Published on November 03, 2025

🌪️ Classifying Disaster Tweets with Deep Learning

machine_learning

Project Overview

Social media platforms like Twitter often serve as early indicators of real-world events. This project explores how machine learning can be used to automatically classify tweets as either disaster-related or not—an application with potential value for emergency response teams, journalists, and public safety organizations.

The dataset comes from the Kaggle NLP with Disaster Tweets competition, and includes over 7,600 labeled tweets. The challenge lies in the ambiguity of language: disaster-related terms are often used metaphorically, making classification non-trivial.

Approach

The analysis began with a baseline model using metadata features (like location and keywords), followed by a more advanced pipeline using text-based features and pretrained GloVe embeddings. The text was cleaned and tokenized to align with GloVe’s training conventions, and padded for input into recurrent neural networks.

Two types of RNN architectures were explored:

LSTM (Long Short-Term Memory)
GRU (Gated Recurrent Unit)

Each was tested across multiple configurations, varying the number of layers, units, and dropout rates. Early stopping and custom callbacks were used to optimize for F1-score, which was chosen as the primary evaluation metric.

Key Findings

GRU outperformed LSTM, achieving an F1-score of 0.857 and an accuracy of 88.2%.
Dropout helped reduce overfitting but significantly increased training time due to limitations in GPU optimization.
Model performance fluctuated, suggesting that randomness and training dynamics played a larger role than expected.
Metadata alone (via Random Forest) was insufficient, with an F1-score of just 0.617.

Reflections

While GRU emerged as the best model, the results showed sensitivity to hyperparameters and training conditions. Dropout and early stopping helped mitigate overfitting, but the trade-offs in training time and reproducibility were notable.

Future improvements could include:

Exploring alternative regularization techniques
Leveraging attention mechanisms or transformer-based models
Incorporating tweet length and other linguistic features more explicitly

If you're curious about the details, the full notebook is embedded below 👇

You can also view the notebook in a separate page, or check it on GitHub.