Online Toxicity Analysis: Comparative Evaluation of Text Classification and Topic Modeling Techniques
This project applies text mining techniques to classify and analyze toxic comments from the "Toxic Comment Classification Challenge" dataset. The project is divided into two main tasks:
- Text Classification (detecting various types of toxicity).
- Topic Modeling (analyzing underlying themes in toxic and identity-hate comments).
For a detailed explanation of the methodology, results, and conclusions, please refer to the project documentation:
Toxic_Comment_Classification/
├── notebooks/ # Jupyter notebooks for analysis
│ ├── Toxic_Comment_Classification.ipynb # Main classification pipeline
│ ├── Toxic_Comment_Topic_Modeling.ipynb # Topic modeling on general toxicity
│ └── Identity_Hate_Topic_Modeling.ipynb # Topic modeling on identity hate
│
├── report/ # Project documentation
│ ├── Text_Mining_Presentation.pdf # Project presentation (PDF)
│ ├── Text_Mining_Presentation.pptx # Project presentation (PPTX)
│ └── Text_Mining_Report.pdf # Project report
│
├── requirements.txt # Required Python libraries
- Python
- Google Colab environment (preferred) or Jupyter Notebook
To run the notebooks in a local setup, simply install the dependencies from the requirements file:
pip install -r requirements.txtYou can access the trained models via the following Google Drive folder.
The notebooks are configured to download the dataset (jigsaw-toxic-comment-classification-challenge.zip) directly from Google Drive.
notebooks/Toxic_Comment_Classification.ipynb
This notebook performs the supervised learning tasks. It includes:
- Data Preprocessing: Cleaning text, tokenization, lemmatization, and stopword removal.
- Exploratory Data Analysis (EDA): Visualization of label distribution and word clouds.
- Model Training & Evaluation:
- Traditional ML: Logistic Regression, Naive Bayes, and Linear SVM using TF-IDF.
- Deep Learning: Fine-tuning a DistilBERT transformer model.
- Open
Toxic_Comment_Classification.ipynb. - Run the cells sequentially. The data will automatically download to a
data/folder. - The notebook evaluates models using ROC-AUC and F1 scores.
- Saved models corresponding to this section are expected to be located in
models/ML_model(Logistic Regression) andmodels/DistilBERT_model.
notebooks/Toxic_Comment_Topic_Modeling.ipynbnotebooks/Identity_Hate_Topic_Modeling.ipynb
These notebooks use unsupervised learning to discover abstract topics within the comments.
Toxic_Comment_Topic_Modeling.ipynb: Applies LDA and BERTopic to comments labeled as 'toxic'.Identity_Hate_Topic_Modeling.ipynb: Focuses specifically on the 'identity_hate' category to understand targeted hate speech patterns.
- Open the desired notebook.
- Ensure
bertopicandgensimare installed (cells included in the notebook). - Run the cells to process the text and generate topic clusters.
- The notebooks provide visualizations to get insights.