Phishing Email Classification using Logistic Regression – Blog

Identifying Malicious Communications

Phishing attacks remain one of the most prominent cybersecurity threats. Building a robust machine learning-based approach to distinguish phishing emails from legitimate ones is critical for modern spam filters.

Modeling Strategy

In a recent Hiring Hackathon by MachineHack, I evaluated multiple classification models to tackle this problem. While one might instinctively reach for complex NLP models, extensive benchmarking revealed that Logistic Regression, when paired with optimal hyperparameters and robust TF-IDF feature extraction, provided the best balance of speed and accuracy.

Performance

The model achieved an outstanding F1 Score of 0.99988 on the public leaderboard and 0.99984 on the private leaderboard, ranking 34th out of 109 participants. This highlights the power of fundamental machine learning algorithms when feature engineering is done correctly.