Dataset
AG News dataset filtered to sports and politics (World class)
Features
Four feature extraction approaches compared
Models
Five classic machine learning algorithms
Metrics
Comprehensive evaluation on test set
🏆 Top Results
Best Configuration: TF-IDF (Bigram) + SVM
| Rank | Vectorizer | Model | Accuracy | Precision | Recall | F1 Score |
|---|---|---|---|---|---|---|
| #1 | TF-IDF (1-2 gram) | SVM | 0.9774 | 0.9781 | 0.9774 | 0.9774 |
| #2 | TF-IDF (1-gram) | SVM | 0.9762 | 0.9762 | 0.9762 | 0.9762 |
| #3 | BoW (1-2 gram) | Logistic Reg. | 0.9751 | 0.9751 | 0.9751 | 0.9751 |
| #4 | BoW (1-2 gram) | Naive Bayes | 0.9741 | 0.9742 | 0.9741 | 0.9741 |
| #5 | TF-IDF (1-gram) | Logistic Reg. | 0.9736 | 0.9736 | 0.9736 | 0.9736 |
| #6 | BoW (1-2 gram) | SVM | 0.9734 | 0.9734 | 0.9734 | 0.9734 |
| #7 | TF-IDF (1-2 gram) | Logistic Reg. | 0.9734 | 0.9734 | 0.9734 | 0.9734 |
| #8 | TF-IDF (1-gram) | Naive Bayes | 0.9732 | 0.9732 | 0.9732 | 0.9732 |
| #9 | BoW (unigram) | Logistic Reg. | 0.9732 | 0.9732 | 0.9732 | 0.9732 |
| #10 | TF-IDF (1-2 gram) | Naive Bayes | 0.9729 | 0.9729 | 0.9729 | 0.9729 |
💡 Key Findings
Model Performance Analysis
🎯 Linear Models Excel
Logistic Regression and SVM achieved the highest average F1 scores (~0.973), demonstrating their effectiveness on high-dimensional sparse text data. The best configuration (TF-IDF bigram + SVM) reached 97.74% F1 score.
📊 TF-IDF Significantly Better
TF-IDF representations significantly outperformed Bag of Words (+5.11% F1), demonstrating that inverse document frequency weighting helps distinguish topical keywords from common words.
🎲 Multiple Strong Configurations
The top 5 models achieved F1 scores between 97.36% and 97.74%, indicating several viable approaches for this classification task with realistic performance metrics.
📉 KNN Struggles with High Dimensionality
KNN performed poorly (F1 = 0.64-0.80), suffering from the curse of dimensionality where distance metrics become unreliable in sparse high-dimensional feature spaces.
⚠️ Dataset Transition: BBC → AG News
We initially used the BBC News dataset but found it produced unrealistic 100% accuracy across multiple models. The BBC dataset has highly distinct topical vocabulary making classes trivially separable. We switched to AG News (60K documents) for more realistic evaluation with vocabulary overlap between sports and politics.