🌸 Clustering Iris Species with K-Means and Hierarchical Methods
Why Clustering?
Clustering is a foundational technique in unsupervised learning, used to uncover patterns in data without predefined labels. This project explores how two popular clustering algorithms—K-Means and Agglomerative Hierarchical Clustering—perform on the well-known Iris flower dataset, aiming to group samples by species based solely on their morphological features.
About the Dataset
The Iris dataset contains 150 samples from three species: setosa, versicolor, and virginica. Each sample includes four features:
- Sepal length
- Sepal width
- Petal length
- Petal width
While setosa is linearly separable, versicolor and virginica overlap significantly, making this dataset ideal for testing clustering algorithms.
What Was Explored
The analysis focused on:
- Comparing K-Means and Agglomerative Clustering
- Evaluating performance using accuracy, silhouette score, and Adjusted Rand Index (ARI)
- Testing different linkage methods and distance metrics
- Visualizing clusters and errors using PCA
Key Experiments & Findings
Dimensionality Reduction with PCA
- 95.8% of the variance was captured using just 2 principal components, confirming strong correlations among features—especially between petal length and width.
Optimal Cluster Count
- Using metrics like inertia, silhouette score, and accuracy, the optimal number of clusters was found to be 3, matching the true number of species.
Parameter Tuning for Agglomerative Clustering
- Tried combinations of:
- Linkage:
ward,complete,average,single - Metrics:
euclidean,manhattan,cosine
- Linkage:
- Best result:
average linkagewithmanhattan distanceachieved 88.7% accuracy, outperforming default settings.
Performance Comparison
| Algorithm | Accuracy | Silhouette Score | ARI |
|---|---|---|---|
| K-Means | 83.3% | 0.46 | 0.62 |
| Agglomerative (default) | 82.7% | 0.45 | 0.61 |
| Agglomerative (best) | 88.7% | 0.45 | 0.72 |
Error Analysis
- Setosa was classified almost perfectly across all methods.
- Most errors occurred between versicolor and virginica, confirming their overlapping nature.
- Agglomerative Clustering showed bias depending on parameters—sometimes misclassifying one species more than the other.
Final Thoughts
While Agglomerative Clustering achieved the highest accuracy with tuned parameters, its sensitivity to configuration and instability in cluster composition make it less reliable for real-world applications without labeled data.
K-Means, despite slightly lower accuracy, offered more balanced results and greater stability, making it a safer choice for practical clustering tasks.
Future Work
- Extend analysis to other clustering algorithms like DBSCAN or Spectral Clustering
- Apply to more complex datasets
- Explore automated parameter tuning techniques
The full notebook with code and visualizations is embedded below 👇
You can also view the notebook in a separate page, or check it on GitHub.