Supervised vs unsupervised: 2 problems, 1 API
Supervised learning maps inputs X to a known target y. Classification predicts a discrete class (spam or ham, digit 0 through 9, iris species). Regression predicts a continuous value (house price, exam score, temperature). The training data carries the right answer for every example, the model learns the mapping, and at inference time it produces predictions on unseen X.
Unsupervised learning has no y. The model finds structure in X alone. Clustering groups similar examples (KMeans, DBSCAN, hierarchical). Dimensionality reduction projects high-dimensional X to a smaller space while preserving variance (PCA, t-SNE, UMAP). Anomaly detection flags points that look unlike the rest of the data (IsolationForest, OneClassSVM). The use case decides the family, not the algorithm.
Scikit-learn unifies both families behind one estimator interface. Every estimator implements `fit(X, y)` for supervised or `fit(X)` for unsupervised. Supervised estimators add `predict(X)` and `score(X, y)`. Clustering estimators add `labels_` after fitting. Transformers (like StandardScaler, OneHotEncoder) implement `fit` plus `transform`. This shared shape is what makes Pipeline work: you can swap a LogisticRegression for a RandomForest without changing any other code.
# Requires: pip install scikit-learn
from sklearn.datasets import load_iris, load_diabetes
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.cluster import KMeans
# Supervised classification: predict iris species
iris = load_iris()
clf = LogisticRegression(max_iter=200)
clf.fit(iris.data, iris.target)
print(f"Classifier accuracy on train: {clf.score(iris.data, iris.target):.3f}")
# Supervised regression: predict diabetes progression
diab = load_diabetes()
reg = LinearRegression()
reg.fit(diab.data, diab.target)
print(f"Regressor R^2 on train: {reg.score(diab.data, diab.target):.3f}")
# Unsupervised clustering: 3 groups on the same iris features
km = KMeans(n_clusters=3, n_init=10, random_state=0)
km.fit(iris.data)
print(f"Cluster sizes: {[(km.labels_ == c).sum() for c in range(3)]}")