Skip to main content

Machine Learning with Python

Introduction to ML concepts and libraries in Python

Machine learning in Python splits into 2 families: supervised learning (predict a known label from features) and unsupervised learning (find structure without labels). The scikit-learn API unifies both behind 3 methods: `fit`, `predict`, `score`. Master the train-test split, the Pipeline pattern, and choosing the right metric, and a CS229 or 10-601 problem set becomes a 60-line script.

Supervised vs unsupervised: 2 problems, 1 API

Supervised learning maps inputs X to a known target y. Classification predicts a discrete class (spam or ham, digit 0 through 9, iris species). Regression predicts a continuous value (house price, exam score, temperature). The training data carries the right answer for every example, the model learns the mapping, and at inference time it produces predictions on unseen X.

Unsupervised learning has no y. The model finds structure in X alone. Clustering groups similar examples (KMeans, DBSCAN, hierarchical). Dimensionality reduction projects high-dimensional X to a smaller space while preserving variance (PCA, t-SNE, UMAP). Anomaly detection flags points that look unlike the rest of the data (IsolationForest, OneClassSVM). The use case decides the family, not the algorithm.

Scikit-learn unifies both families behind one estimator interface. Every estimator implements `fit(X, y)` for supervised or `fit(X)` for unsupervised. Supervised estimators add `predict(X)` and `score(X, y)`. Clustering estimators add `labels_` after fitting. Transformers (like StandardScaler, OneHotEncoder) implement `fit` plus `transform`. This shared shape is what makes Pipeline work: you can swap a LogisticRegression for a RandomForest without changing any other code.

Example

                      
                        # Requires: pip install scikit-learn
from sklearn.datasets import load_iris, load_diabetes
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.cluster import KMeans

# Supervised classification: predict iris species
iris = load_iris()
clf = LogisticRegression(max_iter=200)
clf.fit(iris.data, iris.target)
print(f"Classifier accuracy on train: {clf.score(iris.data, iris.target):.3f}")

# Supervised regression: predict diabetes progression
diab = load_diabetes()
reg = LinearRegression()
reg.fit(diab.data, diab.target)
print(f"Regressor R^2 on train:       {reg.score(diab.data, diab.target):.3f}")

# Unsupervised clustering: 3 groups on the same iris features
km = KMeans(n_clusters=3, n_init=10, random_state=0)
km.fit(iris.data)
print(f"Cluster sizes: {[(km.labels_ == c).sum() for c in range(3)]}")
                      
                    

The train-validation-test split and why it matters

Reporting accuracy on the training set tells you nothing about generalization. A model with enough parameters memorizes the training data and scores 100% there while failing on new examples. The fix: split the labeled data into a training set (the model learns from this), a validation set (you compare candidate models and tune hyperparameters here), and a held-out test set (touched exactly once, at the end, to report the honest score).

The common ratio is 70 / 15 / 15 or 60 / 20 / 20. When data is scarce, k-fold cross-validation replaces the validation set: split the training data into k folds, train on k-1 folds, validate on the remaining fold, rotate through all k folds, average the scores. `cross_val_score(estimator, X, y, cv=5)` runs the whole loop in one call. Use 5-fold or 10-fold by default. Stratified k-fold preserves class proportions per fold and matters for imbalanced classification.

Random splits leak information when the data has temporal or grouped structure. For time series, the test set must come after the training set in time. `TimeSeriesSplit` produces forward-rolling folds. For grouped data (multiple rows per patient, per user, per document), use `GroupKFold` to keep all rows from one group on the same side of the split. Mixing rows from the same patient across train and test inflates the score and the resulting model fails on truly new patients.

Example

                      
                        # Requires: pip install scikit-learn
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier

X, y = load_iris(return_X_y=True)

# Stratified 70/30 split preserves class balance
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)
print(f"Train accuracy: {clf.score(X_train, y_train):.3f}")
print(f"Test accuracy:  {clf.score(X_test, y_test):.3f}")

# 5-fold cross-validation on the training set
scores = cross_val_score(clf, X_train, y_train, cv=5)
print(f"CV scores: {scores.round(3)}")
print(f"CV mean: {scores.mean():.3f} +/- {scores.std():.3f}")
                      
                    

Pipeline and ColumnTransformer: preprocessing without leakage

Real datasets need preprocessing before they reach the estimator. Numeric columns get scaled. Categorical columns get one-hot encoded. Missing values get imputed. The naive approach runs each step manually on the full dataset and then splits into train and test. This leaks test-set statistics into the training pipeline because the scaler saw the test means.

Pipeline solves this by chaining transformers and a final estimator into one object. `Pipeline([("scaler", StandardScaler()), ("clf", LogisticRegression())])` exposes a single `fit` method that fits the scaler on training data only, transforms both train and test with those frozen statistics, then fits the classifier on the transformed training data. Cross-validation calls fit on each fold's training portion, so the imputer mean and scaler std come from inside the fold only. No leakage.

ColumnTransformer handles mixed-type tables. Pass a list of (name, transformer, columns) tuples, and each transformer applies only to the named columns. Numeric columns route to a StandardScaler. Categorical columns route to a OneHotEncoder. The output is one concatenated matrix that the downstream estimator consumes. Wrapping a ColumnTransformer plus an estimator in a Pipeline produces an end-to-end model with one saved artifact through `joblib.dump`.

Example

                      
                        # Requires: pip install scikit-learn pandas
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

df = pd.DataFrame({
    "age": [22, 35, 41, None, 29, 58, 33, 47],
    "income": [40, 65, 80, 55, None, 120, 70, 95],
    "city": ["NY", "SF", "NY", "LA", "SF", "NY", "LA", "SF"],
    "churn": [0, 0, 1, 0, 1, 1, 0, 1],
})
X, y = df.drop(columns="churn"), df["churn"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

numeric = ["age", "income"]
categorical = ["city"]

preprocess = ColumnTransformer([
    ("num", Pipeline([("imp", SimpleImputer(strategy="median")),
                       ("sc",  StandardScaler())]), numeric),
    ("cat", OneHotEncoder(handle_unknown="ignore"), categorical),
])

model = Pipeline([("prep", preprocess),
                  ("clf",  RandomForestClassifier(random_state=0))])
model.fit(X_train, y_train)
print(f"Pipeline test accuracy: {model.score(X_test, y_test):.3f}")
                      
                    

Metrics: accuracy lies on imbalanced data

Accuracy is the fraction of correct predictions. It misleads when one class dominates. On a fraud detection dataset where 1% of transactions are fraud, a model that predicts "not fraud" every time scores 99% accuracy and catches zero fraud. The right metric depends on what mistakes cost.

Precision answers "of the items I flagged as positive, how many were actually positive". Recall answers "of the actual positives, how many did I catch". The two trade off: pushing the decision threshold lower catches more positives (higher recall) at the cost of more false alarms (lower precision). The F1 score is the harmonic mean of precision and recall and balances both. ROC-AUC summarizes how well the model ranks positives above negatives across every threshold and ignores the threshold choice entirely, which is useful when downstream consumers pick their own cutoff.

For regression, the choice is between mean squared error (MSE, penalizes large errors quadratically), mean absolute error (MAE, treats every error linearly and resists outliers), and R squared (fraction of variance explained, dimensionless, useful for comparing models on the same target). For multi-class classification, pass `average="macro"` to compute the metric per class and average, or `average="weighted"` to weight by class size. Choose macro when each class matters equally, weighted when frequent classes matter more.

Example

                      
                        # Requires: pip install scikit-learn
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import (accuracy_score, precision_score, recall_score,
                              f1_score, roc_auc_score, confusion_matrix)

# Imbalanced binary dataset: 5% positive class
X, y = make_classification(n_samples=2000, weights=[0.95, 0.05],
                            n_features=10, random_state=42)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.3,
                                            random_state=42, stratify=y)

clf = LogisticRegression(max_iter=500).fit(X_tr, y_tr)
y_pred = clf.predict(X_te)
y_prob = clf.predict_proba(X_te)[:, 1]

print(f"Accuracy:  {accuracy_score(y_te, y_pred):.3f}")
print(f"Precision: {precision_score(y_te, y_pred):.3f}")
print(f"Recall:    {recall_score(y_te, y_pred):.3f}")
print(f"F1 score:  {f1_score(y_te, y_pred):.3f}")
print(f"ROC-AUC:   {roc_auc_score(y_te, y_prob):.3f}")
print("Confusion matrix:")
print(confusion_matrix(y_te, y_pred))
                      
                    

End-to-end iris classification: 30 lines, train to evaluate

The iris dataset is the canonical first ML assignment. 150 flowers, 4 numeric features (sepal length, sepal width, petal length, petal width), 3 species labels. The task: predict species from features. The full pipeline runs in under 40 lines and demonstrates every step a CS229 or 10-601 lab brief checks.

The order: load data, split into train and test with stratification, fit two candidate models (LogisticRegression as a linear baseline, RandomForest as a non-linear comparison), compare via 5-fold cross-validation on the training set, pick the winner, evaluate on the test set exactly once, print a confusion matrix and per-class metrics. Set `random_state=42` everywhere a stochastic process appears so the grader reproduces your numbers.

The output is interpretable. The confusion matrix is a 3-by-3 grid where row i column j counts how many true-class-i examples were predicted as class j. The diagonal is correct predictions. Off-diagonal entries are mistakes, and the largest off-diagonal cell tells you which two classes the model confuses most. For iris that is usually versicolor with virginica, because their petal measurements overlap. `classification_report` prints precision, recall, F1 per class plus a weighted average.

Example

                      
                        # Requires: pip install scikit-learn
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix

X, y = load_iris(return_X_y=True)
target_names = load_iris().target_names

X_tr, X_te, y_tr, y_te = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

candidates = {
    "logreg":   LogisticRegression(max_iter=500, random_state=42),
    "forest":   RandomForestClassifier(n_estimators=100, random_state=42),
}

# Compare via 5-fold CV on training set
for name, model in candidates.items():
    cv = cross_val_score(model, X_tr, y_tr, cv=5)
    print(f"{name}: CV {cv.mean():.3f} +/- {cv.std():.3f}")

# Final fit + evaluation
best = candidates["forest"]
best.fit(X_tr, y_tr)
y_pred = best.predict(X_te)

print(f"\nTest accuracy: {best.score(X_te, y_te):.3f}")
print("\nConfusion matrix:")
print(confusion_matrix(y_te, y_pred))
print("\nPer-class report:")
print(classification_report(y_te, y_pred, target_names=target_names))
                      
                    

Linear vs Ridge vs RandomForest: choosing a regressor

LinearRegression fits the best straight-line relationship between X and y, minimizing squared error. It is fast, interpretable (every feature has one coefficient), and a strong baseline. The weakness: it overfits when features are correlated, and the coefficients explode when the design matrix is near-singular. Multi-collinearity inflates standard errors and produces unstable predictions.

Ridge regression adds an L2 penalty on the coefficients. The optimization minimizes `MSE + alpha * sum(coef ** 2)`, which shrinks coefficients toward zero and stabilizes them under collinearity. Larger alpha means stronger shrinkage. The hyperparameter is tuned with cross-validation, typically across a grid like `[0.01, 0.1, 1, 10, 100]`. Lasso (`Lasso`, L1 penalty) goes further and pushes some coefficients to exact zero, which doubles as feature selection.

RandomForestRegressor fits hundreds of decision trees on bootstrap samples of the data and averages their predictions. It captures non-linear relationships and interactions without manual feature engineering. The cost: longer training time, larger memory footprint, less interpretable than a single linear model. The 80/20 rule: try LinearRegression first, switch to Ridge if features are correlated, switch to RandomForest if the linear residuals show clear non-linear structure on a residual plot.

Example

                      
                        # Requires: pip install scikit-learn
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

X, y = load_diabetes(return_X_y=True)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=0)

models = {
    "linear": Pipeline([("sc", StandardScaler()), ("m", LinearRegression())]),
    "ridge":  Pipeline([("sc", StandardScaler()), ("m", Ridge(alpha=1.0))]),
    "forest": RandomForestRegressor(n_estimators=200, random_state=0),
}

for name, m in models.items():
    m.fit(X_tr, y_tr)
    r2_train = m.score(X_tr, y_tr)
    r2_test  = m.score(X_te, y_te)
    cv = cross_val_score(m, X_tr, y_tr, cv=5, scoring="r2").mean()
    print(f"{name:<7} train R^2={r2_train:.3f} test R^2={r2_test:.3f} CV R^2={cv:.3f}")
                      
                    

Common pitfalls

Reporting accuracy on the training set as the model score.

Always evaluate on the held-out test set. Use `train_test_split` with `stratify=y` for classification and `random_state` set, then call `score(X_test, y_test)`. Cite the cross-validation mean for model selection.

Fitting the scaler or imputer on the full dataset before splitting, which leaks test statistics.

Wrap preprocessing and the estimator in a `Pipeline`. Call `fit` on the training data only. The pipeline guarantees the scaler sees only training statistics across CV folds and the final test evaluation.

Class imbalance hides as 95% accuracy when the minority class never gets predicted.

Switch metrics to F1, precision, recall, or ROC-AUC. Pass `class_weight="balanced"` to logistic regression or random forest, or resample the training set with imbalanced-learn (`SMOTE`, `RandomUnderSampler`).

LogisticRegression fails to converge with a "max_iter reached" warning.

Scale the features first with `StandardScaler` inside a Pipeline. Raise `max_iter` to 500 or 1000. Switch the solver to `"liblinear"` for small datasets or `"saga"` for L1 plus large sparse data.

KMeans returns different clusters on every run because centroids initialize randomly.

Set `random_state=42` and `n_init=10` so KMeans runs 10 initializations and keeps the best. For reproducibility across machines, pin scikit-learn and NumPy versions in `requirements.txt`.

Cross-validation scores are much higher than the held-out test score.

Check for data leakage: features computed using future information, the same user appearing in both train and test, or the target accidentally inside X. Use `GroupKFold` when records belong to natural groups and `TimeSeriesSplit` for temporal data.

When to use machine learning with python

Reach for scikit-learn whenever the dataset fits in memory and the task is classification, regression, clustering, or dimensionality reduction. For deep learning on images, text, or audio, switch to PyTorch or TensorFlow.

Need Help?

Having trouble with this topic on an assignment? Our Python developers ship working code plus a walkthrough that helps you explain the code in class.