Skip to main content

Data Analysis with Python

Analyzing data using popular Python libraries

Data analysis in Python runs on a 4-library stack: NumPy for array math, Pandas for tabular work, Matplotlib for plotting, Seaborn for statistical charts. A complete coursework pipeline loads a CSV, cleans missing values, groups by a category, and outputs a plot in under 50 lines.

Why NumPy arrays beat Python lists for math (10x to 100x speedups)

NumPy is the foundation library because a NumPy array stores numbers in a contiguous block of typed memory, while a Python list stores pointers to boxed PyObjects scattered across the heap. The result: array arithmetic runs in C-level loops, list arithmetic runs in Python-level loops with type checks on every element. A multiply on a 1-million-float array takes about 3 ms with NumPy and around 200 ms with a list comprehension on the same machine.

Vectorization is the mental model. Replace explicit `for` loops with whole-array operations. `arr * 2` multiplies every element. `np.sqrt(arr)` applies the square root elementwise. `(arr - arr.mean()) / arr.std()` standardizes a column without writing a loop. The shape of the array determines what the operation does, and broadcasting rules align shapes automatically when they differ along axes of length 1.

Indexing is the second pillar. NumPy supports integer indexing (`arr[3]`), slice indexing (`arr[2:7]`), boolean masks (`arr[arr > 5]`), and fancy indexing (`arr[[0, 2, 5]]`). Mask-based selection is the workhorse for filtering: students writing CS50P or DATA 100 problem sets use it to count rows that satisfy a predicate, often nested with `&` and `|` for compound conditions. Note the bitwise operators, not `and` / `or`: NumPy arrays raise an ambiguity error if you use the keyword forms.

Example

                      
                        # Requires: pip install numpy
import numpy as np

# Array creation from list, plus zeros and ranges
heights_cm = np.array([162, 175, 168, 181, 170, 158, 177])
weights_kg = np.array([55, 78, 64, 85, 72, 50, 80])

# Vectorized BMI for the whole sample, no loop
bmi = weights_kg / (heights_cm / 100) ** 2
print("BMI values:", np.round(bmi, 2))

# Boolean mask: students with BMI above 25
overweight_mask = bmi > 25
print("Count above 25:", overweight_mask.sum())
print("Their heights:", heights_cm[overweight_mask])

# Aggregations
print(f"Mean BMI: {bmi.mean():.2f}, Std: {bmi.std():.2f}")
print(f"Min: {bmi.min():.2f}, Max: {bmi.max():.2f}")
                      
                    

Pandas DataFrame: 6 operations that cover 80% of coursework

A DataFrame is a 2D labeled table, conceptually a dict of NumPy Series sharing a row index. The six operations that solve most assignments: `read_csv` to load, square-bracket indexing to select columns, boolean indexing to filter rows, `groupby` to aggregate by category, `merge` to join two tables on a key, and `describe` to summarize. Master these and the median DATA 100 lab becomes mechanical.

Column selection has two forms. `df['age']` returns a Series. `df[['age', 'name']]` returns a DataFrame with two columns. The double brackets matter. Row filtering uses boolean masks the same way NumPy does: `df[df['age'] > 18]` keeps rows where the predicate is True. Combine masks with `&` and `|`, and parenthesize each clause because Pandas inherits NumPy's bitwise precedence rules.

GroupBy answers questions like "average sales per region" or "max grade per course". The pattern is `df.groupby('region')['sales'].mean()`. The first call partitions the rows, the column selection chooses what to aggregate, the agg method (`mean`, `sum`, `count`, `max`, `std`) collapses each group to a scalar. Stack `.agg(['mean', 'std', 'count'])` to get multiple statistics at once. Merge joins two DataFrames on a shared key column the same way SQL does: `pd.merge(orders, customers, on='customer_id', how='left')` keeps all orders, attaching customer data where it exists and NaN where it does not.

Example

                      
                        # Requires: pip install pandas
import pandas as pd

# Build a small DataFrame the way most labs start
df = pd.DataFrame({
    "student": ["Ana", "Ben", "Cole", "Dia", "Eli", "Fae"],
    "course": ["CS50P", "CS50P", "DATA100", "DATA100", "CS50P", "DATA100"],
    "score": [88, 72, 95, 61, 79, 84],
    "hours": [12, 8, 20, 5, 10, 15],
})

# Column then row selection
high_scorers = df[df["score"] >= 80][["student", "course", "score"]]
print("High scorers:")
print(high_scorers)

# GroupBy with multiple aggregations
summary = df.groupby("course").agg(
    mean_score=("score", "mean"),
    max_hours=("hours", "max"),
    n=("student", "count"),
)
print("\nPer-course summary:")
print(summary)

# Correlation between hours studied and score
print(f"\nCorrelation hours-vs-score: {df['hours'].corr(df['score']):.3f}")
                      
                    

Data cleaning: handling missing values and outliers in 3 steps

Real CSVs are dirty. Missing values appear as `NaN` (float) or `None` (object), and unhandled NaN propagates through arithmetic and breaks scikit-learn estimators downstream. The 3-step cleanup: detect, decide, transform. Detect with `df.isna().sum()` to count nulls per column. Decide per column whether to drop the row, fill with a constant, fill with the column mean or median, or forward-fill from the prior row. Transform with `df.dropna()`, `df.fillna(value)`, or `df['col'].fillna(df['col'].median())`.

Median beats mean for filling numeric columns when outliers exist, because outliers inflate the mean. For categorical columns, fill with the mode (`df['col'].mode()[0]`) or with the string `"Unknown"` if missingness itself carries information. Dropping rows is safe when the missing column is critical and missingness is small (under 5%). Dropping columns is the move when over 60% of values in that column are missing.

Outliers distort summaries and regressions. The interquartile range (IQR) rule flags any value below Q1 minus 1.5 times IQR or above Q3 plus 1.5 times IQR as suspect. The Z-score rule flags values more than 3 standard deviations from the mean. Use IQR when the distribution is skewed, Z-score when the distribution is roughly normal. Cap, transform with `np.log1p`, or remove the offending rows depending on what the assignment asks.

Example

                      
                        # Requires: pip install pandas numpy
import pandas as pd
import numpy as np

# Synthetic dirty data
df = pd.DataFrame({
    "score": [88, 72, np.nan, 61, 79, 84, 999],     # 999 is a sentinel outlier
    "hours": [12, 8, 20, 5, np.nan, 15, 6],
})

# Step 1: count nulls
print("Missing per column:\n", df.isna().sum(), "\n")

# Step 2: median-fill numeric columns
df_filled = df.fillna(df.median(numeric_only=True))

# Step 3: IQR outlier flag on score
q1, q3 = df_filled["score"].quantile([0.25, 0.75])
iqr = q3 - q1
low, high = q1 - 1.5 * iqr, q3 + 1.5 * iqr
df_clean = df_filled[(df_filled["score"] >= low) & (df_filled["score"] <= high)]

print(f"IQR bounds: [{low:.1f}, {high:.1f}]")
print("Cleaned rows:", len(df_clean), "of", len(df))
                      
                    

Matplotlib and Seaborn: 5 plot types for 90% of lab reports

Five plot types cover almost every assignment: line for trends over time, scatter for relationships between two numeric variables, histogram for the distribution of one variable, bar for categorical comparisons, box for distribution plus outliers. Matplotlib gives the low-level control. Seaborn wraps Matplotlib with statistical defaults, sensible color palettes, and one-call APIs like `sns.boxplot` and `sns.heatmap`.

The pyplot rhythm is consistent. Create a figure with `plt.figure(figsize=(8, 5))`. Draw with the plot function. Label axes and add a title. Call `plt.show()` to render in a script or notebook, or `plt.savefig('out.png', dpi=150, bbox_inches='tight')` to file. Forgetting axis labels is the number-one deduction in CSE 163 lab grading: every axis needs units, every plot needs a title.

Seaborn shines for correlation heatmaps and pairwise plots. A two-line call (`sns.heatmap(df.corr(numeric_only=True), annot=True, cmap='Greys')`) produces a publication-quality correlation matrix. `sns.pairplot(df, hue='category')` gives an n-by-n grid of scatter plots, which is the fastest way to spot relationships in a new dataset. For statistical tests, `sns.boxplot(x='course', y='score', data=df)` shows the median, quartiles, and outliers per group in a single call.

Example

                      
                        # Requires: pip install pandas matplotlib seaborn
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.DataFrame({
    "course": ["CS50P"] * 4 + ["DATA100"] * 4,
    "score": [88, 72, 79, 91, 95, 61, 84, 77],
    "hours": [12, 8, 10, 14, 20, 5, 15, 11],
})

# Two side-by-side subplots
fig, axes = plt.subplots(1, 2, figsize=(10, 4))

# Boxplot by course
sns.boxplot(x="course", y="score", data=df, ax=axes[0], color="lightgray")
axes[0].set_title("Score distribution by course")
axes[0].set_ylabel("Score (0-100)")

# Scatter with regression line
sns.regplot(x="hours", y="score", data=df, ax=axes[1], color="black")
axes[1].set_title("Hours studied vs final score")
axes[1].set_xlabel("Hours studied per week")

plt.tight_layout()
plt.savefig("analysis.png", dpi=120, bbox_inches="tight")
print("Saved analysis.png")
                      
                    

End-to-end coursework example: load, clean, group, plot

The full pipeline ties the prior 4 sections into one script. A DATA 100 or CSE 163 lab brief usually reads: "Given the attached CSV, report the mean and median per category, flag outliers using IQR, and produce a visualization." That is 30 to 40 lines of Pandas plus Matplotlib.

The order matters. Load first with `pd.read_csv`, set dtypes if the CSV is ambiguous, then immediately call `df.info()` and `df.describe()` to check schema and ranges. Cleaning comes next: fill or drop nulls, flag outliers, normalize units if the assignment asks. Aggregation runs after cleaning so the statistics reflect the cleaned data. Plotting is last and pulls from the cleaned, aggregated frame.

Reproducibility matters for grading. Set `np.random.seed(42)` if any randomness enters (sampling, train-test splits). Use relative file paths so the notebook runs on the grader's machine. Save intermediate results to CSV if the lab specifies, with `df_clean.to_csv('clean.csv', index=False)`. The `index=False` prevents an unwanted unnamed first column on reload, a common loss of two points in Gradescope auto-grading.

Example

                      
                        # Requires: pip install pandas matplotlib
import pandas as pd
import matplotlib.pyplot as plt
from io import StringIO

# Inline CSV stands in for read_csv("students.csv")
csv = """name,course,score,hours
Ana,CS50P,88,12
Ben,CS50P,72,8
Cole,DATA100,95,20
Dia,DATA100,,5
Eli,CS50P,79,10
Fae,DATA100,84,15
Gus,CS50P,999,9
"""
df = pd.read_csv(StringIO(csv))

# Clean: drop NaN scores, cap obvious outliers
df = df.dropna(subset=["score"])
df = df[df["score"] <= 100]

# Aggregate
summary = df.groupby("course")["score"].agg(["mean", "median", "count"])
print(summary)

# Plot
summary["mean"].plot(kind="bar", color="dimgray", figsize=(6, 4))
plt.ylabel("Mean score")
plt.title("Average score per course (cleaned)")
plt.tight_layout()
plt.savefig("course_means.png", dpi=120)
print("Wrote course_means.png")
                      
                    

Common pitfalls

SettingWithCopyWarning when you chain indexing like `df[df.x > 0]["y"] = 1`.

Use a single `.loc` call. Rewrite as `df.loc[df["x"] > 0, "y"] = 1`. The chained version operates on a temporary view, not the original frame, so the assignment is dropped.

NumPy raises "The truth value of an array is ambiguous" when you combine masks with `and` or `or`.

Use the bitwise operators `&` and `|`, and parenthesize each clause. `(df["a"] > 0) & (df["b"] < 5)`, not `df["a"] > 0 and df["b"] < 5`.

Filling NaN with the column mean inflates the variance and skews regression coefficients.

Fill with the median for skewed numeric columns. For columns where missingness carries signal, add an indicator column `df["col_was_missing"] = df["col"].isna().astype(int)` before filling.

Matplotlib charts ship without axis labels or titles and lose points in grading.

Add `plt.xlabel`, `plt.ylabel`, and `plt.title` for every plot. Use `plt.tight_layout()` before saving so labels do not clip. Save with `bbox_inches="tight"`.

`pd.merge` produces unexpected row counts because the join key has duplicates on both sides.

Check `df["key"].duplicated().sum()` on both frames before merging. Pass `validate="one_to_one"` or `"one_to_many"` to `pd.merge` so Pandas raises an error when the relationship is wrong.

Plots in Jupyter render but do not save when `plt.show()` is called before `plt.savefig`.

Reverse the order. Call `plt.savefig("out.png")` first, then `plt.show()`. Showing the figure clears it from the active state, leaving savefig with an empty canvas.

When to use data analysis with python

Reach for Pandas plus NumPy whenever an assignment hands you tabular data and asks for summary statistics, filtering, grouping, or plotting. For datasets above roughly 10 million rows or for streaming I/O, swap Pandas for Polars or Dask.

Need Help?

Having trouble with this topic on an assignment? Our Python developers ship working code plus a walkthrough that helps you explain the code in class.