Skip to main content

Data Structures

Working with Python's built-in data structures

Python's built-in data structures cover 95% of coursework storage needs without pip installing anything: list, dict, set, tuple, and str. Each has a fixed-cost profile that decides whether a homework script runs in 2 seconds or 2 minutes on the grader. Pick the right one upfront and the rest of the assignment writes itself; pick the wrong one and you end up nesting for-loops inside for-loops to fake what a dict already does in O(1).

Lists: ordered, mutable, indexed by position

A list stores items in insertion order, allows duplicates, and lets you mutate it in place. Index access list[i] is O(1), but membership tests like value in list scan from the start at O(n). On a 100,000-item list, 1,000 membership tests cost 100 million comparisons. Switch to a set for membership and the cost drops to 1,000.

List comprehensions [expr for item in iterable if condition] are the Pythonic way to build a new list from another iterable. They run faster than the equivalent for-loop with .append() because the interpreter optimizes the dedicated bytecode. The pattern shows up in CS50P problem sets weekly: filter a list of student records, square a list of numbers, extract every word longer than 4 characters.

Lists are the right container when order matters, when you need to mutate (append, pop, insert), or when the items repeat. Lists are the wrong container when you only need fast lookup, when you need every item to be unique, or when the data must not change after creation.

Example

                      
                        # Build a list of squares for x in 0 through 9, keep only the even ones
squares = [x ** 2 for x in range(10) if x % 2 == 0]  # list comprehension
print(f'Even squares: {squares}')  # [0, 4, 16, 36, 64]

scores = [78, 92, 65, 88, 95, 71]  # quiz scores in submission order
scores.append(83)  # add one more at the end, O(1)
scores.sort(reverse=True)  # in-place sort, descending
print(f'Top 3: {scores[:3]}')  # slice the first three

# Membership scan vs set lookup
names = ['Alice', 'Bob', 'Carol', 'Dave']  # small list
if 'Bob' in names:  # O(n) for list, fine at this size
    print('Bob is enrolled')
                      
                    

Dictionaries: O(1) lookup by key, the workhorse of Python

A dict maps unique keys to values with average O(1) lookup, insert, and delete. The hash table underneath turns key in dict and dict[key] into a near-constant operation regardless of size. A dict with 1,000,000 entries answers a lookup in the same time as one with 10.

The most common coursework pattern is counting frequencies: how many times each word appears in a paragraph, how many students earned each letter grade, how many requests hit each URL in a server log. The collections.Counter subclass of dict does this in one line, but the manual pattern with dict.get(key, 0) + 1 is what TAs expect in the first weeks of CS106A or CSE 163.

Keys must be hashable: str, int, float, tuple of hashables, frozenset. Lists are not hashable and raise TypeError if used as keys. Since Python 3.7, dicts preserve insertion order by language guarantee, so iteration order matches the order keys were added.

Example

                      
                        # Count word frequencies in a sentence
sentence = 'the quick brown fox jumps over the lazy dog the fox'
words = sentence.split()  # list of strings, split on whitespace

counts = {}  # empty dict, the accumulator
for word in words:
    counts[word] = counts.get(word, 0) + 1  # default 0 if key absent

print(f'Word counts: {counts}')
# {'the': 3, 'quick': 1, 'brown': 1, 'fox': 2, ...}

# Lookup by key, O(1) average
print(f"'fox' appeared {counts['fox']} times")

# Safe lookup, returns None or a default if key missing
print(f"'cat' count: {counts.get('cat', 0)}")  # 0, no KeyError
                      
                    

Sets: unique items, O(1) membership, fast deduplication

A set is an unordered collection of unique, hashable items. Adding a duplicate is a no-op, so set(iterable) deduplicates in one call. Membership tests x in set are O(1), making sets the right answer whenever a list-of-strings becomes large enough that in checks slow the program.

Sets support mathematical operations directly: union (|), intersection (&), difference (-), symmetric difference (^). A coursework prompt asking "which students submitted both assignments" is one line: set_a & set_b. The same prompt with two lists takes a nested loop or a comprehension over a membership check.

The trade-off is order. Sets do not preserve insertion order, do not support indexing (set[0] raises TypeError), and do not allow duplicates. If you need any of those, use a list. If you need none of those and want fast lookup or set algebra, use a set.

Example

                      
                        # Deduplicate a list of student IDs that may contain repeats
raw_ids = [101, 205, 101, 308, 205, 412, 101]
unique_ids = set(raw_ids)  # deduplicates in O(n)
print(f'Unique IDs: {sorted(unique_ids)}')  # [101, 205, 308, 412]

# Set algebra: students who submitted both quizzes
quiz1_submitters = {'Alice', 'Bob', 'Carol', 'Dave'}
quiz2_submitters = {'Bob', 'Carol', 'Eve', 'Frank'}

both = quiz1_submitters & quiz2_submitters  # intersection
print(f'Submitted both: {sorted(both)}')  # ['Bob', 'Carol']

only_quiz1 = quiz1_submitters - quiz2_submitters  # difference
print(f'Quiz 1 only: {sorted(only_quiz1)}')  # ['Alice', 'Dave']

all_submitters = quiz1_submitters | quiz2_submitters  # union
print(f'Anyone who submitted: {len(all_submitters)}')  # 6
                      
                    

Tuples: immutable records and dict keys

A tuple is an ordered, immutable sequence. Once created, you cannot change its contents. That immutability lets tuples be hashable (as long as every element is hashable), which is why tuples work as dict keys and set members while lists do not.

Use tuples for fixed-size records: a 2D coordinate (x, y), an RGB color (r, g, b), a date (year, month, day), a student record (id, name, gpa). The shape never changes, so a tuple signals intent. Multiple-return functions use tuples implicitly: return mean, median, std returns a 3-tuple the caller can unpack with mean, median, std = compute_stats(data).

Named tuples from the typing or collections modules add field names without losing immutability or speed. typing.NamedTuple is the modern subclass form. They give you point.x instead of point[0], which reads better and survives refactors when fields move.

Example

                      
                        from typing import NamedTuple  # standard library since 3.5+

# Plain tuple as a 2D point
origin = (0, 0)  # x, y
target = (3, 4)

# Tuple unpacking, common in coursework
x, y = target  # x is 3, y is 4
print(f'Target x={x}, y={y}')

# Function returning multiple values via tuple
def stats(numbers):
    return min(numbers), max(numbers), sum(numbers) / len(numbers)

lo, hi, mean = stats([4, 8, 15, 16, 23, 42])  # unpack 3-tuple
print(f'Range {lo}-{hi}, mean {mean:.2f}')  # 4-42, 18.00

# Named tuple for a richer record
class Student(NamedTuple):
    student_id: int
    name: str
    gpa: float

alice = Student(101, 'Alice', 3.85)
print(f'{alice.name} has GPA {alice.gpa}')  # field access by name
                      
                    

Strings: sequences of characters with 40+ built-in methods

A str is an immutable sequence of Unicode characters. Every string operation that looks like mutation (.upper(), .replace(), .strip()) returns a new string and leaves the original unchanged. That immutability makes strings hashable, safe to use as dict keys, and safe to share across threads.

Indexing and slicing work the same as lists: text[0] is the first character, text[-1] is the last, text[1:5] is characters 1 through 4, text[::2] is every other character. The slice notation is the same idiom that powers every list slice and tuple slice, so learning it once pays back across every data structure.

The methods every student uses weekly: .strip() removes whitespace from both ends, .split() breaks on a delimiter and returns a list, .join() concatenates a list of strings with a separator, .lower() and .upper() normalize case, .startswith() and .endswith() check prefix and suffix, .replace() swaps substrings, f-strings interpolate values cleanly.

Example

                      
                        # Parse a homework brief header line
header = '  Student: Alice Carter | Course: CS50P | Grade: A-  '

clean = header.strip()  # remove surrounding whitespace
print(f'Clean: {repr(clean)}')

parts = clean.split('|')  # ['Student: Alice Carter ', ' Course: CS50P ', ' Grade: A-']
fields = {}
for part in parts:
    key, value = part.split(':', 1)  # split on first colon only
    fields[key.strip().lower()] = value.strip()  # normalize keys

print(f'Parsed fields: {fields}')
# {'student': 'Alice Carter', 'course': 'CS50P', 'grade': 'A-'}

# Slicing
name = fields['student']
first = name[:name.index(' ')]  # everything before the first space
last = name[name.index(' ') + 1:]  # everything after
print(f'First: {first}, Last: {last}')
                      
                    

Picking the right structure: a 4-question decision tree

Four questions decide every container choice. Does order matter? If yes, list or tuple; if no, dict or set. Do you need to mutate after creation? If yes, list or dict or set; if no, tuple or frozenset. Do you look up by key or position? Key means dict; position means list or tuple. Are duplicates allowed? Yes for list, no for set or dict.

A few concrete pairings from common coursework. Student grades by name: dict (lookup by key, O(1)). Words in a paragraph in order: list (order matters, duplicates expected). Unique tags on a blog post: set (deduplication, fast membership). A 3D coordinate that never changes: tuple (immutable record). Lines of a file read into memory: list (order matters, append-friendly).

The wrong default in beginner code is using a list for everything. A class roster of 5,000 students looked up by name 1,000 times runs 5,000,000 comparisons as a list and 1,000 as a dict. The performance gap shows up in autograded assignments with a 30-second time limit.

Example

                      
                        # Concrete pairings for the four-question test
# Order matters + duplicates ok -> list
submissions_in_order = ['Alice', 'Bob', 'Alice', 'Carol']
print(f'First submission: {submissions_in_order[0]}')

# Lookup by name -> dict
grades = {'Alice': 92, 'Bob': 78, 'Carol': 88}
print(f"Bob's grade: {grades['Bob']}")  # O(1) lookup

# Unique items, fast membership -> set
enrolled = {'Alice', 'Bob', 'Carol', 'Dave'}
print(f'Eve enrolled? {"Eve" in enrolled}')  # False, O(1)

# Fixed-shape record, immutable -> tuple
origin = (0, 0, 0)  # x, y, z
print(f'Origin: {origin}')

# Pick by intent, not by default
rubric_weights = {'quizzes': 0.3, 'midterm': 0.3, 'final': 0.4}
total_weight = sum(rubric_weights.values())
print(f'Total rubric weight: {total_weight}')  # 1.0
                      
                    

Common pitfalls

Using a list for membership tests on 10,000+ items makes the script timeout on Gradescope.

Convert the list to a set once with seen = set(my_list) before the loop. Membership drops from O(n) to O(1).

Mutating a list while iterating over it skips items or raises RuntimeError unpredictably.

Iterate over a copy with for item in my_list[:], or build a new list with a comprehension instead of mutating in place.

Using a list as a dict key raises TypeError: unhashable type: list.

Convert the list to a tuple with tuple(my_list) before using it as a key. Tuples are hashable when their elements are.

Relying on dict iteration order from older code: Python 3.6 made it implementation detail, 3.7+ made it a guarantee.

On Python 3.7+ insertion order is guaranteed. For older runtimes use collections.OrderedDict explicitly.

Trying to assign s[0] = "X" on a string raises TypeError because strings are immutable.

Build a new string with concatenation, .replace(), or "".join(list_of_chars). Never mutate a str in place.

Calling my_set[0] raises TypeError because sets do not support indexing.

Convert to a list with sorted(my_set) or list(my_set) first, then index. Sets are unordered by design.

When to use data structures

Use built-in data structures for any coursework that fits in memory and does not require columnar or numerical vectorization. Switch to numpy arrays for numeric math over 10,000+ elements and pandas DataFrames for tabular work with named columns and joins.

Need Help?

Having trouble with this topic on an assignment? Our Python developers ship working code plus a walkthrough that helps you explain the code in class.