Skip to main content

File Handling

Work with files using Python

Python file handling reads, writes, and parses data on disk through built-in functions and the stdlib modules csv, json, and pathlib. The default approach in modern Python uses a context manager (with open(...) as f) so the operating system releases the file descriptor the instant your block exits, even if an exception fires. That single habit prevents 80% of the bugs students hit in CS50P, DATA 100, and CSE 163 file I/O assignments.

Why context managers replace manual open and close

The with statement opens a file, hands you the file object, then closes it on block exit. Closing matters: an unclosed file may keep buffered bytes in memory and never write them to disk, and the OS caps how many descriptors one process holds (1,024 on most Linux defaults). A loop that opens 5,000 log files without closing crashes with OSError: Too many open files.

The manual pattern f = open('data.txt'); ... ; f.close() looks fine until a line in between raises. The close() call never runs, the buffer never flushes, and Gradescope reports a missing output file. Wrap every open() in with and the close() is automatic, paired with the matching open even through exceptions.

A single with statement also chains multiple files in one line: with open('in.csv') as src, open('out.csv', 'w') as dst:. Both files close at block exit regardless of which one raised.

Example

                      
                        # Read a homework brief, count its lines, write the count to a result file
with open('brief.txt', 'r', encoding='utf-8') as src:  # opens for reading, UTF-8 decoded
    lines = src.readlines()  # list of strings, one per line

line_count = len(lines)  # integer count

with open('result.txt', 'w', encoding='utf-8') as dst:  # opens for writing, truncates first
    dst.write(f'Lines: {line_count}\n')  # write expects a string, not bytes

print(f'Wrote {line_count} to result.txt')  # observable output for grading
# Both files are already closed here, no f.close() needed
                      
                    

Text mode versus binary mode, and why encoding matters

Text mode ('r', 'w', 'a') decodes bytes to str using an encoding, with UTF-8 the cross-platform default in modern Python. Binary mode ('rb', 'wb', 'ab') returns raw bytes objects and skips decoding entirely. Pick binary for images, PDFs, ZIP archives, or anything where a stray newline conversion corrupts the file.

The trap is the platform default encoding. On Windows the default is often cp1252, on macOS and Linux it is usually UTF-8. A homework script that reads a CSV without encoding='utf-8' works on the student's Mac, fails on the TA's Windows grader with UnicodeDecodeError on the first non-ASCII character. Always pass encoding='utf-8' to open() in text mode.

Binary mode also gives you exact byte counts. len(content) on a text-mode read returns characters, which differ from bytes for any multi-byte codepoint. For checksum, hash, or network protocol work, open binary.

Example

                      
                        # Copy an image as raw bytes, then read a CSV as decoded text
with open('photo.jpg', 'rb') as src:  # binary read, returns bytes
    image_bytes = src.read()  # whole file into memory

with open('photo_copy.jpg', 'wb') as dst:  # binary write, truncates
    dst.write(image_bytes)

print(f'Copied {len(image_bytes)} bytes')  # exact byte count

with open('students.csv', 'r', encoding='utf-8') as f:  # text mode, UTF-8 explicit
    first_line = f.readline().strip()  # first row, newline stripped

print(f'CSV header: {first_line}')  # decoded string, safe on Windows graders
                      
                    

Line iteration beats read() for files over 100 MB

Iterating a file object yields one line at a time, holding only that line in memory. A 10 GB server log streams through a 50 MB Python process when you iterate. The same file with f.read() tries to load all 10 GB into a single str and triggers MemoryError.

Use for line in f: for streaming patterns: log parsing, CSV processing without pandas, line-numbered output. The newline character stays attached, so .strip() or .rstrip('\n') trims it. Use f.read() only when the entire file content drives the next operation, like a JSON document or a small config.

f.readlines() falls in between. It returns a list of every line, which costs roughly the same memory as f.read() plus list overhead. Reach for it only when you need random access to lines, like reading the 47th line of a 200-line homework dataset.

Example

                      
                        # Stream a log file, count how many lines contain ERROR
error_count = 0  # accumulator

with open('app.log', 'r', encoding='utf-8') as log:
    for line in log:  # iterator, one line in memory at a time
        if 'ERROR' in line:  # substring check
            error_count += 1

print(f'ERROR lines: {error_count}')  # final count

# Compare to the memory-heavy version that loads everything
with open('app.log', 'r', encoding='utf-8') as log:
    all_text = log.read()  # entire file as one string, can blow up RAM
    total_errors = all_text.count('ERROR')  # one scan, but full load

print(f'Total ERROR substring hits: {total_errors}')  # may differ if ERROR appears mid-line
                      
                    

Reading CSV with the csv stdlib module

The csv module handles quoting, delimiters, escaped commas, and embedded newlines that a naive line.split(',') misses. A grade column of "Smith, Jr." has a comma inside quotes; csv.reader returns ['Smith, Jr.'] as one field, while split splits into two and breaks the row.

csv.reader yields each row as a list of strings, in file order. csv.DictReader yields each row as a dict keyed by the header row, which is the right shape for assignments that ask you to look up by column name. Both accept an open file object opened with newline='' so Windows line endings do not double up.

The csv module ships with the standard library: zero pip install, available on every Gradescope and AutoGrader environment. For tabular work beyond filtering or summing, switch to pandas. For coursework prompts that say "use only the standard library," csv is the answer.

Example

                      
                        import csv  # standard library, no install

# students.csv contents:
# name,score
# Alice,92
# Bob,78

with open('students.csv', 'r', encoding='utf-8', newline='') as f:  # newline='' for csv
    reader = csv.DictReader(f)  # rows as dicts keyed by header
    high_scorers = []  # output list

    for row in reader:  # one row at a time
        score = int(row['score'])  # string from CSV, cast to int
        if score >= 80:
            high_scorers.append(row['name'])  # collect passing names

print(f'High scorers: {high_scorers}')  # observable list

# Writing CSV back
with open('passers.csv', 'w', encoding='utf-8', newline='') as f:
    writer = csv.writer(f)  # writes lists as rows
    writer.writerow(['name'])  # header
    for name in high_scorers:
        writer.writerow([name])  # one row per name
                      
                    

Reading and writing JSON with the json module

json.load(file_object) parses a JSON document into a Python dict, list, or scalar in one call. json.dump(obj, file_object) writes a Python object back as JSON. The two functions handle the full type round-trip: dict, list, str, int, float, bool, None map to JSON object, array, string, number, true/false, null.

Use json for configs, API responses saved to disk, gradebook exports, and any structured data that nests. CSV is flat; JSON nests. A student record with a list of grades and a nested address dict serializes cleanly to JSON and loses structure if forced into CSV.

Watch the difference between load (reads from a file) and loads (reads from a string). The trailing 's' is for 'string'. Use load with a file object inside a with block, and loads when you already have JSON text in a variable.

Example

                      
                        import json  # standard library

student = {
    'name': 'Alice',
    'scores': [92, 88, 95],  # list nests cleanly in JSON
    'address': {'city': 'Boston', 'zip': '02139'},  # nested dict
}

# Write Python dict to JSON file
with open('student.json', 'w', encoding='utf-8') as f:
    json.dump(student, f, indent=2)  # indent=2 makes it human-readable

# Read it back into a fresh dict
with open('student.json', 'r', encoding='utf-8') as f:
    loaded = json.load(f)  # one call, fully parsed

print(f"Name: {loaded['name']}")  # dict access
print(f"First score: {loaded['scores'][0]}")  # nested list access
print(f"City: {loaded['address']['city']}")  # nested dict access
                      
                    

pathlib for cross-platform file paths

pathlib.Path replaces string concatenation and os.path.join with a typed Path object that works the same on Windows, macOS, and Linux. Path('data') / 'students.csv' returns Path('data/students.csv') on Unix and Path('data\\students.csv') on Windows, with no manual separator handling.

Path objects answer the questions students actually need: does the file exist (.exists()), is it a directory (.is_dir()), what is the file size (.stat().st_size), what is the extension (.suffix). The methods .read_text() and .write_text() open, read or write, and close in one call, removing the with boilerplate for small files.

For coursework that walks a directory of input files, Path('inputs').glob('*.txt') yields every .txt file in inputs/. Path('inputs').rglob('*.txt') recurses into subdirectories. Both are lazy iterators, so they cost nothing until you consume them.

Example

                      
                        from pathlib import Path  # standard library, Python 3.4+

data_dir = Path('data')  # platform-correct path object

# Check existence before reading
if not data_dir.exists():  # method, not attribute
    data_dir.mkdir()  # create the directory
    print(f'Created {data_dir}')

target = data_dir / 'notes.txt'  # / operator builds child path

# One-call write
target.write_text('Homework notes\n', encoding='utf-8')  # opens, writes, closes

# One-call read
content = target.read_text(encoding='utf-8')  # opens, reads, closes
print(f'File contents: {content.strip()}')

# Walk every .txt under data/
for txt_file in data_dir.rglob('*.txt'):  # recursive glob
    size = txt_file.stat().st_size  # bytes
    print(f'{txt_file.name}: {size} bytes')  # name strips parent path
                      
                    

Common pitfalls

Forgetting to close a file leaks descriptors and may lose buffered writes that never flush to disk.

Use with open(...) as f: for every file open. The block guarantees close() on exit, even on exceptions.

Opening text files without encoding="utf-8" breaks on Windows graders the moment non-ASCII appears.

Pass encoding="utf-8" to every open() call in text mode. Add newline="" for csv module compatibility.

Calling f.read() on a 5 GB log file triggers MemoryError because the whole file loads into one string.

Iterate the file object with for line in f: to stream one line at a time and keep memory flat.

Splitting CSV rows with line.split(",") corrupts fields that contain quoted commas like "Smith, Jr.".

Use csv.reader or csv.DictReader from the standard library. Both handle quoting and escaping correctly.

Opening in mode "w" silently deletes the existing file content before writing.

Use mode "a" to append, mode "x" to fail if the file exists. Only use "w" when truncation is intended.

Hardcoding paths like "C:/Users/me/data.csv" makes the script unrunnable on the grader machine.

Build paths with pathlib.Path or relative paths like Path("data") / "students.csv". Never hardcode an absolute home directory.

When to use file handling

Use the file-handling stdlib for any coursework that reads input data from disk, writes output for a grader, or persists state between runs. Reach for pandas instead when the task is tabular analysis over more than 10,000 rows or involves joins, groupby, and pivot operations.

Need Help?

Having trouble with this topic on an assignment? Our Python developers ship working code plus a walkthrough that helps you explain the code in class.