100%
From Data Engineering Fundamentals

Python syntax reference

Python syntax reference

General notes I keep for stdlib data work: parsing, grouping, counting, sorting, aggregation, dates.

File IO

with open("data.csv") as f:
    text = f.read()              # whole file as one string
    lines = f.read().splitlines()  # (or) list of lines, no trailing \n

with open("data.csv") as f:
    for line in f:               # iterate line by line, line keeps trailing \n
        line = line.strip()

with open("out.txt", "w") as f:  # "w" write, "a" append
    f.write("hello\n")

with auto-closes the file. f.read() = all of it. f.readlines() = list with \n kept. Iterating f is memory-friendly for big files.

Approach for a data problem

  1. Confirm the input shape (list of dicts? CSV string? list of strings?).
  2. Note the edge cases up front: empty input, ties, missing fields, malformed rows.
  3. Say the plan in one sentence before writing: parse, then group, then aggregate, then sort/select.
  4. Talk through each step while writing it.

Parse a CSV string

lines = data.strip().splitlines()
header = lines[0].split(",")
for line in lines[1:]:
    row = line.split(",")
    record = dict(zip(header, row))
    record["amount"] = int(record["amount"])

First line is the header. dict(zip(header, row)) builds the record. Cast numeric fields. Free text: line.split(). Key=value: split(";") then split("=").

sorted

sorted(xs, key=lambda x: x["amount"])

reverse=True for descending. key=lambda x: (-x["amount"], x["name"]) for desc-then-asc. key=lambda w: w.lower() for case-insensitive.

defaultdict

from collections import defaultdict
d = defaultdict(int)
d["a"] += 5

defaultdict(int) starts at 0. defaultdict(list) starts at [] (use .append). defaultdict(lambda: defaultdict(int)) for two keys.

Counter

from collections import Counter
c = Counter(["a", "b", "a"])
c.most_common(2)

Counter(list) counts occurrences. Counter(dict) wraps existing counts. .most_common(k) returns top k as [(item, count), ...].

statistics

import statistics
statistics.mean(xs)
statistics.median(xs)
statistics.stdev(xs)
sum(xs)
min(xs)
max(xs)
round(x, 2)
len(xs)

datetime

from datetime import datetime, timedelta

now = datetime.now()                            # current time
dt = datetime.fromisoformat("2026-05-28T09:00:00")  # from ISO string
dt = datetime(2026, 5, 28, 9, 30)               # from numbers: y, m, d, h, min
dt = datetime.strptime("05/28/2026", "%m/%d/%Y")    # from non-ISO string

now - timedelta(days=30)        # point minus duration -> point (30 days ago)
now + timedelta(hours=2)        # point plus duration -> point

gap = dt2 - dt1                 # point minus point -> duration (timedelta)
gap.days                        # whole days in the duration
gap.total_seconds()             # whole duration as seconds

dt < other_dt                   # datetimes compare directly
dt.year   dt.month   dt.day   dt.hour   dt.weekday()   # weekday: 0=Mon

timedelta = a duration (a length of time). datetime = a point in time. Add a timedelta to a datetime to anchor it back to a point.