Optimizing Pandas Chunksize for 10M Row Fare Files

You have a nightly automated fare collection (AFC) export that has grown past 10 million tap rows, and a naive pd.read_csv now either OOM-kills the worker or drags the overnight settlement window past its SLA. The task is precise: stream that single monolithic CSV through validation and into a partitioned sink at a chunksize that keeps peak memory bounded without starving the CPU on Python-level iteration. This is a throughput-tuning job inside the CSV Batch Parsing Workflows stage of the Fare Data Ingestion & GTFS-RT Sync pipeline, where vendor flat files become auditable rows — so the parser must never silently drop or double-count a fare. Treat pd.read_csv not as one atomic call but as a deterministic streaming iterator whose chunk boundary dictates I/O throughput, garbage-collection frequency, and downstream reconciliation latency.

This page targets transit operations teams, revenue analysts, and Python automation builders who own that ingestion job. It assumes the export has already been mapped onto the canonical tap schema during smart card schema mapping, so the parser here concerns itself only with volume, memory, and structural integrity — not media decoding or pricing.

Memory Allocation & Throughput Mechanics

When processing high-volume tap streams, pandas loads each chunk into contiguous memory blocks. A 10M-row fare file with standard AFC columns (timestamp, card_id, route_id, fare_type, tap_status, device_id, amount_cents) typically consumes 1.2–1.8 GB when parsed with default object dtypes. Setting chunksize=100_000 consistently yields a workable balance between per-batch overhead and memory footprint on standard 16–32 GB worker nodes. Smaller chunks (10k–25k) inflate iteration overhead and context switching; larger chunks (500k+) risk a kernel OOM kill during concurrent reconciliation jobs.

Money is carried as integer minor units (amount_cents, Int64) end to end — never float. A float32 fare column looks tempting for memory reasons, but binary floating point cannot represent most decimal fares exactly, and a fraction-of-a-cent drift multiplied across 10M taps is a settlement figure no auditor will sign. Integer cents keeps each reprocess bit-for-bit reproducible; any true decimal arithmetic downstream stays in Decimal.

The right chunk size is hardware-dependent, so estimate the peak resident set before benchmarking. Peak memory scales roughly with the chunk row count $C$ , the mean serialized row width $\bar{w}$ , and a pandas allocation-overhead multiplier $f_{overhead}$ (empirically 2–3× for mixed dtypes, closer to 1.4× once categories and nullable integers are enforced):

M_{peak} \approx C \times \bar{w} \times f_{overhead}

Solve for $C$ against your worker’s safe memory ceiling, then confirm empirically: iterate through candidate values ([50_000, 100_000, 200_000, 500_000]) while tracking RSS and wall-clock time. Pair the winning value with strict upfront dtype mapping ({'card_id': 'category', 'route_id': 'Int32', 'amount_cents': 'Int64'}) to cut per-chunk allocation by 30–40% and pull $f_{overhead}$ down toward its floor. For allocation hotspots, Python’s native tracemalloc isolates exactly which stage retains memory before you scale to production.

Chunked Ingestion Flow

The pipeline below shows how each chunk streams from the 10M-row CSV through validation to Parquet partitions, with per-chunk failures isolated so the batch never aborts:

Step-by-Step Implementation

The script below is a type-hinted, audit-ready ingestion routine. It enforces schema validation, normalizes transit timestamps to UTC, isolates chunk-level failures so one bad chunk never aborts the batch, and emits structured reconciliation metrics. Each chunk is hashed for the audit trail and sunk to a partitioned Parquet dataset.

import logging
import time
import hashlib
from pathlib import Path
from typing import Iterator, Dict, Any, Optional
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

# ---------------------------------------------------------------------------
# Audit & logging configuration
# ---------------------------------------------------------------------------
AUDIT_LOGGER = logging.getLogger("afc_reconciliation")
AUDIT_LOGGER.setLevel(logging.INFO)
_handler = logging.StreamHandler()
_handler.setFormatter(logging.Formatter("%(asctime)s | %(levelname)s | %(message)s"))
AUDIT_LOGGER.addHandler(_handler)


class AFCIngestionError(Exception):
    """Raised when chunk validation or I/O operations fail."""


# ---------------------------------------------------------------------------
# Schema contract — money is Int64 minor units, never float
# ---------------------------------------------------------------------------
REQUIRED_COLUMNS = {
    "timestamp", "card_id", "route_id", "fare_type",
    "tap_status", "device_id", "amount_cents",
}
DTYPE_MAP: Dict[str, str] = {
    "card_id": "category",
    "route_id": "Int32",
    "fare_type": "category",
    "tap_status": "category",
    "device_id": "category",
    "amount_cents": "Int64",
}
VALID_STATUSES = {"TAP_IN", "TAP_OUT", "TRANSFER", "CASH"}


def _compute_chunk_hash(df: pd.DataFrame) -> str:
    """Generate a deterministic MD5 hash of a chunk for audit-trail verification."""
    return hashlib.md5(pd.util.hash_pandas_object(df).values.tobytes()).hexdigest()


def _validate_and_normalize(chunk: pd.DataFrame, chunk_idx: int) -> pd.DataFrame:
    """Enforce schema compliance and transit-specific business rules on one chunk."""
    missing = REQUIRED_COLUMNS - set(chunk.columns)
    if missing:
        raise AFCIngestionError(f"Chunk {chunk_idx} missing columns: {missing}")

    # Normalize timestamps to UTC and drop malformed rows deterministically.
    chunk["timestamp"] = pd.to_datetime(chunk["timestamp"], utc=True, errors="coerce")
    invalid_ts = int(chunk["timestamp"].isna().sum())
    if invalid_ts:
        AUDIT_LOGGER.warning("Chunk %d: dropped %d rows with invalid timestamps", chunk_idx, invalid_ts)
        chunk = chunk.dropna(subset=["timestamp"])

    # Retain only revenue-bearing taps; drop TEST / MAINTENANCE / unknown statuses.
    chunk = chunk[chunk["tap_status"].isin(VALID_STATUSES)]

    # Idempotency guard: network retries can replay identical taps.
    chunk = chunk.drop_duplicates(subset=["card_id", "timestamp", "device_id"], keep="last")
    return chunk


def ingest_fare_csv(
    file_path: Path,
    chunksize: int = 100_000,
    output_dir: Optional[Path] = None,
) -> Dict[str, Any]:
    """Stream a 10M+ row AFC CSV into validated Parquet partitions.

    Returns an audit dictionary with per-run reconciliation metrics.
    """
    if not file_path.exists():
        raise FileNotFoundError(f"AFC log not found: {file_path}")

    audit: Dict[str, Any] = {
        "file": str(file_path),
        "chunksize": chunksize,
        "chunks_processed": 0,
        "total_rows_in": 0,
        "total_rows_out": 0,
        "validation_errors": 0,
        "start_time": time.time(),
        "checksums": [],
        "status": "SUCCESS",
    }

    try:
        # Iterator keeps only one chunk resident at a time.
        reader: Iterator[pd.DataFrame] = pd.read_csv(
            file_path,
            chunksize=chunksize,
            dtype=DTYPE_MAP,
            on_bad_lines="warn",
        )

        for idx, chunk in enumerate(reader, start=1):
            audit["chunks_processed"] += 1
            audit["total_rows_in"] += len(chunk)
            try:
                clean_chunk = _validate_and_normalize(chunk, idx)
                audit["total_rows_out"] += len(clean_chunk)
                audit["checksums"].append(_compute_chunk_hash(clean_chunk))
                if output_dir is not None and not clean_chunk.empty:
                    table = pa.Table.from_pandas(clean_chunk, preserve_index=False)
                    pq.write_to_dataset(table, root_path=str(output_dir / "afc_normalized"))
            except AFCIngestionError as exc:
                audit["validation_errors"] += 1
                AUDIT_LOGGER.error("Chunk %d rejected: %s", idx, exc)
                # Continue the batch; a single bad chunk must not abort settlement.

    except pd.errors.ParserError as exc:
        audit["status"] = "PARSER_FAILURE"
        raise AFCIngestionError(f"CSV parsing aborted: {exc}") from exc
    except MemoryError as exc:
        audit["status"] = "OOM_FAILURE"
        raise AFCIngestionError("Worker exhausted RAM. Reduce chunksize or add swap.") from exc
    finally:
        audit["duration_sec"] = round(time.time() - audit["start_time"], 2)
        AUDIT_LOGGER.info(
            "Audit complete: %d rows reconciled in %ss",
            audit["total_rows_out"], audit["duration_sec"],
        )

    return audit


if __name__ == "__main__":
    output = Path("parquet_sink")
    output.mkdir(exist_ok=True)
    try:
        report = ingest_fare_csv(Path("daily_taps_2024_10_24.csv"), chunksize=100_000, output_dir=output)
        print(f"Reconciliation report: {report}")
    except AFCIngestionError as err:
        AUDIT_LOGGER.critical("Pipeline halted: %s", err)

Two choices in that loop are load-bearing. First, only AFCIngestionError is caught inside the per-chunk block — a KeyboardInterrupt or a genuine coding bug still propagates, rather than being silently counted as a “validation error.” Second, the total_rows_in == total_rows_out + dropped + rejected arithmetic must always balance across the batch; the audit dictionary is what a revenue analyst reconciles against the vendor manifest.

Validation & Test Cases

Chunk tuning is only safe if _validate_and_normalize behaves identically regardless of where a chunk boundary falls. Exercise it directly with small in-memory fixtures covering a normal chunk, a dirty chunk, and a schema violation.

import io
import pandas as pd

# --- Case 1: well-formed chunk — every row is a revenue tap ----------------
raw = io.StringIO(
    "timestamp,card_id,route_id,fare_type,tap_status,device_id,amount_cents\n"
    "2024-10-24T07:15:02Z,CARD_A,12,adult,TAP_IN,VAL_009,275\n"
    "2024-10-24T07:15:59Z,CARD_B,12,senior,TAP_OUT,VAL_009,0\n"
)
clean = _validate_and_normalize(pd.read_csv(raw, dtype=DTYPE_MAP), chunk_idx=1)
print(len(clean), str(clean["timestamp"].dt.tz))
# -> 2 UTC

# --- Case 2: dirty chunk — one bad timestamp, one non-revenue tap ----------
raw_dirty = io.StringIO(
    "timestamp,card_id,route_id,fare_type,tap_status,device_id,amount_cents\n"
    "not-a-time,CARD_C,12,adult,TAP_IN,VAL_010,275\n"          # dropped: bad timestamp
    "2024-10-24T08:00:00Z,CARD_D,12,adult,MAINTENANCE,VAL_010,0\n"  # dropped: non-revenue
    "2024-10-24T08:01:00Z,CARD_E,12,adult,TAP_IN,VAL_010,275\n"     # kept
)
clean = _validate_and_normalize(pd.read_csv(raw_dirty, dtype=DTYPE_MAP), chunk_idx=2)
print(len(clean), clean["card_id"].tolist())
# -> 1 ['CARD_E']

# --- Case 3: schema violation — a required column is absent ----------------
raw_bad = io.StringIO("timestamp,card_id\n2024-10-24T08:01:00Z,CARD_F\n")
try:
    _validate_and_normalize(pd.read_csv(raw_bad), chunk_idx=3)
except AFCIngestionError as exc:
    print(exc)
# -> Chunk 3 missing columns: {'route_id', 'fare_type', 'tap_status', 'device_id', 'amount_cents'}

Case 1 confirms clean rows survive untouched and timestamps land in UTC. Case 2 confirms the two independent drop paths (coerced NaT and non-revenue status) both fire within a single chunk. Case 3 confirms a truncated header — the classic symptom of a partial SFTP transfer — raises rather than silently producing a short batch. Wire these three into your test suite and they double as a regression guard when you later change chunksize: the row counts must not move.

Transit-Specific Debugging & Troubleshooting

Chunked parsing introduces edge cases unique to fare reconciliation. Use the following diagnostic matrix when pipeline anomalies occur:

Symptom	Root cause	Resolution
Journey fragmentation	Entry/exit taps split across chunk boundaries	Carry a stateful buffer holding the last events per `card_id` into the next chunk before applying any Transfer Window Logic so a paired tap-out is never orphaned.
Validator clock skew	`timestamp` drifts >30s across devices	Normalize with `pd.to_datetime(..., utc=True)`, then apply a rolling median offset per `device_id` before aggregation.
Sudden OOM spikes	`object` dtype fallback on malformed strings	Enforce `dtype=DTYPE_MAP` at parse time and set `on_bad_lines="warn"` (or `"skip"`) so corrupted rows cannot force a type upcast.
Duplicate revenue inflation	Network retries replaying identical tap logs	Deduplicate on `["card_id", "timestamp", "device_id"]` inside `_validate_and_normalize`, as shown above.

Quick diagnostic commands

# Profile allocation during ingestion (-X tracemalloc enables the tracer).
python -X tracemalloc -c "from pathlib import Path; from script import ingest_fare_csv; ingest_fare_csv(Path('test.csv'))"

# Verify Parquet schema consistency after ingestion.
duckdb -c "DESCRIBE SELECT * FROM parquet_scan('parquet_sink/afc_normalized/*.parquet');"

Integration Note

This tuning task feeds the broader CSV Batch Parsing Workflows stage: the checksummed Parquet partitions it emits become the input for schema enforcement in Implementing Pydantic Models for AFC Event Streams, where each normalized row is promoted to a validated event model. Align chunk boundaries to transactional windows (hourly or shift-based) so multi-tap journey reconstruction stays intact, and wrap the synchronous pandas loop in an asyncio-compatible executor when it must run alongside concurrent GTFS-RT Realtime Sync updates without blocking the event loop.

Frequently Asked Questions

Is there a single "best" chunksize for a 10M-row fare file?

No — the optimum is a function of worker RAM, mean row width, and how many reconciliation jobs share the box. 100_000 is a sound default on a 16–32 GB node, but treat it as a starting point: estimate the peak resident set from the sizing formula above, then benchmark [50_000, 100_000, 200_000, 500_000] while tracking RSS and wall time. Once categories and Int64/Int32 dtypes are enforced, larger chunks become affordable because the per-row overhead multiplier drops.

Why not just raise chunksize until the file fits in one read?

Because the whole point of streaming is bounded, predictable memory. One giant read makes peak RSS scale with file size, so a vendor sending a 14M-row day instead of 10M silently pushes you into an OOM kill mid-settlement. A fixed chunksize decouples memory from file size: the batch takes longer on a bigger file but never exceeds its ceiling, which is exactly the guarantee an overnight SLA needs.

Can I store the fare amount as float32 to save memory per chunk?

No. Binary floats cannot represent most decimal fares exactly, and rounding drift compounds across millions of rows into a settlement total no auditor will accept. Carry money as Int64 minor units (amount_cents); it is money-safe, reproducible across reprocessing, and the memory difference versus float32 is negligible next to the categorical string columns.

How do I stop a paired tap-out being split across two chunks?

Keep a small stateful buffer keyed by card_id that carries a chunk’s trailing open taps forward, and defer any journey pairing until the matching event arrives — never pair within a single chunk in isolation. This is the same boundary problem that Transfer Window Logic solves at the rule layer; the parser’s only job is to make sure no event is dropped as a chunk boundary passes over it.

↑ Part of CSV Batch Parsing Workflows