Fallback Routing Strategies

Fallback routing strategies keep an automated fare collection network validating and reconciling taps when the central clearinghouse is unreachable. They are the degraded-mode discipline inside the broader Core Architecture & Fare Taxonomy: the rules that decide how a validator caches a tap locally, how a gateway retries a batch without overwhelming a recovering API, and how offline events are later matched back to authoritative revenue records without double-charging a rider or leaking fare.

This is not a network-level failover switch. It is a data-engineering problem that spans edge persistence, deterministic retry, and reconciliation. Transit ops and revenue analysts own the outcome — minimizing unaccounted fare leakage and holding reconciliation SLAs. Mobility-tech developers and Python automation builders own the mechanism — idempotent ingestion pipelines that degrade gracefully without corrupting the downstream ledger. Because a validator running offline can neither authenticate against the AFC System Security Boundaries nor resolve a live fare, every design choice below trades a little latency or precision for audit-grade recoverability once connectivity returns.

How a Tap Flows Through the Fallback Path

When a validator cannot reach the clearinghouse, the tap must not be dropped and must not be trusted blindly. It is schema-checked at the edge, written to a durable local store keyed by an idempotency token, buffered, and pushed as a batch once the link recovers. The flow below shows how a tap branches between the live primary path and the local-cache fallback path:

The live path and the cached fallback path converge at one reconciliation queue — the single place ordering and duplication are resolved.

Two invariants make this safe. First, the local write is idempotent: the same physical tap replayed during a flapping connection collapses to one row. Second, the reconciliation queue is the single merge point for both the live and recovered paths, so ordering and duplication are resolved in exactly one place rather than smeared across the pipeline.

Prerequisites & Environment

Fallback routing runs on constrained edge hardware and a central reconciliation tier, so the dependency surface is deliberately small and boring.

Component	Assumption
Python	3.11+ (uses `datetime.UTC`, `Decimal`, typed generics)
Local store	`sqlite3` from the stdlib in WAL mode — no server process at the edge
Validation	`pydantic` v2 for strict AFC event models
Retry	`tenacity` >= 8 for backoff + jitter decorators
Transport	`requests` (or `httpx`) with per-attempt timeouts
Spatial	`numpy` for zone-centroid distance scoring
Edge hardware	ARM SoC, 256 MB–1 GB RAM, intermittent cellular/Wi-Fi backhaul

Data-schema expectations: every tap payload carries at minimum media_id, device_id, timestamp (validator-local epoch seconds), a media/product code, and — where available — GPS coordinates. Fields normalize against the same contracts used in Smart Card Schema Mapping, so media type, balance snapshots, and cryptographic signatures survive the round-trip through the cache untouched. Monetary values are handled exclusively with Decimal — never float — because fare caps and transfer discounts must reconcile to the cent against clearinghouse ledgers.

Core Implementation: Memory-Bounded Edge Caching

Memory efficiency at the edge is non-negotiable. Loading an entire transaction batch into RAM on a validator triggers OOM kills and corrupts local state mid-write. The pipeline instead streams payloads, validates each one cheaply, and persists in bounded chunks using write-ahead logging so an unclean power loss never leaves a torn database.

import json
import logging
import sqlite3
from contextlib import contextmanager
from pathlib import Path
from typing import Any, Iterator

logger = logging.getLogger("afc.fallback.edge_cache")


@contextmanager
def local_edge_cache(
    db_path: Path = Path("/var/lib/transit/fallback_cache.db"),
) -> Iterator[sqlite3.Connection]:
    """Memory-efficient SQLite edge cache in WAL mode.

    WAL survives unclean power loss without a torn write; NORMAL sync
    keeps validator I/O latency bounded on slow flash storage.
    """
    conn = sqlite3.connect(str(db_path), isolation_level=None)
    try:
        conn.execute("PRAGMA journal_mode=WAL")
        conn.execute("PRAGMA synchronous=NORMAL")
        conn.execute(
            """
            CREATE TABLE IF NOT EXISTS tap_events (
                idempotency_key TEXT PRIMARY KEY,
                raw_payload     TEXT NOT NULL,
                status          TEXT NOT NULL DEFAULT 'PENDING',
                created_at      REAL NOT NULL
            )
            """
        )
        yield conn
    finally:
        conn.close()


def stream_validate_and_cache(
    payload_stream: Iterator[bytes],
    schema: dict[str, Any],
    batch_size: int = 500,
) -> int:
    """Validate and cache tap events in bounded chunks.

    Returns the number of rows accepted. Malformed payloads are routed
    to the dead-letter queue and never block the pipeline.
    """
    accepted = 0
    with local_edge_cache() as conn:
        cursor = conn.cursor()
        batch: list[tuple[str, str, str, float]] = []

        def flush() -> None:
            nonlocal accepted
            if not batch:
                return
            # INSERT OR IGNORE gives us idempotency: a replayed tap during
            # a flapping link collapses to the existing primary-key row.
            cursor.executemany(
                "INSERT OR IGNORE INTO tap_events VALUES (?, ?, ?, ?)", batch
            )
            accepted += cursor.rowcount if cursor.rowcount > 0 else 0
            batch.clear()

        for raw_bytes in payload_stream:
            try:
                event = json.loads(raw_bytes)
                if not _validate_schema(event, schema):
                    raise ValueError("schema mismatch")
                idem_key = (
                    f"{event['media_id']}:{event['timestamp']}:{event['device_id']}"
                )
                batch.append(
                    (idem_key, raw_bytes.decode("utf-8"), "PENDING", event["timestamp"])
                )
                if len(batch) >= batch_size:
                    flush()
            except (json.JSONDecodeError, ValueError, KeyError) as exc:
                logger.warning("edge cache rejected payload: %s", exc)
                _route_to_dlq(raw_bytes, str(exc))

        flush()
    logger.info("edge cache accepted %d tap events", accepted)
    return accepted

The generator contract (Iterator[bytes]) is the load-bearing detail: the caller can hand this function a socket, a file handle, or a Kafka consumer without the function ever materializing the full stream. INSERT OR IGNORE against a PRIMARY KEY is the cheapest possible idempotency guard — no read-modify-write, no lock contention.

Schema Validation & Transit-Specific Edge Cases

Offline validation multiplies the ways a payload can be subtly wrong, and each one has a deterministic handling rule rather than a silent drop.

Null tap-out. A rider taps in offline and the validator loses power before tap-out. The event is retained with tap_out = None and later resolved by max-fare or grace-period logic — it is never discarded, because a missing tap-out is a revenue signal, not noise.
Encoding fallback. Legacy validators emit Latin-1 or shift-JIS product labels. Decode with an explicit fallback chain (utf-8 → latin-1 with errors="replace") and flag the record, rather than letting a UnicodeDecodeError kill the batch.
Idempotency across clock jumps. The idempotency key must not depend on wall-clock arrival time. Keying on media_id:timestamp:device_id means a tap replayed after an NTP correction still deduplicates.
Timezone normalization. Validator clocks drift and may report local time. Normalize every timestamp to UTC at ingestion; a tap-in/tap-out pair computed across mismatched offsets produces a phantom multi-hour dwell that later inflates fare-leakage estimates.

Because these events later feed fare calculation, the same strict-mode contracts covered in Schema Validation Pipelines apply here — reject at the edge, not during batch reconciliation, so a malformed record is quarantined while it is still cheap to inspect.

Deterministic Retry & Dead-Letter Queues

Network partitions during a fallback state are rarely binary; they manifest as intermittent packet loss, partial ACKs, and stale DNS. Blind retries amplify load on a recovering clearinghouse and turn a brownout into an outage. Use bounded exponential backoff with jitter, and route anything malformed straight to the dead-letter queue instead of retrying it forever. The tenacity library provides production-grade decorators that handle transient failures without hand-rolled state.

Transient failures loop through bounded backoff; malformed payloads and exhausted retries land in the dead-letter queue, never an infinite retry.

import logging

import requests
import tenacity

logger = logging.getLogger("afc.fallback.push")


@tenacity.retry(
    wait=tenacity.wait_random_exponential(min=1, max=30),
    stop=tenacity.stop_after_attempt(5),
    retry=tenacity.retry_if_exception_type(
        (requests.ConnectionError, requests.Timeout)
    ),
    before_sleep=tenacity.before_sleep_log(logger, logging.WARNING),
    reraise=True,
)
def push_to_clearinghouse(
    batch: list[dict[str, object]], endpoint: str
) -> requests.Response:
    """Idempotent batch push with exponential backoff and jitter.

    The Idempotency-Key header lets the clearinghouse dedupe a batch that
    was received but whose ACK we never saw before a timeout.
    """
    headers = {
        "Content-Type": "application/json",
        "Idempotency-Key": str(batch[0]["idem_key"]),
    }
    response = requests.post(endpoint, json=batch, headers=headers, timeout=10)
    response.raise_for_status()
    return response

The Idempotency-Key header closes the most dangerous fallback failure mode: a batch that lands successfully but whose ACK is lost to a timeout. On retry, the server recognizes the key and returns the prior result instead of double-booking the fares.

Integration Pattern: Zone Reconciliation & GTFS-RT Backfill

Recovered taps still need a fare, and offline events cannot rely on live AVL. The system applies heuristic zone-crossing logic against a well-defined Fare Zone Taxonomy Design that specifies boundary geometry, transfer windows, and grace periods for degraded validation. Where GPS is present, coordinates map to a zone with a spatial confidence score; where it is absent, the event carries forward as low-confidence and is resolved by the fare engine’s degraded path.

from dataclasses import dataclass

import numpy as np


@dataclass(frozen=True)
class ZoneBoundary:
    zone_id: str
    lat_min: float
    lat_max: float
    lon_min: float
    lon_max: float


@dataclass(frozen=True)
class TripSegment:
    tap_in: float
    tap_out: float | None
    device_id: str
    media_id: str
    heuristic_zone: str | None = None
    confidence: float = 0.0


def resolve_heuristic_zone(
    tap_lat: float, tap_lon: float, boundaries: list[ZoneBoundary]
) -> tuple[str, float]:
    """Map GPS coordinates to a fare zone with a spatial confidence score.

    Confidence decays with distance from the zone centroid, so a tap near a
    boundary is flagged for review rather than silently mischarged.
    """
    matches: list[tuple[str, float]] = []
    for b in boundaries:
        if b.lat_min <= tap_lat <= b.lat_max and b.lon_min <= tap_lon <= b.lon_max:
            centroid_lat = (b.lat_min + b.lat_max) / 2
            centroid_lon = (b.lon_min + b.lon_max) / 2
            dist = float(np.hypot(tap_lat - centroid_lat, tap_lon - centroid_lon))
            confidence = max(0.0, 1.0 - (dist / 0.005))  # ~500 m decay radius
            matches.append((b.zone_id, confidence))
    if not matches:
        return "UNKNOWN", 0.0
    return max(matches, key=lambda m: m[1])

This component hands off in three directions. Backfilling probable trip paths cross-references recovered batches against historical feeds from GTFS-RT Realtime Sync; the GTFS Realtime specification defines the positional updates that make that reconstruction possible, though vehicle drift forces probabilistic matching. Transfer and capping decisions defer to the operator agreements encoded in Transfer Window Logic. And when even the heuristic path cannot resolve a fare, control passes to the reader-side Fallback Calculation Chains that compute a defensible degraded fare on-device. Apply caps and transfer discounts only when confidence clears the operational threshold.

Performance & Scale Considerations

A mid-size network can queue millions of offline taps during a regional backhaul outage, so the pipeline is tuned for bounded memory and predictable recovery throughput rather than peak speed.

Chunk sizing. Edge inserts flush at 500–2,000 rows; central recovery pushes at 5,000–50,000 events per batch. Below that, HTTP overhead dominates; above it, a single failed batch retries too much work.
Memory bounds. Every stage is a generator or a bounded batch. Peak RSS on the validator stays flat regardless of how long the outage lasts, because rows live in SQLite, not a Python list.
WAL checkpointing. Run PRAGMA wal_checkpoint(TRUNCATE) after a successful drain so the WAL file does not grow without bound during a multi-day partition.
Parallelism caveats. Push workers may run concurrently, but only if they shard by media_id — never split a single rider’s taps across workers, or windowed fare-cap state races and the cap is applied twice.

Reconciliation & Revenue Integrity

Reconciliation drift occurs when fallback events are processed out of order, duplicated, or mismatched against clearinghouse records. The reconciliation layer enforces idempotency with a deterministic key (media_id:timestamp:device_id:direction) and applies fare caps within windowed aggregation. All money is Decimal.

import logging
from collections import defaultdict
from decimal import Decimal

logger = logging.getLogger("afc.fallback.reconcile")

FARE_CAP_TRIGGER = Decimal("8.00")
TRANSFER_DISCOUNT = Decimal("1.50")
CONFIDENCE_FLOOR = 0.70


def reconcile_fare_batches(
    pending_events: list[dict[str, object]],
    clearinghouse_ledger: dict[str, Decimal],
) -> dict[str, dict[str, object]]:
    """Reconcile recovered taps against the authoritative ledger.

    Matched events adopt the ledger fare; unmatched events are priced by the
    heuristic zone path. Idempotency keys guarantee each tap resolves once.
    """
    reconciled: dict[str, dict[str, object]] = {}
    spend: dict[str, Decimal] = defaultdict(lambda: Decimal("0.00"))

    for evt in pending_events:
        key = str(evt["idem_key"])
        if key in reconciled:
            continue  # idempotency guard

        if key in clearinghouse_ledger:
            reconciled[key] = {
                "status": "MATCHED",
                "fare": clearinghouse_ledger[key],
            }
            continue

        media_id = str(evt["media_id"])
        zone_fare = _calculate_zone_fare(evt.get("heuristic_zone"))
        discount = _transfer_discount(evt, spend[media_id])
        final_fare = max(Decimal("0.00"), zone_fare - discount)

        reconciled[key] = {"status": "FALLBACK_RECONCILED", "fare": final_fare}
        spend[media_id] += final_fare

    logger.info("reconciled %d recovered taps", len(reconciled))
    return reconciled


def _transfer_discount(event: dict[str, object], prior_spend: Decimal) -> Decimal:
    """Apply a transfer discount only above the confidence floor and cap."""
    if float(event.get("confidence", 0.0)) < CONFIDENCE_FLOOR:
        return Decimal("0.00")
    return TRANSFER_DISCOUNT if prior_spend > FARE_CAP_TRIGGER else Decimal("0.00")

Revenue analysts size the exposure of the fallback window with a single figure. Given the set of taps $U$ that failed to reconcile, an estimated fair-fare $\hat{f}_i$ , and a spatial confidence $c_i \in [0,1]$ , the fare-leakage index is the confidence-weighted shortfall:

L = \sum_{i \in U} \hat{f}_i \,\bigl(1 - c_i\bigr)

Reconciliation dashboards should surface the components that drive $L$ :

Metric	Definition	Alert threshold
Unmatched fallback rate	Offline taps unreconciled within 24 h	> 2%
Fare-leakage index	Confidence-weighted revenue shortfall $L$	trend-based
DLQ backlog velocity	Malformed payloads resolved vs. accumulated	net negative

Operational Checklist

Fallback routing is a temporary state, not a permanent architecture. Before enabling it in production:

Bound the state. Enforce a reconciliation-latency SLA and alert when recovered taps exceed it.
Isolate the write. Confirm every edge insert is idempotent and WAL-mode SQLite survives a simulated power cut mid-batch.
Gate the retries. Verify backoff caps out and that malformed payloads land in the DLQ instead of retrying indefinitely.
Trip the breaker. Wire a circuit breaker that halts fallback processing automatically when DLQ error rate exceeds 5%.
Instrument everything. Emit structured logs and Prometheus metrics for accepted, pending, reconciled, and dead-lettered counts.
Audit weekly. Have revenue analysts statistically sample outputs, flagging confidence < 0.6 with fare > 4.50, dwell over 3 h without a validated transfer, and idempotency-key collisions above 0.01% of daily volume.
Practice recovery. Drain a synthetic multi-day backlog into a staging clearinghouse and confirm the ledger reconciles to the cent.

By decoupling edge validation from centralized clearing, keeping ingestion memory-bounded, and enforcing deterministic reconciliation, a transit system holds service continuity through network degradation while preserving audit-grade revenue integrity.

Smart Card Schema Mapping — the contracts cached taps must satisfy before reconciliation.
Fare Zone Taxonomy Design — boundary geometry and grace periods that drive heuristic zone-crossing.
AFC System Security Boundaries — why an offline validator cannot authenticate live and what it may safely do instead.
Fallback Calculation Chains — the reader-side fare computation that offline routing defers to.
GTFS-RT Realtime Sync — the historical feeds used to backfill probable trip paths.

Part of Core Architecture & Fare Taxonomy.