How Recovery Works

Faultline is built for long-running jobs where failures are normal. We persist checkpoint blobs in object storage, store run metadata in Postgres, and compute recovery guidance from both signals.

Recovery semantics

Recovery can resume from the latest committed checkpoint only.
Estimated lost steps = latest metric step minus latest checkpoint step.
Stale-run detection marks jobs with no metrics for an extended period.
Resume commands are generated from launch config + checkpoint state.

Durability model

Checkpoint bytes are written to S3-compatible storage before commit status is shown.
Checksums are stored and can be validated by background verification tasks.
Dashboard badges expose checkpoint health and restore readiness.

Faultline does not guarantee zero loss under catastrophic storage failures. Use bucket versioning/replication for production-grade durability.