How Recovery Works
Faultline is built for long-running jobs where failures are normal. We persist checkpoint blobs in object storage, store run metadata in Postgres, and compute recovery guidance from both signals.
Recovery semantics
- Recovery can resume from the latest committed checkpoint only.
- Estimated lost steps = latest metric step minus latest checkpoint step.
- Stale-run detection marks jobs with no metrics for an extended period.
- Resume commands are generated from launch config + checkpoint state.
Durability model
- Checkpoint bytes are written to S3-compatible storage before commit status is shown.
- Checksums are stored and can be validated by background verification tasks.
- Dashboard badges expose checkpoint health and restore readiness.
Faultline does not guarantee zero loss under catastrophic storage failures. Use bucket versioning/replication for production-grade durability.