How Recovery Works

Faultline is built for long-running jobs where failures are normal. We persist checkpoint blobs in object storage, store run metadata in Postgres, and compute recovery guidance from both signals.

Recovery semantics

  • Recovery can resume from the latest committed checkpoint only.
  • Estimated lost steps = latest metric step minus latest checkpoint step.
  • Stale-run detection marks jobs with no metrics for an extended period.
  • Resume commands are generated from launch config + checkpoint state.

Durability model

  • Checkpoint bytes are written to S3-compatible storage before commit status is shown.
  • Checksums are stored and can be validated by background verification tasks.
  • Dashboard badges expose checkpoint health and restore readiness.

Faultline does not guarantee zero loss under catastrophic storage failures. Use bucket versioning/replication for production-grade durability.