Reliability

Built for long-running jobs on HPC + cloud GPUs where failures are expected and recoverability matters more than perfect uptime.

What we provide

  • Durable checkpoint storage (S3-compatible / MinIO)
  • Checkpoint verification and health badges before resume
  • Background worker for alert evaluation and resume tasks
  • Retries on storage operations in the SDK and API paths
  • Recovery summaries with estimated lost steps and resume snippets

What we do not guarantee

  • Zero data loss on catastrophic storage failure (use bucket replication)
  • Automatic distributed training coordination across nodes
  • Exact step-level reproducibility after hardware changes
  • 99.99% SLA on the open-source Docker stack (bring your own HA)