Reliability
Built for long-running jobs on HPC + cloud GPUs where failures are expected and recoverability matters more than perfect uptime.
What we provide
- Durable checkpoint storage (S3-compatible / MinIO)
- Checkpoint verification and health badges before resume
- Background worker for alert evaluation and resume tasks
- Retries on storage operations in the SDK and API paths
- Recovery summaries with estimated lost steps and resume snippets
What we do not guarantee
- Zero data loss on catastrophic storage failure (use bucket replication)
- Automatic distributed training coordination across nodes
- Exact step-level reproducibility after hardware changes
- 99.99% SLA on the open-source Docker stack (bring your own HA)