Reliability

Built for long-running jobs on HPC + cloud GPUs where failures are expected and recoverability matters more than perfect uptime.

What we provide

Durable checkpoint storage (S3-compatible / MinIO)
Checkpoint verification and health badges before resume
Background worker for alert evaluation and resume tasks
Retries on storage operations in the SDK and API paths
Recovery summaries with estimated lost steps and resume snippets

What we do not guarantee

Zero data loss on catastrophic storage failure (use bucket replication)
Automatic distributed training coordination across nodes
Exact step-level reproducibility after hardware changes
99.99% SLA on the open-source Docker stack (bring your own HA)