Training continuity
Never lose days of ML training again.
Faultline monitors long-running jobs, stores checkpoints, and shows you exactly how to resume training on a laptop, HPC cluster, or cloud GPUs.
How it works
From your trainer to durable checkpoints
SDK or framework callbacks stream metrics and state to the Cloud API. Postgres stores run metadata; object storage holds checkpoint blobs. The dashboard is your recovery control plane — not another experiment charting tool.
When training fails
Built for failures trackers weren't designed to fix
Spot preemption
Your GPU job disappears mid-epoch — hours of spend at risk.
Slurm eviction
The cluster requeues without your checkpoint path handy.
Process crash
OOM, Ctrl+C, or a bug — progress scattered across logs.
Recovery
Checkpoint. Crash. Resume.
save(step) streams state to object storage. When the node dies, the dashboard shows lost steps, checkpoint health, and a copy-paste auto_resume() path.
Dashboard
Metrics, checkpoints, and resume commands in one place
Live loss curves while jobs run. Checkpoint timeline, recovery readiness badges, and relaunch when you've registered a launch config. Covered in the product overview above.
Integrations
Drop into the stack you already use
HuggingFace
FaultlineTrainerCallback
PyTorch Lightning
FaultlineLightningCallback
Raw PyTorch
faultline.auto_resume()
Quickstart
Up and running locally
Docker Compose brings up Postgres, MinIO, the API, and this UI. Pre-seeded demo runs — explore without writing a training script.
demo@faultline.local · faultlinedemo
pip install faultline-sdk export FAULTLINE_API_KEY=fl_... export FAULTLINE_API_URL=https://your-api.onrender.com python train.py
Try it in two minutes
docker compose -f docker-compose.cloud.yml up --build