Training continuity

Never lose days of ML training again.

Faultline monitors long-running jobs, stores checkpoints, and shows you exactly how to resume training on a laptop, HPC cluster, or cloud GPUs.

Get started View demo →

How it works

From your trainer to durable checkpoints

SDK or framework callbacks stream metrics and state to the Cloud API. Postgres stores run metadata; object storage holds checkpoint blobs. The dashboard is your recovery control plane — not another experiment charting tool.

When training fails

Built for failures trackers weren't designed to fix

Spot preemption
Your GPU job disappears mid-epoch — hours of spend at risk.
Slurm eviction
The cluster requeues without your checkpoint path handy.
Process crash
OOM, Ctrl+C, or a bug — progress scattered across logs.

Recovery

Checkpoint. Crash. Resume.

save(step) streams state to object storage. When the node dies, the dashboard shows lost steps, checkpoint health, and a copy-paste auto_resume() path.

Try the interactive demo →

Dashboard

Metrics, checkpoints, and resume commands in one place

Live loss curves while jobs run. Checkpoint timeline, recovery readiness badges, and relaunch when you've registered a launch config. Covered in the product overview above.

Integrations

Drop into the stack you already use

HuggingFace
FaultlineTrainerCallback
PyTorch Lightning
FaultlineLightningCallback
Raw PyTorch
faultline.auto_resume()

Quickstart

Up and running locally

Docker Compose brings up Postgres, MinIO, the API, and this UI. Pre-seeded demo runs — explore without writing a training script.

demo@faultline.local · faultlinedemo

pip install faultline-sdk
export FAULTLINE_API_KEY=fl_...
export FAULTLINE_API_URL=https://your-api.onrender.com
python train.py

Try it in two minutes

docker compose -f docker-compose.cloud.yml up --build

Open dashboard Create account →