Never lose days of ML training again.

Faultline monitors long-running jobs, stores checkpoints, and shows you exactly how to resume training on a laptop, HPC cluster, or cloud GPUs.

From your trainer to durable checkpoints

SDK or framework callbacks stream metrics and state to the Cloud API. Postgres stores run metadata; object storage holds checkpoint blobs. The dashboard is your recovery control plane — not another experiment charting tool.

Built for failures trackers weren't designed to fix

  • Spot preemption

    Your GPU job disappears mid-epoch — hours of spend at risk.

  • Slurm eviction

    The cluster requeues without your checkpoint path handy.

  • Process crash

    OOM, Ctrl+C, or a bug — progress scattered across logs.

Checkpoint. Crash. Resume.

save(step) streams state to object storage. When the node dies, the dashboard shows lost steps, checkpoint health, and a copy-paste auto_resume() path.

Try the interactive demo →

Metrics, checkpoints, and resume commands in one place

Live loss curves while jobs run. Checkpoint timeline, recovery readiness badges, and relaunch when you've registered a launch config. Covered in the product overview above.

Drop into the stack you already use

  • HuggingFace

    FaultlineTrainerCallback

  • PyTorch Lightning

    FaultlineLightningCallback

  • Raw PyTorch

    faultline.auto_resume()

Up and running locally

Docker Compose brings up Postgres, MinIO, the API, and this UI. Pre-seeded demo runs — explore without writing a training script.

demo@faultline.local · faultlinedemo

pip install faultline-sdk
export FAULTLINE_API_KEY=fl_...
export FAULTLINE_API_URL=https://your-api.onrender.com
python train.py

Try it in two minutes

docker compose -f docker-compose.cloud.yml up --build