Demo modesample data only. Sign up to connect real training runs.

failedLightningrecoverable

slurm-protein-exp7

Step 400 · loss 0.1800 · 7/3/2026, 12:17:06 AM

Recovery

resume_from_checkpoint

python -m faultline.cli resume demo-run-failed

Metrics

Checkpoints

Latest checkpoint: step 40010.0m ago agocommitted
100
200
300
400

Events

  • infofaultline.run.resume_completed7/3/2026, 12:18:06 AM

    recovery succeeded from checkpoint step 400

    Recovery successful

  • errorfaultline.run.failed7/3/2026, 12:17:06 AM

    Slurm node eviction (demo)

  • infofaultline.checkpoint.saved7/3/2026, 12:16:06 AM

    checkpoint step 400