Architecture

Faultline is an ML training continuity and recovery platform. Training scripts send metrics and checkpoints to a Cloud API; Postgres stores metadata; object storage holds checkpoint blobs; the Next.js dashboard proxies through your login session.

[Trainer / HF / Lightning / SDK]
        |  HTTPS + API key
   Cloud API (FastAPI)
        |
   PostgreSQL + MinIO/S3
        |
   Dashboard (Next.js BFF + JWT)

Built for long-running ML jobs on laptops, HPC clusters, and cloud GPUs. The local Rust runtime remains available for offline checkpoint experiments.

Full details in the repository: docs/ARCHITECTURE.md