Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.expanse.sh/llms.txt

Use this file to discover all available pages before exploring further.

Expanse exposes two intelligence commands. Both reason over evidence from your own compute.

expanse analyse

Before submission. Predicts runtime, memory, CPU, GPU.

expanse diagnose

After failure. Solution-oriented guidance with cited evidence.

Resource prediction

expanse analyse predicts what a workload will need before you submit it. It accepts a SLURM batch script, a source file, a Kubernetes manifest, or a SkyPilot task.
expanse analyse train.slurm
Workload: train.slurm
Compute:  hpc-01 (SLURM)

Prediction              median    p90
─────────────────────────────────────
Runtime                 1h 42m    2h 18m
Memory peak             24.6 GB   31.2 GB
CPU utilisation         78%       n/a
GPU utilisation         91%       n/a
Failure probability     6%

Suggested config
─────────────────────────────────────
--cpus-per-task   16
--mem             36G
--gres            gpu:1
--time            02:30:00

Evidence: 14 similar executions on hpc-01 (last 30 days)
Confidence: 0.94

Use suggested config? [y/n]
Predictions are tied to your compute: every prediction cites the executions on your compute that informed it. For genuinely novel workloads, the engine says so explicitly rather than fabricating a number.

Failure diagnosis

expanse diagnose returns solution-oriented guidance for a failed execution. It cites the evidence it used: telemetry, logs, the captured source bundle, and similar successful executions of the same workload.
expanse diagnose 7f3e9a2b
Execution:  7f3e9a2b  (SLURM job 41982)
Workload:   train.slurm
Outcome:    FAILED   exit 137
Compute:    hpc-01

Likely cause
─────────────────────────────────────
Out of memory at step 3 (loss.backward).
Peak memory before kill: 29.8 GB / 24 GB requested.

Suggested fix
─────────────────────────────────────
Raise --mem to 36G, or reduce batch_size from 64 to 32.

  diff --git a/train.slurm b/train.slurm
  --- a/train.slurm
  +++ b/train.slurm
  @@ -3,7 +3,7 @@
   #SBATCH --job-name=train
   #SBATCH --cpus-per-task=16
  -#SBATCH --mem=24G
  +#SBATCH --mem=36G
   #SBATCH --gres=gpu:1
   #SBATCH --time=02:30:00

  diff --git a/train.py b/train.py
  --- a/train.py
  +++ b/train.py
  @@ -42,7 +42,7 @@
   def main():
       model = build_model().cuda()
  -    batch_size = 64
  +    batch_size = 32
       loader = DataLoader(dataset, batch_size=batch_size, num_workers=4)

Cited evidence
─────────────────────────────────────
- cgroup out-of-memory event at t+0:42:11
- 9 similar workloads succeeded with 36G+
- logs: "RuntimeError: CUDA out of memory"

Confidence: 0.92

Apply suggested fix? [y/n]
Diagnose may also produce a unified diff suggesting a concrete fix when the evidence supports one. The diff is for you to review and apply locally; Expanse never modifies your repository.