expanse analyse predicts what a workload will need before you submit it. It accepts a SLURM batch script, a source file, a Kubernetes manifest, or a SkyPilot task.
expanse analyse train.slurm
Terminal
JSON
Workload: train.slurmCompute: hpc-01 (SLURM)Prediction median p90─────────────────────────────────────Runtime 1h 42m 2h 18mMemory peak 24.6 GB 31.2 GBCPU utilisation 78% n/aGPU utilisation 91% n/aFailure probability 6%Suggested config─────────────────────────────────────--cpus-per-task 16--mem 36G--gres gpu:1--time 02:30:00Evidence: 14 similar executions on hpc-01 (last 30 days)Confidence: 0.94Use suggested config? [y/n]
Predictions are tied to your compute: every prediction cites the executions on your compute that informed it. For genuinely novel workloads, the engine says so explicitly rather than fabricating a number.
expanse diagnose returns solution-oriented guidance for a failed execution. It cites the evidence it used: telemetry, logs, the captured source bundle, and similar successful executions of the same workload.
expanse diagnose 7f3e9a2b
Terminal output
JSON output
Execution: 7f3e9a2b (SLURM job 41982)Workload: train.slurmOutcome: FAILED exit 137Compute: hpc-01Likely cause─────────────────────────────────────Out of memory at step 3 (loss.backward).Peak memory before kill: 29.8 GB / 24 GB requested.Suggested fix─────────────────────────────────────Raise --mem to 36G, or reduce batch_size from 64 to 32. diff --git a/train.slurm b/train.slurm --- a/train.slurm +++ b/train.slurm @@ -3,7 +3,7 @@ #SBATCH --job-name=train #SBATCH --cpus-per-task=16 -#SBATCH --mem=24G +#SBATCH --mem=36G #SBATCH --gres=gpu:1 #SBATCH --time=02:30:00 diff --git a/train.py b/train.py --- a/train.py +++ b/train.py @@ -42,7 +42,7 @@ def main(): model = build_model().cuda() - batch_size = 64 + batch_size = 32 loader = DataLoader(dataset, batch_size=batch_size, num_workers=4)Cited evidence─────────────────────────────────────- cgroup out-of-memory event at t+0:42:11- 9 similar workloads succeeded with 36G+- logs: "RuntimeError: CUDA out of memory"Confidence: 0.92Apply suggested fix? [y/n]
Diagnose may also produce a unified diff suggesting a concrete fix when the evidence supports one. The diff is for you to review and apply locally; Expanse never modifies your repository.