Once the daemon is registered, Expanse captures every workload that runs on the compute. Complete coverage gives you accurate waste numbers, sharper predictions, and an honest view of how the compute estate is being used. Nothing for users to install, import, or remember.Documentation Index
Fetch the complete documentation index at: https://docs.expanse.sh/llms.txt
Use this file to discover all available pages before exploring further.
What gets captured
For every workload, Expanse captures three views:- Before: what the scheduler asked for, the submitting user, the queue, the requested resources.
- During: live metrics streamed throughout the run including CPU, memory, GPU, and I/O.
- After: real runtime, peak memory, real utilisation, exit status, and error context for failures.
Compute coverage
Expanse captures workloads automatically where the scheduler can be observed, and wraps commands where the runtime starts directly:- SLURM: captures
sbatchjobs from the login node, including user, account, partition, requested CPU, memory, GPU, walltime, job state, runtime, exit code, and scheduler diagnostics. - Kubernetes: captures pods in the installed namespace, including owner references for Volcano, Kueue, Argo Workflows, Flux, Ray, and Flyte, plus requested resources, limits, node placement, status, runtime, logs, and events.
- Databricks: captures Databricks Jobs and Mosaic AI Training runs from the job driver, including workspace, job, run, cluster, user, requested resources, status, runtime, and failure context.
- YARN: captures Apache Hadoop YARN applications and managed YARN distributions on AWS EMR, Google Dataproc, Azure HDInsight, and Cloudera CDP, including queue, user, application, container, memory, vcore, runtime, and exit state.
- Cloud batch and training: captures AWS Batch, AWS SageMaker Training Jobs, Azure Batch, Azure ML Jobs, Google Batch, and Vertex AI Custom Training, including job definitions, queues or pools, machine shapes, accelerators, status transitions, logs, and resource metrics.
Supported schedulers and managed services
Expanse captures workloads from:- Cluster schedulers: SLURM, Slinky, Apache Hadoop YARN, Kubernetes, Volcano, Kueue, Argo Workflows, Flux, Ray, and Flyte.
- Databricks: Databricks Jobs and Mosaic AI Training.
- Managed YARN platforms: AWS EMR, Google Dataproc, Azure HDInsight, and Cloudera CDP.
- Cloud batch and training services: AWS Batch, AWS SageMaker Training Jobs, Azure Batch, Azure ML Jobs, Google Batch, and Vertex AI Custom Training.
- User-launched tasks: SkyPilot tasks and standalone VMs wrapped with
expanse run --.