Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.expanse.sh/llms.txt

Use this file to discover all available pages before exploring further.

Once the daemon is registered, Expanse captures every workload that runs on the compute. Complete coverage gives you accurate waste numbers, sharper predictions, and an honest view of how the compute estate is being used. Nothing for users to install, import, or remember.

What gets captured

For every workload, Expanse captures three views:
  • Before: what the scheduler asked for, the submitting user, the queue, the requested resources.
  • During: live metrics streamed throughout the run including CPU, memory, GPU, and I/O.
  • After: real runtime, peak memory, real utilisation, exit status, and error context for failures.
Together, those views feed the Console’s waste dashboards and the intelligence layer.
Zero overhead. Capture happens on the compute, in the background. There is no measurable performance cost on the workloads being observed, and telemetry never blocks workload execution: if the daemon is unavailable or the Cloud API is unreachable, jobs run normally and telemetry is buffered locally until it can be delivered.

Compute coverage

Expanse captures workloads automatically where the scheduler can be observed, and wraps commands where the runtime starts directly:
  • SLURM: captures sbatch jobs from the login node, including user, account, partition, requested CPU, memory, GPU, walltime, job state, runtime, exit code, and scheduler diagnostics.
  • Kubernetes: captures pods in the installed namespace, including owner references for Volcano, Kueue, Argo Workflows, Flux, Ray, and Flyte, plus requested resources, limits, node placement, status, runtime, logs, and events.
  • Databricks: captures Databricks Jobs and Mosaic AI Training runs from the job driver, including workspace, job, run, cluster, user, requested resources, status, runtime, and failure context.
  • YARN: captures Apache Hadoop YARN applications and managed YARN distributions on AWS EMR, Google Dataproc, Azure HDInsight, and Cloudera CDP, including queue, user, application, container, memory, vcore, runtime, and exit state.
  • Cloud batch and training: captures AWS Batch, AWS SageMaker Training Jobs, Azure Batch, Azure ML Jobs, Google Batch, and Vertex AI Custom Training, including job definitions, queues or pools, machine shapes, accelerators, status transitions, logs, and resource metrics.

Supported schedulers and managed services

Expanse captures workloads from:
  • Cluster schedulers: SLURM, Slinky, Apache Hadoop YARN, Kubernetes, Volcano, Kueue, Argo Workflows, Flux, Ray, and Flyte.
  • Databricks: Databricks Jobs and Mosaic AI Training.
  • Managed YARN platforms: AWS EMR, Google Dataproc, Azure HDInsight, and Cloudera CDP.
  • Cloud batch and training services: AWS Batch, AWS SageMaker Training Jobs, Azure Batch, Azure ML Jobs, Google Batch, and Vertex AI Custom Training.
  • User-launched tasks: SkyPilot tasks and standalone VMs wrapped with expanse run --.
See Computes for how these environments map to compute types. If you already run Prometheus, Grafana, OpenTelemetry, or another telemetry pipeline, reach out and we’ll scope the integration.