Telemetry

What gets captured

For every workload, Expanse captures three views:

Before: what the scheduler asked for, the submitting user, the queue, the requested resources.

During: live metrics streamed throughout the run including CPU, memory, and GPU utilisation, memory, power, and clocks.

After: real runtime, peak memory, real utilisation, exit status, and error context for failures.

Together, those views feed the Console’s waste dashboards and the intelligence layer.

Low overhead, never blocking. Baseline capture observes the scheduler from the outside; optional GPU profiling runs in short, bounded windows so the amortised cost stays small. Telemetry never blocks workload execution: if the daemon or the data plane is unreachable, jobs run normally and telemetry is buffered locally until it can be delivered.

Compute coverage

Expanse captures workloads automatically where the scheduler can be observed:

SLURM: captures every job (sbatch, srun, salloc) via scheduler prolog/epilog hooks, including user, account, partition, requested CPU, memory, GPU, walltime, job state, runtime, exit code, and scheduler diagnostics.

Kubernetes: captures pods across the cluster’s non-system namespaces (or a configured namespace list for scoped installs), including owner references for Volcano, Kueue, Argo Workflows, Flux, Ray, and Flyte, plus requested resources, limits, node placement, status, runtime, and log tails.

Nomad: captures allocations placed on the cluster, including job, group, task, namespace, and node context, requested CPU, memory, and device resources, allocation and task state, runtime, exit status, and failure context.

See Computes for how these environments map to compute types. If you already run Prometheus, Grafana, OpenTelemetry, or another telemetry pipeline, reach out and we’ll scope the integration.

Getting Started

Core Concepts

Deployment

Integrations

What gets captured

Compute coverage

​What gets captured

​Compute coverage

What gets captured

Compute coverage