Launch HN: Expanse (YC P26) - Unlock Wasted GPU Capacity | EveryCorner

Expanse is a resource-prediction and observability tool aimed at HPC and GPU clusters, primarily targeting high-compute environments such as research institutions, AI labs, quantitative trading, and manufacturing that use SLURM or Kubernetes. The team notes that the effective utilization of data centers is often only around 30% to 40%, and one reason is that users tend to over-request resources: requesting too much wastes cost and queue capacity, but requesting too little may cause jobs to fail midway due to insufficient memory or time, resulting in days of lost work. Expanse's approach is to analyze the source code, job submission script, hardware topology, and real-time telemetry before a job is submitted, predict how much GPU VRAM, GPU utilization, memory, CPU, and walltime the job will actually need, and provide confidence intervals, p90 recommendations, and failure-risk warnings. It also collects DCGM, CUPTI, Cgroups, network, and I/O monitoring data while jobs run, providing a dashboard for observing hardware and program stack profiling; if a job fails, the system performs correlation analysis between telemetry and profiling to produce concise, solution-oriented diagnostics and line-level code suggestions. The team emphasizes that the core model is not a general-purpose LLM but a custom architecture designed for source code, submission scripts, and hardware data, and that it is fine-tuned for each cluster, because the same program may perform differently across different topologies and hardware. They also claim that on EPCC real-workload data it outperforms other baselines by 34% and substantially outperforms general-purpose frontier LLM baselines on the same task. However, these numbers come from the team's own account, and readers should still treat them as product launch information rather than independently verified results. Overall, Expanse addresses a very practical infrastructure problem in an era of expensive compute: rather than buying more GPUs, it brings existing cluster scheduling and resource requests closer to actual needs. The product is currently positioned at the prediction/intelligence layer; it does not directly sell or schedule idle capacity, but instead provides more accurate resource recommendations to schedulers and users.