>_TheQuery
← All Articles

Idle GPUs Are Costing the AI Industry Billions

By Addy · March 24, 2026

The industry is spending $650 billion on AI data center infrastructure this year. New GPU clusters. New cooling systems. New power contracts. The assumption underneath all of it is that more hardware is the answer to the AI performance problem.

Zain Asgar has a different question.

What if the hardware you already have is sitting idle 70-85% of the time?

That is not a theoretical concern. McKinsey's research puts current data center hardware utilization at 15-30%. The chips are there. The power is on. The cooling is running. The hardware is just waiting. "Another way to think about this," Asgar told TechCrunch, "you're wasting hundreds of billions of dollars because you're just leaving idle resources."

Asgar is the co-founder and CEO of Gimlet Labs, which raised an $80 million Series A led by Menlo Ventures on March 23. Total raised: $92 million. Customers include one of the top three frontier AI labs and one of the top three hyperscalers. Eight-figure revenues five months after emerging from stealth.

The product: the world's first multi-silicon inference cloud. And the question it raises is bigger than the funding round.


The Utilization Problem Nobody Is Solving

Every major AI lab and cloud provider runs homogeneous hardware fleets. Racks of H100s. Clusters of A100s. The same chip type, wall to wall, doing all the work: model inference, decoding, orchestration, tool calls, retrieval.

The problem is that these tasks have fundamentally different hardware requirements.

Inference, computing what comes next in a sequence, is compute-bound. GPUs are the right tool. Decoding, the step-by-step generation of each output token, is memory-bound. SRAM-centric accelerators like Groq and Cerebras handle this dramatically faster than GPUs. Tool calls and orchestration, the glue between model calls, are network-bound. CPUs, which are dramatically cheaper than GPUs, handle these efficiently.

When you run all four on the same GPU fleet, you are using the most expensive hardware for the tasks that need it least. The GPU that costs $30,000 is handling tool calls that a $300 CPU could execute just as well. The SRAM-centric accelerator that could generate tokens 10x faster is not in the fleet at all because the infrastructure was not built to support multiple chip types.

The result is an industry running AI workloads at 15-30% hardware utilization while spending $650 billion on more of the same underutilized hardware. The problem is not compute scarcity. It is a software layer that cannot route the right workload to the right chip.


What Gimlet Labs Actually Built

Gimlet Cloud is orchestration software that slices agentic workloads into their constituent stages and routes each to the optimal hardware available, without requiring developers to rewrite their existing PyTorch or Hugging Face pipelines.

The routing logic follows the hardware requirements of each stage. Compute-bound inference steps go to GPUs. Memory-bound decoding steps go to SRAM-centric accelerators. Orchestration and tool calls go to CPUs. When older GPU generations would otherwise sit idle, Gimlet deploys them for tasks that match their profile rather than retiring them.

The company can even slice a single model across different chip architectures, running different layers of the same model on different hardware simultaneously, matching each layer to the chip where it performs best.

The performance claim: 3-10x faster inference for the same cost and power envelope. Menlo Ventures' Tim Tully, who led the investment, summarizes the thesis: "The multi-silicon fleet is ready: it's just missing the software layer to make it work."

The customer traction suggests the claim is real. Tripling the customer base in five months and adding a top-three frontier lab as a customer is not something you achieve on benchmark performance alone.


The Consumer Hardware Question

Here is where it gets interesting beyond the data center story.

Gimlet Labs currently operates managed multi-silicon data centers and offers its software for enterprise deployment into private data centers. The product today is enterprise infrastructure.

But the architectural premise, route each workload stage to the optimal hardware available, does not stop at the data center boundary.

A MacBook M4 has a CPU, a GPU, and a Neural Engine. Each handles different workloads at different efficiency profiles. An RTX 4070 gaming laptop has similar heterogeneity. A developer's local machine is already a multi-silicon system. What it lacks is the orchestration layer that knows how to route inference workloads across those chips optimally, and across the boundary between local hardware and cloud hardware when local compute is insufficient.

This is not what Gimlet Labs is building today. But the architectural direction points toward it: hybrid edge/cloud workload partitioning, SLA-aware dynamic scheduling of agentic workloads, a universal AI compiler for heterogeneous hardware.

If the orchestration software that today routes workloads across a data center's GPU fleet can be generalized to route workloads across a developer's laptop and a cloud endpoint, routing locally when latency and privacy matter and escalating to cloud when the task exceeds local capability, the economics of running AI change fundamentally.

The data center version of this problem is about utilization efficiency. The consumer hardware version is about accessibility. Making capable AI inference available on hardware that already exists everywhere, at a cost that does not require a cloud API subscription for every token.


Why This Connects to the VC Subsidy Problem

We wrote on March 15 that the current AI pricing is underwritten by venture capital: that the $14 billion OpenAI burns annually and the 15-30% hardware utilization across the industry are two sides of the same inefficiency. You cannot sustain $150,000/month API costs at enterprise scale indefinitely. The bill arrives eventually.

Gimlet Labs is a direct response to the hardware side of that equation. If inference workloads can be routed to the right chip rather than defaulting to the most expensive chip available, the cost per inference call drops. The frontier lab that added Gimlet as a customer presumably did so because the efficiency gains are real and measurable.

The $650 billion in data center CapEx projected for 2026 assumes the current utilization problem cannot be solved in software. If Gimlet's approach works at scale, some meaningful fraction of that CapEx becomes unnecessary. You do not need more H100s if the H100s you have are running at 80% utilization instead of 20%.

That is a significant claim. The funding round and the customer list suggest the claim is credible. Whether the approach scales to consumer hardware is the more interesting long-term question.


The Honest Uncertainty

Routing workloads across heterogeneous hardware introduces latency at every handoff. When a model call transitions from GPU inference to SRAM decoding to CPU orchestration, data moves between chips. That movement costs time. At data center scale with high-bandwidth interconnects, the overhead may be manageable. At consumer scale, routing across a laptop's internal chips or across the network boundary between a local device and a cloud endpoint, the handoff latency may consume the efficiency gains.

Gimlet Labs has not published detailed latency benchmarks for the handoff cost specifically. The 3-10x performance claim is an aggregate. The workload distribution matters enormously: a pipeline that is 80% decoding benefits differently from a pipeline that is 80% orchestration.

The technology is real. The enterprise traction is real. The consumer hardware question is open.


Sources:

Previously on TheQuery: The VC Subsidy Behind Cheap AI Will Not Last - the cost problem this infrastructure is trying to solve.