When reasoning becomes a scarce resource, who captures its value?

Core Viewpoint

2026-06-08 23:35:55

Collection

The company that ultimately wins will not be the one with the most GPUs, but the one that can tell you which GPUs are available where and at what price, and route each workload to where it can run at the lowest cost.

Author: Frank Fu, IOSG

The "hole" proposed by David Cahn in 2023 has never been filled on the training side. It has been filled on the inference side, and the market has only begun to account for it in pricing over the past few weeks. When Nvidia reorganized its financial reporting around "service tokens" and Cerebras went public with a 20x oversubscription, the bottleneck debate was already over, and the real question became: when inference becomes a scarce resource, where will value be concentrated in the computing stack?

1. Following the GPU: From the $200 Billion Question to the $600 Billion Question

In 2023, David Cahn from Sequoia raised the question looming over the entire AI infrastructure, known as the "200 Billion Dollar Question." For every dollar spent on GPUs, about another dollar must be spent on powering them in data centers, meaning that each year's GPU CapEx implies these chips must ultimately generate about $200 billion in revenue to recoup that capital. Even with very generous assumptions about AI revenue, he still found a gap of over $125 billion between "investment" and "actual payments from end customers." The concern is straightforward: GPUs are being overbuilt ahead of actual demand.

A year later, the gap not only did not narrow but actually widened. In his 2024 follow-up, Cahn redefined it as the "600 Billion Dollar Question" as the CapEx of hyperscale vendors ballooned. The bearish logic converged into a familiar shape: overbuilding leads to oversupply, and oversupply will burn capital.

Both articles are essentially asking the same question: who will fill this hole? The answer has never appeared on the "training" side of the ledger. It appears on the inference side, and the market has only begun to account for it in pricing over the past few weeks.

2. Cerebras IPO and Inference Squeeze

Cerebras went public on Thursday. This IPO received a 20x oversubscription, with pricing nearly double the final increase on Wednesday. The demand did not stem from a bet on the "next Nvidia killer," but from a simpler realization: the market is beginning to understand that the real bottleneck in AI is inference, not training.

Cerebras's specialty is a chip architecture that makes inference extremely fast. It is not about training; it is about inference. This is what excites Wall Street. The inference market is recurrent, expanding with usage. Every time Claude answers a question, every time an agent performs a task, it consumes computing power. Training happens once, while inference never stops.

J.P. Morgan estimates the size of the inference market to be 10 to 50 times that of training. When machines start executing tasks assigned by other machines, i.e., agentic expansion, inference demand no longer expands with the number of users but expands with the computing power itself.

3. Nvidia Redrawing the Landscape: Inference Takes Center Stage

If Cerebras represents the awakening of the market, then Nvidia's latest quarterly report is a confirmation from the top of the supply chain. In the latest earnings call, Jensen Huang made the unspoken truth clear: AI demand is growing parabolically. The reason is simple: agentic AI has arrived. Mainstream AI has transitioned from one-time inference to logical reasoning, and now to the agent phase where it can call tools and orchestrate tasks on its own. Huang stated, "Tokens are now profitable." In the AI era, computing power equates to revenue and profit.

This reshapes the entire industry. Training is a one-time cost to build a model, while inference is the recurring cost to run it, and the current bottleneck lies in inference, not training.

Nvidia has incorporated this judgment into its financial reporting. It now discloses its results across two platforms instead of one: Data Center and Edge Computing. The Data Center (approximately $75 billion this quarter, +92% year-over-year) is further broken down into Hyperscale (approximately $38 billion, +12% quarter-over-quarter) and ACIE, which includes AI cloud, industrial, and enterprise (approximately $37 billion, +31% quarter-over-quarter). A new line is Edge Computing: $6.4 billion, +29% year-over-year, covering the endpoints where agentic AI and physical AI truly operate, such as PCs, workstations, AI-RAN base stations, robots, and cars.

Edge currently accounts for less than 8% of total revenue, but Nvidia has elevated it to a "second platform" alongside Data Center. This signals that inference is splitting into two fronts: cloud inference in data centers and endpoint inference on the edge, where AI needs to see, move, and act in the physical world. The roadmap follows the same logic: Vera Rubin, which will start shipping in the third quarter, can achieve inference throughput up to 35 times that of Blackwell; Huang also provided a new total addressable market (TAM) of $200 billion for the Vera CPU designed for agentic workloads. Every leading model company is expected to fully pivot to it on day one.

As the highest-valued company on Earth reorganizes its financial disclosures around "service tokens," the bottleneck debate has already been settled. The remainder of this article discusses who captures value when inference (rather than training) becomes a scarce resource.

First, a scope clarification. This article discusses cloud inference, which refers to rented data center GPUs providing API token services. Endpoint inference runs on local chips within the devices themselves (Nvidia's Jetson, RTX, Drive, AI-RAN), completely bypassing the GPU leasing and aggregation stack beneath it. Here, please consider it as a tailwind amplifying the entire inference economy and supporting the bottleneck argument, rather than the markets where Hyperbolic and Venice operate, which are entirely on the cloud side.

4. The Squeeze Has Arrived

Anthropic is the canary in the coal mine. Usage far exceeds the pre-configured capacity, and complaints about Claude being "brain-extracted" are flooding the internet, including throttled responses, slowed inference, and compressed context windows. The solution is starkly about computing power: in May 2026, Anthropic took over the entire Colossus 1 data center from SpaceX, with over 220,000 Nvidia GPUs and 300+ megawatts, dedicating it specifically to inference rather than training.

This capacity unlocks a series of quota changes, each serving as a signal. On May 6, Anthropic doubled the five-hour limit for Claude Code, lifted throttling during peak hours, and significantly increased the API rate limits for Opus. On May 13, it raised the weekly limit for Claude Code by another 50% (until July 13). Then, starting June 15, it did the opposite of "generous": it separated agentic and programmatic usage (Agent SDK, headless mode claude -p, CI pipeline) from the flat subscription into an independently metered credit pool (ranging from $20 to $200 per month, billed at API rates). This final step condensed the entire argument into one action: the speed at which agents consume inference far exceeds the design capacity of flat subscriptions, thus it must be priced according to its original "recurring cost."

Training is a one-time capital expenditure. Inference is a recurring operational cost, compounding with each new user and each new agent.

5. This Stack: Six Layers, One Bottleneck

Every AI application sits on a supply chain that starts from TSMC's wafer fabrication and ends at the API endpoint:

Most companies only own one layer of this stack. Nvidia owns silicon, CoreWeave owns bare metal, Together AI owns inference optimization, and OpenRouter owns model API routing.

There is only one exception.

6. Hyperbolic: The Only Company Spanning Three Layers

Hyperbolic launched its on-demand GPU marketplace in June 2025. In the first few months, its developer count surpassed 200,000+, covering cutting-edge AI labs, search, and large consumer platforms.

Interestingly, its architecture.

Hyperbolic does not own a single GPU. Every card comes from neocloud and data centers, including CoreWeave, Lambda Labs, Nebius, and smaller operators with idle capacity. This sounds like a weakness, but it is actually a moat.

By sitting between GPU suppliers and consumers, Hyperbolic can see real-time data that others cannot. It knows who is buying what GPU at what price and when. It sees oversupply before it becomes public and detects demand surges before they hit the market.

Today, the moat itself is this multi-cloud aggregation. Hyperbolic stitches together fragmented capacity from dozens of independent clouds and data centers into a standardized unified pool, allowing developers to rent the cheapest available GPUs anywhere without negotiating with each operator or managing a bunch of accounts. The more clouds it connects to, the deeper the liquidity, and the richer the pricing data. Furthermore, the team is exploring how to model GPU price curves with this data and ultimately invest its own capital to smooth supply and demand, acting as a market maker for physical computing power; but this goal is still in its early stages, and what is truly compounding now is the aggregation layer.

This is the flywheel:

Connect more clouds → More aggregated supply
More supply → Deeper markets and real-time pricing data
Better data → Smarter routing now, and long-term pricing models
Better liquidity and prices → More developers → More clouds want to connect

No other company is attempting this. Hyperbolic is the only company that spans the GPU leasing layer, deployment layer, and model API layer.

7. Venice as a Mirror

Venice is the clearest manifestation of the inference economy at the application layer and serves as a useful contrast to Hyperbolic's position. It is a privacy-first inference application: a set of OpenAI-compatible APIs, along with consumer subscriptions (Free / Pro / Pro+ / Max), routing requests to about 75 models, two-thirds of which are open-source or self-hosted models (Llama, Mistral, Qwen, DeepSeek), while the rest are anonymous passthroughs of closed-source cutting-edge models. The key is that Venice does not own meaningful computing power. It rents from undisclosed GPU partners and confidential computing providers (NEAR AI Cloud, Phala) and pays cutting-edge labs for passthroughs, so its true cost of revenue is inference computing power, not SaaS hosting.

What Venice truly sells is privacy. The "privatization" here does not mean turning public computing power into private property, but rather wrapping commoditized inference in a layer of guarantees: no data retention, no training, request anonymization, and some workloads running in TEE, making it impossible for operators to see plaintext. The underlying computing power is commodity-grade, and the markup comes from this layer of privacy packaging. Moreover, this guarantee is layered and not homogeneous: for open-source models running on GPUs under its control or TEE, it can achieve near end-to-end confidential computing; but for anonymous passthroughs of closed-source models like Claude and GPT, privacy merely strips identity, while the cutting-edge labs on the other end are still processing your original prompt. Thus, the strongest privacy only covers the open-source part, while the cutting-edge model part is "anonymous" rather than "truly confidential." Venice's gross profit = subscription price - inference costs paid downstream, and the portion it can charge above the bare API price relies almost entirely on this layer of privacy premium, which is why it operates on thin margins and is constrained by cutting-edge passthrough pricing.

Token design has packaged this portion of inference demand. Venice operates on two tokens: VVV (staking and platform access) and DIEM, the latter being an inference credit, with each DIEM roughly equivalent to $1 of computing power per day. Paid subscriptions trigger programmatic buybacks of VVV (Pro / Pro+ / Max approximately $2 / $5 / $10), with emissions decreasing according to a fixed schedule: from 6M → 5M → 4M VVV per month, and adjusting down to 3M on July 1. The buybacks are real but discretionary and still small: about $103,000 was destroyed in both April and May, and June is slowly climbing towards about $110,000, far below the $200,000 monthly line.

The fundamentals are healthier than the headlines. The publicly circulated figure of "$70 million ARR" is almost certainly the result of mistaking subscription renewals for net new customer acquisition; a defensible observable range is closer to $6 million to $15 million ARR. Below this, traction is real: about 136,000 unique wallet addresses, approximately 9.9 million website visits per month (about 330,000 daily), and new Pro subscriptions hovering around 1,400 per day. This is a real business, but a low-margin business, whose economics are constrained by the computing power it purchases.

This is precisely why Hyperbolic is positioned one layer above it. If Venice is a gas station, Hyperbolic is the refinery. Venice buys computing power from the same constrained supply that everyone relies on; Hyperbolic aggregates and standardizes that fragmented supply and sells it to Venice and all players like it. As inference demand grows, value accumulates not only towards applications consuming computing power but also towards the layer that aggregates, routes computing power, and captures the cost of revenue paid by these applications.

8. Why This Matters Now

Nvidia has reorganized its finances around "service tokens." Cerebras's IPO proves the market has understood that inference is the bottleneck. Anthropic's frantic search for capacity demonstrates this is a real issue. Agentic and physical AI will amplify demand by several orders of magnitude, spanning both cloud and edge lines.

Moreover, it has also closed the loop on the "600 Billion Dollar Question" from another side. Cahn's bearish logic, that overbuilding leads to oversupply, will likely be validated. But oversupply is precisely the optimal scenario for asset-light aggregators: when GPU prices decline and supply fragments across dozens of clouds, the player who does not own any hardware and routes every workload to the cheapest available cards will earn the price difference, while operators holding depreciating GPUs will bear the losses. Hyperbolic is betting on oversupply rather than shorting it.

The company that ultimately prevails will not be the one with the most GPUs but the one that can tell you which GPUs are available where and at what price, routing every workload to the place where it can run at the lowest cost.

Hyperbolic is building such a company. It does not own GPUs, operates purely on software, spans three layers, yet is becoming the ultimate aggregation layer for inference computing power.

Join ChainCatcher Official

Telegram Feed: @chaincatcher

X (Twitter): @ChainCatcher_