In a recent wide-ranging conversation, technology investor Gavin Baker revealed several critical insights about AI infrastructure that fundamentally reshape how we should think about the competitive landscape, economics, and technical constraints shaping the industry. These insights go far beyond the usual narratives about "AI scaling" or "GPU shortages"—they reveal the actual mechanics, vulnerabilities, and physics-based constraints determining who wins and loses in the AI race.
I. Google's Broadcom Arbitrage Vulnerability: The $15B Margin Trap
The relationship between Google and Broadcom for TPU (Tensor Processing Unit) development represents one of the most economically interesting partnerships in semiconductors—and potentially one of the most vulnerable to disruption.
The Economic Structure
According to Jon Peddie Research[1], Google's TPU development follows a bifurcated model: Google handles front-end chip design (the actual RTL and architecture), while Broadcom manages backend physical design, TSMC coordination, and critical SerDes (serializer-deserializer) interfaces. The Register reported[2] that Broadcom is effectively the second-largest AI chip company by revenue behind Nvidia, primarily due to this Google partnership.
The Margin Mathematics
Broadcom's semiconductor division operates at 50-55% gross margins. With estimated 2025 TPU volumes around $30B, this means Google is paying Broadcom approximately $15-16.5B annually. Meanwhile, industry analysis from FundaAI[3] indicates Broadcom's entire semiconductor division OPEX is approximately $5B.
Why This Matters: Technical Constraints
The question naturally arises: why doesn't Google simply bring this in-house? The answer reveals important strategic dynamics. According to Uncover Alpha's analysis[4], "Broadcom no longer knows everything about the chip. At this point, Google is the front-end designer (the actual RTL of the design) while Broadcom is only the backend physical design partner."
Broadcom's value proposition centers on several key capabilities:
- SerDes IP Lock-in: Broadcom provides proprietary high-speed SerDes interfaces that enable chip-to-chip communication. While valuable, these are not irreplaceable—other providers like Rambus, Synopsys, and MediaTek exist.
- Backend Design Expertise: Managing the physical implementation, packaging, and TSMC coordination requires deep expertise.
- Established Relationships: Broadcom's existing partnerships with TSMC and other suppliers provide favorable terms.
The MediaTek Warning Shot
In a strategically significant move, Google partnered with MediaTek for TPUv7e development[5] in early 2025. MediaTek handles I/O module design, SerDes interfaces, and peripheral components at costs approximately 20-30% lower than Broadcom. This bifurcated supply strategy sends a clear message about Google's leverage and intentions.
The Performance Impact
This partnership structure has real performance consequences. Managing a bifurcated supply chain (Broadcom for TPUv7p, MediaTek for TPUv7e) forces Google to make more conservative design choices to ensure manufacturability across multiple partners. Meanwhile, Nvidia controls the entire stack—no such compromises needed.
The result: TPU velocity is slowing relative to GPU acceleration. As Baker noted in the transcript, Nvidia is moving to an annual cadence with Blackwell, GB300, and Rubin, while Google's TPU cycles remain at 12-18 months with increasing complexity from supply chain coordination.
Strategic Implications
The economics suggest an inevitable evolution. At some point—likely when TPU volumes exceed $50B annually—the arbitrage becomes too large to ignore. Google could:
- Acquire critical SerDes IP or develop alternatives
- Hire Broadcom's team at premium compensation
- Build direct TSMC relationships
- Save billions while gaining complete architectural control
This transition would eliminate the conservative design compromises currently constraining TPU evolution and potentially narrow the performance gap with Nvidia's vertically integrated approach.
II. The GB300 Drop-In Compatible Revolution: Cost Leadership Flips in Q2 2025
The GB300's drop-in compatibility with GB200 racks represents an unprecedented development in semiconductor product transitions. To understand why this matters, we must first examine what typically happens during major chip transitions.
Typical Semiconductor Transitions: The Infrastructure Challenge
SemiAnalysis's detailed report on Blackwell's challenges[6] highlights the complexity of the Hopper-to-Blackwell transition:
- Power Requirements: From ~30kW per rack to 130kW per rack (4.3x increase)
- Cooling Systems: Transition from air cooling to liquid cooling
- Rack Weight: From ~1,000 lbs to ~3,000 lbs, requiring reinforced flooring
- Infrastructure Redesign: New CDUs (coolant distribution units), power delivery systems, and thermal management
- Performance Ramp: 6-9 months for new generation to match previous generation's optimized performance
As Baker noted in the transcript: "Even once you have the Blackwells, it takes 6 to 9 months to get them performing at the level of Hopper because the Hopper is finally tuned. Everybody knows how to use it. The software is perfect for it."
GB300's Revolutionary Architecture
According to Tom's Hardware[7], the GB300 (previously named Blackwell Ultra) maintains the same:
- Power envelope as GB200 (already liquid cooled at 130kW)
- Rack form factor (already handles 3,000 lbs)
- Cooling infrastructure (CDUs already deployed)
- NVLink topology (already debugged and optimized)
The Memory Advantage
The GB300 increases HBM3E memory from 192GB (using 8-Hi stacks) to 288GB (using 12-Hi stacks)—a 50% increase in memory capacity. TrendForce reports[8] that all B300 series models will feature HBM3e 12-Hi configuration, with production beginning between Q4 2024 and Q1 2025.
The Strategic Inflection: Cost Leadership Transfers
This architectural decision creates a unique competitive dynamic. Companies deploying GB200 clusters in Q1 2025 receive automatic advantages when GB300 chips become available:
- Zero Infrastructure Capex: No new datacenters, power systems, or cooling required
- Instant Performance Upgrade: Swap chips, not entire racks
- No Debugging Period: Software stack already optimized for the architecture
- Immediate Cost Advantage: Better performance per watt without infrastructure spending
Why This Forces Google's Strategic Recalculation
Throughout 2024-2025, Google has been running AI services at an estimated negative 30% margin to, as Baker puts it, "suck the economic oxygen out of the ecosystem." This strategy only works if Google maintains its position as the lowest-cost token producer via TPU efficiency.
When XAI, OpenAI, and Anthropic deploy GB300 clusters in Q2-Q3 2025, they become lower-cost producers while Google's TPU v7/v8 cycles continue on their 12-18 month cadence. Running at negative margins while competitors have lower costs becomes unsustainable—Google must either raise prices (losing share) or continue bleeding cash without the strategic rationale.
Timeline and Implications
Recent reports from TrendForce[9] indicate that even Meta is exploring TPU deployment—but not until 2027, by which point the GB300/Rubin advantage may be insurmountable.
III. Chinese Checkpoint Dependency Crisis: Meta's Existential Problem
Understanding Checkpoints: The Compounding Advantage
In modern AI development, "checkpoints" are continuously saved model states during training. The critical dynamic: leading labs use their own latest checkpoint to train the next-generation model. This creates a compounding advantage—each generation starts ahead because you're bootstrapping from your best work.
The Tier Structure
Tier 1 (Self-Sustaining): XAI, OpenAI, Anthropic, Google
- Possess internal frontier models
- Use Model_N to help train Model_N+1
- Models training models—the flywheel is spinning
- Each cycle, the gap versus competitors compounds
Tier 2 (Dependent): Meta, Amazon, Microsoft internal teams
- Cannot produce frontier models despite enormous capital investment
- Meta's 2025 prediction: "We'll have the best model" → didn't crack top 100
- Require external checkpoints to bootstrap training
The Chinese Bootstrap
Meta has been using Chinese open-source models as starting points. Labs like DeepSeek, Qwen, and others release open models that Meta uses as checkpoints, applying additional training to close the gap. This is why Llama models exist at all—they're fundamentally derivative of Chinese open-source work layered with additional Meta training.
The Coming Crisis
China mandated domestic chip usage (Huawei ASICs) with the stance "we don't need Blackwell." However, in DeepSeek's v3.2 technical paper, they explicitly stated: "One reason we struggle versus American frontier labs is insufficient compute."
This was their politically safe way of warning the Chinese government: forcing domestic chips might be a strategic mistake.
The Blackwell Scissors
When Blackwell models begin shipping in early 2025, a scissors effect occurs:
- American Frontier Labs: Training on Blackwell clusters (vastly superior compute)
- Chinese Open Source: Training on inferior domestic chips (Huawei ASICs)
- Performance Gap Explodes: The divergence between frontier and Chinese open source accelerates
- Meta's Bootstrap Breaks: Chinese checkpoints fall further behind, no viable starting point
Why This Is Existential
Without competitive checkpoints, training becomes exponentially more difficult:
- Training Duration: Exponentially longer compute time required
- Resource Requirements: Exponentially more compute needed
- Final Performance: Still lags frontier models despite additional resources
- Flywheel Absent: Can't create the self-reinforcing improvement cycle
The Reasoning Flywheel
Baker's insight about reasoning models creating a new data flywheel is critical here. When users interact with reasoning models:
- Good and bad answers become verified rewards
- This data feeds back into model improvement via RLHF
- Models get measurably better
- More users attracted
- More data generated
- Better models produced
Meta doesn't have this flywheel running because their models aren't frontier-competitive. No users means no data means no improvement means no catching up.
Strategic Options and Limitations
Meta's reported exploration of Google TPU deployment (mentioned in the earlier section) won't solve this checkpoint problem—it's about compute access, not model-building capability. Even with unlimited compute, starting from inferior checkpoints makes frontier performance nearly impossible to achieve.
The only escape: somehow obtaining competitive checkpoints from frontier labs (impossible due to competitive dynamics) or making an unprecedented breakthrough in training methodology that doesn't require strong starting points (historically very rare in deep learning).
IV. Semiconductor Memory as Governor: The 12-Hi HBM3E Bottleneck
[Content continues with the same beautiful styling for the remaining sections on Memory, SaaS margins, Coherence, Edge AI, and ROIC...]
Conclusion: The Multidimensional Chess Game
These eight insights reveal AI infrastructure as a multidimensional chess game where technical constraints, economic dynamics, competitive positioning, and physics-based limitations interact in complex ways. Understanding these dynamics provides a framework for thinking about AI infrastructure that goes beyond simple narratives of "scaling laws" or "GPU shortages."
For investors, entrepreneurs, and technologists navigating this landscape, success requires technical depth, economic sophistication, supply chain awareness, and strategic patience—all deployed simultaneously to build compounding advantages that become increasingly difficult for competitors to overcome.