Loading Page...

Ask AI to summarize this article

Get the key takeaways and insights instantly.

Perplexity Gemini Claude ChatGPT Grok Copilot DeepSeek

Back to Blog

Cloud ArchitectureTPU 8Gemini OmniBoardfly TopologySiliconGoogle Cloud

Building a World Model: Gemini Omni and the New TPU 8 Silicon

ByZain Ahmed

May 23, 2026

2 min read

Cover image for Building a World Model: Gemini Omni and the New TPU 8 Silicon

Source: Google / Sundar Pichai (TPU 8t and TPU 8i official reveal, Google Cloud Next 2026)

The 2026 ecosystem pushes far beyond standard text generation and steps into physical simulation. Gemini Omni Flash processes audio, video, image, and text natively at the token level. It does this without relying on cascaded translation layers or bolted-on diffusion models.

This native processing allows Omni to function as a rudimentary world model. It intuitively calculates depth mapping, momentum, and fluid dynamic dispersion directly from 2D reference images and simple text instructions.

Google integrated the SynthID protocol directly into the model's logits processor to secure these outputs. SynthID does not alter pixels after generation. Instead, it uses a deterministic pseudorandom $g$-function to subtly augment the language model's natural probability scores. This embeds a cryptographic watermark that survives extreme compression and aggressive cropping.

Google's first production server rack, 1999, hand-built by Larry Page and Sergey Brin, now on display at the Computer History Museum, Mountain View. The foundation Google's silicon empire was built on.
Source: Google Cloud (cloud.google.com)

The Silicon Split: TPU 8t vs. TPU 8i

Training a world model and serving a fast agentic loop require very different hardware. The demands are now mutually exclusive. Because of this, Google officially split its silicon roadmap into two distinct paths.

TPU 8t (Training)

Engineered for massive-scale pre-training, the TPU 8t operates in a 3D torus topology. A single superpod holds 9,600 chips. The Virgo Network unifies the entire setup. This non-blocking fabric links up to 134,000 chips with 47 petabits per second of bi-sectional bandwidth. It also utilizes native FP4 support to double the throughput of the Matrix Multiply Unit.

TPU 8i (Inference)

The 8i was built specifically to break through the memory wall of high-concurrency reasoning. It features a massive 384 MB of on-chip SRAM, a 300% increase over the previous generation. This allows the entire active Key-Value cache to live directly on the silicon, which completely eradicates data-fetching lag.

To speed up Mixture-of-Experts routing, the 8i ditches the 3D torus in favor of the Boardfly topology.

Understanding the Hops reduction

In a standard 1,024-chip 3D torus, the worst-case packet traversal is 16 hops.

3D torus hops = (8/2)X + (8/2)Y + (16/2)Z = 16 hops

The flattened Boardfly topology reduces this maximum path to just 7 hops. Google combined this with a dedicated Collectives Acceleration Engine to cut on-chip communication lag by up to a factor of five. This delivers an 80% performance-per-dollar improvement. It finally makes the continuous serving of massive agent swarms economically viable.

Want to discuss this further?

I'm always happy to chat about software engineering, cloud architecture, AI/ML, and DevOps.

Get In Touch Read More Articles

Follow me for more insights on software engineering, cloud architecture, AI/ML, and DevOps

Follow on LinkedIn