For thirty years, one side has been a computer and the other has been a chip.
JPEG, MP3, H.264, HEVC, every format you've ever loaded — designed for a world where the source was a studio and the sink was a TV, a phone, a browser running on hardware that could afford exactly one thing: an inverse transform fixed in silicon. The protocol baked the asymmetry in. The format pretended the receiver was a toaster because, at the time, it basically was.
Today both ends are full computers. Your phone has an NPU. Your laptop has a GPU that idles ninety-five percent of the time. The toaster assumption is over.
One title becomes dozens. Then it squeezes through a continent of cache.
A single movie does not arrive on your screen as a single movie. It is encoded into every codec the device fleet might know — H.264, HEVC, VP9, AV1 — at every resolution rung, with every audio track, every HDR variant. Then those dozens of artifacts are pushed to edge servers in every region, so the bytes don't have to travel far when a million people press play at once.
The encode farm and the cache are not the product. They are the cost of pretending the sink is dumb.
Thirty years to halve the bitrate. Three to do it again.
Classical codecs improved at roughly one new generation per decade. Each generation cut bitrate in half at the cost of more decoder complexity, which the silicon could just barely afford. JPEG to JPEG 2000. H.264 to HEVC. HEVC to VVC. Steady, slow, committee-paced.
The neural era moves faster because it isn't waiting for a chip to be fabbed. Once the decoder is software running on hardware the receiver already has, every model release is a potential generation. JPEG AI, Encodec, DAC, Mimi — three years took the field where the previous thirty did.
“AI codec” is at least four different things.
The press treats neural codecs as one category. They are not. Pixel-faithful codecs put quantized latents on the wire and reconstruct close to the original — good for stills and high-fidelity audio. Token-based codecs ship codebook indices that language models can ingest directly — the right tool for speech and music generation pipelines. Structured codecs ship compact primitives like facial keypoints and let the decoder animate from a reference frame — the talking-head workhorses. Generative codecs ship a sketch and a description, and the receiver synthesizes the rest — the ultra-low-bitrate frontier.
WAI commits to the deterministic, reversible end of this spectrum. The first three flavors can be made bit-exact by construction. The fourth is allowed but tagged in the fidelity field, so the receiver knows what it's getting.
The bytes shrink by one to three orders of magnitude.
Talking-head video — news, lectures, interviews — drops from around 500 kbps with H.264 to about 1.5 kbps with audio-visual generative face coding. That is roughly three hundred times less. Cinematic content sees more modest but still dramatic wins. Music drops twenty-fold. Speech with a shared speaker prior can drop a hundred-fold or more.
These aren't projections. They are published bitrates from open, reviewed work. The technology is here. The deployment is not, because every existing pipeline was built around the old assumption.
The source encodes once. The sink runs WebAssembly.
Between source and sink, the only thing that matters is that the receiver can execute code. WebAssembly is that contract: every browser, every modern device, every platform Transaction Science targets already has a WASM runtime. The decoder rides with the stream, or is fetched once and cached forever by hash. The WASM module dispatches each piece of the payload to the silicon that's best at handling it — control flow on the CPU, tensor math on the GPU, neural decode on the NPU.
There is no codec registry. No version negotiation. No hardware decoder block to match. The sender picks the codec; the sender ships the decoder. The receiver runs it and signs a receipt.
When the payload is small, the middle becomes a choice.
A cache exists to put copies of large bytes close to the people who want them. When the bytes are small, a cache is still useful for some audiences and unnecessary for others. Direct origin works because a cheap server can serve a tiny payload to many viewers without breaking a sweat. Peer mesh works because if the catalog is small enough, the audience is the distribution network.
All three are valid WAI deployments.
The dollars were in the middle. The middle just got optional.
Streaming a movie today is mostly not the cost of bandwidth. It's the cost of the encode farm that produced thirty variants, the cache that put them near every viewer, and the licensing on the codecs that decoded them. Those three categories dwarf the underlying bytes. WAI collapses all three: encode once, distribute however, license nothing.
What's left is bandwidth — already cheap and getting cheaper — applied to payloads fifty to five hundred times smaller. The economics of media distribution invert.
Every piece of bloat had a reason. None of the reasons still hold.
Codec registries existed so the receiver knew what to expect. Standards bodies existed so the world's hardware could agree. Hardware decoder blocks existed because software wasn't fast enough. Codec licensing existed because patent holders controlled the only way to decode. ABR ladders existed so the same content could reach devices with different decoders. Encode farms existed to feed all of the above.
When the decoder is shipped with the stream and runs in WebAssembly on hardware the sink already has, the entire stack of historical workarounds drops out. What's left is what was always supposed to be there: bytes from one computer to another, with a receipt that says exactly what happened.
The spec is six sections long.
Container envelope, manifest schema, capability dispatch, registered capabilities, conformance, and versioning. A reader can hold the whole thing in their head in an afternoon.