What Is a Trie?— How a prefix tree stores thousands of strings using shared structure — and why that makes it one of the most efficient data structures for text search.
Building an Interactive Trie Visualizer with D3— How to turn a data structure into a living diagram — D3's tree layout, the enter/update/exit pattern, and making visualizations that adapt to any theme.
Scanning Text with a Trie— How to find every occurrence of thousands of patterns in a single pass through text — the algorithm behind entity detection, content moderation, and autocomplete.
Broadcasting a Trie in Spark— How to scan millions of documents for 100,000 entity names in parallel — by broadcasting a trie to every executor in a Spark cluster.
Building a Trie-Powered Autocomplete with React— How to build an autocomplete component that searches 100,000 entries in microseconds — no server, no debouncing, just a trie in the browser.
Shrinking the Trie for the Wire— We tried to build a compact wire format for tries. Gzip already solved the problem. Here's what we learned by measuring.
Mergeable Operations in Distributed Computation
Split, Process, Combine— Why some operations survive being distributed across a thousand machines — and why others break. The hidden design constraint behind every choice in distributed data processing.
Sketches: Trading Precision for Scalability— HyperLogLog, Count-Min Sketch, Bloom filters, and T-Digest — the approximate data structures that dominate big data all share one property: they merge.
When Abstract Algebra Becomes Practical— Every mergeable operation in this series has been a monoid. Here's what that means, why it matters, and how a Scala library turned abstract algebra into the most practical tool in distributed computing.
Building HyperLogLog from Scratch— A step-by-step Python implementation of HyperLogLog, from hashing to registers to the harmonic mean — with tests and benchmarks at every stage.
Neural Nets from Scratch
Minds, Brains and Computers— I wrote neural nets in BASIC before they were cool. The experience changed the course of my life.
Neural Nets Are Simpler Than You Think— A neural network in 20 lines of Python. We build one from scratch, break it on purpose, fix it, and teach it arithmetic.
A Tour of Karpathy's Tutorials— From counting letters to building a transformer — the three conceptual leaps that turn a toy neural net into a language model.
Building a Mixture-of-Experts Model— Can a small model learn to specialize? We replace the transformer's feed-forward network with multiple experts and a router, then watch what happens.
Do Drunk People Tip Better?— My friend Steve has a question about taxis, drinking, and tipping. We have 174 million taxi records. Let's find out.
Mapping Every Bar in New York City— 11,000 liquor licenses, weighted by drinkiness, visualized on interactive maps. Also: why does JFK airport have 47 liquor licenses in one building?
88.6 Million Taxi Rides— Filtering 174 million records down to the rides that actually matter. Plus: nobody tips in cash, and Friday nights are exactly what you'd expect.
Stumbling Distance— Building a custom distance metric using street graphs, k-d trees, and the realization that Google Maps would take a million years.
The Verdict— After all that work: do drunk people actually tip better? The answer is more interesting than you'd think.
Exploring High-Dimensional Data
What Embeddings Actually Are— Words as points in space. The moment that changed how I think about data — and the foundation for everything that follows.
Projecting to See — PCA, t-SNE, UMAP— Three methods for going from 768 dimensions to 2. Each one lies to you differently — and that's the point.
Building a 3D Data Explorer with Three.js— Take 10,000 movie review embeddings, project them to 3D with UMAP, and build an interactive tool to fly through your data. Color is the question; rotation is the answer.
Making Sense of Clusters— You see clumps in the point cloud. Are they real? Three clustering algorithms, TF-IDF keyword extraction, and NMF topic models turn spatial intuition into labeled structure.
Feature Engineering — Beyond the Embedding— Embeddings capture semantic similarity, but they're not the only signal. Structured features from the text itself reveal structure the embedding misses.
Graph Analysis — When Connections Tell the Story— Text tells you what each review says. A similarity graph tells you how reviews relate to each other. Community detection and graph embeddings reveal structure that spatial clustering misses.
Other
On AI and Authorship— On using AI to build this site, and what that means for authorship.