Tries: Multi-Pattern Text Search

  1. What Is a Trie? — How a prefix tree stores thousands of strings using shared structure — and why that makes it one of the most efficient data structures for text search.
  2. Building an Interactive Trie Visualizer with D3 — How to turn a data structure into a living diagram — D3's tree layout, the enter/update/exit pattern, and making visualizations that adapt to any theme.
  3. Scanning Text with a Trie — How to find every occurrence of thousands of patterns in a single pass through text — the algorithm behind entity detection, content moderation, and autocomplete.
  4. Broadcasting a Trie in Spark — How to scan millions of documents for 100,000 entity names in parallel — by broadcasting a trie to every executor in a Spark cluster.
  5. Building a Trie-Powered Autocomplete with React — How to build an autocomplete component that searches 100,000 entries in microseconds — no server, no debouncing, just a trie in the browser.
  6. Shrinking the Trie for the Wire — We tried to build a compact wire format for tries. Gzip already solved the problem. Here's what we learned by measuring.

Mergeable Operations in Distributed Computation

  1. Split, Process, Combine — Why some operations survive being distributed across a thousand machines — and why others break. The hidden design constraint behind every choice in distributed data processing.
  2. Sketches: Trading Precision for Scalability — HyperLogLog, Count-Min Sketch, Bloom filters, and T-Digest — the approximate data structures that dominate big data all share one property: they merge.
  3. When Abstract Algebra Becomes Practical — Every mergeable operation in this series has been a monoid. Here's what that means, why it matters, and how a Scala library turned abstract algebra into the most practical tool in distributed computing.

Entity Detection: Finding What Matters in Text

  1. The 'You' Problem — Why Entity Detection Is Harder Than Ctrl+F — You have 100,000 entity names and a million documents. Finding the matches is easy. Knowing which matches are real — that's the hard part.
  2. Scoring Entity Matches — When Finding Isn't Enough — A trie finds the strings. Scoring decides which matches are real. Here's the disambiguation model that solves the 'You' problem.
  3. Entity Detection at Scale — The Broadcast Pattern — How to scan 10 million documents for 100,000 entity names across 8 entity types — all in parallel on a Spark cluster.
  4. From Batch to Real-Time — Entity Detection in a Web App — The same trie algorithm that scans 10 million documents overnight can tag a single document in milliseconds. Here's how to make the leap.

HyperLogLog: Counting Unique Items at Scale

  1. The Longest Streak — Estimating Crowd Size from Coin Flips — How a simple coin-flipping game leads to one of the most elegant algorithms in computer science.
  2. Hashing — Turning Anything into Random Coin Flips — How hash functions produce uniformly random bits, and why that makes HyperLogLog work on any dataset.
  3. Splitting the Crowd — How HyperLogLog Tames Randomness — How registers, sub-crowds, and the harmonic mean turn a noisy coin-flip estimate into a precise counting algorithm.
  4. Building HyperLogLog from Scratch — A step-by-step Python implementation of HyperLogLog, from hashing to registers to the harmonic mean — with tests and benchmarks at every stage.

Neural Nets from Scratch

  1. Minds, Brains and Computers — I wrote neural nets in BASIC before they were cool. The experience changed the course of my life.
  2. Neural Nets Are Simpler Than You Think — A neural network in 20 lines of Python. We build one from scratch, break it on purpose, fix it, and teach it arithmetic.
  3. A Tour of Karpathy's Tutorials — From counting letters to building a transformer — the three conceptual leaps that turn a toy neural net into a language model.
  4. Building a Mixture-of-Experts Model — Can a small model learn to specialize? We replace the transformer's feed-forward network with multiple experts and a router, then watch what happens.

Hulu Data Platform

  1. 12,000 Events Per Second: Inside Hulu's Beacon Data Pipeline — How Hulu collected, transformed, and processed billions of events daily — the architecture behind 175 MapReduce jobs per hour.
  2. Why Hulu Built a DSL for Its Data Pipeline (and Why You Should Care) — How BeaconSpec — a domain-specific language for metric definitions — improved monitoring, maintainability, and consistency across 175 MapReduce jobs per hour.
  3. Building Your First Domain-Specific Language: A Practical Guide in Python and Scala — How a small, focused language can eliminate boilerplate, reduce bugs, and make your team faster — with working examples.
  4. The Email Explosion: Why Monitoring a Data Pipeline Is Harder Than You Think — When 175 MapReduce jobs run every hour, traditional monitoring becomes a firehose of alerts. Here's how the first approach fell short — and what it taught us.
  5. Pattern Matching in Graphs: A Practical Introduction to Neo4j and Cypher — How to query connected data by drawing the patterns you're looking for — and why SQL struggles where graphs shine.
  6. Think Like a User: Graph-Based Troubleshooting for Data Pipelines — How flipping the monitoring model from 'what failed?' to 'who is affected?' transformed pipeline operations — using a graph database and the Cypher skills from the previous post.
  7. MVEL and User-Defined Jobs: Letting Users Configure Their Own Pipeline — How a lightweight expression language gave non-specialists the power to define custom pipeline logic — safely.
  8. The Reporting Layer: Building a Data API for Self-Service Analytics — How Hulu turned raw pipeline output into a self-service reporting platform — with query generation, scheduling, and a portal that put data in users' hands.
  9. From Batch to Stream: What 2014's Lessons Mean for Today's Pipelines — The Hadoop Summit talk was about MapReduce. But the principles — DSLs, graph-based monitoring, user-centric thinking — are more relevant than ever in the streaming era.

Python for Fixed-Income Risk Analysis

  1. Exploring Treasury Yield Data with Python — Get the data, explore it, and discover why simple risk assumptions break down.
  2. From Averages to GARCH — A Ladder of Time Series Models — Why each time series model exists, what breaks without it, and how GARCH becomes inevitable.
  3. When Markets Panic — Modeling Volatility with GARCH — Build a GARCH model, construct dynamic VaR, and investigate which historical crises triggered extreme moves.
  4. A Brief Detour — What PCA Actually Does — Building intuition for Principal Component Analysis — no finance required.
  5. Why Duration Rules — PCA and the Hidden Structure of the Yield Curve — Apply PCA to the yield curve and discover why duration and convexity capture most of the risk.

NYC Taxis and Tipping

  1. Do Drunk People Tip Better? — My friend Steve has a question about taxis, drinking, and tipping. We have 174 million taxi records. Let's find out.
  2. Mapping Every Bar in New York City — 11,000 liquor licenses, weighted by drinkiness, visualized on interactive maps. Also: why does JFK airport have 47 liquor licenses in one building?
  3. 88.6 Million Taxi Rides — Filtering 174 million records down to the rides that actually matter. Plus: nobody tips in cash, and Friday nights are exactly what you'd expect.
  4. Stumbling Distance — Building a custom distance metric using street graphs, k-d trees, and the realization that Google Maps would take a million years.
  5. The Verdict — After all that work: do drunk people actually tip better? The answer is more interesting than you'd think.

Exploring High-Dimensional Data

  1. What Embeddings Actually Are — Words as points in space. The moment that changed how I think about data — and the foundation for everything that follows.
  2. First Contact — Statistics and Distributions in High Dimensions — You've embedded 50,000 movie reviews into 768 dimensions. Before you project or cluster, you need to understand what you're looking at.
  3. Projecting to See — PCA, t-SNE, UMAP — Three methods for going from 768 dimensions to 2. Each one lies to you differently — and that's the point.
  4. Building a 3D Data Explorer with Three.js — Take 10,000 movie review embeddings, project them to 3D with UMAP, and build an interactive tool to fly through your data. Color is the question; rotation is the answer.
  5. Making Sense of Clusters — You see clumps in the point cloud. Are they real? Three clustering algorithms, TF-IDF keyword extraction, and NMF topic models turn spatial intuition into labeled structure.
  6. Feature Engineering — Beyond the Embedding — Embeddings capture semantic similarity, but they're not the only signal. Structured features from the text itself reveal structure the embedding misses.
  7. Graph Analysis — When Connections Tell the Story — Text tells you what each review says. A similarity graph tells you how reviews relate to each other. Community detection and graph embeddings reveal structure that spatial clustering misses.

Other