Drinky Cab

  1. Do Drunk People Tip Better? — My friend Steve has a question about taxis, drinking, and tipping. We have 174 million taxi records. Let's find out.
  2. Mapping Every Bar in New York City — 11,000 liquor licenses, weighted by drinkiness, visualized on interactive maps. Also: why does JFK airport have 47 liquor licenses in one building?
  3. 88.6 Million Taxi Rides — Filtering 174 million records down to the rides that actually matter. Plus: nobody tips in cash, and Friday nights are exactly what you'd expect.
  4. Stumbling Distance — Building a custom distance metric using street graphs, k-d trees, and the realization that Google Maps would take a million years.
  5. The Verdict — After all that work: do drunk people actually tip better? The answer is more interesting than you'd think.

Hulu Pipeline

  1. 12,000 Events Per Second: Inside Hulu's Beacon Data Pipeline — How Hulu collected, transformed, and processed billions of events daily — the architecture behind 175 MapReduce jobs per hour.
  2. Why Hulu Built a DSL for Its Data Pipeline (and Why You Should Care) — How BeaconSpec — a domain-specific language for metric definitions — improved monitoring, maintainability, and consistency across 175 MapReduce jobs per hour.
  3. Building Your First Domain-Specific Language: A Practical Guide in Python and Scala — How a small, focused language can eliminate boilerplate, reduce bugs, and make your team faster — with working examples.
  4. The Email Explosion: Why Monitoring a Data Pipeline Is Harder Than You Think — When 175 MapReduce jobs run every hour, traditional monitoring becomes a firehose of alerts. Here's how the first approach fell short — and what it taught us.
  5. Pattern Matching in Graphs: A Practical Introduction to Neo4j and Cypher — How to query connected data by drawing the patterns you're looking for — and why SQL struggles where graphs shine.
  6. Think Like a User: Graph-Based Troubleshooting for Data Pipelines — How flipping the monitoring model from 'what failed?' to 'who is affected?' transformed pipeline operations — using a graph database and the Cypher skills from the previous post.
  7. MVEL and User-Defined Jobs: Letting Users Configure Their Own Pipeline — How a lightweight expression language gave non-specialists the power to define custom pipeline logic — safely.
  8. The Reporting Layer: Building a Data API for Self-Service Analytics — How Hulu turned raw pipeline output into a self-service reporting platform — with query generation, scheduling, and a portal that put data in users' hands.
  9. From Batch to Stream: What 2014's Lessons Mean for Today's Pipelines — The Hadoop Summit talk was about MapReduce. But the principles — DSLs, graph-based monitoring, user-centric thinking — are more relevant than ever in the streaming era.

HyperLogLog: Counting Unique Items the Clever Way

  1. The Longest Streak — Estimating Crowd Size from Coin Flips — How a simple coin-flipping game leads to one of the most elegant algorithms in computer science.
  2. Hashing — Turning Anything into Random Coin Flips — How hash functions produce uniformly random bits, and why that makes HyperLogLog work on any dataset.
  3. Splitting the Crowd — How HyperLogLog Tames Randomness — How registers, sub-crowds, and the harmonic mean turn a noisy coin-flip estimate into a precise counting algorithm.
  4. Building HyperLogLog from Scratch — A step-by-step Python implementation of HyperLogLog, from hashing to registers to the harmonic mean — with tests and benchmarks at every stage.

Python for Fixed-Income Risk Analysis

  1. Exploring Treasury Yield Data with Python — Get the data, explore it, and discover why simple risk assumptions break down.
  2. From Averages to GARCH — A Ladder of Time Series Models — Why each time series model exists, what breaks without it, and how GARCH becomes inevitable.
  3. When Markets Panic — Modeling Volatility with GARCH — Build a GARCH model, construct dynamic VaR, and investigate which historical crises triggered extreme moves.
  4. A Brief Detour — What PCA Actually Does — Building intuition for Principal Component Analysis — no finance required.
  5. Why Duration Rules — PCA and the Hidden Structure of the Yield Curve — Apply PCA to the yield curve and discover why duration and convexity capture most of the risk.