The 'You' Problem — Why Entity Detection Is Harder Than Ctrl+F

Imagine you work for a streaming service. Your catalog has 100,000 titles — every show and film you’ve ever licensed or produced. You also have a knowledge base: millions of documents, reports, presentations, and messages generated by thousands of employees.

You want to connect them. For every document, you want to know: which shows is this about?

If you’ve been following the Tries series, you already know the performance story. A trie can scan a document for all 100,000 titles in a single pass, and we’ve built one, visualized one, and broadcast one across a Spark cluster.

But performance was the easy part. The hard part is knowing which matches are real.

I presented on these challenges at a Netflix Data Engineering Tech Talk (slides). This series explores the entity detection system that came out of that work — the design decisions, the scoring model, and the architecture that makes it scale.


The “You” Problem

Your catalog contains a show called You. It’s a popular thriller. It’s also the most common pronoun in the English language.

A case-insensitive search for “you” in any document will match hundreds of times. Almost none of those matches refer to the show. And a case-sensitive search for “You” isn’t much better — “You” appears at the start of sentences, in titles of other documents, and in quotations.

This isn’t a theoretical concern. Here are real entries from a streaming catalog that are also common English words:

TitleThe Problem
YouMatches every second-person pronoun
The CrownMatches “the crown jewels,” “the crown prince”
RomaMatches inside “romance,” “Roman,” “romantic”
DarkMatches “dark mode,” “dark theme,” “in the dark”
Self MadeMatches “self-made millionaire”
AwayMatches “right away,” “far away,” “gave away”

Every one of these is a legitimate show title. Every one produces hundreds of false positives if you just search for the string.

And it gets worse. Titles aren’t the only entity type. You might also be scanning for:

  • Talent — actors and directors (“Will Smith” is also a common name)
  • Employees — internal team members
  • Countries — “Jordan” is both a country and a name
  • Topics — “machine learning,” “retention,” “churn”

Each entity type brings its own flavor of ambiguity.


What Entity Detection Actually Is

Entity detection (also called entity recognition or entity matching) is the problem of finding mentions of known entities in unstructured text. You have a catalog — a curated list of entities with their names, alternate names, and IDs — and a pile of documents. For each document, you want to know which entities are mentioned.

This is a dictionary-based approach: you know exactly what you’re looking for. You have a list of 100,000 show titles, and you want to find which ones appear in each document.

The alternative is named entity recognition (NER), where a machine learning model identifies entities it was never explicitly told about. An NER model can spot that “Millie Bobby Brown” is probably a person even if it’s never seen that name before. This is powerful, but it’s a fundamentally different problem — discovery versus matching.

Dictionary-based detection wins when:

  • You have a curated catalog. If you already know your 100,000 entities, you don’t need a model to discover them.
  • You need high precision. ML models hallucinate. Dictionary matching doesn’t — if it says “Stranger Things” is in the document, the string “Stranger Things” is definitely there. (Whether it means the show is a different question, which is what scoring addresses.)
  • You need determinism. The same document with the same catalog produces the same results every time. No model versioning, no drift, no retraining.
  • You need to scale. A trie can be serialized, broadcast to a thousand machines, and scanned in parallel with zero coordination. ML inference at that scale is much more expensive.

Three Hard Problems

Strip away the performance question (we solved that with a trie), and entity detection reduces to three hard problems:

1. Ambiguity — “You” means a lot of things

The same string can refer to different things in different contexts. “You” is a show, a pronoun, and a song. “The Crown” is a show, a monarchy, and a dental procedure. “Jordan” is a country, a person, and a sneaker brand.

The entity detector finds the string. Something else has to decide whether this particular occurrence is actually the entity.

2. Surface form variation — “Stranger Things” has many faces

Entities have multiple names. A show’s official title might be “Stranger Things,” but people also write “stranger things,” “ST,” “Stranger Things Season 4,” or the localized title in another language. Talent names are even worse: “Robert Downey Jr.,” “Robert Downey Jr,” “Robert Downey,” “RDJ.”

The catalog typically includes alternate names and search terms for each entity, but the trie needs to map all of them back to the same entity ID.

3. Overlapping matches — “New York” vs. “New York City”

When your catalog contains “New York,” “New York City,” and “York,” scanning the text “New York City” produces three overlapping matches. Which one do you keep?

We covered the overlap resolution algorithm in the Trie series — sort by position and length, keep the longest non-overlapping match. But for entity detection, length isn’t always the right tiebreaker. A shorter, more distinctive match might be more confident than a longer, generic one.


The Naive Pipeline

Here’s what a first attempt at entity detection looks like, using the trie we built in the earlier series:

from trie import Trie

# Build the catalog: entity name → entity ID
catalog = {
    "Stranger Things": "show:80057281",
    "You": "show:80211991",
    "The Crown": "show:80025678",
    "Roma": "show:80240312",
    "Dark": "show:80100172",
}

# Load into a trie
trie = Trie()
for name, entity_id in catalog.items():
    trie.insert(name, value=entity_id)

# Scan a document
text = "You should watch Stranger Things. The Crown is also good."
matches = trie.find_all_matches(text)

This produces:

MatchPositionEntity
“You”0–3show:80211991
“Stranger Things”18–34show:80057281
“The Crown”36–46show:80025678

Three matches, three entity IDs. And in this case, all three are correct! The document really is about those shows.

But change the text slightly:

text = "You should take care of yourself. The crown jewels are priceless."
matches = trie.find_all_matches(text)
MatchPositionEntity
“You”0–3show:80211991
“The crown”33–43show:80025678

Two matches, zero correct. The document has nothing to do with either show. The trie did its job perfectly — those strings really are in the text. The problem is that matching a string isn’t the same as detecting an entity.


From Matching to Detection

The gap between string matching and entity detection is filled by scoring — a system of signals that tells you how confident you should be that a given match is the entity it claims to be.

The signals are surprisingly intuitive once you see them:

How distinctive is the name? “Stranger Things” almost always refers to the show. “You” almost never does. A name’s distinctiveness — how often it appears in general text versus how often it means the entity — is the strongest signal.

Where in the document did the match appear? A show title in a document’s heading is a much stronger signal than the same title buried in the body text. In one production system, title-field matches are weighted 50 times higher than body matches.

Did the match preserve capitalization? If someone writes “Stranger Things” with the original capitalization, it’s more likely to be the show than “stranger things” in lowercase. Case-sensitive matches carry higher confidence.

What else is in the document? If a document mentions both “Stranger Things” and “Millie Bobby Brown,” the show match gets a boost — the presence of associated talent corroborates the identification.

Each of these signals can be quantified, weighted, and combined into a confidence score. A match for “Stranger Things” in a document title, properly capitalized, alongside the lead actor’s name, gets a very high score. A match for “You” in the body text, lowercase, with no corroborating evidence, gets a very low score — low enough to be filtered out entirely.


The Architecture Preview

The full entity detection pipeline looks like this:

Entity catalog (100K+ names with IDs, alternate forms, metadata)
Build tries (case-sensitive + case-insensitive)
Scan documents (trie walk → raw matches)
Score matches (distinctiveness, field weight, case, corroboration)
Resolve overlaps (highest-confidence match wins)
Entity annotations (document → entities with confidence scores)

The trie handles the middle step — the scan. Everything before and after the scan is what makes entity detection a system rather than just a string search.

In this series, we’ll build up each layer:

  1. The ‘You’ Problem — You are here
  2. Scoring Entity Matches — The disambiguation model
  3. Entity Detection at Scale — Broadcasting tries across a Spark cluster
  4. From Batch to Real-Time — Moving entity detection into a web application

What’s Next

String matching is solved. The trie handles that beautifully. The interesting question is: how do you know a match is real?

In Part 2, we’ll build the scoring model — the system that turns a bag of ambiguous string matches into confident entity annotations. We’ll see how a Wikipedia-derived distinctiveness score, field weighting, case sensitivity, and cross-entity corroboration combine to solve the “You” problem.


Next: Scoring Entity Matches