From Batch to Real-Time — Entity Detection in a Web App
The batch pipeline from Part 3 is powerful. It processes 10 million documents against 8 entity types in minutes, runs overnight, and writes scored results to a table. By morning, every document in the corpus is tagged with its entity mentions.
But what about the document that was written this morning?
A user opens a web app, pastes in some text, and expects to see entity tags instantly — which shows are mentioned, which people, which topics. They can’t wait for the nightly batch run. They need entity detection right now.
The good news: the algorithm is identical. The same dual-trie pattern, the same scoring model, the same overlap resolution. What changes is everything around the algorithm — how the tries get built, how they’re cached, and how multiple entity types are processed concurrently.
Same Algorithm, New Constraints
In the Spark pipeline, the entity detection workflow looks like this:
Load catalog from SQL → Build tries → Broadcast → Scan → Score → Write
Each step can take seconds or minutes because the pipeline runs as a batch job. Memory is plentiful (distributed across a cluster). The tries are built fresh on every run.
In a web application, the constraints are different:
| Concern | Batch (Spark) | Real-Time (Web App) |
|---|---|---|
| Latency | Minutes acceptable | Milliseconds required |
| Catalog loading | Read from SQL table | Fetch from remote storage |
| Trie construction | Build once per job | Build once, reuse for many requests |
| Memory | Distributed across executors | Single process, limited RAM |
| Concurrency | Spark parallelism | async/await |
| Entity types | Separate Spark jobs | Concurrent coroutines |
The core scan — walking the trie character by character — is already fast. On a single core, it processes a typical document (a few kilobytes of text) in microseconds. The latency challenge isn’t the scan itself. It’s everything else: loading catalogs, building tries, and coordinating across entity types.
The EntityTrie Wrapper
The first step is to encapsulate the dual-trie pattern into a clean abstraction. Instead of managing two tries manually, wrap them in a class:
from dataclasses import dataclass
from trie_match import Trie
@dataclass
class EntityMatch:
entity_id: str
likelihood: float
class EntityTrie:
"""Wraps a case-sensitive and case-insensitive trie for a single entity type."""
def __init__(self, catalog, term_frequencies=None):
self.case_sensitive_trie = Trie()
self.case_insensitive_trie = Trie()
for entity in catalog:
for term in entity.search_terms:
lik = likelihood(term, entity.wiki_frequency)
# Case-sensitive: always include, 2× confidence
self.case_sensitive_trie.insert(
term,
EntityMatch(entity.id, lik * 2.0),
)
# Case-insensitive: only if not too common
if should_include_ci(term, term_frequencies):
self.case_insensitive_trie.insert(
term.lower(),
EntityMatch(entity.id, lik),
)
def detect(self, text):
"""Find and score all entity mentions in text."""
cs_matches = self.case_sensitive_trie.find_all(
text, word_boundaries=True
)
ci_matches = self.case_insensitive_trie.find_all(
text.lower(), word_boundaries=True
)
all_matches = cs_matches + ci_matches
resolved = resolve_overlaps(all_matches)
return resolved
This is the same logic as the Spark UDF from Part 3, but packaged as a reusable Python class. One EntityTrie per entity type. Call detect(text) on any string and get back scored, resolved matches.
The EntityTrie is immutable after construction — once built, it’s thread-safe and can handle concurrent reads from multiple async handlers without locking. This is the same property that made tries ideal for Spark broadcast: build once, read many.
The Catalog Loading Problem
In Spark, the entity catalog is a SQL table. Read it, build the trie, broadcast. Simple.
In a web app, the catalog lives somewhere else — object storage like S3, a database, or an API behind an authenticated proxy. And it’s big. A full entity catalog with 100,000+ entries, their alternate names, Wikipedia frequencies, and metadata can be tens of megabytes.
You can’t fetch the catalog on every request. The first request would take seconds (network round-trip + deserialization + trie construction), and every subsequent request would do the same work again.
You need caching. And not just one layer of it.
Two-Tier Caching
The solution is a cache with two tiers: in-memory for speed and filesystem for durability.
import json
import hashlib
from pathlib import Path
class EntityCache:
def __init__(self, cache_dir="/tmp/entity_cache"):
self._memory_cache = {} # entity_type → EntityTrie
self._cache_dir = Path(cache_dir)
self._cache_dir.mkdir(parents=True, exist_ok=True)
async def get_trie(self, entity_type, loader):
"""
Get an EntityTrie, checking memory first, then filesystem,
then loading from remote.
"""
# Tier 1: in-memory
if entity_type in self._memory_cache:
return self._memory_cache[entity_type]
# Tier 2: filesystem
cache_path = self._cache_dir / f"{entity_type}.json"
if cache_path.exists():
catalog = json.loads(cache_path.read_text())
trie = EntityTrie(catalog)
self._memory_cache[entity_type] = trie
return trie
# Tier 3: remote fetch
catalog = await loader(entity_type)
cache_path.write_text(json.dumps(catalog))
trie = EntityTrie(catalog)
self._memory_cache[entity_type] = trie
return trie
def invalidate(self, entity_type=None):
"""Clear cache, forcing a re-fetch on next access."""
if entity_type:
self._memory_cache.pop(entity_type, None)
cache_path = self._cache_dir / f"{entity_type}.json"
cache_path.unlink(missing_ok=True)
else:
self._memory_cache.clear()
for f in self._cache_dir.glob("*.json"):
f.unlink()
The lifecycle of a catalog:
Remote storage (S3 / database / API)
↓ (first request: fetch over network, ~2-5 seconds)
Filesystem cache (local disk)
↓ (cold start: read from disk, ~200ms)
In-memory cache (Python dict)
↓ (warm: instant)
EntityTrie (ready to scan)
On the first request after a cold start, the app fetches from remote storage, saves to disk, builds the trie, and caches it in memory. On subsequent requests, the trie is already in memory — zero overhead. If the process restarts, the filesystem cache avoids the expensive remote fetch.
Cache invalidation is TTL-based or event-driven. A daily cron job can call invalidate() to force a refresh, or a webhook from the catalog management system can trigger invalidation when entities are added or removed. The next request after invalidation transparently rebuilds.
Parallel Entity Type Detection
In the Spark pipeline, entity types run as separate jobs on the cluster. In a web app, they run as concurrent coroutines with asyncio.gather:
import asyncio
class EntityDetector:
ENTITY_TYPES = [
"show", "licensed_show", "talent", "employee",
"country", "topic", "genre", "data_asset",
]
def __init__(self, cache, catalog_loader):
self.cache = cache
self.loader = catalog_loader
async def detect_entities(self, text, entity_type):
"""Detect entities of a single type in text."""
trie = await self.cache.get_trie(entity_type, self.loader)
return trie.detect(text)
async def detect_all(self, text):
"""Detect all entity types in parallel."""
tasks = [
self.detect_entities(text, et)
for et in self.ENTITY_TYPES
]
results = await asyncio.gather(*tasks)
# Flatten into a single list, tagged by type
all_matches = []
for entity_type, matches in zip(self.ENTITY_TYPES, results):
for match in matches:
match.entity_type = entity_type
all_matches.append(match)
return all_matches
asyncio.gather launches all 8 entity type detections concurrently. If the tries are already cached in memory (the common case), each detection is a synchronous CPU-bound operation that completes in microseconds. The gather returns when all 8 are done — effectively instantaneous.
The first request after a cold start is slower: gather kicks off 8 concurrent remote fetches, each taking a few seconds. But they run in parallel, so the total time is the time of the slowest single fetch — not the sum. And once cached, every subsequent request is fast.
Putting It Together: The API
Here’s what the real-time entity detection endpoint looks like in a web framework:
from fastapi import FastAPI
app = FastAPI()
cache = EntityCache()
detector = EntityDetector(cache, load_catalog_from_s3)
@app.post("/detect")
async def detect_entities(request: DetectRequest):
matches = await detector.detect_all(request.text)
return {
"entities": [
{
"entity_id": m.entity_id,
"entity_type": m.entity_type,
"matched_text": m.matched_text,
"score": m.likelihood,
"start": m.start,
"end": m.end,
}
for m in matches
if m.likelihood > request.min_confidence
]
}
A user sends text. The API runs all entity types in parallel, applies the scoring model, and returns tagged entities with confidence scores — all in the time it takes to walk 8 tries through the text. For a typical document, that’s under 10 milliseconds.
Detection vs. Search: Two Sides of the Coin
There’s a subtle distinction in how entity detection fits into a larger application. The trie-based system we’ve built answers one question:
Given a document, which entities are mentioned in it?
But users often ask the inverse question:
Given an entity, which documents mention it?
These are complementary operations, and they use different tools:
| Direction | Tool | Technique |
|---|---|---|
| Document → Entities | Trie scan | Walk the trie through the text, emit matches |
| Entity → Documents | Search index | Query an index (e.g., Elasticsearch) for the entity name |
The trie scan is how you enrich documents — adding entity metadata at ingestion time. The search index is how you query documents — finding them by their entity tags at read time.
In a complete system, these feed each other:
New document arrives
↓
Trie scan → entity tags (shows, talent, topics, ...)
↓
Index document + entity tags into search engine
↓
User searches for "Stranger Things"
↓
Search engine returns documents tagged with that entity
The batch pipeline from Part 3 does the enrichment at scale — processing millions of documents overnight. The real-time API from this post does it on demand — tagging a new document the moment it’s created. The search engine serves the inverse query using the tags that both systems produce.
This is why the entity detection system works well as infrastructure rather than a standalone feature. It produces structured metadata that downstream systems (search, recommendations, analytics, dashboards) can consume. The trie scan is the engine. The entity tags are the fuel.
What We Built
Let’s zoom out. Over four posts, we’ve built a complete entity detection system:
Part 1: The “You” Problem — The problem: string matching finds “You” everywhere, but almost never means the show. Entity detection requires scoring, not just matching.
Part 2: Scoring Entity Matches — The scoring model: five signals (distinctiveness, case sensitivity, field weighting, corroboration, pre-filtering) that layer together to separate real entity mentions from coincidental string overlaps.
Part 3: Entity Detection at Scale — The distributed architecture: a DAG of parallel Spark jobs, each broadcasting dual tries and scanning millions of documents. 460,000 entity terms, 10 million documents, 5 minutes.
Part 4: From Batch to Real-Time — The same algorithm in a web app: EntityTrie wrappers, two-tier caching, asyncio.gather for parallel entity type detection. Under 10 milliseconds per document.
The through-line is the trie. The data structure we explored in the Trie series — a tree that shares prefixes — turned out to be the foundation of a production system that tags millions of documents with hundreds of thousands of entity names. Its properties (compact, serializable, read-only, fast) make it equally suited to a Spark broadcast and an in-memory web cache.
But the trie is just the scanner. What makes entity detection work is everything around it: the scoring model that knows “Stranger Things” is distinctive and “You” is not, the dual-trie pattern that treats capitalization as signal, the field weighting that values title matches over body matches, and the corroboration that recognizes when multiple entities reinforce each other.
The system we’ve described is deterministic, scalable, and maintainable. It doesn’t require training data, GPU clusters, or model versioning. It requires a curated catalog, a well-tuned scoring model, and a trie. Sometimes the best approach to a hard problem isn’t the most sophisticated one — it’s the one whose failure modes you can understand and fix.
The Series
- The “You” Problem — Why entity detection is harder than string matching
- Scoring Entity Matches — The disambiguation model
- Entity Detection at Scale — Broadcasting tries across a Spark cluster
- From Batch to Real-Time — You are here
Previous: Entity Detection at Scale