The Reporting Layer: Building a Data API for Self-Service Analytics
The pipeline processes the data. The aggregation layer rolls it up. But none of that matters if the people who need the numbers can’t get to them.
Earlier in this series, we covered how data flowed through Hulu’s pipeline from raw beacons to structured basefacts and up through aggregation layers. But a pipeline that produces clean, aggregated data is only half the story. The other half is getting that data to the humans who make decisions with it.
At Hulu, the reporting layer was a multi-service stack that turned pipeline output into on-demand, self-service analytics. It’s worth examining because it solved a problem that nearly every data team faces: how do you give report users autonomy without giving them footguns?
The Challenge: From Aggregated Tables to Answers
By the time data reached the reporting layer, it had been through the wringer: collected as beacons, transformed by MapReduce into basefacts, and rolled up through hourly, daily, weekly, monthly, and quarterly aggregations. The data was clean and structured.
But “clean and structured” doesn’t mean “accessible.” The aggregated data lived in Hive tables — powerful but not user-friendly. Expecting a product manager or content strategist to write Hive queries was unrealistic.
The team needed a stack that could:
- Let non-technical users request reports without writing code
- Validate requests before executing them (Are these columns available? Is the date range valid?)
- Generate efficient queries from high-level specifications
- Schedule and queue reports so the cluster wasn’t overwhelmed
- Present results in a browsable, shareable format
The Architecture
The reporting stack was a pipeline-within-the-pipeline, composed of distinct services:
Reporting Portal UI (RP2)
The user-facing web application. Report users — analysts, product managers, content teams — interacted with this portal to:
- Browse available metrics and dimensions
- Select columns, date ranges, and filters
- Submit report requests
- View completed reports
- Schedule recurring reports
The key design decision: the portal presented domain concepts, not database concepts. Users saw “video starts by content partner, daily, last 30 days” — not “SELECT content_partner_id, SUM(total_count) FROM playback_daily WHERE hourid BETWEEN…”
Report Controller
The orchestration service. When a user submitted a report request through the portal, the Report Controller:
- Validated the request — checked that the requested columns existed, the date range was within available data, and the combination of dimensions and metrics was valid
- Generated the query — translated the high-level report specification into an optimized Hive query
- Submitted the query to the execution layer
Validation before execution was crucial. Without it, users would submit malformed requests, wait 20 minutes for execution, and get a cryptic Hive error. With it, they got immediate, friendly feedback: “The column ‘device_type’ isn’t available for ad impression data. Did you mean ‘platform_type’?”
Data API Service
The abstraction layer between the portal and the underlying data infrastructure. The Data API:
- Exposed available columns and their metadata (types, descriptions, valid date ranges)
- Handled authentication and authorization (not every user could see every metric)
- Provided a consistent interface regardless of whether the underlying data was in Hive, a published database, or a cache
This service decoupled the portal from the storage layer. If the team moved data from Hive to a faster query engine, the portal didn’t need to change.
HiveRunner
The execution engine. HiveRunner:
- Received generated queries from the Report Controller
- Managed a queue of pending queries
- Executed queries against the Hive cluster
- Handled retries, timeouts, and resource management
- Returned results to the Report Controller for delivery to the portal
Scheduler
For recurring reports — “send the daily content performance report to the Content Team every morning at 8 AM” — the Scheduler service handled:
- Cron-like scheduling of report executions
- Dependency checking (don’t run the daily report if the daily aggregation hasn’t completed yet)
- Notification on completion or failure
Published Databases
For frequently-accessed, latency-sensitive reports, the aggregated data was published to optimized databases (outside Hive) that could serve queries faster. This was the “popular data publishing” path: identify which reports are run most often, pre-compute their results, and serve them from a fast store.
The Flow
Here’s a complete request lifecycle:
User opens Reporting Portal (RP2)
→ Browses available columns via Data API
→ Selects dimensions, metrics, date range, filters
→ Submits report request
Report Controller receives request
→ Validates columns and date range (via Data API)
→ Generates optimized Hive query
→ Submits to HiveRunner queue
HiveRunner executes query
→ Manages queue position and cluster resources
→ Runs query against Hive
→ Returns results
Report Controller delivers results
→ Formats for display in portal
→ Stores for later retrieval
→ Notifies user: "Your report is ready"
User views completed report in RP2
Design Lessons
Several aspects of the reporting architecture generalize well beyond Hulu:
1. Validate Before You Execute
The single biggest usability improvement in the reporting stack was front-loading validation. Every invalid request caught at submission time is a query that doesn’t waste cluster resources and a user who doesn’t wait 15 minutes for an error.
Modern analytics tools (Looker, dbt, Metabase) all do this: type-check the query, validate field references, and check data availability before execution. In 2013, building this validation layer was a deliberate architectural choice.
2. Separate the Interface from the Engine
The Data API’s abstraction layer meant the portal could evolve independently from the data infrastructure. This is the same principle behind modern semantic layers (dbt metrics, Looker’s LookML, Cube.js): define the business logic once in a middle tier, and let the UI and the query engine each do their job.
3. Queue and Prioritize
Not all report requests are equal. A recurring daily report for the executive team should take priority over an ad-hoc exploration. The scheduler and queue in HiveRunner handled this, ensuring that high-priority reports completed on time even when the cluster was busy.
4. Publish Popular Data
The hot-path optimization — pre-computing and publishing frequently-accessed reports to faster stores — is the same pattern that today’s analytics engineers call “materialized views” or “metric layer caching.” The insight was the same in 2013: if you know what people will ask for, prepare the answer in advance.
5. Domain Language in the UI
The portal presented domain concepts (metrics, dimensions, content partners, platforms), not database concepts (tables, columns, JOINs). This is what made it truly self-service. Users didn’t need to know that “video starts by partner” meant a specific Hive query against a specific table with specific aggregation — they just asked for what they wanted.
This echoes the same “declarative over imperative” philosophy that drove BeaconSpec and MVEL. At every layer — metric definition, data filtering, and now report generation — the pattern was the same: express what you want, and the system handles how.
How This Connects to Monitoring
One important detail: the reporting layer was also a consumer of the pipeline, which made it a critical part of the monitoring graph we discussed in Post 6.
When a MapReduce job failed, the impact rippled through: basefacts weren’t produced, aggregations stalled, and reports returned stale data (or failed entirely). The graph-based monitoring system modeled these connections: Job → Table → Aggregation → Published Data → Report → User Group.
The reporting architecture’s clean service separation made this modeling straightforward. Each service boundary was a natural node in the graph. The scheduled reports provided the link between data infrastructure and business users. Without the reporting layer’s structure, the monitoring graph would have had no way to connect failures to the humans who cared about them.
This post is part of a series based on Monitoring the Data Pipeline at Hulu, presented at Hadoop Summit 2014. See the full Hulu Pipeline series for more.