Artificial Intelligence 10 min read

Rosetta Stone: Scalable ID Mapping System for Tubi's Content Library Using LLMs and Embeddings

This article describes how Tubi built the Rosetta Stone system—a flexible ID mapping workflow that leverages large language models, embedding similarity ranking, and K‑nearest‑neighbors to unify and enrich metadata across a 200,000‑title library, improve content recommendation, and streamline operations.

Bitu Technology

Jan 17, 2024

Rosetta Stone: Scalable ID Mapping System for Tubi's Content Library Using LLMs and Embeddings

As the most popular free streaming service in the United States, Tubi aims to let anyone watch stories from around the world for free; by November 2023 its monthly active users exceeded 74 million, a success built on high‑quality content.

Tubi’s catalog contains about 200,000 movies and TV series, making it one of the largest content libraries globally, with daily updates from numerous third‑party sources that bring rich metadata.

Managing such a massive library poses challenges, so Tubi created a unified ID space and the Rosetta Stone system to automatically match each piece of metadata with its corresponding content.

Rosetta Stone is a flexible ID mapping system

ID Mapping System

Creating a standardized ID space is essential for monitoring and linking all content; it underpins content processing, analytics, recommendation, partner payments, and platform‑wide operations.

When content lacks a recognizable ID, automatic matching becomes critical, especially during early data‑cleaning phases, reducing manual effort and allowing staff to focus on higher‑impact work.

Systematic mapping between ID spaces leverages metadata, text descriptions, images, reviews, popularity ratings, and performance metrics to enrich original metadata, improve content understanding, and optimize recommendations.

Large Language Models (LLMs) Provide New Perspectives

Based on research and analysis, Tubi adopted an embedding‑similarity ranking approach to solve ID matching, after earlier methods proved unsatisfactory.

The rapid advancement of LLMs now offers an effective way to match content, delivering strong results.

LLM technology maps text into a unified semantic space, enhancing fuzzy matching across different ID spaces and excelling at classifying and recognizing similar content styles, thereby filling missing metadata and accurately locating content.

As a product‑centric company, Tubi aims to use LLMs to build an advanced content‑metadata embedding space that serves various teams and use cases.

Rosetta Stone Workflow

We can understand the Rosetta Stone workflow in three steps:

1. We created an embedding library containing all relevant content with ground‑truth IDs; the illustration below shows the combinatorial characteristics we must handle, and expanding combinations improves similarity‑score analysis and ID‑matching accuracy.

2. To find a matching ID for a request, we construct a structured string with the same metadata order as the Rosetta Stone embedding DB and generate its embedding using the same LLM.

3. We run a K‑Nearest‑Neighbors (KNN) algorithm to retrieve and rank matching content from the pre‑stored embedding DB by similarity score, returning multiple results to improve recall.

Using an LLM‑supported content recognition and matching system, we leveraged a comprehensive third‑party reference dataset as a standard ID space, successfully establishing a unified ID mapping system and achieving high‑accuracy matches with multiple third‑party content libraries.

How to Use Rosetta Stone

Optimizing Tubi’s Content Management System

The Rosetta Stone tool has had an immediate impact, most notably correcting erroneous reference IDs in the catalog, especially for foreign films and alternate titles.

Several Tubi teams and backend applications rely on statistics derived from reference IDs; inaccurate references propagate negative effects to downstream services. By correcting these IDs, the catalog becomes more accurate and robust, giving teams confidence in subsequent analytics.

Tubi Uses Rosetta Stone

Matching Third‑Party Movie Resources

Tubi has integrated multiple commercial third‑party movie and TV databases to fill missing information and enrich content descriptions. While strict matching conditions allow some third‑party results to align with Tubi’s library, many remain unmatched due to incomplete data. Rosetta Stone enabled us to retrieve a large number of previously unmatched third‑party entries, greatly enhancing our catalog.

Summary

Rosetta Stone is a powerful system for managing complex content metadata; it expands and enriches the library, delivering highly personalized viewing experiences.

Thanks

Special thanks to Yuanbo Chen and John Trenkle for authoring this blog, to the Tubi product and machine‑learning teams for their close collaboration, and to ML CTO Clair Dorman and VP of ML Technology Jaya Kawale for reviewing.

We Are Hiring!

If you are interested in high‑impact, large‑scale projects like Rosetta Stone, join Tubi’s China team as a Big Data Platform Development Lead !

Author : Rudra Roy Choudhury, Yuanbo Chen, John Trenkle

Translator : Yuanbo Chen

Proofreader : Shengwu Yang

Click the original article to view past technical blogs!

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data LLM kNN Metadata Management content ID mapping Embeddings

Written by

Bitu Technology

Bitu Technology is the registered company of Tubi's China team. We are engineers passionate about leveraging advanced technology to improve lives, and we hope to use this channel to connect and advance together.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.