Cognitive Creations Strategy · Governance · PMO · Agentic AI

Case Study: Spotify — Domain Data, LLMs, and MCP-Style Tool Orchestration

This article explains how Spotify typically solves the “domain-specific data” problem: they do not rely on buying a generic domain dataset. Instead, they generate high-quality signals from their own platform at massive scale and use them to train, evaluate, and continuously improve ML systems. We also map a practical MCP-style approach for orchestrating LLM tools safely and reliably in a large product ecosystem.

Download as PDF

1 — Overview

Overview

Overview

Spotify does not rely on buying generic domain datasets; they generate signals from their own platform at scale. This case study maps domain data, model ownership, LLM value, and a practical MCP-style orchestration pattern. Use the tabs above to jump to each section.

2 — Study case

Study case

Study case

Problem Framing
“Where does domain data come from?”
People often assume “domain data” means buying a dataset from a vendor. In consumer platforms like Spotify, the domain data is primarily produced inside the product.
“Do they fine-tune from customer feedback?”
Yes — but the richest feedback is usually implicit (behavior) rather than explicit ratings.
“How do they prevent model/tool chaos?”
Large systems need orchestration: tool catalogs, policy checks, routing, evaluation, observability, and rollback.

Answer

One Sentence

Spotify does not need to “buy” domain data for its core. Spotify is the domain: users continuously generate domain-specific signals through listening behavior, playlists, searches, and engagement patterns. These signals become training/evaluation data that improves recommendation, ranking, discovery, and (in modern setups) LLM-powered experiences.

The hard part is less “finding data” and more: capturing it reliably, turning it into usable labels, controlling bias and privacy, and operationalizing continuous learning without harming user trust.

Users & Context Listening, search, device, time, locale Behavioral Signals Plays, skips, repeats, playlist adds “Implicit labels” at massive scale Models Ranking, retrieval, generation, safety Product Value Discovery, personalization, retention New surfaces: DJ, semantic search Flywheel: more engagement → better signals → better models → better product → more engagement.
3 — Spotify's “DOMAIN DATA”

Spotify's “DOMAIN DATA”

Spotify's “DOMAIN DATA”

What It Actually Is
Behavioral Data
High-volume event streams: plays, skips, replays, session time, searches, shares, follows, playlist edits.
Content Understanding
Audio analysis and embeddings (tempo, timbre, energy), track/artist similarity, mood proxies.
Editorial + Knowledge Graph
Human curation, playlist taxonomy, artist relations, metadata enrichment (often multi-source).

Why Spotify Usually Doesn’t Buy “Domain Datasets”

Competitive Edge

In many industries (legal, medical, finance), domain data is scarce and often purchased or licensed from specialized providers. Spotify’s case is different: the most valuable signals are interaction sequences and preference trajectories that only exist inside Spotify’s own product. No third party can sell “how Spotify users behave on Spotify” at Spotify-scale.

Domain data category Where it comes from How it becomes training signal Typical use
Implicit feedback events In-app interactions (play/skip/replay/add/search) Convert to labels: positive/negative preference proxies, dwell-time, satisfaction metrics Ranking, personalization, session modeling
Explicit feedback actions Likes, hides, follows, blocks, “not interested” High-precision labels but lower volume; used for calibration and constraints Personalization guardrails, user control
Content features embeddings Audio analysis, track descriptors, learned representations Self-supervised / contrastive learning; similarity; cold-start support Discovery, similarity search, mix building
Metadata + editorial taxonomy Label/artist metadata, curation, internal knowledge structures Weak labels for genre/mood; validation sets; semantic navigation Browse, playlists, search facets

Takeaway: Spotify’s domain data strategy is primarily first-party (generated from usage). They may still license certain data types (e.g., music rights and some metadata), but the “secret sauce” is the behavioral signal + representation learning that competitors can’t replicate easily.

4 — Model Ownership

Model Ownership

Model Ownership

Owned vs. Licensed
What’s typically “owned”
Event data pipelines, derived labels, embeddings, ranking models, retrieval indexes, evaluation suites, orchestration and governance.
What’s typically “licensed”
Music content rights, some metadata sources, lyrics where available, external catalogs—then enriched internally.
LLM source (often hybrid)
Many companies use a mix: internal models + vendor foundation models for specific tasks. The key is the domain signals and safe tool access.

How Spotify-Style Systems Train Models

Practical Blueprint

Spotify’s public engineering narrative emphasizes large-scale ML for recommendations and discovery. For teaching purposes, you can describe their training approach as a layered system:

1) Data capture + quality
Instrument the product to emit clean events; enforce schemas; deduplicate; handle bots; build privacy-preserving aggregation where needed.
2) Label construction
Transform events into “weak labels” (e.g., skip under 10s ≈ negative); calibrate with explicit actions; build holdout sets.
3) Representation learning
Learn embeddings for users, tracks, sessions. This supports similarity search, clustering, and cold-start recommendations.
4) Ranking + retrieval
Two-stage systems: retrieval narrows candidates; ranker orders them based on predicted satisfaction and constraints.

For LLM-adjacent features (e.g., conversational discovery, semantic search, “DJ” experiences), the training usually shifts to a hybrid of: retrieval + generation, tool calls, and strong evaluation/guardrails rather than pure “fine-tune everything.”

# Illustrative (teaching) pseudo-code: converting behavior to weak labels events = stream("playback_events") # play, skip, replay, add_to_playlist labels = [] for e in events: if e.type == "skip" and e.seconds_listened < 10: labels.append((e.user, e.track, "negative")) elif e.type == "add_to_playlist": labels.append((e.user, e.track, "strong_positive")) elif e.type == "replay" and e.seconds_listened > 60: labels.append((e.user, e.track, "positive")) train_ranker(labels, features=["user_embed","track_embed","context","session"]) evaluate(rank_metrics=["NDCG","MAP","retention_proxy"], safety_checks=["bias","privacy"]) deploy_with_canary()

Classroom nuance: “fine-tuning” can mean many things (ranking model updates, embedding training, LLM adapter tuning, preference optimization). Spotify is best explained as a continuous learning system driven by first-party signals, not as a one-time LLM fine-tune.

5 — Where LLMs Create Value

Where LLMs Create Value

Where LLMs Create Value

Product Surfaces
Semantic Search
Understand intent beyond keywords (“chill indie for studying”).
Discovery Narratives
Explain recommendations; generate context; guide exploration safely.
Support & Operations
Customer support, internal analytics, incident triage, content policy workflows.

LLMs vs “Classic ML” in Spotify-Style Systems

Division of Labor

In a large consumer platform, LLMs rarely replace recommendation rankers outright. Instead, LLMs typically serve as a natural-language layer over: (1) retrieval/ranking systems, (2) knowledge graphs/metadata, (3) tooling and business rules.

Capability Classic ML (rankers/embeddings) LLMs (generation/tool-using agents) Best combined pattern
Personalization ranking Excellent at optimizing engagement proxies at scale Not ideal for direct ranking alone (cost/latency) LLM explains / steers; rankers decide ordering
Semantic intent Needs engineered features or embeddings + heuristics Strong at intent extraction and reformulation LLM rewrites query → retrieval → ranker → response
Explaining “why” Limited interpretability Strong narrative generation (with constraints) LLM uses grounded facts + policy templates
Multi-step workflows Requires orchestration logic Can plan + call tools if governed MCP-style tool catalog + sandbox + audit

Teaching point: LLMs add most value when they are grounded (retrieval + structured tool calls), and when the org has governance to prevent hallucinations, unsafe actions, and uncontrolled tool sprawl.

6 — MCP-Style Orchestration

MCP-Style Orchestration

MCP-Style Orchestration

Reference Pattern
Tool Catalog
One registry of tools + schemas + permissions + owners.
Policy Gate
AuthZ, rate limits, data minimization, redaction, audit logs.
Observability
Tracing, evaluations, prompt/tool telemetry, rollbacks.

How an “MCP Server” Concept Fits Spotify-Like LLM Systems

Operational Control

As the number of tools grows (search, playlists, recommendations, ads, safety, customer support, analytics), a single LLM can “get lost” choosing the right tool. An MCP-style layer prevents collapse by enforcing: standard tool definitions, routing, permissioning, and evaluation.

User / Client App, voice, web, APIs Intent in natural language LLM Gateway / Orchestrator Prompt policy, routing, safety checks Tool selection, retries, fallbacks Offline eval + online monitoring MCP-Style Control Plane Tool registry • schemas • RBAC • audit Rate limits • PII redaction • approvals Tool: Search Semantic query → retrieval Tool: Recommendations Candidates + rankers Tool: Playlists Create/edit/curate safely …plus ads, support, policy, analytics MCP-style pattern: central tool catalog + policy gate + observability → prevents tool sprawl and unsafe actions.

In practical terms, this “control plane” solves the student’s real concern: too many MCPs/tools causing the model to choose poorly or loop. A Spotify-scale organization typically implements:

Tool governance
Ownership per tool, schema contracts, versioning, deprecation rules, and clear boundaries (read vs write tools).
Routing strategy
Intent classification → tool shortlist → constrained selection → structured outputs. Avoid “free-for-all” tool choice.
Evaluation pipeline
Offline test sets + online A/B tests + guardrail metrics (hallucinations, policy violations, latency, cost).
Safety & privacy
PII minimization, redaction, RBAC, audit logs, rate limits, and business-policy enforcement before any tool call.
7 — Other AI Beyond LLMs

Other AI Beyond LLMs

Other AI Beyond LLMs

Broader AI Stack
Recommender Systems
Candidate generation, ranking, session-based models, multi-objective optimization.
Representation Learning
User/track embeddings, similarity search, clustering, cold-start handling.
Trust & Safety
Content policy detection, abuse prevention, fraud/bot detection, anomaly monitoring.

AI Portfolio: Typical Spotify-Style Use Cases

Value Map

A realistic way to teach Spotify’s AI landscape is to show it as a portfolio with multiple model families, each optimized for different constraints (latency, throughput, cost, explainability, safety).

Area What AI does Signals used Core risks Typical controls
Recommendations Personalize home feed, mixes, radios, session flow Plays/skips/dwell, embeddings, context Filter bubbles, bias, churn from bad recs Multi-objective ranking, diversification, A/B tests
Search Keyword + semantic search, intent reformulation Queries, clicks, retrieval logs, embeddings Wrong intent, unsafe outputs Grounding to catalog, policy templates, fallbacks
Content understanding Audio features, similarity, classification Audio streams, metadata, weak labels Mislabeling genres/moods Human-in-the-loop sampling, calibration sets
LLM experiences Explain, guide discovery, conversational workflows Catalog + retrieval + tool outputs Hallucinations, policy violations MCP-style tool governance, evaluation harness
Trust & safety Detect abuse, bots, policy violations Behavioral anomalies, reports, patterns False positives / negatives Thresholding, escalation queues, audit trails

Teaching note: “AI in Spotify” is not one model. It’s a system-of-systems. LLMs are an interface layer and workflow accelerator; ranking and embeddings remain foundational.

8 — Managing “Too Many Tools”

Managing “Too Many Tools”

Managing “Too Many Tools”

Classroom Playbook
Reduce tool surface area
Group tools by domain; provide “meta-tools” that wrap many small endpoints.
Constrain selection
Never allow the model to pick from 500 tools at once.
Measure everything
Tool success rate, latency, cost, hallucination rate, user satisfaction proxies.

A Practical MCP-Style Governance Model

Roles & Controls

When the number of tools grows, the organization needs operating discipline. Here’s a teachable governance model that maps well to a Spotify-scale scenario and also to enterprise platforms (like CCDOP/WAIFLOW):

Tool Owner
Owns the tool contract (schema), SLOs, and versioning. Maintains docs and test fixtures.
MCP Platform Team
Runs the registry, authZ, policy gates, observability, and standard SDKs for tool integration.
Model Risk / Safety
Defines guardrails, redaction rules, approvals for write actions, and incident response playbooks.
Evaluation & Quality
Maintains offline eval sets, golden conversations, regression checks, and rollout criteria.
# Illustrative (teaching) routing policy: keep tool choice constrained intent = classify(user_query) tool_shortlist = { "music_search": ["catalog_search", "semantic_retrieval"], "playlist_edit": ["playlist_read", "playlist_write_with_approval"], "recommendation_explain": ["rec_reason_codes", "catalog_lookup"], "support_issue": ["kb_retrieval", "ticket_create"] }[intent] # Enforce policy gate BEFORE any tool call require_rbac(user, tool_shortlist) apply_redaction(user_query) enforce_rate_limits(user) result = llm_call( system="Use ONLY tools in tool_shortlist. Output JSON.", tools=tool_shortlist, response_format="json" ) audit_log(user, intent, tool_used=result.tool, outcome=result.status) return result

The goal is not “more tools.” The goal is predictable behavior: constrained choice, testability, and safe operation at scale.

9 — Example: Indirect User Interaction with an LLM in Spotify

Example: Indirect User Interaction with an LLM in Spotify

Example: Indirect User Interaction with an LLM in Spotify

End-to-End Flow
User Input (Visible)
The user types: “instrumental focus, no vocals, upbeat but calm” — or uses voice search.
User Behavior (Invisible)
The user skips vocal tracks in < 10s, listens fully to instrumentals, and saves a track to a playlist — these actions become implicit feedback signals.
Why This Matters
Spotify’s “domain data” is continuously produced by the product itself. The LLM helps translate natural language into structured intent and safe tool calls; ranking models optimize the final ordering.

Diagram: User → LLM Layer → Tools → Ranking → Feedback Loop

MCP-Style Orchestration

The user may never “see” an LLM, yet they interact with it indirectly through search queries, recommendations, and behavior. The LLM’s role is typically to extract intent, normalize constraints, and route safely to a controlled set of tools. The final playlist/feed ordering is usually produced by specialized retrieval + ranking systems, and the user’s behavior becomes fresh domain signals that improve the system over time.

User (Visible + Invisible) Visible input Search: “instrumental focus, no vocals” Voice: “play calm upbeat study music” Invisible behavior Skips < 10s, full listens, saves, repeats LLM / Semantic Layer Intent & constraints Goal: focus / concentration Constraints: no vocals Tone: upbeat but calm (avoid sad) Safe routing Select tools from an allowed shortlist MCP-Style Control Plane Tool registry • schemas • owners • versions RBAC • PII redaction • rate limiting • audit Eval gates • canary rollout • fallbacks Purpose: prevent “tool chaos” as tools and teams scale. Tools (APIs) Catalog search / semantic retrieval Playlist read/write (guarded) Retrieval Find candidate tracks (embeddings) Apply filters (e.g., instrumental proxies) Ranking Order candidates per user + context Diversity + policy constraints Output to User Playlist / feed results (+ optional explanation) Indirect interaction: the user’s behavior (skips, saves, listens) becomes first-party domain signals that continuously improve retrieval/ranking and calibrate LLM intent parsing.
What the LLM is responsible for
Natural-language understanding, constraint normalization, query expansion, tool routing, safe structured outputs, and optionally generating grounded explanations.
What the ranking system is responsible for
Large-scale candidate generation + ranking optimized for satisfaction proxies (dwell, replays, saves), plus constraints (diversity, policy, latency, cost).
10 — Example: Indirect User Interaction with an LLM in Spotify

Example: Indirect User Interaction with an LLM in Spotify

Example: Indirect User Interaction with an LLM in Spotify

End-to-End Flow
User Input (Visible)
The user types: “instrumental focus, no vocals, upbeat but calm” — or uses voice search.
User Behavior (Invisible)
The user skips vocal tracks in < 10s, listens fully to instrumentals, and saves a track to a playlist — these actions become implicit feedback signals.
Why This Matters
Spotify’s “domain data” is continuously produced by the product itself. The LLM helps translate natural language into structured intent and safe tool calls; ranking models optimize the final ordering.

Diagram: User → LLM Layer → Tools → Ranking → Feedback Loop

MCP-Style Orchestration

The user may never “see” an LLM, yet they interact with it indirectly through search queries, recommendations, and behavior. The LLM’s role is typically to extract intent, normalize constraints, and route safely to a controlled set of tools. The final playlist/feed ordering is usually produced by specialized retrieval + ranking systems, and the user’s behavior becomes fresh domain signals that improve the system over time.

User (Visible + Invisible) Visible input Search: “instrumental focus, no vocals” Voice: “play calm upbeat study music” Invisible behavior Skips < 10s, full listens, saves, repeats LLM / Semantic Layer Intent & constraints Goal: focus / concentration Constraints: no vocals Tone: upbeat but calm (avoid sad) Safe routing Select tools from an allowed shortlist MCP-Style Control Plane Tool registry • schemas • owners • versions RBAC • PII redaction • rate limiting • audit Eval gates • canary rollout • fallbacks Purpose: prevent “tool chaos” as tools and teams scale. Tools (APIs) Catalog search / semantic retrieval Playlist read/write (guarded) Retrieval Find candidate tracks (embeddings) Apply filters (e.g., instrumental proxies) Ranking Order candidates per user + context Diversity + policy constraints Output to User Playlist / feed results (+ optional explanation) Indirect interaction: the user’s behavior (skips, saves, listens) becomes first-party domain signals that continuously improve retrieval/ranking and calibrate LLM intent parsing.
What the LLM is responsible for
Natural-language understanding, constraint normalization, query expansion, tool routing, safe structured outputs, and optionally generating grounded explanations.
What the ranking system is responsible for
Large-scale candidate generation + ranking optimized for satisfaction proxies (dwell, replays, saves), plus constraints (diversity, policy, latency, cost).

Rate this article

Share your feedback

Optional: send a comment about this article.