Machine learning is at the core of Pinterest. Pinterest personalizes and ranks 1B+ pins, 700+ million boards for 100M+ users all over the world, using data gathered from collaborative filtering, user curation, web crawling, and more. At Pinterest we model relationships between pins, handle cold-start problems and deal with real-time recommendations.
In this presentation Jure gave an overview of the problems and effective solutions developed at Pinterest. He focused on systems and effective engineering choices made to enable productive machine learning development and enable multiple engineers effectively develop, test, and deploy machine-learned models.
2. Confidential
Pinterest is a visual bookmarking tool
and discovery engine
Users pin images and sites they like onto
boards
Every pin on Pinterest is added by a
human and lives on a board
Users heavily curate their content
What is Pinterest?
3. Confidential
• Image
• URL: http://www.culinaria.com…
• User-generated details
• User-curated pin-board graph
• User-curated annotations
• On-site performance (click actions,
impressions, …)
• Web crawl data
What is a Pin?
8. Confidential
Many parts driven by ML
Personalization
• Pin and board recommendations
• New-user topic recommendations
Notifications
• Email timing, frequency, content
Ads and monetization
• User action prediction
Related pins
• Which pins are related to a given pin
Ranking
• Homefeed pin ranking
ML at Pinterest
9. What interests shall
we recommend to a
new user?
Example ML projects at Pinterest
[Pong Eksombatchai, Dave Cummings, Pei Yin, Dan Frankowski]
12. Confidential
User has just joined, they have no clue what
Pinterest is
• Problem: Product comprehension
We have tens of thousands of interests to
recommend from
• Problem: We cannot score all the interests
Business metric we want to optimize is WAR28
(weekly active repinner after 28 days)
• Problem: What is the right notion of a positive label?
Why is it hard?
13. How to generate
engaging homefeed?
Example ML projects at Pinterest
[Mukund Narasimhan, Yuchen Lie, Dmitry Chechik, Yunsong Guo, …]
14. Confidential
Diverse, Relevant, Endless set of pins to a user
Show pins and content meaningful to a user without a
specific query
Combines content from:
• Users or boards you follow
• Interests you follow
• Recommendations
Homefeed
15. Confidential
Generating candidates
• Find pins that we think you’ll like
Scoring and ranking
• Picking the best of the best among candidates
Blending of different sources
• Followed boards/users/interests, recommendations
Creating final feed
• Doing this for 10s of millions of users multiple times a day
Why is it hard?
22. Confidential
Why is it hard?
Systems challenges
• Billons of pins
• Find related pins of each given pin
Machine learning approach
• Classification vs. Ranking?
Ground-truth labels
• What is a good notion of ground-truth?
• Clicks? How do we de-bias position bias?
Offline evaluation
• What is a good metric for offline evaluation?
Related Pins
23. What “interests” does
a pin belong to?
[Leon Lin, Lingzhi Luo, Ningning Hu, Eugene Ie, Tao Cheng, …]
Example ML projects at Pinterest
25. Confidential
TASK: Given a pin, determine its interest(s)
From Pins to Interests
Black Box
Food&Drink
Lower back tattoos
Canoeing
…
Hair
Geek
26. Confidential
Some interests are specific, others are general
Huge interest size imbalance: 10% to 0.1%
• Problem: Always saying “not my interest” is 99%
correct
Don’t know the interest sizes in the “wild”
• Problem: Overpredict rare, underpredict common
ones
Solution has to scale to 1000s interests and many
languages
• We developed on English, deployed in French
Why is it hard?
28. Confidential
Generating candidates
• Find pins that we think you’ll like
Scoring and ranking
• Picking the best of the best among candidates
Blending of different sources
• Followed boards/users/topics, recommendations
Creating final recommendations
• Doing this for 10s of millions of users multiple times a day
Machine Learning Problems
Problems we’re trying to solve
29. Confidential
No dataset
• We have to create a dataset
• Which users to use? What time period?
No labels
• We have to pick the labels
• What is a good signal for positve/negative label?
• Can “no label” be considered as “negative label”?
Deployment
• We have to serve the model to 100m+ users
• How do we generate, store, and query features?
• How do we score the recommendations?
Many Challenges
30. Know your data
Carefully think about the input data
More is better
Don’t be afraid to try many times!
Evaluation is hard
Move fast but be scientific about it :)
Lessons Learned
1
2
3
What did we learn along the way?
31. Know your data
Learning 1
1
There is no objective dataset
Production changes everything
Make it easy to look at the raw data, raw
results…
Build intuition about the data and what
steps to take next
32. Confidential
• There are lots of subtleties in how training data is
generated
- How the data is sampled matters
- The characteristics of the data changes with time
- Distributions change upon deployment
- We make choices based on computational constraints (ratio of
positive to negative instances, size of data set)
• Varying these have a bigger impact on the final
model than varying algorithms
• More important to examine/vary/test these than (for
example) the regularization parameter
There is no Objective Dataset
33. Confidential
• The data distribution is different
- Need to deal with missing data
- Need to deal with malformed data
- Systems have to work under difficult circumstances
Upstream services may go down, but system should continue
to provide reasonable responses
Defining fallback behavior is important
• Offline/Online consistency takes work
• Investment in monitoring, measurement,
deployment, debugging is crucial
Production Changes Everything
35. Approach: One vs. Rest
Geek
W’s Fashion
Food and Drink
Canoeing
…
Interest
Classifier
Geek Women’s
fashion
Canoeing
Food and drink
Canoeing
Classifier
36. Approach: One vs. Rest
Geek
W’s Fashion
Food and Drink
Canoeing
…
Interest
Classifier
Geek Women’s
fashion
Canoeing
Food and drink
Geek
Classifier
37. Production Data Distribution does not Match
Geek Women’s
fashion
Canoeing
Food and drink
Geek
Classifier
Unlabeled
pins
38. Production Data Distribution does not Match
Geek Women’s
fashion
Canoeing
Food and drink
Geek
Classifier
Carefully think
about biases in
the data
Unlabeled
pins
42. More is better
Learning 2
2
• More is better:
- Models, Data, Features, Experiments
• Hard to tell upfront what will work and
what won’t
- Try lots of things
• Optimize for scale, flexibility,
debuggability
- Simple and consistent systems scale better
43. Confidential
For example:
We started with 39 features…
Quickly expanded to 670,000
features
7x gain in performance (F1 score)!
No manual feature selection.
Let the model select features!
More is better
Classifying pins to interests
Women’s
Fashion
Food &
Drink
Geek
46. Confidential
While scaling up be very careful
• Robust systems work in the presence of errors
- Incorrectly implemented features
- Features go missing
- Models translated incorrectly
- Data missing for a subset of users
• Treating ML systems as black boxes, looking
only at their output is dangerous
- Especially when you are not sure what to look for
- And because errors manifest as slightly lower accuracy
- And because you don't know what accuracy to expect
But: ML Systems Hide Bugs
47. Evaluation is Hard
Learning 3
3
Evaluation is always hard
Not obvious whether an offline metric will
correlate with an online metric
Offline metric is a complex function of
dataset creation, ground-truth labels, and
the ML algorithm
48. Confidential
Training objectives / Offline metrics / Online
metrics can be very different
- Some correlation is expected, but once your models
are sufficiently optimized, they begin to diverge
- Online metrics are the only ones that matter, but are
very expensive
- Offline metric should predict online metric
Naive split of training / testing is suboptimal
- There is often a lot of subjectivity that goes into
training data selection
- More important that the evaluation data reflect reality
than the evaluation data reflect the training data
Evaluation is Hard
49. Confidential
• User features:
• Landing page, demographics, Facebook
• Interest features:
• Topics, annotations, etc.
• Model: User-cross-Interests
• Feature hashing
• What are the labels?
• Not what user follows but interests of pins
user is going to interact with in the future
• Negative labels: Seen but not interacted
• Scoring: Score 1k location-gender
specific interests in real-time
New User Interest
Recommendations
User follows interests
User interacts with pins
Idea: Recommend interests
that user is going to interact
with in the future
50. Confidential
Evaluation
• Number of followed interests (bad)
• Number of pins interacted (good)
• AUC and Precision at top 10
• Baselines: Random, Popularity
In two months we:
• Ran 1,000s of offline experiments
• Trained 1000 of models to find a
useful one
• Generated 2,338 graphs, 148k pin
galleries
New User Interest
Recommendations
51. Evaluation is Hard
Define clear offline success metrics
• Consider many metrics
Build meaningful baselines
Clear offline metrics allow you to
quickly compare solutions and
prune bad directions
52. Confidential
Models can live for a long time
- Long term hold outs (> 1 year)
- Not all affects can be observed in a short timeframe
Models should be independent of infrastructure and
environment
- Infrastructure lifetime and Model lifetime should be independent
- Should be able to deploy models in different environments
Harder to track progress over time
- Changes are not additive
- Only way to determine progress is to compare with older
models
Old Models Never Die
55. Confidential
• Having a repeatable, push button, stable process is
enormously valuable
• Automation encourages experimentation
- Try variations easily
- Reduces temptation to bundle changes
- Easy baseline, good starting point
• Regular retraining is enormously valuable
• A new team member should be able to go through a
documented process and end up with a model
which is on par with production
Automation Pays for Itself
56. Confidential
• We have hundreds of models in production
- Trained by different engineers
- Optimizing for different criteria
- Using different features
- Meant for different purposes
- But running on the same infrastructure
• You need a process for
- Model Storage and Search
- Model Deployment, Documentation and Review
- Keeping Model coupling/dependencies in check
- Tracking experiments, communicating successes and failures
Models Need to be Managed
57. Confidential
• Make everything explicit (via DSL)
- A (linear) model is not just an array of coefficients
- It should list the source/raw-features
- It should contain the feature transforms
- It should contain the score transform/calibration/link function
- It should document how it was built, who built it,
when it was built, and point to instructions to reproduce it
• Config is better than Code
- Create a well documented model specification language
- That is human readable
- But manipulatable by tools (introspection, refactoring, etc.)
• Minimize dependencies on environment
Avoid Implicit Assumptions
58. Confidential
• Infrastructure is critical
• Building high quality systems requires experts
from different domains
- How do ML engineers build models without deep understanding
of the infrastructure?
- How do infrastructure experts build/scale/evolve the system?
• Decoupling infrastructure and modeling is hard but
worth it
- Allows people with different backgrounds to work together
- Requires well thought out interfaces
- Which is rarely achieved through organic evolution
There is more to ML Systems
than ML
59. Confidential
• 100M+ users
Vast, diverse and changing user base makes user modeling a
challenge
Product has to work well for niche as well as mainstream
populations
Optimizing for majority can hurt subgroups
Monitoring needs to be intelligent
• Billions of pieces of content
Modeling is crucial
Need to tradeoff recency, diversity, relevance, and ecosystem
effects
Everything Gets Amplified at
Scale