The Hive Think Tank: Machine Learning at Pinterest by Jure Leskovec

Jure Leskovec
Chief Scientist
Machine Learning at Pinterest

Confidential
Pinterest is a visual bookmarking tool
and discovery engine
Users pin images and sites they like onto
boards
Every pin on Pinterest is added by a
human and lives on a board
Users heavily curate their content
What is Pinterest?

Confidential
• Image
• URL: http://www.culinaria.com…
• User-generated details
• User-curated pin-board graph
• User-curated annotations
• On-site performance (click actions,
impressions, …)
• Web crawl data
What is a Pin?

Confidential
Internet
Offsite
Save
Do
Pinterest
Pinterest is a Visual Discovery Engine

Confidential
Pinterest: Pins and Boards
Pin Board

Confidential
Pinterest is a Giant Bipartite Graph

30+ Billion Pins
categorized by people into more than
750+ Million Boards

Confidential
Many parts driven by ML
Personalization
• Pin and board recommendations
• New-user topic recommendations
Notifications
• Email timing, frequency, content
Ads and monetization
• User action prediction
Related pins
• Which pins are related to a given pin
Ranking
• Homefeed pin ranking
ML at Pinterest

What interests shall
we recommend to a
new user?
Example ML projects at Pinterest
[Pong Eksombatchai, Dave Cummings, Pei Yin, Dan Frankowski]

Confidential
What are the interests of a user?
New User Sign-up Flow

Confidential
User has just joined, they have no clue what
Pinterest is
• Problem: Product comprehension
We have tens of thousands of interests to
recommend from
• Problem: We cannot score all the interests
Business metric we want to optimize is WAR28
(weekly active repinner after 28 days)
• Problem: What is the right notion of a positive label?
Why is it hard?

How to generate
engaging homefeed?
[Mukund Narasimhan, Yuchen Lie, Dmitry Chechik, Yunsong Guo, …]

Confidential
Diverse, Relevant, Endless set of pins to a user
Show pins and content meaningful to a user without a
specific query
Combines content from:
• Users or boards you follow
• Interests you follow
• Recommendations
Homefeed

Confidential
Generating candidates
• Find pins that we think you’ll like
Scoring and ranking
• Picking the best of the best among candidates
Blending of different sources
• Followed boards/users/interests, recommendations
Creating final feed
• Doing this for 10s of millions of users multiple times a day
Why is it hard?

Confidential
No diversity. Some pins with low relevance.
Ranked by Time

Confidential
More diversity. More relevance.
Ranked by ML model

How do pins
relate to each other?
[David Liu, Dmitry Kislyuk, …]

Confidential
Can we discover
relationships
between pins and
fit them into a
giant network?

Confidential
Object Graph: Nodes

Confidential
Object Graph: RelationsSubstitutes
Complements

Confidential
Why is it hard?
Systems challenges
• Billons of pins
• Find related pins of each given pin
Machine learning approach
• Classification vs. Ranking?
Ground-truth labels
• What is a good notion of ground-truth?
• Clicks? How do we de-bias position bias?
Offline evaluation
• What is a good metric for offline evaluation?
Related Pins

What “interests” does
a pin belong to?
[Leon Lin, Lingzhi Luo, Ningning Hu, Eugene Ie, Tao Cheng, …]

Example: Interest Classification
Women’s
Fashion
Food & Drink
Geek

Confidential
TASK: Given a pin, determine its interest(s)
From Pins to Interests
Black Box
Food&Drink
Lower back tattoos
Canoeing
…
Hair
Geek

Confidential
Some interests are specific, others are general
Huge interest size imbalance: 10% to 0.1%
• Problem: Always saying “not my interest” is 99%
correct
Don’t know the interest sizes in the “wild”
• Problem: Overpredict rare, underpredict common
ones
Solution has to scale to 1000s interests and many
languages
• We developed on English, deployed in French
Why is it hard?

Confidential
Generating candidates
• Find pins that we think you’ll like
Scoring and ranking
• Picking the best of the best among candidates
Blending of different sources
• Followed boards/users/topics, recommendations
Creating final recommendations
• Doing this for 10s of millions of users multiple times a day
Machine Learning Problems
Problems we’re trying to solve

Confidential
No dataset
• We have to create a dataset
• Which users to use? What time period?
No labels
• We have to pick the labels
• What is a good signal for positve/negative label?
• Can “no label” be considered as “negative label”?
Deployment
• We have to serve the model to 100m+ users
• How do we generate, store, and query features?
• How do we score the recommendations?
Many Challenges

Know your data
Carefully think about the input data
More is better
Don’t be afraid to try many times!
Evaluation is hard
Move fast but be scientific about it :)
Lessons Learned
1
2
3
What did we learn along the way?

Know your data
Learning 1
1
There is no objective dataset
Production changes everything
Make it easy to look at the raw data, raw
results…
Build intuition about the data and what
steps to take next

Confidential
• There are lots of subtleties in how training data is
generated
- How the data is sampled matters
- The characteristics of the data changes with time
- Distributions change upon deployment
- We make choices based on computational constraints (ratio of
positive to negative instances, size of data set)
• Varying these have a bigger impact on the final
model than varying algorithms
• More important to examine/vary/test these than (for
example) the regularization parameter
There is no Objective Dataset

Confidential
• The data distribution is different
- Need to deal with missing data
- Need to deal with malformed data
- Systems have to work under difficult circumstances
Upstream services may go down, but system should continue
to provide reasonable responses
Defining fallback behavior is important
• Offline/Online consistency takes work
• Investment in monitoring, measurement,
deployment, debugging is crucial
Production Changes Everything

Approach: One vs. Rest
Geek
W’s Fashion
Food and Drink
Canoeing
…
Interest
Classifier
Geek Women’s
fashion
Canoeing
Food and drink
Canoeing
Classifier

Approach: One vs. Rest
Geek
W’s Fashion
Food and Drink
Canoeing
…
Interest
Classifier
Geek Women’s
fashion
Canoeing
Food and drink
Geek
Classifier

Production Data Distribution does not Match
Geek Women’s
fashion
Canoeing
Food and drink
Geek
Classifier
Unlabeled
pins

Production Data Distribution does not Match
Geek Women’s
fashion
Canoeing
Food and drink
Geek
Classifier
Carefully think
about biases in
the data
Unlabeled
pins

Hairstyles: Not enough unlabeled

Hairstyles: Too much unlabeled

Hairstyles: Just enough unlabeled

More is better
Learning 2
2
• More is better:
- Models, Data, Features, Experiments
• Hard to tell upfront what will work and
what won’t
- Try lots of things
• Optimize for scale, flexibility,
debuggability
- Simple and consistent systems scale better

Confidential
For example:
We started with 39 features…
Quickly expanded to 670,000
features
7x gain in performance (F1 score)!
No manual feature selection.
Let the model select features!
More is better
Classifying pins to interests
Women’s
Fashion
Food &
Drink
Geek

Canoeing: 30k pins, 39 features

Canoeing: 1m pins, 670k features

Confidential
While scaling up be very careful
• Robust systems work in the presence of errors
- Incorrectly implemented features
- Features go missing
- Models translated incorrectly
- Data missing for a subset of users
• Treating ML systems as black boxes, looking
only at their output is dangerous
- Especially when you are not sure what to look for
- And because errors manifest as slightly lower accuracy
- And because you don't know what accuracy to expect
But: ML Systems Hide Bugs

Evaluation is Hard
Learning 3
3
Evaluation is always hard
Not obvious whether an offline metric will
correlate with an online metric
Offline metric is a complex function of
dataset creation, ground-truth labels, and
the ML algorithm

Confidential
Training objectives / Offline metrics / Online
metrics can be very different
- Some correlation is expected, but once your models
are sufficiently optimized, they begin to diverge
- Online metrics are the only ones that matter, but are
very expensive
- Offline metric should predict online metric
Naive split of training / testing is suboptimal
- There is often a lot of subjectivity that goes into
training data selection
- More important that the evaluation data reflect reality
than the evaluation data reflect the training data
Evaluation is Hard

Confidential
• User features:
• Landing page, demographics, Facebook
• Interest features:
• Topics, annotations, etc.
• Model: User-cross-Interests
• Feature hashing
• What are the labels?
• Not what user follows but interests of pins
user is going to interact with in the future
• Negative labels: Seen but not interacted
• Scoring: Score 1k location-gender
specific interests in real-time
New User Interest
Recommendations
User follows interests
User interacts with pins
Idea: Recommend interests
that user is going to interact
with in the future

Confidential
Evaluation
• Number of followed interests (bad)
• Number of pins interacted (good)
• AUC and Precision at top 10
• Baselines: Random, Popularity
In two months we:
• Ran 1,000s of offline experiments
• Trained 1000 of models to find a
useful one
• Generated 2,338 graphs, 148k pin
galleries
New User Interest
Recommendations

Evaluation is Hard
Define clear offline success metrics
• Consider many metrics
Build meaningful baselines
Clear offline metrics allow you to
quickly compare solutions and
prune bad directions

Confidential
Models can live for a long time
- Long term hold outs (> 1 year)
- Not all affects can be observed in a short timeframe
Models should be independent of infrastructure and
environment
- Infrastructure lifetime and Model lifetime should be independent
- Should be able to deploy models in different environments
Harder to track progress over time
- Changes are not additive
- Only way to determine progress is to compare with older
models
Old Models Never Die

Possible Solutions
Performance
Explore and Learn
Systematically explore
Learn from your failures

What should
we do?
What are some best practices?

Confidential
• Having a repeatable, push button, stable process is
enormously valuable
• Automation encourages experimentation
- Try variations easily
- Reduces temptation to bundle changes
- Easy baseline, good starting point
• Regular retraining is enormously valuable
• A new team member should be able to go through a
documented process and end up with a model
which is on par with production
Automation Pays for Itself

Confidential
• We have hundreds of models in production
- Trained by different engineers
- Optimizing for different criteria
- Using different features
- Meant for different purposes
- But running on the same infrastructure
• You need a process for
- Model Storage and Search
- Model Deployment, Documentation and Review
- Keeping Model coupling/dependencies in check
- Tracking experiments, communicating successes and failures
Models Need to be Managed

Confidential
• Make everything explicit (via DSL)
- A (linear) model is not just an array of coefficients
- It should list the source/raw-features
- It should contain the feature transforms
- It should contain the score transform/calibration/link function
- It should document how it was built, who built it,
when it was built, and point to instructions to reproduce it
• Config is better than Code
- Create a well documented model specification language
- That is human readable
- But manipulatable by tools (introspection, refactoring, etc.)
• Minimize dependencies on environment
Avoid Implicit Assumptions

Confidential
• Infrastructure is critical
• Building high quality systems requires experts
from different domains
- How do ML engineers build models without deep understanding
of the infrastructure?
- How do infrastructure experts build/scale/evolve the system?
• Decoupling infrastructure and modeling is hard but
worth it
- Allows people with different backgrounds to work together
- Requires well thought out interfaces
- Which is rarely achieved through organic evolution
There is more to ML Systems
than ML

Confidential
• 100M+ users
Vast, diverse and changing user base makes user modeling a
challenge
Product has to work well for niche as well as mainstream
populations
Optimizing for majority can hurt subgroups
Monitoring needs to be intelligent
• Billions of pieces of content
Modeling is crucial
Need to tradeoff recency, diversity, relevance, and ecosystem
effects
Everything Gets Amplified at
Scale

jure@pinterest.com
Come work with us!
Thanks to Mukund, Dmitry, David, Pong, Dave, and Leon

The Hive Think Tank: Machine Learning at Pinterest by Jure Leskovec

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to The Hive Think Tank: Machine Learning at Pinterest by Jure Leskovec

Similar to The Hive Think Tank: Machine Learning at Pinterest by Jure Leskovec (20)

More from The Hive

More from The Hive (20)

Recently uploaded

Recently uploaded (20)

The Hive Think Tank: Machine Learning at Pinterest by Jure Leskovec