SlideShare a Scribd company logo
1 of 13
Social Network Mining
    Solutions using Google App Engine Map Reduce




     J Singh, DataThinks.org



                                        October 19, 2011
MapReduce: A Genealogical Perspective
• Roots
   – Lisp, Scheme
   – APL


• Google OS papers, 2004
   – Exploit extreme parallelism of data


• Apache Top Level Project (Hadoop)

• MapReduceGAE borrows from these




© J Singh, 2011                            2
                                   2
Social Network Mining
• Finding people based on data in social networks
   –   Love and Romance
   –   Common interests
   –   Similar buying habits
   –   Similar voting propensities
   –   Location


• It‟s not a new problem
   – We have additional solutions for the old problem
        • Examples based on proprietary data: eHarmony, etc.
        • Early examples based on social network data: ShoutFlow,
          WhoIsJustLikeMe.



© J Singh, 2011                                                     3
                                      3
Based on clustering algorithms
• On-line demo of clustering       • Resource intensive.
                                      – Best done in batch mode


                                   • Exploit data parallelism of the
                                     algorithm
                                      – App Engine Map Reduce,
                                        employing one map job for
                                        each cluster
                                      – App Engine Pipeline API,
                                        employing one stage of the
                                        pipeline for each „step‟


                                   • But first, a detour into Map
                                     Reduce…
© J Singh, 2011                                                      4
                               4
MapReduce Conceptual Underpinnings
• Based on Functional Programming model
   – From Lisp / Scheme
        • (map square '(1 2 3 4))   (1 4 9 16)
        • (reduce plus '(1 4 9 16))   30
   – From APL
        • +/ N    N  1 2 3 4


• Easy to distribute (based on each element of the vector)

• New for Map/Reduce: Nice failure/retry semantics
   – Hundreds and thousands of low-end servers are running at the
     same time



© J Singh, 2011                                                     5
                                  5
MapReduce Flow




© J Singh, 2011       6
                  6
MapReduce Components in GAE 2011
                  • Input Reader
                     – Several provided by GAE, can write your own


                  • Map function: Written by Programmer

                  • Shuffle function:
                     – Provided by GAE, can write your own


                  • Reduce function: Written by Programmer

                  • Output Writer
                     – Several provided by GAE, can write your own




© J Singh, 2011                                                      7
                               7
Invoking GAE Map Reduce
class MapreducePipeline (…):
    def run(self,
          job_name,             #   A string
          mapper_spec,          #   Mapper function
          reducer_spec,         #   Reducer function
          input_reader_spec,    #   Input reader fn
          output_writer_spec,   #   Output writer
          mapper_params,        #   A dictionary
          reducer_params,       #   A dictionary
          shards,               #   An int
            )


© J Singh, 2011                                        8
                          8
GAE Pipeline API
• Based on Python Generator functions

• The old Unix idea on steroids:
   – Perform complex operations by piping data between primitives
   – But the primitives are not so primitive
   – Data lives in permanent storage between pipeline stages


• MapreducePipeline (prev page) was just one type of pipeline




© J Singh, 2011                                                     9
                                   9
Pipeline API Example Code
Split and Merge example


  class aPipe(pipeline.Pipeline):
      def run(self, e_kind, prop_name, *value_list):
          all_bs = []
          for v in value_list:
              stage = yield bPipe(e_kind, prop_name, v)
              all_bs.append(stage)
          yield common.Append(*all_bs)




© J Singh, 2011                                           10
                            10
Pause and Assess
• Assertion:
   – GAE Map/Reduce is a complete solution for analysis of social
     network mining
   – We know it will scale, the question is how far.


• Working on one Proof of Concept for Social Network Mining
   – Recruiting a second test case


• Will report back in 3-4 months with data on
   – Performance
   – Cost
   – Limits of scalability


© J Singh, 2011                                                     11
                                     11
Adapting the algorithm to M/R
• Clustering Algorithm

   1. Create k randomly placed centroids       Map each
                                               data point

   2. Find the centroid (1-k) closest to each data point


   3. Move each centroid to the average of its members
                                              Reduce
                                           Each Centroid
   4. Repeat 2 and 3 until there is no more change

          Connect to next stage
           using Pipelining API

© J Singh, 2011                                             12
                                  12
About Us
• Involved with Map/Reduce and NoSQL technologies on several
  platforms
   – Google App Engine, MongoDB


• DataThinks.org is a new service of Early Stage IT
   – Building and operating “Big Data” analytics services




                           Thanks
© J Singh, 2011                                                13
                                   13

More Related Content

Similar to Social Network Mining Solutions using Google App Engine Map Reduce

Big Data Laboratory
Big Data LaboratoryBig Data Laboratory
Big Data LaboratoryJ Singh
 
Srinivas Muddana Resume
Srinivas Muddana ResumeSrinivas Muddana Resume
Srinivas Muddana Resumemuddanas
 
Srinivas Muddana Resume
Srinivas Muddana ResumeSrinivas Muddana Resume
Srinivas Muddana Resumemuddanas
 
Srinivas Muddana Resume
Srinivas Muddana ResumeSrinivas Muddana Resume
Srinivas Muddana Resumemuddanas
 
Download It
Download ItDownload It
Download Itbutest
 
DYNAMIC SLICING OF ASPECT-ORIENTED PROGRAMS
DYNAMIC SLICING OF ASPECT-ORIENTED PROGRAMSDYNAMIC SLICING OF ASPECT-ORIENTED PROGRAMS
DYNAMIC SLICING OF ASPECT-ORIENTED PROGRAMSPraveen Penumathsa
 
DYNAMIC SLICING OF ASPECT-ORIENTED PROGRAMS
DYNAMIC SLICING OF ASPECT-ORIENTED PROGRAMSDYNAMIC SLICING OF ASPECT-ORIENTED PROGRAMS
DYNAMIC SLICING OF ASPECT-ORIENTED PROGRAMSPraveen Penumathsa
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop EcosystemJ Singh
 
Scalable image recognition model with deep embedding
Scalable image recognition model with deep embeddingScalable image recognition model with deep embedding
Scalable image recognition model with deep embedding捷恩 蔡
 
Deep Learning Applications to Satellite Imagery
Deep Learning Applications to Satellite ImageryDeep Learning Applications to Satellite Imagery
Deep Learning Applications to Satellite Imageryrlewis48
 
Using R to Visualize Spatial Data: R as GIS - Guy Lansley
Using R to Visualize Spatial Data: R as GIS - Guy LansleyUsing R to Visualize Spatial Data: R as GIS - Guy Lansley
Using R to Visualize Spatial Data: R as GIS - Guy LansleyGuy Lansley
 
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...Srivatsan Ramanujam
 
Hadoop Tutorial.ppt
Hadoop Tutorial.pptHadoop Tutorial.ppt
Hadoop Tutorial.pptSathish24111
 
GoFFish - A Sub-graph centric framework for large scale graph analytics
GoFFish - A Sub-graph centric framework for large scale graph analyticsGoFFish - A Sub-graph centric framework for large scale graph analytics
GoFFish - A Sub-graph centric framework for large scale graph analyticscharithwiki
 
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15MLconf
 

Similar to Social Network Mining Solutions using Google App Engine Map Reduce (20)

Big Data Laboratory
Big Data LaboratoryBig Data Laboratory
Big Data Laboratory
 
Map reducecloudtech
Map reducecloudtechMap reducecloudtech
Map reducecloudtech
 
Resume
ResumeResume
Resume
 
Srinivas Muddana Resume
Srinivas Muddana ResumeSrinivas Muddana Resume
Srinivas Muddana Resume
 
Srinivas Muddana Resume
Srinivas Muddana ResumeSrinivas Muddana Resume
Srinivas Muddana Resume
 
Srinivas Muddana Resume
Srinivas Muddana ResumeSrinivas Muddana Resume
Srinivas Muddana Resume
 
Download It
Download ItDownload It
Download It
 
DYNAMIC SLICING OF ASPECT-ORIENTED PROGRAMS
DYNAMIC SLICING OF ASPECT-ORIENTED PROGRAMSDYNAMIC SLICING OF ASPECT-ORIENTED PROGRAMS
DYNAMIC SLICING OF ASPECT-ORIENTED PROGRAMS
 
DYNAMIC SLICING OF ASPECT-ORIENTED PROGRAMS
DYNAMIC SLICING OF ASPECT-ORIENTED PROGRAMSDYNAMIC SLICING OF ASPECT-ORIENTED PROGRAMS
DYNAMIC SLICING OF ASPECT-ORIENTED PROGRAMS
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
 
MapReduce Algorithm Design
MapReduce Algorithm DesignMapReduce Algorithm Design
MapReduce Algorithm Design
 
Project Progress
Project ProgressProject Progress
Project Progress
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
 
Scalable image recognition model with deep embedding
Scalable image recognition model with deep embeddingScalable image recognition model with deep embedding
Scalable image recognition model with deep embedding
 
Deep Learning Applications to Satellite Imagery
Deep Learning Applications to Satellite ImageryDeep Learning Applications to Satellite Imagery
Deep Learning Applications to Satellite Imagery
 
Using R to Visualize Spatial Data: R as GIS - Guy Lansley
Using R to Visualize Spatial Data: R as GIS - Guy LansleyUsing R to Visualize Spatial Data: R as GIS - Guy Lansley
Using R to Visualize Spatial Data: R as GIS - Guy Lansley
 
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
 
Hadoop Tutorial.ppt
Hadoop Tutorial.pptHadoop Tutorial.ppt
Hadoop Tutorial.ppt
 
GoFFish - A Sub-graph centric framework for large scale graph analytics
GoFFish - A Sub-graph centric framework for large scale graph analyticsGoFFish - A Sub-graph centric framework for large scale graph analytics
GoFFish - A Sub-graph centric framework for large scale graph analytics
 
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
 

More from J Singh

OpenLSH - a framework for locality sensitive hashing
OpenLSH  - a framework for locality sensitive hashingOpenLSH  - a framework for locality sensitive hashing
OpenLSH - a framework for locality sensitive hashingJ Singh
 
Designing analytics for big data
Designing analytics for big dataDesigning analytics for big data
Designing analytics for big dataJ Singh
 
Open LSH - september 2014 update
Open LSH  - september 2014 updateOpen LSH  - september 2014 update
Open LSH - september 2014 updateJ Singh
 
PaaS - google app engine
PaaS  - google app enginePaaS  - google app engine
PaaS - google app engineJ Singh
 
Mining of massive datasets using locality sensitive hashing (LSH)
Mining of massive datasets using locality sensitive hashing (LSH)Mining of massive datasets using locality sensitive hashing (LSH)
Mining of massive datasets using locality sensitive hashing (LSH)J Singh
 
Data Analytic Technology Platforms: Options and Tradeoffs
Data Analytic Technology Platforms: Options and TradeoffsData Analytic Technology Platforms: Options and Tradeoffs
Data Analytic Technology Platforms: Options and TradeoffsJ Singh
 
Facebook Analytics with Elastic Map/Reduce
Facebook Analytics with Elastic Map/ReduceFacebook Analytics with Elastic Map/Reduce
Facebook Analytics with Elastic Map/ReduceJ Singh
 
High Throughput Data Analysis
High Throughput Data AnalysisHigh Throughput Data Analysis
High Throughput Data AnalysisJ Singh
 
NoSQL and MapReduce
NoSQL and MapReduceNoSQL and MapReduce
NoSQL and MapReduceJ Singh
 
CS 542 -- Concurrency Control, Distributed Commit
CS 542 -- Concurrency Control, Distributed CommitCS 542 -- Concurrency Control, Distributed Commit
CS 542 -- Concurrency Control, Distributed CommitJ Singh
 
CS 542 -- Failure Recovery, Concurrency Control
CS 542 -- Failure Recovery, Concurrency ControlCS 542 -- Failure Recovery, Concurrency Control
CS 542 -- Failure Recovery, Concurrency ControlJ Singh
 
CS 542 -- Query Optimization
CS 542 -- Query OptimizationCS 542 -- Query Optimization
CS 542 -- Query OptimizationJ Singh
 
CS 542 -- Query Execution
CS 542 -- Query ExecutionCS 542 -- Query Execution
CS 542 -- Query ExecutionJ Singh
 
CS 542 Putting it all together -- Storage Management
CS 542 Putting it all together -- Storage ManagementCS 542 Putting it all together -- Storage Management
CS 542 Putting it all together -- Storage ManagementJ Singh
 
CS 542 Parallel DBs, NoSQL, MapReduce
CS 542 Parallel DBs, NoSQL, MapReduceCS 542 Parallel DBs, NoSQL, MapReduce
CS 542 Parallel DBs, NoSQL, MapReduceJ Singh
 
CS 542 Database Index Structures
CS 542 Database Index StructuresCS 542 Database Index Structures
CS 542 Database Index StructuresJ Singh
 
CS 542 Controlling Database Integrity and Performance
CS 542 Controlling Database Integrity and PerformanceCS 542 Controlling Database Integrity and Performance
CS 542 Controlling Database Integrity and PerformanceJ Singh
 
CS 542 Overview of query processing
CS 542 Overview of query processingCS 542 Overview of query processing
CS 542 Overview of query processingJ Singh
 
CS 542 Introduction
CS 542 IntroductionCS 542 Introduction
CS 542 IntroductionJ Singh
 
Cloud Computing from an Entrpreneur's Viewpoint
Cloud Computing from an Entrpreneur's ViewpointCloud Computing from an Entrpreneur's Viewpoint
Cloud Computing from an Entrpreneur's ViewpointJ Singh
 

More from J Singh (20)

OpenLSH - a framework for locality sensitive hashing
OpenLSH  - a framework for locality sensitive hashingOpenLSH  - a framework for locality sensitive hashing
OpenLSH - a framework for locality sensitive hashing
 
Designing analytics for big data
Designing analytics for big dataDesigning analytics for big data
Designing analytics for big data
 
Open LSH - september 2014 update
Open LSH  - september 2014 updateOpen LSH  - september 2014 update
Open LSH - september 2014 update
 
PaaS - google app engine
PaaS  - google app enginePaaS  - google app engine
PaaS - google app engine
 
Mining of massive datasets using locality sensitive hashing (LSH)
Mining of massive datasets using locality sensitive hashing (LSH)Mining of massive datasets using locality sensitive hashing (LSH)
Mining of massive datasets using locality sensitive hashing (LSH)
 
Data Analytic Technology Platforms: Options and Tradeoffs
Data Analytic Technology Platforms: Options and TradeoffsData Analytic Technology Platforms: Options and Tradeoffs
Data Analytic Technology Platforms: Options and Tradeoffs
 
Facebook Analytics with Elastic Map/Reduce
Facebook Analytics with Elastic Map/ReduceFacebook Analytics with Elastic Map/Reduce
Facebook Analytics with Elastic Map/Reduce
 
High Throughput Data Analysis
High Throughput Data AnalysisHigh Throughput Data Analysis
High Throughput Data Analysis
 
NoSQL and MapReduce
NoSQL and MapReduceNoSQL and MapReduce
NoSQL and MapReduce
 
CS 542 -- Concurrency Control, Distributed Commit
CS 542 -- Concurrency Control, Distributed CommitCS 542 -- Concurrency Control, Distributed Commit
CS 542 -- Concurrency Control, Distributed Commit
 
CS 542 -- Failure Recovery, Concurrency Control
CS 542 -- Failure Recovery, Concurrency ControlCS 542 -- Failure Recovery, Concurrency Control
CS 542 -- Failure Recovery, Concurrency Control
 
CS 542 -- Query Optimization
CS 542 -- Query OptimizationCS 542 -- Query Optimization
CS 542 -- Query Optimization
 
CS 542 -- Query Execution
CS 542 -- Query ExecutionCS 542 -- Query Execution
CS 542 -- Query Execution
 
CS 542 Putting it all together -- Storage Management
CS 542 Putting it all together -- Storage ManagementCS 542 Putting it all together -- Storage Management
CS 542 Putting it all together -- Storage Management
 
CS 542 Parallel DBs, NoSQL, MapReduce
CS 542 Parallel DBs, NoSQL, MapReduceCS 542 Parallel DBs, NoSQL, MapReduce
CS 542 Parallel DBs, NoSQL, MapReduce
 
CS 542 Database Index Structures
CS 542 Database Index StructuresCS 542 Database Index Structures
CS 542 Database Index Structures
 
CS 542 Controlling Database Integrity and Performance
CS 542 Controlling Database Integrity and PerformanceCS 542 Controlling Database Integrity and Performance
CS 542 Controlling Database Integrity and Performance
 
CS 542 Overview of query processing
CS 542 Overview of query processingCS 542 Overview of query processing
CS 542 Overview of query processing
 
CS 542 Introduction
CS 542 IntroductionCS 542 Introduction
CS 542 Introduction
 
Cloud Computing from an Entrpreneur's Viewpoint
Cloud Computing from an Entrpreneur's ViewpointCloud Computing from an Entrpreneur's Viewpoint
Cloud Computing from an Entrpreneur's Viewpoint
 

Social Network Mining Solutions using Google App Engine Map Reduce

  • 1. Social Network Mining Solutions using Google App Engine Map Reduce J Singh, DataThinks.org October 19, 2011
  • 2. MapReduce: A Genealogical Perspective • Roots – Lisp, Scheme – APL • Google OS papers, 2004 – Exploit extreme parallelism of data • Apache Top Level Project (Hadoop) • MapReduceGAE borrows from these © J Singh, 2011 2 2
  • 3. Social Network Mining • Finding people based on data in social networks – Love and Romance – Common interests – Similar buying habits – Similar voting propensities – Location • It‟s not a new problem – We have additional solutions for the old problem • Examples based on proprietary data: eHarmony, etc. • Early examples based on social network data: ShoutFlow, WhoIsJustLikeMe. © J Singh, 2011 3 3
  • 4. Based on clustering algorithms • On-line demo of clustering • Resource intensive. – Best done in batch mode • Exploit data parallelism of the algorithm – App Engine Map Reduce, employing one map job for each cluster – App Engine Pipeline API, employing one stage of the pipeline for each „step‟ • But first, a detour into Map Reduce… © J Singh, 2011 4 4
  • 5. MapReduce Conceptual Underpinnings • Based on Functional Programming model – From Lisp / Scheme • (map square '(1 2 3 4)) (1 4 9 16) • (reduce plus '(1 4 9 16)) 30 – From APL • +/ N N  1 2 3 4 • Easy to distribute (based on each element of the vector) • New for Map/Reduce: Nice failure/retry semantics – Hundreds and thousands of low-end servers are running at the same time © J Singh, 2011 5 5
  • 6. MapReduce Flow © J Singh, 2011 6 6
  • 7. MapReduce Components in GAE 2011 • Input Reader – Several provided by GAE, can write your own • Map function: Written by Programmer • Shuffle function: – Provided by GAE, can write your own • Reduce function: Written by Programmer • Output Writer – Several provided by GAE, can write your own © J Singh, 2011 7 7
  • 8. Invoking GAE Map Reduce class MapreducePipeline (…): def run(self, job_name, # A string mapper_spec, # Mapper function reducer_spec, # Reducer function input_reader_spec, # Input reader fn output_writer_spec, # Output writer mapper_params, # A dictionary reducer_params, # A dictionary shards, # An int ) © J Singh, 2011 8 8
  • 9. GAE Pipeline API • Based on Python Generator functions • The old Unix idea on steroids: – Perform complex operations by piping data between primitives – But the primitives are not so primitive – Data lives in permanent storage between pipeline stages • MapreducePipeline (prev page) was just one type of pipeline © J Singh, 2011 9 9
  • 10. Pipeline API Example Code Split and Merge example class aPipe(pipeline.Pipeline): def run(self, e_kind, prop_name, *value_list): all_bs = [] for v in value_list: stage = yield bPipe(e_kind, prop_name, v) all_bs.append(stage) yield common.Append(*all_bs) © J Singh, 2011 10 10
  • 11. Pause and Assess • Assertion: – GAE Map/Reduce is a complete solution for analysis of social network mining – We know it will scale, the question is how far. • Working on one Proof of Concept for Social Network Mining – Recruiting a second test case • Will report back in 3-4 months with data on – Performance – Cost – Limits of scalability © J Singh, 2011 11 11
  • 12. Adapting the algorithm to M/R • Clustering Algorithm 1. Create k randomly placed centroids Map each data point 2. Find the centroid (1-k) closest to each data point 3. Move each centroid to the average of its members Reduce Each Centroid 4. Repeat 2 and 3 until there is no more change Connect to next stage using Pipelining API © J Singh, 2011 12 12
  • 13. About Us • Involved with Map/Reduce and NoSQL technologies on several platforms – Google App Engine, MongoDB • DataThinks.org is a new service of Early Stage IT – Building and operating “Big Data” analytics services Thanks © J Singh, 2011 13 13