SlideShare a Scribd company logo
1 of 10
Google App Engine
A Big Data Laboratory?
J Singh, Early Stage IT
March 20, 2012
2
© J Singh, 2011 2
App Engine as a Big Data Laboratory?
• Why bother? Why not use Hadoop?
• EvaluatingApp Engine as a Big Data Laboratory
– Loading Data
– Analytics Capabilities
– Visualization Capabilities
• Conclusions
3
© J Singh, 2011 3
Why Bother? Why not Hadoop?
• No install and configuration required
– Focus on the task: Analytics and Visualization
– Use the technology that powers Google Earth and Google Finance
• Works with Google Datastore
– Makes sense if your data is already there
• No import/export of data necessary
• But a purely „low-level‟ programming environment
– Write Map and Reduce functions in Python / Java
– No Pig, Hive, …
• Is this story for real? We wanted to find out.
4
© J Singh, 2011 4
Loading Data into GAE
• What? No native OS environment to work in?
– No OS commands, no file system accessible to the programmer
– Data Prep must be done elsewhere.
• But other options exist
1. Upload a file into Blobstore through an HTTP request
• Max object size 2GB, max get/put in one call: 1MB.
• Process into Datastore entities using BlobstoreInputReader or
BlobstoreZipInputReader classes.
2. Use remote_api to upload CSV files
• It‟s painful
– Only needs to be done one-time, we hope
– Or we need to set up a process for staging and feeding the data
5
© J Singh, 2011 5
Data Analysis: NumPy and SciPy
• NumPy and SciPy libraries using the traditional computing
model (not Map/Reduce) include:
– Array and Matrix manipulation
– Optimization algorithms, e.g., curve fitting, linear regression,
multi-variate regression.
– Multithreading (for embarrassingly parallel problems)
• Replace map(…) with parallel_map(…).
– map is a Python primitive
– parallel_map is a NumPy primitive
– Other scientific algorithms, e.g., Kalman Filtering, Signal
smoothing, Markov Chains.
• NumPy and SciPy depended on Python 2.7
– Enabled in Fall, 2011.
6
© J Singh, 2011 6
Data Analysis: MapReduce
• Input Reader
– Several provided by GAE, can write your own
• Map function: Written by Programmer
• Shuffle function: Provided
– Can write your own overrides for partitioning
(sharding) and comparison (use in sort)
• Reduce function: Written by Programmer
– Can be skipped if not needed
• Output Writer
– Several provided by GAE, can write your own
7
© J Singh, 2011 7
Data Analysis: Pipeline API
• Based on Python Generator functions
• Allows chaining of map reduce jobs
– Primitives for setting up various types of chains
• MapreducePipeline (prev page) was just one type of pipeline
• Available for Python or Java
– Python side better documented
Split and Merge example
class aPipe(pipeline.Pipeline):
def run(self, e_kind, prop_name, *value_list):
all_bs = []
for v in value_list:
stage = yield bPipe(e_kind, prop_name, v)
all_bs.append(stage)
yield common.Append(*all_bs)
8
© J Singh, 2011 8
Data Visualization
• Appengine supports multiple web frameworks for serving data
directly from the Datastore into an HTML5 Browser:
– Django, Jinja2, CherryPy, …
• Options:
– jQuery Visualize
– Google Visualization API
• Including MotionCharts
– Hans Rosling‟s Visualization API
– Check out his TED talk
• Conclusion:
– A rich set of facilities for visualization and taking action
9
© J Singh, 2011 9
Decision Factors
Usage Discussion
Proof of Concept
or
Demo
In GAE
Need a process for Data Loading
But saves on having to do Hadoop setup
Absence of Pig/Hive may be a limiting factor
Advantage in Visualization
Better security and isolation than Hadoop
Production
In GAE
Analyze cost before committing
Lock-in risk?
Production
elsewhere
Good semantic match between Datastore and HBase.
Need to do Hadoop setup and operation
10
© J Singh, 2011 10
Thank you
• J Singh
– President, Early Stage IT
• Technology Services and Strategy for Startups
• DataThinks.org is a new service of Early Stage IT
– “Big Data” analytics solutions

More Related Content

What's hot

Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadooproyans
 
Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)Marcel Krcah
 
Map reduce paradigm explained
Map reduce paradigm explainedMap reduce paradigm explained
Map reduce paradigm explainedDmytro Sandu
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and FutureDataWorks Summit
 
Extending Hadoop for Fun & Profit
Extending Hadoop for Fun & ProfitExtending Hadoop for Fun & Profit
Extending Hadoop for Fun & ProfitMilind Bhandarkar
 
Hadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFSHadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFSpraveen bhat
 
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop EasyNick Dimiduk
 
Pig on Tez: Low Latency Data Processing with Big Data
Pig on Tez: Low Latency Data Processing with Big DataPig on Tez: Low Latency Data Processing with Big Data
Pig on Tez: Low Latency Data Processing with Big DataDataWorks Summit
 
Pig programming is more fun: New features in Pig
Pig programming is more fun: New features in PigPig programming is more fun: New features in Pig
Pig programming is more fun: New features in Pigdaijy
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
 
Functional Programming and Big Data
Functional Programming and Big DataFunctional Programming and Big Data
Functional Programming and Big DataDataWorks Summit
 
Getting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduceGetting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduceobdit
 
Future of Data Intensive Applicaitons
Future of Data Intensive ApplicaitonsFuture of Data Intensive Applicaitons
Future of Data Intensive ApplicaitonsMilind Bhandarkar
 
Hadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-DelhiHadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-DelhiJoydeep Sen Sarma
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo pptPhil Young
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep divet3rmin4t0r
 
Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1Sperasoft
 

What's hot (20)

Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
 
February 2014 HUG : Hive On Tez
February 2014 HUG : Hive On TezFebruary 2014 HUG : Hive On Tez
February 2014 HUG : Hive On Tez
 
Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)
 
February 2014 HUG : Pig On Tez
February 2014 HUG : Pig On TezFebruary 2014 HUG : Pig On Tez
February 2014 HUG : Pig On Tez
 
Map reduce paradigm explained
Map reduce paradigm explainedMap reduce paradigm explained
Map reduce paradigm explained
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
 
Extending Hadoop for Fun & Profit
Extending Hadoop for Fun & ProfitExtending Hadoop for Fun & Profit
Extending Hadoop for Fun & Profit
 
Hadoop Primer
Hadoop PrimerHadoop Primer
Hadoop Primer
 
Hadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFSHadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFS
 
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop Easy
 
Pig on Tez: Low Latency Data Processing with Big Data
Pig on Tez: Low Latency Data Processing with Big DataPig on Tez: Low Latency Data Processing with Big Data
Pig on Tez: Low Latency Data Processing with Big Data
 
Pig programming is more fun: New features in Pig
Pig programming is more fun: New features in PigPig programming is more fun: New features in Pig
Pig programming is more fun: New features in Pig
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Functional Programming and Big Data
Functional Programming and Big DataFunctional Programming and Big Data
Functional Programming and Big Data
 
Getting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduceGetting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduce
 
Future of Data Intensive Applicaitons
Future of Data Intensive ApplicaitonsFuture of Data Intensive Applicaitons
Future of Data Intensive Applicaitons
 
Hadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-DelhiHadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-Delhi
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
 
Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1
 

Viewers also liked

OpenLSH - a framework for locality sensitive hashing
OpenLSH  - a framework for locality sensitive hashingOpenLSH  - a framework for locality sensitive hashing
OpenLSH - a framework for locality sensitive hashingJ Singh
 
Tableau reseller partner in Australia Bilytica Best business Intelligence com...
Tableau reseller partner in Australia Bilytica Best business Intelligence com...Tableau reseller partner in Australia Bilytica Best business Intelligence com...
Tableau reseller partner in Australia Bilytica Best business Intelligence com...Carie John
 
2016 Standardization of Laboratory Test Coding - PHI Conference
2016 Standardization of Laboratory Test Coding - PHI Conference2016 Standardization of Laboratory Test Coding - PHI Conference
2016 Standardization of Laboratory Test Coding - PHI ConferenceMegan Sawchuk
 
Whitepaper2012 "Virtual Laboratory for Analytic Geometry" UNAM
Whitepaper2012 "Virtual Laboratory for Analytic Geometry" UNAMWhitepaper2012 "Virtual Laboratory for Analytic Geometry" UNAM
Whitepaper2012 "Virtual Laboratory for Analytic Geometry" UNAMmetagraphos
 
Tableau reseller partner in Cape Verde Bilytica Best business Intelligence Co...
Tableau reseller partner in Cape Verde Bilytica Best business Intelligence Co...Tableau reseller partner in Cape Verde Bilytica Best business Intelligence Co...
Tableau reseller partner in Cape Verde Bilytica Best business Intelligence Co...Carie John
 
Checking in on Healthcare Data Analytics
Checking in on Healthcare Data AnalyticsChecking in on Healthcare Data Analytics
Checking in on Healthcare Data AnalyticsCybera Inc.
 
INCREASING LABORATORY EFFICIENCY AND VALUE OF LABORATORY DATA BY MAXIMISING ...
INCREASING LABORATORY EFFICIENCY AND VALUE  OF LABORATORY DATA BY MAXIMISING ...INCREASING LABORATORY EFFICIENCY AND VALUE  OF LABORATORY DATA BY MAXIMISING ...
INCREASING LABORATORY EFFICIENCY AND VALUE OF LABORATORY DATA BY MAXIMISING ...Keynetix
 
Exploring the Role of Information Technology Systems in Preventing and Managi...
Exploring the Role of Information Technology Systems in Preventing and Managi...Exploring the Role of Information Technology Systems in Preventing and Managi...
Exploring the Role of Information Technology Systems in Preventing and Managi...Health Informatics New Zealand
 
Process Improvement - 10 Essential Ingredients
Process Improvement - 10 Essential IngredientsProcess Improvement - 10 Essential Ingredients
Process Improvement - 10 Essential IngredientsRichard Ouellette
 
Advanced Laboratory Analytics — A Disruptive Solution for Health Systems
Advanced Laboratory Analytics — A Disruptive Solution for Health SystemsAdvanced Laboratory Analytics — A Disruptive Solution for Health Systems
Advanced Laboratory Analytics — A Disruptive Solution for Health SystemsViewics
 
2008 Spotfire Life Science Forum
2008 Spotfire Life Science Forum2008 Spotfire Life Science Forum
2008 Spotfire Life Science ForumWolfgang G. Hoeck
 
The Evolution of Laboratory Data Systems: Replacing Paper, Streamlining Proce...
The Evolution of Laboratory Data Systems: Replacing Paper, Streamlining Proce...The Evolution of Laboratory Data Systems: Replacing Paper, Streamlining Proce...
The Evolution of Laboratory Data Systems: Replacing Paper, Streamlining Proce...IDBS
 
eHealth: Big Data, Sports Analysis & Clinical Records
eHealth: Big Data, Sports Analysis & Clinical Records eHealth: Big Data, Sports Analysis & Clinical Records
eHealth: Big Data, Sports Analysis & Clinical Records Health Informatics New Zealand
 
Electronic Medical Records - Paperless to Big Data Initiative
Electronic Medical Records - Paperless to Big Data InitiativeElectronic Medical Records - Paperless to Big Data Initiative
Electronic Medical Records - Paperless to Big Data InitiativeData Science Thailand
 
Basics of laboratory internal quality control, Ola Elgaddar, 2012
Basics of laboratory internal quality control, Ola Elgaddar, 2012Basics of laboratory internal quality control, Ola Elgaddar, 2012
Basics of laboratory internal quality control, Ola Elgaddar, 2012Ola Elgaddar
 
Quality control in the medical laboratory
Quality control in the medical laboratoryQuality control in the medical laboratory
Quality control in the medical laboratoryAdnan Jaran
 
Data analysis powerpoint
Data analysis powerpointData analysis powerpoint
Data analysis powerpointjamiebrandon
 
Big Data in Healthcare Made Simple: Where It Stands Today and Where It’s Going
Big Data in Healthcare Made Simple: Where It Stands Today and Where It’s GoingBig Data in Healthcare Made Simple: Where It Stands Today and Where It’s Going
Big Data in Healthcare Made Simple: Where It Stands Today and Where It’s GoingHealth Catalyst
 

Viewers also liked (20)

OpenLSH - a framework for locality sensitive hashing
OpenLSH  - a framework for locality sensitive hashingOpenLSH  - a framework for locality sensitive hashing
OpenLSH - a framework for locality sensitive hashing
 
Tableau reseller partner in Australia Bilytica Best business Intelligence com...
Tableau reseller partner in Australia Bilytica Best business Intelligence com...Tableau reseller partner in Australia Bilytica Best business Intelligence com...
Tableau reseller partner in Australia Bilytica Best business Intelligence com...
 
2016 Standardization of Laboratory Test Coding - PHI Conference
2016 Standardization of Laboratory Test Coding - PHI Conference2016 Standardization of Laboratory Test Coding - PHI Conference
2016 Standardization of Laboratory Test Coding - PHI Conference
 
Whitepaper2012 "Virtual Laboratory for Analytic Geometry" UNAM
Whitepaper2012 "Virtual Laboratory for Analytic Geometry" UNAMWhitepaper2012 "Virtual Laboratory for Analytic Geometry" UNAM
Whitepaper2012 "Virtual Laboratory for Analytic Geometry" UNAM
 
Tableau reseller partner in Cape Verde Bilytica Best business Intelligence Co...
Tableau reseller partner in Cape Verde Bilytica Best business Intelligence Co...Tableau reseller partner in Cape Verde Bilytica Best business Intelligence Co...
Tableau reseller partner in Cape Verde Bilytica Best business Intelligence Co...
 
Dmla0609 Hoeck Presentation
Dmla0609 Hoeck PresentationDmla0609 Hoeck Presentation
Dmla0609 Hoeck Presentation
 
Checking in on Healthcare Data Analytics
Checking in on Healthcare Data AnalyticsChecking in on Healthcare Data Analytics
Checking in on Healthcare Data Analytics
 
INCREASING LABORATORY EFFICIENCY AND VALUE OF LABORATORY DATA BY MAXIMISING ...
INCREASING LABORATORY EFFICIENCY AND VALUE  OF LABORATORY DATA BY MAXIMISING ...INCREASING LABORATORY EFFICIENCY AND VALUE  OF LABORATORY DATA BY MAXIMISING ...
INCREASING LABORATORY EFFICIENCY AND VALUE OF LABORATORY DATA BY MAXIMISING ...
 
Exploring the Role of Information Technology Systems in Preventing and Managi...
Exploring the Role of Information Technology Systems in Preventing and Managi...Exploring the Role of Information Technology Systems in Preventing and Managi...
Exploring the Role of Information Technology Systems in Preventing and Managi...
 
Process Improvement - 10 Essential Ingredients
Process Improvement - 10 Essential IngredientsProcess Improvement - 10 Essential Ingredients
Process Improvement - 10 Essential Ingredients
 
Advanced Laboratory Analytics — A Disruptive Solution for Health Systems
Advanced Laboratory Analytics — A Disruptive Solution for Health SystemsAdvanced Laboratory Analytics — A Disruptive Solution for Health Systems
Advanced Laboratory Analytics — A Disruptive Solution for Health Systems
 
2008 Spotfire Life Science Forum
2008 Spotfire Life Science Forum2008 Spotfire Life Science Forum
2008 Spotfire Life Science Forum
 
The Evolution of Laboratory Data Systems: Replacing Paper, Streamlining Proce...
The Evolution of Laboratory Data Systems: Replacing Paper, Streamlining Proce...The Evolution of Laboratory Data Systems: Replacing Paper, Streamlining Proce...
The Evolution of Laboratory Data Systems: Replacing Paper, Streamlining Proce...
 
Clinical data analytics
Clinical data analyticsClinical data analytics
Clinical data analytics
 
eHealth: Big Data, Sports Analysis & Clinical Records
eHealth: Big Data, Sports Analysis & Clinical Records eHealth: Big Data, Sports Analysis & Clinical Records
eHealth: Big Data, Sports Analysis & Clinical Records
 
Electronic Medical Records - Paperless to Big Data Initiative
Electronic Medical Records - Paperless to Big Data InitiativeElectronic Medical Records - Paperless to Big Data Initiative
Electronic Medical Records - Paperless to Big Data Initiative
 
Basics of laboratory internal quality control, Ola Elgaddar, 2012
Basics of laboratory internal quality control, Ola Elgaddar, 2012Basics of laboratory internal quality control, Ola Elgaddar, 2012
Basics of laboratory internal quality control, Ola Elgaddar, 2012
 
Quality control in the medical laboratory
Quality control in the medical laboratoryQuality control in the medical laboratory
Quality control in the medical laboratory
 
Data analysis powerpoint
Data analysis powerpointData analysis powerpoint
Data analysis powerpoint
 
Big Data in Healthcare Made Simple: Where It Stands Today and Where It’s Going
Big Data in Healthcare Made Simple: Where It Stands Today and Where It’s GoingBig Data in Healthcare Made Simple: Where It Stands Today and Where It’s Going
Big Data in Healthcare Made Simple: Where It Stands Today and Where It’s Going
 

Similar to Big Data Laboratory

Experimentation Platform on Hadoop
Experimentation Platform on HadoopExperimentation Platform on Hadoop
Experimentation Platform on HadoopDataWorks Summit
 
eBay Experimentation Platform on Hadoop
eBay Experimentation Platform on HadoopeBay Experimentation Platform on Hadoop
eBay Experimentation Platform on HadoopTony Ng
 
Social Media Mining using GAE Map Reduce
Social Media Mining using GAE Map ReduceSocial Media Mining using GAE Map Reduce
Social Media Mining using GAE Map ReduceJ Singh
 
Machine learning model to production
Machine learning model to productionMachine learning model to production
Machine learning model to productionGeorg Heiler
 
MongoDB for Spatio-Behavioral Data Analysis and Visualization
MongoDB for Spatio-Behavioral Data Analysis and VisualizationMongoDB for Spatio-Behavioral Data Analysis and Visualization
MongoDB for Spatio-Behavioral Data Analysis and VisualizationMongoDB
 
Python Powered Data Science at Pivotal (PyData 2013)
Python Powered Data Science at Pivotal (PyData 2013)Python Powered Data Science at Pivotal (PyData 2013)
Python Powered Data Science at Pivotal (PyData 2013)Srivatsan Ramanujam
 
SQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsightSQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsightTillmann Eitelberg
 
Guider: An Integrated Runtime Performance Analyzer on AGL
Guider: An Integrated Runtime Performance Analyzer on AGLGuider: An Integrated Runtime Performance Analyzer on AGL
Guider: An Integrated Runtime Performance Analyzer on AGLPeace Lee
 
EPUG UKI - Lancaster Analytics
EPUG UKI - Lancaster AnalyticsEPUG UKI - Lancaster Analytics
EPUG UKI - Lancaster Analyticsjhkrug
 
Big Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onBig Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onDony Riyanto
 
Capital onehadoopintro
Capital onehadoopintroCapital onehadoopintro
Capital onehadoopintroDoug Chang
 
Data Pipelines with Python - NWA TechFest 2017
Data Pipelines with Python - NWA TechFest 2017Data Pipelines with Python - NWA TechFest 2017
Data Pipelines with Python - NWA TechFest 2017Casey Kinsey
 
O365Engage17 - How to Automate SharePoint Provisioning with PNP Framework
O365Engage17 - How to Automate SharePoint Provisioning with PNP FrameworkO365Engage17 - How to Automate SharePoint Provisioning with PNP Framework
O365Engage17 - How to Automate SharePoint Provisioning with PNP FrameworkNCCOMMS
 
Machine Learning Platform in LINE Fukuoka
Machine Learning Platform in LINE FukuokaMachine Learning Platform in LINE Fukuoka
Machine Learning Platform in LINE FukuokaLINE Corporation
 
Data analytics in the cloud with Jupyter notebooks.
Data analytics in the cloud with Jupyter notebooks.Data analytics in the cloud with Jupyter notebooks.
Data analytics in the cloud with Jupyter notebooks.Graham Dumpleton
 
aip_developer_overview_icar_2014
aip_developer_overview_icar_2014aip_developer_overview_icar_2014
aip_developer_overview_icar_2014Matthew Vaughn
 

Similar to Big Data Laboratory (20)

Experimentation Platform on Hadoop
Experimentation Platform on HadoopExperimentation Platform on Hadoop
Experimentation Platform on Hadoop
 
eBay Experimentation Platform on Hadoop
eBay Experimentation Platform on HadoopeBay Experimentation Platform on Hadoop
eBay Experimentation Platform on Hadoop
 
Social Media Mining using GAE Map Reduce
Social Media Mining using GAE Map ReduceSocial Media Mining using GAE Map Reduce
Social Media Mining using GAE Map Reduce
 
Python ml
Python mlPython ml
Python ml
 
Machine learning model to production
Machine learning model to productionMachine learning model to production
Machine learning model to production
 
MongoDB for Spatio-Behavioral Data Analysis and Visualization
MongoDB for Spatio-Behavioral Data Analysis and VisualizationMongoDB for Spatio-Behavioral Data Analysis and Visualization
MongoDB for Spatio-Behavioral Data Analysis and Visualization
 
App Engine Meetup
App Engine MeetupApp Engine Meetup
App Engine Meetup
 
Python Powered Data Science at Pivotal (PyData 2013)
Python Powered Data Science at Pivotal (PyData 2013)Python Powered Data Science at Pivotal (PyData 2013)
Python Powered Data Science at Pivotal (PyData 2013)
 
SQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsightSQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsight
 
Guider: An Integrated Runtime Performance Analyzer on AGL
Guider: An Integrated Runtime Performance Analyzer on AGLGuider: An Integrated Runtime Performance Analyzer on AGL
Guider: An Integrated Runtime Performance Analyzer on AGL
 
EPUG UKI - Lancaster Analytics
EPUG UKI - Lancaster AnalyticsEPUG UKI - Lancaster Analytics
EPUG UKI - Lancaster Analytics
 
Apache PIG
Apache PIGApache PIG
Apache PIG
 
Big Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onBig Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-on
 
Capital onehadoopintro
Capital onehadoopintroCapital onehadoopintro
Capital onehadoopintro
 
Data Pipelines with Python - NWA TechFest 2017
Data Pipelines with Python - NWA TechFest 2017Data Pipelines with Python - NWA TechFest 2017
Data Pipelines with Python - NWA TechFest 2017
 
O365Engage17 - How to Automate SharePoint Provisioning with PNP Framework
O365Engage17 - How to Automate SharePoint Provisioning with PNP FrameworkO365Engage17 - How to Automate SharePoint Provisioning with PNP Framework
O365Engage17 - How to Automate SharePoint Provisioning with PNP Framework
 
Machine Learning Platform in LINE Fukuoka
Machine Learning Platform in LINE FukuokaMachine Learning Platform in LINE Fukuoka
Machine Learning Platform in LINE Fukuoka
 
Data analytics in the cloud with Jupyter notebooks.
Data analytics in the cloud with Jupyter notebooks.Data analytics in the cloud with Jupyter notebooks.
Data analytics in the cloud with Jupyter notebooks.
 
aip_developer_overview_icar_2014
aip_developer_overview_icar_2014aip_developer_overview_icar_2014
aip_developer_overview_icar_2014
 
What's up?
What's up?What's up?
What's up?
 

More from J Singh

Designing analytics for big data
Designing analytics for big dataDesigning analytics for big data
Designing analytics for big dataJ Singh
 
Open LSH - september 2014 update
Open LSH  - september 2014 updateOpen LSH  - september 2014 update
Open LSH - september 2014 updateJ Singh
 
PaaS - google app engine
PaaS  - google app enginePaaS  - google app engine
PaaS - google app engineJ Singh
 
Mining of massive datasets using locality sensitive hashing (LSH)
Mining of massive datasets using locality sensitive hashing (LSH)Mining of massive datasets using locality sensitive hashing (LSH)
Mining of massive datasets using locality sensitive hashing (LSH)J Singh
 
Data Analytic Technology Platforms: Options and Tradeoffs
Data Analytic Technology Platforms: Options and TradeoffsData Analytic Technology Platforms: Options and Tradeoffs
Data Analytic Technology Platforms: Options and TradeoffsJ Singh
 
High Throughput Data Analysis
High Throughput Data AnalysisHigh Throughput Data Analysis
High Throughput Data AnalysisJ Singh
 
NoSQL and MapReduce
NoSQL and MapReduceNoSQL and MapReduce
NoSQL and MapReduceJ Singh
 
CS 542 -- Concurrency Control, Distributed Commit
CS 542 -- Concurrency Control, Distributed CommitCS 542 -- Concurrency Control, Distributed Commit
CS 542 -- Concurrency Control, Distributed CommitJ Singh
 
CS 542 -- Failure Recovery, Concurrency Control
CS 542 -- Failure Recovery, Concurrency ControlCS 542 -- Failure Recovery, Concurrency Control
CS 542 -- Failure Recovery, Concurrency ControlJ Singh
 
CS 542 -- Query Optimization
CS 542 -- Query OptimizationCS 542 -- Query Optimization
CS 542 -- Query OptimizationJ Singh
 
CS 542 -- Query Execution
CS 542 -- Query ExecutionCS 542 -- Query Execution
CS 542 -- Query ExecutionJ Singh
 
CS 542 Putting it all together -- Storage Management
CS 542 Putting it all together -- Storage ManagementCS 542 Putting it all together -- Storage Management
CS 542 Putting it all together -- Storage ManagementJ Singh
 
CS 542 Parallel DBs, NoSQL, MapReduce
CS 542 Parallel DBs, NoSQL, MapReduceCS 542 Parallel DBs, NoSQL, MapReduce
CS 542 Parallel DBs, NoSQL, MapReduceJ Singh
 
CS 542 Database Index Structures
CS 542 Database Index StructuresCS 542 Database Index Structures
CS 542 Database Index StructuresJ Singh
 
CS 542 Controlling Database Integrity and Performance
CS 542 Controlling Database Integrity and PerformanceCS 542 Controlling Database Integrity and Performance
CS 542 Controlling Database Integrity and PerformanceJ Singh
 
CS 542 Overview of query processing
CS 542 Overview of query processingCS 542 Overview of query processing
CS 542 Overview of query processingJ Singh
 
CS 542 Introduction
CS 542 IntroductionCS 542 Introduction
CS 542 IntroductionJ Singh
 
Cloud Computing from an Entrpreneur's Viewpoint
Cloud Computing from an Entrpreneur's ViewpointCloud Computing from an Entrpreneur's Viewpoint
Cloud Computing from an Entrpreneur's ViewpointJ Singh
 

More from J Singh (18)

Designing analytics for big data
Designing analytics for big dataDesigning analytics for big data
Designing analytics for big data
 
Open LSH - september 2014 update
Open LSH  - september 2014 updateOpen LSH  - september 2014 update
Open LSH - september 2014 update
 
PaaS - google app engine
PaaS  - google app enginePaaS  - google app engine
PaaS - google app engine
 
Mining of massive datasets using locality sensitive hashing (LSH)
Mining of massive datasets using locality sensitive hashing (LSH)Mining of massive datasets using locality sensitive hashing (LSH)
Mining of massive datasets using locality sensitive hashing (LSH)
 
Data Analytic Technology Platforms: Options and Tradeoffs
Data Analytic Technology Platforms: Options and TradeoffsData Analytic Technology Platforms: Options and Tradeoffs
Data Analytic Technology Platforms: Options and Tradeoffs
 
High Throughput Data Analysis
High Throughput Data AnalysisHigh Throughput Data Analysis
High Throughput Data Analysis
 
NoSQL and MapReduce
NoSQL and MapReduceNoSQL and MapReduce
NoSQL and MapReduce
 
CS 542 -- Concurrency Control, Distributed Commit
CS 542 -- Concurrency Control, Distributed CommitCS 542 -- Concurrency Control, Distributed Commit
CS 542 -- Concurrency Control, Distributed Commit
 
CS 542 -- Failure Recovery, Concurrency Control
CS 542 -- Failure Recovery, Concurrency ControlCS 542 -- Failure Recovery, Concurrency Control
CS 542 -- Failure Recovery, Concurrency Control
 
CS 542 -- Query Optimization
CS 542 -- Query OptimizationCS 542 -- Query Optimization
CS 542 -- Query Optimization
 
CS 542 -- Query Execution
CS 542 -- Query ExecutionCS 542 -- Query Execution
CS 542 -- Query Execution
 
CS 542 Putting it all together -- Storage Management
CS 542 Putting it all together -- Storage ManagementCS 542 Putting it all together -- Storage Management
CS 542 Putting it all together -- Storage Management
 
CS 542 Parallel DBs, NoSQL, MapReduce
CS 542 Parallel DBs, NoSQL, MapReduceCS 542 Parallel DBs, NoSQL, MapReduce
CS 542 Parallel DBs, NoSQL, MapReduce
 
CS 542 Database Index Structures
CS 542 Database Index StructuresCS 542 Database Index Structures
CS 542 Database Index Structures
 
CS 542 Controlling Database Integrity and Performance
CS 542 Controlling Database Integrity and PerformanceCS 542 Controlling Database Integrity and Performance
CS 542 Controlling Database Integrity and Performance
 
CS 542 Overview of query processing
CS 542 Overview of query processingCS 542 Overview of query processing
CS 542 Overview of query processing
 
CS 542 Introduction
CS 542 IntroductionCS 542 Introduction
CS 542 Introduction
 
Cloud Computing from an Entrpreneur's Viewpoint
Cloud Computing from an Entrpreneur's ViewpointCloud Computing from an Entrpreneur's Viewpoint
Cloud Computing from an Entrpreneur's Viewpoint
 

Recently uploaded

Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 

Recently uploaded (20)

Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 

Big Data Laboratory

  • 1. Google App Engine A Big Data Laboratory? J Singh, Early Stage IT March 20, 2012
  • 2. 2 © J Singh, 2011 2 App Engine as a Big Data Laboratory? • Why bother? Why not use Hadoop? • EvaluatingApp Engine as a Big Data Laboratory – Loading Data – Analytics Capabilities – Visualization Capabilities • Conclusions
  • 3. 3 © J Singh, 2011 3 Why Bother? Why not Hadoop? • No install and configuration required – Focus on the task: Analytics and Visualization – Use the technology that powers Google Earth and Google Finance • Works with Google Datastore – Makes sense if your data is already there • No import/export of data necessary • But a purely „low-level‟ programming environment – Write Map and Reduce functions in Python / Java – No Pig, Hive, … • Is this story for real? We wanted to find out.
  • 4. 4 © J Singh, 2011 4 Loading Data into GAE • What? No native OS environment to work in? – No OS commands, no file system accessible to the programmer – Data Prep must be done elsewhere. • But other options exist 1. Upload a file into Blobstore through an HTTP request • Max object size 2GB, max get/put in one call: 1MB. • Process into Datastore entities using BlobstoreInputReader or BlobstoreZipInputReader classes. 2. Use remote_api to upload CSV files • It‟s painful – Only needs to be done one-time, we hope – Or we need to set up a process for staging and feeding the data
  • 5. 5 © J Singh, 2011 5 Data Analysis: NumPy and SciPy • NumPy and SciPy libraries using the traditional computing model (not Map/Reduce) include: – Array and Matrix manipulation – Optimization algorithms, e.g., curve fitting, linear regression, multi-variate regression. – Multithreading (for embarrassingly parallel problems) • Replace map(…) with parallel_map(…). – map is a Python primitive – parallel_map is a NumPy primitive – Other scientific algorithms, e.g., Kalman Filtering, Signal smoothing, Markov Chains. • NumPy and SciPy depended on Python 2.7 – Enabled in Fall, 2011.
  • 6. 6 © J Singh, 2011 6 Data Analysis: MapReduce • Input Reader – Several provided by GAE, can write your own • Map function: Written by Programmer • Shuffle function: Provided – Can write your own overrides for partitioning (sharding) and comparison (use in sort) • Reduce function: Written by Programmer – Can be skipped if not needed • Output Writer – Several provided by GAE, can write your own
  • 7. 7 © J Singh, 2011 7 Data Analysis: Pipeline API • Based on Python Generator functions • Allows chaining of map reduce jobs – Primitives for setting up various types of chains • MapreducePipeline (prev page) was just one type of pipeline • Available for Python or Java – Python side better documented Split and Merge example class aPipe(pipeline.Pipeline): def run(self, e_kind, prop_name, *value_list): all_bs = [] for v in value_list: stage = yield bPipe(e_kind, prop_name, v) all_bs.append(stage) yield common.Append(*all_bs)
  • 8. 8 © J Singh, 2011 8 Data Visualization • Appengine supports multiple web frameworks for serving data directly from the Datastore into an HTML5 Browser: – Django, Jinja2, CherryPy, … • Options: – jQuery Visualize – Google Visualization API • Including MotionCharts – Hans Rosling‟s Visualization API – Check out his TED talk • Conclusion: – A rich set of facilities for visualization and taking action
  • 9. 9 © J Singh, 2011 9 Decision Factors Usage Discussion Proof of Concept or Demo In GAE Need a process for Data Loading But saves on having to do Hadoop setup Absence of Pig/Hive may be a limiting factor Advantage in Visualization Better security and isolation than Hadoop Production In GAE Analyze cost before committing Lock-in risk? Production elsewhere Good semantic match between Datastore and HBase. Need to do Hadoop setup and operation
  • 10. 10 © J Singh, 2011 10 Thank you • J Singh – President, Early Stage IT • Technology Services and Strategy for Startups • DataThinks.org is a new service of Early Stage IT – “Big Data” analytics solutions