SlideShare a Scribd company logo
1 of 34
Download to read offline
@Galvanize with Pandas !!
Pandas, Data Wrangling & Data Science
August 12, 2016
@ksankar // doubleclix.wordpress.com
San Francisco 2016
o Intro & Setup [10:45-10:55) (10)
• Goals/non-goals
o Data Wrangling & Data Science Pipeline [10:55 –
11:05) (10)
o Pandas – APIs & Namespaces [11:05-11:15) (10)
o Pandas – Basic Maneuvers [11:15-11:30) (15)
o Pandas – Data Wrangling – Transformations,
Aggregations & Join [11:30-12:15) (45)
• Hands-on : Titanic Dataset
• Hands-on : NW Dataset, State Of The
Union Speeches, Recsys-2015 Data
o Q & A [12:15-Inf) (10)
o Not covering – Panels, Time Series
Agenda : Pandas, Data Wrangling & Data Science
http://pydata.org/sfo2016/schedule/presentation/67/
Goals & non-goals
Goals
¤Understand Data Wrangling with
Pandas
¤Focus on APIs and usage
¤Give you a focused time to work
thru examples
§ Work with me. I will wait if you
want to catch-up
¤Less theory, more usage - let us see
if this works
¤As straightforward as possible
§ The programs can be optimized
¤Foundation for the next 2 tutorials
§ Python Visualization for Exploration of
Data by Stephen F. Elston, Ronald Lopez
§ Applied Time Series Econometrics in Python
(and R)Jeffrey Yau
Non-goals
¡ Not “expert” Pandas
• We don’t have sufficient
time. The topic can be
easily a 1 day tutorial !
¡ Time to do hands-on
• Only 90 minutes
¡ Python vs. R
• I’ve come to discuss
Pandas, not to praise R !
¡ A passive talk
• Nope. Interactive &
hands-on
1. Brandon Rhodes - Pandas From The Ground Up - PyCon
2015 https://www.youtube.com/watch?v=5JnMutdy6Fw
2. A Visual Guide To Pandas by Jason Wirth
https://www.youtube.com/watch?v=9d5-Ti6onew
3. 2012 PyData Workshop: Data Analysis in Python with
Pandas by Wes McKinney
https://www.youtube.com/watch?v=MxRMXhjXZos
4. http://nbviewer.jupyter.org/github/jbochi/recsyschallenge2015/
blob/master/visualization.ipynb
5. https://www.analyticsvidhya.com/blog/2016/01/complete-
tutorial-learn-data-science-python-scratch-2/
Thanks to the Giants whose work
helped to prepare this tutorial
About Me
o AI/Data Scientist
• Autonomous Vehicles [https://goo.gl/BgicSY][https://goo.gl/LZ3fY9]
• Building Autonomous Drone-Jarvis / Working towards FAA Drone Pilot Certification
• What would you want AI to do, if it could do whatever you want it to do ?[https://goo.gl/eqWUEn]
• Decision Data Science & Product Data Science
• Insights = Intelligence + Inference + Interface [https://goo.gl/s2KB6L]
• Predicting NFL with Elo like Nate Silver & 538 [NFL : http://goo.gl/Q2OgeJ, NBA’15 : https://goo.gl/aUhdo3]
o Have been speaking at OSCON [http://goo.gl/1MJLu], PyCon, Pydata [http://vimeo.com/63270513,
http://www.slideshare.net/ksankar/pydata-19] …
o Have done lots of things:
• Big Data (Retail, Bioinformatics, Financial, AdTech), Starting MS-CFRM, University of WA
• Written Books (Web 2.0, Wireless, Java,…))Standards, some work in AI,
• Guest Lecturer at Naval PG School,…
o Volunteer as Robotics Judge at First Lego league World Competitions
o @ksankar, doubleclix.wordpress.com
Pandas & Notebook Installation
o Pandas – Best with Anaconda
o Notebook - Install Jupyter/iPython
Tutorial Materials
oGithub : https://github.com/xsankar/cautious-octo-waffle
• Clone or download zip
oOpen terminal
• cd ~/cautious-octo-waffle
• jupyter notebook
oClick on ipython dashboard
• Run 000-PreFlightCheck.ipynb
• Now you are ready for the workshop !
oOne More Thing !!
• The RecSYs-2015 data is ~2GB. So pl download the data
to the data/recsys-2015 directory
Data Wrangling & Data Science Pipeline
10:55
Pipelines …
“[Collect-Store-Transform]-[Reason-Model]-[Deploy]-[Visualize-Recommend-Predict]-[Explore]”
Data Science - Context
o Scalable Model
Deployment
o Big Data
automation &
purpose built
appliances
(soft/hard)
o Manage SLAs &
response times
o Volume
o Velocity
o Streaming Data
o Canonical form
o Data catalog
o Data Fabric across the
organization
o Access to multiple
sources of data
o Think Hybrid – Big Data
Apps, Appliances &
Infrastructure
Collect Store Transform
o Metadata
o Monitor counters &
Metrics
o Structured vs. Multi-
structured
o Flexible & Selectable
§ Data Subsets
§ Attribute sets
o Refine model with
§ Extended Data
subsets
§ Engineered
Attribute sets
o Validation run across a
larger data set
Reason Model Deploy
Data Management
Data Science
o Dynamic Data Sets
o 2 way key-value tagging of
datasets
o Extended attribute sets
o Advanced Analytics
ExploreVisualize Recommend Predict
o Performance
o Scalability
o Refresh Latency
o In-memory Analytics
o AdvancedVisualization
o Interactive Dashboards
o Map Overlay
o Infographics
¤ Bytes to Business
a.k.a. Build the full
stack
¤ Find Relevant Data
For Business
¤ Connect the Dots
Data Science :
The art of building a model with known knowns, which when let loose, works with unknown
unknowns!
Donald Rumsfeld is an armchair Data Scientist !
http://smartorg.com/2013/07/valuepoint19/
The
World
Knowns
Unknowns
You
UnKnown Known
o Others	know,	you	don’t o What	we	do
o Facts,	outcomes	or	
scenarios	we	have	not	
encountered,	nor	
considered
o “Black	swans”,	outliers,	
long	tails	of	probability	
distributions
o Lack	of	experience,	
imagination
o Potential	facts,	
outcomes	we	
are	aware,	but	
not		with	
certainty
o Stochastic	
processes,	
Probabilities
o Known Knowns
o There are things we know that we know
o Known Unknowns
o That is to say, there are things that we
now know we don't know
o But there are also Unknown Unknowns
o There are things we do not know we
don't know
The curious case of the Data Scientist
o Data Scientist is multi-faceted & Contextual
o Data Scientist should be building Data Products
o Data Scientist should tell a story
http://doubleclix.wordpress.com/2014/01/25/the-curious-case-of-the-data-scientist-profession/
Large is hard; Infinite is much easier !
– Titus Brown
Pandas – APIs & Namespaces
11:05
Pandas Data Model
o Layer over numPy
o Data Model
• 1D Series (numPy Array w/labels)
• Data frame - 2D labelled sheet
• Column operations similar to vector operations
o Pay attention to the index
• Indexed rows, Indexed Columns & info at the center
o Pay attention to the objects
• DataFrame vs Series vs numpy array
• Eg. size() vs size
o “Answer all questions about a dataset” - Wes
pandas namespaces
objects
o pandas.Series
o pandas.DataFrame
o pandas.Panel
o pandas.Panel4D
o pandas.index
I/O
o read_(csv, table,
excel, json, gbq,…)
o to_(csv, table,
excel, json, gbq,…)
o pandas.read_
o df.to_
Computations,
operations,…)
o +,-,
o pow,
o corr, …
DateTime
o .dt
NaN,
Missing
o Isnull()
o fillna()
o dropna()
o skipna
o interpolate
String
o .str
Plotting
o .plot
Notes :
[1] df[“date”].dt - only series has date time ! df.dt won’t work
[2] .sort a DataFrame, but .order a Series
[3] to_frame() converts to a series.Most of the time DataFrame is the preferred
object
Data Science “folk knowledge” (1 of A)
o "If you torture the data long enough, it will confess to anything." – Hal Varian,
Computer Mediated Transactions
o Learning = Representation + Evaluation + Optimization
o It’s Generalization that counts
• The fundamental goal of machine learning is to generalize beyond the
examples in the training set
o Data alone is not enough
• Induction not deduction - Every learner should embody some knowledge
or assumptions beyond the data it is given in order to generalize beyond it
o Machine Learning is not magic – one cannot get something from nothing
• In order to infer, one needs the knobs & the dials
• One also needs a rich expressive dataset
A few useful things to know about machine learning - by Pedro Domingos
http://dl.acm.org/citation.cfm?id=2347755
Pandas- Basic Maneuvers
11:15
1-Getting Data in/out
o pd.read_csv(…)
o read_table(...) <- arbitrary delimied file
o read_{clipboard,json,excel,SAS,SQL,gbq,...)
o df.to_csv(...)
2-Basic Operations
o head()
o tail()
o count()
o describe()
o dtypes
3-Labelled Indexing
4. na-Missing Data
o One of the tenets of big data and data science is that data is never fully clean-while we
can handle types, formats et al, missing values is always challenging
o One easy solution is to drop the rows that have missing values, but then we would lose
valuable data in the columns that do have values.
o A better solution is to impute data based on some criteria. It is true that data cannot be
created out of thin air, but data can be inferred with some success – it is better than
dropping the rows.
• We can replace null with 0
• A better solution is to replace numerical values with the average of the rest of the valid
values; for categorical replacing with the most common value is a good strategy
• We could use mode or median instead of mean
• Another good strategy is to infer the missing value from other attributes ie “Evidence
from multiple fields”.
• For example the Titanic data has name and for imputing the missing age field, we could use the
Mr., Master, Mrs. Miss designation from the name and then fill-in the average of the age field
from the corresponding designation. So a row with missing age with Master. In name would get
the average age of all records with “Master.”
• There is also the filed for number of siblings and number of spouse. We could average the age
based on the value of that field.
• We could even average the ages from different strategies.
4. na-Missing Data
o NaN better than 0 - says I don’t know
• Comes ihandy n recommendation, stock data on a Saturday,…
o Skipna
o Fillna
• forward fill/backward fill method !
o Interpolate
5-Statistics
o Min
o Max
o Quantile
o Mean,SD,variance,…
o Correlation
• Pearson
• Spearman
o Covariance
6-Aggregation/Groupby
Pandas – Data Wrangling –
Transformations, Aggregations & Join
11:30
Merge,Join and friends
o merge
• Use Merge
• join is a set of common merge patterns with defaults
o groupby
• Think in terms of split-apply-combine
o stack/unstack
• unstack operation to compare unlike things - parameter to unstack
different columns
• Too much stack-unstack results in a series !
• Be ready to handle NaN
o Powerful Techniques
• groupby + merge
• groupby + unstack
Hands-On : Pandas@Kaggle
o 020-Titanic.ipyb
o GitHub : https://github.com/xsankar/cautious-octo-waffle/blob/master/020-
Titanic.ipynb
• Let us analyze the Titanic Dataset for a Kaggle Competition
Hands-On : Orders Data
o 030-Orders.ipyb
o Github : https://github.com/xsankar/cautious-octo-waffle/blob/master/030-
Orders.ipynb
• Data wrangling with the Orders dataset
Data Science “folk knowledge” (Wisdom of Kaggle)
Jeremy’s Axioms
o Iteratively explore data
o Tools
• Excel Format, Perl, Perl Book, Pandas !
o Get your head around data
• Pivot Table
o Don’t over-complicate
o If people give you data, don’t assume that you
need to use all of it
o Look at pictures !
o History of your submissions – keep a tab
o Don’t be afraid to submit simple solutions
• We will do this during this workshop
Ref: http://blog.kaggle.com/2011/03/23/getting-in-shape-for-the-sport-of-data-sciencetalk-by-
jeremy-howard/
Hands-On : Clicks & Buys
o 050-RecSys-2015.ipynb
o Github : https://github.com/xsankar/cautious-octo-waffle/blob/master/050-
RecSys-2015.ipynb
• Data wrangling with the RecSys-2015 dataset
Questions ?
12:15
Essential Reading List
o A few useful things to know about machine learning - by Pedro Domingos
• http://dl.acm.org/citation.cfm?id=2347755
o The Lack of A Priori Distinctions Between Learning Algorithms by David H. Wolpert
• http://mpdc.mae.cornell.edu/Courses/MAE714/Papers/lack_of_a_priori_distinctions_wolper
t.pdf
o http://www.no-free-lunch.org/
o Controlling the false discovery rate: a practical and powerful approach to multiple testing Benjamini, Y.
and Hochberg, Y. C
• http://www.stat.purdue.edu/~doerge/BIOINFORM.D/FALL06/Benjamini%20and%20Y%20FD
R.pdf
o A Glimpse of Googl, NASA,Peter Norvig + The Restaurant at the End of the Universe
• http://doubleclix.wordpress.com/2014/03/07/a-glimpse-of-google-nasa-peter-norvig/
o Avoid these three mistakes, James Faghmo
• https://medium.com/about-data/73258b3848a4
o Leakage in Data Mining: Formulation, Detection, and Avoidance
• http://www.cs.umb.edu/~ding/history/470_670_fall_2011/papers/cs670_Tran_PreferredPap
er_LeakingInDataMining.pdf
For your reading & viewing pleasure … An ordered List
① An Introduction to Statistical Learning
• http://www-bcf.usc.edu/~gareth/ISL/
② ISL Class Stanford/Hastie/Tibsharani at their best - Statistical Learning
• http://online.stanford.edu/course/statistical-learning-winter-2014
③ Prof. Pedro Domingo
• https://class.coursera.org/machlearning-001/lecture/preview
④ Prof. Andrew Ng
• https://class.coursera.org/ml-003/lecture/preview
⑤ Prof. Abu Mostafa, CaltechX: CS1156x: Learning From Data
• https://www.edx.org/course/caltechx/caltechx-cs1156x-learning-data-1120
⑥ Mathematicalmonk @ YouTube
• https://www.youtube.com/playlist?list=PLD0F06AA0D2E8FFBA
⑦ The Elements Of Statistical Learning
• http://statweb.stanford.edu/~tibs/ElemStatLearn/
http://www.quora.com/Machine-Learning/Whats-the-easiest-way-to-learn-
machine-learning/
The Beginning As The End
How did we do ?
4:45
Pandas, Data Wrangling & Data Science

More Related Content

What's hot

The Art of Social Media Analysis with Twitter & Python-OSCON 2012
The Art of Social Media Analysis with Twitter & Python-OSCON 2012The Art of Social Media Analysis with Twitter & Python-OSCON 2012
The Art of Social Media Analysis with Twitter & Python-OSCON 2012OSCON Byrum
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with PythonDonald Miner
 
Information Visualization for Knowledge Discovery: An Introduction
Information Visualization for Knowledge Discovery: An IntroductionInformation Visualization for Knowledge Discovery: An Introduction
Information Visualization for Knowledge Discovery: An IntroductionKrist Wongsuphasawat
 
Luna Dong, Principal Scientist, Amazon at MLconf Seattle 2017
Luna Dong, Principal Scientist, Amazon at MLconf Seattle 2017Luna Dong, Principal Scientist, Amazon at MLconf Seattle 2017
Luna Dong, Principal Scientist, Amazon at MLconf Seattle 2017MLconf
 
Test trend analysis: Towards robust reliable and timely tests
Test trend analysis: Towards robust reliable and timely testsTest trend analysis: Towards robust reliable and timely tests
Test trend analysis: Towards robust reliable and timely testsHugh McCamphill
 
The Road to Data Science - Joel Grus, June 2015
The Road to Data Science - Joel Grus, June 2015The Road to Data Science - Joel Grus, June 2015
The Road to Data Science - Joel Grus, June 2015Seattle DAML meetup
 
The Empirical Turn in Knowledge Representation
The Empirical Turn in Knowledge RepresentationThe Empirical Turn in Knowledge Representation
The Empirical Turn in Knowledge RepresentationFrank van Harmelen
 
Getting to Know Your Data with R
Getting to Know Your Data with RGetting to Know Your Data with R
Getting to Know Your Data with RStephen Withington
 
Semantic Web questions we couldn't ask 10 years ago
Semantic Web questions we couldn't ask 10 years agoSemantic Web questions we couldn't ask 10 years ago
Semantic Web questions we couldn't ask 10 years agoFrank van Harmelen
 
Tutorial Data Management and workflows
Tutorial Data Management and workflowsTutorial Data Management and workflows
Tutorial Data Management and workflowsSSSW
 
Measuring Relevance in the Negative Space
Measuring Relevance in the Negative SpaceMeasuring Relevance in the Negative Space
Measuring Relevance in the Negative SpaceTrey Grainger
 
Bio ontologies and semantic technologies
Bio ontologies and semantic technologiesBio ontologies and semantic technologies
Bio ontologies and semantic technologiesProf. Wim Van Criekinge
 
Slow-cooked data and APIs in the world of Big Data: the view from a city per...
Slow-cooked data and APIs in the world of Big Data: the view from a city per...Slow-cooked data and APIs in the world of Big Data: the view from a city per...
Slow-cooked data and APIs in the world of Big Data: the view from a city per...Oscar Corcho
 
Modular design patterns for systems that learn and reason: a boxology
Modular design patterns for systems that learn and reason: a boxologyModular design patterns for systems that learn and reason: a boxology
Modular design patterns for systems that learn and reason: a boxologyFrank van Harmelen
 
Myria: Analytics-as-a-Service for (Data) Scientists
Myria: Analytics-as-a-Service for (Data) ScientistsMyria: Analytics-as-a-Service for (Data) Scientists
Myria: Analytics-as-a-Service for (Data) ScientistsUniversity of Washington
 
Big data and APIs for PHP developers - SXSW 2011
Big data and APIs for PHP developers - SXSW 2011Big data and APIs for PHP developers - SXSW 2011
Big data and APIs for PHP developers - SXSW 2011Eli White
 

What's hot (20)

The Art of Social Media Analysis with Twitter & Python-OSCON 2012
The Art of Social Media Analysis with Twitter & Python-OSCON 2012The Art of Social Media Analysis with Twitter & Python-OSCON 2012
The Art of Social Media Analysis with Twitter & Python-OSCON 2012
 
Analyzing social media with Python and other tools (4/4)
Analyzing social media with Python and other tools (4/4) Analyzing social media with Python and other tools (4/4)
Analyzing social media with Python and other tools (4/4)
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
 
Information Visualization for Knowledge Discovery: An Introduction
Information Visualization for Knowledge Discovery: An IntroductionInformation Visualization for Knowledge Discovery: An Introduction
Information Visualization for Knowledge Discovery: An Introduction
 
Luna Dong, Principal Scientist, Amazon at MLconf Seattle 2017
Luna Dong, Principal Scientist, Amazon at MLconf Seattle 2017Luna Dong, Principal Scientist, Amazon at MLconf Seattle 2017
Luna Dong, Principal Scientist, Amazon at MLconf Seattle 2017
 
Test trend analysis: Towards robust reliable and timely tests
Test trend analysis: Towards robust reliable and timely testsTest trend analysis: Towards robust reliable and timely tests
Test trend analysis: Towards robust reliable and timely tests
 
The Road to Data Science - Joel Grus, June 2015
The Road to Data Science - Joel Grus, June 2015The Road to Data Science - Joel Grus, June 2015
The Road to Data Science - Joel Grus, June 2015
 
The Empirical Turn in Knowledge Representation
The Empirical Turn in Knowledge RepresentationThe Empirical Turn in Knowledge Representation
The Empirical Turn in Knowledge Representation
 
Getting to Know Your Data with R
Getting to Know Your Data with RGetting to Know Your Data with R
Getting to Know Your Data with R
 
Semantic Web questions we couldn't ask 10 years ago
Semantic Web questions we couldn't ask 10 years agoSemantic Web questions we couldn't ask 10 years ago
Semantic Web questions we couldn't ask 10 years ago
 
Data Structure in Elixir
Data Structure in ElixirData Structure in Elixir
Data Structure in Elixir
 
Tutorial Data Management and workflows
Tutorial Data Management and workflowsTutorial Data Management and workflows
Tutorial Data Management and workflows
 
Measuring Relevance in the Negative Space
Measuring Relevance in the Negative SpaceMeasuring Relevance in the Negative Space
Measuring Relevance in the Negative Space
 
2017 biological databases_part1_vupload
2017 biological databases_part1_vupload2017 biological databases_part1_vupload
2017 biological databases_part1_vupload
 
Bio ontologies and semantic technologies
Bio ontologies and semantic technologiesBio ontologies and semantic technologies
Bio ontologies and semantic technologies
 
Slow-cooked data and APIs in the world of Big Data: the view from a city per...
Slow-cooked data and APIs in the world of Big Data: the view from a city per...Slow-cooked data and APIs in the world of Big Data: the view from a city per...
Slow-cooked data and APIs in the world of Big Data: the view from a city per...
 
My Spark Journey
My Spark JourneyMy Spark Journey
My Spark Journey
 
Modular design patterns for systems that learn and reason: a boxology
Modular design patterns for systems that learn and reason: a boxologyModular design patterns for systems that learn and reason: a boxology
Modular design patterns for systems that learn and reason: a boxology
 
Myria: Analytics-as-a-Service for (Data) Scientists
Myria: Analytics-as-a-Service for (Data) ScientistsMyria: Analytics-as-a-Service for (Data) Scientists
Myria: Analytics-as-a-Service for (Data) Scientists
 
Big data and APIs for PHP developers - SXSW 2011
Big data and APIs for PHP developers - SXSW 2011Big data and APIs for PHP developers - SXSW 2011
Big data and APIs for PHP developers - SXSW 2011
 

Viewers also liked

NLP on a Billion Documents: Scalable Machine Learning with Apache Spark
NLP on a Billion Documents: Scalable Machine Learning with Apache SparkNLP on a Billion Documents: Scalable Machine Learning with Apache Spark
NLP on a Billion Documents: Scalable Machine Learning with Apache SparkMartin Goodson
 
NLP Structured Data Investigation on Non-Text by Casey Stella
NLP Structured Data Investigation on Non-Text by Casey StellaNLP Structured Data Investigation on Non-Text by Casey Stella
NLP Structured Data Investigation on Non-Text by Casey StellaSpark Summit
 
Bayesian Machine Learning - Naive Bayes
Bayesian Machine Learning - Naive BayesBayesian Machine Learning - Naive Bayes
Bayesian Machine Learning - Naive BayesKrishna Sankar
 
Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher ...
Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher ...Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher ...
Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher ...Spark Summit
 
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)Spark Summit
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQLjeykottalam
 
Architecture in action 01
Architecture in action 01Architecture in action 01
Architecture in action 01Krishna Sankar
 
Dictionary Based Annotation at Scale with Spark by Sujit Pal
Dictionary Based Annotation at Scale with Spark by Sujit PalDictionary Based Annotation at Scale with Spark by Sujit Pal
Dictionary Based Annotation at Scale with Spark by Sujit PalSpark Summit
 
Introduction to Greenplum
Introduction to GreenplumIntroduction to Greenplum
Introduction to GreenplumDave Cramer
 
Programmer interview exposed - lection 5 temp version
Programmer interview exposed - lection 5 temp versionProgrammer interview exposed - lection 5 temp version
Programmer interview exposed - lection 5 temp versionIgor Kleiner
 
Advanced Spark Meetup - Jan 12, 2016
Advanced Spark Meetup - Jan 12, 2016Advanced Spark Meetup - Jan 12, 2016
Advanced Spark Meetup - Jan 12, 2016Michelle Casbon
 
Aioug vizag oracle12c_new_features
Aioug vizag oracle12c_new_featuresAioug vizag oracle12c_new_features
Aioug vizag oracle12c_new_featuresAiougVizagChapter
 
COUG_AAbate_Oracle_Database_12c_New_Features
COUG_AAbate_Oracle_Database_12c_New_FeaturesCOUG_AAbate_Oracle_Database_12c_New_Features
COUG_AAbate_Oracle_Database_12c_New_FeaturesAlfredo Abate
 
Oracle12 - The Top12 Features by NAYA Technologies
Oracle12 - The Top12 Features by NAYA TechnologiesOracle12 - The Top12 Features by NAYA Technologies
Oracle12 - The Top12 Features by NAYA TechnologiesNAYATech
 
Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0 Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0 Bryan Yang
 
SPARQL and Linked Data Benchmarking
SPARQL and Linked Data BenchmarkingSPARQL and Linked Data Benchmarking
SPARQL and Linked Data BenchmarkingKristian Alexander
 
From Beginners to Experts, Data Wrangling for All
From Beginners to Experts, Data Wrangling for AllFrom Beginners to Experts, Data Wrangling for All
From Beginners to Experts, Data Wrangling for AllDataWorks Summit
 
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlyData Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlySarah Guido
 

Viewers also liked (20)

NLP on a Billion Documents: Scalable Machine Learning with Apache Spark
NLP on a Billion Documents: Scalable Machine Learning with Apache SparkNLP on a Billion Documents: Scalable Machine Learning with Apache Spark
NLP on a Billion Documents: Scalable Machine Learning with Apache Spark
 
NLP Structured Data Investigation on Non-Text by Casey Stella
NLP Structured Data Investigation on Non-Text by Casey StellaNLP Structured Data Investigation on Non-Text by Casey Stella
NLP Structured Data Investigation on Non-Text by Casey Stella
 
Bayesian Machine Learning - Naive Bayes
Bayesian Machine Learning - Naive BayesBayesian Machine Learning - Naive Bayes
Bayesian Machine Learning - Naive Bayes
 
Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher ...
Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher ...Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher ...
Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher ...
 
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 
Architecture in action 01
Architecture in action 01Architecture in action 01
Architecture in action 01
 
Dictionary Based Annotation at Scale with Spark by Sujit Pal
Dictionary Based Annotation at Scale with Spark by Sujit PalDictionary Based Annotation at Scale with Spark by Sujit Pal
Dictionary Based Annotation at Scale with Spark by Sujit Pal
 
Introduction to Greenplum
Introduction to GreenplumIntroduction to Greenplum
Introduction to Greenplum
 
Programmer interview exposed - lection 5 temp version
Programmer interview exposed - lection 5 temp versionProgrammer interview exposed - lection 5 temp version
Programmer interview exposed - lection 5 temp version
 
Advanced Spark Meetup - Jan 12, 2016
Advanced Spark Meetup - Jan 12, 2016Advanced Spark Meetup - Jan 12, 2016
Advanced Spark Meetup - Jan 12, 2016
 
Aioug vizag oracle12c_new_features
Aioug vizag oracle12c_new_featuresAioug vizag oracle12c_new_features
Aioug vizag oracle12c_new_features
 
COUG_AAbate_Oracle_Database_12c_New_Features
COUG_AAbate_Oracle_Database_12c_New_FeaturesCOUG_AAbate_Oracle_Database_12c_New_Features
COUG_AAbate_Oracle_Database_12c_New_Features
 
Oracle12 - The Top12 Features by NAYA Technologies
Oracle12 - The Top12 Features by NAYA TechnologiesOracle12 - The Top12 Features by NAYA Technologies
Oracle12 - The Top12 Features by NAYA Technologies
 
Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0 Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0
 
Spark etl
Spark etlSpark etl
Spark etl
 
SPARQL and Linked Data Benchmarking
SPARQL and Linked Data BenchmarkingSPARQL and Linked Data Benchmarking
SPARQL and Linked Data Benchmarking
 
AMIS Oracle OpenWorld 2015 Review – part 3- PaaS Database, Integration, Ident...
AMIS Oracle OpenWorld 2015 Review – part 3- PaaS Database, Integration, Ident...AMIS Oracle OpenWorld 2015 Review – part 3- PaaS Database, Integration, Ident...
AMIS Oracle OpenWorld 2015 Review – part 3- PaaS Database, Integration, Ident...
 
From Beginners to Experts, Data Wrangling for All
From Beginners to Experts, Data Wrangling for AllFrom Beginners to Experts, Data Wrangling for All
From Beginners to Experts, Data Wrangling for All
 
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlyData Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at Bitly
 

Similar to Pandas, Data Wrangling & Data Science

Data science presentation
Data science presentationData science presentation
Data science presentationMSDEVMTL
 
Data Science with Spark
Data Science with SparkData Science with Spark
Data Science with SparkKrishna Sankar
 
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesBig Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesKrishna Sankar
 
Statistics vs machine learning
Statistics vs machine learningStatistics vs machine learning
Statistics vs machine learningTom Dierickx
 
Fortune Teller API - Doing Data Science with Apache Spark
Fortune Teller API - Doing Data Science with Apache SparkFortune Teller API - Doing Data Science with Apache Spark
Fortune Teller API - Doing Data Science with Apache SparkBas Geerdink
 
Barga Data Science lecture 2
Barga Data Science lecture 2Barga Data Science lecture 2
Barga Data Science lecture 2Roger Barga
 
How to Become a Data Scientist
How to Become a Data ScientistHow to Become a Data Scientist
How to Become a Data Scientistryanorban
 
What Managers Need to Know about Data Science
What Managers Need to Know about Data ScienceWhat Managers Need to Know about Data Science
What Managers Need to Know about Data ScienceAnnie Flippo
 
Bridging the Gap Between Data Science & Engineer: Building High-Performance T...
Bridging the Gap Between Data Science & Engineer: Building High-Performance T...Bridging the Gap Between Data Science & Engineer: Building High-Performance T...
Bridging the Gap Between Data Science & Engineer: Building High-Performance T...ryanorban
 
Sql saturday el salvador 2016 - Me, A Data Scientist?
Sql saturday el salvador 2016 - Me, A Data Scientist?Sql saturday el salvador 2016 - Me, A Data Scientist?
Sql saturday el salvador 2016 - Me, A Data Scientist?Fabricio Quintanilla
 
Data Science 101
Data Science 101Data Science 101
Data Science 101odsc
 
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...Big Data Spain
 
Clare Corthell: Learning Data Science Online
Clare Corthell: Learning Data Science OnlineClare Corthell: Learning Data Science Online
Clare Corthell: Learning Data Science Onlinesfdatascience
 
Barga Data Science lecture 1
Barga Data Science lecture 1Barga Data Science lecture 1
Barga Data Science lecture 1Roger Barga
 
Session 01 designing and scoping a data science project
Session 01 designing and scoping a data science projectSession 01 designing and scoping a data science project
Session 01 designing and scoping a data science projectbodaceacat
 
Session 01 designing and scoping a data science project
Session 01 designing and scoping a data science projectSession 01 designing and scoping a data science project
Session 01 designing and scoping a data science projectSara-Jayne Terp
 

Similar to Pandas, Data Wrangling & Data Science (20)

Interview
InterviewInterview
Interview
 
Data science presentation
Data science presentationData science presentation
Data science presentation
 
Data Science with Spark
Data Science with SparkData Science with Spark
Data Science with Spark
 
SENCER_panel.ppt
SENCER_panel.pptSENCER_panel.ppt
SENCER_panel.ppt
 
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesBig Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
 
Statistics vs machine learning
Statistics vs machine learningStatistics vs machine learning
Statistics vs machine learning
 
Fortune Teller API - Doing Data Science with Apache Spark
Fortune Teller API - Doing Data Science with Apache SparkFortune Teller API - Doing Data Science with Apache Spark
Fortune Teller API - Doing Data Science with Apache Spark
 
Barga Data Science lecture 2
Barga Data Science lecture 2Barga Data Science lecture 2
Barga Data Science lecture 2
 
How to Become a Data Scientist
How to Become a Data ScientistHow to Become a Data Scientist
How to Become a Data Scientist
 
What Managers Need to Know about Data Science
What Managers Need to Know about Data ScienceWhat Managers Need to Know about Data Science
What Managers Need to Know about Data Science
 
The field-guide-to-data-science
The field-guide-to-data-scienceThe field-guide-to-data-science
The field-guide-to-data-science
 
Lean Analytics: How to get more out of your data science team
Lean Analytics: How to get more out of your data science teamLean Analytics: How to get more out of your data science team
Lean Analytics: How to get more out of your data science team
 
Bridging the Gap Between Data Science & Engineer: Building High-Performance T...
Bridging the Gap Between Data Science & Engineer: Building High-Performance T...Bridging the Gap Between Data Science & Engineer: Building High-Performance T...
Bridging the Gap Between Data Science & Engineer: Building High-Performance T...
 
Sql saturday el salvador 2016 - Me, A Data Scientist?
Sql saturday el salvador 2016 - Me, A Data Scientist?Sql saturday el salvador 2016 - Me, A Data Scientist?
Sql saturday el salvador 2016 - Me, A Data Scientist?
 
Data Science 101
Data Science 101Data Science 101
Data Science 101
 
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
 
Clare Corthell: Learning Data Science Online
Clare Corthell: Learning Data Science OnlineClare Corthell: Learning Data Science Online
Clare Corthell: Learning Data Science Online
 
Barga Data Science lecture 1
Barga Data Science lecture 1Barga Data Science lecture 1
Barga Data Science lecture 1
 
Session 01 designing and scoping a data science project
Session 01 designing and scoping a data science projectSession 01 designing and scoping a data science project
Session 01 designing and scoping a data science project
 
Session 01 designing and scoping a data science project
Session 01 designing and scoping a data science projectSession 01 designing and scoping a data science project
Session 01 designing and scoping a data science project
 

More from Krishna Sankar

An excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXAn excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXKrishna Sankar
 
Data Wrangling For Kaggle Data Science Competitions
Data Wrangling For Kaggle Data Science CompetitionsData Wrangling For Kaggle Data Science Competitions
Data Wrangling For Kaggle Data Science CompetitionsKrishna Sankar
 
AWS VPC distilled for MongoDB devOps
AWS VPC distilled for MongoDB devOpsAWS VPC distilled for MongoDB devOps
AWS VPC distilled for MongoDB devOpsKrishna Sankar
 
Big Data Engineering - Top 10 Pragmatics
Big Data Engineering - Top 10 PragmaticsBig Data Engineering - Top 10 Pragmatics
Big Data Engineering - Top 10 PragmaticsKrishna Sankar
 
Scrum debrief to team
Scrum debrief to team Scrum debrief to team
Scrum debrief to team Krishna Sankar
 
Precision Time Synchronization
Precision Time SynchronizationPrecision Time Synchronization
Precision Time SynchronizationKrishna Sankar
 
The Hitchhiker’s Guide to Kaggle
The Hitchhiker’s Guide to KaggleThe Hitchhiker’s Guide to Kaggle
The Hitchhiker’s Guide to KaggleKrishna Sankar
 
Nosql hands on handout 04
Nosql hands on handout 04Nosql hands on handout 04
Nosql hands on handout 04Krishna Sankar
 
Cloud Interoperability Demo at OGF29
Cloud Interoperability Demo at OGF29Cloud Interoperability Demo at OGF29
Cloud Interoperability Demo at OGF29Krishna Sankar
 
A Hitchhiker's Guide to NOSQL v1.0
A Hitchhiker's Guide to NOSQL v1.0A Hitchhiker's Guide to NOSQL v1.0
A Hitchhiker's Guide to NOSQL v1.0Krishna Sankar
 

More from Krishna Sankar (11)

An excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXAn excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphX
 
Data Wrangling For Kaggle Data Science Competitions
Data Wrangling For Kaggle Data Science CompetitionsData Wrangling For Kaggle Data Science Competitions
Data Wrangling For Kaggle Data Science Competitions
 
AWS VPC distilled for MongoDB devOps
AWS VPC distilled for MongoDB devOpsAWS VPC distilled for MongoDB devOps
AWS VPC distilled for MongoDB devOps
 
Big Data Engineering - Top 10 Pragmatics
Big Data Engineering - Top 10 PragmaticsBig Data Engineering - Top 10 Pragmatics
Big Data Engineering - Top 10 Pragmatics
 
Scrum debrief to team
Scrum debrief to team Scrum debrief to team
Scrum debrief to team
 
The Art of Big Data
The Art of Big DataThe Art of Big Data
The Art of Big Data
 
Precision Time Synchronization
Precision Time SynchronizationPrecision Time Synchronization
Precision Time Synchronization
 
The Hitchhiker’s Guide to Kaggle
The Hitchhiker’s Guide to KaggleThe Hitchhiker’s Guide to Kaggle
The Hitchhiker’s Guide to Kaggle
 
Nosql hands on handout 04
Nosql hands on handout 04Nosql hands on handout 04
Nosql hands on handout 04
 
Cloud Interoperability Demo at OGF29
Cloud Interoperability Demo at OGF29Cloud Interoperability Demo at OGF29
Cloud Interoperability Demo at OGF29
 
A Hitchhiker's Guide to NOSQL v1.0
A Hitchhiker's Guide to NOSQL v1.0A Hitchhiker's Guide to NOSQL v1.0
A Hitchhiker's Guide to NOSQL v1.0
 

Recently uploaded

Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 

Recently uploaded (20)

Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 

Pandas, Data Wrangling & Data Science

  • 1. @Galvanize with Pandas !! Pandas, Data Wrangling & Data Science August 12, 2016 @ksankar // doubleclix.wordpress.com San Francisco 2016
  • 2. o Intro & Setup [10:45-10:55) (10) • Goals/non-goals o Data Wrangling & Data Science Pipeline [10:55 – 11:05) (10) o Pandas – APIs & Namespaces [11:05-11:15) (10) o Pandas – Basic Maneuvers [11:15-11:30) (15) o Pandas – Data Wrangling – Transformations, Aggregations & Join [11:30-12:15) (45) • Hands-on : Titanic Dataset • Hands-on : NW Dataset, State Of The Union Speeches, Recsys-2015 Data o Q & A [12:15-Inf) (10) o Not covering – Panels, Time Series Agenda : Pandas, Data Wrangling & Data Science http://pydata.org/sfo2016/schedule/presentation/67/
  • 3. Goals & non-goals Goals ¤Understand Data Wrangling with Pandas ¤Focus on APIs and usage ¤Give you a focused time to work thru examples § Work with me. I will wait if you want to catch-up ¤Less theory, more usage - let us see if this works ¤As straightforward as possible § The programs can be optimized ¤Foundation for the next 2 tutorials § Python Visualization for Exploration of Data by Stephen F. Elston, Ronald Lopez § Applied Time Series Econometrics in Python (and R)Jeffrey Yau Non-goals ¡ Not “expert” Pandas • We don’t have sufficient time. The topic can be easily a 1 day tutorial ! ¡ Time to do hands-on • Only 90 minutes ¡ Python vs. R • I’ve come to discuss Pandas, not to praise R ! ¡ A passive talk • Nope. Interactive & hands-on
  • 4. 1. Brandon Rhodes - Pandas From The Ground Up - PyCon 2015 https://www.youtube.com/watch?v=5JnMutdy6Fw 2. A Visual Guide To Pandas by Jason Wirth https://www.youtube.com/watch?v=9d5-Ti6onew 3. 2012 PyData Workshop: Data Analysis in Python with Pandas by Wes McKinney https://www.youtube.com/watch?v=MxRMXhjXZos 4. http://nbviewer.jupyter.org/github/jbochi/recsyschallenge2015/ blob/master/visualization.ipynb 5. https://www.analyticsvidhya.com/blog/2016/01/complete- tutorial-learn-data-science-python-scratch-2/ Thanks to the Giants whose work helped to prepare this tutorial
  • 5. About Me o AI/Data Scientist • Autonomous Vehicles [https://goo.gl/BgicSY][https://goo.gl/LZ3fY9] • Building Autonomous Drone-Jarvis / Working towards FAA Drone Pilot Certification • What would you want AI to do, if it could do whatever you want it to do ?[https://goo.gl/eqWUEn] • Decision Data Science & Product Data Science • Insights = Intelligence + Inference + Interface [https://goo.gl/s2KB6L] • Predicting NFL with Elo like Nate Silver & 538 [NFL : http://goo.gl/Q2OgeJ, NBA’15 : https://goo.gl/aUhdo3] o Have been speaking at OSCON [http://goo.gl/1MJLu], PyCon, Pydata [http://vimeo.com/63270513, http://www.slideshare.net/ksankar/pydata-19] … o Have done lots of things: • Big Data (Retail, Bioinformatics, Financial, AdTech), Starting MS-CFRM, University of WA • Written Books (Web 2.0, Wireless, Java,…))Standards, some work in AI, • Guest Lecturer at Naval PG School,… o Volunteer as Robotics Judge at First Lego league World Competitions o @ksankar, doubleclix.wordpress.com
  • 6. Pandas & Notebook Installation o Pandas – Best with Anaconda o Notebook - Install Jupyter/iPython
  • 7. Tutorial Materials oGithub : https://github.com/xsankar/cautious-octo-waffle • Clone or download zip oOpen terminal • cd ~/cautious-octo-waffle • jupyter notebook oClick on ipython dashboard • Run 000-PreFlightCheck.ipynb • Now you are ready for the workshop ! oOne More Thing !! • The RecSYs-2015 data is ~2GB. So pl download the data to the data/recsys-2015 directory
  • 8. Data Wrangling & Data Science Pipeline 10:55 Pipelines … “[Collect-Store-Transform]-[Reason-Model]-[Deploy]-[Visualize-Recommend-Predict]-[Explore]”
  • 9. Data Science - Context o Scalable Model Deployment o Big Data automation & purpose built appliances (soft/hard) o Manage SLAs & response times o Volume o Velocity o Streaming Data o Canonical form o Data catalog o Data Fabric across the organization o Access to multiple sources of data o Think Hybrid – Big Data Apps, Appliances & Infrastructure Collect Store Transform o Metadata o Monitor counters & Metrics o Structured vs. Multi- structured o Flexible & Selectable § Data Subsets § Attribute sets o Refine model with § Extended Data subsets § Engineered Attribute sets o Validation run across a larger data set Reason Model Deploy Data Management Data Science o Dynamic Data Sets o 2 way key-value tagging of datasets o Extended attribute sets o Advanced Analytics ExploreVisualize Recommend Predict o Performance o Scalability o Refresh Latency o In-memory Analytics o AdvancedVisualization o Interactive Dashboards o Map Overlay o Infographics ¤ Bytes to Business a.k.a. Build the full stack ¤ Find Relevant Data For Business ¤ Connect the Dots
  • 10. Data Science : The art of building a model with known knowns, which when let loose, works with unknown unknowns! Donald Rumsfeld is an armchair Data Scientist ! http://smartorg.com/2013/07/valuepoint19/ The World Knowns Unknowns You UnKnown Known o Others know, you don’t o What we do o Facts, outcomes or scenarios we have not encountered, nor considered o “Black swans”, outliers, long tails of probability distributions o Lack of experience, imagination o Potential facts, outcomes we are aware, but not with certainty o Stochastic processes, Probabilities o Known Knowns o There are things we know that we know o Known Unknowns o That is to say, there are things that we now know we don't know o But there are also Unknown Unknowns o There are things we do not know we don't know
  • 11. The curious case of the Data Scientist o Data Scientist is multi-faceted & Contextual o Data Scientist should be building Data Products o Data Scientist should tell a story http://doubleclix.wordpress.com/2014/01/25/the-curious-case-of-the-data-scientist-profession/ Large is hard; Infinite is much easier ! – Titus Brown
  • 12. Pandas – APIs & Namespaces 11:05
  • 13. Pandas Data Model o Layer over numPy o Data Model • 1D Series (numPy Array w/labels) • Data frame - 2D labelled sheet • Column operations similar to vector operations o Pay attention to the index • Indexed rows, Indexed Columns & info at the center o Pay attention to the objects • DataFrame vs Series vs numpy array • Eg. size() vs size o “Answer all questions about a dataset” - Wes
  • 14. pandas namespaces objects o pandas.Series o pandas.DataFrame o pandas.Panel o pandas.Panel4D o pandas.index I/O o read_(csv, table, excel, json, gbq,…) o to_(csv, table, excel, json, gbq,…) o pandas.read_ o df.to_ Computations, operations,…) o +,-, o pow, o corr, … DateTime o .dt NaN, Missing o Isnull() o fillna() o dropna() o skipna o interpolate String o .str Plotting o .plot Notes : [1] df[“date”].dt - only series has date time ! df.dt won’t work [2] .sort a DataFrame, but .order a Series [3] to_frame() converts to a series.Most of the time DataFrame is the preferred object
  • 15. Data Science “folk knowledge” (1 of A) o "If you torture the data long enough, it will confess to anything." – Hal Varian, Computer Mediated Transactions o Learning = Representation + Evaluation + Optimization o It’s Generalization that counts • The fundamental goal of machine learning is to generalize beyond the examples in the training set o Data alone is not enough • Induction not deduction - Every learner should embody some knowledge or assumptions beyond the data it is given in order to generalize beyond it o Machine Learning is not magic – one cannot get something from nothing • In order to infer, one needs the knobs & the dials • One also needs a rich expressive dataset A few useful things to know about machine learning - by Pedro Domingos http://dl.acm.org/citation.cfm?id=2347755
  • 17. 1-Getting Data in/out o pd.read_csv(…) o read_table(...) <- arbitrary delimied file o read_{clipboard,json,excel,SAS,SQL,gbq,...) o df.to_csv(...)
  • 18. 2-Basic Operations o head() o tail() o count() o describe() o dtypes
  • 20. 4. na-Missing Data o One of the tenets of big data and data science is that data is never fully clean-while we can handle types, formats et al, missing values is always challenging o One easy solution is to drop the rows that have missing values, but then we would lose valuable data in the columns that do have values. o A better solution is to impute data based on some criteria. It is true that data cannot be created out of thin air, but data can be inferred with some success – it is better than dropping the rows. • We can replace null with 0 • A better solution is to replace numerical values with the average of the rest of the valid values; for categorical replacing with the most common value is a good strategy • We could use mode or median instead of mean • Another good strategy is to infer the missing value from other attributes ie “Evidence from multiple fields”. • For example the Titanic data has name and for imputing the missing age field, we could use the Mr., Master, Mrs. Miss designation from the name and then fill-in the average of the age field from the corresponding designation. So a row with missing age with Master. In name would get the average age of all records with “Master.” • There is also the filed for number of siblings and number of spouse. We could average the age based on the value of that field. • We could even average the ages from different strategies.
  • 21. 4. na-Missing Data o NaN better than 0 - says I don’t know • Comes ihandy n recommendation, stock data on a Saturday,… o Skipna o Fillna • forward fill/backward fill method ! o Interpolate
  • 22. 5-Statistics o Min o Max o Quantile o Mean,SD,variance,… o Correlation • Pearson • Spearman o Covariance
  • 24. Pandas – Data Wrangling – Transformations, Aggregations & Join 11:30
  • 25. Merge,Join and friends o merge • Use Merge • join is a set of common merge patterns with defaults o groupby • Think in terms of split-apply-combine o stack/unstack • unstack operation to compare unlike things - parameter to unstack different columns • Too much stack-unstack results in a series ! • Be ready to handle NaN o Powerful Techniques • groupby + merge • groupby + unstack
  • 26. Hands-On : Pandas@Kaggle o 020-Titanic.ipyb o GitHub : https://github.com/xsankar/cautious-octo-waffle/blob/master/020- Titanic.ipynb • Let us analyze the Titanic Dataset for a Kaggle Competition
  • 27. Hands-On : Orders Data o 030-Orders.ipyb o Github : https://github.com/xsankar/cautious-octo-waffle/blob/master/030- Orders.ipynb • Data wrangling with the Orders dataset
  • 28. Data Science “folk knowledge” (Wisdom of Kaggle) Jeremy’s Axioms o Iteratively explore data o Tools • Excel Format, Perl, Perl Book, Pandas ! o Get your head around data • Pivot Table o Don’t over-complicate o If people give you data, don’t assume that you need to use all of it o Look at pictures ! o History of your submissions – keep a tab o Don’t be afraid to submit simple solutions • We will do this during this workshop Ref: http://blog.kaggle.com/2011/03/23/getting-in-shape-for-the-sport-of-data-sciencetalk-by- jeremy-howard/
  • 29. Hands-On : Clicks & Buys o 050-RecSys-2015.ipynb o Github : https://github.com/xsankar/cautious-octo-waffle/blob/master/050- RecSys-2015.ipynb • Data wrangling with the RecSys-2015 dataset
  • 31. Essential Reading List o A few useful things to know about machine learning - by Pedro Domingos • http://dl.acm.org/citation.cfm?id=2347755 o The Lack of A Priori Distinctions Between Learning Algorithms by David H. Wolpert • http://mpdc.mae.cornell.edu/Courses/MAE714/Papers/lack_of_a_priori_distinctions_wolper t.pdf o http://www.no-free-lunch.org/ o Controlling the false discovery rate: a practical and powerful approach to multiple testing Benjamini, Y. and Hochberg, Y. C • http://www.stat.purdue.edu/~doerge/BIOINFORM.D/FALL06/Benjamini%20and%20Y%20FD R.pdf o A Glimpse of Googl, NASA,Peter Norvig + The Restaurant at the End of the Universe • http://doubleclix.wordpress.com/2014/03/07/a-glimpse-of-google-nasa-peter-norvig/ o Avoid these three mistakes, James Faghmo • https://medium.com/about-data/73258b3848a4 o Leakage in Data Mining: Formulation, Detection, and Avoidance • http://www.cs.umb.edu/~ding/history/470_670_fall_2011/papers/cs670_Tran_PreferredPap er_LeakingInDataMining.pdf
  • 32. For your reading & viewing pleasure … An ordered List ① An Introduction to Statistical Learning • http://www-bcf.usc.edu/~gareth/ISL/ ② ISL Class Stanford/Hastie/Tibsharani at their best - Statistical Learning • http://online.stanford.edu/course/statistical-learning-winter-2014 ③ Prof. Pedro Domingo • https://class.coursera.org/machlearning-001/lecture/preview ④ Prof. Andrew Ng • https://class.coursera.org/ml-003/lecture/preview ⑤ Prof. Abu Mostafa, CaltechX: CS1156x: Learning From Data • https://www.edx.org/course/caltechx/caltechx-cs1156x-learning-data-1120 ⑥ Mathematicalmonk @ YouTube • https://www.youtube.com/playlist?list=PLD0F06AA0D2E8FFBA ⑦ The Elements Of Statistical Learning • http://statweb.stanford.edu/~tibs/ElemStatLearn/ http://www.quora.com/Machine-Learning/Whats-the-easiest-way-to-learn- machine-learning/
  • 33. The Beginning As The End How did we do ? 4:45