How AI, OpenAI, and ChatGPT impact business and software.
Pandas, Data Wrangling & Data Science
1. @Galvanize with Pandas !!
Pandas, Data Wrangling & Data Science
August 12, 2016
@ksankar // doubleclix.wordpress.com
San Francisco 2016
2. o Intro & Setup [10:45-10:55) (10)
• Goals/non-goals
o Data Wrangling & Data Science Pipeline [10:55 –
11:05) (10)
o Pandas – APIs & Namespaces [11:05-11:15) (10)
o Pandas – Basic Maneuvers [11:15-11:30) (15)
o Pandas – Data Wrangling – Transformations,
Aggregations & Join [11:30-12:15) (45)
• Hands-on : Titanic Dataset
• Hands-on : NW Dataset, State Of The
Union Speeches, Recsys-2015 Data
o Q & A [12:15-Inf) (10)
o Not covering – Panels, Time Series
Agenda : Pandas, Data Wrangling & Data Science
http://pydata.org/sfo2016/schedule/presentation/67/
3. Goals & non-goals
Goals
¤Understand Data Wrangling with
Pandas
¤Focus on APIs and usage
¤Give you a focused time to work
thru examples
§ Work with me. I will wait if you
want to catch-up
¤Less theory, more usage - let us see
if this works
¤As straightforward as possible
§ The programs can be optimized
¤Foundation for the next 2 tutorials
§ Python Visualization for Exploration of
Data by Stephen F. Elston, Ronald Lopez
§ Applied Time Series Econometrics in Python
(and R)Jeffrey Yau
Non-goals
¡ Not “expert” Pandas
• We don’t have sufficient
time. The topic can be
easily a 1 day tutorial !
¡ Time to do hands-on
• Only 90 minutes
¡ Python vs. R
• I’ve come to discuss
Pandas, not to praise R !
¡ A passive talk
• Nope. Interactive &
hands-on
4. 1. Brandon Rhodes - Pandas From The Ground Up - PyCon
2015 https://www.youtube.com/watch?v=5JnMutdy6Fw
2. A Visual Guide To Pandas by Jason Wirth
https://www.youtube.com/watch?v=9d5-Ti6onew
3. 2012 PyData Workshop: Data Analysis in Python with
Pandas by Wes McKinney
https://www.youtube.com/watch?v=MxRMXhjXZos
4. http://nbviewer.jupyter.org/github/jbochi/recsyschallenge2015/
blob/master/visualization.ipynb
5. https://www.analyticsvidhya.com/blog/2016/01/complete-
tutorial-learn-data-science-python-scratch-2/
Thanks to the Giants whose work
helped to prepare this tutorial
5. About Me
o AI/Data Scientist
• Autonomous Vehicles [https://goo.gl/BgicSY][https://goo.gl/LZ3fY9]
• Building Autonomous Drone-Jarvis / Working towards FAA Drone Pilot Certification
• What would you want AI to do, if it could do whatever you want it to do ?[https://goo.gl/eqWUEn]
• Decision Data Science & Product Data Science
• Insights = Intelligence + Inference + Interface [https://goo.gl/s2KB6L]
• Predicting NFL with Elo like Nate Silver & 538 [NFL : http://goo.gl/Q2OgeJ, NBA’15 : https://goo.gl/aUhdo3]
o Have been speaking at OSCON [http://goo.gl/1MJLu], PyCon, Pydata [http://vimeo.com/63270513,
http://www.slideshare.net/ksankar/pydata-19] …
o Have done lots of things:
• Big Data (Retail, Bioinformatics, Financial, AdTech), Starting MS-CFRM, University of WA
• Written Books (Web 2.0, Wireless, Java,…))Standards, some work in AI,
• Guest Lecturer at Naval PG School,…
o Volunteer as Robotics Judge at First Lego league World Competitions
o @ksankar, doubleclix.wordpress.com
6. Pandas & Notebook Installation
o Pandas – Best with Anaconda
o Notebook - Install Jupyter/iPython
7. Tutorial Materials
oGithub : https://github.com/xsankar/cautious-octo-waffle
• Clone or download zip
oOpen terminal
• cd ~/cautious-octo-waffle
• jupyter notebook
oClick on ipython dashboard
• Run 000-PreFlightCheck.ipynb
• Now you are ready for the workshop !
oOne More Thing !!
• The RecSYs-2015 data is ~2GB. So pl download the data
to the data/recsys-2015 directory
8. Data Wrangling & Data Science Pipeline
10:55
Pipelines …
“[Collect-Store-Transform]-[Reason-Model]-[Deploy]-[Visualize-Recommend-Predict]-[Explore]”
9. Data Science - Context
o Scalable Model
Deployment
o Big Data
automation &
purpose built
appliances
(soft/hard)
o Manage SLAs &
response times
o Volume
o Velocity
o Streaming Data
o Canonical form
o Data catalog
o Data Fabric across the
organization
o Access to multiple
sources of data
o Think Hybrid – Big Data
Apps, Appliances &
Infrastructure
Collect Store Transform
o Metadata
o Monitor counters &
Metrics
o Structured vs. Multi-
structured
o Flexible & Selectable
§ Data Subsets
§ Attribute sets
o Refine model with
§ Extended Data
subsets
§ Engineered
Attribute sets
o Validation run across a
larger data set
Reason Model Deploy
Data Management
Data Science
o Dynamic Data Sets
o 2 way key-value tagging of
datasets
o Extended attribute sets
o Advanced Analytics
ExploreVisualize Recommend Predict
o Performance
o Scalability
o Refresh Latency
o In-memory Analytics
o AdvancedVisualization
o Interactive Dashboards
o Map Overlay
o Infographics
¤ Bytes to Business
a.k.a. Build the full
stack
¤ Find Relevant Data
For Business
¤ Connect the Dots
10. Data Science :
The art of building a model with known knowns, which when let loose, works with unknown
unknowns!
Donald Rumsfeld is an armchair Data Scientist !
http://smartorg.com/2013/07/valuepoint19/
The
World
Knowns
Unknowns
You
UnKnown Known
o Others know, you don’t o What we do
o Facts, outcomes or
scenarios we have not
encountered, nor
considered
o “Black swans”, outliers,
long tails of probability
distributions
o Lack of experience,
imagination
o Potential facts,
outcomes we
are aware, but
not with
certainty
o Stochastic
processes,
Probabilities
o Known Knowns
o There are things we know that we know
o Known Unknowns
o That is to say, there are things that we
now know we don't know
o But there are also Unknown Unknowns
o There are things we do not know we
don't know
11. The curious case of the Data Scientist
o Data Scientist is multi-faceted & Contextual
o Data Scientist should be building Data Products
o Data Scientist should tell a story
http://doubleclix.wordpress.com/2014/01/25/the-curious-case-of-the-data-scientist-profession/
Large is hard; Infinite is much easier !
– Titus Brown
13. Pandas Data Model
o Layer over numPy
o Data Model
• 1D Series (numPy Array w/labels)
• Data frame - 2D labelled sheet
• Column operations similar to vector operations
o Pay attention to the index
• Indexed rows, Indexed Columns & info at the center
o Pay attention to the objects
• DataFrame vs Series vs numpy array
• Eg. size() vs size
o “Answer all questions about a dataset” - Wes
14. pandas namespaces
objects
o pandas.Series
o pandas.DataFrame
o pandas.Panel
o pandas.Panel4D
o pandas.index
I/O
o read_(csv, table,
excel, json, gbq,…)
o to_(csv, table,
excel, json, gbq,…)
o pandas.read_
o df.to_
Computations,
operations,…)
o +,-,
o pow,
o corr, …
DateTime
o .dt
NaN,
Missing
o Isnull()
o fillna()
o dropna()
o skipna
o interpolate
String
o .str
Plotting
o .plot
Notes :
[1] df[“date”].dt - only series has date time ! df.dt won’t work
[2] .sort a DataFrame, but .order a Series
[3] to_frame() converts to a series.Most of the time DataFrame is the preferred
object
15. Data Science “folk knowledge” (1 of A)
o "If you torture the data long enough, it will confess to anything." – Hal Varian,
Computer Mediated Transactions
o Learning = Representation + Evaluation + Optimization
o It’s Generalization that counts
• The fundamental goal of machine learning is to generalize beyond the
examples in the training set
o Data alone is not enough
• Induction not deduction - Every learner should embody some knowledge
or assumptions beyond the data it is given in order to generalize beyond it
o Machine Learning is not magic – one cannot get something from nothing
• In order to infer, one needs the knobs & the dials
• One also needs a rich expressive dataset
A few useful things to know about machine learning - by Pedro Domingos
http://dl.acm.org/citation.cfm?id=2347755
20. 4. na-Missing Data
o One of the tenets of big data and data science is that data is never fully clean-while we
can handle types, formats et al, missing values is always challenging
o One easy solution is to drop the rows that have missing values, but then we would lose
valuable data in the columns that do have values.
o A better solution is to impute data based on some criteria. It is true that data cannot be
created out of thin air, but data can be inferred with some success – it is better than
dropping the rows.
• We can replace null with 0
• A better solution is to replace numerical values with the average of the rest of the valid
values; for categorical replacing with the most common value is a good strategy
• We could use mode or median instead of mean
• Another good strategy is to infer the missing value from other attributes ie “Evidence
from multiple fields”.
• For example the Titanic data has name and for imputing the missing age field, we could use the
Mr., Master, Mrs. Miss designation from the name and then fill-in the average of the age field
from the corresponding designation. So a row with missing age with Master. In name would get
the average age of all records with “Master.”
• There is also the filed for number of siblings and number of spouse. We could average the age
based on the value of that field.
• We could even average the ages from different strategies.
21. 4. na-Missing Data
o NaN better than 0 - says I don’t know
• Comes ihandy n recommendation, stock data on a Saturday,…
o Skipna
o Fillna
• forward fill/backward fill method !
o Interpolate
22. 5-Statistics
o Min
o Max
o Quantile
o Mean,SD,variance,…
o Correlation
• Pearson
• Spearman
o Covariance
25. Merge,Join and friends
o merge
• Use Merge
• join is a set of common merge patterns with defaults
o groupby
• Think in terms of split-apply-combine
o stack/unstack
• unstack operation to compare unlike things - parameter to unstack
different columns
• Too much stack-unstack results in a series !
• Be ready to handle NaN
o Powerful Techniques
• groupby + merge
• groupby + unstack
26. Hands-On : Pandas@Kaggle
o 020-Titanic.ipyb
o GitHub : https://github.com/xsankar/cautious-octo-waffle/blob/master/020-
Titanic.ipynb
• Let us analyze the Titanic Dataset for a Kaggle Competition
27. Hands-On : Orders Data
o 030-Orders.ipyb
o Github : https://github.com/xsankar/cautious-octo-waffle/blob/master/030-
Orders.ipynb
• Data wrangling with the Orders dataset
28. Data Science “folk knowledge” (Wisdom of Kaggle)
Jeremy’s Axioms
o Iteratively explore data
o Tools
• Excel Format, Perl, Perl Book, Pandas !
o Get your head around data
• Pivot Table
o Don’t over-complicate
o If people give you data, don’t assume that you
need to use all of it
o Look at pictures !
o History of your submissions – keep a tab
o Don’t be afraid to submit simple solutions
• We will do this during this workshop
Ref: http://blog.kaggle.com/2011/03/23/getting-in-shape-for-the-sport-of-data-sciencetalk-by-
jeremy-howard/
29. Hands-On : Clicks & Buys
o 050-RecSys-2015.ipynb
o Github : https://github.com/xsankar/cautious-octo-waffle/blob/master/050-
RecSys-2015.ipynb
• Data wrangling with the RecSys-2015 dataset
31. Essential Reading List
o A few useful things to know about machine learning - by Pedro Domingos
• http://dl.acm.org/citation.cfm?id=2347755
o The Lack of A Priori Distinctions Between Learning Algorithms by David H. Wolpert
• http://mpdc.mae.cornell.edu/Courses/MAE714/Papers/lack_of_a_priori_distinctions_wolper
t.pdf
o http://www.no-free-lunch.org/
o Controlling the false discovery rate: a practical and powerful approach to multiple testing Benjamini, Y.
and Hochberg, Y. C
• http://www.stat.purdue.edu/~doerge/BIOINFORM.D/FALL06/Benjamini%20and%20Y%20FD
R.pdf
o A Glimpse of Googl, NASA,Peter Norvig + The Restaurant at the End of the Universe
• http://doubleclix.wordpress.com/2014/03/07/a-glimpse-of-google-nasa-peter-norvig/
o Avoid these three mistakes, James Faghmo
• https://medium.com/about-data/73258b3848a4
o Leakage in Data Mining: Formulation, Detection, and Avoidance
• http://www.cs.umb.edu/~ding/history/470_670_fall_2011/papers/cs670_Tran_PreferredPap
er_LeakingInDataMining.pdf
32. For your reading & viewing pleasure … An ordered List
① An Introduction to Statistical Learning
• http://www-bcf.usc.edu/~gareth/ISL/
② ISL Class Stanford/Hastie/Tibsharani at their best - Statistical Learning
• http://online.stanford.edu/course/statistical-learning-winter-2014
③ Prof. Pedro Domingo
• https://class.coursera.org/machlearning-001/lecture/preview
④ Prof. Andrew Ng
• https://class.coursera.org/ml-003/lecture/preview
⑤ Prof. Abu Mostafa, CaltechX: CS1156x: Learning From Data
• https://www.edx.org/course/caltechx/caltechx-cs1156x-learning-data-1120
⑥ Mathematicalmonk @ YouTube
• https://www.youtube.com/playlist?list=PLD0F06AA0D2E8FFBA
⑦ The Elements Of Statistical Learning
• http://statweb.stanford.edu/~tibs/ElemStatLearn/
http://www.quora.com/Machine-Learning/Whats-the-easiest-way-to-learn-
machine-learning/