Author: Rory Creedon
Contact: rcreedon@poverty-action.com
Please feel free to mail questions/comments/corrections
from IPython.core.display import Image
Image(filename=r'C:\Users\rcreedon\Dropbox\Rory Notes\Notes\Pandas\Intro\Untitled.png')
Pandas is just one of hundreds of Python libraries. If you can program in Python, you have access not only to the data capabilities of Pandas, the numerical abilities of NumPy and the scientific computing power of SciPy, but all of the libraries that will allow you to do anything under the sun.
This means that what you used to compartmentalise as your 'analysis' work, or 'cleaning' work can become integrated in an altogether much broader pattern of achievement. You can write a web crawler and scraper to get data from the web, have pandas clean and prepare data, analyse it, visualise it, and send a report to your PIs without ever lifting a finger, leaving you free to do something else entirely:
Image(url='http://25.media.tumblr.com/2b3cd2edeb031737e08f9344a6236256/tumblr_mo66pgUm501r0yba7o1_r1_500.gif')
Or perhaps go one step further and build yourself a data collection application to load onto an android tablet, monitor and communicate with you enumerators in real time, and receive, clean, and analyse your data on the spot. The possibilities are truly endless.
The objects that are available to programmers working with pandas make working with data a totally different ball game:
Image(filename=r'C:\Users\rcreedon\Dropbox\Rory Notes\Notes\Pandas\Intro\ballgame.png')
The flexibility that can be achieved by being able to work with lists, dictionaries, series, arrays, and data frames will transform the work you are able to do as well as reduce the amount of code you have to write. No longer will you have to wrack your brain to find a lengthy stata work-around, no endless reshaping, no saving and merging of infinite different data sets. Now you will be able to think big and find an elegant Pythonic solution. Create objects to work on, compare, and manipulate different data sets simultaneously, that's right AT THE SAME TIME!
Not only that, but you can control the way that the output is presented to users. Suppose for example you write a genious survey cleaning file, well now rather than creating indicator dummies to help you find errors, you can use more intuitive output, and make your cleaning commands interactive, so that other people can run your files and make corrections without having to delve into the whys and wherefores of your program. Check this out this small example of interactive cleaning methods:
import pandas as pd
from pandas import DataFrame
import numpy as np
ChildDF = DataFrame({'age':np.random.RandomState(13247).randint(13, 20, 7), \
'years_at_school' : np.random.RandomState(144).randint(1, 10, 7)}, \
index = pd.Index(['100' + str(x) for x in range(1, 8)], name = 'UID'))
ChildDF
age | years_at_school | |
---|---|---|
UID | ||
1001 | 19 | 8 |
1002 | 13 | 9 |
1003 | 19 | 7 |
1004 | 18 | 7 |
1005 | 13 | 4 |
1006 | 15 | 3 |
1007 | 18 | 1 |
Now in this mini-survey of school age (13 -18 years) children, imagine you are cleaning the age variable to make sure only observations of kids that are between the ages of 13 and 18 are included. What would you think if you could do it like this??? :
def FindErrors(minval, maxval, df, VAR):
return df[(df[VAR] < minval) | (df[VAR] > maxval)]
def MessageAction(df):
if len(df) > 1 :
print 'You have errors in the following observations: '
return df
else:
return 'HAPPY Days, the data are error free'
MessageAction(FindErrors(13, 18, ChildDF, 'age'))
You have errors in the following observations:
age | years_at_school | |
---|---|---|
UID | ||
1001 | 19 | 8 |
1003 | 19 | 7 |
From here it would be simplicity itself to prompt the user to enter either 'D' to drop the erroneous variables, 'N' to do nothing, or 'NA' to introduce missing values. This is not demonstrated as it is assumed that for now, you do not have the relevant python modules installed. But it’s so simple to achieve. If you are already up and running, and have pandas installed, then go ahead and open the Interactive_Cleaning.py example file distributed with this notebook and run the code in the python shell you currently use.
The development environments, add-ons and python side projects are being developed faster than you can say Innovations for Poverty Action. It’s happening so quickly it’s hard to keep up.
Just look at this notebook. Imagine if you could present your work to PIs like this. You can include code, output, markdown, formatting, already you have seen embedded images. Just a few of the other things you can do are demonstrated below:
from IPython.display import Math
Math(r'F(k) = \int_{-\infty}^{\infty} f(x) e^{2\pi i k} dx')
from IPython.display import Latex
Latex(r"""\begin{eqnarray}
\nabla \times \vec{\mathbf{B}} -\, \frac1c\, \frac{\partial\vec{\mathbf{E}}}{\partial t} & = \frac{4\pi}{c}\vec{\mathbf{j}} \\
\nabla \cdot \vec{\mathbf{E}} & = 4 \pi \rho \\
\nabla \times \vec{\mathbf{E}}\, +\, \frac1c\, \frac{\partial\vec{\mathbf{B}}}{\partial t} & = \vec{\mathbf{0}} \\
\nabla \cdot \vec{\mathbf{B}} & = 0
\end{eqnarray}""")
That's right! You can use Latex class and display mathematical expressions typeset in LaTex!
Now, you want to direct your PIs to an external website, to demonstrate where you got your ideas from. Good news! You can do that too:
from IPython.display import HTML
HTML('<iframe src=http://en.mobile.wikipedia.org/?useformat=mobile width=700 height=350></iframe>')
Perhaps there is a really great youtube video that you would like your PIs to watch, well why not display the video in your notebook?
from IPython.display import YouTubeVideo
YouTubeVideo('jmtMf6VJklI')
Now don't pretend you don't think that is awesome!
It goes without saying that you can also include graphs of your work (the following is a demonstration of transforming time series data and plotting it):
%pylab inline
import matplotlib.pylab as pylab
pylab.rcParams['figure.figsize'] = 16, 12
index = pd.date_range('10/1/1999', periods=1100)
ts = pd.Series(np.random.normal(0.5, 2, 1100), index)
ts = pd.rolling_mean(ts, 100, 100).dropna()
key = lambda x: x.year
zscore = lambda x: (x - x.mean()) / x.std()
transformed = ts.groupby(key).transform(zscore)
compare = DataFrame({'Original': ts, 'Transformed': transformed})
compare.plot()
Populating the interactive namespace from numpy and matplotlib
<matplotlib.axes.AxesSubplot at 0x66a49b0>
You can even include audio to soothe the reader as they make their way through your notebook.
The bottom line, is that no longer will the way you present your data be a combination of unreadable do-files, impenetrable SMICL files and logs. The notebook allows you program interactively and tell a real story with your data!!!
Not only that, but through projects like Wikari you can share your notebooks online for people to access, edit and run on any machine connected to the internet without even having to have python downloaded...
Image(url = 'http://static5.vayagif.com/gifs/2012/05/GIF_110633_its_free.gif')
If you have mastered python and pandas you are now a fully fledged computer programmer! CONGRATS. The world is now your oyster, plus you will be king of the nerd herd, which let's face it is where all IPA staff long to be!!!
Its not all rosy, there are specific limitations to the use of pandas. They are as follows:
Many journals require stata code to be submitted with papers, so its not always possible to do the analysis in stata. However, there is no reason not to do the cleaning/munging in pandas. Of course you should check with your PIs before embarking on any grand pandas projects. If it is important for them to be able to review all your work, then perhaps its best to stick to the status quo.... Sad Face.
Python/Pandas Firstly and most importantly, python /pandas does not have a specific GUI (Graphical User Interface), which is a fancy way of saying that there is not simple way to “see” your data. There are no drop down menus, no file save buttons, no data browser window. This means that you have to be very clear about what you are doing to your data. When working directly in the shell you will only be able to ‘see’ a small portion of you data. If there was no work around to this, then I suspect I would rely a lot less on python for data work. Thankfully there is work around – the ipython notebook.
The notebook will allow you to display all of your data in HTML format, so that you can see it. Whilst it is important that you know what you are doing to the data, I personally find that being able to visually inspect it can also help. For certain the notebook view is not a pretty as the stata browser, but it is a tradeoff well worth making.
Related to the above, it should be noted that pandas is not a data presentation tool. Therefore labeling values etc. is not possible
Missing numbers are a little awkward to work with in pandas. The np.nan type of missing value can be confusing and if not handled carefully can generate perverse results. Having said that, you also have be careful in stata. It is expected that in future releases this will be improved.
Pandas has changed the way I work with data and my PIs are thrilled. I think yours could be too. This motivating notebook is the first of many notebooks that will hopefully get you toward an intricate understanding of Pandas and what you can achieve with python and pandas. The sky is the limit!!!!
Image(url = 'http://media.tumblr.com/63d05473abe6c876afb2d2f275a81fb4/tumblr_inline_mpf97j3ya61qz4rgp.gif')