Pandas in One Post

Learn Pandas Free

Featured image pandas_tutorial

Pandas in One Post

pandas contains high-level data structures and manipulation tools designed to make data analysis fast and easy in Python. pandas is built on top of NumPy and makes it easy to use in NumPy-centric application.

Key Features of Pandas

Fast and efficient DataFrame object with default and customized indexing.

Tools for loading data into in-memory data objects from different file formats.

Data alignment and integrated handling of missing data.

Reshaping and pivoting of date sets.

Label-based slicing, indexing and subsetting of large data sets.

Columns from a data structure can be deleted or inserted.

Group by data for aggregation and transformations.

High performance merging and joining of data.

Time Series functionality.

Introduction to pandas Data Structure

To get started with pandas, you will need to get comfortable with its two workhorse data structures Series and DataFrame. While they are not a universal solution for every problem, they provide a solid, easy-to-use basis for most applictions.

Series

A Series is a one-dimensional array-like object containing an array of data and an associated array of data labels, called its index.

A series can be created using various inputs like −

Array

Dict

Scalar value or constant

In [28]:
from pandas import Series, DataFrame
import pandas as pd
import numpy as np

obj = Series([10, 20 , 30, 40])
obj
Out[28]:
0    10
1    20
2    30
3    40
dtype: int64

You can get only the array representation using:

In [29]:
obj.values
Out[29]:
array([10, 20, 30, 40], dtype=int64)

You can get only the index represtation using:

You can make it a list: list(S1.index)

In [30]:
list(S1.index)
Out[30]:
[0, 1, 2, 3]

to make np.array

In [31]:
obj.index.values
Out[31]:
array([0, 1, 2, 3], dtype=int64)
In [32]:
# getting value using index
obj[0]


# better to use obj.iloc[0], index are numbers 
#that start not with zero can cause issues
# but is ok if they are not numbers use the literal like obj['a']
obj.iloc[0]
Out[32]:
10
In [33]:
# defining the index values
obj1 = Series([10, 20, 30, 40, 50], index=['b','a','c','d','e'])
obj1
Out[33]:
b    10
a    20
c    30
d    40
e    50
dtype: int64
In [34]:
obj1.index
Out[34]:
Index(['b', 'a', 'c', 'd', 'e'], dtype='object')
In [35]:
# obj1['a'] - below is better
obj1.loc['a']
Out[35]:
20
In [36]:
obj1.iloc[1]
Out[36]:
20
In [37]:
obj1['e']
Out[37]:
50
In [38]:
obj1[['c', 'b', 'a']]
Out[38]:
c    30
b    10
a    20
dtype: int64

NumPy array operations will preserve the index value link.

In [39]:
obj1[obj1 > 20]
Out[39]:
c    30
d    40
e    50
dtype: int64
In [40]:
obj1 * 2
Out[40]:
b     20
a     40
c     60
d     80
e    100
dtype: int64
In [41]:
np.exp(obj1)
Out[41]:
b    2.202647e+04
a    4.851652e+08
c    1.068647e+13
d    2.353853e+17
e    5.184706e+21
dtype: float64

Another way to think about a Series is as a fixed-length, ordered dict, as it is a mapping of index values to data values. It can be substituted into many functions that expect a dict.

In [42]:
'a' in obj1
Out[42]:
True
In [43]:
'f' in obj1
Out[43]:
False
In [44]:
# you can create a Series using Python dic
dic ={'Ohio':123, 'Texas':456, 'Oregon':879, 'Utah':667}
obj2 = Series(dic)
obj2
Out[44]:
Ohio      123
Texas     456
Oregon    879
Utah      667
dtype: int64
In [45]:
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj3 = Series(obj2, index = states)
obj3
Out[45]:
California      NaN
Ohio          123.0
Oregon        879.0
Texas         456.0
dtype: float64
In [46]:
# using isnull function on Series
pd.isnull(obj3)
Out[46]:
California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool
In [47]:
pd.notnull(obj3)
Out[47]:
California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool
In [48]:
# or Series has this function
obj3.isnull()
Out[48]:
California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool
In [49]:
obj2
Out[49]:
Ohio      123
Texas     456
Oregon    879
Utah      667
dtype: int64
In [50]:
obj3
Out[50]:
California      NaN
Ohio          123.0
Oregon        879.0
Texas         456.0
dtype: float64
In [51]:
# adding two Series. It adds the values based on common key. You can perform other operations too.
obj2 + obj3
Out[51]:
California       NaN
Ohio           246.0
Oregon        1758.0
Texas          912.0
Utah             NaN
dtype: float64
In [52]:
obj2
Out[52]:
Ohio      123
Texas     456
Oregon    879
Utah      667
dtype: int64
In [53]:
# both Series object itself and its index have a name
obj2.name = 'population'
obj2.index.name = 'state'
obj2
Out[53]:
state
Ohio      123
Texas     456
Oregon    879
Utah      667
Name: population, dtype: int64
In [54]:
obj
Out[54]:
0    10
1    20
2    30
3    40
dtype: int64
In [55]:
# you can alter Series index in place
obj.index = ['u', 'v', 'x', 'y']
obj
Out[55]:
u    10
v    20
x    30
y    40
dtype: int64

DataFrame

A DataFrame represents a tabular, spreadsheet-like data structure containing an ordered collection of columns, each of which can be a different valuetype (numeric, boolean, etc.) The DataFrame has both a row and column index, it can be though of as a dict of Series.

In [56]:
data = {'state':['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
        'year':[2000, 2001, 2002, 2001, 2002],       
        'pop':[1.5, 1.7, 3.6, 2.4, 2.9] 
       }
frame = DataFrame(data)
frame

# renameing columns
#frame.rename(columns={'pop':'a', 'state':'b', 'year':'c'})
Out[56]:
state year pop
0 Ohio 2000 1.5
1 Ohio 2001 1.7
2 Ohio 2002 3.6
3 Nevada 2001 2.4
4 Nevada 2002 2.9
In [57]:
# you can specifiy the sequence of the columns
d1 = DataFrame(data, columns=['year', 'state', 'pop'])
d1
Out[57]:
year state pop
0 2000 Ohio 1.5
1 2001 Ohio 1.7
2 2002 Ohio 3.6
3 2001 Nevada 2.4
4 2002 Nevada 2.9
In [58]:
# As like Series, if you pass a column that isnt contained in data, it will appear with NA
frame2 = DataFrame(data, columns=['year', 'state', 'pop', 'debt'], index =['one', 'two', 'three', 'four', 'five'])
frame2
Out[58]:
year state pop debt
one 2000 Ohio 1.5 NaN
two 2001 Ohio 1.7 NaN
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 NaN
five 2002 Nevada 2.9 NaN
In [59]:
frame2.index
Out[59]:
Index(['one', 'two', 'three', 'four', 'five'], dtype='object')
In [60]:
frame2.columns
Out[60]:
Index(['year', 'state', 'pop', 'debt'], dtype='object')
In [61]:
# returns a 2d array
frame2.values
Out[61]:
array([[2000, 'Ohio', 1.5, nan],
       [2001, 'Ohio', 1.7, nan],
       [2002, 'Ohio', 3.6, nan],
       [2001, 'Nevada', 2.4, nan],
       [2002, 'Nevada', 2.9, nan]], dtype=object)
In [62]:
frame2['state']
# Gives the index data
#frame2.loc['one']
# Gives the index data by 0 based index number
#frame2.iloc[0]
Out[62]:
one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
Name: state, dtype: object
In [63]:
type(frame2['state'])
Out[63]:
pandas.core.series.Series
In [64]:
frame2.year
Out[64]:
one      2000
two      2001
three    2002
four     2001
five     2002
Name: year, dtype: int64
In [65]:
type(frame2.year)
Out[65]:
pandas.core.series.Series
In [66]:
# retrieving rows using ix indexing field (deprecated)
#frame2.ix['two']
frame2.loc['two']
# gives you data for first index
#frame2.iloc[0]
Out[66]:
year     2001
state    Ohio
pop       1.7
debt      NaN
Name: two, dtype: object
In [67]:
#type(frame2.ix['two'])
type(frame2.loc['two'])
Out[67]:
pandas.core.series.Series
In [68]:
# assigning values to a column 
frame2['debt'] = 16.6
frame2
Out[68]:
year state pop debt
one 2000 Ohio 1.5 16.6
two 2001 Ohio 1.7 16.6
three 2002 Ohio 3.6 16.6
four 2001 Nevada 2.4 16.6
five 2002 Nevada 2.9 16.6
In [69]:
frame2['debt'] = [10, 20, 15, 22, 32]
frame2
Out[69]:
year state pop debt
one 2000 Ohio 1.5 10
two 2001 Ohio 1.7 20
three 2002 Ohio 3.6 15
four 2001 Nevada 2.4 22
five 2002 Nevada 2.9 32
In [70]:
frame2['debt'] = np.arange(5)
frame2
Out[70]:
year state pop debt
one 2000 Ohio 1.5 0
two 2001 Ohio 1.7 1
three 2002 Ohio 3.6 2
four 2001 Nevada 2.4 3
five 2002 Nevada 2.9 4
In [71]:
# you can assign Series to DataFrame columns. The indices should match otherwise gets populated with NA
val = Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
val
Out[71]:
two    -1.2
four   -1.5
five   -1.7
dtype: float64
In [72]:
frame2['debit'] = val
frame2
Out[72]:
year state pop debt debit
one 2000 Ohio 1.5 0 NaN
two 2001 Ohio 1.7 1 -1.2
three 2002 Ohio 3.6 2 NaN
four 2001 Nevada 2.4 3 -1.5
five 2002 Nevada 2.9 4 -1.7
In [73]:
# assigning a column that does not exist will create a new column
frame2['eastern'] = frame2.state == 'Ohio'
frame2
Out[73]:
year state pop debt debit eastern
one 2000 Ohio 1.5 0 NaN True
two 2001 Ohio 1.7 1 -1.2 True
three 2002 Ohio 3.6 2 NaN True
four 2001 Nevada 2.4 3 -1.5 False
five 2002 Nevada 2.9 4 -1.7 False
In [74]:
# deleting a column
del frame2['eastern']
frame2
Out[74]:
year state pop debt debit
one 2000 Ohio 1.5 0 NaN
two 2001 Ohio 1.7 1 -1.2
three 2002 Ohio 3.6 2 NaN
four 2001 Nevada 2.4 3 -1.5
five 2002 Nevada 2.9 4 -1.7
In [75]:
frame2.columns
Out[75]:
Index(['year', 'state', 'pop', 'debt', 'debit'], dtype='object')
In [76]:
# Another common form of data is a nested dict of dicts format
pop = {'Nevada':{2001: 2.4, 2002: 2.9}, 'Ohio':{2000: 1.5, 2001: 1.7, 2002: 3.6}}
pop
Out[76]:
{'Nevada': {2001: 2.4, 2002: 2.9}, 'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}
In [77]:
# you can put this in DataFrame and it makes outer keys columns and inner keys indices
frame3 = DataFrame(pop)
frame3
Out[77]:
Nevada Ohio
2001 2.4 1.7
2002 2.9 3.6
2000 NaN 1.5
In [78]:
# you can transpose the DataFrame
f = frame3.T
f
Out[78]:
2001 2002 2000
Nevada 2.4 2.9 NaN
Ohio 1.7 3.6 1.5
In [79]:
# you can explicitly declare the indices
DataFrame(pop, index=[2001, 2002, 2003])
Out[79]:
Nevada Ohio
2001 2.4 1.7
2002 2.9 3.6
2003 NaN NaN

Index Objects

In [80]:
obj = Series(range(3), index =['a', 'b', 'c'])
index = obj.index
index
Out[80]:
Index(['a', 'b', 'c'], dtype='object')
In [81]:
index[1:]
Out[81]:
Index(['b', 'c'], dtype='object')
In [82]:
index[1]
Out[82]:
'b'
In [83]:
# index objects are immutable, so you cannot do:
index[1] = 'd'
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-83-c11369242898> in <module>
      1 # index objects are immutable, so you cannot do:
----> 2 index[1] = 'd'

~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in __setitem__(self, key, value)
   3908 
   3909     def __setitem__(self, key, value):
-> 3910         raise TypeError("Index does not support mutable operations")
   3911 
   3912     def __getitem__(self, key):

TypeError: Index does not support mutable operations
In [84]:
# Immutability is important so that Index objects can be safely shared among data structures
index = pd.Index(np.arange(3))
obj2 = Series([1.5, -2.5, 0], index = index)
obj2.index is index
Out[84]:
True
In [85]:
# In addition to being array-like, an index also functions as fixed-size set:
frame3
Out[85]:
Nevada Ohio
2001 2.4 1.7
2002 2.9 3.6
2000 NaN 1.5
In [86]:
'Ohio' in frame3.columns
Out[86]:
True
In [87]:
2003 in frame3.index
Out[87]:
False

Essential Functionality

Reindexing

In [88]:
obj = Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
obj
Out[88]:
d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64
In [89]:
# NA for index which does not exist
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
obj2
Out[89]:
a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64
In [90]:
# you can fill the NA with any value 
obj.reindex(['a', 'b', 'c', 'd', 'e'], fill_value = 0)
Out[90]:
a   -5.3
b    7.2
c    3.6
d    4.5
e    0.0
dtype: float64
In [91]:
# for ordered data like time series, it maybe desirable to do some interpolation or filling of values when reindexing
# use method option, ffill means forward filling, bfill means backward filling
obj3 = Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
obj3.reindex(range(6), method='ffill')
Out[91]:
0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object
In [92]:
obj3.reindex(range(6), method='bfill')
Out[92]:
0      blue
1    purple
2    purple
3    yellow
4    yellow
5       NaN
dtype: object
In [93]:
# with DataFrame reindex can alter either the (row) index, columns or both.
frame = DataFrame(np.arange(9).reshape((3, 3)), index=['a', 'c', 'd'], columns=['Ohio', 'Texas', 'California'])
frame
Out[93]:
Ohio Texas California
a 0 1 2
c 3 4 5
d 6 7 8
In [94]:
frame2 = frame.reindex(['a', 'b', 'c', 'd'])
frame2
Out[94]:
Ohio Texas California
a 0.0 1.0 2.0
b NaN NaN NaN
c 3.0 4.0 5.0
d 6.0 7.0 8.0
In [95]:
frame
Out[95]:
Ohio Texas California
a 0 1 2
c 3 4 5
d 6 7 8
In [96]:
states = ['Texas', 'Utah', 'California']
frame.reindex(columns=states)
Out[96]:
Texas Utah California
a 1 NaN 2
c 4 NaN 5
d 7 NaN 8
In [97]:
frame
Out[97]:
Ohio Texas California
a 0 1 2
c 3 4 5
d 6 7 8
In [98]:
states
Out[98]:
['Texas', 'Utah', 'California']
In [99]:
# doing both index and columns reindexing in one shot
# was working before lib update
#frame.reindex(index=['a', 'b', 'c', 'd'], method='ffill', columns=states)
In [100]:
frame
Out[100]:
Ohio Texas California
a 0 1 2
c 3 4 5
d 6 7 8
In [101]:
states
Out[101]:
['Texas', 'Utah', 'California']

Dropping entries from an axis

In [102]:
obj = Series(np.arange(5), index=['a','b','c','d','e'])
obj
Out[102]:
a    0
b    1
c    2
d    3
e    4
dtype: int32
In [103]:
new_obj = obj.drop('c')
new_obj
Out[103]:
a    0
b    1
d    3
e    4
dtype: int32
In [104]:
# just if I forgot to mention this you can use copy function on Series and DataFrames
copyObj = new_obj.copy
print 
copyObj
Out[104]:
<bound method NDFrame.copy of a    0
b    1
d    3
e    4
dtype: int32>
In [105]:
data = DataFrame(np.arange(16).reshape((4,4)), index=['Ohio','Colorado','Utah','New York'],
                                                columns=['one','two','three','four'])
data
#data['one']
Out[105]:
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
In [106]:
# this is a view no effect on data unless assigned to another variable
data.drop(['Colorado','Ohio'])
Out[106]:
one two three four
Utah 8 9 10 11
New York 12 13 14 15
In [107]:
data.drop('two', axis=1)
Out[107]:
one three four
Ohio 0 2 3
Colorado 4 6 7
Utah 8 10 11
New York 12 14 15
In [108]:
data.drop(['two','four'],axis=1)
Out[108]:
one three
Ohio 0 2
Colorado 4 6
Utah 8 10
New York 12 14

Indexing, selection and filtering

In [109]:
obj = Series(np.arange(4), index=['a','b','c','d'])
obj
Out[109]:
a    0
b    1
c    2
d    3
dtype: int32
In [110]:
obj['b']
Out[110]:
1
In [111]:
obj[1]
Out[111]:
1
In [112]:
obj[2:4]
Out[112]:
c    2
d    3
dtype: int32
In [113]:
obj[['b','a','d']]
Out[113]:
b    1
a    0
d    3
dtype: int32
In [114]:
obj[[1,3]]
Out[114]:
b    1
d    3
dtype: int32
In [115]:
obj[obj < 2]
Out[115]:
a    0
b    1
dtype: int32
In [116]:
# slicing with labels are different from normal Python slicing and as you can see the endpoint is inclusive
obj['b':'c']
Out[116]:
b    1
c    2
dtype: int32
In [117]:
obj['b':'c'] = 5
obj
Out[117]:
a    0
b    5
c    5
d    3
dtype: int32
In [118]:
data = DataFrame(np.arange(16).reshape((4,4)),
                 index=['Ohio','Colorado','Utah','New York'],
                 columns=['one','two','three','four'])
data
Out[118]:
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
In [119]:
# these might be a bit inconsistence with previous examples
data['two']
#data.loc['Ohio'] # row
#data.iloc[2] # row
Out[119]:
Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int32
In [120]:
data[['two','three']]
Out[120]:
two three
Ohio 1 2
Colorado 5 6
Utah 9 10
New York 13 14
In [121]:
# selecting rows by slicing
data[:2]
Out[121]:
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
In [122]:
data[data['three'] > 5]
Out[122]:
one two three four
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
In [123]:
data
Out[123]:
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
In [124]:
data < 5
Out[124]:
one two three four
Ohio True True True True
Colorado True False False False
Utah False False False False
New York False False False False
In [125]:
data[data < 5] = 0
data
Out[125]:
one two three four
Ohio 0 0 0 0
Colorado 0 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
In [126]:
# you can use ix property of data frame for mentioned operations too.
# please refer to DataFrame pandas reference

Arithmetic and data alignment

In [127]:
s1 = Series([7.3, -2.5, 3.4, 1.5], index=['a','c','d','e'])
s1
Out[127]:
a    7.3
c   -2.5
d    3.4
e    1.5
dtype: float64
In [128]:
s2 = Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a','c','e','f','g'])
s2
Out[128]:
a   -2.1
c    3.6
e   -1.5
f    4.0
g    3.1
dtype: float64
In [129]:
# indices not match NaN will be placed
s1 + s2
Out[129]:
a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64
In [130]:
df1 = DataFrame(np.arange(9).reshape((3,3)), columns=list('bcd'), index=['Ohio','Texas','Colorado'])
df1
Out[130]:
b c d
Ohio 0 1 2
Texas 3 4 5
Colorado 6 7 8
In [131]:
df2 = DataFrame(np.arange(12).reshape((4,3)), columns=list('bde'), index=['Utah','Ohio','Texas','Oregon'])
df2
Out[131]:
b d e
Utah 0 1 2
Ohio 3 4 5
Texas 6 7 8
Oregon 9 10 11
In [132]:
df1 + df2
Out[132]:
b c d e
Colorado NaN NaN NaN NaN
Ohio 3.0 NaN 6.0 NaN
Oregon NaN NaN NaN NaN
Texas 9.0 NaN 12.0 NaN
Utah NaN NaN NaN NaN

Arithmetic methods with fill values

In [133]:
df1 = DataFrame(np.arange(12).reshape((3,4)), columns=list('abcd'))
df1
Out[133]:
a b c d
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
In [134]:
df2 = DataFrame(np.arange(20).reshape((4,5)), columns=list('abcde'))
df2
Out[134]:
a b c d e
0 0 1 2 3 4
1 5 6 7 8 9
2 10 11 12 13 14
3 15 16 17 18 19
In [135]:
df1 + df2
Out[135]:
a b c d e
0 0.0 2.0 4.0 6.0 NaN
1 9.0 11.0 13.0 15.0 NaN
2 18.0 20.0 22.0 24.0 NaN
3 NaN NaN NaN NaN NaN
In [136]:
# populates the missing one on each DataFrame to zero
# this works for add, sub, div, mul
df1.add(df2, fill_value=0)
Out[136]:
a b c d e
0 0.0 2.0 4.0 6.0 4.0
1 9.0 11.0 13.0 15.0 9.0
2 18.0 20.0 22.0 24.0 14.0
3 15.0 16.0 17.0 18.0 19.0
In [137]:
df1
Out[137]:
a b c d
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
In [138]:
df2
Out[138]:
a b c d e
0 0 1 2 3 4
1 5 6 7 8 9
2 10 11 12 13 14
3 15 16 17 18 19
In [139]:
df1.reindex(columns=df2.columns, fill_value=0)
Out[139]:
a b c d e
0 0 1 2 3 0
1 4 5 6 7 0
2 8 9 10 11 0

Operation between DataFrame and Series

In [140]:
arr = np.arange(12).reshape((4,3))
arr
Out[140]:
array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11]])
In [141]:
arr[0]
Out[141]:
array([0, 1, 2])
In [142]:
# this is called broadcasting, it subtracts row by row
arr - arr[0]
Out[142]:
array([[0, 0, 0],
       [3, 3, 3],
       [6, 6, 6],
       [9, 9, 9]])
In [143]:
# deprecated
frame = DataFrame(np.arange(12).reshape((4,3)), columns=list('bde'), index=['Utah', 'Ohio', 'Texas', 'Oregon'])
series = frame.iloc[0]
frame
Out[143]:
b d e
Utah 0 1 2
Ohio 3 4 5
Texas 6 7 8
Oregon 9 10 11
In [144]:
series
Out[144]:
b    0
d    1
e    2
Name: Utah, dtype: int32
In [145]:
# broadcasting down the rows
frame - series
Out[145]:
b d e
Utah 0 0 0
Ohio 3 3 3
Texas 6 6 6
Oregon 9 9 9
In [146]:
series2 = Series(range(3), index=['b','e','f'])
series2
Out[146]:
b    0
e    1
f    2
dtype: int64
In [147]:
frame
Out[147]:
b d e
Utah 0 1 2
Ohio 3 4 5
Texas 6 7 8
Oregon 9 10 11
In [148]:
# default is axis 0
frame.add(series2)
#frame + series2
Out[148]:
b d e f
Utah 0.0 NaN 3.0 NaN
Ohio 3.0 NaN 6.0 NaN
Texas 6.0 NaN 9.0 NaN
Oregon 9.0 NaN 12.0 NaN
In [149]:
frame
Out[149]:
b d e
Utah 0 1 2
Ohio 3 4 5
Texas 6 7 8
Oregon 9 10 11
In [150]:
# you can do broadcasting on columns using arithmetic methods as follow
series3 = frame.loc['Ohio']
frame
Out[150]:
b d e
Utah 0 1 2
Ohio 3 4 5
Texas 6 7 8
Oregon 9 10 11
In [151]:
series3
Out[151]:
b    3
d    4
e    5
Name: Ohio, dtype: int32
In [152]:
# multiplication
frame.multiply(series3)
Out[152]:
b d e
Utah 0 4 10
Ohio 9 16 25
Texas 18 28 40
Oregon 27 40 55

Function application and mapping

NumPy ufuncs work fine with pandas objects:

In [153]:
frame = DataFrame(np.random.randn(4, 3), columns=list('dbe'), index=['Utah', 'Ohio', 'Texas', 'Oregon'])
frame
Out[153]:
d b e
Utah 1.049846 -1.740212 0.752126
Ohio 0.540530 0.354600 -0.636673
Texas -0.083665 -0.874059 0.699313
Oregon 1.460114 0.169469 -1.925029
In [154]:
np.abs(frame)
Out[154]:
d b e
Utah 1.049846 1.740212 0.752126
Ohio 0.540530 0.354600 0.636673
Texas 0.083665 0.874059 0.699313
Oregon 1.460114 0.169469 1.925029

Another frequent operation is applying a function on 1D array to each column or row

In [155]:
f = lambda x: x.max() - x.min()

# by default axis is zero
frame.apply(f)
Out[155]:
d    1.543779
b    2.094812
e    2.677155
dtype: float64
In [156]:
frame.apply(f, axis = 1)
Out[156]:
Utah      2.790059
Ohio      1.177203
Texas     1.573372
Oregon    3.385143
dtype: float64

Many of the most common array statistics (like sum and mean) are DataFrame methods, so using apply is not necessary.

apply need not return a scalar value, it can also return a Series with multiple values:

In [157]:
def f(x): return Series([x.min(), x.max()], index=['min', 'max'])
frame.apply(f)
Out[157]:
d b e
min -0.083665 -1.740212 -1.925029
max 1.460114 0.354600 0.752126

Elemenst-wise Python functions can be used too. Suppose you wanted to compute a formatted string from each floating point value in frame.

In [158]:
# old formatting way
pi = 3.14159
print(" pi = %1.2f " % pi)
 pi = 3.14 
In [159]:
# applymap element wise and apply applies a function on axis
format = lambda x: '%.2f' % x
frame.applymap(format)
Out[159]:
d b e
Utah 1.05 -1.74 0.75
Ohio 0.54 0.35 -0.64
Texas -0.08 -0.87 0.70
Oregon 1.46 0.17 -1.93

The reason for the name applymap is that Series has a map method for applying an element-wise function:

In [160]:
frame['e'].map(format)
Out[160]:
Utah       0.75
Ohio      -0.64
Texas      0.70
Oregon    -1.93
Name: e, dtype: object

Sorting

In [161]:
obj = Series(range(4), index=['d','a','b','c'])
obj
Out[161]:
d    0
a    1
b    2
c    3
dtype: int64
In [162]:
obj.sort_index()
Out[162]:
a    1
b    2
c    3
d    0
dtype: int64
In [163]:
frame = DataFrame(np.arange(8).reshape((2,4)), index=['three','one'], columns=['d','a','b','c'])
frame                                                                        
Out[163]:
d a b c
three 0 1 2 3
one 4 5 6 7
In [164]:
# default axis is 0
frame.sort_index()
Out[164]:
d a b c
one 4 5 6 7
three 0 1 2 3
In [165]:
frame.sort_index(axis=1)
Out[165]:
a b c d
three 1 2 3 0
one 5 6 7 4
In [166]:
obj = Series([4,7,-3,2])
obj.sort_values()
Out[166]:
2   -3
3    2
0    4
1    7
dtype: int64
In [167]:
# any missing values are sorted to the end of the Series by default
obj = Series([4, np.nan, 7, np.nan, -3, 2])
obj.sort_values()
Out[167]:
4   -3.0
5    2.0
0    4.0
2    7.0
1    NaN
3    NaN
dtype: float64
In [168]:
frame = DataFrame({'b':[4,7,-3,2], 'a':[0,1,0,1]})
frame
Out[168]:
b a
0 4 0
1 7 1
2 -3 0
3 2 1
In [169]:
frame.sort_values(by='b')
Out[169]:
b a
2 -3 0
3 2 1
0 4 0
1 7 1
In [170]:
frame.sort_values(by=['a','b'])
Out[170]:
b a
2 -3 0
0 4 0
3 2 1
1 7 1

Summarizing and Computing Descriptive Statistics

In [171]:
df = DataFrame([[1.4,np.nan], [7.1,-4.5],
               [np.nan, np.nan], [0.75, -1.3]],
               index=['a','b','c','d'],
               columns=['one','two'])
df
Out[171]:
one two
a 1.40 NaN
b 7.10 -4.5
c NaN NaN
d 0.75 -1.3
In [172]:
df.sum()
Out[172]:
one    9.25
two   -5.80
dtype: float64
In [173]:
df.sum(axis=1)
Out[173]:
a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64
In [174]:
df.mean(axis=1, skipna=True)
Out[174]:
a    1.400
b    1.300
c      NaN
d   -0.275
dtype: float64
In [175]:
df.describe()
Out[175]:
one two
count 3.000000 2.000000
mean 3.083333 -2.900000
std 3.493685 2.262742
min 0.750000 -4.500000
25% 1.075000 -3.700000
50% 1.400000 -2.900000
75% 4.250000 -2.100000
max 7.100000 -1.300000
In [176]:
# on non-numeric, it produces alternative summary statistics:
obj = Series(['a', 'a', 'b', 'c'] * 4)
obj
Out[176]:
0     a
1     a
2     b
3     c
4     a
5     a
6     b
7     c
8     a
9     a
10    b
11    c
12    a
13    a
14    b
15    c
dtype: object
In [177]:
obj.describe()
Out[177]:
count     16
unique     3
top        a
freq       8
dtype: object

Some summary statistics and related functions:

count
describe
min, max
quantile
sum
mean
median
mad
var
std
diff
pct_change
cumsum
cumprod

Unique Values, Value Counts, and Membership

In [179]:
obj = Series(['c','a','d','a','a','b','b','c','c'])
obj
Out[179]:
0    c
1    a
2    d
3    a
4    a
5    b
6    b
7    c
8    c
dtype: object
In [180]:
uniques = obj.unique()
uniques
Out[180]:
array(['c', 'a', 'd', 'b'], dtype=object)
In [181]:
obj.value_counts()
Out[181]:
a    3
c    3
b    2
d    1
dtype: int64
In [182]:
# panda has method for this too that can be used for any array or sequence
pd.value_counts(obj.values)
Out[182]:
a    3
c    3
b    2
d    1
dtype: int64
In [183]:
obj
Out[183]:
0    c
1    a
2    d
3    a
4    a
5    b
6    b
7    c
8    c
dtype: object
In [184]:
# isin is responsible for vectorized set memebership and can be very useful in filtering a data set
mask = obj.isin(['b','c'])
mask
Out[184]:
0     True
1    False
2    False
3    False
4    False
5     True
6     True
7     True
8     True
dtype: bool
In [185]:
data = DataFrame({'Qu1': [1,3,4,5,4],
                  'Qu2': [2,3,1,2,3],
                  'Qu3': [1,5,2,4,4]})

data      
Out[185]:
Qu1 Qu2 Qu3
0 1 2 1
1 3 3 5
2 4 1 2
3 5 2 4
4 4 3 4

Handling Missing Data

In [186]:
# pandas uses the floating point value NaN to represent missing data
string_data = Series(['aardvark', 'artichoke', np.nan, 'avacado'])
string_data
Out[186]:
0     aardvark
1    artichoke
2          NaN
3      avacado
dtype: object
In [187]:
# built-in Python None value is also treated as NaN
string_data[0] = None
string_data.isnull()
Out[187]:
0     True
1    False
2     True
3    False
dtype: bool

Filtering Out Missing Data

In [188]:
from numpy import nan as NA
data = Series([1, NA, 3.5, NA, 7])
data.dropna()
Out[188]:
0    1.0
2    3.5
4    7.0
dtype: float64
In [189]:
data
Out[189]:
0    1.0
1    NaN
2    3.5
3    NaN
4    7.0
dtype: float64
In [190]:
data[data.notnull()]
Out[190]:
0    1.0
2    3.5
4    7.0
dtype: float64
In [191]:
data = DataFrame([[1, 6.5, 3], [1, NA, NA],
                 [NA, NA, NA], [NA, 6.5, 3]])

data
Out[191]:
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
2 NaN NaN NaN
3 NaN 6.5 3.0
In [192]:
cleaned = data.dropna()
In [193]:
# any row with NA will be dropped
cleaned
Out[193]:
0 1 2
0 1.0 6.5 3.0
In [194]:
data
Out[194]:
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
2 NaN NaN NaN
3 NaN 6.5 3.0
In [195]:
# if all row has NA
data.dropna(how='all')
Out[195]:
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
3 NaN 6.5 3.0
In [196]:
data
Out[196]:
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
2 NaN NaN NaN
3 NaN 6.5 3.0
In [197]:
data[2] = NA
data
Out[197]:
0 1 2
0 1.0 6.5 NaN
1 1.0 NaN NaN
2 NaN NaN NaN
3 NaN 6.5 NaN
In [198]:
# drop the column if all NA
data.dropna(how='all', axis = 1)
Out[198]:
0 1
0 1.0 6.5
1 1.0 NaN
2 NaN NaN
3 NaN 6.5
In [199]:
df = DataFrame(np.random.randn(7, 3))
df
Out[199]:
0 1 2
0 0.264870 -1.089576 1.044288
1 -0.597993 0.847850 1.140516
2 1.311829 0.970399 0.457232
3 -0.095538 0.154005 -0.608618
4 0.254186 0.731226 0.422971
5 -0.891403 0.163922 -0.615027
6 2.001235 0.663682 1.256099
In [200]:
df.iloc[:4, 1] = NA
df.iloc[:2, 2] = NA
df
Out[200]:
0 1 2
0 0.264870 NaN NaN
1 -0.597993 NaN NaN
2 1.311829 NaN 0.457232
3 -0.095538 NaN -0.608618
4 0.254186 0.731226 0.422971
5 -0.891403 0.163922 -0.615027
6 2.001235 0.663682 1.256099
In [201]:
# keep only rows having certain number of observations
df.dropna(thresh=4,axis='columns')  # or axis='rows'or 0 or 1 as usual
Out[201]:
0 2
0 0.264870 NaN
1 -0.597993 NaN
2 1.311829 0.457232
3 -0.095538 -0.608618
4 0.254186 0.422971
5 -0.891403 -0.615027
6 2.001235 1.256099

Filling in Missing Data

In [202]:
df
Out[202]:
0 1 2
0 0.264870 NaN NaN
1 -0.597993 NaN NaN
2 1.311829 NaN 0.457232
3 -0.095538 NaN -0.608618
4 0.254186 0.731226 0.422971
5 -0.891403 0.163922 -0.615027
6 2.001235 0.663682 1.256099
In [203]:
df.fillna(0)
Out[203]:
0 1 2
0 0.264870 0.000000 0.000000
1 -0.597993 0.000000 0.000000
2 1.311829 0.000000 0.457232
3 -0.095538 0.000000 -0.608618
4 0.254186 0.731226 0.422971
5 -0.891403 0.163922 -0.615027
6 2.001235 0.663682 1.256099
In [204]:
df
Out[204]:
0 1 2
0 0.264870 NaN NaN
1 -0.597993 NaN NaN
2 1.311829 NaN 0.457232
3 -0.095538 NaN -0.608618
4 0.254186 0.731226 0.422971
5 -0.891403 0.163922 -0.615027
6 2.001235 0.663682 1.256099
In [205]:
# use a dic which indicates what to fill NA at each column
df.fillna({1:0.5, 2: -1})
Out[205]:
0 1 2
0 0.264870 0.500000 -1.000000
1 -0.597993 0.500000 -1.000000
2 1.311829 0.500000 0.457232
3 -0.095538 0.500000 -0.608618
4 0.254186 0.731226 0.422971
5 -0.891403 0.163922 -0.615027
6 2.001235 0.663682 1.256099
In [206]:
df
Out[206]:
0 1 2
0 0.264870 NaN NaN
1 -0.597993 NaN NaN
2 1.311829 NaN 0.457232
3 -0.095538 NaN -0.608618
4 0.254186 0.731226 0.422971
5 -0.891403 0.163922 -0.615027
6 2.001235 0.663682 1.256099
In [207]:
# fillna returns a new object, but you can modify the existing object in place
df.fillna(0, inplace=True)
df
Out[207]:
0 1 2
0 0.264870 0.000000 0.000000
1 -0.597993 0.000000 0.000000
2 1.311829 0.000000 0.457232
3 -0.095538 0.000000 -0.608618
4 0.254186 0.731226 0.422971
5 -0.891403 0.163922 -0.615027
6 2.001235 0.663682 1.256099
In [208]:
df = DataFrame(np.random.randn(6, 3))
df
Out[208]:
0 1 2
0 1.681404 1.151962 -0.296332
1 0.060390 -1.713858 -0.778501
2 1.216945 -0.585990 0.457104
3 0.843060 -0.148200 1.413625
4 -0.704657 1.052509 -0.520097
5 -0.875149 -1.522447 -0.562694
In [209]:
df.iloc[2:,1] = np.nan
df.iloc[4:,2] = np.nan
df
Out[209]:
0 1 2
0 1.681404 1.151962 -0.296332
1 0.060390 -1.713858 -0.778501
2 1.216945 NaN 0.457104
3 0.843060 NaN 1.413625
4 -0.704657 NaN NaN
5 -0.875149 NaN NaN
In [210]:
# forward filling
df.fillna(method='ffill')
Out[210]:
0 1 2
0 1.681404 1.151962 -0.296332
1 0.060390 -1.713858 -0.778501
2 1.216945 -1.713858 0.457104
3 0.843060 -1.713858 1.413625
4 -0.704657 -1.713858 1.413625
5 -0.875149 -1.713858 1.413625
In [211]:
df
Out[211]:
0 1 2
0 1.681404 1.151962 -0.296332
1 0.060390 -1.713858 -0.778501
2 1.216945 NaN 0.457104
3 0.843060 NaN 1.413625
4 -0.704657 NaN NaN
5 -0.875149 NaN NaN
In [212]:
# you can put a limit of how many to fill
df.fillna(method='ffill', limit=2)
Out[212]:
0 1 2
0 1.681404 1.151962 -0.296332
1 0.060390 -1.713858 -0.778501
2 1.216945 -1.713858 0.457104
3 0.843060 -1.713858 1.413625
4 -0.704657 NaN 1.413625
5 -0.875149 NaN 1.413625
In [213]:
# With fillna you can do lots of other things with a little creativity
data = Series([1, np.nan, 3.5, np.nan, 7])
# putting mean of values for NAs
data.fillna(data.mean())
Out[213]:
0    1.000000
1    3.833333
2    3.500000
3    3.833333
4    7.000000
dtype: float64