270 min to read
Pandas in One Post
Learn Pandas Free
Pandas in One Post¶
pandas contains high-level data structures and manipulation tools designed to make data analysis fast and easy in Python. pandas is built on top of NumPy and makes it easy to use in NumPy-centric application.
Key Features of Pandas¶
Fast and efficient DataFrame object with default and customized indexing.
Tools for loading data into in-memory data objects from different file formats.
Data alignment and integrated handling of missing data.
Reshaping and pivoting of date sets.
Label-based slicing, indexing and subsetting of large data sets.
Columns from a data structure can be deleted or inserted.
Group by data for aggregation and transformations.
High performance merging and joining of data.
Time Series functionality.
Introduction to pandas Data Structure¶
To get started with pandas, you will need to get comfortable with its two workhorse data structures Series and DataFrame. While they are not a universal solution for every problem, they provide a solid, easy-to-use basis for most applictions.
Series¶
A Series is a one-dimensional array-like object containing an array of data and an associated array of data labels, called its index.
A series can be created using various inputs like −
Array
Dict
Scalar value or constant
from pandas import Series, DataFrame
import pandas as pd
import numpy as np
obj = Series([10, 20 , 30, 40])
obj
You can get only the array representation using:¶
obj.values
list(S1.index)
to make np.array¶
obj.index.values
# getting value using index
obj[0]
# better to use obj.iloc[0], index are numbers
#that start not with zero can cause issues
# but is ok if they are not numbers use the literal like obj['a']
obj.iloc[0]
# defining the index values
obj1 = Series([10, 20, 30, 40, 50], index=['b','a','c','d','e'])
obj1
obj1.index
# obj1['a'] - below is better
obj1.loc['a']
obj1.iloc[1]
obj1['e']
obj1[['c', 'b', 'a']]
NumPy array operations will preserve the index value link.
obj1[obj1 > 20]
obj1 * 2
np.exp(obj1)
Another way to think about a Series is as a fixed-length, ordered dict, as it is a mapping of index values to data values. It can be substituted into many functions that expect a dict.
'a' in obj1
'f' in obj1
# you can create a Series using Python dic
dic ={'Ohio':123, 'Texas':456, 'Oregon':879, 'Utah':667}
obj2 = Series(dic)
obj2
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj3 = Series(obj2, index = states)
obj3
# using isnull function on Series
pd.isnull(obj3)
pd.notnull(obj3)
# or Series has this function
obj3.isnull()
obj2
obj3
# adding two Series. It adds the values based on common key. You can perform other operations too.
obj2 + obj3
obj2
# both Series object itself and its index have a name
obj2.name = 'population'
obj2.index.name = 'state'
obj2
obj
# you can alter Series index in place
obj.index = ['u', 'v', 'x', 'y']
obj
DataFrame¶
A DataFrame represents a tabular, spreadsheet-like data structure containing an ordered collection of columns, each of which can be a different valuetype (numeric, boolean, etc.) The DataFrame has both a row and column index, it can be though of as a dict of Series.
data = {'state':['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
'year':[2000, 2001, 2002, 2001, 2002],
'pop':[1.5, 1.7, 3.6, 2.4, 2.9]
}
frame = DataFrame(data)
frame
# renameing columns
#frame.rename(columns={'pop':'a', 'state':'b', 'year':'c'})
# you can specifiy the sequence of the columns
d1 = DataFrame(data, columns=['year', 'state', 'pop'])
d1
# As like Series, if you pass a column that isnt contained in data, it will appear with NA
frame2 = DataFrame(data, columns=['year', 'state', 'pop', 'debt'], index =['one', 'two', 'three', 'four', 'five'])
frame2
frame2.index
frame2.columns
# returns a 2d array
frame2.values
frame2['state']
# Gives the index data
#frame2.loc['one']
# Gives the index data by 0 based index number
#frame2.iloc[0]
type(frame2['state'])
frame2.year
type(frame2.year)
# retrieving rows using ix indexing field (deprecated)
#frame2.ix['two']
frame2.loc['two']
# gives you data for first index
#frame2.iloc[0]
#type(frame2.ix['two'])
type(frame2.loc['two'])
# assigning values to a column
frame2['debt'] = 16.6
frame2
frame2['debt'] = [10, 20, 15, 22, 32]
frame2
frame2['debt'] = np.arange(5)
frame2
# you can assign Series to DataFrame columns. The indices should match otherwise gets populated with NA
val = Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
val
frame2['debit'] = val
frame2
# assigning a column that does not exist will create a new column
frame2['eastern'] = frame2.state == 'Ohio'
frame2
# deleting a column
del frame2['eastern']
frame2
frame2.columns
# Another common form of data is a nested dict of dicts format
pop = {'Nevada':{2001: 2.4, 2002: 2.9}, 'Ohio':{2000: 1.5, 2001: 1.7, 2002: 3.6}}
pop
# you can put this in DataFrame and it makes outer keys columns and inner keys indices
frame3 = DataFrame(pop)
frame3
# you can transpose the DataFrame
f = frame3.T
f
# you can explicitly declare the indices
DataFrame(pop, index=[2001, 2002, 2003])
Index Objects¶
obj = Series(range(3), index =['a', 'b', 'c'])
index = obj.index
index
index[1:]
index[1]
# index objects are immutable, so you cannot do:
index[1] = 'd'
# Immutability is important so that Index objects can be safely shared among data structures
index = pd.Index(np.arange(3))
obj2 = Series([1.5, -2.5, 0], index = index)
obj2.index is index
# In addition to being array-like, an index also functions as fixed-size set:
frame3
'Ohio' in frame3.columns
2003 in frame3.index
Essential Functionality¶
Reindexing¶
obj = Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
obj
# NA for index which does not exist
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
obj2
# you can fill the NA with any value
obj.reindex(['a', 'b', 'c', 'd', 'e'], fill_value = 0)
# for ordered data like time series, it maybe desirable to do some interpolation or filling of values when reindexing
# use method option, ffill means forward filling, bfill means backward filling
obj3 = Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
obj3.reindex(range(6), method='ffill')
obj3.reindex(range(6), method='bfill')
# with DataFrame reindex can alter either the (row) index, columns or both.
frame = DataFrame(np.arange(9).reshape((3, 3)), index=['a', 'c', 'd'], columns=['Ohio', 'Texas', 'California'])
frame
frame2 = frame.reindex(['a', 'b', 'c', 'd'])
frame2
frame
states = ['Texas', 'Utah', 'California']
frame.reindex(columns=states)
frame
states
# doing both index and columns reindexing in one shot
# was working before lib update
#frame.reindex(index=['a', 'b', 'c', 'd'], method='ffill', columns=states)
frame
states
Dropping entries from an axis¶
obj = Series(np.arange(5), index=['a','b','c','d','e'])
obj
new_obj = obj.drop('c')
new_obj
# just if I forgot to mention this you can use copy function on Series and DataFrames
copyObj = new_obj.copy
print
copyObj
data = DataFrame(np.arange(16).reshape((4,4)), index=['Ohio','Colorado','Utah','New York'],
columns=['one','two','three','four'])
data
#data['one']
# this is a view no effect on data unless assigned to another variable
data.drop(['Colorado','Ohio'])
data.drop('two', axis=1)
data.drop(['two','four'],axis=1)
Indexing, selection and filtering¶
obj = Series(np.arange(4), index=['a','b','c','d'])
obj
obj['b']
obj[1]
obj[2:4]
obj[['b','a','d']]
obj[[1,3]]
obj[obj < 2]
# slicing with labels are different from normal Python slicing and as you can see the endpoint is inclusive
obj['b':'c']
obj['b':'c'] = 5
obj
data = DataFrame(np.arange(16).reshape((4,4)),
index=['Ohio','Colorado','Utah','New York'],
columns=['one','two','three','four'])
data
# these might be a bit inconsistence with previous examples
data['two']
#data.loc['Ohio'] # row
#data.iloc[2] # row
data[['two','three']]
# selecting rows by slicing
data[:2]
data[data['three'] > 5]
data
data < 5
data[data < 5] = 0
data
# you can use ix property of data frame for mentioned operations too.
# please refer to DataFrame pandas reference
Arithmetic and data alignment¶
s1 = Series([7.3, -2.5, 3.4, 1.5], index=['a','c','d','e'])
s1
s2 = Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a','c','e','f','g'])
s2
# indices not match NaN will be placed
s1 + s2
df1 = DataFrame(np.arange(9).reshape((3,3)), columns=list('bcd'), index=['Ohio','Texas','Colorado'])
df1
df2 = DataFrame(np.arange(12).reshape((4,3)), columns=list('bde'), index=['Utah','Ohio','Texas','Oregon'])
df2
df1 + df2
Arithmetic methods with fill values¶
df1 = DataFrame(np.arange(12).reshape((3,4)), columns=list('abcd'))
df1
df2 = DataFrame(np.arange(20).reshape((4,5)), columns=list('abcde'))
df2
df1 + df2
# populates the missing one on each DataFrame to zero
# this works for add, sub, div, mul
df1.add(df2, fill_value=0)
df1
df2
df1.reindex(columns=df2.columns, fill_value=0)
Operation between DataFrame and Series¶
arr = np.arange(12).reshape((4,3))
arr
arr[0]
# this is called broadcasting, it subtracts row by row
arr - arr[0]
# deprecated
frame = DataFrame(np.arange(12).reshape((4,3)), columns=list('bde'), index=['Utah', 'Ohio', 'Texas', 'Oregon'])
series = frame.iloc[0]
frame
series
# broadcasting down the rows
frame - series
series2 = Series(range(3), index=['b','e','f'])
series2
frame
# default is axis 0
frame.add(series2)
#frame + series2
frame
# you can do broadcasting on columns using arithmetic methods as follow
series3 = frame.loc['Ohio']
frame
series3
# multiplication
frame.multiply(series3)
Function application and mapping¶
NumPy ufuncs work fine with pandas objects:
frame = DataFrame(np.random.randn(4, 3), columns=list('dbe'), index=['Utah', 'Ohio', 'Texas', 'Oregon'])
frame
np.abs(frame)
Another frequent operation is applying a function on 1D array to each column or row
f = lambda x: x.max() - x.min()
# by default axis is zero
frame.apply(f)
frame.apply(f, axis = 1)
Many of the most common array statistics (like sum and mean) are DataFrame methods, so using apply is not necessary.
apply need not return a scalar value, it can also return a Series with multiple values:
def f(x): return Series([x.min(), x.max()], index=['min', 'max'])
frame.apply(f)
Elemenst-wise Python functions can be used too. Suppose you wanted to compute a formatted string from each floating point value in frame.
# old formatting way
pi = 3.14159
print(" pi = %1.2f " % pi)
# applymap element wise and apply applies a function on axis
format = lambda x: '%.2f' % x
frame.applymap(format)
The reason for the name applymap is that Series has a map method for applying an element-wise function:
frame['e'].map(format)
Sorting¶
obj = Series(range(4), index=['d','a','b','c'])
obj
obj.sort_index()
frame = DataFrame(np.arange(8).reshape((2,4)), index=['three','one'], columns=['d','a','b','c'])
frame
# default axis is 0
frame.sort_index()
frame.sort_index(axis=1)
obj = Series([4,7,-3,2])
obj.sort_values()
# any missing values are sorted to the end of the Series by default
obj = Series([4, np.nan, 7, np.nan, -3, 2])
obj.sort_values()
frame = DataFrame({'b':[4,7,-3,2], 'a':[0,1,0,1]})
frame
frame.sort_values(by='b')
frame.sort_values(by=['a','b'])
Summarizing and Computing Descriptive Statistics¶
df = DataFrame([[1.4,np.nan], [7.1,-4.5],
[np.nan, np.nan], [0.75, -1.3]],
index=['a','b','c','d'],
columns=['one','two'])
df
df.sum()
df.sum(axis=1)
df.mean(axis=1, skipna=True)
df.describe()
# on non-numeric, it produces alternative summary statistics:
obj = Series(['a', 'a', 'b', 'c'] * 4)
obj
obj.describe()
Some summary statistics and related functions:
count
describe
min, max
quantile
sum
mean
median
mad
var
std
diff
pct_change
cumsum
cumprod
Unique Values, Value Counts, and Membership¶
obj = Series(['c','a','d','a','a','b','b','c','c'])
obj
uniques = obj.unique()
uniques
obj.value_counts()
# panda has method for this too that can be used for any array or sequence
pd.value_counts(obj.values)
obj
# isin is responsible for vectorized set memebership and can be very useful in filtering a data set
mask = obj.isin(['b','c'])
mask
data = DataFrame({'Qu1': [1,3,4,5,4],
'Qu2': [2,3,1,2,3],
'Qu3': [1,5,2,4,4]})
data
Handling Missing Data¶
# pandas uses the floating point value NaN to represent missing data
string_data = Series(['aardvark', 'artichoke', np.nan, 'avacado'])
string_data
# built-in Python None value is also treated as NaN
string_data[0] = None
string_data.isnull()
Filtering Out Missing Data¶
from numpy import nan as NA
data = Series([1, NA, 3.5, NA, 7])
data.dropna()
data
data[data.notnull()]
data = DataFrame([[1, 6.5, 3], [1, NA, NA],
[NA, NA, NA], [NA, 6.5, 3]])
data
cleaned = data.dropna()
# any row with NA will be dropped
cleaned
data
# if all row has NA
data.dropna(how='all')
data
data[2] = NA
data
# drop the column if all NA
data.dropna(how='all', axis = 1)
df = DataFrame(np.random.randn(7, 3))
df
df.iloc[:4, 1] = NA
df.iloc[:2, 2] = NA
df
# keep only rows having certain number of observations
df.dropna(thresh=4,axis='columns') # or axis='rows'or 0 or 1 as usual
Filling in Missing Data¶
df
df.fillna(0)
df
# use a dic which indicates what to fill NA at each column
df.fillna({1:0.5, 2: -1})
df
# fillna returns a new object, but you can modify the existing object in place
df.fillna(0, inplace=True)
df
df = DataFrame(np.random.randn(6, 3))
df
df.iloc[2:,1] = np.nan
df.iloc[4:,2] = np.nan
df
# forward filling
df.fillna(method='ffill')
df
# you can put a limit of how many to fill
df.fillna(method='ffill', limit=2)
# With fillna you can do lots of other things with a little creativity
data = Series([1, np.nan, 3.5, np.nan, 7])
# putting mean of values for NAs
data.fillna(data.mean())