Pandas in One Post¶

pandas contains high-level data structures and manipulation tools designed to make data analysis fast and easy in Python. pandas is built on top of NumPy and makes it easy to use in NumPy-centric application.

Key Features of Pandas¶

Fast and efficient DataFrame object with default and customized indexing.

Tools for loading data into in-memory data objects from different file formats.

Data alignment and integrated handling of missing data.

Reshaping and pivoting of date sets.

Label-based slicing, indexing and subsetting of large data sets.

Columns from a data structure can be deleted or inserted.

Group by data for aggregation and transformations.

High performance merging and joining of data.

Time Series functionality.

Introduction to pandas Data Structure¶

To get started with pandas, you will need to get comfortable with its two workhorse data structures Series and DataFrame. While they are not a universal solution for every problem, they provide a solid, easy-to-use basis for most applictions.

Series¶

A Series is a one-dimensional array-like object containing an array of data and an associated array of data labels, called its index.

A series can be created using various inputs like −

Array

Dict

Scalar value or constant

from pandas import Series, DataFrame
import pandas as pd
import numpy as np

obj = Series([10, 20 , 30, 40])
obj

0    10
1    20
2    30
3    40
dtype: int64

You can get only the array representation using:¶

obj.values

array([10, 20, 30, 40], dtype=int64)

You can get only the index represtation using:¶

You can make it a list: list(S1.index)¶

list(S1.index)

[0, 1, 2, 3]

to make np.array¶

obj.index.values

array([0, 1, 2, 3], dtype=int64)

# getting value using index
obj[0]


# better to use obj.iloc[0], index are numbers 
#that start not with zero can cause issues
# but is ok if they are not numbers use the literal like obj['a']
obj.iloc[0]

10

# defining the index values
obj1 = Series([10, 20, 30, 40, 50], index=['b','a','c','d','e'])
obj1

b    10
a    20
c    30
d    40
e    50
dtype: int64

obj1.index

Index(['b', 'a', 'c', 'd', 'e'], dtype='object')

# obj1['a'] - below is better
obj1.loc['a']

20

obj1.iloc[1]

20

obj1['e']

50

obj1[['c', 'b', 'a']]

c    30
b    10
a    20
dtype: int64

NumPy array operations will preserve the index value link.

obj1[obj1 > 20]

c    30
d    40
e    50
dtype: int64

obj1 * 2

b     20
a     40
c     60
d     80
e    100
dtype: int64

np.exp(obj1)

b    2.202647e+04
a    4.851652e+08
c    1.068647e+13
d    2.353853e+17
e    5.184706e+21
dtype: float64

Another way to think about a Series is as a fixed-length, ordered dict, as it is a mapping of index values to data values. It can be substituted into many functions that expect a dict.

'a' in obj1

True

'f' in obj1

False

# you can create a Series using Python dic
dic ={'Ohio':123, 'Texas':456, 'Oregon':879, 'Utah':667}
obj2 = Series(dic)
obj2

Ohio      123
Texas     456
Oregon    879
Utah      667
dtype: int64

states = ['California', 'Ohio', 'Oregon', 'Texas']
obj3 = Series(obj2, index = states)
obj3

California      NaN
Ohio          123.0
Oregon        879.0
Texas         456.0
dtype: float64

# using isnull function on Series
pd.isnull(obj3)

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

pd.notnull(obj3)

California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

# or Series has this function
obj3.isnull()

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

obj2

Ohio      123
Texas     456
Oregon    879
Utah      667
dtype: int64

obj3

California      NaN
Ohio          123.0
Oregon        879.0
Texas         456.0
dtype: float64

# adding two Series. It adds the values based on common key. You can perform other operations too.
obj2 + obj3

California       NaN
Ohio           246.0
Oregon        1758.0
Texas          912.0
Utah             NaN
dtype: float64

obj2

Ohio      123
Texas     456
Oregon    879
Utah      667
dtype: int64

# both Series object itself and its index have a name
obj2.name = 'population'
obj2.index.name = 'state'
obj2

state
Ohio      123
Texas     456
Oregon    879
Utah      667
Name: population, dtype: int64

obj

0    10
1    20
2    30
3    40
dtype: int64

# you can alter Series index in place
obj.index = ['u', 'v', 'x', 'y']
obj

u    10
v    20
x    30
y    40
dtype: int64

DataFrame¶

A DataFrame represents a tabular, spreadsheet-like data structure containing an ordered collection of columns, each of which can be a different valuetype (numeric, boolean, etc.) The DataFrame has both a row and column index, it can be though of as a dict of Series.

data = {'state':['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
        'year':[2000, 2001, 2002, 2001, 2002],       
        'pop':[1.5, 1.7, 3.6, 2.4, 2.9] 
       }
frame = DataFrame(data)
frame

# renameing columns
#frame.rename(columns={'pop':'a', 'state':'b', 'year':'c'})

# you can specifiy the sequence of the columns
d1 = DataFrame(data, columns=['year', 'state', 'pop'])
d1

# As like Series, if you pass a column that isnt contained in data, it will appear with NA
frame2 = DataFrame(data, columns=['year', 'state', 'pop', 'debt'], index =['one', 'two', 'three', 'four', 'five'])
frame2

frame2.index

Index(['one', 'two', 'three', 'four', 'five'], dtype='object')

frame2.columns

Index(['year', 'state', 'pop', 'debt'], dtype='object')

# returns a 2d array
frame2.values

array([[2000, 'Ohio', 1.5, nan],
       [2001, 'Ohio', 1.7, nan],
       [2002, 'Ohio', 3.6, nan],
       [2001, 'Nevada', 2.4, nan],
       [2002, 'Nevada', 2.9, nan]], dtype=object)

frame2['state']
# Gives the index data
#frame2.loc['one']
# Gives the index data by 0 based index number
#frame2.iloc[0]

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
Name: state, dtype: object

type(frame2['state'])

pandas.core.series.Series

frame2.year

one      2000
two      2001
three    2002
four     2001
five     2002
Name: year, dtype: int64

type(frame2.year)

pandas.core.series.Series

# retrieving rows using ix indexing field (deprecated)
#frame2.ix['two']
frame2.loc['two']
# gives you data for first index
#frame2.iloc[0]

year     2001
state    Ohio
pop       1.7
debt      NaN
Name: two, dtype: object

#type(frame2.ix['two'])
type(frame2.loc['two'])

pandas.core.series.Series

# assigning values to a column 
frame2['debt'] = 16.6
frame2

frame2['debt'] = [10, 20, 15, 22, 32]
frame2

frame2['debt'] = np.arange(5)
frame2

# you can assign Series to DataFrame columns. The indices should match otherwise gets populated with NA
val = Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
val

two    -1.2
four   -1.5
five   -1.7
dtype: float64

frame2['debit'] = val
frame2

# assigning a column that does not exist will create a new column
frame2['eastern'] = frame2.state == 'Ohio'
frame2

# deleting a column
del frame2['eastern']
frame2

frame2.columns

Index(['year', 'state', 'pop', 'debt', 'debit'], dtype='object')

# Another common form of data is a nested dict of dicts format
pop = {'Nevada':{2001: 2.4, 2002: 2.9}, 'Ohio':{2000: 1.5, 2001: 1.7, 2002: 3.6}}
pop

{'Nevada': {2001: 2.4, 2002: 2.9}, 'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}

# you can put this in DataFrame and it makes outer keys columns and inner keys indices
frame3 = DataFrame(pop)
frame3

# you can transpose the DataFrame
f = frame3.T
f

# you can explicitly declare the indices
DataFrame(pop, index=[2001, 2002, 2003])

Index Objects¶

obj = Series(range(3), index =['a', 'b', 'c'])
index = obj.index
index

Index(['a', 'b', 'c'], dtype='object')

index[1:]

Index(['b', 'c'], dtype='object')

index[1]

'b'

# index objects are immutable, so you cannot do:
index[1] = 'd'

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-83-c11369242898> in <module>
      1 # index objects are immutable, so you cannot do:
----> 2 index[1] = 'd'

~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in __setitem__(self, key, value)
   3908 
   3909     def __setitem__(self, key, value):
-> 3910         raise TypeError("Index does not support mutable operations")
   3911 
   3912     def __getitem__(self, key):

TypeError: Index does not support mutable operations

# Immutability is important so that Index objects can be safely shared among data structures
index = pd.Index(np.arange(3))
obj2 = Series([1.5, -2.5, 0], index = index)
obj2.index is index

True

# In addition to being array-like, an index also functions as fixed-size set:
frame3

'Ohio' in frame3.columns

True

2003 in frame3.index

False

Essential Functionality¶

Reindexing¶

obj = Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
obj

d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

# NA for index which does not exist
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
obj2

a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

# you can fill the NA with any value 
obj.reindex(['a', 'b', 'c', 'd', 'e'], fill_value = 0)

a   -5.3
b    7.2
c    3.6
d    4.5
e    0.0
dtype: float64

# for ordered data like time series, it maybe desirable to do some interpolation or filling of values when reindexing
# use method option, ffill means forward filling, bfill means backward filling
obj3 = Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
obj3.reindex(range(6), method='ffill')

0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

obj3.reindex(range(6), method='bfill')

0      blue
1    purple
2    purple
3    yellow
4    yellow
5       NaN
dtype: object

# with DataFrame reindex can alter either the (row) index, columns or both.
frame = DataFrame(np.arange(9).reshape((3, 3)), index=['a', 'c', 'd'], columns=['Ohio', 'Texas', 'California'])
frame

frame2 = frame.reindex(['a', 'b', 'c', 'd'])
frame2

frame

states = ['Texas', 'Utah', 'California']
frame.reindex(columns=states)

frame

states

['Texas', 'Utah', 'California']

# doing both index and columns reindexing in one shot
# was working before lib update
#frame.reindex(index=['a', 'b', 'c', 'd'], method='ffill', columns=states)

frame

states

['Texas', 'Utah', 'California']

Dropping entries from an axis¶

obj = Series(np.arange(5), index=['a','b','c','d','e'])
obj

a    0
b    1
c    2
d    3
e    4
dtype: int32

new_obj = obj.drop('c')
new_obj

a    0
b    1
d    3
e    4
dtype: int32

# just if I forgot to mention this you can use copy function on Series and DataFrames
copyObj = new_obj.copy
print 
copyObj

<bound method NDFrame.copy of a    0
b    1
d    3
e    4
dtype: int32>

data = DataFrame(np.arange(16).reshape((4,4)), index=['Ohio','Colorado','Utah','New York'],
                                                columns=['one','two','three','four'])
data
#data['one']

# this is a view no effect on data unless assigned to another variable
data.drop(['Colorado','Ohio'])

data.drop('two', axis=1)

data.drop(['two','four'],axis=1)

Indexing, selection and filtering¶

obj = Series(np.arange(4), index=['a','b','c','d'])
obj

a    0
b    1
c    2
d    3
dtype: int32

obj['b']

1

obj[1]

1

obj[2:4]

c    2
d    3
dtype: int32

obj[['b','a','d']]

b    1
a    0
d    3
dtype: int32

obj[[1,3]]

b    1
d    3
dtype: int32

obj[obj < 2]

a    0
b    1
dtype: int32

# slicing with labels are different from normal Python slicing and as you can see the endpoint is inclusive
obj['b':'c']

b    1
c    2
dtype: int32

obj['b':'c'] = 5
obj

a    0
b    5
c    5
d    3
dtype: int32

data = DataFrame(np.arange(16).reshape((4,4)),
                 index=['Ohio','Colorado','Utah','New York'],
                 columns=['one','two','three','four'])
data

# these might be a bit inconsistence with previous examples
data['two']
#data.loc['Ohio'] # row
#data.iloc[2] # row

Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int32

data[['two','three']]

# selecting rows by slicing
data[:2]

data[data['three'] > 5]

data

data < 5

data[data < 5] = 0
data

# you can use ix property of data frame for mentioned operations too.
# please refer to DataFrame pandas reference

Arithmetic and data alignment¶

s1 = Series([7.3, -2.5, 3.4, 1.5], index=['a','c','d','e'])
s1

a    7.3
c   -2.5
d    3.4
e    1.5
dtype: float64

s2 = Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a','c','e','f','g'])
s2

a   -2.1
c    3.6
e   -1.5
f    4.0
g    3.1
dtype: float64

# indices not match NaN will be placed
s1 + s2

a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64

df1 = DataFrame(np.arange(9).reshape((3,3)), columns=list('bcd'), index=['Ohio','Texas','Colorado'])
df1

df2 = DataFrame(np.arange(12).reshape((4,3)), columns=list('bde'), index=['Utah','Ohio','Texas','Oregon'])
df2

df1 + df2

Arithmetic methods with fill values¶

df1 = DataFrame(np.arange(12).reshape((3,4)), columns=list('abcd'))
df1

df2 = DataFrame(np.arange(20).reshape((4,5)), columns=list('abcde'))
df2

df1 + df2

# populates the missing one on each DataFrame to zero
# this works for add, sub, div, mul
df1.add(df2, fill_value=0)

df1

df2

df1.reindex(columns=df2.columns, fill_value=0)

Operation between DataFrame and Series¶

arr = np.arange(12).reshape((4,3))
arr

array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11]])

arr[0]

array([0, 1, 2])

# this is called broadcasting, it subtracts row by row
arr - arr[0]

array([[0, 0, 0],
       [3, 3, 3],
       [6, 6, 6],
       [9, 9, 9]])

# deprecated
frame = DataFrame(np.arange(12).reshape((4,3)), columns=list('bde'), index=['Utah', 'Ohio', 'Texas', 'Oregon'])
series = frame.iloc[0]
frame

series

b    0
d    1
e    2
Name: Utah, dtype: int32

# broadcasting down the rows
frame - series

series2 = Series(range(3), index=['b','e','f'])
series2

b    0
e    1
f    2
dtype: int64

frame

# default is axis 0
frame.add(series2)
#frame + series2

frame

# you can do broadcasting on columns using arithmetic methods as follow
series3 = frame.loc['Ohio']
frame

series3

b    3
d    4
e    5
Name: Ohio, dtype: int32

# multiplication
frame.multiply(series3)

Function application and mapping¶

NumPy ufuncs work fine with pandas objects:

frame = DataFrame(np.random.randn(4, 3), columns=list('dbe'), index=['Utah', 'Ohio', 'Texas', 'Oregon'])
frame

np.abs(frame)

Another frequent operation is applying a function on 1D array to each column or row

f = lambda x: x.max() - x.min()

# by default axis is zero
frame.apply(f)

d    1.543779
b    2.094812
e    2.677155
dtype: float64

frame.apply(f, axis = 1)

Utah      2.790059
Ohio      1.177203
Texas     1.573372
Oregon    3.385143
dtype: float64

Many of the most common array statistics (like sum and mean) are DataFrame methods, so using apply is not necessary.

apply need not return a scalar value, it can also return a Series with multiple values:

def f(x): return Series([x.min(), x.max()], index=['min', 'max'])
frame.apply(f)

Elemenst-wise Python functions can be used too. Suppose you wanted to compute a formatted string from each floating point value in frame.

# old formatting way
pi = 3.14159
print(" pi = %1.2f " % pi)

 pi = 3.14

# applymap element wise and apply applies a function on axis
format = lambda x: '%.2f' % x
frame.applymap(format)

The reason for the name applymap is that Series has a map method for applying an element-wise function:

frame['e'].map(format)

Utah       0.75
Ohio      -0.64
Texas      0.70
Oregon    -1.93
Name: e, dtype: object

Sorting¶

obj = Series(range(4), index=['d','a','b','c'])
obj

d    0
a    1
b    2
c    3
dtype: int64

obj.sort_index()

a    1
b    2
c    3
d    0
dtype: int64

frame = DataFrame(np.arange(8).reshape((2,4)), index=['three','one'], columns=['d','a','b','c'])
frame

# default axis is 0
frame.sort_index()

frame.sort_index(axis=1)

obj = Series([4,7,-3,2])
obj.sort_values()

2   -3
3    2
0    4
1    7
dtype: int64

# any missing values are sorted to the end of the Series by default
obj = Series([4, np.nan, 7, np.nan, -3, 2])
obj.sort_values()

4   -3.0
5    2.0
0    4.0
2    7.0
1    NaN
3    NaN
dtype: float64

frame = DataFrame({'b':[4,7,-3,2], 'a':[0,1,0,1]})
frame

frame.sort_values(by='b')

frame.sort_values(by=['a','b'])

Summarizing and Computing Descriptive Statistics¶

df = DataFrame([[1.4,np.nan], [7.1,-4.5],
               [np.nan, np.nan], [0.75, -1.3]],
               index=['a','b','c','d'],
               columns=['one','two'])
df

df.sum()

one    9.25
two   -5.80
dtype: float64

df.sum(axis=1)

a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64

df.mean(axis=1, skipna=True)

a    1.400
b    1.300
c      NaN
d   -0.275
dtype: float64

df.describe()

# on non-numeric, it produces alternative summary statistics:
obj = Series(['a', 'a', 'b', 'c'] * 4)
obj

0     a
1     a
2     b
3     c
4     a
5     a
6     b
7     c
8     a
9     a
10    b
11    c
12    a
13    a
14    b
15    c
dtype: object

obj.describe()

count     16
unique     3
top        a
freq       8
dtype: object

Some summary statistics and related functions:

count
describe
min, max
quantile
sum
mean
median
mad
var
std
diff
pct_change
cumsum
cumprod

Unique Values, Value Counts, and Membership¶

obj = Series(['c','a','d','a','a','b','b','c','c'])
obj

0    c
1    a
2    d
3    a
4    a
5    b
6    b
7    c
8    c
dtype: object

uniques = obj.unique()
uniques

array(['c', 'a', 'd', 'b'], dtype=object)

obj.value_counts()

a    3
c    3
b    2
d    1
dtype: int64

# panda has method for this too that can be used for any array or sequence
pd.value_counts(obj.values)

a    3
c    3
b    2
d    1
dtype: int64

obj

0    c
1    a
2    d
3    a
4    a
5    b
6    b
7    c
8    c
dtype: object

# isin is responsible for vectorized set memebership and can be very useful in filtering a data set
mask = obj.isin(['b','c'])
mask

0     True
1    False
2    False
3    False
4    False
5     True
6     True
7     True
8     True
dtype: bool

data = DataFrame({'Qu1': [1,3,4,5,4],
                  'Qu2': [2,3,1,2,3],
                  'Qu3': [1,5,2,4,4]})

data

Handling Missing Data¶

# pandas uses the floating point value NaN to represent missing data
string_data = Series(['aardvark', 'artichoke', np.nan, 'avacado'])
string_data

0     aardvark
1    artichoke
2          NaN
3      avacado
dtype: object

# built-in Python None value is also treated as NaN
string_data[0] = None
string_data.isnull()

0     True
1    False
2     True
3    False
dtype: bool

Filtering Out Missing Data¶

from numpy import nan as NA
data = Series([1, NA, 3.5, NA, 7])
data.dropna()

0    1.0
2    3.5
4    7.0
dtype: float64

data

0    1.0
1    NaN
2    3.5
3    NaN
4    7.0
dtype: float64

data[data.notnull()]

0    1.0
2    3.5
4    7.0
dtype: float64

data = DataFrame([[1, 6.5, 3], [1, NA, NA],
                 [NA, NA, NA], [NA, 6.5, 3]])

data

cleaned = data.dropna()

# any row with NA will be dropped
cleaned

data

# if all row has NA
data.dropna(how='all')

data

data[2] = NA
data

# drop the column if all NA
data.dropna(how='all', axis = 1)

df = DataFrame(np.random.randn(7, 3))
df

df.iloc[:4, 1] = NA
df.iloc[:2, 2] = NA
df

# keep only rows having certain number of observations
df.dropna(thresh=4,axis='columns')  # or axis='rows'or 0 or 1 as usual

Filling in Missing Data¶

df

df.fillna(0)

df

# use a dic which indicates what to fill NA at each column
df.fillna({1:0.5, 2: -1})

df

# fillna returns a new object, but you can modify the existing object in place
df.fillna(0, inplace=True)
df

df = DataFrame(np.random.randn(6, 3))
df

df.iloc[2:,1] = np.nan
df.iloc[4:,2] = np.nan
df

# forward filling
df.fillna(method='ffill')

df

# you can put a limit of how many to fill
df.fillna(method='ffill', limit=2)

# With fillna you can do lots of other things with a little creativity
data = Series([1, np.nan, 3.5, np.nan, 7])
# putting mean of values for NAs
data.fillna(data.mean())

0    1.000000
1    3.833333
2    3.500000
3    3.833333
4    7.000000
dtype: float64

	a	b	c	d	e
0	0.0	2.0	4.0	6.0	4.0
1	9.0	11.0	13.0	15.0	9.0
2	18.0	20.0	22.0	24.0	14.0
3	15.0	16.0	17.0	18.0	19.0

	d	b	e
Utah	1.049846	-1.740212	0.752126
Ohio	0.540530	0.354600	-0.636673
Texas	-0.083665	-0.874059	0.699313
Oregon	1.460114	0.169469	-1.925029

	d	b	e
Utah	1.049846	1.740212	0.752126
Ohio	0.540530	0.354600	0.636673
Texas	0.083665	0.874059	0.699313
Oregon	1.460114	0.169469	1.925029

	d	b	e
min	-0.083665	-1.740212	-1.925029
max	1.460114	0.354600	0.752126

	one	two
count	3.000000	2.000000
mean	3.083333	-2.900000
std	3.493685	2.262742
min	0.750000	-4.500000
25%	1.075000	-3.700000
50%	1.400000	-2.900000
75%	4.250000	-2.100000
max	7.100000	-1.300000

HenryBernreuter.com v1.0

Pandas in One Post

Pandas in One Post¶

Key Features of Pandas¶

Introduction to pandas Data Structure¶

Series¶

You can get only the array representation using:¶

You can get only the index represtation using:¶

You can make it a list: list(S1.index)¶

to make np.array¶

DataFrame¶

Index Objects¶

Essential Functionality¶

Reindexing¶

Dropping entries from an axis¶

Indexing, selection and filtering¶

Arithmetic and data alignment¶

Arithmetic methods with fill values¶

Operation between DataFrame and Series¶

Function application and mapping¶

Sorting¶

Summarizing and Computing Descriptive Statistics¶

Unique Values, Value Counts, and Membership¶

Handling Missing Data¶

Filtering Out Missing Data¶

Filling in Missing Data¶

Start Here

Pandas in One Post

Pandas in One Post¶

Key Features of Pandas¶

Introduction to pandas Data Structure¶

Series¶

You can get only the array representation using:¶

You can get only the index represtation using:¶

You can make it a list: list(S1.index)¶

to make np.array¶

DataFrame¶

Index Objects¶

Essential Functionality¶

Reindexing¶

Dropping entries from an axis¶

Indexing, selection and filtering¶

Arithmetic and data alignment¶

Arithmetic methods with fill values¶

Operation between DataFrame and Series¶

Function application and mapping¶

Sorting¶

Summarizing and Computing Descriptive Statistics¶

Unique Values, Value Counts, and Membership¶

Handling Missing Data¶

Filtering Out Missing Data¶

Filling in Missing Data¶

Start Here

Don't go yet!

Python in One Post

Basic Data Visualiztion in Python

Like this page? Consider buying me a cup of coffee or sharing it?