Python Libraries

Libraries Needed for Data Science

Featured image pylib

On this site, we will be using a variety of tools that will require some initial configuration. To ensure everything goes smoothly moving forward, we will setup the majority of those tools in this Session. While some of this will likely be dull, doing it now will enable us to do more exciting work in the weeks that follow without getting bogged down in further software configuration.

Getting Python

You will be using Python throughout the rest of these posts including many popular 3rd party Python libraries for scientific computing. Anaconda is an easy-to-install bundle of Python and most of these libraries.

Please visit this page and follow the instructions to set up Python

Hello, Anacoda

image

Notebooks are composed of many "cells", which can contain text (like this one), or code (like the one below).

In [1]:
x = [10, 20, 30, 40, 50]
for item in x:
    print ("Item is ", item)
Item is  10
Item is  20
Item is  30
Item is  40
Item is  50

Python Libraries

In [3]:
# Below you can see your python 2 and 3 folders that jupyter(ipython) notebook
# looking at:
#import sys
#print(sys.path)


# Look at your OS path now and see what is the path to your anaconda. 
# the pip that comes with this anaconda installs modules local to this
# path under lib/python-2.7/site-packages. All modules are here.
# For me, I have two anaconda installation and one is bad. 
# The path was not right one. So I went to bin 
# folder of right anaconda and ran pip for this one.
# mine right anaconda is under Users/XXXX/anaconda/bin and executed ./pip
# For your anaconda Python3 location, find your python3 anaconda path and 
# then go to bin under this folder and run ./pip3 install package model_name
# so you have modules for Python3 too.
# mine was: /Library/Frameworks/Python.framework/Versions/3.4/bin

# mine had only python2 kernel so I used this command to have python3 available too
# go to Users/XXXX/anaconda/bin
# sudo ipython3 kernel install
 
    
#IPython is what you are using now to run the notebook
import IPython
print ("IPython version:    ", IPython.__version__, "(need at least 1.0)") 

# Numpy is a library for working with Arrays
import numpy as np
#print "Numpy version:        %6.6s (need at least 1.7.1)" % np.__version__
print ("Numpy version:        ", np.__version__, "(need at least 1.7.1)") 

# SciPy implements many different numerical algorithms
import scipy as sp
#print "SciPy version:        %6.6s (need at least 0.12.0)" % sp.__version__
print ("SciPy version:        ",sp.__version__,"(need at least 0.12.0)")

# Pandas makes working with data tables easier
import pandas as pd
#print "Pandas version:       %6.6s (need at least 0.11.0)" % pd.__version__
print ("Pandas version:       ", pd.__version__, "(need at least 0.11.0)")

# Module for plotting
import matplotlib
#print "Mapltolib version:    %6.6s (need at least 1.2.1)" % matplotlib.__version__
print ("Mapltolib version:    ", matplotlib.__version__, "(need at least 1.2.1)") 

# SciKit Learn implements several Machine Learning algorithms
import sklearn
#print "Scikit-Learn version: %6.6s (need at least 0.13.1)" % sklearn.__version__
print ("Scikit-Learn version: ", sklearn.__version__, "(need at least 0.13.1)") 

# Requests is a library for getting data from the Web
import requests
#print "requests version:     %6.6s (need at least 1.2.3)" % requests.__version__
print ("requests version:     ", requests.__version__, "(need at least 1.2.3)") 

# Networkx is a library for working with networks
import networkx as nx
#print "NetworkX version:     %6.6s (need at least 1.7)" % nx.__version__
print ("NetworkX version:     ", nx.__version__, "(need at least 1.7)") 

#BeautifulSoup is a library to parse HTML and XML documents
#import BeautifulSoup
#print "BeautifulSoup version:%6.6s (need at least 3.2)" % BeautifulSoup.__version__
#BeautifulSoup is a library to parse HTML and XML documents
import bs4
#print "BeautifulSoup version:%6.6s (need at least 3.2)" % bs4.__version__
print ("BeautifulSoup version:",bs4.__version__,"(need at least 3.2)") 
IPython version:     7.12.0 (need at least 1.0)
Numpy version:         1.18.1 (need at least 1.7.1)
SciPy version:         1.4.1 (need at least 0.12.0)
Pandas version:        1.0.1 (need at least 0.11.0)
Mapltolib version:     3.1.3 (need at least 1.2.1)
Scikit-Learn version:  0.22.1 (need at least 0.13.1)
requests version:      2.22.0 (need at least 1.2.3)
NetworkX version:      2.4 (need at least 1.7)
BeautifulSoup version: 4.8.2 (need at least 3.2)

Hello matplotlib

The notebook integrates nicely with Matplotlib, the primary plotting package for python. This should embed a figure of a sine wave:

In [4]:
#this line prepares IPython for working with matplotlib
%matplotlib inline  

# this actually imports matplotlib
import matplotlib.pyplot as plt  

x = np.linspace(0, 10, 30)  #array of 30 points from 0 to 10
y = np.sin(x)
z = y + np.random.normal(size=30) * .2
#print (z)
plt.plot(x, y, 'ro-', label='A sine wave')
plt.plot(x, z, 'b-', label='Noisy sine')
plt.legend(loc = 'lower right')
plt.xlabel("X axis")
plt.ylabel("Y axis")           
Out[4]:
Text(0, 0.5, 'Y axis')

Hello Numpy

The Numpy array processing library is the basis of nearly all numerical computing in Python. Here's a 30 second crash course. For more details, consult Chapter 4 of Python for Data Analysis, or the Numpy User's Guide

In [5]:
print ("Make a 3 row x 4 column array of random numbers")
x = np.random.random((3, 4))
print (x)
print()

print ("Add 1 to every element")
x = x + 1
print (x)
print()

print ("Get the element at row 1, column 2")
print (x[1, 2])
print()

# The colon syntax is called "slicing" the array. 
print ("Get the first row")
print (x[0, :])
print()

print ("Get every 2nd column of the first row")
print (x[0, ::2])
print()
Make a 3 row x 4 column array of random numbers
[[0.65131408 0.70095201 0.11668714 0.90682329]
 [0.98177087 0.54238409 0.31524446 0.88818107]
 [0.25609469 0.9957122  0.63371904 0.05045888]]

Add 1 to every element
[[1.65131408 1.70095201 1.11668714 1.90682329]
 [1.98177087 1.54238409 1.31524446 1.88818107]
 [1.25609469 1.9957122  1.63371904 1.05045888]]

Get the element at row 1, column 2
1.315244464390577

Get the first row
[1.65131408 1.70095201 1.11668714 1.90682329]

Get every 2nd column of the first row
[1.65131408 1.11668714]

Print the maximum, minimum, and mean of the array. This does not require writing a loop. In the code cell below, type x.m<TAB>, to find built-in operations for common array statistics like this

In [6]:
print ("Max is  ", x.max())
print ("Min is  ", x.min())
print ("Mean is ", x.mean())
Max is   1.9957121978681591
Min is   1.0504588773182602
Mean is  1.5866118176205317

Call the x.max function again, but use the axis keyword to print the maximum of each row in x.

In [7]:
print (x.max(axis=1))
[1.90682329 1.98177087 1.9957122 ]

Print the maximum of each column

In [8]:
print (x.max(axis=0))
[1.98177087 1.9957122  1.63371904 1.90682329]

In a binomial experiment there are two mutually exclusive outcomes, often referred to as "success" and "failure". If the probability of success is p, the probability of failure is 1 - p.

Such an experiment whose outcome is random and can be either of two possibilities, "success" or "failure", is called a Bernoulli trial, after Swiss mathematician Jacob Bernoulli (1654 - 1705).

Here's a way to quickly simulate 500 coin "fair" coin tosses (where the probabily of getting Heads is 50%, or 0.5)

In [9]:
x = np.random.binomial(500, .5)
print ("number of heads:", x)
number of heads: 247

Repeat this simulation 500 times, and use the plt.hist() function to plot a histogram of the number of Heads (1s) in each simulation

In [10]:
# loop
heads = []
for i in range(500):
    heads.append(np.random.binomial(500, .5))

# "list comprehension"
heads = [np.random.binomial(500, .5) for i in range(500)]

# pure numpy
heads = np.random.binomial(500, .5, size=500)

histogram = plt.hist(heads, bins=10)

What's Next?

Please visit this page and follow the instructions to start learning Python.