231 min to read
Numpy in One Post
Numpy
NumPy Basics: Arrays and Vectorized Computation¶
NumPy, short for Numerical Python, is the fundamental package required for high performance scientific computing and data analysis.
Here what it provides:
1- ndarray. a fast and space-efficient multidimensional array.
2- Standard mathematical functions for fast operations on entire arrays of data without having to write loops.
3- Tools for reading / writing array data to disk and working with memory-mapped files.
4- Linear algebra, ranadom number generation and Fourier transform capabilities.
5- Tools for intergating code written in C/C++ and Fortran.
The NumPy ndarray: A Multidimensional Array Object¶
One of the key feautures of NumPy is its N-dimensional array object, or ndarray which is fast, flexible container for large data sets in Python.
Creating an Array¶
# need to import the numpy library
import numpy as np
# one dimensional array
data1 = [6, 7.5, 8, 0, 1]
arr1 = np.array(data1)
arr1
# two dimensional array
data2 = [[1, 2, 3, 4], [5, 6, 7, 8]]
arr2 = np.array(data2)
arr2
# dimension of the array
arr2.ndim
# shape of the array
#type(arr2.shape)
arr2.shape
# data type of the array
arr1.dtype
# size of the array
arr2.size
# number of rows
len(arr2)
#arr2
# number of columns
# refer to this after reading about slicing
len(arr2[0,:])
# create one dimensional array and all zero
np.zeros(10)
# create one dimensional array and all ones
np.ones(5)
# create two dimensional array and all zero
np.zeros((3,5))
# similar to range but create one dimensional array
np.arange(10)
arr2
# create an array similar to arr2 shape and all ones
arr3 = np.ones_like(arr2)
arr3
# create an array similar to arr2 shape and all zeros
arr4 = np.zeros_like(arr2)
# create empty array (allocating new memory so values might be garbage)
arr5 = np.empty((3, 4))
arr5
# creates an empty array similar shape of arr2
arr6 = np.empty_like(arr2)
arr6
# create n x n identity matrix
arr7 = np.identity(5)
arr7
# create n x n identity matrix
arr8 = np.eye(3)
arr8
Data Types for ndarrays¶
arr1 = np.array([1,2,3])
arr1.dtype
arr2 = np.array([1, 2, 3], dtype=np.int32)
arr2.dtype
array types:
¶
int8, uint8
int16, uint16
int32, uint32
int64, uint64
float16
float32
float64
float128
complex64, complex128
complex256
bool
object
string
unicode
arr = np.array([1, 2, 3])
arr.dtype
float_arr = arr.astype(np.float64)
float_arr
arr = np.array([3.7, -1.2, -2.6, 0.5, 12.9, 10.1])
arr.astype(np.int32)
# you can drop the dtype and get same result
numeric_strings = np.array(['1.2', '3.4', '5.6'], dtype=np.string_)
numeric_strings.astype(np.float64)
Operation between Arrays and Scalars¶
arr = np.array([[1, 2, 3], [4, 5, 6]])
arr
arr * arr
arr + arr
arr - arr
1.0 / arr
arr ** 2
Basic Indexing and Slicing¶
arr = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9])
arr
arr[5]
arr[5:8]
arr[5:8] = 12
arr
# IMPORTANT: slices are views of orignal array, so change to view affects original one
arr_slice = arr[5:8]
arr_slice[1] = 1000
arr
arr_slice[:] = 64
arr
# this is how you create new array not the view of the original array
arr_new = np.array(arr[5:8])
arr[6] = 200
# no side effect on arr_new
arr_new
# or you can use
arr_new = arr[5:8].copy()
arr_new
# some examples for higher dimensional arrays
arr2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
arr2d[2]
arr2d[2][2]
# or you can
arr2d[2, 2]
# examples for 3D arrays
arr3d = np.array([[[1, 2, 3], [3, 4, 5]], [[6, 7, 8], [9, 10 , 11]]])
arr3d
# imagine every index that you use, you get into one bracket
# this below generates a 2 x 3 array
arr3d[0]
arr3d[0][1]
arr3d[0][1][2]
# or you can type
arr3d[0, 1, 2]
# some more operations
# again, you need copy so you dont generate a view
old_values = arr3d[0].copy()
arr3d[0] = 42
arr3d
arr3d[0] = old_values
arr3d
Indexing with Slices¶
arr2d
arr2d[:2]
arr2d[:2, 1:]
arr2d[1, :2]
arr2d[2, :1]
arr2d[:, :1]
arr2d[:2, 1:] = 1000
arr2d
Boolean Indexing¶
names = np.array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'])
# random number of normal distribution [-1, 1]
data = np.random.randn(7, 4)
data
names.shape
data
#names
names == 'Bob'
# matches the row with above True-False and picks only the True ones
data[names == 'Bob', 2:]
data[names == 'Bob', 3]
# To select everything but Bob
names != 'Bob'
# or you can use ~
data[~(names == 'Bob')]
Note: Selecting data from an array by boolean indexing always create a copy of the data¶
# you can use & and | for boolean expressions
mask = (names == 'Bob') | (names == 'Will')
mask
data[mask]
Note: keywords and/or do not work with boolean arrays¶
data
# setting all negative values in array daat to zero
data[data < 0] = 0
data
Fancy Indexing¶
arr = np.zeros((8, 4))
for i in range(len(arr)):
arr[i] = i
arr
# fancy indexing
# picks complete row of each element of the list
arr[[4, 3, 0, 6]]
# array length - 1 is the last row
arr[[-3, -5, -7]]
# reshape being introduced here
arr = np.arange(32).reshape((8, 4))
arr
arr
# another fancy indexing
# intersection of rows and columns in order
arr[[1, 5, 7, 2], [0, 3, 1, 2]]
Note: Fancy indexing, unlike slicing always copies the data into a new array¶
Transporting Arrays and Swapping Axes¶
arr = np.arange(15).reshape((3, 5))
arr
# transpose of an array which is a view of the array
arr.T
arr = np.array([[1, 2], [3, 4]])
arr
# matrix multiplication
arr.dot(arr)
# or you can type
np.dot(arr, arr)
arr = np.random.randn(6, 3)
np.dot(arr.T, arr)
arr = np.array([[1, 2, 3], [4, 5, 6]])
arr
# transpose permutes the axes. It axes start from 0, 1 ... depending to dimension of the array
# following means transpose the rows and columns
arr.transpose(1,0)
arr = np.arange(16).reshape((2, 2, 4))
arr
# following means keep the last index intact but change the first index with second one
# to understand what is happening use Aijk and play with keeping k as before but changing i and j
arr.transpose(1, 0, 2)
# swap axes works like transpose but gets a pair of axes to swap
arr
arr.swapaxes(1,2)
Universal Functions: Fast Element-wise Array Functions¶
A universal function, or ufunc, is a function that performs elementwise operations on data in ndarrays. You can think of them as fast vectorized wrappers for simple functions that take one or more scalar values and produce one or more scalar results.
arr = np.arange(10)
# unary universal function of sqrt
np.sqrt(arr)
# unary universal function of exponent
np.exp(arr)
x = np.random.randn(8)
y = np.random.randn(8)
x
y
# binary universal function of maximum (compares element by element in order)
np.maximum(x, y)
arr = np.random.randn(8)
# modf returns two array as a tuple, one is fractional and one integral part of numbers
np.modf(arr)
Some unary ufuncs (Please refer to PyNum documentation for the explanation of each)¶
abs, fabs
sqrt
square
exp
log, log10, log2, log1p
sign
ceil
floor
rint
modf
isnan
isfinite, isinf
cos, cosh, sin, sinh
tan, tanh
arccos,arccosh, arcsin
arcsinh, arctan, arctanh
logical_not
Some binary ufuncs (Please refer to NumPy documentation for the explanation of each)¶
add
subtract
multiply
divide, floor_divide
power
maximum, fmax
minimum, fmin
mod
copysign
greater, greater_equal
less, less_equal, equal
not_equal
logical_and
logical_or
logical_xor
Data Processing Using Arrays¶
Using NumPy arrays enables you to express many kinds of data processing tasks as concise array expressions that might otherwise require writing loops. This practice of replacing explicit loops with array expressions is commonly referred to as vectorization. In general, vectorized array operations will often be one or two (or more) orders of magnitude faster than their pure Python equivalents.
# lets say you want to calculate the function sqrt(x^2 + y^2) across a reqular grid of values.
# np.meshgrid function takes two 1D array and produces two 2D, look at following example and see how
points = np.arange(0, 10, 2)
points
xs, ys = np.meshgrid(points, points)
xs
ys
z= np.sqrt(xs ** 2 + ys ** 2)
z
Expressing Conditional Logic as Array Operations¶
The numpy.where function is a vectorized version of the ternary expression x if condition else y
xarr = np.array([1.1, 1.2, 1.3, 1.4, 1.5])
yarr = np.array([2.1, 2.2, 2.3, 2.4, 2.5])
cond = np.array([True, False, True, True, False])
#zip() is built in Python function and makes an iterator that aggregates elements from each of the iterables.
result = [(x if c else y) for x, y, c in zip(xarr, yarr, cond)]
result
This has multiple problems.
First, it will not be very fast for large arrays. (Pure Python)
Second, it will not works with multidimensional arrays.
With np.where you can write:
result = np.where(cond, xarr, yarr)
result
# The second or third arguments of where function; one or both of them can be scalars.
arr = np.random.randn(4,4)
arr
# we want to replace all positive values with 2 and all negative values with -2
np.where(arr > 0, 2, -2)
# or setting only positive values to 2
np.where(arr > 0, 2, arr)
'''
Consider following example where we have two boolean arrays, cond1 and cond2 and wish to assign
a different value for each of he 4 possible pairs of boolean values.
Pure Pythin:
'''
cond1 = np.array([True, True, False, False])
cond2 = np.array([True, False, True, False])
result = []
for i in range(len(cond1)):
if cond1[i] and cond2[i]:
result.append(0)
elif cond1[i]:
result.append(1)
elif cond2[i]:
result.append(2)
else:
result.append(3)
result
# smart way of using np.where
np.where(cond1 & cond2, 0, np.where(cond1, 1, np.where(cond2, 2, 3)))
# values of zero treated as False and non-zero True in Python
# so we can re-write previous code as:
result = 1 * (cond1 & ~cond2) + 2 * (~cond1 * cond2) + 3 * (~cond1 * ~cond2)
result
Mathematical and Statistical Methods¶
arr = np.random.randn(5, 4)
arr
arr.mean()
arr.sum()
arr.std()
arr = np.array([[1, 2, 3],[4, 5, 6], [7, 8, 9]])
arr
# mean on axis - 0 is column and 1 is row for two dimension array
arr.mean(0)
arr.mean(axis = 1)
arr.sum(axis = 0)
# Cumulative sum - starting from zero as sum
arr.cumsum(axis = 0)
Basic array statistical methods¶
# cumulative product - starting from one as product
arr.cumprod(axis = 1)
sum
mean
std, var
min, max
argmin, argmax (Indices of minimum and maximum elements, respectively. By default, the index is for the flattened array)
cumsum
cumprod
arr.min(axis = 0)
#arr
# max index for flattened array
arr.argmax()
Methods for Boolean Arrays¶
boolean values are coerced to 1 (True) and 0 (False).
arr = np.random.randn(10)
arr
arr > 0
(arr > 0).sum()
# any() method retrun True if any element is True
bool = np.array([False, False, True, False])
bool.any()
# all() method return True if all elements are True
bool.all()
Sorting¶
arr = np.random.randn(10)
arr
arr.sort()
arr
arr = np.random.randn(3, 4)
arr
arr.sort(axis = 0)
arr
arr.sort(axis = 1)
arr
# finding 5% quantile
large_array = np.random.randn(1000)
large_array.sort()
large_array[int(0.05 * len(large_array))]
Unique and Other Set Logic¶
names = np.array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'])
np.unique(names)
ints = np.array([3, 3, 3, 2, 2, 1, 1, 4, 4])
np.unique(ints)
# putting in set to remove the duplicates
sorted(set(names))
#names.sort()
#names
# compute a boolean array indicating whether each element of x is contained in y
values = np.array([6, 0, 0, 3, 2, 5, 6])
np.in1d(values, [2, 3, 6])
# compute the sorted union of element
np.union1d(values, [200, 100])
# compute the sorted , common elements
np.intersect1d(values, [3, 2])
# set differencce, elements in first set but not the second one
np.setdiff1d(values, [0, 6, 10])
# set symmetric difference, elements that are in either of the arrays but not both
np.setxor1d(values, [0, 6, 10])
Linear Algebra¶
# example of matrix multiplication
x = np.array([[1, 2, 3], [4, 5, 6]])
y = np.array([[6, 23], [-1, 7], [8, 9]])
np.dot(x, y)
# the same
x.dot(y)
# numpy.linalg has standard set of matrix decomposition and things like inverse and determinant
from numpy.linalg import inv, qr
X = np.random.randn(5,5)
# T for transpose
mat = X.T.dot(X)
# inverse of teh matrix
inv(mat)
# It should give you Identity matrix
mat.dot(inv(mat))
Commonly-used numpy.linalg functions¶
diag (return the diagonal of the matrix)
dot (multiplication)
trace (main diagonal sum)
det (determinent)
eig (AV = EV)
inv (inverse)
qr (QR decomposition)
svd (singular value decomposition)
solve (solve Ax = b for x, where A is a square matrix)
mat.trace()
Random Number Generation¶
# np.random supplements the built-in Python random with functions for efficiency
# for example a 4 by 4 array of samples from standard normal distribution
samples = np.random.normal(size=(4,4))
samples
List vs Array in Python¶
Arrays and lists are both used in Python to store data, but they don't serve exactly the same purposes. They both can be used to store any data type (real numbers, strings, etc), and they both can be indexed and iterated through, but the similarities between the two don't go much further. The main difference between a list and an array is the functions that you can perform to them.
Another difference between an array and a list is that array elements are of the same data type, vs. list elements can have different data types.
Some of numpy.random functions¶
seed (seed the random number generator)
permutation (return a random permutation)
shuffle (randomly permute a sequence in place)
rand (draw samples from a uniform distribution)
randint (draw random integers from a given low-to-high range)
randn (draw samples from a normal distribution with mean 0 and standard deviation 1)
binominal (draw samples from binominal distribution)
normal (draw samples from normal (Gaussian) distribution)
beta (draw samples from beta distribution)
chisquare (draw samples from a chi-square distribution)
gamma (draw samples from gamma distribution)
uniform (draw samples from a uniform [0, 1) distribution)