python for data analytics
DESCRIPTION
Python for Data Analytics Lecture 2TRANSCRIPT
-
Python for Data Analytics
Lectures 3 & 4: Essential Libraries NumPy and pandas
Rodrigo [email protected]
Spring 2015
1
-
NumPy
2
-
NumPy
NumPy is the fundamental package required for high performancescientific computing and data analysis. It provides:
ndarray, a fast and space-efficient multidimensional array providingvectorized operations
Standard mathematical operations for fast operations over arrayswithout having to write loops
Tools for reading and writing array data to disk and working withmemory-mapped files
Tools for integrating code written in C, C++, and Fortran
Having a good understanding of how NumPy works will help use tools likepandas
3
-
NumPy
NumPy is the fundamental package required for high performancescientific computing and data analysis. It provides:
ndarray, a fast and space-efficient multidimensional array providingvectorized operations
Standard mathematical operations for fast operations over arrayswithout having to write loops
Tools for reading and writing array data to disk and working withmemory-mapped files
Tools for integrating code written in C, C++, and Fortran
Having a good understanding of how NumPy works will help use tools likepandas
3
-
ndarray
ndarray stands for N-dimensional array.
data
array([[ 0.73230045, 0.25494037, 0.79516021],[ 0.62986533, 0.3420035 , 0.08914765]])
You can get the shape of an array and the type of its elements byaccessing the values shape and dtype:
print data .shapeprint data . dtype
(2, 3)float64
4
-
ndarray
ndarray stands for N-dimensional array.
data
array([[ 0.73230045, 0.25494037, 0.79516021],[ 0.62986533, 0.3420035 , 0.08914765]])
You can get the shape of an array and the type of its elements byaccessing the values shape and dtype:
print data .shapeprint data . dtype
(2, 3)float64
4
-
Creating ndarrays
It is possible to create ndarrays from a list or a list of lists
From a list:
import numpy as npdata1 = [1 ,2 ,3 ,4]arr1 = np. array (data1)arr1
array([1, 2, 3, 4])
From a list of lists:
data2 = [[1 ,2 ,3 ,4] ,[5 ,6 ,7 ,8]]arr2 = np. array (data2)arr2
array([[1, 2, 3, 4],[5, 6, 7, 8]])
5
-
Creating ndarrays
It is possible to create ndarrays from a list or a list of lists
From a list:
import numpy as npdata1 = [1 ,2 ,3 ,4]arr1 = np. array (data1)arr1
array([1, 2, 3, 4])
From a list of lists:
data2 = [[1 ,2 ,3 ,4] ,[5 ,6 ,7 ,8]]arr2 = np. array (data2)arr2
array([[1, 2, 3, 4],[5, 6, 7, 8]])
5
-
Creating ndarrays
Creating an array initiated with zeros
np. zeros ((3 ,6))
array([[ 0., 0., 0., 0., 0., 0.],[ 0., 0., 0., 0., 0., 0.],[ 0., 0., 0., 0., 0., 0.]])
6
-
Creating ndarrays
Creating an array with random numbers:
data = np.random. rand(2 ,3)print data .shapeprint data . dtypedata
(2, 3)float64array([[ 0.73230045, 0.25494037, 0.79516021],
[ 0.62986533, 0.3420035 , 0.08914765]])
7
-
Data Types for ndarrays
ndarrays are composed of elements that are all of the same type:
int
float
complex
bool
string
object
In practice an array of type object can have elements of any type, butthese types of array are not common
Example
arr = np. array ( [ Hello , np.random. rand ] )arr
array([Hello,], dtype=object)
8
-
Data Types for ndarrays
ndarrays are composed of elements that are all of the same type:
int
float
complex
bool
string
object
In practice an array of type object can have elements of any type, butthese types of array are not common
Example
arr = np. array ( [ Hello , np.random. rand ] )arr
array([Hello,], dtype=object)
8
-
Operations between Arrays and Scalars
ndarray supports vectorized operations, i.e., operations that areperformed to each element of an array without the need of using loops
Multiplication by a scalar
data * 10
array([[ 6.39219315, 6.8102819 , 4.34637984],[ 0.34237044, 5.39243817, 1.26276343]])
Addition
data + data
array([[ 1.27843863, 1.36205638, 0.86927597],[ 0.06847409, 1.07848763, 0.25255269]])
9
-
Operations between Arrays and Scalars
ndarray supports vectorized operations, i.e., operations that areperformed to each element of an array without the need of using loops
Multiplication by a scalar
data * 10
array([[ 6.39219315, 6.8102819 , 4.34637984],[ 0.34237044, 5.39243817, 1.26276343]])
Addition
data + data
array([[ 1.27843863, 1.36205638, 0.86927597],[ 0.06847409, 1.07848763, 0.25255269]])
9
-
Operations between Arrays and Scalars
arr = np. array ([[1. ,2. ,3] , [4 ,5 ,6] , [7 ,8 ,9]])arr
array([[ 1., 2., 3.],[ 4., 5., 6.],[ 7., 8., 9.]])
Multiplication
arr * arr
array([[ 1., 4., 9.],[ 16., 25., 36.],[ 49., 64., 81.]])
Division
1 / arr
array([[ 1. , 0.5 , 0.33333333],[ 0.25 , 0.2 , 0.16666667],[ 0.14285714, 0.125 , 0.11111111]])
10
-
Basic Indexing and Slicing
Indexing works in the same way as for lists and tuples:
arr
array([[ 1., 2., 3.],[ 4., 5., 6.],[ 7., 8., 9.]])
arr [1]
array([ 4., 5., 6.])
arr [ : ,1 ]
array([ 2., 5., 8.])
arr [1: ,:1]
array([[ 4., 5.],[ 7., 8.]])
11
-
Basic Indexing and Slicing
Indexing works in the same way as for lists and tuples:
arr
array([[ 1., 2., 3.],[ 4., 5., 6.],[ 7., 8., 9.]])
arr [1]
array([ 4., 5., 6.])
arr [ : ,1 ]
array([ 2., 5., 8.])
arr [1: ,:1]
array([[ 4., 5.],[ 7., 8.]])
11
-
Boolean Indexing
names = np. array ( [ Bob , Joe , B i l l , Tess , Joe , Joe , Bob ] )data = randn(7 ,4)print namesdata
[Bob Joe Bill Tess Joe Joe Bob]array([[ 1.31273264, -1.4027545 , 1.76375476, -0.19194064],
[-0.41950145, -0.21455786, 0.28687505, 0.70312942],[ 0.76013575, 0.68719731, 1.45771087, 0.07268093],[ 1.20720934, -0.52305673, 0.56317445, 0.33062879],[ 0.81500408, 1.12185486, 1.31608209, 0.80725464],[-0.70584486, -0.86788517, -0.07373691, 0.83189097],[-0.44775752, -1.67612963, -0.01396541, 0.2745861 ]])
We can create an array of Booleans that is used to select the relevant rows:
print names == Bob data[names == Bob ]
[ True False False False False False True]array([[ 1.31273264, -1.4027545 , 1.76375476, -0.19194064],
[-0.44775752, -1.67612963, -0.01396541, 0.2745861 ]])
12
-
Boolean Indexing
names = np. array ( [ Bob , Joe , B i l l , Tess , Joe , Joe , Bob ] )data = randn(7 ,4)print namesdata
[Bob Joe Bill Tess Joe Joe Bob]array([[ 1.31273264, -1.4027545 , 1.76375476, -0.19194064],
[-0.41950145, -0.21455786, 0.28687505, 0.70312942],[ 0.76013575, 0.68719731, 1.45771087, 0.07268093],[ 1.20720934, -0.52305673, 0.56317445, 0.33062879],[ 0.81500408, 1.12185486, 1.31608209, 0.80725464],[-0.70584486, -0.86788517, -0.07373691, 0.83189097],[-0.44775752, -1.67612963, -0.01396541, 0.2745861 ]])
We can create an array of Booleans that is used to select the relevant rows:
print names == Bob data[names == Bob ]
[ True False False False False False True]array([[ 1.31273264, -1.4027545 , 1.76375476, -0.19194064],
[-0.44775752, -1.67612963, -0.01396541, 0.2745861 ]])
12
-
Boolean Indexing
You can use different indexing methods at once:
data[names == Bob , 2:]
array([[ 1.76375476, -0.19194064],[-0.01396541, 0.2745861 ]])
You can use arithmetic operators:
data [ (names == Bob ) | (names == Joe ) , ]
array([[ 1.31273264, -1.4027545 , 1.76375476, -0.19194064],[-0.41950145, -0.21455786, 0.28687505, 0.70312942],[ 0.81500408, 1.12185486, 1.31608209, 0.80725464],[-0.70584486, -0.86788517, -0.07373691, 0.83189097],[-0.44775752, -1.67612963, -0.01396541, 0.2745861 ]])
data [ (names == Bob ) & (names == Joe ) , ]
array([], shape=(0, 4), dtype=float64)
Note: Selecting data from an array always creates a copy of the data, evenif the returned array is unchanged
13
-
Boolean Indexing
You can use different indexing methods at once:
data[names == Bob , 2:]
array([[ 1.76375476, -0.19194064],[-0.01396541, 0.2745861 ]])
You can use arithmetic operators:
data [ (names == Bob ) | (names == Joe ) , ]
array([[ 1.31273264, -1.4027545 , 1.76375476, -0.19194064],[-0.41950145, -0.21455786, 0.28687505, 0.70312942],[ 0.81500408, 1.12185486, 1.31608209, 0.80725464],[-0.70584486, -0.86788517, -0.07373691, 0.83189097],[-0.44775752, -1.67612963, -0.01396541, 0.2745861 ]])
data [ (names == Bob ) & (names == Joe ) , ]
array([], shape=(0, 4), dtype=float64)
Note: Selecting data from an array always creates a copy of the data, evenif the returned array is unchanged
13
-
Boolean Indexing
You can use different indexing methods at once:
data[names == Bob , 2:]
array([[ 1.76375476, -0.19194064],[-0.01396541, 0.2745861 ]])
You can use arithmetic operators:
data [ (names == Bob ) | (names == Joe ) , ]
array([[ 1.31273264, -1.4027545 , 1.76375476, -0.19194064],[-0.41950145, -0.21455786, 0.28687505, 0.70312942],[ 0.81500408, 1.12185486, 1.31608209, 0.80725464],[-0.70584486, -0.86788517, -0.07373691, 0.83189097],[-0.44775752, -1.67612963, -0.01396541, 0.2745861 ]])
data [ (names == Bob ) & (names == Joe ) , ]
array([], shape=(0, 4), dtype=float64)
Note: Selecting data from an array always creates a copy of the data, evenif the returned array is unchanged
13
-
Boolean Indexing
You can use different indexing methods at once:
data[names == Bob , 2:]
array([[ 1.76375476, -0.19194064],[-0.01396541, 0.2745861 ]])
You can use arithmetic operators:
data [ (names == Bob ) | (names == Joe ) , ]
array([[ 1.31273264, -1.4027545 , 1.76375476, -0.19194064],[-0.41950145, -0.21455786, 0.28687505, 0.70312942],[ 0.81500408, 1.12185486, 1.31608209, 0.80725464],[-0.70584486, -0.86788517, -0.07373691, 0.83189097],[-0.44775752, -1.67612963, -0.01396541, 0.2745861 ]])
data [ (names == Bob ) & (names == Joe ) , ]
array([], shape=(0, 4), dtype=float64)
Note: Selecting data from an array always creates a copy of the data, evenif the returned array is unchanged
13
-
Boolean Indexing
You can use boolean indexing to assign values to specific positions of thearray:
data[data < 0 ] = 0data
array([[ 1.31273264, 0. , 1.76375476, 0. ],[ 0. , 0. , 0.28687505, 0.70312942],[ 0.76013575, 0.68719731, 1.45771087, 0.07268093],[ 1.20720934, 0. , 0.56317445, 0.33062879],[ 0.81500408, 1.12185486, 1.31608209, 0.80725464],[ 0. , 0. , 0. , 0.83189097],[ 0. , 0. , 0. , 0.2745861 ]])
Note: I am indexing an array with an array of booleans
14
-
Boolean Indexing
You can use boolean indexing to assign values to specific positions of thearray:
data[data < 0 ] = 0data
array([[ 1.31273264, 0. , 1.76375476, 0. ],[ 0. , 0. , 0.28687505, 0.70312942],[ 0.76013575, 0.68719731, 1.45771087, 0.07268093],[ 1.20720934, 0. , 0.56317445, 0.33062879],[ 0.81500408, 1.12185486, 1.31608209, 0.80725464],[ 0. , 0. , 0. , 0.83189097],[ 0. , 0. , 0. , 0.2745861 ]])
Note: I am indexing an array with an array of booleans
14
-
Boolean Indexing
data[names != Joe ] = 7data
array([[ 7. , 7. , 7. , 7. ],[ 0. , 0. , 0.28687505, 0.70312942],[ 7. , 7. , 7. , 7. ],[ 7. , 7. , 7. , 7. ],[ 0.81500408, 1.12185486, 1.31608209, 0.80725464],[ 0. , 0. , 0. , 0.83189097],[ 7. , 7. , 7. , 7. ]])
15
-
Fancy Indexing
Fancy indexing is a term adopted by NumPy to describe indexing usinginteger arrays
arr = np.empty((8 ,4))
for i in range(8) :arr [ i ] = i
arr
array([[ 0., 0., 0., 0.],[ 1., 1., 1., 1.],[ 2., 2., 2., 2.],[ 3., 3., 3., 3.],[ 4., 4., 4., 4.],[ 5., 5., 5., 5.],[ 6., 6., 6., 6.],[ 7., 7., 7., 7.]])
Example
arr [[3 ,0 ,2]]
array([[ 3., 3., 3., 3.],[ 0., 0., 0., 0.],[ 2., 2., 2., 2.]])
16
-
Transposing Arrays
It is easy to transpose arrays with the attribute T:
arr .T
array([[ 0., 1., 2., 3., 4., 5., 6., 7.],[ 0., 1., 2., 3., 4., 5., 6., 7.],[ 0., 1., 2., 3., 4., 5., 6., 7.],[ 0., 1., 2., 3., 4., 5., 6., 7.]])
17
-
Data Processing Using Arrays
NumPy arrays allow us to express many kinds of data processing tasks asconcise array expressions
This practice of replacing explicit loops with array expressions is commonlyreferred to as vectorization
Vectorized array operations are often one or two orders of magnitude fasterthan their pure Python equivalents
18
-
Universal Functions
A universal function is a function that performs elementwise operationson data in ndarrays. They are fast vectorized wrappers for simple functions
Examples
arr = np.arange(10)np. sqrt ( arr )
array([ 0. , 1. , 1.41421356, 1.73205081, 2. ,2.23606798, 2.44948974, 2.64575131, 2.82842712, 3. ])
np.exp( arr )
array([ 1.00000000e+00, 2.71828183e+00, 7.38905610e+00,2.00855369e+01, 5.45981500e+01, 1.48413159e+02,4.03428793e+02, 1.09663316e+03, 2.98095799e+03,8.10308393e+03])
19
-
Conditional Logic as Array Operations
np.where is the vectorized version of the if condition:
Example
xarr = np. array ([1.1 , 1.2 , 1.3 , 1.4 , 1.5])yarr = np. array ([2.1 , 2.2 , 2.3 , 2.4 , 2.5])cond = np. array ( [ True , False , True , True , False ] )
Suppose we want to create an array that takes the value in xarr when condis True and the value of yarr when cond is False
np.where(cond, xarr , yarr )
array([ 1.1, 2.2, 1.3, 1.4, 2.5])
This method can be applied to n-dimensional arrays
20
-
Conditional Logic as Array Operations
np.where is the vectorized version of the if condition:
Example
xarr = np. array ([1.1 , 1.2 , 1.3 , 1.4 , 1.5])yarr = np. array ([2.1 , 2.2 , 2.3 , 2.4 , 2.5])cond = np. array ( [ True , False , True , True , False ] )
Suppose we want to create an array that takes the value in xarr when condis True and the value of yarr when cond is False
np.where(cond, xarr , yarr )
array([ 1.1, 2.2, 1.3, 1.4, 2.5])
This method can be applied to n-dimensional arrays
20
-
Conditional Logic as Array Operations
np.where is the vectorized version of the if condition:
Example
xarr = np. array ([1.1 , 1.2 , 1.3 , 1.4 , 1.5])yarr = np. array ([2.1 , 2.2 , 2.3 , 2.4 , 2.5])cond = np. array ( [ True , False , True , True , False ] )
Suppose we want to create an array that takes the value in xarr when condis True and the value of yarr when cond is False
np.where(cond, xarr , yarr )
array([ 1.1, 2.2, 1.3, 1.4, 2.5])
This method can be applied to n-dimensional arrays
20
-
Mathematical and Statistical Methods
NumPy arrays provide a good set of statistical methods
Basic array statistical methods
Method Description
sum Sum of all the elements in the array or along an axis.mean Arithmetic mean. Zero-length arrays have NaN mean.std, var Standard deviation and variance, respectivelymin, max Minimum and maximum.argmin, argmax Indices of minimum and maximum elements, respectively.cumsum Cumulative sum of elements starting from 0cumprod Cumulative product of elements starting from 1
21
-
Methods for Boolean Arrays
Booleans are coerced to 1 and 0, so the sum method can be used to countthe number of true values in an array:
arr = randn(100)( arr > 0).sum()
55
22
-
Sorting
NumpPy arrays can be sorted in-place using the sort method:
arr = randn(5)print unsorted : , arrarr . sort ( )print sorted : , arr
unsorted: [-0.21132983 0.25338333 -1.27090331 0.88185258 0.32729311]sorted: [-1.27090331 -0.21132983 0.25338333 0.32729311 0.88185258]
23
-
Sorting
You can specify the dimension in which you want to sort an n-dimentionalarray:
arr = randn(5 ,3)arr . sort ( axis=0)arr
array([[-1.17850016, 0.05609878, -1.11894931],[-0.15450684, 0.14064359, -0.12111114],[ 0.66674063, 0.39402912, -0.09261304],[ 0.79119149, 1.18169535, 0.09052968],[ 1.61247548, 1.48936384, 0.11534684]])
arr . sort ( axis=1)arr
array([[-1.17850016, -1.11894931, 0.05609878],[-0.15450684, -0.12111114, 0.14064359],[-0.09261304, 0.39402912, 0.66674063],[ 0.09052968, 0.79119149, 1.18169535],[ 0.11534684, 1.48936384, 1.61247548]])
24
-
Sorting
You can specify the dimension in which you want to sort an n-dimentionalarray:
arr = randn(5 ,3)arr . sort ( axis=0)arr
array([[-1.17850016, 0.05609878, -1.11894931],[-0.15450684, 0.14064359, -0.12111114],[ 0.66674063, 0.39402912, -0.09261304],[ 0.79119149, 1.18169535, 0.09052968],[ 1.61247548, 1.48936384, 0.11534684]])
arr . sort ( axis=1)arr
array([[-1.17850016, -1.11894931, 0.05609878],[-0.15450684, -0.12111114, 0.14064359],[-0.09261304, 0.39402912, 0.66674063],[ 0.09052968, 0.79119149, 1.18169535],[ 0.11534684, 1.48936384, 1.61247548]])
24
-
Set Logic
We can get all the unique values of an array with the unique method:
names = np. array ( [ Bob , Joe , B i l l , Tess , Joe , Joe , Bob ] )np. unique(names)
array([Bill, Bob, Joe, Tess],dtype=|S4)
It is possible to check whether each of the elements of an array belongs toa set of values with the method np.in1d:
np. in1d(names, [ Bob , Joe ] )
array([ True, True, False, False, True, True, True], dtype=bool)
25
-
Set Logic
We can get all the unique values of an array with the unique method:
names = np. array ( [ Bob , Joe , B i l l , Tess , Joe , Joe , Bob ] )np. unique(names)
array([Bill, Bob, Joe, Tess],dtype=|S4)
It is possible to check whether each of the elements of an array belongs toa set of values with the method np.in1d:
np. in1d(names, [ Bob , Joe ] )
array([ True, True, False, False, True, True, True], dtype=bool)
25
-
Set Logic
Array set operations:
Method Description
unique(x) Compute the sorted, unique elements in xintersect1d(x, y) Compute the sorted, common elements in x and yunion1d(x, y) Compute the sorted union of elementsin1d(x, y) Compute a boolean array indicating whether each element of x is contained in ysetdiff1d(x, y) Set difference, elements in x that are not in ysetxor1d(x, y) Set symmetric differences
26
-
File Input and Output with Arrays
np.save and np.load are the two main functions to save and load arraydata on disk
Arrays are saved by default in an uncompressed binary format with fileextension .npy
Example
Saving an array:
arr = np.arange(10)print arrnp. save( my_array , arr )
[0 1 2 3 4 5 6 7 8 9]
Loading the array:
arr = np. load ( my_array .npy )arr
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
27
-
File Input and Output with Arrays
np.save and np.load are the two main functions to save and load arraydata on disk
Arrays are saved by default in an uncompressed binary format with fileextension .npy
Example
Saving an array:
arr = np.arange(10)print arrnp. save( my_array , arr )
[0 1 2 3 4 5 6 7 8 9]
Loading the array:
arr = np. load ( my_array .npy )arr
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
27
-
Linear Algebra
In Python multiplying two arrays is an elementwise operation. In somecases we are interested in matrix operations.
The numpy.linalg module contains such operations
Example
We can use the operation np.dot to multiply two matrices:
arr1 = array ([[1 ,2 ,3] ,[4 ,5 ,6]])arr2 = array ([[2 ,2 ,2] ,[2 ,2 ,2] ,[2 ,2 ,2]])np. dot ( arr1 , arr2 )
array([[12, 12, 12],[30, 30, 30]])
np. dot ( arr2 , arr1 .T)
array([[12, 30],[12, 30],[12, 30]])
28
-
Linear Algebra
In Python multiplying two arrays is an elementwise operation. In somecases we are interested in matrix operations.
The numpy.linalg module contains such operations
Example
We can use the operation np.dot to multiply two matrices:
arr1 = array ([[1 ,2 ,3] ,[4 ,5 ,6]])arr2 = array ([[2 ,2 ,2] ,[2 ,2 ,2] ,[2 ,2 ,2]])np. dot ( arr1 , arr2 )
array([[12, 12, 12],[30, 30, 30]])
np. dot ( arr2 , arr1 .T)
array([[12, 30],[12, 30],[12, 30]])
28
-
Random Number Generation
The numpy.random module supplements the built-in Python random withfunctions that efficiently generate whole arrays of sample values frommany different kinds of probability distributions
samples = np.random.normal( size=(4,4))samples
array([[-0.08695804, 0.18486392, -0.32093721, -1.812208 ],[ 1.31593422, 0.56465651, -1.43691046, -0.40667169],[ 1.74605033, 1.27025025, -0.67012289, 0.57377713],[-1.46157084, -0.86130787, -0.64128062, 0.66803304]])
The numpy.random function is much faster than the standard randommodule in Python:
from random import normalvariate%timeit samples = [ normalvariate (0 ,1) for _ in xrange(1000*1000)]%timeit samples = np.random.normal( size=1000*1000)
1 loops, best of 3: 1.29 s per loop10 loops, best of 3: 36.3 ms per loop
29
-
Random Number Generation
The numpy.random module supplements the built-in Python random withfunctions that efficiently generate whole arrays of sample values frommany different kinds of probability distributions
samples = np.random.normal( size=(4,4))samples
array([[-0.08695804, 0.18486392, -0.32093721, -1.812208 ],[ 1.31593422, 0.56465651, -1.43691046, -0.40667169],[ 1.74605033, 1.27025025, -0.67012289, 0.57377713],[-1.46157084, -0.86130787, -0.64128062, 0.66803304]])
The numpy.random function is much faster than the standard randommodule in Python:
from random import normalvariate%timeit samples = [ normalvariate (0 ,1) for _ in xrange(1000*1000)]%timeit samples = np.random.normal( size=1000*1000)
1 loops, best of 3: 1.29 s per loop10 loops, best of 3: 36.3 ms per loop
29
-
Term Project
Requirements:
Teams of 2 or 3 students
Include all the three components covered in this class:1 Data collection from an online source2 Data storage in appropriate format3 Descriptive and graphical analysis of the data; regression analysis or
other technique
Dates:
Project proposal and teams: Thursday, April 24 paragraphs:
GoalsData collection strategyData storage strategyAnalysis strategy
Iterations over one week max.
Progress report: Tuesday, April 21
Final report: Sunday, May 3, 2015 at 11:59 pm (hard deadline)
30
-
Term Project
Requirements:
Teams of 2 or 3 students
Include all the three components covered in this class:1 Data collection from an online source2 Data storage in appropriate format3 Descriptive and graphical analysis of the data; regression analysis or
other technique
Dates:
Project proposal and teams: Thursday, April 24 paragraphs:
GoalsData collection strategyData storage strategyAnalysis strategy
Iterations over one week max.
Progress report: Tuesday, April 21
Final report: Sunday, May 3, 2015 at 11:59 pm (hard deadline)
30
-
pandas
31
-
pandas
pandas is the main library used for data analysis in Python
Built on top of NumPy
Designed to make data analysis fast and easy in Python
Main data structures:
Series
DataFrame
32
-
pandas
pandas is the main library used for data analysis in Python
Built on top of NumPy
Designed to make data analysis fast and easy in Python
Main data structures:
Series
DataFrame
32
-
Series
A Series is a one-dimensional array-like object containing an array of dataand an array of data labels, called its index.
Example
from pandas import Seriesobj = Series ([1 , 3, 4, 5])obj
0 11 32 43 -5dtype: int64
You can get the array representation and index object of the Series via itsattributes values and index:
obj . values
array([ 1, 3, 4, -5])
obj . index
Int64Index([0, 1, 2, 3], dtype=int64)
33
-
Series
A Series is a one-dimensional array-like object containing an array of dataand an array of data labels, called its index.
Example
from pandas import Seriesobj = Series ([1 , 3, 4, 5])obj
0 11 32 43 -5dtype: int64
You can get the array representation and index object of the Series via itsattributes values and index:
obj . values
array([ 1, 3, 4, -5])
obj . index
Int64Index([0, 1, 2, 3], dtype=int64)
33
-
Series
You can use any index in a Series:
obj2 = Series ([4 ,7,4,3] , index=[ d , b , a , c ] )obj2
d 4b 7a -4c 3dtype: int64
Boolean operations will preserve the index-value link:
obj2 [obj2 > 0]
d 4b 7c 3dtype: int64
34
-
Series
You can use any index in a Series:
obj2 = Series ([4 ,7,4,3] , index=[ d , b , a , c ] )obj2
d 4b 7a -4c 3dtype: int64
Boolean operations will preserve the index-value link:
obj2 [obj2 > 0]
d 4b 7c 3dtype: int64
34
-
Series
Series automatically aligns differently indexed data in arithmeticoperations
Example
obj3 = Series ([4 ,7,4,3] , index=[ a , b , c , d ] )obj4 = Series ([4 ,7,4,3] , index=[ d , b , a , c ] )print obj3print obj4
a 4b 7c -4d 3dtype: int64d 4b 7a -4c 3dtype: int64
obj3 + obj4
a 0b 14c -1d 7dtype: int64
35
-
Series
Series automatically aligns differently indexed data in arithmeticoperations
Example
obj3 = Series ([4 ,7,4,3] , index=[ a , b , c , d ] )obj4 = Series ([4 ,7,4,3] , index=[ d , b , a , c ] )print obj3print obj4
a 4b 7c -4d 3dtype: int64d 4b 7a -4c 3dtype: int64
obj3 + obj4
a 0b 14c -1d 7dtype: int64
35
-
DataFrame
A DataFrame represents a spreadsheet-like data structure containingan ordered collection of columns
Each column is a Series object
Each column can contain a different data type
A DataFrame can be seen as a dictionary of Series objects
Example
from pandas import DataFrame
data = { state : [ Ohio , Ohio , Ohio , Nevada , Nevada ] ,year : [2000, 2001, 2002, 2001, 2002],pop : [1.5 , 1.7 , 3.6 , 2.4 , 2.9]}
frame = DataFrame(data)frame
pop state year0 1.5 Ohio 20001 1.7 Ohio 20012 3.6 Ohio 20023 2.4 Nevada 20014 2.9 Nevada 2002
36
-
DataFrame
A DataFrame represents a spreadsheet-like data structure containingan ordered collection of columns
Each column is a Series object
Each column can contain a different data type
A DataFrame can be seen as a dictionary of Series objects
Example
from pandas import DataFrame
data = { state : [ Ohio , Ohio , Ohio , Nevada , Nevada ] ,year : [2000, 2001, 2002, 2001, 2002],pop : [1.5 , 1.7 , 3.6 , 2.4 , 2.9]}
frame = DataFrame(data)frame
pop state year0 1.5 Ohio 20001 1.7 Ohio 20012 3.6 Ohio 20023 2.4 Nevada 20014 2.9 Nevada 2002
36
-
DataFrame
The order of the columns can be defined with the argument columns
Example
DataFrame(data , columns=[ year , state , pop ] )
year state pop0 2000 Ohio 1.51 2001 Ohio 1.72 2002 Ohio 3.63 2001 Nevada 2.44 2002 Nevada 2.9
37
-
DataFrame
A column in a DataFrame can be retrieved as a Series either by dict-likenotation or by attribute notation
Example
frame[ state ]
0 Ohio1 Ohio2 Ohio3 Nevada4 NevadaName: state, dtype: object
frame. year
0 20001 20012 20023 20014 2002Name: year, dtype: int64
38
-
DataFrame
Columns can be modified and created by assignment
Example
frame[ debt ] = 0frame
pop state year debt0 1.5 Ohio 2000 01 1.7 Ohio 2001 02 3.6 Ohio 2002 03 2.4 Nevada 2001 04 2.9 Nevada 2002 0
frame[ debt ] = xrange( len (frame) )frame
pop state year debt0 1.5 Ohio 2000 01 1.7 Ohio 2001 12 3.6 Ohio 2002 23 2.4 Nevada 2001 34 2.9 Nevada 2002 4
39
-
DataFrame
Columns can be deleted using the del statement
del frame[ debt ]frame
pop state year0 1.5 Ohio 20001 1.7 Ohio 20012 3.6 Ohio 20023 2.4 Nevada 20014 2.9 Nevada 2002
40
-
Index Objects
pandass Index objects are responsible for holding the axis labels andother metadata (like the axis name or names)
Any array or other sequence of labels used when constructing a Series orDataFrame is internally converted to an Index
Example
obj = Series (range(3) , index=[ a , b , c ] )index = obj . index
Index([ua, ub, uc], dtype=object)
Index objects are immutable and thus cant be changed by the user
41
-
Reindexing
A critical method on pandas objects is reindex, which means to create anew object with the data conformed to a new index
Example
obj = Series ([4.5 , 7.2 , 5.3, 3.6] , index=[ d , b , a , c ] )obj
d 4.5b 7.2a -5.3c 3.6dtype: float64
obj2 = obj . reindex ( [ a , b , c , d ] )obj2
a -5.3b 7.2c 3.6d 4.5dtype: float64
42
-
Reindexing
We can provide an optional fill value in case some index value does notexist
Example
obj . reindex ( [ a , b , c , d , e ] , f i l l _va lue = 0)
a -5.3b 7.2c 3.6d 4.5e 0.0dtype: float64
43
-
Dropping Entries from an axis
Dropping one or more entries from an axis can be performed using themethod drop
Example
obj = Series (np.arange(5 . ) , index=[ a , b , c , d , e ] )obj . drop( c )
a 0b 1d 3e 4dtype: float64
44
-
Dropping Entries from an axis
Dropping can be performed in any axis
Example
data = DataFrame(np.arange(16). reshape((4 , 4)) ,index=[ Ohio , Colorado , Utah , New York ] ,columns=[ one , two , three , four ] )
data . drop ( [ Colorado , Ohio ] )
one two three fourUtah 8 9 10 11New York 12 13 14 15
data . drop( two , axis=1)
one three fourOhio 0 2 3Colorado 4 6 7Utah 8 10 11New York 12 14 15
45
-
Dropping Entries from an axis
Dropping can be performed in any axis
Example
data = DataFrame(np.arange(16). reshape((4 , 4)) ,index=[ Ohio , Colorado , Utah , New York ] ,columns=[ one , two , three , four ] )
data . drop ( [ Colorado , Ohio ] )
one two three fourUtah 8 9 10 11New York 12 13 14 15
data . drop( two , axis=1)
one three fourOhio 0 2 3Colorado 4 6 7Utah 8 10 11New York 12 14 15
45
-
Indexing, selection, and filtering
Series indexing (obj[...]) works analogously to NumPy array indexing,except you can use the Seriess index values instead of only integers:
obj = Series (np.arange(4 . ) , index=[ a , b , c , d ] )print obj [ b ]
1.0
print obj [ [ a , c ] ]
a 0c 2dtype: float64
46
-
Indexing, selection, and filtering
Series indexing (obj[...]) works analogously to NumPy array indexing,except you can use the Seriess index values instead of only integers:
obj = Series (np.arange(4 . ) , index=[ a , b , c , d ] )print obj [ b ]
1.0
print obj [ [ a , c ] ]
a 0c 2dtype: float64
46
-
Function application and mapping
Elementwise array methods work well with pandas objects
Example
frame = DataFrame(np.random. randn(4 , 3) ,columns=l i s t ( bde ) ,index=[ Utah , Ohio , Texas , Oregon ] )
frame
b d eUtah -0.091392 -1.935977 0.271981Ohio -0.034697 0.823547 0.655560Texas 0.316441 -0.603441 1.380851Oregon 0.045986 -0.965604 0.227028
np.abs(frame)
b d eUtah 0.091392 1.935977 0.271981Ohio 0.034697 0.823547 0.655560Texas 0.316441 0.603441 1.380851Oregon 0.045986 0.965604 0.227028
47
-
Function application and mapping
It is also common to apply a function on 1D arrays to each column orrow
Example
f = lambda x: x .max( ) x .min( )
frame. apply ( f )
b 0.407834d 2.759525e 1.153823dtype: float64
frame. apply ( f , axis=1)
Utah 2.207958Ohio 0.858245Texas 1.984292Oregon 1.192632dtype: float64
48
-
Function application and mapping
If a function receives only one element it is possible to use the methodapplymap
Example
frame.applymap(lambda x: %.2f % x)
b d eUtah -0.09 -1.94 0.27Ohio -0.03 0.82 0.66Texas 0.32 -0.60 1.38Oregon 0.05 -0.97 0.23
49
-
Sorting and ranking
It is possible to sort a DataFrame by index on either axis
Example
frame. sort_index ( )
b d eOhio -0.034697 0.823547 0.655560Oregon 0.045986 -0.965604 0.227028Texas 0.316441 -0.603441 1.380851Utah -0.091392 -1.935977 0.271981
frame. sort_index ( axis=1)
b d eUtah -0.091392 -1.935977 0.271981Ohio -0.034697 0.823547 0.655560Texas 0.316441 -0.603441 1.380851Oregon 0.045986 -0.965604 0.227028
50
-
Sorting and ranking
It is possible to sort by descending order
Example
frame. sort_index (ascending=False )
b d eUtah -0.091392 -1.935977 0.271981Texas 0.316441 -0.603441 1.380851Oregon 0.045986 -0.965604 0.227028Ohio -0.034697 0.823547 0.655560
51
-
Summarizing and Descriptive Statistics
pandas objects are equipped with a set of common mathematical andstatistical methods
Example
frame. describe ( )
b d ecount 4.000000 4.000000 4.000000mean 0.059085 -0.670369 0.633855std 0.180594 1.143852 0.533834min -0.091392 -1.935977 0.22702825% -0.048871 -1.208198 0.26074250% 0.005645 -0.784522 0.46377075% 0.113600 -0.246694 0.836882max 0.316441 0.823547 1.380851
52
-
Summarizing and Descriptive Statistics
Descriptive and summary statistics:
Method Description
count Number of non-NA valuesdescribe Compute set of summary statisticsmin, max Compute minimum and maximum valuesquantile Compute sample quantile ranging from 0 to 1sum Sum of valuesmean Mean of valuesmedian Arithmetic median (50% quantile) of valuesvar Sample variance of valuesstd Sample standard deviation of valuescumsum Cumulative sum of valuescumprod Cumulative product of values
53
-
Correlation and Covariance
Correlation and Covariance require two sets of data
Example
Get stock prices and volumes obtained from Yahoo! Finance
import pandas. io . data as web
all_data = {}
for t icker in [ AAPL , IBM , MSFT , GOOG ] :all_data [ t icker ] = web. get_data_yahoo( ticker , 1/1/2010 , 3/22/2015 )
price = DataFrame({ t i c : data[ Adj Close ]for t ic , data in all_data . iteritems ()})
volume = DataFrame({ t i c : data[ Volume ]for t ic , data in all_data . iteritems ()})
price . t a i l ( )
AAPL GOOG IBM MSFTDate2015-03-16 124.95 554.51 157.08 41.562015-03-17 127.04 550.84 156.96 41.702015-03-18 128.47 559.50 159.81 42.502015-03-19 127.50 557.99 159.81 42.292015-03-20 125.90 560.36 162.88 42.88
54
-
Correlarion and Covariance
Calculate the percentage change from the previous value:
returns = price . pct_change ( )returns . t a i l ( )
AAPL GOOG IBM MSFTDate2015-03-16 0.011004 0.013137 0.018149 0.0043502015-03-17 0.016727 -0.006618 -0.000764 0.0033692015-03-18 0.011256 0.015721 0.018157 0.0191852015-03-19 -0.007550 -0.002699 0.000000 -0.0049412015-03-20 -0.012549 0.004247 0.019210 0.013951
55
-
Correlarion and Covariance
The corr method calculates the correlation between two series:
returns .MSFT. corr ( returns . IBM)
0.50052763872781603
DataFrames corr and cov methods, return a full correlation or covariancematrix as a DataFrame:
returns . corr ( )
AAPL GOOG IBM MSFTAAPL 1.000000 0.265999 0.368079 0.345835GOOG 0.265999 1.000000 0.315613 0.409107IBM 0.368079 0.315613 1.000000 0.500528MSFT 0.345835 0.409107 0.500528 1.000000
56
-
Correlarion and Covariance
The corr method calculates the correlation between two series:
returns .MSFT. corr ( returns . IBM)
0.50052763872781603
DataFrames corr and cov methods, return a full correlation or covariancematrix as a DataFrame:
returns . corr ( )
AAPL GOOG IBM MSFTAAPL 1.000000 0.265999 0.368079 0.345835GOOG 0.265999 1.000000 0.315613 0.409107IBM 0.368079 0.315613 1.000000 0.500528MSFT 0.345835 0.409107 0.500528 1.000000
56
-
Unique Values
To get unique values we can use the method unique from the Series object:
print len ( price .AAPL)unique_prices = price .AAPL. unique ( )print len ( unique_prices )
13121192
57
-
Value Counts
We can also count the appearence of each of the values
Example
price .AAPL. value_counts ( ) . head( )
45.29 334.26 327.75 245.47 245.72 2dtype: int64
58
-
Missing Data
Missing data is common in most data analysis applications
By default pandas functions deal with missing data graciously
Example
First, lets calculate the average price for GOOG:
price .GOOG.mean( )
550.01818548387075
How many missing observations do we have?
price .GOOG. i snu l l ( ) .sum()
1064
Now, lets calculate the mean without discarding the missing observations
price .GOOG.mean(skipna=False )
nan
The average price cannot be calculated if we do not remove or replace themissing values
59
-
Missing Data
Missing data is common in most data analysis applications
By default pandas functions deal with missing data graciously
Example
First, lets calculate the average price for GOOG:
price .GOOG.mean( )
550.01818548387075
How many missing observations do we have?
price .GOOG. i snu l l ( ) .sum()
1064
Now, lets calculate the mean without discarding the missing observations
price .GOOG.mean(skipna=False )
nan
The average price cannot be calculated if we do not remove or replace themissing values
59
-
Missing Data
Missing data is common in most data analysis applications
By default pandas functions deal with missing data graciously
Example
First, lets calculate the average price for GOOG:
price .GOOG.mean( )
550.01818548387075
How many missing observations do we have?
price .GOOG. i snu l l ( ) .sum()
1064
Now, lets calculate the mean without discarding the missing observations
price .GOOG.mean(skipna=False )
nan
The average price cannot be calculated if we do not remove or replace themissing values
59
-
Filtering out Missing Data
In many applications it is important to know that we are using always thesame observations
In such cases may be wise to remove observations with missing values:
price .dropna ( ) . head( )
AAPL GOOG IBM MSFTDate2014-03-27 75.35 558.46 185.05 38.332014-03-28 75.27 559.99 185.65 39.242014-03-31 75.25 556.97 187.64 39.912014-04-01 75.94 567.16 189.60 40.332014-04-02 76.06 567.00 188.67 40.26
Data starts on 2014-03-27, the first date for which we have data for GOOG
60
-
Filtering out Missing Data
In many applications it is important to know that we are using always thesame observations
In such cases may be wise to remove observations with missing values:
price .dropna ( ) . head( )
AAPL GOOG IBM MSFTDate2014-03-27 75.35 558.46 185.05 38.332014-03-28 75.27 559.99 185.65 39.242014-03-31 75.25 556.97 187.64 39.912014-04-01 75.94 567.16 189.60 40.332014-04-02 76.06 567.00 188.67 40.26
Data starts on 2014-03-27, the first date for which we have data for GOOG
60
-
Filtering out Missing Data
We could also drop the columns that have missing data:
price .dropna( axis=1).head( )
AAPL IBM MSFTDate2010-01-04 28.84 119.53 26.942010-01-05 28.89 118.09 26.952010-01-06 28.43 117.32 26.792010-01-07 28.38 116.92 26.512010-01-08 28.56 118.09 26.69
61
-
Filling in Missing Data
In some situations we want to fill in the missing observations with defaultvalues:
Example
Filling with zeros:
price . f i l l n a (0) .head( )
AAPL GOOG IBM MSFTDate2010-01-04 28.84 0 119.53 26.942010-01-05 28.89 0 118.09 26.952010-01-06 28.43 0 117.32 26.792010-01-07 28.38 0 116.92 26.512010-01-08 28.56 0 118.09 26.69
Filling with the mean:
price . f i l l n a ( price .mean( ) ) . head( )
AAPL GOOG IBM MSFTDate2010-01-04 28.84 550.018185 119.53 26.942010-01-05 28.89 550.018185 118.09 26.952010-01-06 28.43 550.018185 117.32 26.792010-01-07 28.38 550.018185 116.92 26.512010-01-08 28.56 550.018185 118.09 26.69
62
-
Filling in Missing Data
Note: These operations always create a copy of the data
price .head( )
AAPL GOOG IBM MSFTDate2010-01-04 28.84 NaN 119.53 26.942010-01-05 28.89 NaN 118.09 26.952010-01-06 28.43 NaN 117.32 26.792010-01-07 28.38 NaN 116.92 26.512010-01-08 28.56 NaN 118.09 26.69
63
-
Hierarchical Indexing
Hierarchical indexing enables using multiple (two or more) index levelson an axis
It provides a way to work with higher dimensional data in a lowerdimensional form
Example
data = Series (np.random. randn(10) ,index=[[ a , a , a , b , b , b , c , c , d , d ] ,
[2010, 2011, 2012, 2010, 2011, 2012, 2010, 2011, 2011, 2012]])
data
a 2010 0.5476342011 0.7921822012 -0.821709
b 2010 0.1725032011 0.7144972012 -0.004165
c 2010 -0.0951962011 0.096810
d 2011 0.5530032012 0.167027
dtype: float64
64
-
Hierarchical Indexing
Example
Accessing to a
data[ a ]
2010 0.5476342011 0.7921822012 -0.821709dtype: float64
Accessing to 2011
data [ : , 2011]
a 0.792182b 0.714497c 0.096810d 0.553003dtype: float64
65
-
Summary Statistics by Level
We can summarize the results by each level of the index
Example
data .sum( level=0)
a 0.518107b 0.882835c 0.001614d 0.720031dtype: float64
data .sum( level=1)
2010 0.6249412011 2.1564922012 -0.658846dtype: float64
66
SETUPLecture 3NumPy
Lecture 4pandas