python for data analytics

96
Python for Data Analytics Lectures 3 & 4: Essential Libraries – NumPy and pandas Rodrigo Belo [email protected] Spring 2015 1

Upload: saikrishnaiyerj

Post on 17-Dec-2015

21 views

Category:

Documents


0 download

DESCRIPTION

Python for Data Analytics Lecture 2

TRANSCRIPT

  • Python for Data Analytics

    Lectures 3 & 4: Essential Libraries NumPy and pandas

    Rodrigo [email protected]

    Spring 2015

    1

  • NumPy

    2

  • NumPy

    NumPy is the fundamental package required for high performancescientific computing and data analysis. It provides:

    ndarray, a fast and space-efficient multidimensional array providingvectorized operations

    Standard mathematical operations for fast operations over arrayswithout having to write loops

    Tools for reading and writing array data to disk and working withmemory-mapped files

    Tools for integrating code written in C, C++, and Fortran

    Having a good understanding of how NumPy works will help use tools likepandas

    3

  • NumPy

    NumPy is the fundamental package required for high performancescientific computing and data analysis. It provides:

    ndarray, a fast and space-efficient multidimensional array providingvectorized operations

    Standard mathematical operations for fast operations over arrayswithout having to write loops

    Tools for reading and writing array data to disk and working withmemory-mapped files

    Tools for integrating code written in C, C++, and Fortran

    Having a good understanding of how NumPy works will help use tools likepandas

    3

  • ndarray

    ndarray stands for N-dimensional array.

    data

    array([[ 0.73230045, 0.25494037, 0.79516021],[ 0.62986533, 0.3420035 , 0.08914765]])

    You can get the shape of an array and the type of its elements byaccessing the values shape and dtype:

    print data .shapeprint data . dtype

    (2, 3)float64

    4

  • ndarray

    ndarray stands for N-dimensional array.

    data

    array([[ 0.73230045, 0.25494037, 0.79516021],[ 0.62986533, 0.3420035 , 0.08914765]])

    You can get the shape of an array and the type of its elements byaccessing the values shape and dtype:

    print data .shapeprint data . dtype

    (2, 3)float64

    4

  • Creating ndarrays

    It is possible to create ndarrays from a list or a list of lists

    From a list:

    import numpy as npdata1 = [1 ,2 ,3 ,4]arr1 = np. array (data1)arr1

    array([1, 2, 3, 4])

    From a list of lists:

    data2 = [[1 ,2 ,3 ,4] ,[5 ,6 ,7 ,8]]arr2 = np. array (data2)arr2

    array([[1, 2, 3, 4],[5, 6, 7, 8]])

    5

  • Creating ndarrays

    It is possible to create ndarrays from a list or a list of lists

    From a list:

    import numpy as npdata1 = [1 ,2 ,3 ,4]arr1 = np. array (data1)arr1

    array([1, 2, 3, 4])

    From a list of lists:

    data2 = [[1 ,2 ,3 ,4] ,[5 ,6 ,7 ,8]]arr2 = np. array (data2)arr2

    array([[1, 2, 3, 4],[5, 6, 7, 8]])

    5

  • Creating ndarrays

    Creating an array initiated with zeros

    np. zeros ((3 ,6))

    array([[ 0., 0., 0., 0., 0., 0.],[ 0., 0., 0., 0., 0., 0.],[ 0., 0., 0., 0., 0., 0.]])

    6

  • Creating ndarrays

    Creating an array with random numbers:

    data = np.random. rand(2 ,3)print data .shapeprint data . dtypedata

    (2, 3)float64array([[ 0.73230045, 0.25494037, 0.79516021],

    [ 0.62986533, 0.3420035 , 0.08914765]])

    7

  • Data Types for ndarrays

    ndarrays are composed of elements that are all of the same type:

    int

    float

    complex

    bool

    string

    object

    In practice an array of type object can have elements of any type, butthese types of array are not common

    Example

    arr = np. array ( [ Hello , np.random. rand ] )arr

    array([Hello,], dtype=object)

    8

  • Data Types for ndarrays

    ndarrays are composed of elements that are all of the same type:

    int

    float

    complex

    bool

    string

    object

    In practice an array of type object can have elements of any type, butthese types of array are not common

    Example

    arr = np. array ( [ Hello , np.random. rand ] )arr

    array([Hello,], dtype=object)

    8

  • Operations between Arrays and Scalars

    ndarray supports vectorized operations, i.e., operations that areperformed to each element of an array without the need of using loops

    Multiplication by a scalar

    data * 10

    array([[ 6.39219315, 6.8102819 , 4.34637984],[ 0.34237044, 5.39243817, 1.26276343]])

    Addition

    data + data

    array([[ 1.27843863, 1.36205638, 0.86927597],[ 0.06847409, 1.07848763, 0.25255269]])

    9

  • Operations between Arrays and Scalars

    ndarray supports vectorized operations, i.e., operations that areperformed to each element of an array without the need of using loops

    Multiplication by a scalar

    data * 10

    array([[ 6.39219315, 6.8102819 , 4.34637984],[ 0.34237044, 5.39243817, 1.26276343]])

    Addition

    data + data

    array([[ 1.27843863, 1.36205638, 0.86927597],[ 0.06847409, 1.07848763, 0.25255269]])

    9

  • Operations between Arrays and Scalars

    arr = np. array ([[1. ,2. ,3] , [4 ,5 ,6] , [7 ,8 ,9]])arr

    array([[ 1., 2., 3.],[ 4., 5., 6.],[ 7., 8., 9.]])

    Multiplication

    arr * arr

    array([[ 1., 4., 9.],[ 16., 25., 36.],[ 49., 64., 81.]])

    Division

    1 / arr

    array([[ 1. , 0.5 , 0.33333333],[ 0.25 , 0.2 , 0.16666667],[ 0.14285714, 0.125 , 0.11111111]])

    10

  • Basic Indexing and Slicing

    Indexing works in the same way as for lists and tuples:

    arr

    array([[ 1., 2., 3.],[ 4., 5., 6.],[ 7., 8., 9.]])

    arr [1]

    array([ 4., 5., 6.])

    arr [ : ,1 ]

    array([ 2., 5., 8.])

    arr [1: ,:1]

    array([[ 4., 5.],[ 7., 8.]])

    11

  • Basic Indexing and Slicing

    Indexing works in the same way as for lists and tuples:

    arr

    array([[ 1., 2., 3.],[ 4., 5., 6.],[ 7., 8., 9.]])

    arr [1]

    array([ 4., 5., 6.])

    arr [ : ,1 ]

    array([ 2., 5., 8.])

    arr [1: ,:1]

    array([[ 4., 5.],[ 7., 8.]])

    11

  • Boolean Indexing

    names = np. array ( [ Bob , Joe , B i l l , Tess , Joe , Joe , Bob ] )data = randn(7 ,4)print namesdata

    [Bob Joe Bill Tess Joe Joe Bob]array([[ 1.31273264, -1.4027545 , 1.76375476, -0.19194064],

    [-0.41950145, -0.21455786, 0.28687505, 0.70312942],[ 0.76013575, 0.68719731, 1.45771087, 0.07268093],[ 1.20720934, -0.52305673, 0.56317445, 0.33062879],[ 0.81500408, 1.12185486, 1.31608209, 0.80725464],[-0.70584486, -0.86788517, -0.07373691, 0.83189097],[-0.44775752, -1.67612963, -0.01396541, 0.2745861 ]])

    We can create an array of Booleans that is used to select the relevant rows:

    print names == Bob data[names == Bob ]

    [ True False False False False False True]array([[ 1.31273264, -1.4027545 , 1.76375476, -0.19194064],

    [-0.44775752, -1.67612963, -0.01396541, 0.2745861 ]])

    12

  • Boolean Indexing

    names = np. array ( [ Bob , Joe , B i l l , Tess , Joe , Joe , Bob ] )data = randn(7 ,4)print namesdata

    [Bob Joe Bill Tess Joe Joe Bob]array([[ 1.31273264, -1.4027545 , 1.76375476, -0.19194064],

    [-0.41950145, -0.21455786, 0.28687505, 0.70312942],[ 0.76013575, 0.68719731, 1.45771087, 0.07268093],[ 1.20720934, -0.52305673, 0.56317445, 0.33062879],[ 0.81500408, 1.12185486, 1.31608209, 0.80725464],[-0.70584486, -0.86788517, -0.07373691, 0.83189097],[-0.44775752, -1.67612963, -0.01396541, 0.2745861 ]])

    We can create an array of Booleans that is used to select the relevant rows:

    print names == Bob data[names == Bob ]

    [ True False False False False False True]array([[ 1.31273264, -1.4027545 , 1.76375476, -0.19194064],

    [-0.44775752, -1.67612963, -0.01396541, 0.2745861 ]])

    12

  • Boolean Indexing

    You can use different indexing methods at once:

    data[names == Bob , 2:]

    array([[ 1.76375476, -0.19194064],[-0.01396541, 0.2745861 ]])

    You can use arithmetic operators:

    data [ (names == Bob ) | (names == Joe ) , ]

    array([[ 1.31273264, -1.4027545 , 1.76375476, -0.19194064],[-0.41950145, -0.21455786, 0.28687505, 0.70312942],[ 0.81500408, 1.12185486, 1.31608209, 0.80725464],[-0.70584486, -0.86788517, -0.07373691, 0.83189097],[-0.44775752, -1.67612963, -0.01396541, 0.2745861 ]])

    data [ (names == Bob ) & (names == Joe ) , ]

    array([], shape=(0, 4), dtype=float64)

    Note: Selecting data from an array always creates a copy of the data, evenif the returned array is unchanged

    13

  • Boolean Indexing

    You can use different indexing methods at once:

    data[names == Bob , 2:]

    array([[ 1.76375476, -0.19194064],[-0.01396541, 0.2745861 ]])

    You can use arithmetic operators:

    data [ (names == Bob ) | (names == Joe ) , ]

    array([[ 1.31273264, -1.4027545 , 1.76375476, -0.19194064],[-0.41950145, -0.21455786, 0.28687505, 0.70312942],[ 0.81500408, 1.12185486, 1.31608209, 0.80725464],[-0.70584486, -0.86788517, -0.07373691, 0.83189097],[-0.44775752, -1.67612963, -0.01396541, 0.2745861 ]])

    data [ (names == Bob ) & (names == Joe ) , ]

    array([], shape=(0, 4), dtype=float64)

    Note: Selecting data from an array always creates a copy of the data, evenif the returned array is unchanged

    13

  • Boolean Indexing

    You can use different indexing methods at once:

    data[names == Bob , 2:]

    array([[ 1.76375476, -0.19194064],[-0.01396541, 0.2745861 ]])

    You can use arithmetic operators:

    data [ (names == Bob ) | (names == Joe ) , ]

    array([[ 1.31273264, -1.4027545 , 1.76375476, -0.19194064],[-0.41950145, -0.21455786, 0.28687505, 0.70312942],[ 0.81500408, 1.12185486, 1.31608209, 0.80725464],[-0.70584486, -0.86788517, -0.07373691, 0.83189097],[-0.44775752, -1.67612963, -0.01396541, 0.2745861 ]])

    data [ (names == Bob ) & (names == Joe ) , ]

    array([], shape=(0, 4), dtype=float64)

    Note: Selecting data from an array always creates a copy of the data, evenif the returned array is unchanged

    13

  • Boolean Indexing

    You can use different indexing methods at once:

    data[names == Bob , 2:]

    array([[ 1.76375476, -0.19194064],[-0.01396541, 0.2745861 ]])

    You can use arithmetic operators:

    data [ (names == Bob ) | (names == Joe ) , ]

    array([[ 1.31273264, -1.4027545 , 1.76375476, -0.19194064],[-0.41950145, -0.21455786, 0.28687505, 0.70312942],[ 0.81500408, 1.12185486, 1.31608209, 0.80725464],[-0.70584486, -0.86788517, -0.07373691, 0.83189097],[-0.44775752, -1.67612963, -0.01396541, 0.2745861 ]])

    data [ (names == Bob ) & (names == Joe ) , ]

    array([], shape=(0, 4), dtype=float64)

    Note: Selecting data from an array always creates a copy of the data, evenif the returned array is unchanged

    13

  • Boolean Indexing

    You can use boolean indexing to assign values to specific positions of thearray:

    data[data < 0 ] = 0data

    array([[ 1.31273264, 0. , 1.76375476, 0. ],[ 0. , 0. , 0.28687505, 0.70312942],[ 0.76013575, 0.68719731, 1.45771087, 0.07268093],[ 1.20720934, 0. , 0.56317445, 0.33062879],[ 0.81500408, 1.12185486, 1.31608209, 0.80725464],[ 0. , 0. , 0. , 0.83189097],[ 0. , 0. , 0. , 0.2745861 ]])

    Note: I am indexing an array with an array of booleans

    14

  • Boolean Indexing

    You can use boolean indexing to assign values to specific positions of thearray:

    data[data < 0 ] = 0data

    array([[ 1.31273264, 0. , 1.76375476, 0. ],[ 0. , 0. , 0.28687505, 0.70312942],[ 0.76013575, 0.68719731, 1.45771087, 0.07268093],[ 1.20720934, 0. , 0.56317445, 0.33062879],[ 0.81500408, 1.12185486, 1.31608209, 0.80725464],[ 0. , 0. , 0. , 0.83189097],[ 0. , 0. , 0. , 0.2745861 ]])

    Note: I am indexing an array with an array of booleans

    14

  • Boolean Indexing

    data[names != Joe ] = 7data

    array([[ 7. , 7. , 7. , 7. ],[ 0. , 0. , 0.28687505, 0.70312942],[ 7. , 7. , 7. , 7. ],[ 7. , 7. , 7. , 7. ],[ 0.81500408, 1.12185486, 1.31608209, 0.80725464],[ 0. , 0. , 0. , 0.83189097],[ 7. , 7. , 7. , 7. ]])

    15

  • Fancy Indexing

    Fancy indexing is a term adopted by NumPy to describe indexing usinginteger arrays

    arr = np.empty((8 ,4))

    for i in range(8) :arr [ i ] = i

    arr

    array([[ 0., 0., 0., 0.],[ 1., 1., 1., 1.],[ 2., 2., 2., 2.],[ 3., 3., 3., 3.],[ 4., 4., 4., 4.],[ 5., 5., 5., 5.],[ 6., 6., 6., 6.],[ 7., 7., 7., 7.]])

    Example

    arr [[3 ,0 ,2]]

    array([[ 3., 3., 3., 3.],[ 0., 0., 0., 0.],[ 2., 2., 2., 2.]])

    16

  • Transposing Arrays

    It is easy to transpose arrays with the attribute T:

    arr .T

    array([[ 0., 1., 2., 3., 4., 5., 6., 7.],[ 0., 1., 2., 3., 4., 5., 6., 7.],[ 0., 1., 2., 3., 4., 5., 6., 7.],[ 0., 1., 2., 3., 4., 5., 6., 7.]])

    17

  • Data Processing Using Arrays

    NumPy arrays allow us to express many kinds of data processing tasks asconcise array expressions

    This practice of replacing explicit loops with array expressions is commonlyreferred to as vectorization

    Vectorized array operations are often one or two orders of magnitude fasterthan their pure Python equivalents

    18

  • Universal Functions

    A universal function is a function that performs elementwise operationson data in ndarrays. They are fast vectorized wrappers for simple functions

    Examples

    arr = np.arange(10)np. sqrt ( arr )

    array([ 0. , 1. , 1.41421356, 1.73205081, 2. ,2.23606798, 2.44948974, 2.64575131, 2.82842712, 3. ])

    np.exp( arr )

    array([ 1.00000000e+00, 2.71828183e+00, 7.38905610e+00,2.00855369e+01, 5.45981500e+01, 1.48413159e+02,4.03428793e+02, 1.09663316e+03, 2.98095799e+03,8.10308393e+03])

    19

  • Conditional Logic as Array Operations

    np.where is the vectorized version of the if condition:

    Example

    xarr = np. array ([1.1 , 1.2 , 1.3 , 1.4 , 1.5])yarr = np. array ([2.1 , 2.2 , 2.3 , 2.4 , 2.5])cond = np. array ( [ True , False , True , True , False ] )

    Suppose we want to create an array that takes the value in xarr when condis True and the value of yarr when cond is False

    np.where(cond, xarr , yarr )

    array([ 1.1, 2.2, 1.3, 1.4, 2.5])

    This method can be applied to n-dimensional arrays

    20

  • Conditional Logic as Array Operations

    np.where is the vectorized version of the if condition:

    Example

    xarr = np. array ([1.1 , 1.2 , 1.3 , 1.4 , 1.5])yarr = np. array ([2.1 , 2.2 , 2.3 , 2.4 , 2.5])cond = np. array ( [ True , False , True , True , False ] )

    Suppose we want to create an array that takes the value in xarr when condis True and the value of yarr when cond is False

    np.where(cond, xarr , yarr )

    array([ 1.1, 2.2, 1.3, 1.4, 2.5])

    This method can be applied to n-dimensional arrays

    20

  • Conditional Logic as Array Operations

    np.where is the vectorized version of the if condition:

    Example

    xarr = np. array ([1.1 , 1.2 , 1.3 , 1.4 , 1.5])yarr = np. array ([2.1 , 2.2 , 2.3 , 2.4 , 2.5])cond = np. array ( [ True , False , True , True , False ] )

    Suppose we want to create an array that takes the value in xarr when condis True and the value of yarr when cond is False

    np.where(cond, xarr , yarr )

    array([ 1.1, 2.2, 1.3, 1.4, 2.5])

    This method can be applied to n-dimensional arrays

    20

  • Mathematical and Statistical Methods

    NumPy arrays provide a good set of statistical methods

    Basic array statistical methods

    Method Description

    sum Sum of all the elements in the array or along an axis.mean Arithmetic mean. Zero-length arrays have NaN mean.std, var Standard deviation and variance, respectivelymin, max Minimum and maximum.argmin, argmax Indices of minimum and maximum elements, respectively.cumsum Cumulative sum of elements starting from 0cumprod Cumulative product of elements starting from 1

    21

  • Methods for Boolean Arrays

    Booleans are coerced to 1 and 0, so the sum method can be used to countthe number of true values in an array:

    arr = randn(100)( arr > 0).sum()

    55

    22

  • Sorting

    NumpPy arrays can be sorted in-place using the sort method:

    arr = randn(5)print unsorted : , arrarr . sort ( )print sorted : , arr

    unsorted: [-0.21132983 0.25338333 -1.27090331 0.88185258 0.32729311]sorted: [-1.27090331 -0.21132983 0.25338333 0.32729311 0.88185258]

    23

  • Sorting

    You can specify the dimension in which you want to sort an n-dimentionalarray:

    arr = randn(5 ,3)arr . sort ( axis=0)arr

    array([[-1.17850016, 0.05609878, -1.11894931],[-0.15450684, 0.14064359, -0.12111114],[ 0.66674063, 0.39402912, -0.09261304],[ 0.79119149, 1.18169535, 0.09052968],[ 1.61247548, 1.48936384, 0.11534684]])

    arr . sort ( axis=1)arr

    array([[-1.17850016, -1.11894931, 0.05609878],[-0.15450684, -0.12111114, 0.14064359],[-0.09261304, 0.39402912, 0.66674063],[ 0.09052968, 0.79119149, 1.18169535],[ 0.11534684, 1.48936384, 1.61247548]])

    24

  • Sorting

    You can specify the dimension in which you want to sort an n-dimentionalarray:

    arr = randn(5 ,3)arr . sort ( axis=0)arr

    array([[-1.17850016, 0.05609878, -1.11894931],[-0.15450684, 0.14064359, -0.12111114],[ 0.66674063, 0.39402912, -0.09261304],[ 0.79119149, 1.18169535, 0.09052968],[ 1.61247548, 1.48936384, 0.11534684]])

    arr . sort ( axis=1)arr

    array([[-1.17850016, -1.11894931, 0.05609878],[-0.15450684, -0.12111114, 0.14064359],[-0.09261304, 0.39402912, 0.66674063],[ 0.09052968, 0.79119149, 1.18169535],[ 0.11534684, 1.48936384, 1.61247548]])

    24

  • Set Logic

    We can get all the unique values of an array with the unique method:

    names = np. array ( [ Bob , Joe , B i l l , Tess , Joe , Joe , Bob ] )np. unique(names)

    array([Bill, Bob, Joe, Tess],dtype=|S4)

    It is possible to check whether each of the elements of an array belongs toa set of values with the method np.in1d:

    np. in1d(names, [ Bob , Joe ] )

    array([ True, True, False, False, True, True, True], dtype=bool)

    25

  • Set Logic

    We can get all the unique values of an array with the unique method:

    names = np. array ( [ Bob , Joe , B i l l , Tess , Joe , Joe , Bob ] )np. unique(names)

    array([Bill, Bob, Joe, Tess],dtype=|S4)

    It is possible to check whether each of the elements of an array belongs toa set of values with the method np.in1d:

    np. in1d(names, [ Bob , Joe ] )

    array([ True, True, False, False, True, True, True], dtype=bool)

    25

  • Set Logic

    Array set operations:

    Method Description

    unique(x) Compute the sorted, unique elements in xintersect1d(x, y) Compute the sorted, common elements in x and yunion1d(x, y) Compute the sorted union of elementsin1d(x, y) Compute a boolean array indicating whether each element of x is contained in ysetdiff1d(x, y) Set difference, elements in x that are not in ysetxor1d(x, y) Set symmetric differences

    26

  • File Input and Output with Arrays

    np.save and np.load are the two main functions to save and load arraydata on disk

    Arrays are saved by default in an uncompressed binary format with fileextension .npy

    Example

    Saving an array:

    arr = np.arange(10)print arrnp. save( my_array , arr )

    [0 1 2 3 4 5 6 7 8 9]

    Loading the array:

    arr = np. load ( my_array .npy )arr

    array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

    27

  • File Input and Output with Arrays

    np.save and np.load are the two main functions to save and load arraydata on disk

    Arrays are saved by default in an uncompressed binary format with fileextension .npy

    Example

    Saving an array:

    arr = np.arange(10)print arrnp. save( my_array , arr )

    [0 1 2 3 4 5 6 7 8 9]

    Loading the array:

    arr = np. load ( my_array .npy )arr

    array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

    27

  • Linear Algebra

    In Python multiplying two arrays is an elementwise operation. In somecases we are interested in matrix operations.

    The numpy.linalg module contains such operations

    Example

    We can use the operation np.dot to multiply two matrices:

    arr1 = array ([[1 ,2 ,3] ,[4 ,5 ,6]])arr2 = array ([[2 ,2 ,2] ,[2 ,2 ,2] ,[2 ,2 ,2]])np. dot ( arr1 , arr2 )

    array([[12, 12, 12],[30, 30, 30]])

    np. dot ( arr2 , arr1 .T)

    array([[12, 30],[12, 30],[12, 30]])

    28

  • Linear Algebra

    In Python multiplying two arrays is an elementwise operation. In somecases we are interested in matrix operations.

    The numpy.linalg module contains such operations

    Example

    We can use the operation np.dot to multiply two matrices:

    arr1 = array ([[1 ,2 ,3] ,[4 ,5 ,6]])arr2 = array ([[2 ,2 ,2] ,[2 ,2 ,2] ,[2 ,2 ,2]])np. dot ( arr1 , arr2 )

    array([[12, 12, 12],[30, 30, 30]])

    np. dot ( arr2 , arr1 .T)

    array([[12, 30],[12, 30],[12, 30]])

    28

  • Random Number Generation

    The numpy.random module supplements the built-in Python random withfunctions that efficiently generate whole arrays of sample values frommany different kinds of probability distributions

    samples = np.random.normal( size=(4,4))samples

    array([[-0.08695804, 0.18486392, -0.32093721, -1.812208 ],[ 1.31593422, 0.56465651, -1.43691046, -0.40667169],[ 1.74605033, 1.27025025, -0.67012289, 0.57377713],[-1.46157084, -0.86130787, -0.64128062, 0.66803304]])

    The numpy.random function is much faster than the standard randommodule in Python:

    from random import normalvariate%timeit samples = [ normalvariate (0 ,1) for _ in xrange(1000*1000)]%timeit samples = np.random.normal( size=1000*1000)

    1 loops, best of 3: 1.29 s per loop10 loops, best of 3: 36.3 ms per loop

    29

  • Random Number Generation

    The numpy.random module supplements the built-in Python random withfunctions that efficiently generate whole arrays of sample values frommany different kinds of probability distributions

    samples = np.random.normal( size=(4,4))samples

    array([[-0.08695804, 0.18486392, -0.32093721, -1.812208 ],[ 1.31593422, 0.56465651, -1.43691046, -0.40667169],[ 1.74605033, 1.27025025, -0.67012289, 0.57377713],[-1.46157084, -0.86130787, -0.64128062, 0.66803304]])

    The numpy.random function is much faster than the standard randommodule in Python:

    from random import normalvariate%timeit samples = [ normalvariate (0 ,1) for _ in xrange(1000*1000)]%timeit samples = np.random.normal( size=1000*1000)

    1 loops, best of 3: 1.29 s per loop10 loops, best of 3: 36.3 ms per loop

    29

  • Term Project

    Requirements:

    Teams of 2 or 3 students

    Include all the three components covered in this class:1 Data collection from an online source2 Data storage in appropriate format3 Descriptive and graphical analysis of the data; regression analysis or

    other technique

    Dates:

    Project proposal and teams: Thursday, April 24 paragraphs:

    GoalsData collection strategyData storage strategyAnalysis strategy

    Iterations over one week max.

    Progress report: Tuesday, April 21

    Final report: Sunday, May 3, 2015 at 11:59 pm (hard deadline)

    30

  • Term Project

    Requirements:

    Teams of 2 or 3 students

    Include all the three components covered in this class:1 Data collection from an online source2 Data storage in appropriate format3 Descriptive and graphical analysis of the data; regression analysis or

    other technique

    Dates:

    Project proposal and teams: Thursday, April 24 paragraphs:

    GoalsData collection strategyData storage strategyAnalysis strategy

    Iterations over one week max.

    Progress report: Tuesday, April 21

    Final report: Sunday, May 3, 2015 at 11:59 pm (hard deadline)

    30

  • pandas

    31

  • pandas

    pandas is the main library used for data analysis in Python

    Built on top of NumPy

    Designed to make data analysis fast and easy in Python

    Main data structures:

    Series

    DataFrame

    32

  • pandas

    pandas is the main library used for data analysis in Python

    Built on top of NumPy

    Designed to make data analysis fast and easy in Python

    Main data structures:

    Series

    DataFrame

    32

  • Series

    A Series is a one-dimensional array-like object containing an array of dataand an array of data labels, called its index.

    Example

    from pandas import Seriesobj = Series ([1 , 3, 4, 5])obj

    0 11 32 43 -5dtype: int64

    You can get the array representation and index object of the Series via itsattributes values and index:

    obj . values

    array([ 1, 3, 4, -5])

    obj . index

    Int64Index([0, 1, 2, 3], dtype=int64)

    33

  • Series

    A Series is a one-dimensional array-like object containing an array of dataand an array of data labels, called its index.

    Example

    from pandas import Seriesobj = Series ([1 , 3, 4, 5])obj

    0 11 32 43 -5dtype: int64

    You can get the array representation and index object of the Series via itsattributes values and index:

    obj . values

    array([ 1, 3, 4, -5])

    obj . index

    Int64Index([0, 1, 2, 3], dtype=int64)

    33

  • Series

    You can use any index in a Series:

    obj2 = Series ([4 ,7,4,3] , index=[ d , b , a , c ] )obj2

    d 4b 7a -4c 3dtype: int64

    Boolean operations will preserve the index-value link:

    obj2 [obj2 > 0]

    d 4b 7c 3dtype: int64

    34

  • Series

    You can use any index in a Series:

    obj2 = Series ([4 ,7,4,3] , index=[ d , b , a , c ] )obj2

    d 4b 7a -4c 3dtype: int64

    Boolean operations will preserve the index-value link:

    obj2 [obj2 > 0]

    d 4b 7c 3dtype: int64

    34

  • Series

    Series automatically aligns differently indexed data in arithmeticoperations

    Example

    obj3 = Series ([4 ,7,4,3] , index=[ a , b , c , d ] )obj4 = Series ([4 ,7,4,3] , index=[ d , b , a , c ] )print obj3print obj4

    a 4b 7c -4d 3dtype: int64d 4b 7a -4c 3dtype: int64

    obj3 + obj4

    a 0b 14c -1d 7dtype: int64

    35

  • Series

    Series automatically aligns differently indexed data in arithmeticoperations

    Example

    obj3 = Series ([4 ,7,4,3] , index=[ a , b , c , d ] )obj4 = Series ([4 ,7,4,3] , index=[ d , b , a , c ] )print obj3print obj4

    a 4b 7c -4d 3dtype: int64d 4b 7a -4c 3dtype: int64

    obj3 + obj4

    a 0b 14c -1d 7dtype: int64

    35

  • DataFrame

    A DataFrame represents a spreadsheet-like data structure containingan ordered collection of columns

    Each column is a Series object

    Each column can contain a different data type

    A DataFrame can be seen as a dictionary of Series objects

    Example

    from pandas import DataFrame

    data = { state : [ Ohio , Ohio , Ohio , Nevada , Nevada ] ,year : [2000, 2001, 2002, 2001, 2002],pop : [1.5 , 1.7 , 3.6 , 2.4 , 2.9]}

    frame = DataFrame(data)frame

    pop state year0 1.5 Ohio 20001 1.7 Ohio 20012 3.6 Ohio 20023 2.4 Nevada 20014 2.9 Nevada 2002

    36

  • DataFrame

    A DataFrame represents a spreadsheet-like data structure containingan ordered collection of columns

    Each column is a Series object

    Each column can contain a different data type

    A DataFrame can be seen as a dictionary of Series objects

    Example

    from pandas import DataFrame

    data = { state : [ Ohio , Ohio , Ohio , Nevada , Nevada ] ,year : [2000, 2001, 2002, 2001, 2002],pop : [1.5 , 1.7 , 3.6 , 2.4 , 2.9]}

    frame = DataFrame(data)frame

    pop state year0 1.5 Ohio 20001 1.7 Ohio 20012 3.6 Ohio 20023 2.4 Nevada 20014 2.9 Nevada 2002

    36

  • DataFrame

    The order of the columns can be defined with the argument columns

    Example

    DataFrame(data , columns=[ year , state , pop ] )

    year state pop0 2000 Ohio 1.51 2001 Ohio 1.72 2002 Ohio 3.63 2001 Nevada 2.44 2002 Nevada 2.9

    37

  • DataFrame

    A column in a DataFrame can be retrieved as a Series either by dict-likenotation or by attribute notation

    Example

    frame[ state ]

    0 Ohio1 Ohio2 Ohio3 Nevada4 NevadaName: state, dtype: object

    frame. year

    0 20001 20012 20023 20014 2002Name: year, dtype: int64

    38

  • DataFrame

    Columns can be modified and created by assignment

    Example

    frame[ debt ] = 0frame

    pop state year debt0 1.5 Ohio 2000 01 1.7 Ohio 2001 02 3.6 Ohio 2002 03 2.4 Nevada 2001 04 2.9 Nevada 2002 0

    frame[ debt ] = xrange( len (frame) )frame

    pop state year debt0 1.5 Ohio 2000 01 1.7 Ohio 2001 12 3.6 Ohio 2002 23 2.4 Nevada 2001 34 2.9 Nevada 2002 4

    39

  • DataFrame

    Columns can be deleted using the del statement

    del frame[ debt ]frame

    pop state year0 1.5 Ohio 20001 1.7 Ohio 20012 3.6 Ohio 20023 2.4 Nevada 20014 2.9 Nevada 2002

    40

  • Index Objects

    pandass Index objects are responsible for holding the axis labels andother metadata (like the axis name or names)

    Any array or other sequence of labels used when constructing a Series orDataFrame is internally converted to an Index

    Example

    obj = Series (range(3) , index=[ a , b , c ] )index = obj . index

    Index([ua, ub, uc], dtype=object)

    Index objects are immutable and thus cant be changed by the user

    41

  • Reindexing

    A critical method on pandas objects is reindex, which means to create anew object with the data conformed to a new index

    Example

    obj = Series ([4.5 , 7.2 , 5.3, 3.6] , index=[ d , b , a , c ] )obj

    d 4.5b 7.2a -5.3c 3.6dtype: float64

    obj2 = obj . reindex ( [ a , b , c , d ] )obj2

    a -5.3b 7.2c 3.6d 4.5dtype: float64

    42

  • Reindexing

    We can provide an optional fill value in case some index value does notexist

    Example

    obj . reindex ( [ a , b , c , d , e ] , f i l l _va lue = 0)

    a -5.3b 7.2c 3.6d 4.5e 0.0dtype: float64

    43

  • Dropping Entries from an axis

    Dropping one or more entries from an axis can be performed using themethod drop

    Example

    obj = Series (np.arange(5 . ) , index=[ a , b , c , d , e ] )obj . drop( c )

    a 0b 1d 3e 4dtype: float64

    44

  • Dropping Entries from an axis

    Dropping can be performed in any axis

    Example

    data = DataFrame(np.arange(16). reshape((4 , 4)) ,index=[ Ohio , Colorado , Utah , New York ] ,columns=[ one , two , three , four ] )

    data . drop ( [ Colorado , Ohio ] )

    one two three fourUtah 8 9 10 11New York 12 13 14 15

    data . drop( two , axis=1)

    one three fourOhio 0 2 3Colorado 4 6 7Utah 8 10 11New York 12 14 15

    45

  • Dropping Entries from an axis

    Dropping can be performed in any axis

    Example

    data = DataFrame(np.arange(16). reshape((4 , 4)) ,index=[ Ohio , Colorado , Utah , New York ] ,columns=[ one , two , three , four ] )

    data . drop ( [ Colorado , Ohio ] )

    one two three fourUtah 8 9 10 11New York 12 13 14 15

    data . drop( two , axis=1)

    one three fourOhio 0 2 3Colorado 4 6 7Utah 8 10 11New York 12 14 15

    45

  • Indexing, selection, and filtering

    Series indexing (obj[...]) works analogously to NumPy array indexing,except you can use the Seriess index values instead of only integers:

    obj = Series (np.arange(4 . ) , index=[ a , b , c , d ] )print obj [ b ]

    1.0

    print obj [ [ a , c ] ]

    a 0c 2dtype: float64

    46

  • Indexing, selection, and filtering

    Series indexing (obj[...]) works analogously to NumPy array indexing,except you can use the Seriess index values instead of only integers:

    obj = Series (np.arange(4 . ) , index=[ a , b , c , d ] )print obj [ b ]

    1.0

    print obj [ [ a , c ] ]

    a 0c 2dtype: float64

    46

  • Function application and mapping

    Elementwise array methods work well with pandas objects

    Example

    frame = DataFrame(np.random. randn(4 , 3) ,columns=l i s t ( bde ) ,index=[ Utah , Ohio , Texas , Oregon ] )

    frame

    b d eUtah -0.091392 -1.935977 0.271981Ohio -0.034697 0.823547 0.655560Texas 0.316441 -0.603441 1.380851Oregon 0.045986 -0.965604 0.227028

    np.abs(frame)

    b d eUtah 0.091392 1.935977 0.271981Ohio 0.034697 0.823547 0.655560Texas 0.316441 0.603441 1.380851Oregon 0.045986 0.965604 0.227028

    47

  • Function application and mapping

    It is also common to apply a function on 1D arrays to each column orrow

    Example

    f = lambda x: x .max( ) x .min( )

    frame. apply ( f )

    b 0.407834d 2.759525e 1.153823dtype: float64

    frame. apply ( f , axis=1)

    Utah 2.207958Ohio 0.858245Texas 1.984292Oregon 1.192632dtype: float64

    48

  • Function application and mapping

    If a function receives only one element it is possible to use the methodapplymap

    Example

    frame.applymap(lambda x: %.2f % x)

    b d eUtah -0.09 -1.94 0.27Ohio -0.03 0.82 0.66Texas 0.32 -0.60 1.38Oregon 0.05 -0.97 0.23

    49

  • Sorting and ranking

    It is possible to sort a DataFrame by index on either axis

    Example

    frame. sort_index ( )

    b d eOhio -0.034697 0.823547 0.655560Oregon 0.045986 -0.965604 0.227028Texas 0.316441 -0.603441 1.380851Utah -0.091392 -1.935977 0.271981

    frame. sort_index ( axis=1)

    b d eUtah -0.091392 -1.935977 0.271981Ohio -0.034697 0.823547 0.655560Texas 0.316441 -0.603441 1.380851Oregon 0.045986 -0.965604 0.227028

    50

  • Sorting and ranking

    It is possible to sort by descending order

    Example

    frame. sort_index (ascending=False )

    b d eUtah -0.091392 -1.935977 0.271981Texas 0.316441 -0.603441 1.380851Oregon 0.045986 -0.965604 0.227028Ohio -0.034697 0.823547 0.655560

    51

  • Summarizing and Descriptive Statistics

    pandas objects are equipped with a set of common mathematical andstatistical methods

    Example

    frame. describe ( )

    b d ecount 4.000000 4.000000 4.000000mean 0.059085 -0.670369 0.633855std 0.180594 1.143852 0.533834min -0.091392 -1.935977 0.22702825% -0.048871 -1.208198 0.26074250% 0.005645 -0.784522 0.46377075% 0.113600 -0.246694 0.836882max 0.316441 0.823547 1.380851

    52

  • Summarizing and Descriptive Statistics

    Descriptive and summary statistics:

    Method Description

    count Number of non-NA valuesdescribe Compute set of summary statisticsmin, max Compute minimum and maximum valuesquantile Compute sample quantile ranging from 0 to 1sum Sum of valuesmean Mean of valuesmedian Arithmetic median (50% quantile) of valuesvar Sample variance of valuesstd Sample standard deviation of valuescumsum Cumulative sum of valuescumprod Cumulative product of values

    53

  • Correlation and Covariance

    Correlation and Covariance require two sets of data

    Example

    Get stock prices and volumes obtained from Yahoo! Finance

    import pandas. io . data as web

    all_data = {}

    for t icker in [ AAPL , IBM , MSFT , GOOG ] :all_data [ t icker ] = web. get_data_yahoo( ticker , 1/1/2010 , 3/22/2015 )

    price = DataFrame({ t i c : data[ Adj Close ]for t ic , data in all_data . iteritems ()})

    volume = DataFrame({ t i c : data[ Volume ]for t ic , data in all_data . iteritems ()})

    price . t a i l ( )

    AAPL GOOG IBM MSFTDate2015-03-16 124.95 554.51 157.08 41.562015-03-17 127.04 550.84 156.96 41.702015-03-18 128.47 559.50 159.81 42.502015-03-19 127.50 557.99 159.81 42.292015-03-20 125.90 560.36 162.88 42.88

    54

  • Correlarion and Covariance

    Calculate the percentage change from the previous value:

    returns = price . pct_change ( )returns . t a i l ( )

    AAPL GOOG IBM MSFTDate2015-03-16 0.011004 0.013137 0.018149 0.0043502015-03-17 0.016727 -0.006618 -0.000764 0.0033692015-03-18 0.011256 0.015721 0.018157 0.0191852015-03-19 -0.007550 -0.002699 0.000000 -0.0049412015-03-20 -0.012549 0.004247 0.019210 0.013951

    55

  • Correlarion and Covariance

    The corr method calculates the correlation between two series:

    returns .MSFT. corr ( returns . IBM)

    0.50052763872781603

    DataFrames corr and cov methods, return a full correlation or covariancematrix as a DataFrame:

    returns . corr ( )

    AAPL GOOG IBM MSFTAAPL 1.000000 0.265999 0.368079 0.345835GOOG 0.265999 1.000000 0.315613 0.409107IBM 0.368079 0.315613 1.000000 0.500528MSFT 0.345835 0.409107 0.500528 1.000000

    56

  • Correlarion and Covariance

    The corr method calculates the correlation between two series:

    returns .MSFT. corr ( returns . IBM)

    0.50052763872781603

    DataFrames corr and cov methods, return a full correlation or covariancematrix as a DataFrame:

    returns . corr ( )

    AAPL GOOG IBM MSFTAAPL 1.000000 0.265999 0.368079 0.345835GOOG 0.265999 1.000000 0.315613 0.409107IBM 0.368079 0.315613 1.000000 0.500528MSFT 0.345835 0.409107 0.500528 1.000000

    56

  • Unique Values

    To get unique values we can use the method unique from the Series object:

    print len ( price .AAPL)unique_prices = price .AAPL. unique ( )print len ( unique_prices )

    13121192

    57

  • Value Counts

    We can also count the appearence of each of the values

    Example

    price .AAPL. value_counts ( ) . head( )

    45.29 334.26 327.75 245.47 245.72 2dtype: int64

    58

  • Missing Data

    Missing data is common in most data analysis applications

    By default pandas functions deal with missing data graciously

    Example

    First, lets calculate the average price for GOOG:

    price .GOOG.mean( )

    550.01818548387075

    How many missing observations do we have?

    price .GOOG. i snu l l ( ) .sum()

    1064

    Now, lets calculate the mean without discarding the missing observations

    price .GOOG.mean(skipna=False )

    nan

    The average price cannot be calculated if we do not remove or replace themissing values

    59

  • Missing Data

    Missing data is common in most data analysis applications

    By default pandas functions deal with missing data graciously

    Example

    First, lets calculate the average price for GOOG:

    price .GOOG.mean( )

    550.01818548387075

    How many missing observations do we have?

    price .GOOG. i snu l l ( ) .sum()

    1064

    Now, lets calculate the mean without discarding the missing observations

    price .GOOG.mean(skipna=False )

    nan

    The average price cannot be calculated if we do not remove or replace themissing values

    59

  • Missing Data

    Missing data is common in most data analysis applications

    By default pandas functions deal with missing data graciously

    Example

    First, lets calculate the average price for GOOG:

    price .GOOG.mean( )

    550.01818548387075

    How many missing observations do we have?

    price .GOOG. i snu l l ( ) .sum()

    1064

    Now, lets calculate the mean without discarding the missing observations

    price .GOOG.mean(skipna=False )

    nan

    The average price cannot be calculated if we do not remove or replace themissing values

    59

  • Filtering out Missing Data

    In many applications it is important to know that we are using always thesame observations

    In such cases may be wise to remove observations with missing values:

    price .dropna ( ) . head( )

    AAPL GOOG IBM MSFTDate2014-03-27 75.35 558.46 185.05 38.332014-03-28 75.27 559.99 185.65 39.242014-03-31 75.25 556.97 187.64 39.912014-04-01 75.94 567.16 189.60 40.332014-04-02 76.06 567.00 188.67 40.26

    Data starts on 2014-03-27, the first date for which we have data for GOOG

    60

  • Filtering out Missing Data

    In many applications it is important to know that we are using always thesame observations

    In such cases may be wise to remove observations with missing values:

    price .dropna ( ) . head( )

    AAPL GOOG IBM MSFTDate2014-03-27 75.35 558.46 185.05 38.332014-03-28 75.27 559.99 185.65 39.242014-03-31 75.25 556.97 187.64 39.912014-04-01 75.94 567.16 189.60 40.332014-04-02 76.06 567.00 188.67 40.26

    Data starts on 2014-03-27, the first date for which we have data for GOOG

    60

  • Filtering out Missing Data

    We could also drop the columns that have missing data:

    price .dropna( axis=1).head( )

    AAPL IBM MSFTDate2010-01-04 28.84 119.53 26.942010-01-05 28.89 118.09 26.952010-01-06 28.43 117.32 26.792010-01-07 28.38 116.92 26.512010-01-08 28.56 118.09 26.69

    61

  • Filling in Missing Data

    In some situations we want to fill in the missing observations with defaultvalues:

    Example

    Filling with zeros:

    price . f i l l n a (0) .head( )

    AAPL GOOG IBM MSFTDate2010-01-04 28.84 0 119.53 26.942010-01-05 28.89 0 118.09 26.952010-01-06 28.43 0 117.32 26.792010-01-07 28.38 0 116.92 26.512010-01-08 28.56 0 118.09 26.69

    Filling with the mean:

    price . f i l l n a ( price .mean( ) ) . head( )

    AAPL GOOG IBM MSFTDate2010-01-04 28.84 550.018185 119.53 26.942010-01-05 28.89 550.018185 118.09 26.952010-01-06 28.43 550.018185 117.32 26.792010-01-07 28.38 550.018185 116.92 26.512010-01-08 28.56 550.018185 118.09 26.69

    62

  • Filling in Missing Data

    Note: These operations always create a copy of the data

    price .head( )

    AAPL GOOG IBM MSFTDate2010-01-04 28.84 NaN 119.53 26.942010-01-05 28.89 NaN 118.09 26.952010-01-06 28.43 NaN 117.32 26.792010-01-07 28.38 NaN 116.92 26.512010-01-08 28.56 NaN 118.09 26.69

    63

  • Hierarchical Indexing

    Hierarchical indexing enables using multiple (two or more) index levelson an axis

    It provides a way to work with higher dimensional data in a lowerdimensional form

    Example

    data = Series (np.random. randn(10) ,index=[[ a , a , a , b , b , b , c , c , d , d ] ,

    [2010, 2011, 2012, 2010, 2011, 2012, 2010, 2011, 2011, 2012]])

    data

    a 2010 0.5476342011 0.7921822012 -0.821709

    b 2010 0.1725032011 0.7144972012 -0.004165

    c 2010 -0.0951962011 0.096810

    d 2011 0.5530032012 0.167027

    dtype: float64

    64

  • Hierarchical Indexing

    Example

    Accessing to a

    data[ a ]

    2010 0.5476342011 0.7921822012 -0.821709dtype: float64

    Accessing to 2011

    data [ : , 2011]

    a 0.792182b 0.714497c 0.096810d 0.553003dtype: float64

    65

  • Summary Statistics by Level

    We can summarize the results by each level of the index

    Example

    data .sum( level=0)

    a 0.518107b 0.882835c 0.001614d 0.720031dtype: float64

    data .sum( level=1)

    2010 0.6249412011 2.1564922012 -0.658846dtype: float64

    66

    SETUPLecture 3NumPy

    Lecture 4pandas