python for data analytics

Python for Data Analytics

Lectures 3 & 4: Essential Libraries NumPy and pandas

Rodrigo [email protected]

Spring 2015

1

NumPy

2

NumPy

NumPy is the fundamental package required for high performancescientific computing and data analysis. It provides:

ndarray, a fast and space-efficient multidimensional array providingvectorized operations

Standard mathematical operations for fast operations over arrayswithout having to write loops

Tools for reading and writing array data to disk and working withmemory-mapped files

Tools for integrating code written in C, C++, and Fortran

Having a good understanding of how NumPy works will help use tools likepandas

3

NumPy

NumPy is the fundamental package required for high performancescientific computing and data analysis. It provides:

ndarray, a fast and space-efficient multidimensional array providingvectorized operations

Standard mathematical operations for fast operations over arrayswithout having to write loops

Tools for reading and writing array data to disk and working withmemory-mapped files

Tools for integrating code written in C, C++, and Fortran

Having a good understanding of how NumPy works will help use tools likepandas

3

ndarray

ndarray stands for N-dimensional array.

data

array([[ 0.73230045, 0.25494037, 0.79516021],[ 0.62986533, 0.3420035 , 0.08914765]])

You can get the shape of an array and the type of its elements byaccessing the values shape and dtype:

print data .shapeprint data . dtype

(2, 3)float64

4

ndarray

ndarray stands for N-dimensional array.

data

array([[ 0.73230045, 0.25494037, 0.79516021],[ 0.62986533, 0.3420035 , 0.08914765]])

You can get the shape of an array and the type of its elements byaccessing the values shape and dtype:

print data .shapeprint data . dtype

(2, 3)float64

4

Creating ndarrays

It is possible to create ndarrays from a list or a list of lists

From a list:

import numpy as npdata1 = [1 ,2 ,3 ,4]arr1 = np. array (data1)arr1

array([1, 2, 3, 4])

From a list of lists:

data2 = [[1 ,2 ,3 ,4] ,[5 ,6 ,7 ,8]]arr2 = np. array (data2)arr2

array([[1, 2, 3, 4],[5, 6, 7, 8]])

5

Creating ndarrays

It is possible to create ndarrays from a list or a list of lists

From a list:

import numpy as npdata1 = [1 ,2 ,3 ,4]arr1 = np. array (data1)arr1

array([1, 2, 3, 4])

From a list of lists:

data2 = [[1 ,2 ,3 ,4] ,[5 ,6 ,7 ,8]]arr2 = np. array (data2)arr2

array([[1, 2, 3, 4],[5, 6, 7, 8]])

5

Creating ndarrays

Creating an array initiated with zeros

np. zeros ((3 ,6))

array([[ 0., 0., 0., 0., 0., 0.],[ 0., 0., 0., 0., 0., 0.],[ 0., 0., 0., 0., 0., 0.]])

6

Creating ndarrays

Creating an array with random numbers:

data = np.random. rand(2 ,3)print data .shapeprint data . dtypedata

(2, 3)float64array([[ 0.73230045, 0.25494037, 0.79516021],

[ 0.62986533, 0.3420035 , 0.08914765]])

7

Data Types for ndarrays

ndarrays are composed of elements that are all of the same type:

int

float

complex

bool

string

object

In practice an array of type object can have elements of any type, butthese types of array are not common

Example

arr = np. array ( [ Hello , np.random. rand ] )arr

array([Hello,], dtype=object)

8

Data Types for ndarrays

ndarrays are composed of elements that are all of the same type:

int

float

complex

bool

string

object

In practice an array of type object can have elements of any type, butthese types of array are not common

Example

arr = np. array ( [ Hello , np.random. rand ] )arr

array([Hello,], dtype=object)

8

Operations between Arrays and Scalars

ndarray supports vectorized operations, i.e., operations that areperformed to each element of an array without the need of using loops

Multiplication by a scalar

data * 10

array([[ 6.39219315, 6.8102819 , 4.34637984],[ 0.34237044, 5.39243817, 1.26276343]])

Addition

data + data

array([[ 1.27843863, 1.36205638, 0.86927597],[ 0.06847409, 1.07848763, 0.25255269]])

9

ndarray supports vectorized operations, i.e., operations that areperformed to each element of an array without the need of using loops

Multiplication by a scalar

data * 10

array([[ 6.39219315, 6.8102819 , 4.34637984],[ 0.34237044, 5.39243817, 1.26276343]])

Addition

data + data

array([[ 1.27843863, 1.36205638, 0.86927597],[ 0.06847409, 1.07848763, 0.25255269]])

9

arr = np. array ([[1. ,2. ,3] , [4 ,5 ,6] , [7 ,8 ,9]])arr

array([[ 1., 2., 3.],[ 4., 5., 6.],[ 7., 8., 9.]])

Multiplication

arr * arr

array([[ 1., 4., 9.],[ 16., 25., 36.],[ 49., 64., 81.]])

Division

1 / arr

array([[ 1. , 0.5 , 0.33333333],[ 0.25 , 0.2 , 0.16666667],[ 0.14285714, 0.125 , 0.11111111]])

10

Basic Indexing and Slicing

Indexing works in the same way as for lists and tuples:

arr

array([[ 1., 2., 3.],[ 4., 5., 6.],[ 7., 8., 9.]])

arr [1]

array([ 4., 5., 6.])

arr [ : ,1 ]

array([ 2., 5., 8.])

arr [1: ,:1]

array([[ 4., 5.],[ 7., 8.]])

11

Basic Indexing and Slicing

Indexing works in the same way as for lists and tuples:

arr

array([[ 1., 2., 3.],[ 4., 5., 6.],[ 7., 8., 9.]])

arr [1]

array([ 4., 5., 6.])

arr [ : ,1 ]

array([ 2., 5., 8.])

arr [1: ,:1]

array([[ 4., 5.],[ 7., 8.]])

11

Boolean Indexing

names = np. array ( [ Bob , Joe , B i l l , Tess , Joe , Joe , Bob ] )data = randn(7 ,4)print namesdata

[Bob Joe Bill Tess Joe Joe Bob]array([[ 1.31273264, -1.4027545 , 1.76375476, -0.19194064],

[-0.41950145, -0.21455786, 0.28687505, 0.70312942],[ 0.76013575, 0.68719731, 1.45771087, 0.07268093],[ 1.20720934, -0.52305673, 0.56317445, 0.33062879],[ 0.81500408, 1.12185486, 1.31608209, 0.80725464],[-0.70584486, -0.86788517, -0.07373691, 0.83189097],[-0.44775752, -1.67612963, -0.01396541, 0.2745861 ]])

We can create an array of Booleans that is used to select the relevant rows:

print names == Bob data[names == Bob ]

[ True False False False False False True]array([[ 1.31273264, -1.4027545 , 1.76375476, -0.19194064],

[-0.44775752, -1.67612963, -0.01396541, 0.2745861 ]])

12

Boolean Indexing

names = np. array ( [ Bob , Joe , B i l l , Tess , Joe , Joe , Bob ] )data = randn(7 ,4)print namesdata

[Bob Joe Bill Tess Joe Joe Bob]array([[ 1.31273264, -1.4027545 , 1.76375476, -0.19194064],

[-0.41950145, -0.21455786, 0.28687505, 0.70312942],[ 0.76013575, 0.68719731, 1.45771087, 0.07268093],[ 1.20720934, -0.52305673, 0.56317445, 0.33062879],[ 0.81500408, 1.12185486, 1.31608209, 0.80725464],[-0.70584486, -0.86788517, -0.07373691, 0.83189097],[-0.44775752, -1.67612963, -0.01396541, 0.2745861 ]])

We can create an array of Booleans that is used to select the relevant rows:

print names == Bob data[names == Bob ]

[ True False False False False False True]array([[ 1.31273264, -1.4027545 , 1.76375476, -0.19194064],

[-0.44775752, -1.67612963, -0.01396541, 0.2745861 ]])

12

Boolean Indexing

You can use different indexing methods at once:

data[names == Bob , 2:]

array([[ 1.76375476, -0.19194064],[-0.01396541, 0.2745861 ]])

You can use arithmetic operators:

data [ (names == Bob ) | (names == Joe ) , ]

array([[ 1.31273264, -1.4027545 , 1.76375476, -0.19194064],[-0.41950145, -0.21455786, 0.28687505, 0.70312942],[ 0.81500408, 1.12185486, 1.31608209, 0.80725464],[-0.70584486, -0.86788517, -0.07373691, 0.83189097],[-0.44775752, -1.67612963, -0.01396541, 0.2745861 ]])

data [ (names == Bob ) & (names == Joe ) , ]

array([], shape=(0, 4), dtype=float64)

Note: Selecting data from an array always creates a copy of the data, evenif the returned array is unchanged

13

Boolean Indexing

array([[ 1.76375476, -0.19194064],[-0.01396541, 0.2745861 ]])

array([[ 1.31273264, -1.4027545 , 1.76375476, -0.19194064],[-0.41950145, -0.21455786, 0.28687505, 0.70312942],[ 0.81500408, 1.12185486, 1.31608209, 0.80725464],[-0.70584486, -0.86788517, -0.07373691, 0.83189097],[-0.44775752, -1.67612963, -0.01396541, 0.2745861 ]])

13

Boolean Indexing

array([[ 1.76375476, -0.19194064],[-0.01396541, 0.2745861 ]])

array([[ 1.31273264, -1.4027545 , 1.76375476, -0.19194064],[-0.41950145, -0.21455786, 0.28687505, 0.70312942],[ 0.81500408, 1.12185486, 1.31608209, 0.80725464],[-0.70584486, -0.86788517, -0.07373691, 0.83189097],[-0.44775752, -1.67612963, -0.01396541, 0.2745861 ]])

13

Boolean Indexing

array([[ 1.76375476, -0.19194064],[-0.01396541, 0.2745861 ]])

array([[ 1.31273264, -1.4027545 , 1.76375476, -0.19194064],[-0.41950145, -0.21455786, 0.28687505, 0.70312942],[ 0.81500408, 1.12185486, 1.31608209, 0.80725464],[-0.70584486, -0.86788517, -0.07373691, 0.83189097],[-0.44775752, -1.67612963, -0.01396541, 0.2745861 ]])

13

Boolean Indexing

You can use boolean indexing to assign values to specific positions of thearray:

data[data < 0 ] = 0data

array([[ 1.31273264, 0. , 1.76375476, 0. ],[ 0. , 0. , 0.28687505, 0.70312942],[ 0.76013575, 0.68719731, 1.45771087, 0.07268093],[ 1.20720934, 0. , 0.56317445, 0.33062879],[ 0.81500408, 1.12185486, 1.31608209, 0.80725464],[ 0. , 0. , 0. , 0.83189097],[ 0. , 0. , 0. , 0.2745861 ]])

Note: I am indexing an array with an array of booleans

14

Boolean Indexing

You can use boolean indexing to assign values to specific positions of thearray:

data[data < 0 ] = 0data

array([[ 1.31273264, 0. , 1.76375476, 0. ],[ 0. , 0. , 0.28687505, 0.70312942],[ 0.76013575, 0.68719731, 1.45771087, 0.07268093],[ 1.20720934, 0. , 0.56317445, 0.33062879],[ 0.81500408, 1.12185486, 1.31608209, 0.80725464],[ 0. , 0. , 0. , 0.83189097],[ 0. , 0. , 0. , 0.2745861 ]])

Note: I am indexing an array with an array of booleans

14

Boolean Indexing

data[names != Joe ] = 7data

array([[ 7. , 7. , 7. , 7. ],[ 0. , 0. , 0.28687505, 0.70312942],[ 7. , 7. , 7. , 7. ],[ 7. , 7. , 7. , 7. ],[ 0.81500408, 1.12185486, 1.31608209, 0.80725464],[ 0. , 0. , 0. , 0.83189097],[ 7. , 7. , 7. , 7. ]])

15

Fancy Indexing

Fancy indexing is a term adopted by NumPy to describe indexing usinginteger arrays

arr = np.empty((8 ,4))

for i in range(8) :arr [ i ] = i

arr

array([[ 0., 0., 0., 0.],[ 1., 1., 1., 1.],[ 2., 2., 2., 2.],[ 3., 3., 3., 3.],[ 4., 4., 4., 4.],[ 5., 5., 5., 5.],[ 6., 6., 6., 6.],[ 7., 7., 7., 7.]])

Example

arr [[3 ,0 ,2]]

array([[ 3., 3., 3., 3.],[ 0., 0., 0., 0.],[ 2., 2., 2., 2.]])

16

Transposing Arrays

It is easy to transpose arrays with the attribute T:

arr .T

array([[ 0., 1., 2., 3., 4., 5., 6., 7.],[ 0., 1., 2., 3., 4., 5., 6., 7.],[ 0., 1., 2., 3., 4., 5., 6., 7.],[ 0., 1., 2., 3., 4., 5., 6., 7.]])

17

Data Processing Using Arrays

NumPy arrays allow us to express many kinds of data processing tasks asconcise array expressions

This practice of replacing explicit loops with array expressions is commonlyreferred to as vectorization

Vectorized array operations are often one or two orders of magnitude fasterthan their pure Python equivalents

18

Universal Functions

A universal function is a function that performs elementwise operationson data in ndarrays. They are fast vectorized wrappers for simple functions

Examples

arr = np.arange(10)np. sqrt ( arr )

array([ 0. , 1. , 1.41421356, 1.73205081, 2. ,2.23606798, 2.44948974, 2.64575131, 2.82842712, 3. ])

np.exp( arr )

array([ 1.00000000e+00, 2.71828183e+00, 7.38905610e+00,2.00855369e+01, 5.45981500e+01, 1.48413159e+02,4.03428793e+02, 1.09663316e+03, 2.98095799e+03,8.10308393e+03])

19

Conditional Logic as Array Operations

np.where is the vectorized version of the if condition:

Example

xarr = np. array ([1.1 , 1.2 , 1.3 , 1.4 , 1.5])yarr = np. array ([2.1 , 2.2 , 2.3 , 2.4 , 2.5])cond = np. array ( [ True , False , True , True , False ] )

Suppose we want to create an array that takes the value in xarr when condis True and the value of yarr when cond is False

np.where(cond, xarr , yarr )

array([ 1.1, 2.2, 1.3, 1.4, 2.5])

This method can be applied to n-dimensional arrays

20

Example

array([ 1.1, 2.2, 1.3, 1.4, 2.5])

20

Example

array([ 1.1, 2.2, 1.3, 1.4, 2.5])

20

Mathematical and Statistical Methods

NumPy arrays provide a good set of statistical methods

Basic array statistical methods

Method Description

sum Sum of all the elements in the array or along an axis.mean Arithmetic mean. Zero-length arrays have NaN mean.std, var Standard deviation and variance, respectivelymin, max Minimum and maximum.argmin, argmax Indices of minimum and maximum elements, respectively.cumsum Cumulative sum of elements starting from 0cumprod Cumulative product of elements starting from 1

21

Methods for Boolean Arrays

Booleans are coerced to 1 and 0, so the sum method can be used to countthe number of true values in an array:

arr = randn(100)( arr > 0).sum()

55

22

Sorting

NumpPy arrays can be sorted in-place using the sort method:

arr = randn(5)print unsorted : , arrarr . sort ( )print sorted : , arr

unsorted: [-0.21132983 0.25338333 -1.27090331 0.88185258 0.32729311]sorted: [-1.27090331 -0.21132983 0.25338333 0.32729311 0.88185258]

23

Sorting

You can specify the dimension in which you want to sort an n-dimentionalarray:

arr = randn(5 ,3)arr . sort ( axis=0)arr

array([[-1.17850016, 0.05609878, -1.11894931],[-0.15450684, 0.14064359, -0.12111114],[ 0.66674063, 0.39402912, -0.09261304],[ 0.79119149, 1.18169535, 0.09052968],[ 1.61247548, 1.48936384, 0.11534684]])

arr . sort ( axis=1)arr

array([[-1.17850016, -1.11894931, 0.05609878],[-0.15450684, -0.12111114, 0.14064359],[-0.09261304, 0.39402912, 0.66674063],[ 0.09052968, 0.79119149, 1.18169535],[ 0.11534684, 1.48936384, 1.61247548]])

24

Sorting

You can specify the dimension in which you want to sort an n-dimentionalarray:

arr = randn(5 ,3)arr . sort ( axis=0)arr

array([[-1.17850016, 0.05609878, -1.11894931],[-0.15450684, 0.14064359, -0.12111114],[ 0.66674063, 0.39402912, -0.09261304],[ 0.79119149, 1.18169535, 0.09052968],[ 1.61247548, 1.48936384, 0.11534684]])

arr . sort ( axis=1)arr

array([[-1.17850016, -1.11894931, 0.05609878],[-0.15450684, -0.12111114, 0.14064359],[-0.09261304, 0.39402912, 0.66674063],[ 0.09052968, 0.79119149, 1.18169535],[ 0.11534684, 1.48936384, 1.61247548]])

24

Set Logic

We can get all the unique values of an array with the unique method:

names = np. array ( [ Bob , Joe , B i l l , Tess , Joe , Joe , Bob ] )np. unique(names)

array([Bill, Bob, Joe, Tess],dtype=|S4)

It is possible to check whether each of the elements of an array belongs toa set of values with the method np.in1d:

np. in1d(names, [ Bob , Joe ] )

array([ True, True, False, False, True, True, True], dtype=bool)

25

Set Logic

We can get all the unique values of an array with the unique method:

names = np. array ( [ Bob , Joe , B i l l , Tess , Joe , Joe , Bob ] )np. unique(names)

array([Bill, Bob, Joe, Tess],dtype=|S4)

It is possible to check whether each of the elements of an array belongs toa set of values with the method np.in1d:

np. in1d(names, [ Bob , Joe ] )

array([ True, True, False, False, True, True, True], dtype=bool)

25

Set Logic

Array set operations:

Method Description

unique(x) Compute the sorted, unique elements in xintersect1d(x, y) Compute the sorted, common elements in x and yunion1d(x, y) Compute the sorted union of elementsin1d(x, y) Compute a boolean array indicating whether each element of x is contained in ysetdiff1d(x, y) Set difference, elements in x that are not in ysetxor1d(x, y) Set symmetric differences

26

File Input and Output with Arrays

np.save and np.load are the two main functions to save and load arraydata on disk

Arrays are saved by default in an uncompressed binary format with fileextension .npy

Example

Saving an array:

arr = np.arange(10)print arrnp. save( my_array , arr )

[0 1 2 3 4 5 6 7 8 9]

Loading the array:

arr = np. load ( my_array .npy )arr

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

27

File Input and Output with Arrays

np.save and np.load are the two main functions to save and load arraydata on disk

Arrays are saved by default in an uncompressed binary format with fileextension .npy

Example

Saving an array:

arr = np.arange(10)print arrnp. save( my_array , arr )

[0 1 2 3 4 5 6 7 8 9]

Loading the array:

arr = np. load ( my_array .npy )arr

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

27

Linear Algebra

In Python multiplying two arrays is an elementwise operation. In somecases we are interested in matrix operations.

The numpy.linalg module contains such operations

Example

We can use the operation np.dot to multiply two matrices:

arr1 = array ([[1 ,2 ,3] ,[4 ,5 ,6]])arr2 = array ([[2 ,2 ,2] ,[2 ,2 ,2] ,[2 ,2 ,2]])np. dot ( arr1 , arr2 )

array([[12, 12, 12],[30, 30, 30]])

np. dot ( arr2 , arr1 .T)

array([[12, 30],[12, 30],[12, 30]])

28

Linear Algebra

In Python multiplying two arrays is an elementwise operation. In somecases we are interested in matrix operations.

The numpy.linalg module contains such operations

Example

We can use the operation np.dot to multiply two matrices:

arr1 = array ([[1 ,2 ,3] ,[4 ,5 ,6]])arr2 = array ([[2 ,2 ,2] ,[2 ,2 ,2] ,[2 ,2 ,2]])np. dot ( arr1 , arr2 )

array([[12, 12, 12],[30, 30, 30]])

np. dot ( arr2 , arr1 .T)

array([[12, 30],[12, 30],[12, 30]])

28

Random Number Generation

The numpy.random module supplements the built-in Python random withfunctions that efficiently generate whole arrays of sample values frommany different kinds of probability distributions

samples = np.random.normal( size=(4,4))samples

array([[-0.08695804, 0.18486392, -0.32093721, -1.812208 ],[ 1.31593422, 0.56465651, -1.43691046, -0.40667169],[ 1.74605033, 1.27025025, -0.67012289, 0.57377713],[-1.46157084, -0.86130787, -0.64128062, 0.66803304]])

The numpy.random function is much faster than the standard randommodule in Python:

from random import normalvariate%timeit samples = [ normalvariate (0 ,1) for _ in xrange(1000*1000)]%timeit samples = np.random.normal( size=1000*1000)

1 loops, best of 3: 1.29 s per loop10 loops, best of 3: 36.3 ms per loop

29

Random Number Generation

The numpy.random module supplements the built-in Python random withfunctions that efficiently generate whole arrays of sample values frommany different kinds of probability distributions

samples = np.random.normal( size=(4,4))samples

array([[-0.08695804, 0.18486392, -0.32093721, -1.812208 ],[ 1.31593422, 0.56465651, -1.43691046, -0.40667169],[ 1.74605033, 1.27025025, -0.67012289, 0.57377713],[-1.46157084, -0.86130787, -0.64128062, 0.66803304]])

The numpy.random function is much faster than the standard randommodule in Python:

from random import normalvariate%timeit samples = [ normalvariate (0 ,1) for _ in xrange(1000*1000)]%timeit samples = np.random.normal( size=1000*1000)

1 loops, best of 3: 1.29 s per loop10 loops, best of 3: 36.3 ms per loop

29

Term Project

Requirements:

Teams of 2 or 3 students

Include all the three components covered in this class:1 Data collection from an online source2 Data storage in appropriate format3 Descriptive and graphical analysis of the data; regression analysis or

other technique

Dates:

Project proposal and teams: Thursday, April 24 paragraphs:

GoalsData collection strategyData storage strategyAnalysis strategy

Iterations over one week max.

Progress report: Tuesday, April 21

Final report: Sunday, May 3, 2015 at 11:59 pm (hard deadline)

30

Term Project

Requirements:

Teams of 2 or 3 students

Include all the three components covered in this class:1 Data collection from an online source2 Data storage in appropriate format3 Descriptive and graphical analysis of the data; regression analysis or

other technique

Dates:

Project proposal and teams: Thursday, April 24 paragraphs:

GoalsData collection strategyData storage strategyAnalysis strategy

Iterations over one week max.

Progress report: Tuesday, April 21

Final report: Sunday, May 3, 2015 at 11:59 pm (hard deadline)

30

pandas

31

pandas

pandas is the main library used for data analysis in Python

Built on top of NumPy

Designed to make data analysis fast and easy in Python

Main data structures:

Series

DataFrame

32

pandas

pandas is the main library used for data analysis in Python

Built on top of NumPy

Designed to make data analysis fast and easy in Python

Main data structures:

Series

DataFrame

32

Series

A Series is a one-dimensional array-like object containing an array of dataand an array of data labels, called its index.

Example

from pandas import Seriesobj = Series ([1 , 3, 4, 5])obj

0 11 32 43 -5dtype: int64

You can get the array representation and index object of the Series via itsattributes values and index:

obj . values

array([ 1, 3, 4, -5])

obj . index

Int64Index([0, 1, 2, 3], dtype=int64)

33

Series

A Series is a one-dimensional array-like object containing an array of dataand an array of data labels, called its index.

Example

from pandas import Seriesobj = Series ([1 , 3, 4, 5])obj

0 11 32 43 -5dtype: int64

You can get the array representation and index object of the Series via itsattributes values and index:

obj . values

array([ 1, 3, 4, -5])

obj . index

Int64Index([0, 1, 2, 3], dtype=int64)

33

Series

You can use any index in a Series:

obj2 = Series ([4 ,7,4,3] , index=[ d , b , a , c ] )obj2

d 4b 7a -4c 3dtype: int64

Boolean operations will preserve the index-value link:

obj2 [obj2 > 0]

d 4b 7c 3dtype: int64

34

Series

You can use any index in a Series:

obj2 = Series ([4 ,7,4,3] , index=[ d , b , a , c ] )obj2

d 4b 7a -4c 3dtype: int64

Boolean operations will preserve the index-value link:

obj2 [obj2 > 0]

d 4b 7c 3dtype: int64

34

Series

Series automatically aligns differently indexed data in arithmeticoperations

Example

obj3 = Series ([4 ,7,4,3] , index=[ a , b , c , d ] )obj4 = Series ([4 ,7,4,3] , index=[ d , b , a , c ] )print obj3print obj4

a 4b 7c -4d 3dtype: int64d 4b 7a -4c 3dtype: int64

obj3 + obj4

a 0b 14c -1d 7dtype: int64

35

Series

Series automatically aligns differently indexed data in arithmeticoperations

Example

obj3 = Series ([4 ,7,4,3] , index=[ a , b , c , d ] )obj4 = Series ([4 ,7,4,3] , index=[ d , b , a , c ] )print obj3print obj4

a 4b 7c -4d 3dtype: int64d 4b 7a -4c 3dtype: int64

obj3 + obj4

a 0b 14c -1d 7dtype: int64

35

DataFrame

A DataFrame represents a spreadsheet-like data structure containingan ordered collection of columns

Each column is a Series object

Each column can contain a different data type

A DataFrame can be seen as a dictionary of Series objects

Example

from pandas import DataFrame

data = { state : [ Ohio , Ohio , Ohio , Nevada , Nevada ] ,year : [2000, 2001, 2002, 2001, 2002],pop : [1.5 , 1.7 , 3.6 , 2.4 , 2.9]}

frame = DataFrame(data)frame

pop state year0 1.5 Ohio 20001 1.7 Ohio 20012 3.6 Ohio 20023 2.4 Nevada 20014 2.9 Nevada 2002

36

DataFrame

A DataFrame represents a spreadsheet-like data structure containingan ordered collection of columns

Each column is a Series object

Each column can contain a different data type

A DataFrame can be seen as a dictionary of Series objects

Example

from pandas import DataFrame

data = { state : [ Ohio , Ohio , Ohio , Nevada , Nevada ] ,year : [2000, 2001, 2002, 2001, 2002],pop : [1.5 , 1.7 , 3.6 , 2.4 , 2.9]}

frame = DataFrame(data)frame

36

DataFrame

The order of the columns can be defined with the argument columns

Example

DataFrame(data , columns=[ year , state , pop ] )

year state pop0 2000 Ohio 1.51 2001 Ohio 1.72 2002 Ohio 3.63 2001 Nevada 2.44 2002 Nevada 2.9

37

DataFrame

A column in a DataFrame can be retrieved as a Series either by dict-likenotation or by attribute notation

Example

frame[ state ]

0 Ohio1 Ohio2 Ohio3 Nevada4 NevadaName: state, dtype: object

frame. year

0 20001 20012 20023 20014 2002Name: year, dtype: int64

38

DataFrame

Columns can be modified and created by assignment

Example

frame[ debt ] = 0frame

pop state year debt0 1.5 Ohio 2000 01 1.7 Ohio 2001 02 3.6 Ohio 2002 03 2.4 Nevada 2001 04 2.9 Nevada 2002 0

frame[ debt ] = xrange( len (frame) )frame

pop state year debt0 1.5 Ohio 2000 01 1.7 Ohio 2001 12 3.6 Ohio 2002 23 2.4 Nevada 2001 34 2.9 Nevada 2002 4

39

DataFrame

Columns can be deleted using the del statement

del frame[ debt ]frame

40

Index Objects

pandass Index objects are responsible for holding the axis labels andother metadata (like the axis name or names)

Any array or other sequence of labels used when constructing a Series orDataFrame is internally converted to an Index

Example

obj = Series (range(3) , index=[ a , b , c ] )index = obj . index

Index([ua, ub, uc], dtype=object)

Index objects are immutable and thus cant be changed by the user

41

Reindexing

A critical method on pandas objects is reindex, which means to create anew object with the data conformed to a new index

Example

obj = Series ([4.5 , 7.2 , 5.3, 3.6] , index=[ d , b , a , c ] )obj

d 4.5b 7.2a -5.3c 3.6dtype: float64

obj2 = obj . reindex ( [ a , b , c , d ] )obj2

a -5.3b 7.2c 3.6d 4.5dtype: float64

42

Reindexing

We can provide an optional fill value in case some index value does notexist

Example

obj . reindex ( [ a , b , c , d , e ] , f i l l _va lue = 0)

a -5.3b 7.2c 3.6d 4.5e 0.0dtype: float64

43

Dropping Entries from an axis

Dropping one or more entries from an axis can be performed using themethod drop

Example

obj = Series (np.arange(5 . ) , index=[ a , b , c , d , e ] )obj . drop( c )

a 0b 1d 3e 4dtype: float64

44

Dropping can be performed in any axis

Example

data = DataFrame(np.arange(16). reshape((4 , 4)) ,index=[ Ohio , Colorado , Utah , New York ] ,columns=[ one , two , three , four ] )

data . drop ( [ Colorado , Ohio ] )

one two three fourUtah 8 9 10 11New York 12 13 14 15

data . drop( two , axis=1)

one three fourOhio 0 2 3Colorado 4 6 7Utah 8 10 11New York 12 14 15

45

Dropping can be performed in any axis

Example

data = DataFrame(np.arange(16). reshape((4 , 4)) ,index=[ Ohio , Colorado , Utah , New York ] ,columns=[ one , two , three , four ] )

data . drop ( [ Colorado , Ohio ] )

one two three fourUtah 8 9 10 11New York 12 13 14 15

data . drop( two , axis=1)

one three fourOhio 0 2 3Colorado 4 6 7Utah 8 10 11New York 12 14 15

45

Indexing, selection, and filtering

Series indexing (obj[...]) works analogously to NumPy array indexing,except you can use the Seriess index values instead of only integers:

obj = Series (np.arange(4 . ) , index=[ a , b , c , d ] )print obj [ b ]

1.0

print obj [ [ a , c ] ]

a 0c 2dtype: float64

46

Indexing, selection, and filtering

Series indexing (obj[...]) works analogously to NumPy array indexing,except you can use the Seriess index values instead of only integers:

obj = Series (np.arange(4 . ) , index=[ a , b , c , d ] )print obj [ b ]

1.0

print obj [ [ a , c ] ]

a 0c 2dtype: float64

46

Function application and mapping

Elementwise array methods work well with pandas objects

Example

frame = DataFrame(np.random. randn(4 , 3) ,columns=l i s t ( bde ) ,index=[ Utah , Ohio , Texas , Oregon ] )

frame

b d eUtah -0.091392 -1.935977 0.271981Ohio -0.034697 0.823547 0.655560Texas 0.316441 -0.603441 1.380851Oregon 0.045986 -0.965604 0.227028

np.abs(frame)

b d eUtah 0.091392 1.935977 0.271981Ohio 0.034697 0.823547 0.655560Texas 0.316441 0.603441 1.380851Oregon 0.045986 0.965604 0.227028

47

It is also common to apply a function on 1D arrays to each column orrow

Example

f = lambda x: x .max( ) x .min( )

frame. apply ( f )

b 0.407834d 2.759525e 1.153823dtype: float64

frame. apply ( f , axis=1)

Utah 2.207958Ohio 0.858245Texas 1.984292Oregon 1.192632dtype: float64

48

If a function receives only one element it is possible to use the methodapplymap

Example

frame.applymap(lambda x: %.2f % x)

49

Sorting and ranking

It is possible to sort a DataFrame by index on either axis

Example

frame. sort_index ( )

b d eOhio -0.034697 0.823547 0.655560Oregon 0.045986 -0.965604 0.227028Texas 0.316441 -0.603441 1.380851Utah -0.091392 -1.935977 0.271981

frame. sort_index ( axis=1)

50

Sorting and ranking

It is possible to sort by descending order

Example

frame. sort_index (ascending=False )

b d eUtah -0.091392 -1.935977 0.271981Texas 0.316441 -0.603441 1.380851Oregon 0.045986 -0.965604 0.227028Ohio -0.034697 0.823547 0.655560

51

Summarizing and Descriptive Statistics

pandas objects are equipped with a set of common mathematical andstatistical methods

Example

frame. describe ( )

b d ecount 4.000000 4.000000 4.000000mean 0.059085 -0.670369 0.633855std 0.180594 1.143852 0.533834min -0.091392 -1.935977 0.22702825% -0.048871 -1.208198 0.26074250% 0.005645 -0.784522 0.46377075% 0.113600 -0.246694 0.836882max 0.316441 0.823547 1.380851

52

Summarizing and Descriptive Statistics

Descriptive and summary statistics:

Method Description

count Number of non-NA valuesdescribe Compute set of summary statisticsmin, max Compute minimum and maximum valuesquantile Compute sample quantile ranging from 0 to 1sum Sum of valuesmean Mean of valuesmedian Arithmetic median (50% quantile) of valuesvar Sample variance of valuesstd Sample standard deviation of valuescumsum Cumulative sum of valuescumprod Cumulative product of values

53

Correlation and Covariance

Correlation and Covariance require two sets of data

Example

Get stock prices and volumes obtained from Yahoo! Finance

import pandas. io . data as web

all_data = {}

for t icker in [ AAPL , IBM , MSFT , GOOG ] :all_data [ t icker ] = web. get_data_yahoo( ticker , 1/1/2010 , 3/22/2015 )

price = DataFrame({ t i c : data[ Adj Close ]for t ic , data in all_data . iteritems ()})

volume = DataFrame({ t i c : data[ Volume ]for t ic , data in all_data . iteritems ()})

price . t a i l ( )

AAPL GOOG IBM MSFTDate2015-03-16 124.95 554.51 157.08 41.562015-03-17 127.04 550.84 156.96 41.702015-03-18 128.47 559.50 159.81 42.502015-03-19 127.50 557.99 159.81 42.292015-03-20 125.90 560.36 162.88 42.88

54

Correlarion and Covariance

Calculate the percentage change from the previous value:

returns = price . pct_change ( )returns . t a i l ( )

AAPL GOOG IBM MSFTDate2015-03-16 0.011004 0.013137 0.018149 0.0043502015-03-17 0.016727 -0.006618 -0.000764 0.0033692015-03-18 0.011256 0.015721 0.018157 0.0191852015-03-19 -0.007550 -0.002699 0.000000 -0.0049412015-03-20 -0.012549 0.004247 0.019210 0.013951

55

The corr method calculates the correlation between two series:

returns .MSFT. corr ( returns . IBM)

0.50052763872781603

DataFrames corr and cov methods, return a full correlation or covariancematrix as a DataFrame:

returns . corr ( )

AAPL GOOG IBM MSFTAAPL 1.000000 0.265999 0.368079 0.345835GOOG 0.265999 1.000000 0.315613 0.409107IBM 0.368079 0.315613 1.000000 0.500528MSFT 0.345835 0.409107 0.500528 1.000000

56

The corr method calculates the correlation between two series:

returns .MSFT. corr ( returns . IBM)

0.50052763872781603

DataFrames corr and cov methods, return a full correlation or covariancematrix as a DataFrame:

returns . corr ( )

AAPL GOOG IBM MSFTAAPL 1.000000 0.265999 0.368079 0.345835GOOG 0.265999 1.000000 0.315613 0.409107IBM 0.368079 0.315613 1.000000 0.500528MSFT 0.345835 0.409107 0.500528 1.000000

56

Unique Values

To get unique values we can use the method unique from the Series object:

print len ( price .AAPL)unique_prices = price .AAPL. unique ( )print len ( unique_prices )

13121192

57

Value Counts

We can also count the appearence of each of the values

Example

price .AAPL. value_counts ( ) . head( )

45.29 334.26 327.75 245.47 245.72 2dtype: int64

58

Missing Data

Missing data is common in most data analysis applications

By default pandas functions deal with missing data graciously

Example

First, lets calculate the average price for GOOG:

price .GOOG.mean( )

550.01818548387075

How many missing observations do we have?

price .GOOG. i snu l l ( ) .sum()

1064

Now, lets calculate the mean without discarding the missing observations

price .GOOG.mean(skipna=False )

nan

The average price cannot be calculated if we do not remove or replace themissing values

59

Missing Data

Example

price .GOOG.mean( )

550.01818548387075

1064

nan

59

Missing Data

Example

price .GOOG.mean( )

550.01818548387075

1064

nan

59

Filtering out Missing Data

In many applications it is important to know that we are using always thesame observations

In such cases may be wise to remove observations with missing values:

price .dropna ( ) . head( )

Data starts on 2014-03-27, the first date for which we have data for GOOG

60

In many applications it is important to know that we are using always thesame observations

In such cases may be wise to remove observations with missing values:

price .dropna ( ) . head( )

Data starts on 2014-03-27, the first date for which we have data for GOOG

60

We could also drop the columns that have missing data:

price .dropna( axis=1).head( )

AAPL IBM MSFTDate2010-01-04 28.84 119.53 26.942010-01-05 28.89 118.09 26.952010-01-06 28.43 117.32 26.792010-01-07 28.38 116.92 26.512010-01-08 28.56 118.09 26.69

61

Filling in Missing Data

In some situations we want to fill in the missing observations with defaultvalues:

Example

Filling with zeros:

price . f i l l n a (0) .head( )

AAPL GOOG IBM MSFTDate2010-01-04 28.84 0 119.53 26.942010-01-05 28.89 0 118.09 26.952010-01-06 28.43 0 117.32 26.792010-01-07 28.38 0 116.92 26.512010-01-08 28.56 0 118.09 26.69

Filling with the mean:

price . f i l l n a ( price .mean( ) ) . head( )

62

Filling in Missing Data

Note: These operations always create a copy of the data

price .head( )

AAPL GOOG IBM MSFTDate2010-01-04 28.84 NaN 119.53 26.942010-01-05 28.89 NaN 118.09 26.952010-01-06 28.43 NaN 117.32 26.792010-01-07 28.38 NaN 116.92 26.512010-01-08 28.56 NaN 118.09 26.69

63

Hierarchical Indexing

Hierarchical indexing enables using multiple (two or more) index levelson an axis

It provides a way to work with higher dimensional data in a lowerdimensional form

Example

data = Series (np.random. randn(10) ,index=[[ a , a , a , b , b , b , c , c , d , d ] ,

[2010, 2011, 2012, 2010, 2011, 2012, 2010, 2011, 2011, 2012]])

data

a 2010 0.5476342011 0.7921822012 -0.821709

b 2010 0.1725032011 0.7144972012 -0.004165

c 2010 -0.0951962011 0.096810

d 2011 0.5530032012 0.167027

dtype: float64

64

Hierarchical Indexing

Example

Accessing to a

data[ a ]

2010 0.5476342011 0.7921822012 -0.821709dtype: float64

Accessing to 2011

data [ : , 2011]

a 0.792182b 0.714497c 0.096810d 0.553003dtype: float64

65

Summary Statistics by Level

We can summarize the results by each level of the index

Example

data .sum( level=0)

a 0.518107b 0.882835c 0.001614d 0.720031dtype: float64

data .sum( level=1)

2010 0.6249412011 2.1564922012 -0.658846dtype: float64

66

SETUPLecture 3NumPy

Lecture 4pandas

python for data analytics

Documents