module 5 introduction to pandas

96
Introduction to Pandas Module 5

Upload: others

Post on 21-Jan-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Module 5 Introduction to Pandas

WELCOME TO GAGENERAL ASSEMBLY

Introduction to PandasModule 5

Page 2: Module 5 Introduction to Pandas

Modules of this Course

Module 1: Foundations of Programming

Module 6:EDA with Pandas

Module 2: Introduction to

Python

Module 3: Python Data Structures

Module 5:Introduction to

Pandas

Module 4: Intermediate

Python

Page 3: Module 5 Introduction to Pandas

Module 5: Lesson Objectives

After learning this module, you will be able to:

● Understand what data science is and the roles needed in data science● Apply the data science workflow● Understand programming tools for data science and how Pandas fit in ● Use Pandas to read in a dataset● Get data information and summary statistics● Investigate a dataset's integrity● Filter, sort, and manipulate DataFrame series

3 | © 2021 True Digital Academy

Page 4: Module 5 Introduction to Pandas

WELCOME TO GAGENERAL ASSEMBLY

Introduction to Pandas

Intro to Data Science

Page 5: Module 5 Introduction to Pandas

What is Data Science?

Data science encompasses:

● Framing the problem● Collecting the raw data needed for your problem● Processing the data for analysis● Exploring the data● Performing in-depth analysis● Communicating results of the analysis

5 | © 2021 True Digital Academy

Page 6: Module 5 Introduction to Pandas

Data Science Examples

● Fraud and Risk Detection● Healthcare● A bank approving a credit card● Recommendation systems in marketing & advertising

6 | © 2021 True Digital Academy

Page 7: Module 5 Introduction to Pandas

Data Science Examples

● Oil and Gas Industry can benefit from Data Science:○ Exploration and discovery – rock types can be used to predict oil pockets.○ Production accounting – production data can be linked with alarms.○ Drilling and completions – Predictive analytics can make use of geological

completion and drilling data to determine between preferred, best, drilling locations.

○ Equipment maintenance – Real-time streaming data from rigs can be compared with historical drilling to help predict and prevent problems and better understand operation risks.

7 | © 2021 True Digital Academy

Page 8: Module 5 Introduction to Pandas

Conway Venn Diagram

8 | © 2021 True Digital Academy http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram

Page 9: Module 5 Introduction to Pandas

Job Roles in Data Science

What does that break down to?

● Machine Learning Engineer● Data Engineer● Research Science● Advanced Analyst

9 | © 2021 True Digital Academy

Page 10: Module 5 Introduction to Pandas

Machine Learning Engineer

● Identify machine learning applications.● Work in production code.● Manage infrastructure and data pipelines● “Straddle the line between knowing the mathematics and coding the mathematics.”

○ eBay VP of engineering Japjit Tulsi

10 | © 2021 True Digital Academy

Production codeIt is hard to give a general definition of what production code is, but a key difference with non-production code, is that production code gets read and executed by many other people, instead of just the person that wrote it. We should therefore aim for our code to be

● Reproducible, because many people are going to run it.

● Modular and well-documented, because many people are going to read it.

Page 11: Module 5 Introduction to Pandas

Data Engineer

● Create the architecture that allows data acquisition and machine learning problems to run at scale.

● Focus on the algorithm and the analysis.● Don't work much on the software side.

11 | © 2021 True Digital Academy

Data Engineer is a person who design, build and install the data system.Data Engineer is responsible for data flow, data pipeline, data storing and data structure.

Data pipeline is a set of tools and activities for moving data from one system to another system. It combines 4 phases; ingest, store, process and consume.

Page 12: Module 5 Introduction to Pandas

Research Scientist

● PhD-heavy field.● Determines new algorithmic optimizations.● Focused on driving scientific discovery.● Less concerned with pursuing industrial applications.

Applied research scientists:

● Specialized research scientist.● Backgrounds in both data science and computer science.● Invaluable members of any AI team.● “They can both pitch in on data science and write code. Finding a good applied

research scientist is worth her weight in gold.○ Japjit Tulsi

12 | © 2021 True Digital Academy

Page 13: Module 5 Introduction to Pandas

Advanced Analysts

● Quantitative-minded.● Apply data descriptive and inferential exploratory data analysis and

modeling.

13 | © 2021 True Digital Academy

Page 14: Module 5 Introduction to Pandas

Quick Review: Intro to Data Science

● Data science is the practice of:○ Acquiring, organizing, and delivering complex data; discovering relationships

and anomalies among variables.○ Building and deploying machine learning models.○ Synthesizing data to influence decision-making.

● Specific Data Science Roles Include:○ Machine Learning Engineer○ Data Engineer○ Research Science○ Advanced Analyst

14 | © 2021 True Digital Academy Next: Data Science Workflow

Page 15: Module 5 Introduction to Pandas

WELCOME TO GAGENERAL ASSEMBLY

Introduction to Pandas

Data Science Workflow

Page 16: Module 5 Introduction to Pandas

How Do We...

● Go through data science workflow?● Solve a data science problem?● Craft a data science problem statement?

16 | © 2021 True Digital Academy

focus on how to conduct exploratory data analysis (EDA)

Page 17: Module 5 Introduction to Pandas

The Data Science Workflow

17 | © 2021 True Digital Academy

Frame Prepare Analyze Interpret Communicate

Develop hypothesis-driven questions to your

analysis

Get, understand, explore, and

scrub your data

Structure, visualize to find

significant patterns and trend using statistical

methods, and complete your

analysis

Make recommendations and business decisions from

your data

Present insights from your data to audiences

Page 18: Module 5 Introduction to Pandas

Notes on the Steps

● Not hard-set rules.● Really, problem-solving guidelines.

Every problem is different!

● Some projects may not require every step.● It's normal to repeat certain steps a few times.● The process is cyclical with new findings!

18 | © 2021 True Digital Academy

Page 19: Module 5 Introduction to Pandas

Step 1 is Always "Frame the Problem"

Solving data science task starts with a clearly defined problem.

● Poor results stem from no defined goal.

“A problem well stated is half solved.” — Charles Kettering

From there, you can apply your steps.

19 | © 2021 True Digital Academy

Page 20: Module 5 Introduction to Pandas

The Data Science Workflow: Applied

You need to reduce the costs of staffing.

You have a table of DSW current retail sales associates across department stores.

The first three rows look like this:

20 | © 2021 True Digital Academy

Assume that clothing retail company, Data Science Wearables (DSW), is interested in improving their human resource operations.

Job Level Current Employee

Reason for Termination

Years of Service

Candidate Source

Previous Employer

School Time to Fill (Days)

Associate N New offer 1.5 Referral Jake's Hawaiian Shirts

University of Minnesota

40

Associate Y N/A 2.0 Internship N/A University of Iowa

15

Associate No Tardiness 0.5 Online Hats and Caps

University of Nebraska

25

Time to Fill (Days): How long did it take to fill this person's role? Typically minimizing time to fill is key to lower costs.

Page 21: Module 5 Introduction to Pandas

Step One: Frame

We know:

● We want to reduce costs associated with staffing.

We don't know:

● What drives up costs of staffing?● Is there an underlying reason for those costs?● What hypothesis can we test to reduce costs?

21 | © 2021 True Digital Academy

Page 22: Module 5 Introduction to Pandas

Step Two: Prepare

What questions do you have about the dataset?

22 | © 2021 True Digital Academy

Job Level Current Employee

Reason for Termination

Years of Service

Candidate Source

Previous Employer

School Time to Fill (Days)

Associate N New offer 1.5 Referral Jake's Hawaiian Shirts

University of Minnesota

40

Associate Y N/A 2.0 Internship N/A University of Iowa

15

Associate No Tardiness 0.5 Online Hats and Caps

University of Nebraska

25

N/A missing values

inconsistencies

Page 23: Module 5 Introduction to Pandas

Step Three: Analyze

We want to:

● Create meaning and conduct statistical description and inference.

For example, the average Years of Service is ~1.33 years.

● Could we build a machine learning model to predict this?● The data could center on their background (school, previous employers, and

application source).

For example, is the relationship between Time to Fill and Years of Service positive or negative?

● Positive: when one increases, the other increases.● Negative: when one increases, the other decreases.

23 | © 2021 True Digital Academy

Page 24: Module 5 Introduction to Pandas

Step Four: Interpret

How do our results compare to our initial hypothesis?

What concrete actions do we recommend?

Question: Even with an extremely limited dataset (n=3), can you identify hypothesis-validating or invalidating anecdotes?

At this stage, treat metrics and results like "check engine lights."

● Result summaries may point you in the right direction, but they do not necessarily explain the full context at hand.

24 | © 2021 True Digital Academy

Page 25: Module 5 Introduction to Pandas

Step Five: Communicate

Results are only as convincing as they are conveyed to key stakeholders!

Back up your statement with evidence, including statistical tests, visualizations, and model results.

25 | © 2021 True Digital Academy

Page 26: Module 5 Introduction to Pandas

Quick Review: Data Science Workflow

The data science workflow:

26 | © 2021 True Digital Academy

Frame Prepare Analyze Interpret CommunicateFrame Prepare Analyze Interpret Communicate

Develop hypothesis-driven questions to your

analysis

Get, import, understand, explore, and

scrub your data

Structure, visualize to find

significant patterns and trend using statistical

methods, and complete your

analysis

Make recommendations and business decisions from

your data

Present insights from your data to audiences

Next: Data Science Tools

Page 27: Module 5 Introduction to Pandas

WELCOME TO GAGENERAL ASSEMBLY

Introduction to Pandas

Data Science Tools

Page 28: Module 5 Introduction to Pandas

Why Python for Data Science

Easy to write

● Data science is inherently a cross-functional discipline!● A language for all audiences is key.

Open source

● New techniques become available daily!● Developers from around the world race to implement new libraries.● This places Python in contrast to closed source, paid data analysis tools like SAS

and SPSS.

Often used for data analysis, scripting, and rapid software development.

28 | © 2021 True Digital Academy

Page 29: Module 5 Introduction to Pandas

Getting Data Science Tools

● We can analyze data to determine what Python is most used for:

Pandas?

● A Python package for exploratory analysis.

● Let's use it!

29 | © 2021 True Digital Academy

"panel data"

Page 30: Module 5 Introduction to Pandas

Your Data Science Development Tools

Python packages in DS are ubiquitous (universal):

● Reading CSVs, linear algebra, linear regressions, matrices...

Anaconda ("Conda"):

● Package manager.● Downloads everything for us!

Follow these steps:

1. Download Anaconda: https://www.anaconda.com/download/.Select Python 3.7+ for your machine (macOS or PC)

2. Open the file. Follow the on-screen prompts. Don't hesitate to ask questions!

30 | © 2021 True Digital Academy

Page 31: Module 5 Introduction to Pandas

What Are We Downloading?

Pandas:

● The default tool for data exploration and manipulation in Python.

Jupyter Notebooks and JupyterLab:

● The preferred integrated development environments (IDEs) of data science.

● We'll write our code in this!

NumPy, SciPy, and more:

● Other packages for statistical inference, visualization, and parallelizing operations.

31 | © 2021 True Digital Academy

Page 32: Module 5 Introduction to Pandas

Why Jupyter Notebook?

Data science is both code and methods

What if we're missing many values?

● Do you fill in missing values with the mean or the median?● Easy to create code cells next to text cells.

Easy to connect to remote computers (data centers).

● Thus, the Jupyter Notebook is in your browser!

33 | © 2021 True Digital Academy

Page 33: Module 5 Introduction to Pandas

Quick Review: Data Science Tools

● Pandas○ A Python package for exploratory analysis.

● Jupyter Notebooks and JupyterLab:○ The preferred integrated development environments (IDEs) of data

science.○ We'll write our code in this!

● Anaconda helps us download these. You only had to download it once!

34 | © 2021 True Digital Academy

Page 34: Module 5 Introduction to Pandas

Quick Review: Data Science Tools

● Data scientists:○ Use data of all kinds (numbers, text, images).○ Make explanations and predictive decisions.

● Data Science Workflow:○ Frame -> Prepare -> Analyze -> Interpret -> Communicate.

● Jupyter Notebooks:○ The industry tool!○ Interactive with Python.

35 | © 2021 True Digital Academy Next: Pandas

Page 35: Module 5 Introduction to Pandas

WELCOME TO GAGENERAL ASSEMBLY

Introduction to Pandas:

Pandas

Page 36: Module 5 Introduction to Pandas

What is Pandas?

● A group of adorable bears 🐼🐼🐼● A Python library for data manipulation.

37 | © 2021 True Digital Academy

Page 37: Module 5 Introduction to Pandas

So, Pandas the Library

The Swiss Army Knife of data manipulation!

Pandas:

● Is the library for exploratory data analysis (EDA).● Formats, wrangles, cleans, and prepares our data.

Quick Backstory from 2009:

● A humble open source project for Panel Data (hence "Pandas") from Wes McKinney.● A 'panel' is the name of the object (in pandas) holding an n-dimensional numpy array● Don't let the term fool you, a panel is effectively the same thing as an excel workbook (a

collection of sheets)● A 2-dimensional panel is a Dataframe (rows and columns)● A 1-dimensional panel is a Series (column)

38 | © 2021 True Digital Academy

Page 38: Module 5 Introduction to Pandas

Quick Review: Pandas

● Exploratory Data Analysis (EDA) is the process of understanding our dataset, and producing our first level of insights.

● Pandas is a prominent Python library used for exploratory data analysis.

39 | © 2021 True Digital Academy Next: Series and DataFrame

Page 39: Module 5 Introduction to Pandas

WELCOME TO GAGENERAL ASSEMBLY

Introduction to Pandas:

Series and DataFrame

Page 40: Module 5 Introduction to Pandas

Import Pandas

41 | © 2021 True Digital Academy

import pandas as pd

Alias

Page 41: Module 5 Introduction to Pandas

Pandas’ Data structures

pandas provides two data structures that shape data into a readable form:

● Series● Data frame

42 | © 2021 True Digital Academy

Page 42: Module 5 Introduction to Pandas

Series

A pandas series is a one-dimensional data structure that comprises of a key-value pair.

43 | © 2021 True Digital Academy

Page 43: Module 5 Introduction to Pandas

pd.Series()

To initialize a series, use pd.Series():

44 | © 2021 True Digital Academy

import pandas as pd

##### INITIALIZATION #####

#STRING SERIES

fruits = pd.Series(["apples", "oranges", "bananas"])

print("MY FRUIT SERIES")

print(fruits, "\n")

MY FRUIT SERIES0 apples1 oranges2 bananasdtype: object

Page 44: Module 5 Introduction to Pandas

pd.Series()

change the key column to user-defined keys by passing a list to the index argument in the pd.Series() method.

45 | © 2021 True Digital Academy

import pandas as pd

##### INTIALIZATION #####

#FLOAT SERIES

temperature = pd.Series([32.6, 34.1, 28.0, 35.9], index = ["a","b","c","d"])

print("TEMPERATURE IN CELSIUS")

print(temperature, "\n")TEMPERATURE IN CELSIUSa 32.6b 34.1c 28.0d 35.9dtype: float64

Page 45: Module 5 Introduction to Pandas

Query a Series

To query a series, use .iloc[] or [] to query using the index/position of the value and .loc[] to query using the user-defined keys.

46 | © 2021 True Digital Academy

##### QUERY #####

#USING INDEX

print ("2nd fruit: ", fruits.iloc[1])

#OR

print ("2nd fruit: ", fruits[1], "\n")

#USING KEY

print ("temperature at key \"b\": ", temperature.loc["b"])

2nd fruit: oranges2nd fruit: oranges

temperature at key "b": 34.1

MY FRUIT SERIES0 apples1 oranges2 bananasdtype: object

TEMPERATURE IN CELSIUSa 32.6b 34.1c 28.0d 35.9dtype: float64

Page 46: Module 5 Introduction to Pandas

Dataframe

A pandas dataframe is a two-dimensional data-structure that can be thought of as a spreadsheet.

A dataframe can also be thought of as a combination of two or more series.

47 | © 2021 True Digital Academy

Page 47: Module 5 Introduction to Pandas

Dataframe

To initialize a dataframe, use pd.DataFrame:

48 | © 2021 True Digital Academy

import pandas as pd

##### INITIALIZATION #####

fruits_jack = ["apples", "oranges", "bananas"]

fruits_john = ["guavas", "kiwis", "strawberries"]

index = ["a", "b", "c"]

all_fruits = {"Jack's": fruits_jack, "John's": fruits_john}

fruits = pd.DataFrame(all_fruits, index = index)

print(fruits, "\n")

new_fruits = fruits.reset_index(drop = True)

print(new_fruits, "\n")

Jack's John'sa apples guavasb oranges kiwisc bananas strawberries

Jack's John's0 apples guavas1 oranges kiwis2 bananas strawberries

fruits

new_fruits

Page 48: Module 5 Introduction to Pandas

Query a Dataframe

49 | © 2021 True Digital Academy

##### QUERY #####

#USING INDEX

print("1st fruit:")

print(new_fruits.iloc[0], "\n")

#USING KEY

print("Fruits at key \"c\":")

print(fruits.loc["c"], "\n")

#USING COLUMN NAME

print("Jack's fruits: ")

print(fruits["Jack's"], "\n")

#CHAINED QUERY

print("Johns third fruit: ")

print(new_fruits["John's"][2], "\n")

Jack's John'sa apples guavasb oranges kiwisc bananas strawberries

Jack's John's0 apples guavas1 oranges kiwis2 bananas strawberries

fruits

new_fruits

1st fruit:Jack's applesJohn's guavasName: 0, dtype: object

Fruits at key "c":Jack's bananasJohn's strawberriesName: c, dtype: object

Jack's fruits:a applesb orangesc bananasName: Jack's, dtype: object

Johns third fruit:strawberries

Page 49: Module 5 Introduction to Pandas

Quick Review: Series and DataFrame

● Pandas provides two data structures that shape data into a readable form:○ Series is a one-dimensional data structure.○ Data frame is a two-dimensional data-structure.

● To query a series or dataframe, use iloc() or loc.

50 | © 2021 True Digital Academy Next: Data Input and Validation

Page 50: Module 5 Introduction to Pandas

WELCOME TO GAGENERAL ASSEMBLY

Introduction to Pandas:

Data Input and Validation

Page 51: Module 5 Introduction to Pandas

Data Input

Pandas provides the easiest way to get data into a Python program is to read it from a file.

Pandas can read lots of kinds of files: csv, xls, xlsx, and so on.

52 | © 2021 True Digital Academy

Page 52: Module 5 Introduction to Pandas

Reading a Dataset

import pandas library and then read the dataset

53 | © 2021 True Digital Academy

# importing pandas package

import pandas as pd

# making data frame from csv file

df = pd.read_csv("nba.csv", index_col ="Name")

Page 53: Module 5 Introduction to Pandas

dataframe.shape

The shape property is used to get a tuple representing the dimensionality of the DataFrame.

54 | © 2021 True Digital Academy

df.shape(458, 8)

Number of Rows

Number of Columns

Page 54: Module 5 Introduction to Pandas

Getting the First n Rows Using head()

The head() function is used to get the first n rows. By default, it returns the first 5 rows of the Dataframe.

55 | © 2021 True Digital Academy

df.head()

Page 55: Module 5 Introduction to Pandas

Getting Last n Rows Using tail()

The tail() function is used to get the last n rows.

56 | © 2021 True Digital Academy

df.tail()

Page 56: Module 5 Introduction to Pandas

Printing Info about the DataFrame Using info()

The information contains the number of columns, column labels, column data types, memory usage, range index, and the number of cells in each column (non-null values).

57 | © 2021 True Digital Academy

df.info()<class 'pandas.core.frame.DataFrame'>Index: 458 entries, Avery Bradley to nanData columns (total 8 columns):# Column Non-Null Count Dtype--- ------ -------------- -----0 Team 457 non-null object1 Number 457 non-null float642 Position 457 non-null object3 Age 457 non-null float644 Height 457 non-null object5 Weight 457 non-null float646 College 373 non-null object7 Salary 446 non-null float64dtypes: float64(4), object(4)

Page 57: Module 5 Introduction to Pandas

Changing Column Name

● Rename column / index name (label): rename()● Change multiple names (labels)● Update the original object: inplace

58 | © 2021 True Digital Academy

df_new = df.rename(columns={'Team': 'BasketTeam'},

index={'Jeff Withey':

'changed_index'})

print(df_new)

Page 58: Module 5 Introduction to Pandas

Changing Column Name

● Rename column / index name (label): rename()● Change multiple names (labels)● Update the original object: inplace

59 | © 2021 True Digital Academy

print(df.rename(columns={'Team': 'Col_1', 'Position': 'Col_3'}))

Page 59: Module 5 Introduction to Pandas

Changing Column Name

● Rename column / index name (label): rename()● Change multiple names (labels)● Update the original object: inplace

60 | © 2021 True Digital Academy

df_copy = df.copy()

df_copy.rename(columns={'Team': 'Col_1'}, index={'Avery Bradley': 'Row_1'}, inplace=True)

print(df_copy)

Page 60: Module 5 Introduction to Pandas

Quick Review: Data Input and Validation

● Pandas provides the easiest way to get data into a Python program is to read it from a file and get information about data:○ Read files using pd.read_csv()○ Get dimensionality of dataframe using df.shape○ Get first rows and last rows using df.head(), df.tail() respectively○ Get information about dataframe using df.info()○ Change column names and label using df.rename()

61 | © 2021 True Digital Academy Next: Basic Analysis

Page 61: Module 5 Introduction to Pandas

WELCOME TO GAGENERAL ASSEMBLY

Introduction to Pandas:

Basic Analysis

Page 62: Module 5 Introduction to Pandas

Summary Statistics using .describe()

The describe() function is used to generate descriptive statistics that summarize the central tendency, dispersion, and shape of a dataset’s distribution, excluding NaN values.

63 | © 2021 True Digital Academy

data.describe()count

mean

std

min

25%

50%

75%

max

3.0

3.0

1.0

2.0

2.5

3.0

3.5

4.0

s = pd.Series ( [ 2, 3, 4 ] )s.describe ( )

3 numbers

mean or average

Standard Deviation

minimum value

25th percentiles

50th percentiles

75th percentiles

maximum value

Page 63: Module 5 Introduction to Pandas

Describing a Categorical Series

64 | © 2021 True Digital Academy

count

unique

top

freq

4

3

p

2

s = pd.Series ( [ ‘p’, ‘p’, ‘q’, ‘r’ ] )s.describe ( )

4 letters

unique letters 3 (p,q,r)

top letter is p

p has highest freq (2)

Page 64: Module 5 Introduction to Pandas

Describing a Timestamp Series

65 | © 2021 True Digital Academy

import numpy as np

import pandas as pd

s = pd.Series([

np.datetime64("2018-02-01"),

np.datetime64("2019-02-01"),

np.datetime64("2019-02-01")])

s.describe()

count 3unique 2top 2019-02-01 00:00:00freq 2first 2018-02-01 00:00:00last 2019-02-01 00:00:00dtype: object

Page 65: Module 5 Introduction to Pandas

mean

The mean() function is used to return the mean of the values for the requested axis.

axis: {index (0), columns (1)}.

66 | © 2021 True Digital Academy

import pandas as pd

info = pd.DataFrame({"A":[8, 2, 7, 12, 6],

"B":[26, 19, 7, 5, 9],

"C":[10, 11, 15, 4, 3],

"D":[16, 24, 14, 22, 1]})

print(info)

print("*************")

info.mean(axis = 0)

A B C D0 8 26 10 161 2 19 11 242 7 7 15 143 12 5 4 224 6 9 3 1*************A 7.0B 13.2C 8.6D 15.4dtype: float64

Page 66: Module 5 Introduction to Pandas

sum

DataFrame.sum() function is used to return the sum of the values for the requested axis.

axis: {index (0), columns (1)}

67 | © 2021 True Digital Academy

import pandas as pd

info = pd.DataFrame({"A":[8, 2, 7, 12, 6],

"B":[26, 19, 7, 5, 9],

"C":[10, 11, 15, 4, 3],

"D":[16, 24, 14, 22, 1]})

print("The sum of A is ", info['A'].sum())

The sum of A is 35

Page 67: Module 5 Introduction to Pandas

max

DataFrame.max() function is used to find the maximum value along the axis.

68 | © 2021 True Digital Academy

import pandas as pd

info = pd.DataFrame({"A":[8, 2, 7, 12, 6],

"B":[26, 19, 7, 5, 9],

"C":[10, 11, 15, 4, 3],

"D":[16, 24, 14, 22, 1]})

print("The max of A is ", info['A'].max())

The max of A is 12

Page 68: Module 5 Introduction to Pandas

Get Maximum Values of Every Column

69 | © 2021 True Digital Academy

import pandas as pd

info = pd.DataFrame({"A":[8, 2, 7, 12, 6],

"B":[26, 19, 7, 5, 9],

"C":[10, 11, 15, 4, 3],

"D":[16, 24, 14, 22, 1]})

print("The max of A is", info.max())

The max of A is A 12B 26C 15D 24

Page 69: Module 5 Introduction to Pandas

Get Maximum Values of Every Row

70 | © 2021 True Digital Academy

import pandas as pd

info = pd.DataFrame({"A":[8, 2, 7, 12, 6],

"B":[26, 19, 7, 5, 9],

"C":[10, 11, 15, 4, 3],

"D":[16, 24, 14, 22, 1]})

print(info.max(axis=1))

0 261 242 153 224 9

Next: value_counts()

Page 70: Module 5 Introduction to Pandas

value_counts

The value_counts() function returns an object containing counts of unique values.

Let’s explore titanic datasets.

71 | © 2021 True Digital Academy

# Importing necessary libraries

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

%matplotlib inline

df = pd.read_csv('titanic.csv')

Page 71: Module 5 Introduction to Pandas

Titanic DatasetVARIABLE DESCRIPTIONS

1. Pclass Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd) 2. Survived Survival (0 = No; 1 = Yes) 3. Name Name 4. Sex Sex 5. Age Age 6. SibSp Number of Siblings/Spouses Aboard 7. Parch Number of Parents/Children Aboard 8. Ticket Ticket Number 9. Fare Passenger Fare (British pound) 10. Cabin Cabin 11. Embarked Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton) 12. Boat Lifeboat 13. Body Body Identification Number 14. Home.dest Home/Destination

72 | © 2021 True Digital Academy

df.head()

Page 72: Module 5 Introduction to Pandas

Calling the value_counts() on the Sex Column of the Dataset

73 | © 2021 True Digital Academy

df['Sex'].value_counts() male 577female 314

S 644C 168Q 77

df['Embarked'].value_counts()

We can quickly see that the maximum people embarked from Southampton, followed by Cherbourg and then Queenstown.

Page 73: Module 5 Introduction to Pandas

value_counts() with Relative Frequencies of the Unique Values

Sometimes, getting a percentage of the total is a better criteria than the count.

By setting normalize =True, the object returned will contain the relative frequencies of the unique values.

74 | © 2021 True Digital Academy

df['Embarked'].value_counts(normalize=True) S 0.724409C 0.188976Q 0.086614

Page 74: Module 5 Introduction to Pandas

value_counts() in Ascending Order

sort the results obtained in ascending order, simply set the ascending parameter to True

75 | © 2021 True Digital Academy

df['Embarked'].value_counts(ascending=True) Q 77C 168S 644

Page 75: Module 5 Introduction to Pandas

value_counts() with NaN Values

By default, count of null values are excluded. However, this can be reversed by setting dropna=False.

76 | © 2021 True Digital Academy

S 644C 168Q 77NaN 2

df['Embarked'].value_counts(dropna=False)

Page 76: Module 5 Introduction to Pandas

value_counts() with Bins

value_counts() can also be used to bin continuous data into discrete intervalswith the help of bin parameter.

This option works only with numerical data.

77 | © 2021 True Digital Academy

# applying value_counts on a numerical

column

df['Fare'].value_counts()

This doesn't convey much since the function above has given a count of every available Fare amount. Instead, let's group them into 7 bins

Page 77: Module 5 Introduction to Pandas

Grouping Fares into 7 Bins.

78 | © 2021 True Digital Academy

df['Fare'].value_counts(bins=7)

We can easily see that most of the people out of the total population paid less than 73.19 for their ticket.

Next: sort_values()

Page 78: Module 5 Introduction to Pandas

Sorting Data Frame Using sort_values()

Pandas sort_values() function sorts a data frame in Ascending or Descending order of passed Column.

79 | © 2021 True Digital Academy

Page 79: Module 5 Introduction to Pandas

Sorting by Name

80 | © 2021 True Digital Academy

# sorting data frame by name

df.sort_values("Name", axis = 0, ascending = True,

inplace = True, na_position ='last')

# display

df

Page 80: Module 5 Introduction to Pandas

Changing Position of Null Values

Null values are kept at the top

81 | © 2021 True Digital Academy

# sorting data frame by Cabin

df.sort_values("Cabin", axis = 0, ascending = True,

inplace = True, na_position ='first')

# display

df

Next: Boolean masking

Page 81: Module 5 Introduction to Pandas

Boolean Masking

Boolean masking is used to filter a data.

Let’s filter Titanic datasets again!

82 | © 2021 True Digital Academy

# Importing necessary libraries

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

%matplotlib inline

df = pd.read_csv('titanic.csv')

Page 82: Module 5 Introduction to Pandas

Masking Data Based on Column Value

Using a comparison operator for filtering of data: ==, >, <, <=, >=

83 | © 2021 True Digital Academy

# Importing necessary libraries

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

%matplotlib inline

df = pd.read_csv('titanic.csv')

print(df.shape)

#mask

print(df['Age'] >25)

(891, 12)0 False1 True2 True3 True4 True

...886 True887 False888 False889 True890 TrueName: Age, Length: 891, dtype: bool

Page 83: Module 5 Introduction to Pandas

Applying a Boolean Mask (age > 25)

When we apply a boolean mask it will print only that dataframe in which we pass a boolean value True.

84 | © 2021 True Digital Academy

# Importing necessary libraries

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

%matplotlib inline

df = pd.read_csv('titanic.csv')

print(df.shape)

mask = df['Age'] > 25

print(df[mask])

Page 84: Module 5 Introduction to Pandas

Applying a Boolean Mask (Embarked == ‘S’)

85 | © 2021 True Digital Academy

mask = df['Embarked'] == 'S'

print(df[mask]) # print(df[df['Embarked'] == 'S'])

Page 85: Module 5 Introduction to Pandas

Combine Conditions Using Boolean Operators

Python's bitwise logic operators, & (and), | (or), ^ (xor), and ~(not).

86 | © 2021 True Digital Academy

mask = (df['Embarked'] == 'S') | (df["Age"] > 25)

print(df[mask])

# print(df[(df['Embarked'] == 'S') | (df["Age"] > 25)])

or

Next: String Manipulation

Page 86: Module 5 Introduction to Pandas

String Manipulations in DataFrame

Series.str can be used to access the values of the series as strings and apply several methods to it.

Pandas Series.str.contains() function is used to test if pattern or regex is contained within a string of a Series or Index.

87 | © 2021 True Digital Academy

import pandas as pd

df = pd.DataFrame({

'name': ['alice smith','bob jones','charlie joneson','daisy white'],

'age': [25,20,30,35]

})

Page 87: Module 5 Introduction to Pandas

Select by Partial String

88 | © 2021 True Digital Academy

import pandas as pd

df = pd.DataFrame({

'name': ['alice smith','bob jones','charlie joneson','daisy white'],

'age': [25,20,30,35]

})

df[df['name'].str.contains('jones',regex=False)]

.str.contains('jones')

Set regex=False for better performance

Page 88: Module 5 Introduction to Pandas

Select by Regular Expression

89 | © 2021 True Digital Academy

import pandas as pd

df = pd.DataFrame({

'name': ['alice smith','bob jones','charlie joneson','daisy white'],

'age': [25,20,30,35]

})

# names starting with 'b' or 'd'

df[df['name'].str.contains('^b|d')]

Page 89: Module 5 Introduction to Pandas

Concatenate String Columns

90 | © 2021 True Digital Academy

import pandas as pd

df = pd.DataFrame({

'first_name': ['alice','bob','charlie','daisy'],

'last_name':['smith','jones','joneson','white'],

'age': [25,20,30,35]

})

# just add the two columns

df['full_name'] = df['first_name'] + df['last_name']

df

Next: apply()

Page 90: Module 5 Introduction to Pandas

apply()

Pandas.apply allow the users to pass a function and apply it on every single value of the Pandas series.

91 | © 2021 True Digital Academy

Page 91: Module 5 Introduction to Pandas

Split String Column

92 | © 2021 True Digital Academy

import pandas as pddf = pd.DataFrame({

'name': ['alice smith','bob jones','charlie joneson','daisy white'],'age': [25,20,30,35]

})

# a function that takes the value and returns# a series with as many columns as you wantdef split_name(name):

first_name, last_name = name.split(' ')

return pd.Series({'first_name': first_name,'last_name': last_name

})

# df_new has the new columnsdf_new = df['name'].apply(split_name)df_new

df_new

df

Page 92: Module 5 Introduction to Pandas

Split String Column

93 | © 2021 True Digital Academy

import pandas as pd

df = pd.DataFrame({'name': ['alice smith','bob jones','charlie joneson','daisy white'],'age': [25,20,30,35]

})

# a function that takes the value and returns# a series with as many columns as you wantdef split_name(name):

first_name, last_name = name.split(' ')

return pd.Series({'first_name': first_name,'last_name': last_name

})

# df_new has the new columnsdf_new = df['name'].apply(split_name)

# append the columns to the original dataframedf_final = pd.concat([df,df_new],axis=1)

df_new

df

df_final

Two new columns were createdby splitting full_name into two

Page 93: Module 5 Introduction to Pandas

Quick Review: Basic Analysis

● Summary statistics using .describe()● The mean() function is used to return the mean of the values for the requested axis.● DataFrame.sum() function is used to return the sum of the values for the requested

axis.● DataFrame.max() function is used to find the maximum value along the axis.● The value_counts() function returns an object containing counts of unique values.● Pandas sort_values() function sorts a data frame in Ascending or Descending order

of passed Column.● Boolean masking is used to filter a data.● Series.str can be used to access the values of the series as strings and apply

several methods to it.

94 | © 2021 True Digital Academy

Page 94: Module 5 Introduction to Pandas

Module Summary

We’ve learned to:

● Understand what data science is and the roles needed in data science● Apply the data science workflow● Understand programming tools for data science and how Pandas fit in ● Use Pandas to read in a dataset● Get data information and summary statistics● Investigate a dataset's integrity● Filter, sort, and manipulate DataFrame series

95 | © 2021 True Digital Academy

Page 95: Module 5 Introduction to Pandas

WELCOME TO GAGENERAL ASSEMBLY

THANK YOU!See you next time!

Page 96: Module 5 Introduction to Pandas