data analysis with pandas
TRANSCRIPT
![Page 1: Data analysis with pandas](https://reader036.vdocument.in/reader036/viewer/2022062310/58ec65b81a28ab80438b4661/html5/thumbnails/1.jpg)
Data Analysis with Pandas
![Page 2: Data analysis with pandas](https://reader036.vdocument.in/reader036/viewer/2022062310/58ec65b81a28ab80438b4661/html5/thumbnails/2.jpg)
When you think of Python...
![Page 3: Data analysis with pandas](https://reader036.vdocument.in/reader036/viewer/2022062310/58ec65b81a28ab80438b4661/html5/thumbnails/3.jpg)
Meet Jupyter Notebook
![Page 4: Data analysis with pandas](https://reader036.vdocument.in/reader036/viewer/2022062310/58ec65b81a28ab80438b4661/html5/thumbnails/4.jpg)
And me
job_title != “Developer”
I’m a Consultant at Distilled (since September 2015)
I do build some software in Python
But I mainly use it for data analysis
![Page 5: Data analysis with pandas](https://reader036.vdocument.in/reader036/viewer/2022062310/58ec65b81a28ab80438b4661/html5/thumbnails/5.jpg)
Getting Started
![Page 6: Data analysis with pandas](https://reader036.vdocument.in/reader036/viewer/2022062310/58ec65b81a28ab80438b4661/html5/thumbnails/6.jpg)
Python for scientific computing
Huge community
Fantastic ecosystem of packages other people have written
Can be tedious to actually install everything
![Page 7: Data analysis with pandas](https://reader036.vdocument.in/reader036/viewer/2022062310/58ec65b81a28ab80438b4661/html5/thumbnails/7.jpg)
Just use this! (https://continuum.io/downloads)
![Page 8: Data analysis with pandas](https://reader036.vdocument.in/reader036/viewer/2022062310/58ec65b81a28ab80438b4661/html5/thumbnails/8.jpg)
What is Anaconda?
Essentially a large (~400 MB) Python installation
But contains everything* you need for data analysis
Unless you have a special reason not to, you should just install and use this
*OK, technically not true, but it has everything you’re likely to need
![Page 9: Data analysis with pandas](https://reader036.vdocument.in/reader036/viewer/2022062310/58ec65b81a28ab80438b4661/html5/thumbnails/9.jpg)
You need the command line (but only for a minute)
On Windows, open Powershell
On mac, Terminal or iTerm2
![Page 10: Data analysis with pandas](https://reader036.vdocument.in/reader036/viewer/2022062310/58ec65b81a28ab80438b4661/html5/thumbnails/10.jpg)
Just one line, though:
1. Just type “jupyter notebook”
2. Wait
3. ...
![Page 11: Data analysis with pandas](https://reader036.vdocument.in/reader036/viewer/2022062310/58ec65b81a28ab80438b4661/html5/thumbnails/11.jpg)
Back to safety
![Page 12: Data analysis with pandas](https://reader036.vdocument.in/reader036/viewer/2022062310/58ec65b81a28ab80438b4661/html5/thumbnails/12.jpg)
Open a new Notebook
![Page 13: Data analysis with pandas](https://reader036.vdocument.in/reader036/viewer/2022062310/58ec65b81a28ab80438b4661/html5/thumbnails/13.jpg)
Your very own data analysis environment
![Page 14: Data analysis with pandas](https://reader036.vdocument.in/reader036/viewer/2022062310/58ec65b81a28ab80438b4661/html5/thumbnails/14.jpg)
So that was fairly easy...
![Page 15: Data analysis with pandas](https://reader036.vdocument.in/reader036/viewer/2022062310/58ec65b81a28ab80438b4661/html5/thumbnails/15.jpg)
but why is it better than Excel?
![Page 16: Data analysis with pandas](https://reader036.vdocument.in/reader036/viewer/2022062310/58ec65b81a28ab80438b4661/html5/thumbnails/16.jpg)
There’s not enough room to list everything, but:
1. Handle larger data sets—no set limit on rows
2. Combine multiple files and data sources together instantaneously. Pull data straight from APIs or scraping
3. Everything is completely customisable—if you can imagine a query, it can be done (though not always easily)
4. It’s a safe place to mess things up
5. Keeps a record of your workflow—retrace your steps
![Page 17: Data analysis with pandas](https://reader036.vdocument.in/reader036/viewer/2022062310/58ec65b81a28ab80438b4661/html5/thumbnails/17.jpg)
...and it’s the perfect playground for learning Python
![Page 18: Data analysis with pandas](https://reader036.vdocument.in/reader036/viewer/2022062310/58ec65b81a28ab80438b4661/html5/thumbnails/18.jpg)
Side note: don’t know any Python?
![Page 19: Data analysis with pandas](https://reader036.vdocument.in/reader036/viewer/2022062310/58ec65b81a28ab80438b4661/html5/thumbnails/19.jpg)
Can’t cover it all today, so go here:
1. Learn Python the Hard Way (free)
2. Real Python ($60, but good)
3. Writing Idiomatic Python (~$15)
![Page 20: Data analysis with pandas](https://reader036.vdocument.in/reader036/viewer/2022062310/58ec65b81a28ab80438b4661/html5/thumbnails/20.jpg)
Unless you’re building applications:
1. Stick with the small building blocks
2. Learn how to write a function (we’ll do this today)
3. Learn about loops, conditional statements, and handling data
4. Probably no need to learn about managing projects and apps
![Page 21: Data analysis with pandas](https://reader036.vdocument.in/reader036/viewer/2022062310/58ec65b81a28ab80438b4661/html5/thumbnails/21.jpg)
Jupyter Notebook
Save notebooks for later
Run and re-run Python code
Really cool features like post-mortem debugging if you make a mistake
![Page 22: Data analysis with pandas](https://reader036.vdocument.in/reader036/viewer/2022062310/58ec65b81a28ab80438b4661/html5/thumbnails/22.jpg)
Cells
1. Type all the code you want
2. Shift+Enter to run it
3. View the result
![Page 23: Data analysis with pandas](https://reader036.vdocument.in/reader036/viewer/2022062310/58ec65b81a28ab80438b4661/html5/thumbnails/23.jpg)
Now we have our Jupyter Notebook up and running, you can start playing around with almost any Python code
We’re going to look at Pandas, though—a data analysis library written in Python
Started its life in finance
Great for fast, flexible computation
The Star of the Show
![Page 24: Data analysis with pandas](https://reader036.vdocument.in/reader036/viewer/2022062310/58ec65b81a28ab80438b4661/html5/thumbnails/24.jpg)
A little setup, first
You’ll do this more or less at the beginning of each session
It’ll become second nature; just import the workhorse libraries we always use: numpy, pandas, pyplot.
![Page 25: Data analysis with pandas](https://reader036.vdocument.in/reader036/viewer/2022062310/58ec65b81a28ab80438b4661/html5/thumbnails/25.jpg)
The DataFrame
If you’re used to spreadsheets, the DataFrame isn’t too difficult to understand
It’s the fundamental, flexible building block in Pandas
![Page 26: Data analysis with pandas](https://reader036.vdocument.in/reader036/viewer/2022062310/58ec65b81a28ab80438b4661/html5/thumbnails/26.jpg)
At its simplest, it looks rather like a spreadsheet would
The only obvious difference with Excel is the column indexes, which are numeric instead of A, B, C...
![Page 27: Data analysis with pandas](https://reader036.vdocument.in/reader036/viewer/2022062310/58ec65b81a28ab80438b4661/html5/thumbnails/27.jpg)
You’ll usually create them from some other source:
The Pandas library provides some nice functions for importing from common file formats, so you won’t usually be building “by hand”:
1. pd.read_csv()
2. pd.read_table()
3. pd.read_sql()
![Page 28: Data analysis with pandas](https://reader036.vdocument.in/reader036/viewer/2022062310/58ec65b81a28ab80438b4661/html5/thumbnails/28.jpg)
We have so much data stored in CSVs
Our first function call will just read some data into the DataFrame, where we can analyse it
Reading a CSV
![Page 29: Data analysis with pandas](https://reader036.vdocument.in/reader036/viewer/2022062310/58ec65b81a28ab80438b4661/html5/thumbnails/29.jpg)
Get help at any time with Shift+Tab
![Page 30: Data analysis with pandas](https://reader036.vdocument.in/reader036/viewer/2022062310/58ec65b81a28ab80438b4661/html5/thumbnails/30.jpg)
1. pd.read_csv() will read in the data
2. Fields are separated by tabs
3. The encoding is UTF-16 (don’t ask…)
4. The whole result is assigned to the variable ‘df’
![Page 31: Data analysis with pandas](https://reader036.vdocument.in/reader036/viewer/2022062310/58ec65b81a28ab80438b4661/html5/thumbnails/31.jpg)
Get a quick sense of the data (658k rows, here)
![Page 32: Data analysis with pandas](https://reader036.vdocument.in/reader036/viewer/2022062310/58ec65b81a28ab80438b4661/html5/thumbnails/32.jpg)
See the columns
![Page 33: Data analysis with pandas](https://reader036.vdocument.in/reader036/viewer/2022062310/58ec65b81a28ab80438b4661/html5/thumbnails/33.jpg)
Filtering
![Page 34: Data analysis with pandas](https://reader036.vdocument.in/reader036/viewer/2022062310/58ec65b81a28ab80438b4661/html5/thumbnails/34.jpg)
What’s happening there?
df[‘Link Active?’] is:
1. Checking that whole column for values that are True or False
2. Returning an array of True/False values
3. This is fast, and lets us filter in an amazing variety of ways
![Page 35: Data analysis with pandas](https://reader036.vdocument.in/reader036/viewer/2022062310/58ec65b81a28ab80438b4661/html5/thumbnails/35.jpg)
Filtering (again)
![Page 36: Data analysis with pandas](https://reader036.vdocument.in/reader036/viewer/2022062310/58ec65b81a28ab80438b4661/html5/thumbnails/36.jpg)
We’re probably ready for this one, now:
![Page 37: Data analysis with pandas](https://reader036.vdocument.in/reader036/viewer/2022062310/58ec65b81a28ab80438b4661/html5/thumbnails/37.jpg)
Example project: Getting data from SEMRush
![Page 38: Data analysis with pandas](https://reader036.vdocument.in/reader036/viewer/2022062310/58ec65b81a28ab80438b4661/html5/thumbnails/38.jpg)
Writing your own function
![Page 39: Data analysis with pandas](https://reader036.vdocument.in/reader036/viewer/2022062310/58ec65b81a28ab80438b4661/html5/thumbnails/39.jpg)
Call our function, get a DataFrame!
![Page 40: Data analysis with pandas](https://reader036.vdocument.in/reader036/viewer/2022062310/58ec65b81a28ab80438b4661/html5/thumbnails/40.jpg)
Write to disk in case anything goes wrong
![Page 41: Data analysis with pandas](https://reader036.vdocument.in/reader036/viewer/2022062310/58ec65b81a28ab80438b4661/html5/thumbnails/41.jpg)
Reading in multiple files
![Page 42: Data analysis with pandas](https://reader036.vdocument.in/reader036/viewer/2022062310/58ec65b81a28ab80438b4661/html5/thumbnails/42.jpg)
Apply custom filters
![Page 43: Data analysis with pandas](https://reader036.vdocument.in/reader036/viewer/2022062310/58ec65b81a28ab80438b4661/html5/thumbnails/43.jpg)
Drill down into individual words:
Counter() will save you a huge amount of workHere we wanted to hone in on modifier words
![Page 44: Data analysis with pandas](https://reader036.vdocument.in/reader036/viewer/2022062310/58ec65b81a28ab80438b4661/html5/thumbnails/44.jpg)
More detailed questions
How local are the searches?Do people search by state code or full name?Do people search by hotel category?
![Page 45: Data analysis with pandas](https://reader036.vdocument.in/reader036/viewer/2022062310/58ec65b81a28ab80438b4661/html5/thumbnails/45.jpg)
Second example: Custom Rank Tracking Charts
![Page 46: Data analysis with pandas](https://reader036.vdocument.in/reader036/viewer/2022062310/58ec65b81a28ab80438b4661/html5/thumbnails/46.jpg)
Where to begin?
If you don’t know Python, start with those books I shared earlier.
If you do, check out Python for Data Analysis
Keep Jupyter Notebook open at all times
Experiment!
![Page 47: Data analysis with pandas](https://reader036.vdocument.in/reader036/viewer/2022062310/58ec65b81a28ab80438b4661/html5/thumbnails/47.jpg)
Questions?