Download - Analyzing Data With Python
![Page 1: Analyzing Data With Python](https://reader035.vdocument.in/reader035/viewer/2022081412/540d91b38d7f728d7e8b49f7/html5/thumbnails/1.jpg)
Sarah Guido@sarah_guidoReonomyOSCON 2014
ANALYZING DATA WITH PYTHON
![Page 2: Analyzing Data With Python](https://reader035.vdocument.in/reader035/viewer/2022081412/540d91b38d7f728d7e8b49f7/html5/thumbnails/2.jpg)
Data scientist at ReonomyUniversity of Michigan graduateNYC Python organizerPyGotham organizer
ABOUT ME
![Page 3: Analyzing Data With Python](https://reader035.vdocument.in/reader035/viewer/2022081412/540d91b38d7f728d7e8b49f7/html5/thumbnails/3.jpg)
Bird’s-eye overview: not comprehensive explanation of these tools!
Take data from start-to-finishPreprocessing: PandasAnalysis: scikit-learnAnalysis: nltkData pipeline: MRjobVisualization: matplotlib
What next?
ABOUT THIS TALK
![Page 4: Analyzing Data With Python](https://reader035.vdocument.in/reader035/viewer/2022081412/540d91b38d7f728d7e8b49f7/html5/thumbnails/4.jpg)
So many toolsPreprocessing, analysis, statistics, machine learning, natural language processing, network analysis, visualization, scalability
Community support“Easy” language to learnBoth a scripting and production-ready
language
WHY PYTHON?
![Page 5: Analyzing Data With Python](https://reader035.vdocument.in/reader035/viewer/2022081412/540d91b38d7f728d7e8b49f7/html5/thumbnails/5.jpg)
How to find the best tool(s)?The 90/10 ruleSimple is better than complex
FROM POINT A TO POINT…X?
![Page 6: Analyzing Data With Python](https://reader035.vdocument.in/reader035/viewer/2022081412/540d91b38d7f728d7e8b49f7/html5/thumbnails/6.jpg)
Available resourcesDocumentation, tutorials, books, videos
Ease of use (with a grain of salt)Community support and continuous
developmentWidely used
WHY I CHOSE THESE TOOLS
![Page 7: Analyzing Data With Python](https://reader035.vdocument.in/reader035/viewer/2022081412/540d91b38d7f728d7e8b49f7/html5/thumbnails/7.jpg)
The importance of data preprocessingAKA wrangling, munging, manipulating, and so on
Preprocessing is also getting to know your dataMissing values? Categorical/continuous? Distribution?
PREPROCESSING
![Page 8: Analyzing Data With Python](https://reader035.vdocument.in/reader035/viewer/2022081412/540d91b38d7f728d7e8b49f7/html5/thumbnails/8.jpg)
Data analysis and modelingSimilar to R and ExcelEasy-to-use data structures
DataFrameData wrangling tools
Merging, pivoting, etc
PANDAS
![Page 9: Analyzing Data With Python](https://reader035.vdocument.in/reader035/viewer/2022081412/540d91b38d7f728d7e8b49f7/html5/thumbnails/9.jpg)
Keep everything in PythonCommunity support/resourcesUse for preprocessing
File I/0, cleaning, manipulation, etcCombinable with other modules
NumPy, SciPy, statsmodel, matplotlib
PANDAS
![Page 10: Analyzing Data With Python](https://reader035.vdocument.in/reader035/viewer/2022081412/540d91b38d7f728d7e8b49f7/html5/thumbnails/10.jpg)
File I/O
PANDAS
![Page 11: Analyzing Data With Python](https://reader035.vdocument.in/reader035/viewer/2022081412/540d91b38d7f728d7e8b49f7/html5/thumbnails/11.jpg)
Finding missing values
PANDAS
![Page 12: Analyzing Data With Python](https://reader035.vdocument.in/reader035/viewer/2022081412/540d91b38d7f728d7e8b49f7/html5/thumbnails/12.jpg)
Removing missing values
PANDAS
![Page 13: Analyzing Data With Python](https://reader035.vdocument.in/reader035/viewer/2022081412/540d91b38d7f728d7e8b49f7/html5/thumbnails/13.jpg)
Pivoting
PANDAS
![Page 14: Analyzing Data With Python](https://reader035.vdocument.in/reader035/viewer/2022081412/540d91b38d7f728d7e8b49f7/html5/thumbnails/14.jpg)
Other thingsStatistical methodsMerge/join like SQLTime seriesHas some visualization functionality
PANDAS
![Page 15: Analyzing Data With Python](https://reader035.vdocument.in/reader035/viewer/2022081412/540d91b38d7f728d7e8b49f7/html5/thumbnails/15.jpg)
Application of algorithms that learn from examples
Representation and generalizationUseful in everyday lifeEspecially useful in data analysis
MACHINE LEARNING
![Page 16: Analyzing Data With Python](https://reader035.vdocument.in/reader035/viewer/2022081412/540d91b38d7f728d7e8b49f7/html5/thumbnails/16.jpg)
Supervised learningClassification and regression
Unsupervised learningClustering and dimensionality reduction
MACHINE LEARNING
![Page 17: Analyzing Data With Python](https://reader035.vdocument.in/reader035/viewer/2022081412/540d91b38d7f728d7e8b49f7/html5/thumbnails/17.jpg)
Machine learning moduleOpen-sourceBuilt-in datasetsGood resources for learning
SCIKIT-LEARN
![Page 18: Analyzing Data With Python](https://reader035.vdocument.in/reader035/viewer/2022081412/540d91b38d7f728d7e8b49f7/html5/thumbnails/18.jpg)
Scikit-learn: your data has to be continuous
Here’s what one observation/label looks like:
SCIKIT-LEARN
![Page 19: Analyzing Data With Python](https://reader035.vdocument.in/reader035/viewer/2022081412/540d91b38d7f728d7e8b49f7/html5/thumbnails/19.jpg)
Transform categorical values/labels
SCIKIT-LEARN
![Page 20: Analyzing Data With Python](https://reader035.vdocument.in/reader035/viewer/2022081412/540d91b38d7f728d7e8b49f7/html5/thumbnails/20.jpg)
Classification
SCIKIT-LEARN
![Page 21: Analyzing Data With Python](https://reader035.vdocument.in/reader035/viewer/2022081412/540d91b38d7f728d7e8b49f7/html5/thumbnails/21.jpg)
Classification
SCIKIT-LEARN
![Page 22: Analyzing Data With Python](https://reader035.vdocument.in/reader035/viewer/2022081412/540d91b38d7f728d7e8b49f7/html5/thumbnails/22.jpg)
Other thingsVery comprehensive of machine learning algorithms
Preprocessing toolsMethods for testing the accuracy of your model
SCIKIT-LEARN
![Page 23: Analyzing Data With Python](https://reader035.vdocument.in/reader035/viewer/2022081412/540d91b38d7f728d7e8b49f7/html5/thumbnails/23.jpg)
Concerned with interactions between computers and human languages
Derive meaning from textMany NLP algorithms are based on
machine learning
NATURAL LANGUAGE PROCESSING
![Page 24: Analyzing Data With Python](https://reader035.vdocument.in/reader035/viewer/2022081412/540d91b38d7f728d7e8b49f7/html5/thumbnails/24.jpg)
Natural Language ToolKitAccess to over 50 corpora
Corpus: body of textNLP tools
Stemming, tokenizing, etcResources for learning
NLTK
![Page 25: Analyzing Data With Python](https://reader035.vdocument.in/reader035/viewer/2022081412/540d91b38d7f728d7e8b49f7/html5/thumbnails/25.jpg)
Stopword removal
NLTK
![Page 26: Analyzing Data With Python](https://reader035.vdocument.in/reader035/viewer/2022081412/540d91b38d7f728d7e8b49f7/html5/thumbnails/26.jpg)
Stopword removal
NLTK
![Page 27: Analyzing Data With Python](https://reader035.vdocument.in/reader035/viewer/2022081412/540d91b38d7f728d7e8b49f7/html5/thumbnails/27.jpg)
Stemming
NLTK
![Page 28: Analyzing Data With Python](https://reader035.vdocument.in/reader035/viewer/2022081412/540d91b38d7f728d7e8b49f7/html5/thumbnails/28.jpg)
Other thingsLemmatizing, tokenization, tagging, parse trees
ClassificationChunkingSentence structure
NLTK
![Page 29: Analyzing Data With Python](https://reader035.vdocument.in/reader035/viewer/2022081412/540d91b38d7f728d7e8b49f7/html5/thumbnails/29.jpg)
Data that takes too long to process on your machineNot “big data” but larger data
Solution: MapReduce!Processing large datasets with a parallel, distributed algorithm
Map stepReduce step
PROCESSING LARGE DATA
![Page 30: Analyzing Data With Python](https://reader035.vdocument.in/reader035/viewer/2022081412/540d91b38d7f728d7e8b49f7/html5/thumbnails/30.jpg)
Map stepTakes series of key/value pairs Ex. Word counts: break line into words, return word and count within line
Reduce stepOnce for each unique key: iterates through values associated with that key
Ex. Word counts: returns word and sum of all counts
PROCESSING LARGE DATA
![Page 31: Analyzing Data With Python](https://reader035.vdocument.in/reader035/viewer/2022081412/540d91b38d7f728d7e8b49f7/html5/thumbnails/31.jpg)
Write MapReduce jobs in PythonTest code locally without installing
HadoopLots of thorough documentationA few things to know
Keep everything in one classMRJob program in a separate fileOutput to new file if doing something like word counts
MRJOB
![Page 32: Analyzing Data With Python](https://reader035.vdocument.in/reader035/viewer/2022081412/540d91b38d7f728d7e8b49f7/html5/thumbnails/32.jpg)
Stemmed file
Line 1: (‘miss’, 2), (‘taylor’, 1)Line 2: (‘taylor’, 1), (‘first’, 1), (‘wed’,
1)And so on…
MRJOB
![Page 33: Analyzing Data With Python](https://reader035.vdocument.in/reader035/viewer/2022081412/540d91b38d7f728d7e8b49f7/html5/thumbnails/33.jpg)
MapLine 1: (‘miss’, 2),
(‘taylor’, 1)Line 2: (‘taylor’, 1),
(‘first’, 1), (‘wed’, 1)Line 3: (‘first’, 1),
(‘wed’, 1)Line 4: (‘father’, 1)Line 5: (‘father’, 1)
Reduce(‘miss’, 2)(‘taylor’, 2)(‘first’, 2)(‘wed’, 2)(‘father’, 2)
MRJOB
![Page 34: Analyzing Data With Python](https://reader035.vdocument.in/reader035/viewer/2022081412/540d91b38d7f728d7e8b49f7/html5/thumbnails/34.jpg)
Let’s count all words in the Gutenberg file
Map step
MRJOB
![Page 35: Analyzing Data With Python](https://reader035.vdocument.in/reader035/viewer/2022081412/540d91b38d7f728d7e8b49f7/html5/thumbnails/35.jpg)
Reduce (and run) step
MRJOB
![Page 36: Analyzing Data With Python](https://reader035.vdocument.in/reader035/viewer/2022081412/540d91b38d7f728d7e8b49f7/html5/thumbnails/36.jpg)
ResultsMapped counts reducedKey/val pairs
MRJOB
![Page 37: Analyzing Data With Python](https://reader035.vdocument.in/reader035/viewer/2022081412/540d91b38d7f728d7e8b49f7/html5/thumbnails/37.jpg)
Other thingsRun on Hadoop clustersCan write highly complex jobsWorks with Elasticsearch
MRJOB
![Page 38: Analyzing Data With Python](https://reader035.vdocument.in/reader035/viewer/2022081412/540d91b38d7f728d7e8b49f7/html5/thumbnails/38.jpg)
The “final step”Conveying your results in a meaningful
wayLiterally see what’s going on
DATA VISUALIZATION
![Page 39: Analyzing Data With Python](https://reader035.vdocument.in/reader035/viewer/2022081412/540d91b38d7f728d7e8b49f7/html5/thumbnails/39.jpg)
2D visualization libraryVery VERY widely usedWide variety of plotsEasy to feed in results from other
modules (like Pandas, scikit-learn, NumPy, SciPy, etc)
MATPLOTLIB
![Page 40: Analyzing Data With Python](https://reader035.vdocument.in/reader035/viewer/2022081412/540d91b38d7f728d7e8b49f7/html5/thumbnails/40.jpg)
Remember this?
MATPLOTLIB
![Page 41: Analyzing Data With Python](https://reader035.vdocument.in/reader035/viewer/2022081412/540d91b38d7f728d7e8b49f7/html5/thumbnails/41.jpg)
Bar chart of distribution
MATPLOTLIB
![Page 42: Analyzing Data With Python](https://reader035.vdocument.in/reader035/viewer/2022081412/540d91b38d7f728d7e8b49f7/html5/thumbnails/42.jpg)
Let’s graph our word count frequencies(Hint: It’s a power law distribution!)
MATPLOTLIB
![Page 43: Analyzing Data With Python](https://reader035.vdocument.in/reader035/viewer/2022081412/540d91b38d7f728d7e8b49f7/html5/thumbnails/43.jpg)
High frequency of low numbers, low frequency of high numbers
MATPLOTLIB
![Page 44: Analyzing Data With Python](https://reader035.vdocument.in/reader035/viewer/2022081412/540d91b38d7f728d7e8b49f7/html5/thumbnails/44.jpg)
Other thingsMany different kinds of graphsCustomizableTime series
MATPLOTLIB
![Page 45: Analyzing Data With Python](https://reader035.vdocument.in/reader035/viewer/2022081412/540d91b38d7f728d7e8b49f7/html5/thumbnails/45.jpg)
Phew!Which tool to choose depends on your
needsWorkflow:
PreprocessAnalyzeVisualize
WHAT NEXT?
![Page 46: Analyzing Data With Python](https://reader035.vdocument.in/reader035/viewer/2022081412/540d91b38d7f728d7e8b49f7/html5/thumbnails/46.jpg)
Pandashttp://pandas.pydata.org/
scikit-learnhttp://scikit-learn.org/
NLTKhttp://www.nltk.org/
MRJobhttp://mrjob.readthedocs.org/
matplotlibhttp://matplotlib.org/
RESOURCES
![Page 47: Analyzing Data With Python](https://reader035.vdocument.in/reader035/viewer/2022081412/540d91b38d7f728d7e8b49f7/html5/thumbnails/47.jpg)
Twitter@sarah_guido
LinkedInhttps://www.linkedin.com/in/sarahguido
NYC Pythonhttp://www.meetup.com/nycpython/
CONTACT ME!
![Page 48: Analyzing Data With Python](https://reader035.vdocument.in/reader035/viewer/2022081412/540d91b38d7f728d7e8b49f7/html5/thumbnails/48.jpg)
AND FINALLY…
![Page 49: Analyzing Data With Python](https://reader035.vdocument.in/reader035/viewer/2022081412/540d91b38d7f728d7e8b49f7/html5/thumbnails/49.jpg)
Questions?
THE END!