getting started with python and r for text analysis
DESCRIPTION
Jason T. KileyOklahoma State UniversityTRANSCRIPT
![Page 1: Getting Started With Python and R for Text Analysis](https://reader034.vdocument.in/reader034/viewer/2022052302/563db922550346aa9a9a5fa9/html5/thumbnails/1.jpg)
GETTING STARTED WITH PYTHON AND R FOR TEXT ANALYSIS
JASON T. KILEYOKLAHOMA STATE UNIVERSITY
![Page 3: Getting Started With Python and R for Text Analysis](https://reader034.vdocument.in/reader034/viewer/2022052302/563db922550346aa9a9a5fa9/html5/thumbnails/3.jpg)
Overview
Goal: familiarization with tools and specific resources for learning to use Python (and R) for text analysis.
How to get started with Python
Gathering and processing data
Using data for analyses
![Page 4: Getting Started With Python and R for Text Analysis](https://reader034.vdocument.in/reader034/viewer/2022052302/563db922550346aa9a9a5fa9/html5/thumbnails/4.jpg)
Getting started: Good news!
You know more about programming basics that you may realize.
Most statistical software eventually requires you to learn about different ways of formatting data (e.g. strings and dates).
Commands often require that you specify options in particular ways and provide particular kinds of data, much like functions in Python and R.
![Page 5: Getting Started With Python and R for Text Analysis](https://reader034.vdocument.in/reader034/viewer/2022052302/563db922550346aa9a9a5fa9/html5/thumbnails/5.jpg)
Getting started: software
Download Python and R.
Choose a good text editor that is designed for coding. I prefer Atom.
Download add-on software for your text editor, as needed.
Install the analysis packages that you would like to try out. Hint: start with TextBlob.
![Page 6: Getting Started With Python and R for Text Analysis](https://reader034.vdocument.in/reader034/viewer/2022052302/563db922550346aa9a9a5fa9/html5/thumbnails/6.jpg)
PYTHON: HELLO
![Page 7: Getting Started With Python and R for Text Analysis](https://reader034.vdocument.in/reader034/viewer/2022052302/563db922550346aa9a9a5fa9/html5/thumbnails/7.jpg)
TEXTEDIT (BAD)
![Page 8: Getting Started With Python and R for Text Analysis](https://reader034.vdocument.in/reader034/viewer/2022052302/563db922550346aa9a9a5fa9/html5/thumbnails/8.jpg)
ATOM (GOOD)
![Page 9: Getting Started With Python and R for Text Analysis](https://reader034.vdocument.in/reader034/viewer/2022052302/563db922550346aa9a9a5fa9/html5/thumbnails/9.jpg)
Getting started: What to learn
Start with the basics: data types, operators, and control structures. These are things that you statistical software (partially) hides from you.
Learn how to read and write files and work with filenames and paths.
Spend less time on classes and inheritance.
Once you can comfortably manipulate your text data into desired forms (e.g. splitting files, extracting titles and body text, combining texts and metadata in to CSVs), move to analysis tools.
![Page 10: Getting Started With Python and R for Text Analysis](https://reader034.vdocument.in/reader034/viewer/2022052302/563db922550346aa9a9a5fa9/html5/thumbnails/10.jpg)
LEXISNEXIS: MULTIPLE TEXTS PER FILE
![Page 11: Getting Started With Python and R for Text Analysis](https://reader034.vdocument.in/reader034/viewer/2022052302/563db922550346aa9a9a5fa9/html5/thumbnails/11.jpg)
Data: Collecting and reading
Gather data in forms that are easiest to work with.
Process new (or existing) data into usable formats.
Extract the information that we want to analyze.
Use data for analyses.
![Page 12: Getting Started With Python and R for Text Analysis](https://reader034.vdocument.in/reader034/viewer/2022052302/563db922550346aa9a9a5fa9/html5/thumbnails/12.jpg)
Data: gathering
In general, plain text is best, and closer is better.
Some other formats (e.g. CSV) are plain text files that adhere to a further specification.
LexisNexis: choose plain text (*.txt).
![Page 13: Getting Started With Python and R for Text Analysis](https://reader034.vdocument.in/reader034/viewer/2022052302/563db922550346aa9a9a5fa9/html5/thumbnails/13.jpg)
VIEWING A CSV AS A SPREADSHEET
![Page 14: Getting Started With Python and R for Text Analysis](https://reader034.vdocument.in/reader034/viewer/2022052302/563db922550346aa9a9a5fa9/html5/thumbnails/14.jpg)
VIEWING A CSV AS TEXT
![Page 15: Getting Started With Python and R for Text Analysis](https://reader034.vdocument.in/reader034/viewer/2022052302/563db922550346aa9a9a5fa9/html5/thumbnails/15.jpg)
Data: gathering other types
“But, my data is .rtf, .doc, HTML, morse code. . . !”
You will need some additional processing steps, but you should be fine.
Factiva: gather .rtf files and process them into plain text.
HTML: strip tags or use Beautiful Soup to parse pages.
![Page 16: Getting Started With Python and R for Text Analysis](https://reader034.vdocument.in/reader034/viewer/2022052302/563db922550346aa9a9a5fa9/html5/thumbnails/16.jpg)
Data: extracting information
We often want something less than the full text that we gathered.
Examples
Press releases and news stories: analyze headlines separately or with a weight.
Web pages: analyze the body content or comments.
Some libraries have their own tools, but you may have to extract data yourself using regular expressions.
![Page 17: Getting Started With Python and R for Text Analysis](https://reader034.vdocument.in/reader034/viewer/2022052302/563db922550346aa9a9a5fa9/html5/thumbnails/17.jpg)
Data: workflow
Gather raw data.
Write code that extracts the data you want from one text. This is often the most challenging part.
Make the single-text code into a function.
Write the code that opens files, processes each one using your function, and writes out the data that you want to analyze.
![Page 18: Getting Started With Python and R for Text Analysis](https://reader034.vdocument.in/reader034/viewer/2022052302/563db922550346aa9a9a5fa9/html5/thumbnails/18.jpg)
EXAMPLE: FUNCTION
![Page 19: Getting Started With Python and R for Text Analysis](https://reader034.vdocument.in/reader034/viewer/2022052302/563db922550346aa9a9a5fa9/html5/thumbnails/19.jpg)
Analyses
Generally, you will use collections of strings (perhaps with metadata) for text analysis.
You may also process texts into CSVs that you can use for fast human coding of either a variable of interest or as a training set for machine learning.
As Laura showed us, there are many techniques and tools available, so read up on the particular library that you intend to use.
![Page 20: Getting Started With Python and R for Text Analysis](https://reader034.vdocument.in/reader034/viewer/2022052302/563db922550346aa9a9a5fa9/html5/thumbnails/20.jpg)
COMMENTS AND QUESTIONS