getting from raw data to content analysis with custom programs
TRANSCRIPT
-
8/10/2019 Getting from Raw Data to Content Analysis with Custom Programs
1/13
GETTING FROM RAW DATATO CONTENT ANALYSISWITH CUSTOM PROGRAMS
JASON T. KILEY
UNIVERSITY OF GEORGIA
-
8/10/2019 Getting from Raw Data to Content Analysis with Custom Programs
2/13
Overview
Goal: high-level familiarization with data issues
where programming (Python 3.x) can help.
Retrieve the structure from semi-structured text.
Identify texts of interest within a larger set.
Output data in appropriate forms for content
analysis.
-
8/10/2019 Getting from Raw Data to Content Analysis with Custom Programs
3/13
Semi-structured text
Defined: Text that has a structure that cannot be
used directly.
Convert that semi-structure to a form we can use.
Examples: LexisNexis (press releases, news), web
pages, recent EDGAR documents.
Isolate the text we want to analyze, and use the
metadata.
-
8/10/2019 Getting from Raw Data to Content Analysis with Custom Programs
4/13
PRESS RELEASE 1
-
8/10/2019 Getting from Raw Data to Content Analysis with Custom Programs
5/13
PRESS RELEASE 1 & 2
-
8/10/2019 Getting from Raw Data to Content Analysis with Custom Programs
6/13
Regular expressions
Defined: a way of specifying search patterns.
Any given regex has a binary outcome dependingon whether it matches: true or false.
In some cases, this true or false result is enough.
We can also specify that we want to capture a part
of the matching text to use.
-
8/10/2019 Getting from Raw Data to Content Analysis with Custom Programs
7/13
Identifying interesting texts
Archival raw data is often full of texts that we are
not interested in.
Some data is naturally noisy.
Other times, over-gathering is easier if we can
automate the elimination of uninteresting texts.
Human labor is smart but expensive and slow;
computers are fast but only as smart as you can
make it.
-
8/10/2019 Getting from Raw Data to Content Analysis with Custom Programs
8/13
REGULAR EXPRESSIONS
-
8/10/2019 Getting from Raw Data to Content Analysis with Custom Programs
9/13
Output
Forms
Individual files: LIWC, comparisons.
Spreadsheets: coding, statistical analysis.
Testing and project management: gatherdescriptive statistics, diagnostics.
-
8/10/2019 Getting from Raw Data to Content Analysis with Custom Programs
10/13
OUTPUT: INDIVIDUAL FILES
-
8/10/2019 Getting from Raw Data to Content Analysis with Custom Programs
11/13
OUTPUT: SPREADSHEET
-
8/10/2019 Getting from Raw Data to Content Analysis with Custom Programs
12/13
Scratching the surface
If you can precisely describe how to do something,
even if that explanation is complex, a computer can
do the hard work.
Automate raw data gathering.
Use open source libraries for analysis.
Write your own algorithms to analyze content.
-
8/10/2019 Getting from Raw Data to Content Analysis with Custom Programs
13/13
COMMENTS ANDQUESTIONS