words and more words: challenges of big data by prof. edie rasmussen

19
Words and More Words: Challenges of Big (Text) Data Edie Rasmussen Visiting Professor, Nanyang Technological University Professor, University of British Columbia WKWSCI SYMPOSIUM 2014 Big Data, Big Ideas for Smarter Communities

Upload: wkwsci-research

Post on 26-Jan-2015

104 views

Category:

Education


1 download

DESCRIPTION

Presented during the WKWSCI Symposium 2014 21 March 2014 Marina Bay Sands Expo and Convention Centre Organized by the Wee Kim Wee School of Communication and Information at Nanyang Technological University

TRANSCRIPT

Page 1: Words and More Words: Challenges of Big Data by Prof. Edie Rasmussen

Words and More Words: Challenges of Big (Text) Data

Edie Rasmussen Visiting Professor, Nanyang Technological University

Professor, University of British Columbia

WKWSCI

SYMPOSIUM

2014 Big Data, Big Ideas for Smarter Communities

Page 2: Words and More Words: Challenges of Big Data by Prof. Edie Rasmussen

Outline

• The Rise of Big Text Data

• Challenges for Text Data

• Research Opportunities

– Counting and Culturomics

– Extracting Meaning from Text

2 WKWSCI SYMPOSIUM 2014

Big Data, Big Ideas for Smarter Communities

Page 3: Words and More Words: Challenges of Big Data by Prof. Edie Rasmussen

The Rise of Big Text Data

• Before there was Big Data, there were large bibliographic databases:

– Dialog: ~180 scholarly databases

– Lexis/Nexis: 5 billion documents (business/law/news)

– Citation Indexes: > 40 million records

• IR techniques designed for rapid access to very large (text) databases

• Swanson: “Undiscovered public knowledge” (1987)

WKWSCI SYMPOSIUM 2014 Big Data, Big Ideas for Smarter Communities

3

Page 4: Words and More Words: Challenges of Big Data by Prof. Edie Rasmussen

Current Text Sources

• Digitized Legacy Materials – Google Books, Hathi Trust (11 million volumes, 500 TB)

• The Web

• Search Logs (over 2 million queries per minute)

• Wikipedia (~4.5 million English articles)

• Blogs (The Blogosphere)

• Twitter (The Twitterverse)

• Test Collections – Smaller

– Experimentally more robust

4 WKWSCI SYMPOSIUM 2014

Big Data, Big Ideas for Smarter Communities

Page 5: Words and More Words: Challenges of Big Data by Prof. Edie Rasmussen

Challenges of Text

• Legacy Text/Digitization Costs • Quality (OCR Errors; Metadata Errors) • Availability (Access, Copyright, Privacy) • Reliability

– Algorithmic dependencies – Creator trustworthiness

• Authorship Issues (Identification, Authority) • Lack of Structure • Lack of Context • Ambiguity of human language • Breadth vs. Depth

5

WKWSCI SYMPOSIUM 2014 Big Data, Big Ideas for Smarter Communities

Page 6: Words and More Words: Challenges of Big Data by Prof. Edie Rasmussen

Processing Text

• Tokenizing, stopping, stemming

• Statistics of text: term values (tf*idf)

• “Bag of Words” approach

• Other evidence: network structures

• Similarity calculations

• Creating ranked lists

• Note: Probabilistic rather than Deterministic

6

WKWSCI SYMPOSIUM 2014 Big Data, Big Ideas for Smarter Communities

Page 7: Words and More Words: Challenges of Big Data by Prof. Edie Rasmussen

Counting and the Rise of Culturomics

• “Culturomics is the application of high-throughput data collection and analysis to the study of human culture”

• Database of >5 million digitized books (~4%)

• Michel et al. (Science, 2011): “Quantitative analysis of culture using millions of digitized books”

• Google’s N-Gram Viewer

7

WKWSCI SYMPOSIUM 2014 Big Data, Big Ideas for Smarter Communities

Page 8: Words and More Words: Challenges of Big Data by Prof. Edie Rasmussen

Using the N-Gram Viewer

8

typhoid

gout

1800 2000 1900

HIV

cholera

WKWSCI SYMPOSIUM 2014 Big Data, Big Ideas for Smarter Communities

Page 9: Words and More Words: Challenges of Big Data by Prof. Edie Rasmussen

How Far Will Counting Take us?

• Many limitations (e.g. incomplete data set)

• Some surprisingly sophisticated analyses:

– Size of English lexicon

– Change in word usage (irregular verbs) over time

– Cultural turnover (inventions)

– The nature (duration) of fame

– Patterns of censorship (“suppression index”)

9 WKWSCI SYMPOSIUM 2014

Big Data, Big Ideas for Smarter Communities

Page 10: Words and More Words: Challenges of Big Data by Prof. Edie Rasmussen

Critiques of Culturomics

• “The death of theory”

• “…second-rate scholars will use the Google Books corpus to churn out gigabytes of uninformative graphs and insignificant conclusions.” (Nunberg, 2011)

• Books as a representation of human history

• A “time sink”

10 WKWSCI SYMPOSIUM 2014

Big Data, Big Ideas for Smarter Communities

Page 12: Words and More Words: Challenges of Big Data by Prof. Edie Rasmussen

TM: Topic Detection and Tracking

• Tracking a story line over time

• News wire input, identify new story, find subsequent instances

• Story segmentation, First story detection, Clustering of like stories

• Interesting to news, business, security analysts

12 WKWSCI SYMPOSIUM 2014

Big Data, Big Ideas for Smarter Communities

Page 13: Words and More Words: Challenges of Big Data by Prof. Edie Rasmussen

TM: Sentiment Analysis/Opinion Mining

• Rich data from Blogs and Tweets

• Basically a classification problem (SVM, Naïve Bayes, etc.) - > positive, negative, neutral

• Involves Entity Extraction, NLP, sentiment vocabularies

• Of interest to government and businesses

• See Stanford SA of movie reviews: http://nlp.stanford.edu:8080/sentiment/rntnDemo.html

13 WKWSCI SYMPOSIUM 2014

Big Data, Big Ideas for Smarter Communities

Page 14: Words and More Words: Challenges of Big Data by Prof. Edie Rasmussen

TM: Trends and Predictions

• Can Tweets and Search Logs be used to predict the future?

• Google Flu Trends, Google Dengue Trends – Correlated with Search Terms

• Network analysis on Tweets on Arab Spring

• Assessing tone of global news data to predict national stability, location of terrorists, etc. (Leetaru)

• Predicting opinions (recommender systems)

14

WKWSCI SYMPOSIUM 2014 Big Data, Big Ideas for Smarter Communities

Page 15: Words and More Words: Challenges of Big Data by Prof. Edie Rasmussen

TM: Question Answering

• Combines multiple sources of evidence:

– Question type identification

– Information retrieval of candidate text

– Natural language processing

– Entity extraction

– Hypothesis generation and scoring (confidence)

– Ranking hypotheses

15 WKWSCI SYMPOSIUM 2014

Big Data, Big Ideas for Smarter Communities

Page 16: Words and More Words: Challenges of Big Data by Prof. Edie Rasmussen

16

Watson, 2011

Hans Peter Luhn, 1952

Watson, 2011

Page 17: Words and More Words: Challenges of Big Data by Prof. Edie Rasmussen

Structuring Research: “Digging Into Data” Program

• Addresses: “how "big data" changes the research landscape for the humanities and social sciences”

• 3 rounds of international research funding • Canada, US, UK, plus Netherlands • Team approach: scholars, scientists, information

professionals • Requires international teams; funding from at

least two countries • Wide range of datasets made available • http://www.diggingintodata.org/

17

WKWSCI SYMPOSIUM 2014 Big Data, Big Ideas for Smarter Communities

Page 18: Words and More Words: Challenges of Big Data by Prof. Edie Rasmussen

18 WKWSCI SYMPOSIUM 2014

Big Data, Big Ideas for Smarter Communities

Page 19: Words and More Words: Challenges of Big Data by Prof. Edie Rasmussen

Thank you!

19 WKWSCI SYMPOSIUM 2014

Big Data, Big Ideas for Smarter Communities