words and more words: challenges of big data by prof. edie rasmussen
DESCRIPTION
Presented during the WKWSCI Symposium 2014 21 March 2014 Marina Bay Sands Expo and Convention Centre Organized by the Wee Kim Wee School of Communication and Information at Nanyang Technological UniversityTRANSCRIPT
![Page 1: Words and More Words: Challenges of Big Data by Prof. Edie Rasmussen](https://reader035.vdocument.in/reader035/viewer/2022062617/54c657cf4a795965328b45e7/html5/thumbnails/1.jpg)
Words and More Words: Challenges of Big (Text) Data
Edie Rasmussen Visiting Professor, Nanyang Technological University
Professor, University of British Columbia
WKWSCI
SYMPOSIUM
2014 Big Data, Big Ideas for Smarter Communities
![Page 2: Words and More Words: Challenges of Big Data by Prof. Edie Rasmussen](https://reader035.vdocument.in/reader035/viewer/2022062617/54c657cf4a795965328b45e7/html5/thumbnails/2.jpg)
Outline
• The Rise of Big Text Data
• Challenges for Text Data
• Research Opportunities
– Counting and Culturomics
– Extracting Meaning from Text
2 WKWSCI SYMPOSIUM 2014
Big Data, Big Ideas for Smarter Communities
![Page 3: Words and More Words: Challenges of Big Data by Prof. Edie Rasmussen](https://reader035.vdocument.in/reader035/viewer/2022062617/54c657cf4a795965328b45e7/html5/thumbnails/3.jpg)
The Rise of Big Text Data
• Before there was Big Data, there were large bibliographic databases:
– Dialog: ~180 scholarly databases
– Lexis/Nexis: 5 billion documents (business/law/news)
– Citation Indexes: > 40 million records
• IR techniques designed for rapid access to very large (text) databases
• Swanson: “Undiscovered public knowledge” (1987)
WKWSCI SYMPOSIUM 2014 Big Data, Big Ideas for Smarter Communities
3
![Page 4: Words and More Words: Challenges of Big Data by Prof. Edie Rasmussen](https://reader035.vdocument.in/reader035/viewer/2022062617/54c657cf4a795965328b45e7/html5/thumbnails/4.jpg)
Current Text Sources
• Digitized Legacy Materials – Google Books, Hathi Trust (11 million volumes, 500 TB)
• The Web
• Search Logs (over 2 million queries per minute)
• Wikipedia (~4.5 million English articles)
• Blogs (The Blogosphere)
• Twitter (The Twitterverse)
• Test Collections – Smaller
– Experimentally more robust
4 WKWSCI SYMPOSIUM 2014
Big Data, Big Ideas for Smarter Communities
![Page 5: Words and More Words: Challenges of Big Data by Prof. Edie Rasmussen](https://reader035.vdocument.in/reader035/viewer/2022062617/54c657cf4a795965328b45e7/html5/thumbnails/5.jpg)
Challenges of Text
• Legacy Text/Digitization Costs • Quality (OCR Errors; Metadata Errors) • Availability (Access, Copyright, Privacy) • Reliability
– Algorithmic dependencies – Creator trustworthiness
• Authorship Issues (Identification, Authority) • Lack of Structure • Lack of Context • Ambiguity of human language • Breadth vs. Depth
5
WKWSCI SYMPOSIUM 2014 Big Data, Big Ideas for Smarter Communities
![Page 6: Words and More Words: Challenges of Big Data by Prof. Edie Rasmussen](https://reader035.vdocument.in/reader035/viewer/2022062617/54c657cf4a795965328b45e7/html5/thumbnails/6.jpg)
Processing Text
• Tokenizing, stopping, stemming
• Statistics of text: term values (tf*idf)
• “Bag of Words” approach
• Other evidence: network structures
• Similarity calculations
• Creating ranked lists
• Note: Probabilistic rather than Deterministic
6
WKWSCI SYMPOSIUM 2014 Big Data, Big Ideas for Smarter Communities
![Page 7: Words and More Words: Challenges of Big Data by Prof. Edie Rasmussen](https://reader035.vdocument.in/reader035/viewer/2022062617/54c657cf4a795965328b45e7/html5/thumbnails/7.jpg)
Counting and the Rise of Culturomics
• “Culturomics is the application of high-throughput data collection and analysis to the study of human culture”
• Database of >5 million digitized books (~4%)
• Michel et al. (Science, 2011): “Quantitative analysis of culture using millions of digitized books”
• Google’s N-Gram Viewer
7
WKWSCI SYMPOSIUM 2014 Big Data, Big Ideas for Smarter Communities
![Page 8: Words and More Words: Challenges of Big Data by Prof. Edie Rasmussen](https://reader035.vdocument.in/reader035/viewer/2022062617/54c657cf4a795965328b45e7/html5/thumbnails/8.jpg)
Using the N-Gram Viewer
8
typhoid
gout
1800 2000 1900
HIV
cholera
WKWSCI SYMPOSIUM 2014 Big Data, Big Ideas for Smarter Communities
![Page 9: Words and More Words: Challenges of Big Data by Prof. Edie Rasmussen](https://reader035.vdocument.in/reader035/viewer/2022062617/54c657cf4a795965328b45e7/html5/thumbnails/9.jpg)
How Far Will Counting Take us?
• Many limitations (e.g. incomplete data set)
• Some surprisingly sophisticated analyses:
– Size of English lexicon
– Change in word usage (irregular verbs) over time
– Cultural turnover (inventions)
– The nature (duration) of fame
– Patterns of censorship (“suppression index”)
9 WKWSCI SYMPOSIUM 2014
Big Data, Big Ideas for Smarter Communities
![Page 10: Words and More Words: Challenges of Big Data by Prof. Edie Rasmussen](https://reader035.vdocument.in/reader035/viewer/2022062617/54c657cf4a795965328b45e7/html5/thumbnails/10.jpg)
Critiques of Culturomics
• “The death of theory”
• “…second-rate scholars will use the Google Books corpus to churn out gigabytes of uninformative graphs and insignificant conclusions.” (Nunberg, 2011)
• Books as a representation of human history
• A “time sink”
10 WKWSCI SYMPOSIUM 2014
Big Data, Big Ideas for Smarter Communities
![Page 11: Words and More Words: Challenges of Big Data by Prof. Edie Rasmussen](https://reader035.vdocument.in/reader035/viewer/2022062617/54c657cf4a795965328b45e7/html5/thumbnails/11.jpg)
Social Media as Big Data
• ‘Internet Minute’
– 320+ new Twitter accounts
– 100,000 new Tweets
– 2+ million search queries
– 6 new Wikipedia articles
– 30 hours of video uploaded (Source: Intel http://www.intel.com/content/www/us/en/communications/internet-minute-infographic.html)
11 WKWSCI SYMPOSIUM 2014
Big Data, Big Ideas for Smarter Communities
![Page 12: Words and More Words: Challenges of Big Data by Prof. Edie Rasmussen](https://reader035.vdocument.in/reader035/viewer/2022062617/54c657cf4a795965328b45e7/html5/thumbnails/12.jpg)
TM: Topic Detection and Tracking
• Tracking a story line over time
• News wire input, identify new story, find subsequent instances
• Story segmentation, First story detection, Clustering of like stories
• Interesting to news, business, security analysts
12 WKWSCI SYMPOSIUM 2014
Big Data, Big Ideas for Smarter Communities
![Page 13: Words and More Words: Challenges of Big Data by Prof. Edie Rasmussen](https://reader035.vdocument.in/reader035/viewer/2022062617/54c657cf4a795965328b45e7/html5/thumbnails/13.jpg)
TM: Sentiment Analysis/Opinion Mining
• Rich data from Blogs and Tweets
• Basically a classification problem (SVM, Naïve Bayes, etc.) - > positive, negative, neutral
• Involves Entity Extraction, NLP, sentiment vocabularies
• Of interest to government and businesses
• See Stanford SA of movie reviews: http://nlp.stanford.edu:8080/sentiment/rntnDemo.html
13 WKWSCI SYMPOSIUM 2014
Big Data, Big Ideas for Smarter Communities
![Page 14: Words and More Words: Challenges of Big Data by Prof. Edie Rasmussen](https://reader035.vdocument.in/reader035/viewer/2022062617/54c657cf4a795965328b45e7/html5/thumbnails/14.jpg)
TM: Trends and Predictions
• Can Tweets and Search Logs be used to predict the future?
• Google Flu Trends, Google Dengue Trends – Correlated with Search Terms
• Network analysis on Tweets on Arab Spring
• Assessing tone of global news data to predict national stability, location of terrorists, etc. (Leetaru)
• Predicting opinions (recommender systems)
14
WKWSCI SYMPOSIUM 2014 Big Data, Big Ideas for Smarter Communities
![Page 15: Words and More Words: Challenges of Big Data by Prof. Edie Rasmussen](https://reader035.vdocument.in/reader035/viewer/2022062617/54c657cf4a795965328b45e7/html5/thumbnails/15.jpg)
TM: Question Answering
• Combines multiple sources of evidence:
– Question type identification
– Information retrieval of candidate text
– Natural language processing
– Entity extraction
– Hypothesis generation and scoring (confidence)
– Ranking hypotheses
15 WKWSCI SYMPOSIUM 2014
Big Data, Big Ideas for Smarter Communities
![Page 16: Words and More Words: Challenges of Big Data by Prof. Edie Rasmussen](https://reader035.vdocument.in/reader035/viewer/2022062617/54c657cf4a795965328b45e7/html5/thumbnails/16.jpg)
16
Watson, 2011
Hans Peter Luhn, 1952
Watson, 2011
![Page 17: Words and More Words: Challenges of Big Data by Prof. Edie Rasmussen](https://reader035.vdocument.in/reader035/viewer/2022062617/54c657cf4a795965328b45e7/html5/thumbnails/17.jpg)
Structuring Research: “Digging Into Data” Program
• Addresses: “how "big data" changes the research landscape for the humanities and social sciences”
• 3 rounds of international research funding • Canada, US, UK, plus Netherlands • Team approach: scholars, scientists, information
professionals • Requires international teams; funding from at
least two countries • Wide range of datasets made available • http://www.diggingintodata.org/
17
WKWSCI SYMPOSIUM 2014 Big Data, Big Ideas for Smarter Communities
![Page 18: Words and More Words: Challenges of Big Data by Prof. Edie Rasmussen](https://reader035.vdocument.in/reader035/viewer/2022062617/54c657cf4a795965328b45e7/html5/thumbnails/18.jpg)
18 WKWSCI SYMPOSIUM 2014
Big Data, Big Ideas for Smarter Communities
![Page 19: Words and More Words: Challenges of Big Data by Prof. Edie Rasmussen](https://reader035.vdocument.in/reader035/viewer/2022062617/54c657cf4a795965328b45e7/html5/thumbnails/19.jpg)
Thank you!
19 WKWSCI SYMPOSIUM 2014
Big Data, Big Ideas for Smarter Communities