leetaru, kalev: the gdelt project
TRANSCRIPT
Tweets per month (1% sample) Active users per month (1% sample)
Datasets• NEWS: Worldwide local news coverage in 100 languages (65 live
translated) – online news preserved via Internet Archive• TELEVISION: Collaboration with the Internet Archive to process
more than 100 television stations across the US, updating daily• ACADEMIC LITERATURE: 21 billion words covering 70 years
(JSTOR/DTIC/CORE/CITESEER/IA)• BOOKS: Collaboration with Internet Archive and HathiTrust to
process 3.5 million books 1800-2015• HUMAN RIGHTS: Half century of worldwide human rights reports• IMAGERY: Large fraction of global news imagery processed via deep
learning: objects/activities, OCR, logos, facial sentiment, geolocation
Preserving Online News
• World’s largest initiative to preserve online news• Only program to focus on worldwide local news in local
languages• Partnership with Internet Archive’s NO404 program - prior to
this IA’s news archiving was very limited, focused extensively on the Western world and major English-language sources
• Most web archiving efforts preference English and Western news outlets
• Working with IA to ensure preservation of mobile formats and enhanced preservation of embedded article imagery
Preserving Online News
• 1.5-2% of news articles disappear within 2 weeks• 5% disappear within a month• Up to 14% gone after 2 months – half with 404 and half ranging
from sustained 500’s to domain removal (popular in some areas of the world)
• Of GDELT-relevant coverage, 140,000 articles published today will be gone in 2 months
• 14 million GDELT monitored articles disappeared over a 6 month period representing 2x the total output of the New York Times over the last half century
• Numbers vastly higher in some countries
Preserving Online News
• Manual efforts like Archive-IT don’t scale to sudden-onset events like natural disasters or terror attacks – need “always on” archiving. Majority of coverage in first 72 hours and levels off in 14 days.
• Nepal 2015 earthquake: Yale + Columbia preserved 107 URLs with ArchiveIT.
• Nepal 2015 earthquake: GDELT captured over 667,000 articles about the earthquake and the country’s recovery over the following year, including 225,000 in languages other than English, with the top language being Nepali – capturing the local perspective
Global Event Database Global Knowledge Graph
Greece
France
Germany
Italy
United Kingdom
Burundi - 12/13/2015Instability
Tone
Media Attention
Topics
Physical Unrest
Anxiety
Positive/Negative: “Cautiously Optimistic” Trending
US Ebola News Coverage
Number American television news broadcasts per week mentioning "ebola"
• March 2014 WHO announcement
• First American infections• Eric Duncan arrives in Dallas
Average “tone” of English language media coverage of “ebola”
• Steady ascent towards more and more positive coverage as “Western medicine miracles to the rescue” theme dominates coverage
Carbon Capture & Sequestration
• English coverage of CCS 2010-2015• 32,000 websites, 250,000 people, 140,000 organizations,
50,000 locations• Green cluster (center): senior American policymakers• Green cluster (lower): “cap and trade” politicians• Red cluster (bottom): American lawmakers on Congressional
energy committee or sponsoring energy-related legislation• Purple cluster (top right): climate skeptics• Yellow (upper left): Australian politicians• Pink (upper center): British politicians• Periphery of all clusters: journalists and financial analysts who
feature prominently in coverage or who write much of the coverage – Karolin Schaps (Reuters) and Alex Morales (Bloomberg News London) are attached to British political cluster; Tom Friedman is attached to American political cluster
• Red: Actual Ukraine• Green: Avg Turkey
(2/19/1999-4/20/1999) and Lebanon (3/24/2007-5/23/2007)
• (r=0.49)