common crawl: an open repository of web data
Post on 06-May-2015
2.187 Views
Preview:
DESCRIPTION
TRANSCRIPT
What Does The Data World
Mean to Society?Lisa Green
1 October 2012
London HUG
Lisa Green10 October 2012
Common Crawl : An Open Repository
of Web Data
Photo license: Public Domain Origin: http://en.wikipedia.org/wiki/File:Floppy_disk_2009_G1.jpg
Photo license: CC-BY-SA Origin: http://en.wikipedia.org/wiki/File:Wikimedia_Foundation_Servers-8055_08.jpg
Image license: CC-BY Origin: http://en.wikipedia.org/wiki/File:Internet_map_1024.jpg
Still NascentStill Nascent• Even cheaper storage• Even cheaper compute• Education• Open Data
Still Nascent• Even cheaper storage• Even cheaper compute• Education
Still Nascent• Even cheaper storage• Even cheaper compute
Still Nascent• Even cheaper storage
Image license: CC-BY Credit: NASA, ESA, and the Hubble Heritage Team (STScI/AURA)
Proprietary
Commercial
Gratis
Libre
Progress
Insight
Analysis
Data
Gil Elbaz
Common Crawl Data
• ~8 Billion web pages • ~120 TB• 2008-2012• ARC files, JSON metadata, text files• Available to anyone
ARC Files - Raw Content
Metadata• Status information• HTTP response code• File names & offsets of ARC files• HTML title• HTML meta tags• RSS/Atom information• All anchors/hyperlinks
Text Files - Text Only
http://commoncrawl.org/get-started
http://webdatacommons.org
Change between 2010 and 2012• URLs with embedded data +6%• Microdata +14%• RDFa +26%
• 22% of Web pages contain Facebook URLs• 8% of Web pages implement Open Graph tags
A corpus of anchortext-WikipediaConcept-Count from the CommonCrawl dataset, to benefit
research on WSD, NLP and IR.
Explicit Topic Modeling:Given a concept (represented as a wikipedia page), it can tell what are the most common terms people use to describe the concept.
Given a sentence, it can help identify entities (person, location, organization) in the sentence and map them onto Wikipedia concepts.
http://wikientities.appspot.com
Mapping French websites related to Open Data
Other Use Examples
• Apache Giraph Testing• Maplight• Tineye• Factual• Sentiment Analysis Projects
In Development
• N-gram and Link Graph Extracts• Pig Reader• More Frequent Full Crawls• Focused Subset Crawls at High Frequency• Open Educational Resources
What Does The Data World
Mean to Society?Lisa Green
1 October 2012
Lisa Green
lisa@commoncrawl.orgwww.commoncrawl.org
@commoncrawl@boudicca
Thank YouLondon HUG
top related