common crawl: an open repository of web data
DESCRIPTION
Talk given by Lisa Green from the Common Crawl Foundation at the Hadoop User Group UK meetup on 10 October in LondonTRANSCRIPT
![Page 1: Common Crawl: An Open Repository of Web Data](https://reader038.vdocument.in/reader038/viewer/2022102700/554a374bb4c905293a8b468c/html5/thumbnails/1.jpg)
What Does The Data World
Mean to Society?Lisa Green
1 October 2012
London HUG
Lisa Green10 October 2012
Common Crawl : An Open Repository
of Web Data
![Page 2: Common Crawl: An Open Repository of Web Data](https://reader038.vdocument.in/reader038/viewer/2022102700/554a374bb4c905293a8b468c/html5/thumbnails/2.jpg)
Photo license: Public Domain Origin: http://en.wikipedia.org/wiki/File:Floppy_disk_2009_G1.jpg
![Page 3: Common Crawl: An Open Repository of Web Data](https://reader038.vdocument.in/reader038/viewer/2022102700/554a374bb4c905293a8b468c/html5/thumbnails/3.jpg)
Photo license: CC-BY-SA Origin: http://en.wikipedia.org/wiki/File:Wikimedia_Foundation_Servers-8055_08.jpg
![Page 4: Common Crawl: An Open Repository of Web Data](https://reader038.vdocument.in/reader038/viewer/2022102700/554a374bb4c905293a8b468c/html5/thumbnails/4.jpg)
Image license: CC-BY Origin: http://en.wikipedia.org/wiki/File:Internet_map_1024.jpg
![Page 5: Common Crawl: An Open Repository of Web Data](https://reader038.vdocument.in/reader038/viewer/2022102700/554a374bb4c905293a8b468c/html5/thumbnails/5.jpg)
Still NascentStill Nascent• Even cheaper storage• Even cheaper compute• Education• Open Data
Still Nascent• Even cheaper storage• Even cheaper compute• Education
Still Nascent• Even cheaper storage• Even cheaper compute
Still Nascent• Even cheaper storage
Image license: CC-BY Credit: NASA, ESA, and the Hubble Heritage Team (STScI/AURA)
![Page 6: Common Crawl: An Open Repository of Web Data](https://reader038.vdocument.in/reader038/viewer/2022102700/554a374bb4c905293a8b468c/html5/thumbnails/6.jpg)
Proprietary
Commercial
Gratis
Libre
![Page 7: Common Crawl: An Open Repository of Web Data](https://reader038.vdocument.in/reader038/viewer/2022102700/554a374bb4c905293a8b468c/html5/thumbnails/7.jpg)
Progress
Insight
Analysis
Data
![Page 8: Common Crawl: An Open Repository of Web Data](https://reader038.vdocument.in/reader038/viewer/2022102700/554a374bb4c905293a8b468c/html5/thumbnails/8.jpg)
Gil Elbaz
![Page 9: Common Crawl: An Open Repository of Web Data](https://reader038.vdocument.in/reader038/viewer/2022102700/554a374bb4c905293a8b468c/html5/thumbnails/9.jpg)
![Page 10: Common Crawl: An Open Repository of Web Data](https://reader038.vdocument.in/reader038/viewer/2022102700/554a374bb4c905293a8b468c/html5/thumbnails/10.jpg)
Common Crawl Data
• ~8 Billion web pages • ~120 TB• 2008-2012• ARC files, JSON metadata, text files• Available to anyone
![Page 11: Common Crawl: An Open Repository of Web Data](https://reader038.vdocument.in/reader038/viewer/2022102700/554a374bb4c905293a8b468c/html5/thumbnails/11.jpg)
ARC Files - Raw Content
Metadata• Status information• HTTP response code• File names & offsets of ARC files• HTML title• HTML meta tags• RSS/Atom information• All anchors/hyperlinks
Text Files - Text Only
http://commoncrawl.org/get-started
![Page 12: Common Crawl: An Open Repository of Web Data](https://reader038.vdocument.in/reader038/viewer/2022102700/554a374bb4c905293a8b468c/html5/thumbnails/12.jpg)
![Page 13: Common Crawl: An Open Repository of Web Data](https://reader038.vdocument.in/reader038/viewer/2022102700/554a374bb4c905293a8b468c/html5/thumbnails/13.jpg)
http://webdatacommons.org
Change between 2010 and 2012• URLs with embedded data +6%• Microdata +14%• RDFa +26%
![Page 14: Common Crawl: An Open Repository of Web Data](https://reader038.vdocument.in/reader038/viewer/2022102700/554a374bb4c905293a8b468c/html5/thumbnails/14.jpg)
• 22% of Web pages contain Facebook URLs• 8% of Web pages implement Open Graph tags
![Page 15: Common Crawl: An Open Repository of Web Data](https://reader038.vdocument.in/reader038/viewer/2022102700/554a374bb4c905293a8b468c/html5/thumbnails/15.jpg)
A corpus of anchortext-WikipediaConcept-Count from the CommonCrawl dataset, to benefit
research on WSD, NLP and IR.
Explicit Topic Modeling:Given a concept (represented as a wikipedia page), it can tell what are the most common terms people use to describe the concept.
Given a sentence, it can help identify entities (person, location, organization) in the sentence and map them onto Wikipedia concepts.
http://wikientities.appspot.com
![Page 16: Common Crawl: An Open Repository of Web Data](https://reader038.vdocument.in/reader038/viewer/2022102700/554a374bb4c905293a8b468c/html5/thumbnails/16.jpg)
Mapping French websites related to Open Data
![Page 17: Common Crawl: An Open Repository of Web Data](https://reader038.vdocument.in/reader038/viewer/2022102700/554a374bb4c905293a8b468c/html5/thumbnails/17.jpg)
Other Use Examples
• Apache Giraph Testing• Maplight• Tineye• Factual• Sentiment Analysis Projects
![Page 18: Common Crawl: An Open Repository of Web Data](https://reader038.vdocument.in/reader038/viewer/2022102700/554a374bb4c905293a8b468c/html5/thumbnails/18.jpg)
In Development
• N-gram and Link Graph Extracts• Pig Reader• More Frequent Full Crawls• Focused Subset Crawls at High Frequency• Open Educational Resources
![Page 19: Common Crawl: An Open Repository of Web Data](https://reader038.vdocument.in/reader038/viewer/2022102700/554a374bb4c905293a8b468c/html5/thumbnails/19.jpg)
What Does The Data World
Mean to Society?Lisa Green
1 October 2012
Lisa Green
@commoncrawl@boudicca
Thank YouLondon HUG