open data bay area (obda) | kurt bollacker: public metadata commons

27
A Public Metadata Commons: What is it? Why do we need it? How do we get it? Kurt Bollacker Open Data Bay Area 2012 Nov 27 1 Wednesday, April 3, 2013

Upload: open-data-bay-area-obda

Post on 08-May-2015

146 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Open Data Bay Area (OBDA) | Kurt Bollacker: Public Metadata Commons

A Public Metadata Commons:What is it?

Why do we need it?How do we get it?

Kurt BollackerOpen Data Bay Area

2012 Nov 27

1Wednesday, April 3, 2013

Page 2: Open Data Bay Area (OBDA) | Kurt Bollacker: Public Metadata Commons

A long time ago, there was no “open” data.All of the media we used to create was physical.

2Wednesday, April 3, 2013

Page 3: Open Data Bay Area (OBDA) | Kurt Bollacker: Public Metadata Commons

Then most (all?) of the media became digital.

3Wednesday, April 3, 2013

Page 4: Open Data Bay Area (OBDA) | Kurt Bollacker: Public Metadata Commons

The Internet let us ship data around for (almost) free.

4Wednesday, April 3, 2013

Page 5: Open Data Bay Area (OBDA) | Kurt Bollacker: Public Metadata Commons

And we learned how to connect it all together.

So naturally, we started to build a Global Digital Data Commons!

5Wednesday, April 3, 2013

Page 6: Open Data Bay Area (OBDA) | Kurt Bollacker: Public Metadata Commons

At first it was a “free for all” of academics and enthusiasts.

Almost all data on the Web was considered to be “open”.

6Wednesday, April 3, 2013

Page 7: Open Data Bay Area (OBDA) | Kurt Bollacker: Public Metadata Commons

And then folks figured out how to make money from our contributions,

so they started to “lock down” part of the Internet that previously would have been part of the commons.

7Wednesday, April 3, 2013

Page 8: Open Data Bay Area (OBDA) | Kurt Bollacker: Public Metadata Commons

Why is this bad?

For the data archivist, centrally controlled data have far fewer (single?) points of failure.

• Technical Failure

• Legal Barriers

• Incompetence

8Wednesday, April 3, 2013

Page 9: Open Data Bay Area (OBDA) | Kurt Bollacker: Public Metadata Commons

"Those who cannot remember the past are condemned to repeat it" --- George Santayana

A (Potential) Digital Dark Age

9Wednesday, April 3, 2013

Page 10: Open Data Bay Area (OBDA) | Kurt Bollacker: Public Metadata Commons

How Do We Avoid This Lockdown Of Central Control,

(And Hopefully A Digital Dark Age)?

We Need A Practical Perspective On the Problem.

10Wednesday, April 3, 2013

Page 11: Open Data Bay Area (OBDA) | Kurt Bollacker: Public Metadata Commons

Example Surviving Archives

11Wednesday, April 3, 2013

Page 12: Open Data Bay Area (OBDA) | Kurt Bollacker: Public Metadata Commons

Data tends to survive if over the long term, it is:

• Visible

• Mobile

• Well Loved

These happen to also be the properties of data in a public commons.

12Wednesday, April 3, 2013

Page 13: Open Data Bay Area (OBDA) | Kurt Bollacker: Public Metadata Commons

Present Day Examples:

• There are many copies. (mobile)

• Their use is mostly unrestricted. (visible)

• Everyone can access and contribute. (well loved)

• Bible / Torah / Koran

• U.S. Constitution

• DNA?

Historical Examples:

• Wikipedia

• Open Street Maps

• Freebase

• MusicBrainz

Why?

13Wednesday, April 3, 2013

Page 14: Open Data Bay Area (OBDA) | Kurt Bollacker: Public Metadata Commons

But what about data that is still trapped by:

• Technical Barriers?

• Legal Restrictions?

• Limited Resources?

14Wednesday, April 3, 2013

Page 15: Open Data Bay Area (OBDA) | Kurt Bollacker: Public Metadata Commons

We build a metadata commons to hold the “cultural context” of our trapped data.

15Wednesday, April 3, 2013

Page 16: Open Data Bay Area (OBDA) | Kurt Bollacker: Public Metadata Commons

How does a metadata commons work?

Even if the original contribution is lost or otherwise made unavailable, we still have its cultural context.

TrappedDatasets

ExtractionProcesses

Metadata

Metadata

Metadata

Metadata

Metadata

16Wednesday, April 3, 2013

Page 17: Open Data Bay Area (OBDA) | Kurt Bollacker: Public Metadata Commons

The cultural context in a metadata commons might contain:

• Indices and Tags (to find and organize)

• Comments (to analyze and interpret)

• Technical metadata (e.g. provenance, format info)

• Transforms and Interpretations (to make something useful)

17Wednesday, April 3, 2013

Page 18: Open Data Bay Area (OBDA) | Kurt Bollacker: Public Metadata Commons

Where is the trapped data that we care about?A lot of it is in The World Wide Web!

But the Web is:

• Very large (10TB - 100TB for accessible / deduped)

• Very noisy (useless pages, partial duplicates)

• Very diverse (in content, purpose, and target audience)

How do we build a Metadata Commons from the Web?

18Wednesday, April 3, 2013

Page 19: Open Data Bay Area (OBDA) | Kurt Bollacker: Public Metadata Commons

A Practical Place To Start:

Common Crawl (and cheap cloud computing resources)

make the Web far cheaper and easier to access and manipulate.

• Can be downloaded wholesale

• Can be processed and analyzed in situ.

• Parts can be publicly referenced

19Wednesday, April 3, 2013

Page 20: Open Data Bay Area (OBDA) | Kurt Bollacker: Public Metadata Commons

This foundation helps us scale up to “Web size”, but:

• What is the useful “metadata of the Web”?

• How to we extract that metadata?

20Wednesday, April 3, 2013

Page 21: Open Data Bay Area (OBDA) | Kurt Bollacker: Public Metadata Commons

Useful Web Extracts Are

• Interesting to many people (to me!)

• Can be used to answer relevant questions.

• Can be used to build useful products and services.

Almost everyone will have an itch to scratch.

21Wednesday, April 3, 2013

Page 22: Open Data Bay Area (OBDA) | Kurt Bollacker: Public Metadata Commons

Specific Examples Of Useful Web Extracts

(From the Common Crawl code contest)

• WikiEntities

• Congressional sentiment

• Reach of Facebook on the Web

22Wednesday, April 3, 2013

Page 23: Open Data Bay Area (OBDA) | Kurt Bollacker: Public Metadata Commons

(A Few) General Shapes Of Web Metadata Extracts

• Link graphs

• N-gram counts

• File Indices by domain or keyword

• Mashups with interesting datasets

• Wikipedia

• Freebase

• Location databases (e.g. Open Street Maps)

We should all create an extract!

23Wednesday, April 3, 2013

Page 24: Open Data Bay Area (OBDA) | Kurt Bollacker: Public Metadata Commons

• Ingredients:

• A Web crawl snapshot

• A little bit of programming skill

• Access to a cloud computing resources (e.g. EMR)

• Directions:

• http://commoncrawl.org/mapreduce-for-the-masses/

How do I create an extract?

An easy Recipe:

24Wednesday, April 3, 2013

Page 25: Open Data Bay Area (OBDA) | Kurt Bollacker: Public Metadata Commons

What Happens Once I’ve Made This Awesome Extract?

• Share the extracted data

• Share the code you created / modified

• https://github.com/commoncrawl/commoncrawl-examples/

• Broadcast it to the world!

25Wednesday, April 3, 2013

Page 26: Open Data Bay Area (OBDA) | Kurt Bollacker: Public Metadata Commons

And The World Is Saved!

Thank you.

26Wednesday, April 3, 2013

Page 27: Open Data Bay Area (OBDA) | Kurt Bollacker: Public Metadata Commons

Some Useful Links

• https://github.com/commoncrawl

• http://commoncrawl.org/mapreduce-for-the-masses/

• https://github.com/commoncrawl/commoncrawl-examples/

• https://aws.amazon.com/amis/common-crawl-quick-start

• https://commoncrawl.atlassian.net/wiki/display/CRWL/About+the+Data+Set

27Wednesday, April 3, 2013