introduction to collections as data follow along: https ... · collections as data stewards are...

Introduction to Collections as Data

Follow along: https://tinyurl.com/whseumv

Joe Carrano, Alex McGee, Greta Suiter, Chris Tanguay, Rachel Van Unen

https://tinyurl.com/whseumv

AGENDA9:00 – 9:15 Group Norms and Intros

9:15 – 10:15 Collections as Data Introduction

10:15 – 10:45 Project Percolator

10:45 – 11:00 Break

11:00 – 12:00 Case Studies

12:00 – 1:00 Lunch

1:00 – 2:00 Overview of Collections as Data Tools

2:00 – 2:45 Collaboration and Assumptions

2:45 – 3:00 Wrap Up

Group norms / ROPES

RespectOpen mind, opportunityParticipate / passEmpathy, Education, EscucharSafe space

https://www.bjcschooloutreach.org/Portals/0/Tools%20for%20Building%20Healthier%20Youth/Materials/Gender%20Alphabet/ROPES%20Handout.pdf

Introductions

● Introduce yourself to someone you’ve never met before

● Name, affiliation, reason for attending

● As a pair, find another pair, and introduce your partner

Collections as DataIntroduction

Definition

● Ordered information● Stored digitally● Inherently amenable to

computation- Elizabeth Russey Roke,bloggERS! July 16, 2019

https://saaers.wordpress.com/2019/07/16/collections-as-data/

Evolution of Digital Access

Early initiatives focused on digital surrogates

Special Collections and Archives, Cal Poly

Evolution of Digital Access

Collections as data approach provides access beyond just the object itself, e.g. OCR text downloads

Special Collections and Archives, Cal Poly

“To see collections as data begins with reframing all digital objects as data.”

- Thomas Padilla,“On a Collections as Data Imperative”

http://digitalpreservation.gov/meetings/dcs16/tpadilla_OnaCollectionsasDataImperative_final.pdf

Collections as Data Myths

● I need to be a programmer / have access to a programmer’s time○ False! Although public APIs are great!

■ New York Public Library■ Library of Congress

● My project needs to have a very broad scope to reach as many people as possible○ False! Don’t try to be all things to all people

http://api.repo.nypl.org/

https://libraryofcongress.github.io/data-exploration/

Collections as Data Myths (continued)

● My colleagues are so busy, I need to do this project on my own○ False! Involving an appropriately sized group of various

stakeholders will lead to better results

“Digital tools we build and provide are likely to reflect and perpetuate stereotypes, biases, and inequalities”

- Chris Bourg,OLA Super Conference, January 2015

https://chrisbourg.wordpress.com/2015/01/28/never-neutral-libraries-technology-and-inclusion/

The Collections as Data Project

Always Already Computational: Collections as Data focused on current and potential approaches to developing cultural heritage collections that support computationally-driven research and teaching.

“Collections as Data: Part to Whole aims to foster the development of broadly viable models that support implementation and use of collections as data.”

Principles

Santa Barbara Statement on Collections as DataVersion 2

Written by the Always Already Computational: Collections as Data project team

Collections as data development aims to encourage computational use of digitized and born digital collections.

Collections as data stewards are guided by ongoing ethical commitments.

Collection stewards aim to respect the rights and needs of the content creators, those represented in collections, and the communities that use them.

Collections as data stewards aim to lower barriers to use.

A range of accessible instructional materials and documentation should be developed to support data use.

Collections as data designed for everyone serve no one.

Specific needs inform collections as data development.

Shared documentation helps others find a path to doing the work.

➔ Document!➔ Document!➔ Document!

Collections as Data should be made openly accessible by default, except in cases where ethical or legal obligations preclude it.

Collections as data development values interoperability.

Collections as data stewards work transparently in order to develop trustworthy, long-lived collections.

Data as well as the data that describe those data are considered in scope.

➔ images ➔ metadata➔ finding aids or other description➔ data resulting from analysis of those data

The development of collections as data is an ongoing process and does not necessarily conclude with a final version.

Who is using collections as data?

Humanists

Distant reading

Text and Data mining

Network graphs

Spatial analysis, GIS, mapping

Spatial analysis, GIS, mapping

Timelines

N-gram

Basically any sort of data viz!

https://en.wikipedia.org/wiki/Distant_reading

https://dev.hiphopwordcount.com/

https://linkedjazz.org/

https://dsl.richmond.edu/panorama/redlining/

https://dsl.richmond.edu/panorama/redlining/

https://archives.library.illinois.edu/thought-collective/data-visualizations/grid-visualization/

https://books.google.com/ngrams#

Science Community

https://research.noaa.gov/article/ArtMID/587/ArticleID/2560/Old-weather-%E2%80%9Ctime-machine%E2%80%9D-opens-a-treasure-trove-for-researchers

https://research.noaa.gov/article/ArtMID/587/ArticleID/2560/Old-weather-%E2%80%9Ctime-machine%E2%80%9D-opens-a-treasure-trove-for-researchers

https://escholarship.umassmed.edu/jeslib/vol8/iss2/5/

https://escholarship.umassmed.edu/jeslib/vol8/iss2/5/

Data Journalists

https://www.storybench.org/how-the-wall-street-journal-visualized-the-500-conflicts-of-interest-of-the-trumps/

https://www.storybench.org/how-the-wall-street-journal-visualized-the-500-conflicts-of-interest-of-the-trumps/

https://www.washingtonpost.com/graphics/2019/investigations/dea-pain-pill-database/

https://www.washingtonpost.com/graphics/2019/investigations/dea-pain-pill-database/

Civic data users

Community members

City planners

Open data advocates

Nonprofit and research professionals

https://civic-switchboard.github.io/

Social Scientists

https://journal.code4lib.org/articles/11358

https://journal.code4lib.org/articles/11358

Computer ScientistsNatural Language Processing

Name Entity Recognition

Sentiment analysis

Machine learning, Computer vision

Computational Archival Science is an example of a crossover between GLAM/comp sci

https://dcicblog.umd.edu/cas/

Teachers and Students

Need easy intro datasets for students, low barrier methods that aren’t programming heavy

Create workshops and classes for students to learn new methods

Miriam Posner’s intro to DH class website is a great example

http://miriamposner.com/classes/dh201w19/

Artists

https://twitter.com/pomological

http://sarahhattonartist.com/detachment/

Cultural Heritage Professionals

Archivists

Metadata experts

Data Librarians

Subject specialists

Repository administrator

Library administrators

Library developers

Outreach staff

50 things you can do

Within the next 1-6 months, what can you do within your own workplace to realistically and relatively easily improve/ move forward collections as data work?

https://collectionsasdata.github.io/fiftythings/

50 things themes

● Generate institutional support/interest○ ex. #10, 14, 15, 16, 17, 18, 19, 20, 23, 24, 29, 30, 31, 38, 40, 41, 45

● Lay groundwork for future projects○ ex. #1, 2, 4, 5, 6, 7, 9, 11, 12, 13, 25, 27, 29, 32, 33, 34, 36, 37, 42, 47, 48

● Scope potential projects○ ex. #43, 44, 46

https://collectionsasdata.github.io/fiftythings/

Ways to access collections as data

Finding data

Data download – csv, xml, json, database, image files, etc

API calls, OAI-PMH for metadata

FOIA, lawsuit

Web scraping

Jupyter notebooks

Digitization/reformatting

Getting Data

http://miriamposner.com/classes/dh201w19/tutorials-guides/finding-data/

Once you have the data you often have to edit or enhance it to suit your needs

Project Percolator

What data set makes a good collections as data project? Where do I start?

Learning Goals

● Discuss the application of Collections as Data principles in relation to example data sets.

● Practice analyzing and evaluating example data sets to better recognize potential projects in the future.

Project Percolator

● Intro - 5 mins● In groups of 2-3 choose a data set

and discuss below Qs - 10 mins● What format is data in?● What are opportunities?● What are challenges?

● Find another pair/group and share data set and at least one opportunity or challenge - 10 mins

● Group share - 5 mins

BREAK!

Collections as DataCase Studies

A Matchbook Map

Case study from the University of Utah J. Willard Marriott Library Special Collections

ArcGIS in action!

Harold Stanley Sanders Matchbook Collection

➔ Series from the Harold Stanley Sanders collection➔ Documents businesses in Utah➔ Comprises 4 of 134 boxes in the collection➔ 678 matchbooks

The Data

The Data

➔ Additional metadata by Rachel Wittmann➔ Shared on GitHub➔ CSV and TXT files➔ Open for reuse

https://github.com/marriott-library/collections-as-data/tree/master/matchbooks


The Data



https://tinyurl.com/s7zbvxn



Library of Colors A case study from the Library of Congress LC Labs Innovator

in Residence, Jer Thorp

https://library-of-colors.glitch.me/

Analyzing permanent collection data

at the National Gallery of Art in DC

“The National Gallery of Art

will be the first American art

museum to invite teams of

data scientists and art

historians to analyze,

contextualize, and visualize

its permanent collection

data.”

https://www.nga.gov/press/2019/datathon.html



6 Teams given NGA full permanent collection data

Team 1 - InceptionV3 neural network

Team 2 - Spatial data w/in paintings + w/in museum

Team 3 - Reflect vs. attract America

Team 4 - Diversity on display

Team 5 - Intention in acquisition

Team 6 - Focus on women donors

Team 1 - Pittsburgh - Neural network - clustering images by visual similarity

Team 3 - George Mason University

Team 4 - Diversity on display (or is it?) - calls for data curation along w/ art curation

https://www.nga.gov/content/dam/ngaweb/press/2019/datathon/team-4.pdf

Team 5 - undergrads from multiple institutions - Intention in acquisition

https://www.nga.gov/content/dam/ngaweb/press/2019/datathon/team-5.pdf

Team 6 - NGA - Women as donors

Takeaways from NGA datathon

● What do we want from the data?○ How do we form questions to ask of the data?○ What is the data telling us?

● What do we want from the project?○ Who shares information about the project?○ How are audiences engaged in the project?○ Scale of project?

Eastern Apps: Visualizing Historic Prison Data

https://diglib.amphilsoc.org/labs/eastern-apps/




Women Men

MIT Indie Tarot Collection

Every 3 minutes/Texas Runaway Ads Twitter Bots

● @Every3Minutes● @TxRunawayAds

https://twitter.com/Every3Minutes

https://twitter.com/TxRunawayAds

Every 3 minutes

● Twitter bot created by historian, W. Caleb McDaniel● Tweets every 3 minutes that an enslaved person was sold,

in relation to statistics compiled by historian Herbert Gutman

● Now includes data/images from cultural heritage collections

● More info

http://wcaleb.org/blog/slave-sales-on-twitter

Texas Runaway Ads

● Twitter bot created by students in historian, W. Caleb McDaniel’s digital humanities course

● Tweets text and image twice a day of runaway slave ad from Portal for Texas History

● More info

https://ricedh.github.io/05-twitterbot.html

LUNCH!

Collections as DataTools

Photo by Ashim D’Silva on Unsplash

https://unsplash.com/@randomlies?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText

https://unsplash.com/?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText

Library as laboratory

2020 IAP library offerings - focus on making

● Basics of Copyrights, Data, and Software Intellectual Property

● Data Management for Postdocs and Research Scientists

● Data Visualization: Introduction to Tools and Principles

● GIS Level 1: Introduction to GIS & Mapping

● Introduction to Cleaning and Prepping Data with OpenRefine

● Introduction to R Graphics with ggplot2

● Introduction to Web Scraping with Python

● Lockedletter Book Making

● Make an Online Map: Introduction to Carto

● Software Carpentries: Introduction to Unix Shell, Python, and Git

http://student.mit.edu/iap/ns194.html

The Wikimedia universe

77

Wikipedia Written sum of human knowledge

Wikidata Structured data, semantic representation

Commons Multimedia formats

slides borrowed from Andrew Lih, @fuzheado, from LD4 2019 keynote

What is Wikidata?

● www.wikidata.org

● A website that anyone can edit

● That uses software called Wikibase (an extension to MediaWiki)

● To create a repository of linked open data

http://www.wikidata.org/

●Wikidata is comprised of items

● Items have properties

●Properties have values

item property value

Noam Chomsky Instance of Human

Q9049 P31 Q5

item property value

Massachusetts Institute of Technology

Founded by William Barton Rogers

Q49108 P112 Q2341286

Benefit: re-using, re-mixing and re-imagining that human

knowledge in new innovative ways

81

slides borrowed from Andrew Lih, @fuzheado, from LD4 2019 keynote

What’s in Wikidata?

● 75,322,058 data items○ One item for every topic that is in Wikipedia○ There can also be items for topics that are not yet in Wikipedia

■ Places ■ People■ Paintings & artworks ■ Scientific articles…

● Items are multilingual

https://www.wikidata.org/wiki/Special:Statistics

You can ask Wikidata things with SPARQL

Q1368825

query.wikidata.org

Let’s Query!

SELECT ?person ?personLabel ?birthplaceLabel ?coordinates ?birthdateWHERE { ?person wdt:P485 wd:Q6784299 . = person with archives at MIT Libraries ?person wdt:P21 wd:Q6581072 . = person with sex or gender female ?person wdt:P19 ?birthplace . = person with place of birth anywhere ?birthplace wdt:P625 ?coordinates . = place of birth with coordinate location anywhere ?person wdt:P569 ?birthdate = person with date of birth any date of birth SERVICE wikibase:label { bd:serviceParam wikibase:language "en".} }

Ewan McAndrew Wikidata Sparql Query Tutorial - YouTube

https://youtu.be/1jHoUkj_mKw

Use OpenRefine to see what is already in Wikidata

Detailed instructions “Using OpenRefine to Reconcile Name Entities”By Karen H. on The Metro Fellowship website

http://mnylc.org/fellows/2017/03/17/using-openrefine-to-reconcile-name-entities/

OpenRefine

A data manipulation tool

https://openrefine.org/

https://openrefine.org/

Resources

OpenRefine tutorial: Getting Started with OpenRefine

British Library data sets: Downloads

https://thomaspadilla.org/dataprep/

http://www.bl.uk/bibliographic/datafree.html

Facet data to find possible duplication

Normalize your terms

General Refine Expression Language (GREL) example

Voyant-Tools

An online application for text analysis

https://voyant-tools.org/

RAWGraphsrawgraphs.io

A data visualization tool

https://rawgraphs.io/

Archives Unleashed

Data analysis of web archives

Archives Unleashed Cloud

Archives Unleashed Cloud allows for importing data from Archive-it and is currently free for Archive-It Users.

● How to set it up with your account ● Allows you to create derivative sets of data to

analyze

https://cloud.archivesunleashed.org

https://cloud.archivesunleashed.org/documentation

https://cloud.archivesunleashed.org/derivatives

Archives Unleashed Jupyter Notebooks

● Jupyter notebooks walk you through code snippets and how to run larger programs

● How to install and set up● Let’s try an example

https://cloud.archivesunleashed.org/derivatives/notebooks

Questions?Other tools that you love/work with/want to share with others?

Suggested tools:

● Timeline JS● Drupal Views● Internet Archive● StoryMaps

Collaboration and Assumptions

Definitions

Assumption: a thing that is accepted as true or as certain to happen, without proof.

Bias: prejudice in favor of or against one thing, person, or group compared with another, usually in a way considered to be unfair.

Implicit bias refers to the attitudes or stereotypes that affect our understanding, actions, and decisions in an unconscious manner.

Thinking things through...

● How many people will be involved in this project and what will their roles be?

● What assumptions are you making about the project?

● What potential biases are there in these assumptions?

Assumption Junction - 15 mins

Go back to your data set and project percolator partner(s). Keeping in mind the format of the data, and what the opportunities and challenges are, spend 15 minutes discussing the following questions:

● What does the ideal team look like to complete this project? What can you do? What expertise do you need from others?

● How will you recruit and build your team?● What does the complete project look like? What tools might you use?● When you think about the opportunities and challenges for your

project what assumptions are you making?

Share out: What assumptions and/or biases did you talk about? Best practices for overcoming them?

Group google doc here:

https://tinyurl.com/r6t4345