introduction to collections as data follow along: https ... · collections as data stewards are...
TRANSCRIPT
Introduction to Collections as Data
Follow along: https://tinyurl.com/whseumv
Joe Carrano, Alex McGee, Greta Suiter, Chris Tanguay, Rachel Van Unen
AGENDA9:00 – 9:15 Group Norms and Intros
9:15 – 10:15 Collections as Data Introduction
10:15 – 10:45 Project Percolator
10:45 – 11:00 Break
11:00 – 12:00 Case Studies
12:00 – 1:00 Lunch
1:00 – 2:00 Overview of Collections as Data Tools
2:00 – 2:45 Collaboration and Assumptions
2:45 – 3:00 Wrap Up
Group norms / ROPES
RespectOpen mind, opportunityParticipate / passEmpathy, Education, EscucharSafe space
Introductions
● Introduce yourself to someone you’ve never met before
● Name, affiliation, reason for attending
● As a pair, find another pair, and introduce your partner
Collections as DataIntroduction
Definition
● Ordered information● Stored digitally● Inherently amenable to
computation- Elizabeth Russey Roke,bloggERS! July 16, 2019
Evolution of Digital Access
Early initiatives focused on digital surrogates
Special Collections and Archives, Cal Poly
Evolution of Digital Access
Collections as data approach provides access beyond just the object itself, e.g. OCR text downloads
Special Collections and Archives, Cal Poly
“To see collections as data begins with reframing all digital objects as data.”
- Thomas Padilla,“On a Collections as Data Imperative”
Collections as Data Myths
● I need to be a programmer / have access to a programmer’s time○ False! Although public APIs are great!
■ New York Public Library■ Library of Congress
● My project needs to have a very broad scope to reach as many people as possible○ False! Don’t try to be all things to all people
Collections as Data Myths (continued)
● My colleagues are so busy, I need to do this project on my own○ False! Involving an appropriately sized group of various
stakeholders will lead to better results
“Digital tools we build and provide are likely to reflect and perpetuate stereotypes, biases, and inequalities”
- Chris Bourg,OLA Super Conference, January 2015
The Collections as Data Project
Always Already Computational: Collections as Data focused on current and potential approaches to developing cultural heritage collections that support computationally-driven research and teaching.
“Collections as Data: Part to Whole aims to foster the development of broadly viable models that support implementation and use of collections as data.”
Principles
Santa Barbara Statement on Collections as DataVersion 2
Written by the Always Already Computational: Collections as Data project team
Collections as data development aims to encourage computational use of digitized and born digital collections.
Collections as data stewards are guided by ongoing ethical commitments.
Collection stewards aim to respect the rights and needs of the content creators, those represented in collections, and the communities that use them.
Collections as data stewards aim to lower barriers to use.
A range of accessible instructional materials and documentation should be developed to support data use.
Collections as data designed for everyone serve no one.
Specific needs inform collections as data development.
Shared documentation helps others find a path to doing the work.
➔ Document!➔ Document!➔ Document!
Collections as Data should be made openly accessible by default, except in cases where ethical or legal obligations preclude it.
Collections as data development values interoperability.
Collections as data stewards work transparently in order to develop trustworthy, long-lived collections.
Data as well as the data that describe those data are considered in scope.
➔ images ➔ metadata➔ finding aids or other description➔ data resulting from analysis of those data
The development of collections as data is an ongoing process and does not necessarily conclude with a final version.
Who is using collections as data?
Humanists
Distant reading
Text and Data mining
Network graphs
Spatial analysis, GIS, mapping
Spatial analysis, GIS, mapping
Timelines
N-gram
Basically any sort of data viz!
Science Community
Data Journalists
Civic data users
Community members
City planners
Open data advocates
Nonprofit and research professionals
Social Scientists
Computer ScientistsNatural Language Processing
Name Entity Recognition
Sentiment analysis
Machine learning, Computer vision
Computational Archival Science is an example of a crossover between GLAM/comp sci
Teachers and Students
Need easy intro datasets for students, low barrier methods that aren’t programming heavy
Create workshops and classes for students to learn new methods
Miriam Posner’s intro to DH class website is a great example
Cultural Heritage Professionals
Archivists
Metadata experts
Data Librarians
Subject specialists
Repository administrator
Library administrators
Library developers
Outreach staff
50 things you can do
Within the next 1-6 months, what can you do within your own workplace to realistically and relatively easily improve/ move forward collections as data work?
50 things themes
● Generate institutional support/interest○ ex. #10, 14, 15, 16, 17, 18, 19, 20, 23, 24, 29, 30, 31, 38, 40, 41, 45
● Lay groundwork for future projects○ ex. #1, 2, 4, 5, 6, 7, 9, 11, 12, 13, 25, 27, 29, 32, 33, 34, 36, 37, 42, 47, 48
● Scope potential projects○ ex. #43, 44, 46
Ways to access collections as data
Finding data
Data download – csv, xml, json, database, image files, etc
API calls, OAI-PMH for metadata
FOIA, lawsuit
Web scraping
Jupyter notebooks
Digitization/reformatting
Getting Data
Once you have the data you often have to edit or enhance it to suit your needs
Project Percolator
What data set makes a good collections as data project? Where do I start?
Learning Goals
● Discuss the application of Collections as Data principles in relation to example data sets.
● Practice analyzing and evaluating example data sets to better recognize potential projects in the future.
Project Percolator
● Intro - 5 mins● In groups of 2-3 choose a data set
and discuss below Qs - 10 mins● What format is data in?● What are opportunities?● What are challenges?
● Find another pair/group and share data set and at least one opportunity or challenge - 10 mins
● Group share - 5 mins
BREAK!
Collections as DataCase Studies
A Matchbook Map
Case study from the University of Utah J. Willard Marriott Library Special Collections
ArcGIS in action!
Harold Stanley Sanders Matchbook Collection
➔ Series from the Harold Stanley Sanders collection➔ Documents businesses in Utah➔ Comprises 4 of 134 boxes in the collection➔ 678 matchbooks
The Data
The Data
➔ Additional metadata by Rachel Wittmann➔ Shared on GitHub➔ CSV and TXT files➔ Open for reuse
https://github.com/marriott-library/collections-as-data/tree/master/matchbooks
The Data
https://github.com/marriott-library/collections-as-data/tree/master/matchbooks
Library of Colors A case study from the Library of Congress LC Labs Innovator
in Residence, Jer Thorp
Analyzing permanent collection data
at the National Gallery of Art in DC
“The National Gallery of Art
will be the first American art
museum to invite teams of
data scientists and art
historians to analyze,
contextualize, and visualize
its permanent collection
data.”
6 Teams given NGA full permanent collection data
Team 1 - InceptionV3 neural network
Team 2 - Spatial data w/in paintings + w/in museum
Team 3 - Reflect vs. attract America
Team 4 - Diversity on display
Team 5 - Intention in acquisition
Team 6 - Focus on women donors
Team 1 - Pittsburgh - Neural network - clustering images by visual similarity
Team 3 - George Mason University
Team 4 - Diversity on display (or is it?) - calls for data curation along w/ art curation
Team 5 - undergrads from multiple institutions - Intention in acquisition
Team 6 - NGA - Women as donors
Takeaways from NGA datathon
● What do we want from the data?○ How do we form questions to ask of the data?○ What is the data telling us?
● What do we want from the project?○ Who shares information about the project?○ How are audiences engaged in the project?○ Scale of project?
Eastern Apps: Visualizing Historic Prison Data
Women Men
MIT Indie Tarot Collection
Every 3 minutes/Texas Runaway Ads Twitter Bots
● @Every3Minutes● @TxRunawayAds
Every 3 minutes
● Twitter bot created by historian, W. Caleb McDaniel● Tweets every 3 minutes that an enslaved person was sold,
in relation to statistics compiled by historian Herbert Gutman
● Now includes data/images from cultural heritage collections
● More info
Texas Runaway Ads
● Twitter bot created by students in historian, W. Caleb McDaniel’s digital humanities course
● Tweets text and image twice a day of runaway slave ad from Portal for Texas History
● More info
LUNCH!
Collections as DataTools
Photo by Ashim D’Silva on Unsplash
Library as laboratory
2020 IAP library offerings - focus on making
● Basics of Copyrights, Data, and Software Intellectual Property
● Data Management for Postdocs and Research Scientists
● Data Visualization: Introduction to Tools and Principles
● GIS Level 1: Introduction to GIS & Mapping
● Introduction to Cleaning and Prepping Data with OpenRefine
● Introduction to R Graphics with ggplot2
● Introduction to Web Scraping with Python
● Lockedletter Book Making
● Make an Online Map: Introduction to Carto
● Software Carpentries: Introduction to Unix Shell, Python, and Git
The Wikimedia universe
77
Wikipedia Written sum of human knowledge
Wikidata Structured data, semantic representation
Commons Multimedia formats
slides borrowed from Andrew Lih, @fuzheado, from LD4 2019 keynote
What is Wikidata?
● www.wikidata.org
● A website that anyone can edit
● That uses software called Wikibase (an extension to MediaWiki)
● To create a repository of linked open data
●Wikidata is comprised of items
● Items have properties
●Properties have values
item property value
Noam Chomsky Instance of Human
Q9049 P31 Q5
item property value
Massachusetts Institute of Technology
Founded by William Barton Rogers
Q49108 P112 Q2341286
Benefit: re-using, re-mixing and re-imagining that human
knowledge in new innovative ways
81
slides borrowed from Andrew Lih, @fuzheado, from LD4 2019 keynote
What’s in Wikidata?
● 75,322,058 data items○ One item for every topic that is in Wikipedia○ There can also be items for topics that are not yet in Wikipedia
■ Places ■ People■ Paintings & artworks ■ Scientific articles…
● Items are multilingual
You can ask Wikidata things with SPARQL
Q1368825
query.wikidata.org
Let’s Query!
SELECT ?person ?personLabel ?birthplaceLabel ?coordinates ?birthdateWHERE { ?person wdt:P485 wd:Q6784299 . = person with archives at MIT Libraries ?person wdt:P21 wd:Q6581072 . = person with sex or gender female ?person wdt:P19 ?birthplace . = person with place of birth anywhere ?birthplace wdt:P625 ?coordinates . = place of birth with coordinate location anywhere ?person wdt:P569 ?birthdate = person with date of birth any date of birth SERVICE wikibase:label { bd:serviceParam wikibase:language "en".} }
Ewan McAndrew Wikidata Sparql Query Tutorial - YouTube
Use OpenRefine to see what is already in Wikidata
Detailed instructions “Using OpenRefine to Reconcile Name Entities”By Karen H. on The Metro Fellowship website
Resources
OpenRefine tutorial: Getting Started with OpenRefine
British Library data sets: Downloads
Facet data to find possible duplication
Normalize your terms
General Refine Expression Language (GREL) example
Archives Unleashed
Data analysis of web archives
Archives Unleashed Cloud
Archives Unleashed Cloud allows for importing data from Archive-it and is currently free for Archive-It Users.
● How to set it up with your account ● Allows you to create derivative sets of data to
analyze
Archives Unleashed Jupyter Notebooks
● Jupyter notebooks walk you through code snippets and how to run larger programs
● How to install and set up● Let’s try an example
Questions?Other tools that you love/work with/want to share with others?
Suggested tools:
● Timeline JS● Drupal Views● Internet Archive● StoryMaps
Collaboration and Assumptions
Definitions
Assumption: a thing that is accepted as true or as certain to happen, without proof.
Bias: prejudice in favor of or against one thing, person, or group compared with another, usually in a way considered to be unfair.
Implicit bias refers to the attitudes or stereotypes that affect our understanding, actions, and decisions in an unconscious manner.
Thinking things through...
● How many people will be involved in this project and what will their roles be?
● What assumptions are you making about the project?
● What potential biases are there in these assumptions?
Assumption Junction - 15 mins
Go back to your data set and project percolator partner(s). Keeping in mind the format of the data, and what the opportunities and challenges are, spend 15 minutes discussing the following questions:
● What does the ideal team look like to complete this project? What can you do? What expertise do you need from others?
● How will you recruit and build your team?● What does the complete project look like? What tools might you use?● When you think about the opportunities and challenges for your
project what assumptions are you making?
Share out: What assumptions and/or biases did you talk about? Best practices for overcoming them?
Group google doc here:
https://tinyurl.com/r6t4345
Thank you!Joe Carrano | [email protected]
Alex McGee | [email protected]
Greta Suiter | @gkur | [email protected]
Chris Tanguay | [email protected]
Rachel Van Unen | [email protected]