dhoxss 2014 - introduction to relational databases
DESCRIPTION
This introduction to relational databases was presented at the Digital Humanities at Oxford Summer School 2014.TRANSCRIPT
18 July, 2014
An Introduction to Relational Databases
Dr James A J WilsonDr Meriel Patrick
18 July, 2014Page 2
Relational Databases
Defined in 1970 First commercially available relational database management
system released by Oracle in 1979 Widespread adoption by both business and research
communities Underpin many websites Well understood and widely supported
Digital Humanities Summer School -An Introduction to Relational Databases
Options when structuring data
Spreadsheets Recording the common properties of a single thing Numerical analysis Generating charts and graphs
Relational databases Recording the common properties of multiple related things Flexible querying
Document-orientated databases / ‘semi-structured’ databases Recording items which share some common properties Avoids need to define rigid structure in advance
XML / XML databases Categorizing elements of text
RDF (Resource Description Framework) triplestores Records relationships between things (basis of Semantic Web)
18 July, 2014Digital Humanities Summer School -An Introduction to Relational Databases
Page 3
When to use a relational database
You are collecting information about things which share common properties
You want to be able to list particular records that meet certain conditions
You wish to encourage consistency You want to be efficient, and avoid duplication of information You value flexibility when querying Good for collaborative working – one person sets up the
database, many can edit the data, many more can view or query the data
18 July, 2014Digital Humanities Summer School -An Introduction to Relational Databases
Page 4
Structure of a relational database - tables Example scenario: study of 18th century book trade
What things are we interested in? Publications Publishers People Our sources for the information we’re collecting
And what information might we want to know about each of these things?
Names Dates Places References
18 July, 2014Digital Humanities Summer School -An Introduction to Relational Databases
Page 5
18 July, 2014Digital Humanities Summer School -An Introduction to Relational Databases
Page 6
Person
Surname
First name
Middle initial(s)
Date of birth
Notes
Publication
Title
Author(s)
Publisher
Date of publication
Place of publication
Edition
Format
Type of publication
Price
Sales
Notes
Publisher
Name
Staff
Founded
Ceased
Address
Notes
Reference
Author(s)
Title
Date of publication
Edition
Volume
Page(s)
URL
Notes
Structure of a relational database – data types Most relational database management systems require that
each field has a defined data type Text (e.g. varchar, memo) Numeric (e.g. integer, decimal) Date Boolean (true / false; on / off) Blob (for otherwise undefined data, such as image files)
Each table needs at least one field that only contains unique values, which can be used as a ‘primary key’
Commonly an auto-incrementing whole (integer) number
18 July, 2014Digital Humanities Summer School -An Introduction to Relational Databases
Page 7
18 July, 2014Digital Humanities Summer School -An Introduction to Relational Databases
Page 8
Person
ID Int
Surname Text
First name Text
Middle initial(s)
Text
Date of birth Date
Notes Text
Publication
ID Int
Title Text
Author(s) Text
Publisher Text
Date of publication
Int?
Place of publication
Text
Edition Int
Format Text
Type of publication
Text
Price Dec?
Sales Int?
Notes Text
Publisher
ID Int
Name Text
Staff Text
Founded Int?
Ceased Int?
Address Text
Notes Text
Reference
ID Int
Author(s) Text
Title Text
Date of publication
Int?
Edition Int?
Volume Int?
Page(s) Text?
URL Text
Notes Text
Structure of a relational database - relationships Our different things are related to one another
A person may be the author of a publication, or a reference work, or they may be a publisher
Each edition of a publication has a publisher, or maybe more than one?
The information you record about a particular publication, or publisher, may come from one or more sources
Relationships between things can be of various sorts: One-to-many (e.g. a publisher may have many publications) Many-to-many (e.g. a publication may have many authors, and an
author may have many publications) One-to-one (rarely used – can improve performance, overcome
system limitations, or enable more granular access permissions)
18 July, 2014Digital Humanities Summer School -An Introduction to Relational Databases
Page 9
18 July, 2014Digital Humanities Summer School -An Introduction to Relational Databases
Page 10
Person
ID Int
Surname Text
First name Text
Middle initial(s)
Text
Date of birth Date
Reference Int
Page Text
Notes Text
Publication
ID Int
Title Text
Author(s) INT
Publisher INT
Date of publication
Int?
Place of publication
Text
Edition Int
Format Text
Type of publication
Text
Price Dec?
Sales Int?
Reference Int
Page Text
Notes Text
Publisher
ID Int
Name Text
Staff Text
Founded Int?
Ceased Int?
Address Text
Reference Int
Page Text
Notes Text
Reference
ID Int
Author(s) Text
Title Text
Date of publication
Int?
Edition Int?
Volume Int?
URL Text
Notes Text
1
∞
?1
∞
∞
∞
18 July, 2014Digital Humanities Summer School -An Introduction to Relational Databases
Page 11
Person
ID Int
Surname Text
First name Text
Middle initial(s)
Text
Date of birth Date
Reference Int
Page Text
Notes Text
Publication
ID Int
Title Text
Author(s) INT
Publisher INT
Date of publication
Int?
Place of publication
Text
Edition Int
Format Text
Type of publication
Text
Price Dec?
Sales Int?
Reference Int
Page Text
Notes Text
Publisher
ID Int
Name Text
Staff Text
Founded Int?
Ceased Int?
Address Text
Reference Int
Page Text
Notes Text
Reference
ID Int
Author(s) Text
Title Text
Date of publication
Int?
Edition Int?
Volume Int?
URL Text
Notes Text
1
∞
?1
∞
∞
∞
Man
y to
man
y
18 July, 2014Digital Humanities Summer School -An Introduction to Relational Databases
Page 12
Person
ID Int
Surname Text
First name Text
Middle initial(s)
Text
Date of birth Date
Reference Int
Page Text
Notes Text
Publication
ID Int
Title Text
Publisher INT
Date of publication
Int?
Place of publication
Text
Edition Int
Format Text
Type of publication
Text
Price Dec?
Sales Int?
Reference Int
Page Text
Notes Text
Publisher
ID Int
Name Text
Staff Text
Founded Int?
Ceased Int?
Address Text
Reference Int
Page Text
Notes Text
Reference
ID Int
Author(s) Text
Title Text
Date of publication
Int?
Edition Int?
Volume Int?
URL Text
Notes Text
1
∞
?1
∞
∞
∞
Man
y to
man
y
Authorship
ID Int
Author Int
Publication Int
Alternative structures
If you are certain that no publication is going to have more than three authors, your might want to have fields in the ‘publication’ table for author1, author2, author3 – each with a one-to-many relationship with the ‘person’ table
You could create another table just consisting of IDs and different types of publication. This could then be linked to the ‘publication’ table and act as a controlled vocabulary
Have a separate table for edition information. In most cases authors will not change, format might, sales and price almost certainly will. This will avoid data duplication
But maybe authors will be credited differently (anon revealed?), or titles vary between editions?
18 July, 2014Digital Humanities Summer School -An Introduction to Relational Databases
Page 13
Database design – good practice
Database normalization: Shouldn’t have to enter the same data twice Separate tables for separate things Don’t define duplicate fields in the same table (e.g. author1, author2, etc.) Fields should be ‘atomic’ – containing information at the most granular level
(usually) Enforce data integrity Keep ‘blobs’ of data (images, audio, etc.) outside of your database or at the very
least in separate tables; include links to the files within the database
Table / field naming conventions: Be consistent Avoid spaces, punctuation marks, and other non-alphanumeric characters
(although it’s fine to use underscores instead of spaces)
Document your database! You will thank yourself later
18 July, 2014Page 14
Digital Humanities Summer School -An Introduction to Relational Databases
Database design workflow
18 July, 2014Digital Humanities Summer School -An Introduction to Relational Databases
Page 15
Querying a relational database
Queries usually constructed using SQL statements SQL stands for ‘Structure Query Language’
Some Relational Database Management Systems hide the raw SQL from the user by providing query-builder tools
SELECT statements indicate which fields should be returned FROM statements indicate the table(s) in which those fields are
to be found JOIN statements are used when you wish to query multiple tables WHERE statement provide the conditions that a record must
meet in order to be listed in results ORDER BY statements control the order in which results are
returned
18 July, 2014Digital Humanities Summer School -An Introduction to Relational Databases
Page 16
Querying a relational database - examples
Imagine we have a single table in a database, called ‘countries’
18 July, 2014Page 17
Digital Humanities Summer School -An Introduction to Relational Databases
Countries
ID Int.
name Text
area Int.
population Int.
continent Text
If_visited Bool.
observations Text
SELECT * FROM Countrieswould return all information about all countries
SELECT name, area, population FROM Countrieswould return only the information in the named fields
SELECT * FROM Countries WHERE visited = TRUEwould return all information about countries that have been visited
SELECT * FROM Countries WHERE visited = TRUE AND population > 1000000would return all information about countries that have been visited and have a population of greater than a million
Querying a relational database - examples
JOINS are used to search across multiple tables
18 July, 2014Page 18
Digital Humanities Summer School -An Introduction to Relational Databases
SELECT c.name, c.area, c.population, d.name FROM Countries c INNER JOIN Continents d
ON c.continent = d.ID WHERE c.visited = TRUE AND d.name = ‘Europe’
would return selected information about each European
country visited
Countries
ID Int.
name Text
area Int.
population Int.
continent Int.
If_visited Bool.
observations Text
Continents
ID Int.
name Text
area Int.
What query results look like
A single table / spreadsheet Although software / websites may format results into a report.
18 July, 2014Page 19
Digital Humanities Summer School -An Introduction to Relational Databases
What can you do with your results?
Count, sort, and sometimes filter further
Export and analyse .csv file format is standard, and compatible with almost all statistical
analysis / data visualisation software
Save and make available to others
18 July, 2014Page 20
Digital Humanities Summer School -An Introduction to Relational Databases
Common database challenges in the humanities
Patchy or incomplete data Beware of the difference between 0 and null
Varying degrees of accuracy Often an issue with historical dates Splitting the separate elements of a date into separate fields may help
Interpreted and uncertain information Include a field indicating the degree of certainty of a particular ‘fact’ – e.g.
‘Definite, Probable, Possible’ Inconsistent or changing terminology
Alternative spellings, different forms of address, name changes Can be an idea to have a table of controlled vocabulary
‘Fuzziness’ vs. ‘queryableness’ e.g. if you store a data as ‘c. 310 BCE’, you can’t use it in a conditional
query such as ‘list all the inscriptions from the fourth century BCE
18 July, 2014Digital Humanities Summer School -An Introduction to Relational Databases
Page 21
Your exercise today…
Draft a structure for a relational database recording information about membership of gentlemen’s clubs in Victorian London
Think about the tables, fields, and relationships you’d need Your evidence collection (membership records, letters, diaries,
etc.) tells you which clubs people belonged to, and when However, the information is patchy
Names may not be given in full – identity is sometimes uncertain Dates may be uncertain or missing
All clubs have multiple members; some people were members of multiple clubs at varying periods
Over the years, some clubs changed locations
18 July, 2014Digital Humanities Summer School -An Introduction to Relational Databases
Page 22
Our example solution
18 July, 2014Digital Humanities Summer School -An Introduction to Relational Databases
Page 23
Possible enhancements
If dates are uncertain, integer may not be the best data type. Make the relationship between club_memberships and
evidence many-to-many rather than one-to-many Done by adding a link table
Split author entries into a separate table Allows multiple authors for each piece of evidence
Impose a controlled vocabulary on the occupation field by adding a look-up table
Add longitude and latitude to the addresses table.
18 July, 2014Digital Humanities Summer School -An Introduction to Relational Databases
Page 24
Relational Database Management Systems
Databases can seem rather complicated, but there is software that can help
MS Access Filemaker Pro Coming soon to Oxford – the Online Research Database Service
(ORDS)
For web-hosted relational database manipulation: MySQL PostgreSQL
18 July, 2014Digital Humanities Summer School -An Introduction to Relational Databases
Page 25
Questions?
18 July, 2014Page 26
Digital Humanities Summer School -An Introduction to Relational Databases