data warehousing: the new knowledge management architecture for humanities research?

36
Data Warehousing and Knowledge Management Slide 1 Data Warehousing: the New Knowledge Management Architecture for Humanities Research? Janet Delve University of Portsmouth, UK UKAIS 2004

Upload: munin

Post on 14-Jan-2016

40 views

Category:

Documents


0 download

DESCRIPTION

Data Warehousing: the New Knowledge Management Architecture for Humanities Research?. Janet Delve University of Portsmouth, UK UKAIS 2004. Introduction. Data Warehouses everywhere Amazon Wal*Mart Opodo DWs used a lot in industry, and scientific research, but not in humanities research. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Data Warehousing: the New Knowledge Management Architecture for Humanities Research?

Data Warehousing and Knowledge Management

Slide 1

Data Warehousing:

the New Knowledge Management Architecture for

Humanities Research?Janet Delve

University of Portsmouth, UKUKAIS 2004

Page 2: Data Warehousing: the New Knowledge Management Architecture for Humanities Research?

Data Warehousing and Knowledge Management

Slide 2

Introduction

Data Warehouses everywhere• Amazon• Wal*Mart• Opodo

DWs used a lot in industry, and scientific research, but not in humanities research.

Written paper covers linguistics and history. Talk covers history in detail and gestures towards linguistics.

Page 3: Data Warehousing: the New Knowledge Management Architecture for Humanities Research?

Data Warehousing and Knowledge Management

Slide 3

Overview

IntroductionData modelling and traditional databasesSource-oriented data modellingData MiningPhilosophy of data warehousingBackground of DWsBasic components of a data warehouse (DW)Advantages of DWsFindings –Humanities and DWs Humanities and DWs – some issuesExamples of possible Humanities DWsIdeas for the future?

Page 4: Data Warehousing: the New Knowledge Management Architecture for Humanities Research?

Data Warehousing and Knowledge Management

Slide 4

Data Modelling

Relational data modelling – material split into many tables in order to gain enhanced performance – no duplication, updating or insertion anomalies etc.

Source-oriented data modelling – emphasis on modelling data as closely as possible to original source which is included in its entirety for posterity.

DW data modelling nearer to source-oriented approach in spirit.

Page 5: Data Warehousing: the New Knowledge Management Architecture for Humanities Research?

Data Warehousing and Knowledge Management

Slide 5

Traditional databases

ERD p117 Harvey and Press

Page 6: Data Warehousing: the New Knowledge Management Architecture for Humanities Research?

Data Warehousing and Knowledge Management

Slide 6

Traditional databases

Harvey and Press p.129

Page 7: Data Warehousing: the New Knowledge Management Architecture for Humanities Research?

Data Warehousing and Knowledge Management

Slide 7

Historical Data

This can be difficult to model because:

• It is irregular in structure,• It is complex• It is erratic in terms of when it occurs

Using a relational database can mean data from a single source being spit into many tables.

Page 8: Data Warehousing: the New Knowledge Management Architecture for Humanities Research?

Data Warehousing and Knowledge Management

Slide 8

Source-oriented data modelling

‘a semantic network tempered by hierarchical considerations’ [Thaller 1991, 155].

Its flexible nature gives a ‘rubber band data structures’ facility [Denley 1994, 37].

The fluid nature of creating a database with marks it out as an ‘organic’ DBMS.

Page 9: Data Warehousing: the New Knowledge Management Architecture for Humanities Research?

Data Warehousing and Knowledge Management

Slide 9

Data Mining

The whole field is often referred to as data mining, which is also a major component within the field.

Data mining (DM) is normally used on large quantities (terabytes) of data, to find meaningful patterns. Neural nets, statistical modelling, decision trees are just some AI methods used. SQL can be used too. Parallel data processing is used with DM.

In order to mine data, it must be kept in a suitable system - a data warehouse is ideal.

Page 10: Data Warehousing: the New Knowledge Management Architecture for Humanities Research?

Data Warehousing and Knowledge Management

Slide 10

Philosophy of data warehousing

‘Data warehousing is an architecture, not a technology. There is the architecture, and there is the underlying technology, and they are two very different things. Unquestionably there is a relationship between data warehousing and database technology, but they are most certainly not the same. Data warehousing requires the support of many different kinds of technology.’

Inmon 2002

Page 11: Data Warehousing: the New Knowledge Management Architecture for Humanities Research?

Data Warehousing and Knowledge Management

Slide 11

Background of DWs

Business-oriented – serve the analytical needs of a company. The ordinary DBMS is still needed for the day-to-day queries, and also to feed the DW.

W.H. Inmon, father of DW. Cabinet effect –1991 R. Kimball, expert on dimensional modelling

Need for single, integrated source of clean data, particularly for multinational etc. companies

Supporting technology from e.g. Oracle, Prism Solutions, IBM

Page 12: Data Warehousing: the New Knowledge Management Architecture for Humanities Research?

Data Warehousing and Knowledge Management

Slide 12

Data Marts

Data marts contain DW data but are restricted to one department or one business process.

The industry is divided about data marts,

Inmon recommends building the DW first, then siphoning off the data to data marts.

Kimball believes you should build several data marts first, then integrate them into a DW.

Page 13: Data Warehousing: the New Knowledge Management Architecture for Humanities Research?

Data Warehousing and Knowledge Management

Slide 13

Basic components of a Data Warehouse (DW)

A DW is subject-oriented, integrated, non-volatile & time-variant.

The major subjects for an insurance company are customer, policy, premium and claim. Previously data modelled around applications -car, health, life and accident.

Integration is the most important facet of a DW. Previous inconsistencies are ironed out and all data unambiguously entered into DW. Many sources of data can be placed in DW.

Page 14: Data Warehousing: the New Knowledge Management Architecture for Humanities Research?

Data Warehousing and Knowledge Management

Slide 14

Basic components of a Data Warehouse (DW)

Non-volatile data in a DW means that it is not changed in the way data is in operational database – data is loaded en masse and isn’t updated. Obviates need for normalisation.

Time- variant – DW time horizon 5 –10 years, operational database 2-3 months. DW snapshots, operational database current data, DW always has element of time, operational database may or may not have. Inmon 2002

Page 15: Data Warehousing: the New Knowledge Management Architecture for Humanities Research?

Data Warehousing and Knowledge Management

Slide 15

Kimball p7

Basic components of a Data Warehouse (DW)

Page 16: Data Warehousing: the New Knowledge Management Architecture for Humanities Research?

Data Warehousing and Knowledge Management

Slide 16

Typical Architecture of a Data Warehouse

Page 17: Data Warehousing: the New Knowledge Management Architecture for Humanities Research?

Data Warehousing and Knowledge Management

Slide 17

Meta Data

Meta data is extremely important in a DW. It is used:

• to log the extraction and loading of data into the warehouse;

• in query management to locate the most appropriate data source and also to help end users to build queries;

• to show how the data has been mapped when carrying out data cleansing and transformations;

• To manage all the data in the DW – recording where data came from, when etc.

Page 18: Data Warehousing: the New Knowledge Management Architecture for Humanities Research?

Data Warehousing and Knowledge Management

Slide 18

Basic components of a Data Warehouse (DW)

Fact Tables

‘A fact table is the primary table in a dimensional model where the numerical performance measurements of the business are stored…

The measurement data resulting from a business process is stored in a single data mart

Since measurement data is overwhelmingly the largest part of any data mart, we avoid duplicating it in multiple places around the enterprise’ Kimball 2002

Page 19: Data Warehousing: the New Knowledge Management Architecture for Humanities Research?

Data Warehousing and Knowledge Management

Slide 19

Basic components of a Data Warehouse (DW)

Dimension tablesThese contain the textual descriptors of the

business. Their depth and breadth define the usefulness of the DW.

Contains data that doesn’t change frequently

Can have 50-100 attributes.

Not usually normalized. (Snowflake and starflake)

Coding disparaged (Long term view)

Page 20: Data Warehousing: the New Knowledge Management Architecture for Humanities Research?

Data Warehousing and Knowledge Management

Slide 20Star schema Kimball p51

Basic components of a Data Warehouse (DW)

Page 21: Data Warehousing: the New Knowledge Management Architecture for Humanities Research?

Data Warehousing and Knowledge Management

Slide 21

Kimball p43

Basic components of a Data Warehouse (DW)

Page 22: Data Warehousing: the New Knowledge Management Architecture for Humanities Research?

Data Warehousing and Knowledge Management

Slide 22

Basic components of a Data Warehouse (DW)

Kimball p39

Page 23: Data Warehousing: the New Knowledge Management Architecture for Humanities Research?

Data Warehousing and Knowledge Management

Slide 23

Data Warehousing Tools and Technologies

Building a data warehouse is a complex task because there is no vendor that provides an ‘end-to-end’ set of tools.

Necessitates that a data warehouse is built using multiple products from different vendors.

Ensuring that these products work well together and are fully integrated is a major challenge.

Page 24: Data Warehousing: the New Knowledge Management Architecture for Humanities Research?

Data Warehousing and Knowledge Management

Slide 24

Advantages of DWs

• Flexibility in modelling data. • Time dimension – country-specific

calendars and synchronization across multiple time zones.

• Easy to add external data and summarised data.

• Built for analysis.• Built for huge volumes of data (terabytes of

data – a trillion 1012).• Can cope with ‘idiosyncrasies of geographic

location dimensions’ within GISs.

Page 25: Data Warehousing: the New Knowledge Management Architecture for Humanities Research?

Data Warehousing and Knowledge Management

Slide 25

Possible advantages of DWs

• Indexing facilities of DW.• Publishing the ‘right data’ – data collected

from a variety of sources and edited for quality and consistency.

• DW seeks to collate all data so a variety of different subsets can be analysed whenever required.

• Easy to extend DW and add material from a new source.

• Data cleansing techniques.• Tracking facility afforded by meta data

Page 26: Data Warehousing: the New Knowledge Management Architecture for Humanities Research?

Data Warehousing and Knowledge Management

Slide 26

Disadvantages of DWs

• Some humanities data fits into the ‘numerical fact’ topology, some doesn’t

• Technology not easy and is based on having existing databases to extract from

• Regular snapshots not the same but they could equate to data sets taken at different periods of time (e.g. 1841 census, 1861 census)

• A lot to learn.

Page 27: Data Warehousing: the New Knowledge Management Architecture for Humanities Research?

Data Warehousing and Knowledge Management

Slide 27

Findings – Humanities and DWs

NAGARA(National Association of Government Archives

and Records Administrators)Article on DWs by Mary Klauda of the Minnesota

Historical Society 1999 (archivist)

Eastern Connecticut schools DW 2002

Bo Wandschneider – University of Guelph, Canada -DW and the use of census data. ICPSR (Inter-university Consortium for Political and Social Research)

Page 28: Data Warehousing: the New Knowledge Management Architecture for Humanities Research?

Data Warehousing and Knowledge Management

Slide 28

Findings – Humanities and DWs

University of California DW – memo to Humanities department

Social Science DW – Human Resources DW project of Human Sciences Research Council, South Africa

GEOBASE, Israel. DW of Israel’s regional statistics, supported by National Planning Authority in the Ministry of Interior Affairs.

Page 29: Data Warehousing: the New Knowledge Management Architecture for Humanities Research?

Data Warehousing and Knowledge Management

Slide 29

Humanities and DWs – some issues

Scale – can cope with really large country / state -wide problems.

Can analyse e.g. British censuses 1841-1901 (108).

Can put several databases together to produce a time run – e.g Hearth taxes, window taxes, poll taxes, land taxes, poor rates all in one DW.

Oracle site licenses.

Page 30: Data Warehousing: the New Knowledge Management Architecture for Humanities Research?

Data Warehousing and Knowledge Management

Slide 30

Examples of possible History DWs

Page 31: Data Warehousing: the New Knowledge Management Architecture for Humanities Research?

Data Warehousing and Knowledge Management

Slide 31

Examples of possible History DWs

MANOR----------------------------ManorIdHolding IdProperty IdOriginal Owner IdDateManor ValueTax (Hides)Cottar PopulationBordar PopulationVillein PopulationSokeman PopulationPries PopulationNumber of BurgessesNumber of slavesEtc.

HOLDING DETAILS-----------------------------Holding IDKingTenant in ChiefManor LordVILLEtc.

ORIGINAL OWNER----------------------------Original Owner IDEtc.

PROPERTY INFORMATION--------------------------------Property IdProperty descriptionProperty valueEtc

Page 32: Data Warehousing: the New Knowledge Management Architecture for Humanities Research?

Data Warehousing and Knowledge Management

Slide 32

Examples of possible History DWs

Data from a variety of sources over time– hearth tax, poor rates, trade directories, census, street directories, wills and inventories, GIS maps for a city e.g. Winchester.

Voting data – poll book data and rate book data up to 1870 for whole country (note some data missing).

Port data – all data from portbooks for all British ports together with yearly trade figures.

Street directories for whole country for last 100 years.

Taxation overview – different types / areas / periods.

Page 33: Data Warehousing: the New Knowledge Management Architecture for Humanities Research?

Data Warehousing and Knowledge Management

Slide 33

Examples of possible History DWs

19th C British census data doesn’t fit into the typical DW model as it doesn’t have the numerical facts to go into a fact table.

However, there’s a recent development in DWs – ‘factless’ fact tables.

There is real scope to be able to model historical data using these.

Page 34: Data Warehousing: the New Knowledge Management Architecture for Humanities Research?

Data Warehousing and Knowledge Management

Slide 34

Examples of possible History DWs

Kimball p247

Page 35: Data Warehousing: the New Knowledge Management Architecture for Humanities Research?

Data Warehousing and Knowledge Management

Slide 35

Examples of possible Humanities DWsLanguage DW – could contain databases of

different languages for comparison, or many databases of same languages over larger area.

DW of worldwide scholarly community / whole culture

GIS or archaeological DW by continent etc. rather than country.

DW of biographies.

DW of library catalogues or archives for enhanced public access.

Page 36: Data Warehousing: the New Knowledge Management Architecture for Humanities Research?

Data Warehousing and Knowledge Management

Slide 36

Ideas for the future?

Instead of ‘me and my database’ - emphasis on smallish, individual, national projects,

Maybe

‘Our integrated warehouse’ – emphasis on large scale, collaborative, international projects?