using metadata analytics to understand legacy digital...

14
05/04/2019 1 Using metadata analytics to understand legacy digital collections prior to disposal David Canning, DRO - Cabinet Office The National Archives - 4 April 2019 Our strategy is to design and build a digital archiving capability that will operate over three phases. 2 Acquisition Management Review & Transfer Using metadata analytics to understand legacy digital collections prior to disposal. David Canning, DRO - Cabinet Office. The National Archives - 4 April 2019

Upload: others

Post on 09-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Using metadata analytics to understand legacy digital ...filestore.nationalarchives.gov.uk/resources/temp/... · 4/4/2019  · review and disposal of digital legacy content in volume

05/04/2019

1

Using metadata analytics to understand

legacy digital collections prior to

disposal

David Canning, DRO - Cabinet Office

The National Archives - 4 April 2019

Our strategy is to design and build a digital archiving

capability that will operate over three phases.

2

Acquisition ManagementReview &

Transfer

Using metadata analytics to understand legacy digital collections prior to disposal. David Canning, DRO - Cabinet Office. The National Archives - 4 April 2019

Page 2: Using metadata analytics to understand legacy digital ...filestore.nationalarchives.gov.uk/resources/temp/... · 4/4/2019  · review and disposal of digital legacy content in volume

05/04/2019

2

The Acquisition Phase, collects, catalogues

and carries out first review at seven years

3

Cabinet Office

‘Story’

Records

discovery at

business unit

level

Annual ‘Spring

Clean’

First Review

and Disposal

Annual basis, every May, Business Units

surrender one (financial) years worth of

information > 7 years after actual or assumed

creation date

Central team work with business areas to

identify main themes of record (e.g. projects,

policy, legislation) over a Parliament and/or

Prime Ministerial term of office

Remove ROT according to policy rules

identified via analytics.

A summarised version of the records

discovery process, noting key events in the

Department’s history during the period

Information is able to be collected and isolated for

analysis

Information related to the work theme is located in file

plan

Ephemera is destroyed and remaining information is

reconstructed into (where possible) chronological

record

ProcessInput Outcome

Operational Selection Policy

Using metadata analytics to understand legacy digital collections prior to disposal. David Canning, DRO - Cabinet Office. The National Archives - 4 April 2019

The Acquisition Phase, collects, catalogues

and carries out first review at seven years

4

First Review

and DisposalRemove ROT according to policy rules

identified via analytics.Ephemera is destroyed and remaining information is

reconstructed into (where possible) chronological

record

We’ll be looking at how we are approaching this part of the Acquisition phase in detail today.

Using metadata analytics to understand legacy digital collections prior to disposal. David Canning, DRO - Cabinet Office. The National Archives - 4 April 2019

Page 3: Using metadata analytics to understand legacy digital ...filestore.nationalarchives.gov.uk/resources/temp/... · 4/4/2019  · review and disposal of digital legacy content in volume

05/04/2019

3

The Management Phase curates our corporate

memory and makes it accessible for use

5

Structured

digital

archives

Assisted

Search

Digital

Catalogue

The business may ask the Archivists to carry

out research or searches of the archived

material in order to exploit the department’s

corporate memory.

Where possible, archivists reconstruct the

record around the work themes identified

through records discovery into chronological

order, grouped into Parliaments, subdivided

by financial year. Includes various media and

formats in original and converted form.

The list of what is available in the Archive is

made available to the business via an online

catalogue. This is the department’s first line

knowledge resource.

The department is able to exploit its

corporate memory

We are confident that the record exists

in an organised system.

The contents of the archive are clear and

accessible to our people

ProcessInput Outcome

Using metadata analytics to understand legacy digital collections prior to disposal. David Canning, DRO - Cabinet Office. The National Archives - 4 April 2019

Phase Three is selection, review and transfer

6

Transfer

Appraisal

and

Selection

Sensitivity

Review

Disposal

As now, reviewers will apply explanatory

memoranda and judgement. E-discovery or

similar may be used along with search strings

to identify sensitive areas and reduce the

volume requiring human eyes

As now the Archivists work with TNA to

identify records of particular historic interest.

Informed by OSP and the CO Story.

Information may be of historic value (to TNA

or witheld) or continued knowledge value to

Cabinet Office. Remainder considered to be

of little/no value is destroyed (weeded).

Digital ‘transfer’ to TNA with verification of

integrity.

We avoid opening material that should not be

published

Records in scope for transfer identified and agreed

ProcessInput Outcome

We manage the heap, keeping only what is useful or

needs to be retained for security purposes

We fulfil our obligations to the Public

Records Act, supporting transparency.

Using metadata analytics to understand legacy digital collections prior to disposal. David Canning, DRO - Cabinet Office. The National Archives - 4 April 2019

Page 4: Using metadata analytics to understand legacy digital ...filestore.nationalarchives.gov.uk/resources/temp/... · 4/4/2019  · review and disposal of digital legacy content in volume

05/04/2019

4

Our objective was to design and test a process for first

review and disposal of digital legacy content in volume

7

11 million data

objects* 3.51Tb

706 Top level

folders/drives

● We selected a file share containing the oldest digital information in the department.

● The earliest created and last modified dates were 1 January 1970!

*Some of this (circa. 0.3million) is live data still being used by the business

Using metadata analytics to understand legacy digital collections prior to disposal. David Canning, DRO - Cabinet Office. The National Archives - 4 April 2019

8

Overview and programmable search interface

Using metadata analytics to understand legacy digital collections prior to disposal. David Canning, DRO - Cabinet Office. The National Archives - 4 April 2019

Page 5: Using metadata analytics to understand legacy digital ...filestore.nationalarchives.gov.uk/resources/temp/... · 4/4/2019  · review and disposal of digital legacy content in volume

05/04/2019

5

9

File paths are provided in a list for visual inspection

Using metadata analytics to understand legacy digital collections prior to disposal. David Canning, DRO - Cabinet Office. The National Archives - 4 April 2019

The first step was to accurately age the data and

understand the volume of file formats for each year

10

Year Created and Volume Year modified and Volume

1970 42 1970 414

1980 826 1980 3,940

1996 407 1996 16,093

1997 14,334 1997 33,818

1998 22,370 1998 59,700

2008 786,583 2008 820,314

Using metadata analytics to understand legacy digital collections prior to disposal. David Canning, DRO - Cabinet Office. The National Archives - 4 April 2019

Page 6: Using metadata analytics to understand legacy digital ...filestore.nationalarchives.gov.uk/resources/temp/... · 4/4/2019  · review and disposal of digital legacy content in volume

05/04/2019

6

The first step was to accurately age the data and

understand the volume of file formats for each year

11

Year Created and Volume Year modified and Volume

1970 42 1970 414

1980 826 1980 3,940

1996 407 1996 16,093

1997 14,334 1997 33,818

1998 22,370 1998 59,700

2008 786,583 2008 820,314

● 01/01/1970 is the Unix

default date

● 01/01/1980 is the MSDos

default date

● The metadata in files

subject to corruption is

often absent, but two docs

were created in 2038!

Using metadata analytics to understand legacy digital collections prior to disposal. David Canning, DRO - Cabinet Office. The National Archives - 4 April 2019

The first step was to accurately age the data and

understand the volume of file formats for each year

12

Year Created and Volume Year modified and Volume

1970 42 1970 414

1980 826 1980 3,940

1996 407 1996 16,093

1997 14,334 1997 33,818

1998 22,370 1998 59,700

2008 786,583 2008 820,314

Visual inspection of document titles suggests creation dates ranging from 2002 to 2004. Files ‘modified’ in 1970 include:

-132 .doc-117 .msg- 103 .jpg- 22 .xls- 14 .ppt- 11 .pdf plus various unreadable/exotic formats

Using metadata analytics to understand legacy digital collections prior to disposal. David Canning, DRO - Cabinet Office. The National Archives - 4 April 2019

Page 7: Using metadata analytics to understand legacy digital ...filestore.nationalarchives.gov.uk/resources/temp/... · 4/4/2019  · review and disposal of digital legacy content in volume

05/04/2019

7

The first step was to accurately age the data and

understand the volume of file formats for each year

13

Year Created and Volume Year modified and Volume

1970 42 1970 414

1980 826 1980 3,940

1996 407 1996 16,093

1997 14,334 1997 33,818

1998 22,370 1998 59,700

2008 786,583 2008 820,314

3863 have a date of

01/01/1980

258 .doc files (two have the

year 2013 in their title)

Using metadata analytics to understand legacy digital collections prior to disposal. David Canning, DRO - Cabinet Office. The National Archives - 4 April 2019

The first step was to accurately age the data and

understand the volume of file formats for each year

14

Year Created and Volume Year modified and Volume

1970 42 1970 414

1980 826 1980 3,940

1996 407 1996 16,093

1997 14,334 1997 33,818

1998 22,370 1998 59,700

2008 786,583 2008 820,314

First significant volume of

.docs and .wpd files:

2,505 .doc

1,279 .wpd

1,149 .htm

502 .xls

Using metadata analytics to understand legacy digital collections prior to disposal. David Canning, DRO - Cabinet Office. The National Archives - 4 April 2019

Page 8: Using metadata analytics to understand legacy digital ...filestore.nationalarchives.gov.uk/resources/temp/... · 4/4/2019  · review and disposal of digital legacy content in volume

05/04/2019

8

The first step was to accurately age the data and

understand the volume of file formats for each year

15

Year Created and Volume Year modified and Volume

1970 42 1970 414

1980 826 1980 3,940

1996 407 1996 16,093

1997 14,334 1997 33,818

1998 22,370 1998 59,700

2008 786,583 2008 820,314

10,805 .docs

1,827 .wpd

Using metadata analytics to understand legacy digital collections prior to disposal. David Canning, DRO - Cabinet Office. The National Archives - 4 April 2019

The first step was to accurately age the data and

understand the volume of file formats for each year

16

Year Created and Volume Year modified and Volume

1970 42 1970 414

1980 826 1980 3,940

1996 407 1996 16,093

1997 14,334 1997 33,818

1998 22,370 1998 59,700

2008 786,583 2008 820,314

21459 .doc

zero .wpd

Between 1995 and 1999

and near doubling of

volume year on year

Using metadata analytics to understand legacy digital collections prior to disposal. David Canning, DRO - Cabinet Office. The National Archives - 4 April 2019

Page 9: Using metadata analytics to understand legacy digital ...filestore.nationalarchives.gov.uk/resources/temp/... · 4/4/2019  · review and disposal of digital legacy content in volume

05/04/2019

9

The first step was to accurately age the data and

understand the volume of file formats for each year

17

Year Created and Volume Year modified and Volume

1970 42 1970 414

1980 826 1980 3,940

1996 407 1996 16,093

1997 14,334 1997 33,818

1998 22,370 1998 59,700

2008 786,583 2008 820,314

Print to paper policy ends

and first EDRM introduced

By 2008 the volume of

new documents being

created was thirty five

times that of the volume

in 1998

Using metadata analytics to understand legacy digital collections prior to disposal. David Canning, DRO - Cabinet Office. The National Archives - 4 April 2019

The metadata corruption reveals a history of poor data

management by IT suppliers

18

Document growth and

metadata corruption in legacy

fileshare 1996 to 2016

Created and last modified dates normally arise from the

same calendar year and appear to follow the same

trend. Accurate aging was achieved by using a matrix

of the two, combined with an occasional visual sense

check of file paths.

Using metadata analytics to understand legacy digital collections prior to disposal. David Canning, DRO - Cabinet Office. The National Archives - 4 April 2019

Page 10: Using metadata analytics to understand legacy digital ...filestore.nationalarchives.gov.uk/resources/temp/... · 4/4/2019  · review and disposal of digital legacy content in volume

05/04/2019

10

The metadata corruption reveals a history of poor data

management by IT suppliers

19

Document growth and

metadata corruption in legacy

fileshare 1996 to 2016

New IT

platforms

deployed

Using metadata analytics to understand legacy digital collections prior to disposal. David Canning, DRO - Cabinet Office. The National Archives - 4 April 2019

Information was identified for deletion

through a multi-layered filtering process

20

Retain in Corporate Memory (Archive)

File format analysis

Data classification analysis

Human analysis

Operational

Selection Policy

ROT

Using metadata analytics to understand legacy digital collections prior to disposal. David Canning, DRO - Cabinet Office. The National Archives - 4 April 2019

Page 11: Using metadata analytics to understand legacy digital ...filestore.nationalarchives.gov.uk/resources/temp/... · 4/4/2019  · review and disposal of digital legacy content in volume

05/04/2019

11

Information was identified for deletion

through a multi-layered filtering process

21

Retain in Corporate Memory (Archive)

File format analysis

Data classification analysis

Human analysis

Operational

Selection Policy

File format analysis identified

a large volume of exotic and

obsolete file formats that are

unreadable remnants of old

software. These make up:

● 87% of the pre-1996

data;

● 36% of the 1996-2008

data, and

● 49% of the 2008-2011

data

● The average is 36%

Further analysis excluded:

● Image files adding a

further 25% to the

ROT pile.

● HTM and HMTL

adding a further 10%

ROT

Using metadata analytics to understand legacy digital collections prior to disposal. David Canning, DRO - Cabinet Office. The National Archives - 4 April 2019

Information was identified for deletion

through a multi-layered filtering process

22

Retain in Corporate Memory (Archive)

File format analysis

Data classification analysis

Human analysis

Operational

Selection Policy

● We were left with:

○ 2million .doc

○ 600,000 .msg and

○ 500,000 .pdf files.

● There were also:

● 422,000 .xls, and

● 138,000 .ppt files

● .msg and .pdf were

retained.

● We used key word

search to identify:

● .docs for

destruction

● .xls and .ppt for

retention

ROT

Using metadata analytics to understand legacy digital collections prior to disposal. David Canning, DRO - Cabinet Office. The National Archives - 4 April 2019

Page 12: Using metadata analytics to understand legacy digital ...filestore.nationalarchives.gov.uk/resources/temp/... · 4/4/2019  · review and disposal of digital legacy content in volume

05/04/2019

12

23

‘Catering’ produced a 100% confidence rating for ROT

Mostly .xls but also

order forms

Using metadata analytics to understand legacy digital collections prior to disposal. David Canning, DRO - Cabinet Office. The National Archives - 4 April 2019

24

‘Drinks’ was not so clear cut

Some of this

relates to alcohol

policy

Using metadata analytics to understand legacy digital collections prior to disposal. David Canning, DRO - Cabinet Office. The National Archives - 4 April 2019

Page 13: Using metadata analytics to understand legacy digital ...filestore.nationalarchives.gov.uk/resources/temp/... · 4/4/2019  · review and disposal of digital legacy content in volume

05/04/2019

13

25

‘Submission’ produced 25,000 hits, 670 of

which were spreadsheets

Using metadata analytics to understand legacy digital collections prior to disposal. David Canning, DRO - Cabinet Office. The National Archives - 4 April 2019

26

Most spreadsheets are simply financial but

some are an annex to a ministerial submission

Using metadata analytics to understand legacy digital collections prior to disposal. David Canning, DRO - Cabinet Office. The National Archives - 4 April 2019

Page 14: Using metadata analytics to understand legacy digital ...filestore.nationalarchives.gov.uk/resources/temp/... · 4/4/2019  · review and disposal of digital legacy content in volume

05/04/2019

14

Conclusions & Next Steps

27

● Enforcement of dates in the content and the file path (i.e. document title) is

critical for the accurate aging of information.

● Metadata corruption is a problem, the downstream effects of which (risks

and costs) need to be better understood.

● Data classification helps to identify smaller groups of files to zoom in

on, but further work is required to develop a reliable (standard) set of

classifications to facilitate automaticity.

● More research is required on the use of data classification:

○ Our work only analysed metadata - applying classifications to document

content may produce different results.

○ Results are often not clear cut and provide a likely proportion of ROT.

The desire to review documents before destruction will depend on an

individual department’s risk appetite and knowledge of its content.

Using metadata analytics to understand legacy digital collections prior to disposal. David Canning, DRO - Cabinet Office. The National Archives - 4 April 2019

Questions & Discussion

28 Using metadata analytics to understand legacy digital collections prior to disposal. David Canning, DRO - Cabinet Office. The National Archives - 4 April 2019