data clean-up: is there a better way?
DESCRIPTION
Moving data about library resources among systems often engenders data cleanup processes. What is the best way to clean up data? Which tools and skills for non-programmers can help? See how University of California, Riverside Libraries tackle this issue, then share tips and techniques in an open forum.TRANSCRIPT
![Page 1: Data Clean-up: Is There A Better Way?](https://reader036.vdocument.in/reader036/viewer/2022081602/54ba35d94a7959041e8b458a/html5/thumbnails/1.jpg)
Data Clean-Up:
Is there a Better Way?
Margaret Hogarth
ER&L, 2/2/2010© Camille Pissaro, 1882, “Peasant Woman Digging,” ARTstor 40-06-12/42 1
![Page 2: Data Clean-up: Is There A Better Way?](https://reader036.vdocument.in/reader036/viewer/2022081602/54ba35d94a7959041e8b458a/html5/thumbnails/2.jpg)
To Be Covered:
• Commonalities• Issues• Deficits• Excel• Access• MarcEdit• Global Update• Data Quality Policy• Suggestions?
© The Metropolitan Museum of Art, “Book of the Knowledge of Ingenious Mechanical Devices by al-Jazari,” ARTstor
2
![Page 3: Data Clean-up: Is There A Better Way?](https://reader036.vdocument.in/reader036/viewer/2022081602/54ba35d94a7959041e8b458a/html5/thumbnails/3.jpg)
Commonalities
• Data sources• Tools• Data clean-up• Capture and use
issues• Needed technical
skills
bonobokids.com
3
![Page 4: Data Clean-up: Is There A Better Way?](https://reader036.vdocument.in/reader036/viewer/2022081602/54ba35d94a7959041e8b458a/html5/thumbnails/4.jpg)
General Approach
• Keep original data original; copy to a new worksheet
• Save frequently• Use meaningful
file/worksheet names• Folder of “final”
documents© John Vink, 2005, "CAMBODIA," ARTstor PAR287761
4
![Page 5: Data Clean-up: Is There A Better Way?](https://reader036.vdocument.in/reader036/viewer/2022081602/54ba35d94a7959041e8b458a/html5/thumbnails/5.jpg)
Issues
• Import issues• System limitations• Dirty data• Non-standardized
data
• Standardized but variable data
• Application-related issues• Others?
© Miguel Rio Branco, 1985, "BRAZIL," ARTstor PAR52903
5
![Page 6: Data Clean-up: Is There A Better Way?](https://reader036.vdocument.in/reader036/viewer/2022081602/54ba35d94a7959041e8b458a/html5/thumbnails/6.jpg)
Deficits
• Time• Staff• Budget• Skills• Development• Confidence• Big-picture view• Systems
© John Murphy, “Men Digging," ARTstor DMCC.1993.4
6
![Page 7: Data Clean-up: Is There A Better Way?](https://reader036.vdocument.in/reader036/viewer/2022081602/54ba35d94a7959041e8b458a/html5/thumbnails/7.jpg)
System Limitations
• Field character limits• Report field limits
7
![Page 8: Data Clean-up: Is There A Better Way?](https://reader036.vdocument.in/reader036/viewer/2022081602/54ba35d94a7959041e8b458a/html5/thumbnails/8.jpg)
Import => Excel, Zeros
Problem:
• Loss of trailing zero• Loss of leading zeros
Solution:• Import > Delimited >
Tab > Text format
Example:• 1944826 =>1944-8260• 14826 => 0001-4826
8
![Page 9: Data Clean-up: Is There A Better Way?](https://reader036.vdocument.in/reader036/viewer/2022081602/54ba35d94a7959041e8b458a/html5/thumbnails/9.jpg)
Add Hyphens to ISSNs - 1
• Sort by ISSN to make sure leading zeros are intact.
• In a new column type the formula =MID(A1,1,4)&”-”&MID(A1,5,4)
• Syntax: =MID(text,start_num,num_chars)
9
![Page 10: Data Clean-up: Is There A Better Way?](https://reader036.vdocument.in/reader036/viewer/2022081602/54ba35d94a7959041e8b458a/html5/thumbnails/10.jpg)
Add Hyphens to ISSNs - 2
• Or use Cell Formatting
• Select range• Format Cells
(CTRL+1) > Number tab > Custom > 0000-0000 > OK
10
![Page 11: Data Clean-up: Is There A Better Way?](https://reader036.vdocument.in/reader036/viewer/2022081602/54ba35d94a7959041e8b458a/html5/thumbnails/11.jpg)
Import => Excel, Numbers
Problem:• ID # garbled
Solution:• Choose Number format• Remove decimal places
11
![Page 12: Data Clean-up: Is There A Better Way?](https://reader036.vdocument.in/reader036/viewer/2022081602/54ba35d94a7959041e8b458a/html5/thumbnails/12.jpg)
Import => Excel, Commas
Problem:
Solution:
Comma
12
![Page 13: Data Clean-up: Is There A Better Way?](https://reader036.vdocument.in/reader036/viewer/2022081602/54ba35d94a7959041e8b458a/html5/thumbnails/13.jpg)
Restore Dropped Leading Zeros
• Change to TEXT format
• Sort by ISSN• Use
=CONCATENATE(“000”,A1) to add zeros
13
![Page 14: Data Clean-up: Is There A Better Way?](https://reader036.vdocument.in/reader036/viewer/2022081602/54ba35d94a7959041e8b458a/html5/thumbnails/14.jpg)
Excel: Remove Quotes
• Select column• Find & Select >
Replace > Find what: [space]” > Replace with: [leave blank]
14
![Page 15: Data Clean-up: Is There A Better Way?](https://reader036.vdocument.in/reader036/viewer/2022081602/54ba35d94a7959041e8b458a/html5/thumbnails/15.jpg)
Remove Non-Printable Characters• Use TRIM (removes
ASCII value 32 = space character, except single spaces between words)
• Use CLEAN (ASCII codes 0-31, Unicode 127, 129, 141, 143, 144, 157
• Use SUBSTITUTE for higher codes
Book of Hours, c. 1440 “Use of Tornai,” © ARTstor, Rawl.liturg.e.14_roll314.1_frame3
15
![Page 16: Data Clean-up: Is There A Better Way?](https://reader036.vdocument.in/reader036/viewer/2022081602/54ba35d94a7959041e8b458a/html5/thumbnails/16.jpg)
ASCII Characters
http://en.wikipedia.org/wiki/ASCII 12/22/2009
16
![Page 17: Data Clean-up: Is There A Better Way?](https://reader036.vdocument.in/reader036/viewer/2022081602/54ba35d94a7959041e8b458a/html5/thumbnails/17.jpg)
Access: Subscript Out of Range - 1
Try These Steps:• Check for spaces in
column headings• Use TRIM (removes
ASCII value 32 = space character)
• Use CLEAN for ASCII code 0-31
• Delete empty right columns
• Remove empty “used” cells:
1.Find end of "used cells": CTRL+SHIFT+END
2.Select all empty “used” cells > Edit > Clear > All or Edit > Delete. Save the file.
17
![Page 18: Data Clean-up: Is There A Better Way?](https://reader036.vdocument.in/reader036/viewer/2022081602/54ba35d94a7959041e8b458a/html5/thumbnails/18.jpg)
Access: Subscript Out of Range - 2
• Copy and paste cells into a new workbook. Save. Import into Access.
• Or, save file as CSV, import into Access. = Will see data error.
Unknown, circa 1885-1900, “Sorting Mail,” © ARTstor MC212-D-99
18
![Page 19: Data Clean-up: Is There A Better Way?](https://reader036.vdocument.in/reader036/viewer/2022081602/54ba35d94a7959041e8b458a/html5/thumbnails/19.jpg)
Access: Type Conversion Failure
• Make sure data types in fields match data types in columns.
• Data like ISBNs are text but can be “read” like numbers.
• Add top row with correct data/type: XXX for ISBN
19
![Page 20: Data Clean-up: Is There A Better Way?](https://reader036.vdocument.in/reader036/viewer/2022081602/54ba35d94a7959041e8b458a/html5/thumbnails/20.jpg)
Access: Remove Quotes
• Search for records
with “”: Criteria: LIKE "*" & Chr(34) & "*"
• Replace([SomeField],Chr(34),"") will replace a quotation mark (") with a zero-length string
© Erich Lessing, Bayeaux Tapestry, c. 1070-80, ARTstor 31-01-01/23
20
![Page 21: Data Clean-up: Is There A Better Way?](https://reader036.vdocument.in/reader036/viewer/2022081602/54ba35d94a7959041e8b458a/html5/thumbnails/21.jpg)
Access: ISSN Issues
• Find too-short ISSNs: Len([FieldName])<n [9 is good here]
• Find ISSNs without hyphens: SELECT table.field, table.field FROM table WHERE (((table.field) Not Like "*-*));
© Eve Arnold, 1979, “Hsishuang Panna Weeding,” ARTstor
21
![Page 22: Data Clean-up: Is There A Better Way?](https://reader036.vdocument.in/reader036/viewer/2022081602/54ba35d94a7959041e8b458a/html5/thumbnails/22.jpg)
Access & ARL Stats - 1
• =Sum([YTD Total]) Sum of article downloads in COUNTER Journal 1 report.
• =[Jan-09]+[Feb-09]+[Mar-09]+[Apr-09]+[May-09]+[Jun-09] Sum of Jan-Jun 2009 COUNTER J1.
• =[Jul-09]+[Aug-09]+[Sep-09]+[Oct-09]+[Nov-09]+[Dec-09] Sum of Jul-Dec 2009 COUNTER J1.
22
![Page 23: Data Clean-up: Is There A Better Way?](https://reader036.vdocument.in/reader036/viewer/2022081602/54ba35d94a7959041e8b458a/html5/thumbnails/23.jpg)
Access & ARL Stats - 2
• RowCount:Count(*) Number of titles in a set.
• =[YTD Total]*[Cost] Annual cost-per-use.
• Access Expressions: http://office.microsoft.com/en-us/access/HA011814491033.aspx
23
![Page 24: Data Clean-up: Is There A Better Way?](https://reader036.vdocument.in/reader036/viewer/2022081602/54ba35d94a7959041e8b458a/html5/thumbnails/24.jpg)
Access or Excel?
Access:• Relational• Large amount of data• Primary key• Many people working• Long text strings
Excel:• Non-relational• Mostly numeric• Calculations/Statistics
Nelson, Emma. 2010. Using Access or Excel to Manage Your Data. http://office.microsoft.com/en-us/help/HA010429181033.aspxSee also:Microsoft. 2010. Examples of Expressions. http://office.microsoft.com/en-us/access/HA011814491033.aspx
24
![Page 25: Data Clean-up: Is There A Better Way?](https://reader036.vdocument.in/reader036/viewer/2022081602/54ba35d94a7959041e8b458a/html5/thumbnails/25.jpg)
XML - Excel
• Excel can interpret XML
• Data > Get External Data > From XML Data Import
• Format without affecting source data
25
Later Excel: Activate Developer tab through Office logo (upper left)
![Page 26: Data Clean-up: Is There A Better Way?](https://reader036.vdocument.in/reader036/viewer/2022081602/54ba35d94a7959041e8b458a/html5/thumbnails/26.jpg)
MarcEdit: XML - 1
• Convert large XML files to Excel
• Specify input, output files
• Choose MARC21XML => MARC
26
![Page 27: Data Clean-up: Is There A Better Way?](https://reader036.vdocument.in/reader036/viewer/2022081602/54ba35d94a7959041e8b458a/html5/thumbnails/27.jpg)
MarcEdit: XML - 2• Choose display fields,
input, output files• View, format in Excel
27
![Page 28: Data Clean-up: Is There A Better Way?](https://reader036.vdocument.in/reader036/viewer/2022081602/54ba35d94a7959041e8b458a/html5/thumbnails/28.jpg)
MarcEdit: MARC - 1
• Convert large files to local practices
DELETE existing
999 field
910 field(s)…
Copy 035 to 001
Remove (Sc-P) Prefix from 001
ADD
910 |aDEL SCP ; jc ; 2009/7/8
Field:910
Field data: \\$aDEL SCP ; jc ; 2009/7/8
998 |an
Field: 998
Field data: \\$an
28
![Page 29: Data Clean-up: Is There A Better Way?](https://reader036.vdocument.in/reader036/viewer/2022081602/54ba35d94a7959041e8b458a/html5/thumbnails/29.jpg)
MarcEdit: MARC - 2
29
![Page 30: Data Clean-up: Is There A Better Way?](https://reader036.vdocument.in/reader036/viewer/2022081602/54ba35d94a7959041e8b458a/html5/thumbnails/30.jpg)
MarcEdit Information
• By Terry Reese
• http://oregonstate.edu/~reeset/marcedit/html/
• MARCEDIT-L listserv at https://listserv.gmu.edu/cgi-bin/wa?SUBED1=marcedit-l&A=1
• Regular updates
• Tutorials, templates, scripts
30
![Page 31: Data Clean-up: Is There A Better Way?](https://reader036.vdocument.in/reader036/viewer/2022081602/54ba35d94a7959041e8b458a/html5/thumbnails/31.jpg)
ILS: “Global Update”• For records within ILS system
• For universal changes
• “Check website for coverage.”
31
![Page 32: Data Clean-up: Is There A Better Way?](https://reader036.vdocument.in/reader036/viewer/2022081602/54ba35d94a7959041e8b458a/html5/thumbnails/32.jpg)
A Better Way? Macros
• microsoft.public.excel (General Excel group) http://groups.google.com/groups/dir?sel=33606583&hl=en
• OzGrid Forum (Excel tips and VBA macros) http://www.ozgrid.com/forum/
• http://www.lib-stats.org.uk/ (statistics listserv)
[Courtesy Tansy Matthews]
32
![Page 33: Data Clean-up: Is There A Better Way?](https://reader036.vdocument.in/reader036/viewer/2022081602/54ba35d94a7959041e8b458a/html5/thumbnails/33.jpg)
Data Quality
Strategies to Improve Data Quality:
1. Identify problems
2. Treat data as an asset
3. Implement quality systems
Principle Activities for Data:
• Acquire• Store• Use
33
![Page 34: Data Clean-up: Is There A Better Way?](https://reader036.vdocument.in/reader036/viewer/2022081602/54ba35d94a7959041e8b458a/html5/thumbnails/34.jpg)
Poor Quality Data Indicators
• Uncorrected errors• Redundant data/
processes• Lack of data for
strategizing• Frustration with data,
data supplier, IT (c) 2006, SCALA, “Shoes,” 6th century BCE, ARTstor
34
![Page 35: Data Clean-up: Is There A Better Way?](https://reader036.vdocument.in/reader036/viewer/2022081602/54ba35d94a7959041e8b458a/html5/thumbnails/35.jpg)
Treat Data as an Asset
• Inventory data assets• Data = dynamic;
process = asset• Align responsibilities:
acquire, store, use data.
• Establish customer-supplier relationships for data. © The Metropolitan Museum of
Art, “Tournament,” late 16th century, ARTstor
35
![Page 36: Data Clean-up: Is There A Better Way?](https://reader036.vdocument.in/reader036/viewer/2022081602/54ba35d94a7959041e8b458a/html5/thumbnails/36.jpg)
Apply Quality Principles
• Create and keep a customer
• Detect and correct errors
• Determine root cause of defects
• Manage the process
• Communicate results• Audit supplier
performance
36
![Page 37: Data Clean-up: Is There A Better Way?](https://reader036.vdocument.in/reader036/viewer/2022081602/54ba35d94a7959041e8b458a/html5/thumbnails/37.jpg)
Library Data Quality Policy - 1
Suppliers/Creators:• Understand users,
uses, & requirements• Ensure requirements
are met• Manage data creation
process
Data Processors:• Avoid duplication• Safeguard data• Make data accessible• Promote data quality
in IT
37
![Page 38: Data Clean-up: Is There A Better Way?](https://reader036.vdocument.in/reader036/viewer/2022081602/54ba35d94a7959041e8b458a/html5/thumbnails/38.jpg)
Library Data Quality Policy - 2
Users:• Define requirements,
work with suppliers• Provide feedback• Interpret data
correctly• Use data legitimately• Protect privacy
Logistics:• Determine master
systems• Understand system
limitations• Accessible storage• Match inputs with
needs• Identify key keepers
38
![Page 39: Data Clean-up: Is There A Better Way?](https://reader036.vdocument.in/reader036/viewer/2022081602/54ba35d94a7959041e8b458a/html5/thumbnails/39.jpg)
Digging for Coal, The Illustrated Bartsch, vol 85, 1486, ARTstor 8586.1486/154
Other Techniques?
39
![Page 40: Data Clean-up: Is There A Better Way?](https://reader036.vdocument.in/reader036/viewer/2022081602/54ba35d94a7959041e8b458a/html5/thumbnails/40.jpg)
Bibliography• Microsoft, 2009. Top Ten Ways to Clean Your Data.
http://office.microsoft.com/en-us/excel/HA102218401033.aspx, accessed 12/18/2009.
• Use error checking to convert numbers that are stored as text to numbers. http://office.microsoft.com/en-us/excel/HP012167611033.aspx, accessed 12/22/2009.
• Apply a number format to numbers that are stored as text http://office.microsoft.com/en-us/excel/HP012167611033.aspx
• Redman, Thomas C. 1995. Improve Data Quality for Competitive Advantage. Sloan Management Review, 36:2, 99-107.
• Rothschiller, Chad. 2007. Manipulating and Massaging Data in Excel. http://blogs.msdn.com/excel/archive/2007/11/12/manipulating-and-massaging-data-in-excel.aspx 12/18/2009.
• Spencer, John. March 6, 2008. Find/ /replace characters like quotes http://www.eggheadcafe.com/software/aspnet/31782118/findreplace-characters-l.aspx 12/22/2009
40