christian gendreau , david shorthouse & peter desmet
DESCRIPTION
Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions. Christian Gendreau , David Shorthouse & Peter Desmet. Game plan. Introduction to Canadensys Data quality @ Canadensys Canadensys p rocessing solutions Numbers from Canadensys - PowerPoint PPT PresentationTRANSCRIPT
Data quality challenges in the Canadensys network of
occurrence records: examples, tools, and solutions
Christian Gendreau, David Shorthouse & Peter Desmet
Game plan• Introduction to Canadensys• Data quality @ Canadensys• Canadensys processing solutions• Numbers from Canadensys• Hopes and expectations
A NetworkOf people and collections
Canadensys Headquarters Université de Montréal Biodiversity Centre
data.canadensys.net/vascan
data.canadensys.net/ipt
data.canadensys.net/explorer
Data quality related activitiesFrom an aggregator perspective
During data entry• Help to avoid typographical errors• Help to convert verbatim data
Actor : data entry person
Before publication
Actor : data publisher
• Detect file character encoding issue• Detect duplicate or missing IDs
Previous Activity:Data entry
During aggregation• Process data: validation, cleaning• Produce structured reports : quality control
Actor : data aggregator
Previous Activity:Before publication
After aggregation• Allow and facilitate community feedback• Help data publisher to integrate corrections
Actor : users and community
Previous Activity:Aggregation
Canadensys toolsduring data entry
data.canadensys.net/tools
Why do we process data?• Enrich our Explorer, http://data.canadensys.net• Provide structured reports to data providers
• Help identify records that need re-examination• Help to improve data entry procedure
Data processing
Processing solutionsNarwhals to the rescue
Narwhal image Public Domain
The narwhal-processor approach● Single field processing to allow complex
processing (combined fields)● Processors with common interface ease
integration and usage● Collaboration
https://github.com/Canadensys/narwhal-processor
Data usabilitybefore processing
country text state/province text coordinates dates0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
92%
60%
96%
44%
% o
f non
-nul
l cle
an v
erba
tim d
ata
Data usabilityafter processing
• 7% of provided country text
USAISO 3166-
2:US, United States
Data usabilityafter processing
• 7% of provided country text• 16% of provided state/province text
QuéISO 3166-2
CA-QC, Quebec
Data usabilityafter processing
• 7% of provided country text• 16% of provided state/province text• 4% of provided coordinates
45° 32' 25" N, 129° 40' 31"
W
45.5402778, -129.6752778
Data usabilityafter processing
• 7% of provided country text• 16% of provided state/province text• 4% of provided coordinates• 42% of provided dates
2008 VI 13 2008-06-13
Data usabilityincluding processed data
country text state/province text coordinates dates0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
92%
60%
96%
44%
7%
16%
4%
42%
% o
f non
-nul
l pro
vide
d
Projects With Data Quality Tools• Atlas of living Australia• GBIF Norway, GBIF Spain, National Biodiversity
Network, BioVeL … • GBIF libraries• Most nodes have their own data quality
routine
Hopes and expectations
• Maintain taxonomic authority files• Maintain country, province and city lists
We do not want to
• Efficiently use specialized resources/services• Provide report, quality indices
We prefer to
Help from Semantic Web• Data in other languages (French, Spanish, …)
should not be flagged as error• Misspellings should be shared as a common
resource (e.g. SKOS)• Understand historical data (e.g. collected in
USSR in 1980)
Reporting and log• DarwinCore annotations for processed data• Shared vocabulary for structured reports and
quality indices
Summary• Tools available for sharing• Use, review, contribute• Opportunity for broad coordination and
increased efficiencies
Thanks
Anne Bruneau, Institut de recherche en biologie végétale andDépartement de Sciences Biologiques, Université de Montréal
Contacthttp://www.canadensys.nethttp://github.com/Canadensys@Canadensys
Gulo gulo, Larry Master (www.masterimages.org)