modelling and computing the quality of information in e-science paolo missier, suzanne embury, mark...
TRANSCRIPT
Modelling and computingthe quality of information in e-science
Paolo Missier, Suzanne Embury, Mark GreenwoodSchool of Computer ScienceUniversity of Manchester, UK
Alun Preece, Binling JinDepartment of Computing Science
University of Aberdeen, UK
http://www.qurator.org
Roma, 3/4/07
Combining the strengths of UMIST andThe Victoria University of Manchester
Quality of data
Main driver, historically: data cleaning for
• Integration: use of same IDs across data sources
• Warehousing, analytics:
– restore completeness,
– reconcile referential constraints
– cross-validation of numeric data by aggregation
Focus:
• Record de-duplication, reconciliation, “linkage”
– Ample literature – see eg Nov 2006 issue of IEEE TKDE
• Consistency of data across sources
• Managing uncertainty in databases (Trio - Stanford)
Data quality control in the data management practice
Combining the strengths of UMIST andThe Victoria University of Manchester
Common quality issues
• Completeness: not missing any of the results
• Correctness: each data should reflect the actual real-world entity that it is intended to model
– The actual address where you live, the correct balance in your bank account…
• Timeliness: delivered in time for use by a consumer process
– Eg stock information
• …
Combining the strengths of UMIST andThe Victoria University of Manchester
Taxonomy for data quality dimensions
Combining the strengths of UMIST andThe Victoria University of Manchester
Our motivation: quality in public e-science data
GenBankUniProt
EnsEMBL
Entrez
dbSNP
• Large volumes of data in many public repositories• Increasingly creative uses for this data
Problem: using third party data of unknown quality may result in misleading scientific conclusions
Problem: using third party data of unknown quality may result in misleading scientific conclusions
Combining the strengths of UMIST andThe Victoria University of Manchester
Some quality issues in biology
“Quality” covers a broader spectrum of issues than traditional DQ
• “X% of database A may be wrong (unreliable) – but I have no easy way to test that”
• “This microarray data looks ok but is testing the wrong hypothesis”
• The output from this sequence matching algorithm produces false positives
• …
Each of these issues calls for a separate testing procedureDifficult to generalize
Each of these issues calls for a separate testing procedureDifficult to generalize
Combining the strengths of UMIST andThe Victoria University of Manchester
Correctness in biology - examples
Data type Creation process Correctness
Uniprot protein annotation
Manual curation Functional annotation f for p correct if function f can reliably be attributed to p
Qualitative proteomics:
Protein identification
Generate peptides peak lists, match peak lists (eg Imprint)
No false positives:
Every protein in the output is actually present in the cell sample
Transcriptomics:
Gene expression report (up/down-regulation)
Microarray data analysis
No false positives, no false negatives
Combining the strengths of UMIST andThe Victoria University of Manchester
Defining quality in e-science is challenging
• In-silico experiments express cutting-edge research
– Experimental data liable to change rapidly
– Definitions of quality are themselves experimental
• Scientists’ quality requirements often just a hunch
– Quality tests missing or based on experimental heuristics
– Definitions of quality criteria are personal and subjective
• Quality controls tightly coupled to data processing
– Often implicit and embedded in the experiment
– Not reusable
“Quality” personal criteria for data acceptability
Combining the strengths of UMIST andThe Victoria University of Manchester
Research goals
1. Make personal definitions of quality explicit and formal
– Identify a common denominator for quality concepts
– Expressed as a conceptual model for Information Quality
Elicit “nuggets” of latent quality knowledgefrom the experts
Elicit “nuggets” of latent quality knowledgefrom the experts
2. Make existing data processing quality-aware
– Define an architectural framework that accommodates personal definitions of quality
– Compute quality levels and expose them to the user
Combining the strengths of UMIST andThe Victoria University of Manchester
Example: protein identification
Data output
Protein identification algorithm
“Wet lab” experiment
Protein Hitlist
Protein function prediction
Correct entry true positive
Evidence:
mass coverage (MC) measures the amount of protein sequence matched
Hit ratio (HR) gives an indication of the signal to noise ratio in a mass spectrum
ELDP reflects the completeness of the digestion that precedes the peptide mass fingerprinting
This evidence is independent of the algorithm / SW packageIt is readily available and inexpensive to obtain
This evidence is independent of the algorithm / SW packageIt is readily available and inexpensive to obtain
Combining the strengths of UMIST andThe Victoria University of Manchester
Correctness of protein identification
Estimator function: (computes a score rather than a probability)
PMF score = (HR x 100) + MC + (ELDP x 10)
Prediction performance – comparing 3 models:
ROC curve:True positives vs false positives
Combining the strengths of UMIST andThe Victoria University of Manchester
Quality process components
Data output
Protein identification algorithm
“Wet lab” experiment
Protein Hitlist
Protein function prediction
Goal:to automatically add the additional filtering step in a principled way
Goal:to automatically add the additional filtering step in a principled way
PMF score = (HR x 100) + MC + (ELDP x 10)
Quality filtering
Quality assertion:
Evidence:•mass coverage (MC)•Hit ratio (HR)•ELDP
Combining the strengths of UMIST andThe Victoria University of Manchester
Quality Assertions
QA(D): any function of evidence (metadata for D) that computes equivalence classes on D
1. Score model (total or partial order)
2. Classification model:
D
B
A
C
Actions associated to regions:Eg accept/reject but possibly more
Quality-equivalent regions
Combining the strengths of UMIST andThe Victoria University of Manchester
Layered definition of Quality
DB
DBData sources
custom qualityknowledge
Quality Assertionsfunctions
QA QA QA
Quality Views:definition of acceptability regions QVQVQV QV
quality evidence annotations
EnvEnv
Annotationfunctions
Long-livedreusable
CommoditiesExpert-defined
DynamicUser
controlled
Combining the strengths of UMIST andThe Victoria University of Manchester
Abstract Quality ViewsAn operational definition for personal quality:
1. Formulate a quality assertion on the dataset:– i.e. a ranking of proteins by PMF score
– “quality knolwedge, possibly subjective”
2. Identify underlying evidence necessary to compute the assertion– the variables used to compute the score (HR, MC, ELDP)
– Objective, inexpensive
3. Define annotation functions that compute evidence values• Functions that compute HR, MC, ELDP
4. Define quality regions on the ranked dataset• In this case, intervals of acceptability
5. Associate actions to each region
Combining the strengths of UMIST andThe Victoria University of Manchester
Computable quality views as commodities
Cost-effective quality-awareness for data processing:
• Reuse of high-level definitions of quality views
• Compilation of abstract quality views into quality components
Abstract quality views
binding andcompilation
Executable Quality process
- runtime environment- data-specific quality services
Quratorarchitectural framework:
Combining the strengths of UMIST andThe Victoria University of Manchester
Quality hypotheses discovery and testing
Quality modelPerformance assessment
Executionon test data
abstractquality view
CompilationCompilationTargeted
Compilation
Quality-enhancedUser environmentQuality-enhanced
User environmentQuality-enhancedUser environment
Target-specificQuality componentTarget-specific
Quality componentTarget-specificQuality component
DeploymentDeployment
Deployment
Multiple target environments:• Workflow• query processor
Quality modeldefinition
Combining the strengths of UMIST andThe Victoria University of Manchester
Experimental quality
Making data processing quality-aware using Quality Views
– Query, browsing, retrieval, data-intensive workflows
Discovery and validation: “nuggets of quality
knowldege”
QualityView
Modeltesting
Testdatasets
Embedding quality views and flow-through
testing
+
Combining the strengths of UMIST andThe Victoria University of Manchester
Execution model for Quality views
Binding compilation executable component
– Sub-flow of an existing workflow
– Query processing interceptor
Host workflow
AbstractQuality view
Embeddedquality
workflow
QV compiler
D
D’ Quality view on D’
Qurator quality frameworkServices registry
Servicesimplementation
Host workflow: D D’
Combining the strengths of UMIST andThe Victoria University of Manchester
Example: original proteomics workflow
Taverna workflow
Quality flow embedding point
Combining the strengths of UMIST andThe Victoria University of Manchester
Example: embedded quality workflow
Combining the strengths of UMIST andThe Victoria University of Manchester
Interactive conditions / actions
Combining the strengths of UMIST andThe Victoria University of Manchester
Generic quality process pattern
Collect evidence - Fetch persistent annotations- Compute on-the-fly annotations
<variables <var variableName="Coverage“ evidence="q:Coverage"/> <var variableName="PeptidesCount“ evidence="q:PeptidesCount"/> </variables>
Evaluate conditionsExecute actions
<action> <filter> <condition> ScoreClass in {``q:high'', ``q:mid''} and Coverage > 12</condition> </filter> </action>
Compute assertions
ClassifierClassifier
Classifier
<QualityAssertion
serviceName="PIScoreClassifier" serviceType="q:PIScoreClassifier" tagSemType="q:PIScoreClassification" tagName="ScoreClass"
Persistentevidence
Combining the strengths of UMIST andThe Victoria University of Manchester
Reference (semantic) model
quality evidence annotations
custom qualityknowledge
DB
DBEnvEnv
Data sources
Annotationfunctions
Quality Assertionsfunctions
QA QA QA
Quality Viewsdefinition of acceptability regions QVQVQV QV
Common Semantic
Model(IQ Ontology)
Combining the strengths of UMIST andThe Victoria University of Manchester
A semantic model for quality concepts
Quality “upper ontology”(OWL)
Quality “upper ontology”(OWL)
Evidence annotations are class instances
Evidence annotations are class instances
Quality evidence typesQuality evidence types
EvidenceMeta-data model
(RDF)
EvidenceMeta-data model
(RDF)
Combining the strengths of UMIST andThe Victoria University of Manchester
Main taxonomies and properties
Class restriction:MassCoverage is-evidence-for . ImprintHitEntry
Class restriction:PIScoreClassifier assertion-based-on-evidence . HitScorePIScoreClassifier assertion-based-on-evidence . Mass Coverage
assertion-based-on-evidence: QualityAssertion QualityEvidence
is-evidence-for: QualityEvidence DataEntity
Combining the strengths of UMIST andThe Victoria University of Manchester
The ontology-driven user interface
Detecting inconsistencies: no annotators for this Evidence type
Detecting inconsistencies: no annotators for this Evidence type
Detecting inconsistencies: Unsatisfied input requirements
for Quality Assertion
Detecting inconsistencies: Unsatisfied input requirements
for Quality Assertion
Combining the strengths of UMIST andThe Victoria University of Manchester
Qurator architecture
Combining the strengths of UMIST andThe Victoria University of Manchester
Quality-aware query processing
Data
Queryprocessor
SQL, XQUERY
annotate
R’
Queryclient
QualityView
component
R
assert
act
evidence
dump
dumpR’
Quality-aware
query
Combining the strengths of UMIST andThe Victoria University of Manchester
Research issuesQuality modelling:
• Provenance as evidence
– Can data/process provenance be turned into evidence?
• Experimental elicitation of new Quality Assertions
– Seeking new collaborations with biologists!
• Classification with uncertainty
– Data elements belong to a quality class with some probability
• Computing Quality Assertions with limited evidence
– Evidence may be expensive and sometimes unavailable
– Robust classification / score models
Architecture:
• Metadata management model
– Quality Evidence is a type of metadata with known features…
Combining the strengths of UMIST andThe Victoria University of Manchester
Summary
For complex data types, often no single “correct” and agreed-upon definition of quality of data
• Qurator provides an environment for fast prototyping of quality hypotheses
– Based on the notion of “evidence” supporting a quality hypothesis
– With support for an incremental learning cycle
• Quality views offer an abstract model for making data processing environments quality-aware
– To be compiled into executable components and embedded
– Qurator provides an invocation framework for Quality Views
Publications: http://www.qurator.orgQurator is registered with OMII-UK