Databases for Knowledge Discovery
Jan H. van BemmelErasmus University Rotterdam
Databases for Knowledge Discovery
Databases for Knowledge Discovery
● Natural sciences physics, chemistry, engineering models, experiments, theories ► ’hard’ data
● Humanities arts, social sciences, economics behavioural studies, text analysis ► ‘soft’ data
● Biomedical and health sciences biomedicine, health sciences models, experiments, studies ► hard & soft data
Databases for Knowledge Discovery
Biomedical research related to the 'hard' scientific approach as in physics and engineering
Clinical research using rather 'hard' data, and sometimes ‘soft’ subjective observations
Population-based research data collected from populations of healthy and ill persons This research can be subdivided into
• retrospective research• prospective research
● Biomedicine & health sciences
Databases for Knowledge Discovery
Biomedical research related to the 'hard' scientific approach as in physics and engineering
Clinical research using rather 'hard' data, and sometimes ‘soft’ subjective observations
Population-based research data collected from populations of healthy and ill persons This research can be subdivided into
• retrospective research• prospective research
● Biomedicine & health sciences
experiments
patients
populations
Basic Research
Clinical Research
Health Research
Biomedicine andHealth Sciences
RegionalRegionalDatabaseDatabase
RegionalRegionalDatabaseDatabase
RegionalRegionalDatabaseDatabase
ResearchResearchDatabaseDatabase
Databases for Knowledge Discovery
Discovery of new scientificknowledge from large databasesof measurements, observationsand interpretations
Databases for Knowledge Discovery
Until recently, basic research in biomedicine was done on organs and organisms.
Nowadays the fundamental challenges lay a magnitude lower: on the level of molecules and cells.
Research on organs and organisms is still of interest: breakthroughs from biomolecular research are to be translated to higher levels.
● Biomedicine & health sciences
Databases for Knowledge Discovery
Knowledge contained in multiple databases
of refereed articles and
databases on genes and proteins
MedLine: 11 million abstracts; 500,000/year searching for articles in sphere of interest how to find new knowledge? how to cope with serendipity?
● Biomedical research
Databases for Knowledge Discovery
Different methods to retrieve knowledge:
simple Boolean expressions too specific: few references too broad: avalanche of references
use of a more complex ‘fingerprint’
combination of different databases complex retrieval using ontology dbase
● Biomedical research
for-ward
in-verse
Databases for Knowledge Discovery
● Biomedical research
Databases for Knowledge Discovery
● Biomedical research
Databases for Knowledge Discovery
● Biomedical research
content fingerprints
JobsCVs, Skills
Articlesbooks
EmailsWord RFPs
people fingerprints
average
organisation fingerprints
average
FindFindnewnew
associa-associa-tionstions
MatchingMatchingmethodsmethods
GeneticsDatabase
LiteratureDatabase
Databases for Knowledge Discovery
● Biomedical research
A – B B – C A – C
Datamining
Databases for Knowledge Discovery
Composition of a thesaurus
from separate databases
GDB: AAA; BBB
LocusLink: AAA; CCC
Hugo NC: AAA
OMIM: BBB; CCC
SwissProt: BBB
concept: AAA
synonyms: BBB; CCC
● Biomedical research
FindFindnewnew
associa-associa-tionstions
MatchingMatching
methodsmethods
GeneticsDatabase
LiteratureDatabase
Databases for Knowledge Discovery
● Biomedical research
CollexionCollexionOntologyOntologydatabasedatabase
ACSACSconstruc-construc-
tortor
ACSACSmodelmodel
ACS: AssociativeConcept Space
ACSACS
viewerviewer
ACSACS
valida-valida-tiontion
Databases for Knowledge Discovery
Biomedical research related to the 'hard' scientific approach as in physics and engineering
Clinical research using rather 'hard' data, and sometimes ‘soft’ subjective observations
Population-based research data collected from populations of healthy and ill persons This research can be subdivided into
• retrospective research• prospective research
● Biomedicine & health sciences
Databases for Knowledge Discovery
● Clinical research
0
10
20
30
40
50
60
70
80
90
100
78 80 82 84 86 88 90 92 94 96
Per
cent
age
of p
rimar
y ca
re p
ract
ices
Year
98
Growth ofinformationsystems inprimary care
Computer-based patientrecords
UK
NL
BloodLinkThe impact of
guidelines-based
decision support
on lab test ordering
in primary care.
Databases for Knowledge Discovery
● Clinical research
BloodLink Control Guideline-
controlled clinical trial Group Group
No. of practices 21 23
No. of physicians 29 31
No. of patients 97,177 98,432
Sickfunds 52% 52%
No. of order forms 12,786 12,700
Databases for Knowledge Discovery
● Clinical research
Databases for Knowledge Discovery
ESRTest
HemoglobinWBC countHematocriteCreatinineErytrocytesMCVDifferentiatieCholesterolTSHGamma-GTGlucose in serumALAT (SGPT)PotassiumASAT (SGOT)Glucose fastingTriglyceridesHDL cholesterolNatriumFree T4
5612BloodLink Guideline
6061371936113314336031593060341332132004296418921096
959128613981350
745618
-29%Difference
-17%-26%-25%-34%-28%-32%-26%
-1%+9%
-42%19%
-34%-53%-58%-20%
1%-2%
-30%-47%
7932BloodLink control
7332503948305024469046424151435429543466250128502320226916111380138210701163
ESRTest
HemoglobinWBC countHematocriteCreatinineErytrocytesMCVDifferentiatieCholesterol
Gamma-GTGlucose in serumALAT (SGPT)PotassiumASAT (SGOT)Glucose fastingTriglyceridesHDL cholesterolNatriumFree T4
5612BloodLink Guideline
6061371936113314336031593060341332132004296418921096
959128613981350
745618
-29%Difference
-17%-26%-25%-34%-28%-32%-26%
-1%
-42%19%
-34%-53%-58%-20%
1%-2%
-30%
7932BloodLink control
7332503948305024469046424151435429543466250128502320226916111380138210701163
In case of thyroid disease, physicians were used to orderthe T4 test (free thyroxine); the protocol prescribed the TSH test instead (thyroidstimulating hormone)
+9%
-47%Free T4
TSH
Databases for Knowledge Discovery
ESRTest
HemoglobinWBC countHematocriteCreatinineErytrocytesMCVDifferentiatieCholesterol
NatriumFree T4
5612BloodLink Guideline
6061371936113314336031593060341332132004296418921096
959128613981350
745618
-29%Difference
-17%-26%-25%-34%-28%-32%-26%
-1%+9%
-42%19%
-34%-53%
-58%-20%
1%-2%
-30%-47%
7932BloodLink control
7332503948305024469046424151435429543466250128502320226916111380138210701163
Tests, such as SGOT (serum glu-tamic oxalacetic transaminase), Gamma GT and SGPT, had been ordered almost automatically; theprotocols, however, did not support such tests. The same applies to K+.
TSH
Glucose in serum
ASAT (SGOT)Glucose fastingTriglyceridesHDL cholesterol
ALAT (SGPT)
Gamma-GT
Potassium
Gamma GTALAT (SGPT)ASAT (SGOT)
Databases for Knowledge Discovery
BloodLink Control Guideline-
controlled clinical trial Group Group
No. of practices 21 23
No. of GPs 29 31
No. of patients 97,177 98,432
Sickfunds 52% 52%
No. of order forms 12,786 12,700
% of forms generated by BloodLink 89% 73%
No. of requested tests 87,634 70,479
Average No. of tests per order1 6.9 5.5
1Student's t-test, N=44, p<0.001
Databases for Knowledge Discovery
● Clinical research
Cardiology
Databases for Knowledge Discovery
● Clinical research
# sens spec
1 0.94 0.36
2 0.86 0.70
3 0.72 0.82
4 0.65 0.75
5 0.73 0.69
6 0.70 0.78
7 0.88 0.52
8 0.74 0.77
CS 0.74 0.88
Critiquing system for hypertension
sens(%)
100 90 80 70 60 50 40 30 20 10 00
10
20
30
40
50
60
70
80
90
100
spec (%)
Databases for Knowledge Discovery
● Clinical research
Ref
eren
ce
Class N NL LVH RVH BVH AMI IMI MIX OTH VH+MI
NL 382 95.5 0.9 0.4 0.0 1.4 1.6 0.0 0.1
LVH 183 19.0 69.0 0.5 0.0 4.3 6.9 0.2 0.0
RVH 55 40.6 6.7 45.8 2.7 1.2 2.1 0.0 0.9
BVH 53 22.0 54.7 14.5 1.6 5.3 1.9 0.0 0.0
AMI 170 14.3 2.6 0.6 0.0 80.0 1.8 0.7 0.0
IMI 273 19.8 2.6 0.2 0.0 0.7 76.7 0.1 0.0
MIX 73 2.5 4.1 1.6 0.0 51.6 37.4 2.7 0.0
VH+MI 31 22.6 0.0 0.0 0.0 0.0 0.0 0.0 16.1 61.3
Databases for Knowledge Discovery
● Clinical research
Computer-assistedECGinter-pretation
Assessment of different interpretation programs
60
65
70
75
80
85
90
60 65 70 75 80 85 90% agreement with clinical data
% a
gree
men
t wit
h re
fere
es
cardiologistssystems
Databases for Knowledge Discovery
● Clinical research
Databases for Knowledge Discovery
Biomedical research related to the 'hard' scientific approach as in physics and engineering
Clinical research using rather 'hard' data, and sometimes ‘soft’ subjective observations
Population-based research data collected from populations of healthy and ill persons This research can be subdivided into
• retrospective research• prospective research
● Biomedicine & health sciences
Databases for Knowledge Discovery
● Population-based research: retrospective
Post-marketing surveillance of drugs
Combinations of drugs: interactions
Longitudinal databases of about 500,000 patients
Patient privacy and data security
0
10
20
30
40
50
60
70
80
90
100
78 80 82 84 86 88 90 92 94 96
Per
cent
age
of p
rimar
y ca
re p
ract
ices
Year
98
Growth ofinformationsystems inprimary care
Computer-based patientrecords
UK
NL
health carepractices
CentralCentralDatabaseDatabase
CPRCPRCPRCPR
CPRCPRCPRCPR
Databases for Knowledge Discovery
● Population-based research: retrospective
ResearchResearchdatabasedatabase
researchresearchdatadataresearchresearch
datadataresearchresearchdatadataresearchresearch
datadata
population-based
research
Databases for Knowledge Discovery
● Population-based research: retrospective
ResearchResearchdatabasedatabase
researchresearchdatadataresearchresearch
datadataresearchresearchdatadataresearchresearch
datadata
population-based
research
coupling of clinical data to genealogical database
municipal records of > 20,000 individuals
each disorder could be coupled to common ancestor:
genes involved in diabetes, Alzheimer’s disease, etc.
recessivePedigree tree
RotterdamRotterdamStudyStudy
Databases for Knowledge Discovery
ResearchResearchdatabasedatabase
researchresearchdatadataresearchresearch
datadataresearchresearchdatadataresearchresearch
datadata
population-based
research
● Population-based research: prospective
RotterdamRotterdamStudyStudy
Databases for Knowledge Discovery
ResearchResearchdatabasedatabase
researchresearchdatadataresearchresearch
datadataresearchresearchdatadataresearchresearch
datadata
population-based
research Prospective longitudinal database
10,000 persions > 55 years of age
relationships between risks and diseases
cardiovascular and vessel-wall diseases, glaucoma
neurologic diseases (Alzheimer), osteoporosis
● Population-based research: prospective
Databases for Knowledge Discovery
● Population-based research: prospective
Generation R
ResearchResearchdatabasedatabase
researchresearchdatadataresearchresearch
datadataresearchresearchdatadataresearchresearch
datadata
population-based
research
Databases for Knowledge Discovery
Generation R
ResearchResearchdatabasedatabase
researchresearchdatadataresearchresearch
datadataresearchresearchdatadataresearchresearch
datadata
population-based
research Prospective longitudinal database
10,000 children from pregnancy onwards
relations risks and genetics/environmental data
perinatal circumstances, diseases at young age
cultural backgrounds, impact of education, etc.
● Population-based research: prospective
Databases for Knowledge Discovery A formal ('forward‘ ) method in analysing large research databases may hamper the flexible attitude of a researcher, not knowing in advance what he may expect (serendipity).
‘Hard’ and‘soft’ examples from biomedicine and the health sciences show that computers can be very helpful in finding new and unforeseen (‘inverse’ ) associations between the data stored in research databases.
Well-documented databases are an enormous treasure for the advancement of scientific research.