automating the committee meeting: intelligent integration of information from diverse sources...
TRANSCRIPT
Automating the Committee Meeting:Intelligent Integration of
Information From Diverse Sources
Pedrito Maynard-Zhang
Department of Computer Science & Systems Analysis
Information Integration
Information integration is ubiquitous:• Committee meetings• Research papers• Information retrieval on the web• Assessing intelligence on the
battlefield• …
Information Integration
Outline
• Introduction• Automating Information Integration
– Database Integration– Model Integration– Conflict Resolution and Meta-
Information• Integrating Learned Probabilistic
Information• Conclusion and Current Work
Multi-Disciplinary Research• Databases (e.g., Halevy’s group at
U. of Washington)• Artificial Intelligence (e.g.,
Stanford’s Knowledge Systems Laboratory)
• Business (e.g., MIT-Sloan’s Aggregators Group)
• Decision Analysis (e.g., Clemen & Winkler’s work at Duke)
Database Integration
Gene-Clinics
Locus-Link
EntrezOMIM
Genes Proteins NucleotideSequences
Mediation Layer
Source Databases
bioinformatics query
Database Integration
• Application: Querying distributed databases
• Examples– Bioinformatics– Corporate data management– Question-answer systems on the web– Detecting bioterrorism
Model Integration
if cancer then operate…
cdi = CIRDE BML…
expert system probabilistic model mathematical model
super model
Model Integration
• Applications: Diagnosis and prediction
• Examples:– Medical diagnosis– NASA spacecraft design and diagnosis– Expert system integration– Combining commonsense knowledge
bases
Challenges
• Efficient query processing and optimization– Parsing XML
• Defining expressive yet tractable mediator languages
• Handling heterogeneous source languages– Wrapper technology development
Challenges
• Resolving ontological differences– e.g., realizing that the field “Name”
for one source stores the same information as “First Name” and “Last Name” for another.
• Detecting conflicts• Resolving conflicts
– Resolution done manually in practice– We can automate more!
Uninformed Integration
raining sunny raining
What’s the weather like?
Intelligent Integration
raining sunny raining
meteorologist practical joker own eyes
What’s the weather like?
Types of Meta-Information
• Credibility, experience, political clout• Areas of expertise• How source acquired information:
– Source’s sources– Processes source used to accumulate
information
• Structure of the data representation
Outline
• Introduction• Automating Information Integration• Integrating Learned Probabilistic
Information– Medical Scenario– Semantic Framework– LinOP-Based Aggregation– Aggregating Bayesian Networks– Experimental Validation
• Conclusion and Current Work
Medical Expert System Scenario
3 years experience
20 years experience
10 years experience
Expert system
Source Meta-Information
• Doctors learned probabilistic models from patient data using some known standard learning algorithm.
• We know the relative amount of experience doctors have had (i.e., years of practice).
Popular Aggregation Approaches
• Intuition approach: Take simple weighted averages, etc. unexpected behavior
• Axiomatic approach: Find aggregation algorithm satisfying certain “obvious” properties impossibility results
• Problem: Not semantically grounded
Aggregation Semantics
M samples generated from the true distribution p
learningalgorithm
optimaldistribution
p*
learningalgorithm
learningalgorithm
learningalgorithm
p1
p2
pL
aggregationalgorithm
aggregatedistribution
p
not available in practice
^
……
…
Linear Opinion Pool (LinOP)• LinOP: Weighted sum of joint distributions.
• Precisely, for joint distributions pi and joint variable instantiation w,
LinOP(p1, p2, …, pL)(w) = i ipi(w).
i weights: relative experience.
• Satisfies unanimity, non-dictatorship, and marginalization.
• Doesn’t preserve shared independences.
LinOP and Joint Learning
If– sources learn joint distributions using
maximum likelihood or MAP learning and– the same learning framework would be
used on the combined data set to learn p*
thenp* LinOP(p1, p2, …, pL).
Bayesian Network (BN)
• Summary: Compact, graphical representation of a probability distribution.
• Definition: Directed acyclic graph (DAG) over nodes (random variables); each node has a local conditional probability distribution (CPD) associated with it.
• Exploits causal structure in the domain.
Alarm BN
Burglary Earthquake
JohnCalls
Alarm
MaryCalls
P(B)
.001
P(E)
.002
B E P(A)
+ + .95
+ - .94
- + .29
- - .001A P(M)
+ .70
- .01
A P(J)
+ .90
- .05
BN Advantages
• Compact representation and graph encodes conditional independences.
• Elicitation easy in practice.• Inference efficient in practice.• Can be learned from data.• Deployed successfully – medical
diagnosis, Microsoft Office, NASA Mission Control, and more.
BN Learning
• Idea: Select BN most likely to have generated data.
• Standard algorithm:– Search over structures by adding,
deleting, and reversing edges.– Parameterize and score structures
using statistics from the data.– Penalize complex structures.
Aggregating BNs
• Each source i learns BN pi.• p* is the BN we would learn from the
combined data set.• We want to approximate p* as closely
as possible by aggregating p1, …, pL.• Source information: estimates for
the relative experience of the sources and the total amount of data seen (M).
AGGR: BN Aggregation Algorithm
• Idea: Use BN learning algorithm.• Problem: We don’t have data.• Key observation: We can use
LinOP to approximate the statistics needed for the parameterization and scoring steps!
• Also, we can use LinOP properties to make algorithm reasonably efficient.
Asia BNVisit to
AsiaSmoking
Lung CancerTuberculosis
Abnormalityin Chest
Bronchitis
X-Ray Dyspnea
Experimental Setup
• Generate data for sources from well-known ASIA BN which relates smoking, visiting Asia, and lung cancer.
• Compare our algorithm AGGR against the optimal algorithm OPT that has access to the combined data set.
• Accuracy measure: KL divergence from generating distribution.
Sensitivity to M Experiments
• Sensitivity to M– Size of the combined data set M
varies.– AGGR’s estimate of M is accurate.
• Sensitivity to Estimate of M– Size of the combined data set M is
fixed.– AGGR’s estimate of M varies.
Sensitivity to M
0.0000.0100.0200.0300.0400.0500.0600.0700.0800.0900.100
200 600 1000 1400 1800 2200 2600 3000
M
KL
Div
erg
ence
S1S2OPTAGGR
Sensitivity to Estimate of M
0.0000
0.0002
0.0004
0.0006
0.0008
0.0010
0.0012
0.0014
0.0016
-1.00 -0.75 -0.50 -0.25 0.00 0.25 0.50 0.75 1.00
M'/M (log10)
KL D
iverg
en
ce
S1
S2
OPT
AGGR
M=10k
Subpopulations
• Each source’s data may come from a different subpopulation P(D|Si), where D is the data.
• We want to learn P(D).
• P(D) = LinOP(P(D|S1), P(D|S2), …, P(D|SL)) with sources’ weights based on P(Si).
• We can apply the same algorithm.
Subpopulations Experiments• In the Asia network domain, one
doctor practices in San Francisco, another in Cincinnati.
• Subpopulations have different priors for smoking and having visited Asia, so doctors’ beliefs are biased.
• The aggregate distribution comes much closer to the original distribution.
Asia BNVisit to
AsiaSmoking
Lung CancerTuberculosis
Abnormalityin Chest
Bronchitis
X-Ray Dyspnea
Doctor
0.000.020.040.060.080.100.120.140.160.18
200 600 1000 1400 1800 2200 2600 3000
M
KL
Div
erg
ence
S1
S2
OPT
AGGR
Subpopulations
Contributions
• A semantic framework for aggregating learned probabilistic models.
• A LinOP-based algorithm for aggregating learned BNs.
• Experiments showing algorithm behaves well.
Outline
• Introduction• Automating Information Integration• Integrating Learned Probabilistic
Information• Conclusion and Current Work
Conclusion
• Conflict resolution is key in automated information integration.
• This is a difficult task in general.• However, information about
sources is often readily available.• Principled use of this information
can greatly enhance the ability to resolve conflicts intelligently.
Current Work
• Allow dependence between sources’ data sets in probabilistic aggregation work.
• Apply semantic framework to aggregation in other learning paradigms.
• Explore application of algorithms to database integration, RoboCup, stock market prediction, etc.
• Making committee meetings obsolete!
Multi-Agent Research Zone• Research interests:
– Information integration– Multi-agent machine learning– RoboCup soccer simulation league testbed
• Masters students– Jian Xu: Information integration in medical
informatics– Linxin Gan: Ensemble learning in stock
market prediction
CSA Graduate Program
• Masters in Computer Science• Research areas include:
– machine learning, KRR, and MAS– information retrieval, databases, and NLP– networking and virtual environments– simulation and evolutionary computation– software engineering and formal methods
http://unixgen.muohio.edu/~maynarp/[email protected]