automating the committee meeting: intelligent integration of information from diverse sources...

Automating the Committee Meeting:Intelligent Integration of

Information From Diverse Sources

Pedrito Maynard-Zhang

Department of Computer Science & Systems Analysis

Information Integration

Information integration is ubiquitous:• Committee meetings• Research papers• Information retrieval on the web• Assessing intelligence on the

battlefield• …

Information Integration

Outline

• Introduction• Automating Information Integration

– Database Integration– Model Integration– Conflict Resolution and Meta-

Information• Integrating Learned Probabilistic

Information• Conclusion and Current Work

Multi-Disciplinary Research• Databases (e.g., Halevy’s group at

U. of Washington)• Artificial Intelligence (e.g.,

Stanford’s Knowledge Systems Laboratory)

• Business (e.g., MIT-Sloan’s Aggregators Group)

• Decision Analysis (e.g., Clemen & Winkler’s work at Duke)

Database Integration

Gene-Clinics

Locus-Link

EntrezOMIM

Genes Proteins NucleotideSequences

Mediation Layer

Source Databases

bioinformatics query

Database Integration

• Application: Querying distributed databases

• Examples– Bioinformatics– Corporate data management– Question-answer systems on the web– Detecting bioterrorism

Model Integration

if cancer then operate…

cdi = CIRDE BML…

expert system probabilistic model mathematical model

super model

Model Integration

• Applications: Diagnosis and prediction

• Examples:– Medical diagnosis– NASA spacecraft design and diagnosis– Expert system integration– Combining commonsense knowledge

bases

Challenges

• Efficient query processing and optimization– Parsing XML

• Defining expressive yet tractable mediator languages

• Handling heterogeneous source languages– Wrapper technology development

Challenges

• Resolving ontological differences– e.g., realizing that the field “Name”

for one source stores the same information as “First Name” and “Last Name” for another.

• Detecting conflicts• Resolving conflicts

– Resolution done manually in practice– We can automate more!

Uninformed Integration

raining sunny raining

What’s the weather like?

Intelligent Integration

raining sunny raining

meteorologist practical joker own eyes

What’s the weather like?

Types of Meta-Information

• Credibility, experience, political clout• Areas of expertise• How source acquired information:

– Source’s sources– Processes source used to accumulate

information

• Structure of the data representation

Outline

• Introduction• Automating Information Integration• Integrating Learned Probabilistic

Information– Medical Scenario– Semantic Framework– LinOP-Based Aggregation– Aggregating Bayesian Networks– Experimental Validation

• Conclusion and Current Work

Medical Expert System Scenario

3 years experience

20 years experience

10 years experience

Expert system

Source Meta-Information

• Doctors learned probabilistic models from patient data using some known standard learning algorithm.

• We know the relative amount of experience doctors have had (i.e., years of practice).

Popular Aggregation Approaches

• Intuition approach: Take simple weighted averages, etc. unexpected behavior

• Axiomatic approach: Find aggregation algorithm satisfying certain “obvious” properties impossibility results

• Problem: Not semantically grounded

Aggregation Semantics

M samples generated from the true distribution p

learningalgorithm

optimaldistribution

p*

learningalgorithm

learningalgorithm

learningalgorithm

p1

p2

pL

aggregationalgorithm

aggregatedistribution

p

not available in practice

^

……

…

Linear Opinion Pool (LinOP)• LinOP: Weighted sum of joint distributions.

• Precisely, for joint distributions pi and joint variable instantiation w,

LinOP(p1, p2, …, pL)(w) = i ipi(w).

i weights: relative experience.

• Satisfies unanimity, non-dictatorship, and marginalization.

• Doesn’t preserve shared independences.

LinOP and Joint Learning

If– sources learn joint distributions using

maximum likelihood or MAP learning and– the same learning framework would be

used on the combined data set to learn p*

thenp* LinOP(p1, p2, …, pL).

Bayesian Network (BN)

• Summary: Compact, graphical representation of a probability distribution.

• Definition: Directed acyclic graph (DAG) over nodes (random variables); each node has a local conditional probability distribution (CPD) associated with it.

• Exploits causal structure in the domain.

Alarm BN

Burglary Earthquake

JohnCalls

Alarm

MaryCalls

P(B)

.001

P(E)

.002

B E P(A)

+ + .95

+ - .94

- + .29

- - .001A P(M)

+ .70

- .01

A P(J)

+ .90

- .05

BN Advantages

• Compact representation and graph encodes conditional independences.

• Elicitation easy in practice.• Inference efficient in practice.• Can be learned from data.• Deployed successfully – medical

diagnosis, Microsoft Office, NASA Mission Control, and more.

BN Learning

• Idea: Select BN most likely to have generated data.

• Standard algorithm:– Search over structures by adding,

deleting, and reversing edges.– Parameterize and score structures

using statistics from the data.– Penalize complex structures.

Aggregating BNs

• Each source i learns BN pi.• p* is the BN we would learn from the

combined data set.• We want to approximate p* as closely

as possible by aggregating p1, …, pL.• Source information: estimates for

the relative experience of the sources and the total amount of data seen (M).

AGGR: BN Aggregation Algorithm

• Idea: Use BN learning algorithm.• Problem: We don’t have data.• Key observation: We can use

LinOP to approximate the statistics needed for the parameterization and scoring steps!

• Also, we can use LinOP properties to make algorithm reasonably efficient.

Asia BNVisit to

AsiaSmoking

Lung CancerTuberculosis

Abnormalityin Chest

Bronchitis

X-Ray Dyspnea

Experimental Setup

• Generate data for sources from well-known ASIA BN which relates smoking, visiting Asia, and lung cancer.

• Compare our algorithm AGGR against the optimal algorithm OPT that has access to the combined data set.

• Accuracy measure: KL divergence from generating distribution.

Sensitivity to M Experiments

• Sensitivity to M– Size of the combined data set M

varies.– AGGR’s estimate of M is accurate.

• Sensitivity to Estimate of M– Size of the combined data set M is

fixed.– AGGR’s estimate of M varies.

Sensitivity to M

0.0000.0100.0200.0300.0400.0500.0600.0700.0800.0900.100

200 600 1000 1400 1800 2200 2600 3000

M

KL

Div

erg

ence

S1S2OPTAGGR

Sensitivity to Estimate of M

0.0000

0.0002

0.0004

0.0006

0.0008

0.0010

0.0012

0.0014

0.0016

-1.00 -0.75 -0.50 -0.25 0.00 0.25 0.50 0.75 1.00

M'/M (log10)

KL D

iverg

en

ce

S1

S2

OPT

AGGR

M=10k

Subpopulations

• Each source’s data may come from a different subpopulation P(D|Si), where D is the data.

• We want to learn P(D).

• P(D) = LinOP(P(D|S1), P(D|S2), …, P(D|SL)) with sources’ weights based on P(Si).

• We can apply the same algorithm.

Subpopulations Experiments• In the Asia network domain, one

doctor practices in San Francisco, another in Cincinnati.

• Subpopulations have different priors for smoking and having visited Asia, so doctors’ beliefs are biased.

• The aggregate distribution comes much closer to the original distribution.

Asia BNVisit to

AsiaSmoking

Lung CancerTuberculosis

Abnormalityin Chest

Bronchitis

X-Ray Dyspnea

Doctor

0.000.020.040.060.080.100.120.140.160.18

200 600 1000 1400 1800 2200 2600 3000

M

KL

Div

erg

ence

S1

S2

OPT

AGGR

Subpopulations

Contributions

• A semantic framework for aggregating learned probabilistic models.

• A LinOP-based algorithm for aggregating learned BNs.

• Experiments showing algorithm behaves well.

Outline

• Introduction• Automating Information Integration• Integrating Learned Probabilistic

Information• Conclusion and Current Work

Conclusion

• Conflict resolution is key in automated information integration.

• This is a difficult task in general.• However, information about

sources is often readily available.• Principled use of this information

can greatly enhance the ability to resolve conflicts intelligently.

Current Work

• Allow dependence between sources’ data sets in probabilistic aggregation work.

• Apply semantic framework to aggregation in other learning paradigms.

• Explore application of algorithms to database integration, RoboCup, stock market prediction, etc.

• Making committee meetings obsolete!

Multi-Agent Research Zone• Research interests:

– Information integration– Multi-agent machine learning– RoboCup soccer simulation league testbed

• Masters students– Jian Xu: Information integration in medical

informatics– Linxin Gan: Ensemble learning in stock

market prediction

CSA Graduate Program

• Masters in Computer Science• Research areas include:

– machine learning, KRR, and MAS– information retrieval, databases, and NLP– networking and virtual environments– simulation and evolutionary computation– software engineering and formal methods

http://unixgen.muohio.edu/~maynarp/[email protected]

automating the committee meeting: intelligent integration of information from diverse sources...

Documents

years of practice

source stores

probabilistic models

aggregation algorithm

patient data

webassessing intelligence

experience doctors

clemen winklers work