automating the committee meeting: intelligent integration of information from diverse sources...

42
Automating the Committee Meeting: Intelligent Integration of Information From Diverse Sources Pedrito Maynard-Zhang Department of Computer Science & Systems Analysis

Upload: cornelia-nash

Post on 12-Jan-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Automating the Committee Meeting: Intelligent Integration of Information From Diverse Sources Pedrito Maynard-Zhang Department of Computer Science & Systems

Automating the Committee Meeting:Intelligent Integration of

Information From Diverse Sources

Pedrito Maynard-Zhang

Department of Computer Science & Systems Analysis

Page 2: Automating the Committee Meeting: Intelligent Integration of Information From Diverse Sources Pedrito Maynard-Zhang Department of Computer Science & Systems

Information Integration

Information integration is ubiquitous:• Committee meetings• Research papers• Information retrieval on the web• Assessing intelligence on the

battlefield• …

Page 3: Automating the Committee Meeting: Intelligent Integration of Information From Diverse Sources Pedrito Maynard-Zhang Department of Computer Science & Systems

Information Integration

Page 4: Automating the Committee Meeting: Intelligent Integration of Information From Diverse Sources Pedrito Maynard-Zhang Department of Computer Science & Systems

Outline

• Introduction• Automating Information Integration

– Database Integration– Model Integration– Conflict Resolution and Meta-

Information• Integrating Learned Probabilistic

Information• Conclusion and Current Work

Page 5: Automating the Committee Meeting: Intelligent Integration of Information From Diverse Sources Pedrito Maynard-Zhang Department of Computer Science & Systems

Multi-Disciplinary Research• Databases (e.g., Halevy’s group at

U. of Washington)• Artificial Intelligence (e.g.,

Stanford’s Knowledge Systems Laboratory)

• Business (e.g., MIT-Sloan’s Aggregators Group)

• Decision Analysis (e.g., Clemen & Winkler’s work at Duke)

Page 6: Automating the Committee Meeting: Intelligent Integration of Information From Diverse Sources Pedrito Maynard-Zhang Department of Computer Science & Systems

Database Integration

Gene-Clinics

Locus-Link

EntrezOMIM

Genes Proteins NucleotideSequences

Mediation Layer

Source Databases

bioinformatics query

Page 7: Automating the Committee Meeting: Intelligent Integration of Information From Diverse Sources Pedrito Maynard-Zhang Department of Computer Science & Systems

Database Integration

• Application: Querying distributed databases

• Examples– Bioinformatics– Corporate data management– Question-answer systems on the web– Detecting bioterrorism

Page 8: Automating the Committee Meeting: Intelligent Integration of Information From Diverse Sources Pedrito Maynard-Zhang Department of Computer Science & Systems

Model Integration

if cancer then operate…

cdi = CIRDE BML…

expert system probabilistic model mathematical model

super model

Page 9: Automating the Committee Meeting: Intelligent Integration of Information From Diverse Sources Pedrito Maynard-Zhang Department of Computer Science & Systems

Model Integration

• Applications: Diagnosis and prediction

• Examples:– Medical diagnosis– NASA spacecraft design and diagnosis– Expert system integration– Combining commonsense knowledge

bases

Page 10: Automating the Committee Meeting: Intelligent Integration of Information From Diverse Sources Pedrito Maynard-Zhang Department of Computer Science & Systems

Challenges

• Efficient query processing and optimization– Parsing XML

• Defining expressive yet tractable mediator languages

• Handling heterogeneous source languages– Wrapper technology development

Page 11: Automating the Committee Meeting: Intelligent Integration of Information From Diverse Sources Pedrito Maynard-Zhang Department of Computer Science & Systems

Challenges

• Resolving ontological differences– e.g., realizing that the field “Name”

for one source stores the same information as “First Name” and “Last Name” for another.

• Detecting conflicts• Resolving conflicts

– Resolution done manually in practice– We can automate more!

Page 12: Automating the Committee Meeting: Intelligent Integration of Information From Diverse Sources Pedrito Maynard-Zhang Department of Computer Science & Systems

Uninformed Integration

raining sunny raining

What’s the weather like?

Page 13: Automating the Committee Meeting: Intelligent Integration of Information From Diverse Sources Pedrito Maynard-Zhang Department of Computer Science & Systems

Intelligent Integration

raining sunny raining

meteorologist practical joker own eyes

What’s the weather like?

Page 14: Automating the Committee Meeting: Intelligent Integration of Information From Diverse Sources Pedrito Maynard-Zhang Department of Computer Science & Systems

Types of Meta-Information

• Credibility, experience, political clout• Areas of expertise• How source acquired information:

– Source’s sources– Processes source used to accumulate

information

• Structure of the data representation

Page 15: Automating the Committee Meeting: Intelligent Integration of Information From Diverse Sources Pedrito Maynard-Zhang Department of Computer Science & Systems

Outline

• Introduction• Automating Information Integration• Integrating Learned Probabilistic

Information– Medical Scenario– Semantic Framework– LinOP-Based Aggregation– Aggregating Bayesian Networks– Experimental Validation

• Conclusion and Current Work

Page 16: Automating the Committee Meeting: Intelligent Integration of Information From Diverse Sources Pedrito Maynard-Zhang Department of Computer Science & Systems

Medical Expert System Scenario

3 years experience

20 years experience

10 years experience

Expert system

Page 17: Automating the Committee Meeting: Intelligent Integration of Information From Diverse Sources Pedrito Maynard-Zhang Department of Computer Science & Systems

Source Meta-Information

• Doctors learned probabilistic models from patient data using some known standard learning algorithm.

• We know the relative amount of experience doctors have had (i.e., years of practice).

Page 18: Automating the Committee Meeting: Intelligent Integration of Information From Diverse Sources Pedrito Maynard-Zhang Department of Computer Science & Systems

Popular Aggregation Approaches

• Intuition approach: Take simple weighted averages, etc. unexpected behavior

• Axiomatic approach: Find aggregation algorithm satisfying certain “obvious” properties impossibility results

• Problem: Not semantically grounded

Page 19: Automating the Committee Meeting: Intelligent Integration of Information From Diverse Sources Pedrito Maynard-Zhang Department of Computer Science & Systems

Aggregation Semantics

M samples generated from the true distribution p

learningalgorithm

optimaldistribution

p*

learningalgorithm

learningalgorithm

learningalgorithm

p1

p2

pL

aggregationalgorithm

aggregatedistribution

p

not available in practice

^

……

Page 20: Automating the Committee Meeting: Intelligent Integration of Information From Diverse Sources Pedrito Maynard-Zhang Department of Computer Science & Systems

Linear Opinion Pool (LinOP)• LinOP: Weighted sum of joint distributions.

• Precisely, for joint distributions pi and joint variable instantiation w,

LinOP(p1, p2, …, pL)(w) = i ipi(w).

i weights: relative experience.

• Satisfies unanimity, non-dictatorship, and marginalization.

• Doesn’t preserve shared independences.

Page 21: Automating the Committee Meeting: Intelligent Integration of Information From Diverse Sources Pedrito Maynard-Zhang Department of Computer Science & Systems

LinOP and Joint Learning

If– sources learn joint distributions using

maximum likelihood or MAP learning and– the same learning framework would be

used on the combined data set to learn p*

thenp* LinOP(p1, p2, …, pL).

Page 22: Automating the Committee Meeting: Intelligent Integration of Information From Diverse Sources Pedrito Maynard-Zhang Department of Computer Science & Systems

Bayesian Network (BN)

• Summary: Compact, graphical representation of a probability distribution.

• Definition: Directed acyclic graph (DAG) over nodes (random variables); each node has a local conditional probability distribution (CPD) associated with it.

• Exploits causal structure in the domain.

Page 23: Automating the Committee Meeting: Intelligent Integration of Information From Diverse Sources Pedrito Maynard-Zhang Department of Computer Science & Systems

Alarm BN

Burglary Earthquake

JohnCalls

Alarm

MaryCalls

P(B)

.001

P(E)

.002

B E P(A)

+ + .95

+ - .94

- + .29

- - .001A P(M)

+ .70

- .01

A P(J)

+ .90

- .05

Page 24: Automating the Committee Meeting: Intelligent Integration of Information From Diverse Sources Pedrito Maynard-Zhang Department of Computer Science & Systems

BN Advantages

• Compact representation and graph encodes conditional independences.

• Elicitation easy in practice.• Inference efficient in practice.• Can be learned from data.• Deployed successfully – medical

diagnosis, Microsoft Office, NASA Mission Control, and more.

Page 25: Automating the Committee Meeting: Intelligent Integration of Information From Diverse Sources Pedrito Maynard-Zhang Department of Computer Science & Systems

BN Learning

• Idea: Select BN most likely to have generated data.

• Standard algorithm:– Search over structures by adding,

deleting, and reversing edges.– Parameterize and score structures

using statistics from the data.– Penalize complex structures.

Page 26: Automating the Committee Meeting: Intelligent Integration of Information From Diverse Sources Pedrito Maynard-Zhang Department of Computer Science & Systems

Aggregating BNs

• Each source i learns BN pi.• p* is the BN we would learn from the

combined data set.• We want to approximate p* as closely

as possible by aggregating p1, …, pL.• Source information: estimates for

the relative experience of the sources and the total amount of data seen (M).

Page 27: Automating the Committee Meeting: Intelligent Integration of Information From Diverse Sources Pedrito Maynard-Zhang Department of Computer Science & Systems

AGGR: BN Aggregation Algorithm

• Idea: Use BN learning algorithm.• Problem: We don’t have data.• Key observation: We can use

LinOP to approximate the statistics needed for the parameterization and scoring steps!

• Also, we can use LinOP properties to make algorithm reasonably efficient.

Page 28: Automating the Committee Meeting: Intelligent Integration of Information From Diverse Sources Pedrito Maynard-Zhang Department of Computer Science & Systems

Asia BNVisit to

AsiaSmoking

Lung CancerTuberculosis

Abnormalityin Chest

Bronchitis

X-Ray Dyspnea

Page 29: Automating the Committee Meeting: Intelligent Integration of Information From Diverse Sources Pedrito Maynard-Zhang Department of Computer Science & Systems

Experimental Setup

• Generate data for sources from well-known ASIA BN which relates smoking, visiting Asia, and lung cancer.

• Compare our algorithm AGGR against the optimal algorithm OPT that has access to the combined data set.

• Accuracy measure: KL divergence from generating distribution.

Page 30: Automating the Committee Meeting: Intelligent Integration of Information From Diverse Sources Pedrito Maynard-Zhang Department of Computer Science & Systems

Sensitivity to M Experiments

• Sensitivity to M– Size of the combined data set M

varies.– AGGR’s estimate of M is accurate.

• Sensitivity to Estimate of M– Size of the combined data set M is

fixed.– AGGR’s estimate of M varies.

Page 31: Automating the Committee Meeting: Intelligent Integration of Information From Diverse Sources Pedrito Maynard-Zhang Department of Computer Science & Systems

Sensitivity to M

0.0000.0100.0200.0300.0400.0500.0600.0700.0800.0900.100

200 600 1000 1400 1800 2200 2600 3000

M

KL

Div

erg

ence

S1S2OPTAGGR

Page 32: Automating the Committee Meeting: Intelligent Integration of Information From Diverse Sources Pedrito Maynard-Zhang Department of Computer Science & Systems

Sensitivity to Estimate of M

0.0000

0.0002

0.0004

0.0006

0.0008

0.0010

0.0012

0.0014

0.0016

-1.00 -0.75 -0.50 -0.25 0.00 0.25 0.50 0.75 1.00

M'/M (log10)

KL D

iverg

en

ce

S1

S2

OPT

AGGR

M=10k

Page 33: Automating the Committee Meeting: Intelligent Integration of Information From Diverse Sources Pedrito Maynard-Zhang Department of Computer Science & Systems

Subpopulations

• Each source’s data may come from a different subpopulation P(D|Si), where D is the data.

• We want to learn P(D).

• P(D) = LinOP(P(D|S1), P(D|S2), …, P(D|SL)) with sources’ weights based on P(Si).

• We can apply the same algorithm.

Page 34: Automating the Committee Meeting: Intelligent Integration of Information From Diverse Sources Pedrito Maynard-Zhang Department of Computer Science & Systems

Subpopulations Experiments• In the Asia network domain, one

doctor practices in San Francisco, another in Cincinnati.

• Subpopulations have different priors for smoking and having visited Asia, so doctors’ beliefs are biased.

• The aggregate distribution comes much closer to the original distribution.

Page 35: Automating the Committee Meeting: Intelligent Integration of Information From Diverse Sources Pedrito Maynard-Zhang Department of Computer Science & Systems

Asia BNVisit to

AsiaSmoking

Lung CancerTuberculosis

Abnormalityin Chest

Bronchitis

X-Ray Dyspnea

Doctor

Page 36: Automating the Committee Meeting: Intelligent Integration of Information From Diverse Sources Pedrito Maynard-Zhang Department of Computer Science & Systems

0.000.020.040.060.080.100.120.140.160.18

200 600 1000 1400 1800 2200 2600 3000

M

KL

Div

erg

ence

S1

S2

OPT

AGGR

Subpopulations

Page 37: Automating the Committee Meeting: Intelligent Integration of Information From Diverse Sources Pedrito Maynard-Zhang Department of Computer Science & Systems

Contributions

• A semantic framework for aggregating learned probabilistic models.

• A LinOP-based algorithm for aggregating learned BNs.

• Experiments showing algorithm behaves well.

Page 38: Automating the Committee Meeting: Intelligent Integration of Information From Diverse Sources Pedrito Maynard-Zhang Department of Computer Science & Systems

Outline

• Introduction• Automating Information Integration• Integrating Learned Probabilistic

Information• Conclusion and Current Work

Page 39: Automating the Committee Meeting: Intelligent Integration of Information From Diverse Sources Pedrito Maynard-Zhang Department of Computer Science & Systems

Conclusion

• Conflict resolution is key in automated information integration.

• This is a difficult task in general.• However, information about

sources is often readily available.• Principled use of this information

can greatly enhance the ability to resolve conflicts intelligently.

Page 40: Automating the Committee Meeting: Intelligent Integration of Information From Diverse Sources Pedrito Maynard-Zhang Department of Computer Science & Systems

Current Work

• Allow dependence between sources’ data sets in probabilistic aggregation work.

• Apply semantic framework to aggregation in other learning paradigms.

• Explore application of algorithms to database integration, RoboCup, stock market prediction, etc.

• Making committee meetings obsolete!

Page 41: Automating the Committee Meeting: Intelligent Integration of Information From Diverse Sources Pedrito Maynard-Zhang Department of Computer Science & Systems

Multi-Agent Research Zone• Research interests:

– Information integration– Multi-agent machine learning– RoboCup soccer simulation league testbed

• Masters students– Jian Xu: Information integration in medical

informatics– Linxin Gan: Ensemble learning in stock

market prediction

Page 42: Automating the Committee Meeting: Intelligent Integration of Information From Diverse Sources Pedrito Maynard-Zhang Department of Computer Science & Systems

CSA Graduate Program

• Masters in Computer Science• Research areas include:

– machine learning, KRR, and MAS– information retrieval, databases, and NLP– networking and virtual environments– simulation and evolutionary computation– software engineering and formal methods

http://unixgen.muohio.edu/~maynarp/[email protected]