on matching large life science ontologies in parallel2 matching large life science... · • needed...

Post on 01-Jan-2020

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

ON MATCHING LARGE LIFE SCIENCE

ONTOLOGIES IN PARALLEL

Anika Groß, Michael Hartung, Toralf Kirsten, Erhard Rahm

Database Group Leipzig

http://dbs.uni-leipzig.de

Interdisciplinary Centre for Bioinformatics

http://www.izbi.uni-leipzig.de

Göteborg, 25th August 2010

ONTOLOGY MAPPINGS

• Ontologies are often highly related

• Identify overlapping information

• Crucial for data integration, enhanced dataanalysis across ontologies, ontology merging, …

On Matching Large Life Science Ontologies in Parallel 2 / 20

GALEN MouseAnatomy

NCIThesaurus

SNOMEDCT

FMA

NCI Thesaurus(Anatomic Structure..)

Vertebra

Sacral_Vertebra

Sacrum

S2_Vertebra S1_Vertebra…

……

Adult Mouse Anatomical Dictionary

vertebra

sacral vertebra

pelvis bone

sacral vertebra 2sacral vertebra 1 …

… …

• Manual creation is complex or even infeasible

• (Semi-) automatic determination of similarities between ontology elements to find correspondences

• Ontology Mapping = set of semantic correspondences between concepts of different ontologies

?

?

??

Preprocessing

• Use metadata (element-, structure-level), instance data

• High quality mappings

• Combined execution of several matchers in more complex match workflows

On Matching Large Life Science Ontologies in Parallel 3 / 20

• Time consuming and memory intensive task

ONTOLOGY MATCHING

… …

O1

O2

… …

Combination SelectionMatchResult

Mapping

M2

M1

M3

Matcher execution

→ Long execution times, high memory requirements

→ Reasonable to improve efficiency

EXAMPLE – MATCH SUB-ONTOLOGIES OF GENE ONTOLOGY

On Matching Large Life Science Ontologies in Parallel 4 / 20

Molecular Function

Biological Process

2*104

104

Evaluation ofCartesian product

2·108 comparisons

Multiplied by number of different

matchers

6·108 comparisons

HOW TO DEAL WITH PERFORMANCE AND MEMORY ISSUES?

On Matching Large Life Science Ontologies in Parallel 5 / 20

→ Further methods

• Reduction of search space: Avoid evaluation of Cartesian product using e.g. early pruning, cluster- or fragment-based methods

• Reuse of match results: Save recomputation (ontology evolution)

• …

→ Combination of approaches

→ Parallelization

• Parallel execution of ontology matching on multiple compute nodes

• Use the broad availability of multi-core systems

• Reduce execution time, memory requirements

CONTRIBUTIONS

• Two strategies for parallel ontology matching• Inter-matcher parallelization: execute independent matchers in parallel

• Intra-matcher parallelization: internal parallelization of matchers based on partitioning of the ontologies

• Parallelization of different kinds of matchers• Element-level, structure-level, instance-based matchers

• Implementation and evaluation• Distributed infrastructure for parallel ontology matching

• Evaluation of the approaches for matching large life science ontologies

On Matching Large Life Science Ontologies in Parallel 6 / 20

OVERVIEW

• Motivation

• Parallelization Strategies

• Inter-matcher Parallelization

• Intra-matcher Parallelization

• Infrastructure

• Evaluation

• Conclusion & Future Work

On Matching Large Life Science Ontologies in Parallel 7 / 20

INTER-MATCHER PARALLELIZATION

• Parallel execution of independently executable matchers

• Process matchers on different cores or computing nodes

On Matching Large Life Science Ontologies in Parallel 8 / 20

+ Improve execution time up to factor n (n=|matchers|)

– Limited degree of parallelization (|independently executable matchers|)

– Slowest matcher limits the achievable speedup

– Memory requirements are not reduced since matchers evaluate complete ontologies

...

M1

O2

O1

Mn

...MatchResult

M1

O2

O1

M3

...MatchResult

M2

INTRA-MATCHER PARALLELIZATION

• Internal decomposition of individual matchers or matcher parts

• Partitioning the input data (ontologies)

• Many small match tasks can be executed in parallel

On Matching Large Life Science Ontologies in Parallel 9 / 20

...

...Ontology Partitioning

M11

M1k

O2

O1

...

Ontology Partitioning

Match Task Generation M1

Match Task Generation Mn

MatchResult

...

Mn1

Mnk

...

+ Match tasks of limited complexity → reduced memory and processing requirements

+ Applicable to sequential and independently executable matchers

• Each match task matches one O1 against one O2 partition

SIZE-BASED PARTITIONING

• Enable parallel matching of Cartesian product• Partition input ontologies into partitions of equal size (|concepts|)

On Matching Large Life Science Ontologies in Parallel 10 / 20

...

p10

p1

...

p1

p10

+ Scalable to large ontologies+ Good load balancing + Optimizes performance without sacrificing match quality+ Applicable to element-level, structure-level and instance-based matchers

… … … …

|O1|= 10,000concepts

Generate 100match tasks

|O2|= 10,000concepts

Acc: c1

Name: c1.name

Acc: c3

Name: c3.name

Acc: d1

Name: d1.name

Acc: d2

Name: d2.nameAcc: d3

Name: d3.name

Acc: d4

Name: d4.name

Acc: d5

Name: d5.nameAcc: c2

Name: c2.name

PARALLELIZATION OF ELEMENT-LEVEL MATCHERS

• Needed information is directly associated with concepts themselves

On Matching Large Life Science Ontologies in Parallel 11 / 20

→ Match-information is retained

• Attribute values like names, synonyms …

0.70.90.9

0.7

Acc: c1

Name: c1.name

Acc: c3

Name: c3.name

Acc: d1

Name: d1.name

Acc: d2

Name: d2.nameAcc: d3

Name: d3.name

Acc: d4

Name: d4.name

Acc: d5

Name: d5.nameAcc: c2

Name: c2.name

c∈O1, d∈O2

→ Easy partitioning of ontologies into subsets of concepts for element-level matchers

• Needed information is not provided by concepts themselves

• Use information from the neighborhood of concepts, e.g. children, siblings, namePath, …

PARALLELIZATION OF STRUCTURE-LEVEL MATCHERS

On Matching Large Life Science Ontologies in Parallel 12 / 20

Approach

• Extend the concept-level information within special multi-valued context attributes, e.g. Child, NamePath, …

• Preprocessing step of linear effort

→ Size-based partitioning as for element-level matching

• Note: parallelization of similarity propagation algorithms is more difficult (need the whole structural information)

→ Only parallelization of the initial matcher

• Average element similarity between children of concepts

→Compare all their Child attributes

Acc: c1

Name: c1.name

Acc: c3

Name: c3.name

Acc: d1

Name: d1.name

Acc: d2

Name: d2.nameAcc: d3

Name: d3.name

Acc: d4

Name: d4.name

Acc: d5

Name: d5.nameAcc: c2

Name: c2.name

EXAMPLE: CHILD MATCHER

• Use a multi-valued Child context attribute: get name values of child concepts in a preprocessing step

On Matching Large Life Science Ontologies in Parallel 13 / 20

corr(c,d) simChildren(c,d)

c1 – d1 (0.9 + 0.7 + 0.7 + 0.9) / (2·2) = 0.8

c1 – d4 (0.7 + 0 + 0 + 0.9) / (2·2) = 0.4

• Similarly applicable for other local context matchers, e.g. namePath

0.8 0.4

Acc: c1

Name: c1.nameChild: c2.nameChild: c3.name

Acc: c3

Name: c3.nameChild: …

Acc: d1

Name: d1.nameChild: d2.nameChild: d3.name

Acc: d2

Name: d2.nameChild: …

Acc: d3

Name: d3.nameChild: …

Acc: d4

Name: d4.nameChild: d3.nameChild: d5.name

Acc: d5

Name: d5.nameChild: …

Acc: c2

Name: c2.nameChild: …

0.7 0.90.90.7

INFRASTRUCTURE INTRA-MATCHER PARALLELIZATION EXAMPLE

On Matching Large Life Science Ontologies in Parallel 14 / 20

CPU1

CPU2

CPU1

CPU2

Data Service

… …… …

Workflow Service

...

c1c2

ck-1ck

...

c1c2

cl-1cl

Concepts with context-attributes

...

pn

p1

...

p1

pm

Partitions (sets of concepts)

Match Service 1

Match Service 2

j1j1

j2

j3

j4

js

j2

j3

j4

js

Job queue

......

INFRASTRUCTURE – INTRA-MATCHER PARALLELIZATION EXAMPLE

On Matching Large Life Science Ontologies in Parallel 15 / 20

CPU1

CPU2

CPU1

CPU2

Data Service

Workflow Service

CPU1...

...

...

...

...

...

...

Job executionUnify Partial Match Results

Match Service 1

Match Service 2

Further aggregation,selection, …

in the match workflow...

...

...

...

• Two match problems to evaluate efficiency (execution times)

EVALUATION

On Matching Large Life Science Ontologies in Parallel 16 / 20

Adult Mouse Anatomical Dictionary

NCI Thesaurus Anatomic Structure or Substance

Biological Processes

Molecular Functions

~2,700 · ~3,300 = 9 · 106 comparisons

~9,400 · ~17,100 = 1.6 · 108 comparisons

• Matchers (element and structure-level)

• NameSynonym (NS) Max TriGram similarity for the name and multi-valued synonym attr.

• Children (CH) Avg TriGram similarity on the context attributes Child

• NamePath (NP) Avg TriGram similarity on the context attributes NamePath

• Computing environmentFour nodes consisting of four cores (up to 16 cores)

Medium

Scale

Large

Scale

• 1 node, 4 cores, 8 threads

INTRA-MATCHER PARALLELIZATION OF INDIVIDUAL MATCHERS

On Matching Large Life Science Ontologies in Parallel 17 / 20

• ↑↑↑↑ degree of parallelization → ↓↓↓↓ execution times

• Almost linear speedup for up to 4 threads

6h15

• NP - most expensive, long concatenated names, multiple inheritance

• NS - many synonyms per concept in GO

1h15

NP NS CH speedup NP speedup NS speedup CH

Medium Scale Large Scale

PARALLELIZATION STRATEGIES

• Parallelization strategies benefit from multiple threads/cores

• Combined approach slightly better, because execution delays between matchers are avoided

On Matching Large Life Science Ontologies in Parallel 18 / 20

• 4 nodes, 16 cores; matcher combination of NP, CH, NS

11h

50min

Intra-matcher parallelization speedup Intra

Intra- & Inter-matcher parallelization speedup Intra&Inter

→ Intra and the combined approach are very effective and thus especially valuable for parallel ontology matching

Speedup12.5

CONCLUSIONS AND FUTURE WORK

• General, flexible strategies for parallel ontology matching on multiple compute nodes to improve efficiency (inter- and intra-matcher parallelization)

• Size-based partitioning enabled good load balancing, scalable, reduced memory consumptions, quality not affected

• Parallelization of element-level, structure-level, instance-based matchers using multi-valued context attributes

• Implemented a distributed infrastructure, evaluation for large life science ontology match problems

On Matching Large Life Science Ontologies in Parallel 19 / 20

• Investigate parallel ontology matching for additional matchers, evaluate effectiveness

• Combine parallelization with advanced fragmentation strategies

THANK YOU FOR YOUR ATTENTION

On Matching Large Life Science Ontologies in Parallel 20 / 20

top related