on matching large life science ontologies in parallel2 matching large life science... · • needed...

ON MATCHING LARGE LIFE SCIENCE

ONTOLOGIES IN PARALLEL

Anika Groß, Michael Hartung, Toralf Kirsten, Erhard Rahm

Database Group Leipzig

http://dbs.uni-leipzig.de

Interdisciplinary Centre for Bioinformatics

http://www.izbi.uni-leipzig.de

Göteborg, 25th August 2010

ONTOLOGY MAPPINGS

• Ontologies are often highly related

• Identify overlapping information

• Crucial for data integration, enhanced dataanalysis across ontologies, ontology merging, …

On Matching Large Life Science Ontologies in Parallel 2 / 20

GALEN MouseAnatomy

NCIThesaurus

SNOMEDCT

NCI Thesaurus(Anatomic Structure..)

Vertebra

Sacral_Vertebra

Sacrum

S2_Vertebra S1_Vertebra…

……

Adult Mouse Anatomical Dictionary

vertebra

sacral vertebra

pelvis bone

sacral vertebra 2sacral vertebra 1 …

… …

• Manual creation is complex or even infeasible

• (Semi-) automatic determination of similarities between ontology elements to find correspondences

• Ontology Mapping = set of semantic correspondences between concepts of different ontologies

Preprocessing

• Use metadata (element-, structure-level), instance data

• High quality mappings

• Combined execution of several matchers in more complex match workflows

• Time consuming and memory intensive task

ONTOLOGY MATCHING

… …

Combination SelectionMatchResult

Mapping

Matcher execution

→ Long execution times, high memory requirements

→ Reasonable to improve efficiency

EXAMPLE – MATCH SUB-ONTOLOGIES OF GENE ONTOLOGY

Molecular Function

Biological Process

Evaluation ofCartesian product

2·108 comparisons

Multiplied by number of different

matchers

6·108 comparisons

HOW TO DEAL WITH PERFORMANCE AND MEMORY ISSUES?

→ Further methods

• Reduction of search space: Avoid evaluation of Cartesian product using e.g. early pruning, cluster- or fragment-based methods

• Reuse of match results: Save recomputation (ontology evolution)

• …

→ Combination of approaches

→ Parallelization

• Parallel execution of ontology matching on multiple compute nodes

• Use the broad availability of multi-core systems

• Reduce execution time, memory requirements

CONTRIBUTIONS

• Two strategies for parallel ontology matching• Inter-matcher parallelization: execute independent matchers in parallel

• Intra-matcher parallelization: internal parallelization of matchers based on partitioning of the ontologies

• Parallelization of different kinds of matchers• Element-level, structure-level, instance-based matchers

• Implementation and evaluation• Distributed infrastructure for parallel ontology matching

• Evaluation of the approaches for matching large life science ontologies

OVERVIEW

• Motivation

• Parallelization Strategies

• Inter-matcher Parallelization

• Intra-matcher Parallelization

• Infrastructure

• Evaluation

• Conclusion & Future Work

INTER-MATCHER PARALLELIZATION

• Parallel execution of independently executable matchers

• Process matchers on different cores or computing nodes

+ Improve execution time up to factor n (n=|matchers|)

– Limited degree of parallelization (|independently executable matchers|)

– Slowest matcher limits the achievable speedup

– Memory requirements are not reduced since matchers evaluate complete ontologies

...MatchResult

INTRA-MATCHER PARALLELIZATION

• Internal decomposition of individual matchers or matcher parts

• Partitioning the input data (ontologies)

• Many small match tasks can be executed in parallel

...Ontology Partitioning

Ontology Partitioning

Match Task Generation M1

Match Task Generation Mn

MatchResult

+ Match tasks of limited complexity → reduced memory and processing requirements

+ Applicable to sequential and independently executable matchers

• Each match task matches one O1 against one O2 partition

SIZE-BASED PARTITIONING

• Enable parallel matching of Cartesian product• Partition input ontologies into partitions of equal size (|concepts|)

+ Scalable to large ontologies+ Good load balancing + Optimizes performance without sacrificing match quality+ Applicable to element-level, structure-level and instance-based matchers

… … … …

|O1|= 10,000concepts

Generate 100match tasks

|O2|= 10,000concepts

Acc: c1

Name: c1.name

Acc: c3

Name: c3.name

Acc: d1

Name: d1.name

Acc: d2

Name: d2.nameAcc: d3

Name: d3.name

Acc: d4

Name: d4.name

Acc: d5

Name: d5.nameAcc: c2

Name: c2.name

PARALLELIZATION OF ELEMENT-LEVEL MATCHERS

• Needed information is directly associated with concepts themselves

→ Match-information is retained

• Attribute values like names, synonyms …

0.70.90.9

Acc: c1

Name: c1.name

Acc: c3

Name: c3.name

Acc: d1

Name: d1.name

Acc: d2

Name: d3.name

Acc: d4

Name: d4.name

Acc: d5

Name: c2.name

c∈O1, d∈O2

→ Easy partitioning of ontologies into subsets of concepts for element-level matchers

• Needed information is not provided by concepts themselves

• Use information from the neighborhood of concepts, e.g. children, siblings, namePath, …

PARALLELIZATION OF STRUCTURE-LEVEL MATCHERS

Approach

• Extend the concept-level information within special multi-valued context attributes, e.g. Child, NamePath, …

• Preprocessing step of linear effort

→ Size-based partitioning as for element-level matching

• Note: parallelization of similarity propagation algorithms is more difficult (need the whole structural information)

→ Only parallelization of the initial matcher

• Average element similarity between children of concepts

→Compare all their Child attributes

Acc: c1

Name: c1.name

Acc: c3

Name: c3.name

Acc: d1

Name: d1.name

Acc: d2

Name: d3.name

Acc: d4

Name: d4.name

Acc: d5

Name: c2.name

EXAMPLE: CHILD MATCHER

• Use a multi-valued Child context attribute: get name values of child concepts in a preprocessing step

corr(c,d) simChildren(c,d)

c1 – d1 (0.9 + 0.7 + 0.7 + 0.9) / (2·2) = 0.8

c1 – d4 (0.7 + 0 + 0 + 0.9) / (2·2) = 0.4

• Similarly applicable for other local context matchers, e.g. namePath

0.8 0.4

Acc: c1

Name: c1.nameChild: c2.nameChild: c3.name

Acc: c3

Name: c3.nameChild: …

Acc: d1

Name: d1.nameChild: d2.nameChild: d3.name

Acc: d2

Name: d2.nameChild: …

Acc: d3

Acc: d4

Name: d4.nameChild: d3.nameChild: d5.name

Acc: d5

Acc: c2

Name: c2.nameChild: …

0.7 0.90.90.7

INFRASTRUCTURE INTRA-MATCHER PARALLELIZATION EXAMPLE

Data Service

… …… …

Workflow Service

ck-1ck

cl-1cl

Concepts with context-attributes

Partitions (sets of concepts)

Match Service 1

Match Service 2

Job queue

......

INFRASTRUCTURE – INTRA-MATCHER PARALLELIZATION EXAMPLE

Data Service

Workflow Service

CPU1...

Job executionUnify Partial Match Results

Match Service 1

Match Service 2

Further aggregation,selection, …

in the match workflow...

• Two match problems to evaluate efficiency (execution times)

EVALUATION

Adult Mouse Anatomical Dictionary

NCI Thesaurus Anatomic Structure or Substance

Biological Processes

Molecular Functions

~2,700 · ~3,300 = 9 · 106 comparisons

~9,400 · ~17,100 = 1.6 · 108 comparisons

• Matchers (element and structure-level)

• NameSynonym (NS) Max TriGram similarity for the name and multi-valued synonym attr.

• Children (CH) Avg TriGram similarity on the context attributes Child

• NamePath (NP) Avg TriGram similarity on the context attributes NamePath

• Computing environmentFour nodes consisting of four cores (up to 16 cores)

Medium

• 1 node, 4 cores, 8 threads

INTRA-MATCHER PARALLELIZATION OF INDIVIDUAL MATCHERS

• ↑↑↑↑ degree of parallelization → ↓↓↓↓ execution times

• Almost linear speedup for up to 4 threads

• NP - most expensive, long concatenated names, multiple inheritance

• NS - many synonyms per concept in GO

NP NS CH speedup NP speedup NS speedup CH

Medium Scale Large Scale

PARALLELIZATION STRATEGIES

• Parallelization strategies benefit from multiple threads/cores

• Combined approach slightly better, because execution delays between matchers are avoided

• 4 nodes, 16 cores; matcher combination of NP, CH, NS

Intra-matcher parallelization speedup Intra

Intra- & Inter-matcher parallelization speedup Intra&Inter

→ Intra and the combined approach are very effective and thus especially valuable for parallel ontology matching

Speedup12.5

CONCLUSIONS AND FUTURE WORK

• General, flexible strategies for parallel ontology matching on multiple compute nodes to improve efficiency (inter- and intra-matcher parallelization)

• Size-based partitioning enabled good load balancing, scalable, reduced memory consumptions, quality not affected

• Parallelization of element-level, structure-level, instance-based matchers using multi-valued context attributes

• Implemented a distributed infrastructure, evaluation for large life science ontology match problems

• Investigate parallel ontology matching for additional matchers, evaluate effectiveness

• Combine parallelization with advanced fragmentation strategies

THANK YOU FOR YOUR ATTENTION

on matching large life science ontologies in parallel2 matching large life science... · • needed...

Documents

xwm: a high-speed matching algorithm for large...

efficient blocking method for a large scale citation...

matching in the large: an experimental study · matching in...

matching bibliographic data from publication lists with...

on matching large life science ontologies in parallel...

stable matching in large economies - columbia...

large-scale matching

object retrieval with large vocabularies and fast spatial...

framework for matching and linking large ontologies

embedding-based subsequence matching in large sequence...

a real-time matching system for large fingerprint...

on ranked approximate matching of large attributed graphs

towards exhaustive pairwise matching in large image...

lighthouse: large-scale graph pattern matching on giraph

evaluating large-scale biomedical ontology matching...

object mining using a matching graph on very large image...

agreementmaker: e cient matching for large real...

large-scale collective entity matching

geometrically motivated dense large displacement matching in...

balanced large scale knowledge matching using lsh forest ·...