laboratoire lip6 the gedeon project: data, metadata and databases yves denneulin lig laboratory,...
TRANSCRIPT
Laboratoire LIP6
The Gedeon Project: Data, Metadata and Databases
Yves DENNEULINLIG laboratory, Grenoble
ACI MD
Context and goals● Heterogeneous metadata management on grids
Clusters of clusters ● High-level queries using metadata● Easy and flexible deployment and
configuration● Minimal overhead● Various interfaces● Initial target application domains
Biocomputing (lots of metadata, few data) Microscopic imaging (lots of data data, few
metadata)
intergicielGEDEON
GrappeBioInfointergiciel
GEDEONintergicielGEDEON
Grille
Requète
Résultat
séquencesproproétaires
swissprot
TrEMBL
The Gedeon middleware Metadata management on lightweight grids
● Records of (attribute,value) pairs stored in files Flexible requests
● Can be combined through scripting Various interfaces
● Command line (tools)● Libraries● Virtual FS (legacy applications support)
Deployment “à la carte”● Composition of various data sources
Performances● Dedicated I/O library● Semantic caching
Outline
1. General architecture
a.Gedeon internal structure
b.Composition of various data sources
2. Practical use
3. « dual » cache
Conclusion
Example of a deploymentQuery Interface(API, FS, GUI, ...)
Local proxy
Interconnect middleware Interconnect middleware
Local proxy Local proxy Local proxy
Interconnect
Client
Servers« close »
to the client
Storage sites
cache
cache cach
e
cach
e
cache
cache cache cache
Gedeon components● Gedeon Kernel
fuple● I/O Library● Evaluate the queries
lowerG● Operators to compose bases● Remote access
● Interface API lowerG Virtual FS
● Cache
application
vSGF
lowerG
fuple
network
cach
e
fuple
network
lowerG
Local proxy
What inside the sources?
● Records of pairs attribute/value
Id
classifA
classifB
457
Bacteria
Clostridia
taille 26
ref
Record
Example of composition of sources
client
+
J
Metadata can belocal or copies
site S1site S2
site S3
RR
...
Union
enreg. A1
enreg. A2
enreg. A3
enreg. A4
+
enreg. B1
enreg. B2
enreg. B3
enreg. B4
...
...
enreg. A1
enreg. A2
enreg. A3
enreg. A4
enreg. B1
enreg. B2
enreg. B3
enreg. B4Unify storage space
+Parallel evaluation
...
Join operatorId
A1
A2
457
v1
v2
A3 v3
Id
A1
A2
458
v4
v5
A3 v6
J
Id
...
Id
An
457
vAn1
Id
An
458
vAn2
...
Id
A1
A2
457
v1
v2
A3 v3
Id
A1
A2
458
v4
v5
A3 v6
An vAn1
An vAn2
Enrich a source withanother
Outline
1. General architecture
a.Gedeon internal structure
b.Composition of various data sources
2. Practical use
3. « dual » cache
Conclusion
Tools 2/2
sort(attr='taille')
● Examples sort$> cat mesmeta.g | fsort 'taille' > trie_taille.g
index
create_idx(attr='Id')
.Id.idx
.Id.idx
.Id.idx
search_idx('Id', 'P0123')
Language for the requests
● Simple ($, type control with the operators)
● Regular expressions
● Of the second order
Select expression
Id
classifB
459
Bacteria
taille 47
Id
classifA
460
Fermicutes
Select$Id>459
Id
classifA
460
Fermicutes
Id
classifA
classifB
457
Bacteria
Clostridia
taille 26
Select using regexpId
classifA
classifB
Id
classifB
457
Bacteria
Clostridia
459
Bacteria
taille 26
taille 47
Id
classifA
460
Fermicutes
Select$classifB==/.*a$/
Id
classifA
classifB
457
Bacteria
Clostridia
taille 26
Id
classifB
459
Bacteria
taille 47
Select using 2nd order logicId
classifA
classifB
Id
classifB
457
Bacteria
Clostridia
459
Bacteria
taille 26
taille 47
Id
classifA
460
Fermicutes
Select$/classif[AB]/==Bacteria
&&$taille>=36
Id
classifB
459
Bacteria
taille 47
Virtual FS interface
● Just a specific file-oriented interface● Data and metadata can be anywhere in the grid● Definition of logical directories
Ex: cd '$classifB==|.*a$|' « and » between directories 1 filename =value of a metadata: logical view
/fs_virt/$classifB==|.*a$|> ls457 459/fs_virt/$classifB==|.*a$|> cat *>/tmp/mater/fs_virt/$classifB==|.*a$|>
Outline
1. General architecture
a.Gedeon internal structure
b.Composition of various data sources
2. Practical use
3. « dual » cache
Conclusion
Dual cache (1)
● 2 cooperative caches cache of requests (R, {id,...})
-> save computing power cache of data (id, {attr,...})
-> save bandwidth● Semantic cache
Can evaluate a query using the data in the cache Can generate a remainder to complement the data
cached
Example
● Refinement of a request1)'$OC==/Eukaryota/'
-> (R, Lid={id1,id2, ...})2)'$OC==/Eukaryota/ && $year>=1998'
Select(*Lid, '$year>=1998')
Dual cache (2)
● Distributed semantic cache Typically used inside communities
● Lots of common requests No location constraints
● Members of the community can be geographically scattered
● Distributed data cache Minimize time and data transfer Cooperation between close, from a topological point
of view, sites
Dual cache (3)
Grenoble
ServersServers
Rennes
Dual cache
Query cache
Object cache
Semantic locality
Community Eukaryota
Community Archaea
Geographic locality
Dual cache (4)
● Work in progress on the notion of distance Find geographical proximity Find common interests between communities
● Create hybrid communities based on their requests
● Could be used to change the cache parameters Manual and/or automatic
Conclusion
● A data integration middleware Handling of metadata
● Distributed and modular Deployment can be done according to
architectural/organisational constraints● Definition of a dual cache infrastructure
Reflect both organisational use● Prototype in use
Packaging and documentation needed