amga metadata catalogue system - home · in2p3 events ... · lhcb and ganga were early testers ......

22
EGEE-II INFSO-RI-031688 Enabling Grids for E-sciencE www.eu-egee.org EGEE and gLite are registered trademarks AMGA metadata catalogue system Hurng-Chun Lee ACGrid School, Hanoi, Vietnam

Upload: ngoque

Post on 19-Apr-2018

215 views

Category:

Documents


1 download

TRANSCRIPT

EGEE-II INFSO-RI-031688

Enabling Grids for E-sciencE

www.eu-egee.org

EGEE and gLite are registered trademarks

AMGA metadata catalogue systemHurng-Chun LeeACGrid School, Hanoi, Vietnam

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Outline

• AMGA overview– AMGA Background and Motivation for AMGA– Interface, Architecture and Implementation– Metadata Replication on AMGA– AMGA use cases

• AMGA hands-on exercise

2

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Metadata catalogue

3

logical filename Physical filename

Data contentMetadata(data about data)

Metadata catalogue ≠ File catalogue

Metadata catalogue can be used as a File catalogue

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Metadata on the Grid

4

SE 1

SE 2

SE 3

SE n

The Grid Storage EnvironmentData are stored as files on many distributed storage elements

User

Find the file XXX.YY

Find the data produced by my job x and with energy level < -10 eV

File Catalogue

Metadata Catalogue

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

More than a file catalogue

5

• Metadata is data about data

• On the Grid: information about data content – Describe files– Locate files based on their contents

• But also makes DB access a simple task on the Grid– Many Grid applications need structured data– Many applications require only simple schemas

Can be modeled as metadata– Main advantage: better integration with the Grid environment

Metadata Service is a Grid component Grid security Hide DB heterogeneity

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

ARDA/gLite Metadata Interface

• Existing Metadata Services in High-Energy Physics– AMI (ATLAS), RefDB (CMS), Alien Metadata Catalogue (ALICE)– Similar goals, similar concepts– Domain specific– Technical limitations: large answers, scalability, performance

• ARDA proposal: AMGA– based on the HEP requirements– generic for other applications– design jointly with the gLite/EGEE team

• Adopted as the official EGEE Metadata Interface– endorsed by PTF (Project Technical Forum of EGEE)

6

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

The AMGA project

• ARDA developed a Project Task Force in order to develop:– AMGA – ARDA Metadata Grid Application

• Began as prototype to evaluate the Metadata Interface– Evaluated by community since the beginning:

LHCb and Ganga were early testers (more on this later)– Matured quickly thanks to users feedback

• Now is part of the gLite middleware– Official Metadata Service for EGEE– First release with gLite 1.5– Also available as standalone component

• It is expanding to other user communities:– HEP, Biomed, UNOSAT…

7

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

AMGA metadata concept

• Some Concepts:– Metadata - List of attributes associated with entries– Attribute – key/value pair with type information

Type – The type (int, float, string,…) Name/Key – The name of the attribute Value - Value of an entry's attribute

• Schema – A set of attributes

• Collection – A set of entries associated with a schema

• Think of schemas as tables, attributes as columns, entries as rows

8

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

AMGA features

• Dynamic Schemas– Schemas can be modified at runtime by client

Create, delete schemas Add, remove attributes

• Metadata organized as an hierarchy– Collections can contain sub-collections– Analogy to file system:

Collection ⇔ Directory

Entry ⇔ File

• Flexible Queries– SQL-like query language– Joins between schemas

9

QUERY EXAMPLE:

selectattr /gLibrary:FileName \ /gLibrary:Author \ ‘/gLibrary:FILE=/gLAudio:FILE \ and \ like(/gLibrary:FileName,“%.mp3")‘

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

AMGA security

• Unix style permissions

• ACLs – per-collection or per-entry

• Secure connections – SSL

• Client Authentication based on– Username/password– General X509 certificates– Grid-proxy certificates

• Access control via a Virtual Organization Management System (VOMS)

10

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

AMGA implementation

• C++ multiprocess server– Runs on any Linux flavour

• Backends– Oracle, MySQL, PostgreSQL, SQLite

• Two frontends– TCP Streaming

High performance Client API for: C++, Java, Python, Perl,

Ruby– SOAP

Interoperability

• Also implemented as standalone Python library

– Data stored on filesystem

11

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Architecture

• Designed for scalability– Asynchronous operation

Reading from DB and sending data to client

– Response sent to client in chunks No limit on the maximum

response size

• Example: TCP Streaming– Text based protocol (like SMTP,

POP3,…)– Response streamed to client

12

Client: listattr entry

Server: 0

entry value1 value2 … <EOT>

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Metadata replication

• Motivation– Scalability – Support hundreds/thousands of concurrent users– Geographical distribution – Hide network latency– Reliability – No single point of failure– DB Independent replication – Heterogeneous DB systems– Disconnected computing – Off-line access (laptops)

• Architecture– Asynchronous replication– Master-slave – Writes only allowed on the master– Replication at the application level

Replicate Metadata commands, not SQL → DB independence– Partial replication – supports replication of only sub-trees of the

metadata hierarchy

13

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Metadata replication (cont.)

14

Full replication Partial replication

Federation Proxy

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Early adoption of AMGA

15

• LHCb-bookkeeping (keep additional information from executed jobs)– Migrated bookkeeping metadata to ARDA prototype

20M entries, 15 GB Large amount of static metadata

– Feedback valuable in improving interface and fixing bugs– AMGA showing good scalability

• Ganga– Job management system

Developed jointly by Atlas and LHCb– Uses AMGA for storing information about job status

Small amount of highly dynamic metadata

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Accessing AMGA

• TCP Streaming Front-end– mdcli & mdclient and C++ API (md_cli.h, MD_Client.h)– Java Client API and command line mdjavaclient.sh & mdjavacli.sh

(also under Windows)– Python Client API

• SOAP Frontend (WSDL)– C++ gSOAP– AXIS (Java)– ZSI (Python)

16

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Conclusion

• AMGA – Metadata Service of gLite– Part of gLite (but still not certificed in gLite 3.0. it will be done with 3.1

release)– Useful for simplified DB access– Integrated on the Grid environment (Security)

• Replication/Federation features

• Tests show good performance/scalability

• Already deployed by several Grid Applications– LHCb, ATLAS, Biomed, …

• AMGA Web Site– http://project-arda-dev.web.cern.ch/project-arda-dev/metadata/

17

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Medical data management

• Scenario:– Store and access medical images exploiting metadata on the Grid

• Requirements:– Strong security requirements

Patient data is sensitive Data must be encrypted Metadata access must be restricted to authorized users

• AMGA used as metadata server– Demonstrates authentication and encrypted access– Used as a simplified DB

• More details: – http://www.i3s.unice.fr/~johan/mdm/mdm-051013.pdf

18

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

WISDOM drug analysis

• Scenario:– essential docking information (e.g. binding energy, conformation)

provider for on-line data analysis on the simulation outputs

• Requirement:– high speed and large-scale I/O– supporting large number of entries ( > 10 million database records )– replication and backup

• AMGA used as a file catalogue as well as part of the post-analysis infrastructure– first level data analysis on the metadata– grid physical file name is one of the metadata attributes– users are pointed to download the full output according to the

metadata query

19

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Grid-enable drug analysis system

20

simulationpreparation

Histogram on 295535 screens: DC2_T07IAN

binding energy

- 1 6 - 1 5 - 1 4 - 1 3 - 1 2 - 1 1 - 1 0 - 9 - 8 - 7 - 6 - 5 - 4 - 30

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

5500

6000

6500

online data analysis

Grid Computing Units(Worker nodes, agents)

job submission

Best demo prize of EGEE’07

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

gMOD: grid Movie on Demand

• gMOD provides a Video-On-Demand service

• User chooses among a list of video and the chosen one is streamed in real time to the video client of the user’s workstation

• For each movie a lot of details (Title, Runtime, Country, Release Date, Genre, Director, Case, Plot Outline) are stored and users can search a particular movie querying on one or more attributes

• Two kind of users can interact with gMOD: TrailersManagers that can administer the db of movies (uploading new ones and attaching metadata to them); GILDA VO users (guest) can browse, search and choose a movie to be streamed.

21

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

gMOD workflow

22

VOMS

LFCCatalogue

MetadataCatalogue

WN WN

WN

CE

Storage Elements

User

Genius Portal

Workload Management System

get RoleAMGA