thoughts on data management nicholas schwarz software services group advanced engineering support...
TRANSCRIPT
Thoughts on Data Management
Nicholas SchwarzSoftware Services GroupAdvanced Engineering Support (AES) DivisionAdvanced Photon Source (APS)
25 June 2013
Thoughts on Data Management - SSG - 14 June 2013
2
What is Data Management?
Data Management is the development and execution of architectures, practices and procedures, and policiesthat properly manage our data lifecycle needs.
Thoughts on Data Management - SSG - 14 June 2013
3
Architecture
The architecture is the unambiguous definition of data, and the data storage and distribution infrastructure, i.e. hardware and software.
Data Examples Data are files on disk Data are a list of names and telephone numbers Data are a tuple of real numbers Data are …
Hardware and Software Examples Each sector has a dserv with storage There is central storage There is one internal and one external GlobusOnline endpoint A web-based system is used to set ownership permissions
Thoughts on Data Management - SSG - 14 June 2013
4
Practices and Procedures
Standard practices and procedures are required so that data can be handled properly. These practices and procedures must be embedded in regular operations processes.
Examples All measurement data must be saved to the local sector’s dserv every 24 hours Selected measurement data must be transferred to central storage Data on central storage must be saved in /data/managed/esaf123456 Data to be archived indefinitely must be flagged for archival within 7 days of the
end of the experiment period
Thoughts on Data Management - SSG - 14 June 2013
5
Policies
Data policies dictate what is done with data so that data management helps meet the organization’s goals and operates within its requirements.
Examples All systems must comply with requirements in ANL-593 Only members of an ESAF can access data collected with that ESAF APS firewalls must not change APS must not loose data when outside network connection is lost Data management at one sector must not interfere with data collection at another
sector All measurement data must be kept for 90 days All metadata should be kept indefinitely Old metadata must be accessible within 48 hours of a request
Thoughts on Data Management - SSG - 14 June 2013
6
Interdependency
Data polices, practices and procedures, and architecture drive each other.
ExamplesPolicy: data management at one sector must not interfere with data collection at another sectorArchitecture: distributed server (dserv) for each sector
Architecture: The only commonality of APS data is that it is stored in filesArchitecture: Data ownership enforcement mechanism is based on file system permissions
Policy: APS must not loose data when outside network connection is lostPractices and procedures: Data is stored internal to the APS
Thoughts on Data Management - SSG - 14 June 2013
7
Thoughts / Questions / Tasks
Define what data management is to the APS.
Thoughts on Data Management - SSG - 14 June 2013
8
Perspectives
Data management depends on your perspective…
User / Scientist– Do science– Output measured primarily by publications (patents)
Facility– Produce x-rays (maximize uptime)– Maximize data collection
Thoughts on Data Management - SSG - 14 June 2013
9
User / Scientist Perspective
Laboratory Microscope Data Synchrotron Derived Data
Publication Multiple figures Different types of data
Thoughts on Data Management - SSG - 14 June 2013
10
User / Scientist Perspective
Synchrotron Derived Data
Even a single figure with synchrotron data may have data from multiple facilities.
Thoughts on Data Management - SSG - 14 June 2013
11
User / Scientist Perspective
Normalize IntensityCell Finding Algorithm
Data Fusion Synchrotron Derived Data
Process of analyzing data generates new knowledge and data (and metadata).
Thoughts on Data Management - SSG - 14 June 2013
12
Facility Perspective
Sources Type Example
N Administrative Data PI, UserDatesDescriptionESAF, BTR, GUP…
N Experiment / Measurement Data Sample and sample conditionsArea Detector imagesPoint detector scalarsMotor positionsEnergy (Undulator, Monochromator)…
N Beamline / Sector DataBL 1-XX, BL 2-XX, …, BL 35-XXSector 1, Sector 2, …, Sector 35
Energy (Undulator, Monochromator)…
1 Accelerator Data Machine DataStatusOrbit, Power Supply…
Thoughts on Data Management - SSG - 14 June 2013
13
Publication
Data Source 1 Data Source 2 Data Source N
Synchrotron 1 Data Synchrotron 2 Data Synchrotron N Data
Administrative Data
Sample / Experiment / Measurement Metadata
Accelerator Data
Analysis
Measured Data
Facility
User / Scientist
Thoughts on Data Management - SSG - 14 June 2013
14
Thoughts / Questions / Tasks
What’s the perspective of the APS?
APS is a (one-of-many) scientific instruments
As a facility, what can the APS do to enable science without knowing what goes on outside the facility, and with little control of what goes on outside the facility? Every facility agrees and does the exact same thing?
– Data formats, equipment, passwords, etc. Help facilitate transition of data from facility to user?
Thoughts on Data Management - SSG - 14 June 2013
15
Data Management at the APS
1. What is/are our architecture (data, hardware, software), practices and procedures, and policies for data management?
2. As a facility, what can the APS do to enable science without knowing what goes on outside the facility, and with little control of what goes on outside the facility?
3. What are our limitations?
4. What do we hope to be?– Streamlined facility so the user can realize their perspective
Thoughts on Data Management - SSG - 14 June 2013
16
APS Architecture - Data
Many types of data at the APS Administrative Data – well defined Accelerator Data – well defined Beamline Data - varies Measurement/Experiment Data – defined based on technique/beamline/user
– Great variability: commonality is files on disk– Database entries for protein crystallography
One experiment has data from all of these categories
Thoughts on Data Management - SSG - 14 June 2013
17
APS Policies
Goal: Streamlined facility so users can realize their science perspective
Policies Maximize data collection ANL-593 Operate without outside network Firewalls can not change Data ownership (only data owners can see their data) Data should be deleted after some set amount of time Many, many more to follow…
Implications No Cloud-only based solution Critical services work internally User access is tied to APS computer access
Thoughts on Data Management - SSG - 14 June 2013
18
Data Management Roles
Data Administrator Group Manager User
Experiment (or Project) Directory rw Data administrator owns all group directories enforced at creation time
r Group manager is in experiment group Experiment directory is rx for group
r User is in experiment group Experiment directory is rx for group
Data in Experiment (or Project) Directory
rw Data administrator owns all files and subdirectories enforced with inotify script
rw Group manager is in experiment group Experiment directory is rwx for group
rw User is in experiment group Experiment directory is rwx for group
Experiment (or Project) Group create group modify group member
modify group members Group manager uid has additional group owner attribute in schema
none User can not modify group
Thoughts on Data Management - SSG - 14 June 2013
19
APS Architecture - Hardware
Beamline Acquisition Computer
dserv
lustre
gridFTP
Server
Internal gridFTP Server External GO EndpointBeamline Acquisition Computer
dserv
Beamline Acquisition Computer
dserv
Globus
APS Firewall
Thoughts on Data Management - SSG - 14 June 2013
20
APS Architecture – Software
Internal Transfer & Tracking Storage Resource Broker (SRB) (SDSC) SPADE (ALS-LBL) Modify our internal workflow pipeline (APS-ANL) SLAC has an internal system XRootDSSG is investigating which to adopt
User Accounts Integrate user badges with APS LDAP
Management Develop web site for modifying ownership and access permissions
Thoughts on Data Management - SSG - 14 June 2013
21
APS Architecture – Software
External Transfer & Access GlobusOnline provides access to APS
data from the outside Users authenticate using their APS
badge number and password Users can only see their data Users can integrate with other
Globus tools
Thoughts on Data Management - SSG - 14 June 2013
22
APS Practices and ProceduresData Storage Workflow
Data should be transferred from the acquisition computer to the local dserv Data on the dserv is transferred to lustre storage at one of the following intervals:
– Immediately– Daily (at a designated time)– Every Tuesday @ 8AM– At the end of an experiment– At the end of a run
Data on lustre is automatically deleted at a time determined by APS policy
Thoughts on Data Management - SSG - 14 June 2013
23
APS Practices and ProceduresData Storage Organization
Experiment Data Experiment data must be stored in a directory named
e[EASFNumber]_[PILastName], e.g. e123456_Smith Experiment data directories must be located in
/data/managed/experiments/r[RunNumber], e.g. /data/managed/experiments/r2013-2
/data/managed/experiments/r2013-2/e123456_Smith
Project Data Project data must be stored in a directory named p[ProjectID]_[ProjectName], e.g.
p000001_MyProject Project data directories must be located in /data/managed/projects /data/managed/projects/p000001_MyProject
Thoughts on Data Management - SSG - 14 June 2013
24