code analysis repository and modelling for e-neuroscience - car… · carmen is a big data portal...
TRANSCRIPT
Code Analysis Repository and Modelling for E-Neuroscience An e-science virtual laboratory supporting collaboration in neurophysiology
Content
• Overview of CARMEN – What it is – Who it is for – What does it consist of
• Current status • Portal technology
– To the extent that I know it! • Ways forward and issues
CARMEN is a Big Data portal for neurophysiology
• An e-science virtual laboratory supporting collaboration in neurophysiology
• Runs on a cluster of machines at York University • Portal based
– That is, accessed over the web via a portal, from anywhere • Enables uploading/downloading of data • Running of services, including some display services • Enables workflows.
Theoreticians and modellers need data and can provide
predictions to experimentalists
Analysts provide the tools to statistically and
mathematically describe the data
Experimentalists obtain original data to test
hypotheses and derive new knowledge
CARMEN Objectives � To create a grid-enabled ‘virtual laboratory’ environment for neurophysiological data (‘co-laboratory’, or ‘virtual research environment’)
� To develop an extensible, client-defined ‘toolkit’ for data extraction, analysis and modelling
� To provide a ‘repository’ for archiving, sharing, integration and discovery of data
� To demonstrate and sustain advances in neuroscience enabled by e-science technology
Newcastle
York Stirling
Leicester
Imperial
Manchester
Sheffield
Warwick
Plymouth
St Andrews
Cambridge
CARMEN Consortium First two compute nodes (CAIRNS)
Collaborators in: Edinburgh; Berkeley; Washington; St. Louis; Aberdeen; Seoul; Pennsylvania; New York; Boston; Brazil
Access Portal through www.carmen.org.uk
Click on link to access the portal
Registra6on Open to all Academic Users
Automatic Registration for New Users
Organisa6on of Workspace
Workspace divided into repository of resources (left hand window) and information
and activity (right hand window)
Organisa6on of Data
Data you have permission to access are organised into ‘personal’, ‘shared’ and ‘public’ folders
Entering Data and Metadata
Upload of new data generates metadata forms. Pre-compiled
templates speeds upload process
Entering Data and Metadata
Expanding windows are used for collecting the metadata
Entering Data and Metadata
A browser allows you to select files to upload
to the platform
SeAng Security ACributes You have options to make data and metadata private, shared with collaborators, or public.
SeAng Security ACributes
Typing part of a name, address or e-mail suggests possible registered users for sharing
Finding Resources in the Repository Search function using
multiple terms enables users to find appropriate resources
Viewing Metadata and Annota6ons
Metadata can be accessed to understand any resources and those with permission
can add annotations
Viewing Data Files
Time series data files can be viewed by launching the thick client tool ‘Signal Data Explorer’
Viewing Time-‐Series Data Files
Multiple views can be used to explore the data, including multielectrode array view
Feature Searching
Pattern matching function allows feature searching and averaging
Execute Services Services are uploaded and shared like data. Available services can be added
to ‘Favourites’ list in the resources.
Execute Services Associated metadata describe the function and required parameters
Set Service Parameters through Interface
Parameters are entered using an automatically generated entry form
Service Log A Service Log monitors progress
with executed services and shows results of prior services
Service Results Output
On completion results are written to the resources
directory and can be viewed
Common File Format (NDF) An internal file format (NDF) allows services to run across multiple data types. A service converts files from
original formats.
Workflow SoRware
NDF will enable services to be linked into more
complex workflows
Platform Development: CARMEN ‘Cloud’ (CAIRN)
Raw Signal Data Search & Visualisa4on
Data
Metadata
Compute Cluster on which Services are Dynamically
Deployed
Web
Porta
l Rich
Clients
S e c u
r i t y
Workflow Enactment
Engine
Registry Service
Repository
Enactment of scien4fic analysis processes
Security Policies Controlling Access to Data & Code
Search for Data & Analysis Code
Structured Metadata Store Enabling Search & Annota4on
Analysis Code Store
Raw & Derived Data Store
Metadata – The MINI Document � Attempt to identify core information in 8 domains: 1. Contact and context (4 terms) 2. Study subject (16 terms) 3. Recording location (5 terms) 4. Task (4 terms) 5. Stimulus (5 terms) 6. Behavioural event (3 terms) 7. Recording (6 terms) 8. Time series data (3 terms)
� Most terms are user defined but this will become fixed by a lexicon – work with INCF
Nature Precedings (2009) http://hdl.handle.net/10101/npre.
2009.1720.2
Data format issues
• Raw electrophysiology data comes from many different manufacturers and sources – Manufacturers: Multichannel Systems, Blackrock, Plexon, …, and some researchers build their own systems
– Proprietary data formats • Not open data formats!
– Some assistance with data conversion • Neuroshare DLLs (unidirectional: enables data interrogation)
• … and that’s just electrophysiology data: not including EEG data, for example
Data formats and services
• We want to run services – that is code that processes data to produce derived data – [and updates the metadata to show how the derived data was reached]
• We don’t want to have to write a new service for each possible data format. – So we need to either:
• Convert all the data into a fixed format, or • Write a set of data conversion services that enable each service to cope with all
the data formats – CARMEN went with the first option. – In face we developed out own data format, NDF, which is not directly
HDF 5 compatible • Services run on a machine hidden behind a portal
– What implications does this have for interactivity?
NDF (Neural Data Format) and HDF5
• NDF was developed by Bojian Liang at York University – At the time HDF5 was not directly available to us – HDF4 could not cope with large datasets
• Ours could be up to more than 20Gbytes per experiment • If we started again we would use HDF5
– But actually, HDF5 doesn’t really solve the problem • HDF5 is only a mechanism for data storage • The precise format still needs defined
– And there needs to be an Applications Programmer Interface to go with it …
Services and Workflows
• Services take in data, and produce new derived data, and/or displays from that data – Services are often parameterised: these are values that can
be set which alter the precise behaviour of a service • Workflows are concatenated sequences of services
– Again, each service may have its own parameter set – Also, there may be loops in the workflow, and possibly
decisions (like when to terminate) as well
Issues
• Design of a portal-based access to Big Data – Who are the target clients? – What sort of user interface do they want/expect/are they willing to put up
with? • What are they used to?
– Portal-based systems tend to have a higher latency/delay than local systems
• (Latency is the time between the last key stroke/mouse gesture and the beginning of a visible reaction)
• Because the latency is the sum of the latency of the local machine sending a message on the internet, the portal and server processing it, plus the round trip delay
• This can present a problem in acceptance for those who previously used only local systems
Parameter entry for services • Currently entering them
one by one … – Even with a default
• … is clumsy and slow • Is there a better design? • Would a command-driven
approach be better? – Even if rather old-
fashioned? • Is there a better way?
Workflow creation is currently graphical
Workflows and GUIs
• Graphical workflow creations looks good – And sounds like the right way to go:
• But: for complex workflows – Like ones that cycle through a set of parameters in the
different internal services • … it is not easy to see how to create the appropriate
graphical tools – What might be better?
Technologies in CARMEN
• The CARMEN Virtual Laboratory Architecture: – Three-tier
architecture
Web Portal: Google Web Toolkit
System management: Java Servlet Based
MySQL Database
CARMEN Architecture continued
Web Portal: Google Web Toolkit
System management: Java Servlet Based
MySQL Database
User's View
Services on Compute Servers
Service metadata, security info, asset
identifiers
1st tier: Portal built using Ajax, GWT running in the browsers 2nd tier: Back end build using Java, Servlets, C/C++, XML runs across Linux and Windows 3rd tier: Storage MySQL and HP X9000 storage system.
Service architecture
Service wrapping • Services are written by users in many languages:
– Matlab, R, Python, … • … and they need to be deployed • They need to respect the service architecture
– Getting parameters, reading and writing files, returning results, …
• … and to do this, they need to be wrapped. • Web services wrapping: turn each service into a web
service – Very general: enables remote services to be run – Heavyweight: using JAX-WW, added about
20Mbyte/service, added 10-20 seconds to deployment
• Alternative: make them a simple loadable class – Simple and fast – But does imply that service deployment is local.
What have we learned?
• CARMEN is quite an old project, as Big Data projects go – Started in 2006
• What would we do differently – In response to the users – if we started again?
• Poster.
What’s happening now with CARMEN?
• Right now (next week): being demonstrated again at Society for Neuroscience meeting, Washington DC. – 30,000 delegates!
• Practical issues – Running out of funding in February 2015
• Very soon – Writing a Horizon 2020 proposal to join it together with other
European “Big Data in Neuroscience” projects to create a large-scale Virtual learning Environment
CARMEN Consortium