code analysis repository and modelling for e-neuroscience - car… · carmen is a big data portal...

Code Analysis Repository and Modelling for E-Neuroscience An e-science virtual laboratory supporting collaboration in neurophysiology

Content

•  Overview of CARMEN – What it is – Who it is for – What does it consist of

•  Current status •  Portal technology

–  To the extent that I know it! •  Ways forward and issues

CARMEN is a Big Data portal for neurophysiology

•  An e-science virtual laboratory supporting collaboration in neurophysiology

•  Runs on a cluster of machines at York University •  Portal based

–  That is, accessed over the web via a portal, from anywhere •  Enables uploading/downloading of data •  Running of services, including some display services •  Enables workflows.

Theoreticians and modellers need data and can provide

predictions to experimentalists

Analysts provide the tools to statistically and

mathematically describe the data

Experimentalists obtain original data to test

hypotheses and derive new knowledge

CARMEN Objectives � To create a grid-enabled ‘virtual laboratory’ environment for neurophysiological data (‘co-laboratory’, or ‘virtual research environment’)

� To develop an extensible, client-defined ‘toolkit’ for data extraction, analysis and modelling

� To provide a ‘repository’ for archiving, sharing, integration and discovery of data

� To demonstrate and sustain advances in neuroscience enabled by e-science technology

Newcastle

York Stirling

Leicester

Imperial

Manchester

Sheffield

Warwick

Plymouth

St Andrews

Cambridge

CARMEN Consortium First two compute nodes (CAIRNS)

Collaborators in: Edinburgh; Berkeley; Washington; St. Louis; Aberdeen; Seoul; Pennsylvania; New York; Boston; Brazil

Access Portal through www.carmen.org.uk

Click on link to access the portal

Registra6on Open to all Academic Users

Automatic Registration for New Users

Organisa6on of Workspace

Workspace divided into repository of resources (left hand window) and information

and activity (right hand window)

Organisa6on of Data

Data you have permission to access are organised into ‘personal’, ‘shared’ and ‘public’ folders

Entering Data and Metadata

Upload of new data generates metadata forms. Pre-compiled

templates speeds upload process


Expanding windows are used for collecting the metadata


A browser allows you to select files to upload

to the platform

SeAng Security ACributes You have options to make data and metadata private, shared with collaborators, or public.

SeAng Security ACributes

Typing part of a name, address or e-mail suggests possible registered users for sharing

Finding Resources in the Repository Search function using

multiple terms enables users to find appropriate resources

Viewing Metadata and Annota6ons

Metadata can be accessed to understand any resources and those with permission

can add annotations

Viewing Data Files

Time series data files can be viewed by launching the thick client tool ‘Signal Data Explorer’

Viewing Time-‐Series Data Files

Multiple views can be used to explore the data, including multielectrode array view

Feature Searching

Pattern matching function allows feature searching and averaging

Execute Services Services are uploaded and shared like data. Available services can be added

to ‘Favourites’ list in the resources.

Execute Services Associated metadata describe the function and required parameters

Set Service Parameters through Interface

Parameters are entered using an automatically generated entry form

Service Log A Service Log monitors progress

with executed services and shows results of prior services

Service Results Output

On completion results are written to the resources

directory and can be viewed

Common File Format (NDF) An internal file format (NDF) allows services to run across multiple data types. A service converts files from

original formats.

Workflow SoRware

NDF will enable services to be linked into more

complex workflows

Platform Development: CARMEN ‘Cloud’ (CAIRN)

Raw Signal Data Search & Visualisa4on

Data

Metadata

Compute Cluster on which Services are Dynamically

Deployed

Web

Porta

l Rich

Clients

S e c u

r i t y

Workflow Enactment

Engine

Registry Service

Repository

Enactment of scien4fic analysis processes

Security Policies Controlling Access to Data & Code

Search for Data & Analysis Code

Structured Metadata Store Enabling Search & Annota4on

Analysis Code Store

Raw & Derived Data Store

Metadata – The MINI Document � Attempt to identify core information in 8 domains: 1. Contact and context (4 terms) 2. Study subject (16 terms) 3. Recording location (5 terms) 4. Task (4 terms) 5. Stimulus (5 terms) 6. Behavioural event (3 terms) 7. Recording (6 terms) 8. Time series data (3 terms)

� Most terms are user defined but this will become fixed by a lexicon – work with INCF

Nature Precedings (2009) http://hdl.handle.net/10101/npre.

2009.1720.2

Data format issues

•  Raw electrophysiology data comes from many different manufacturers and sources –  Manufacturers: Multichannel Systems, Blackrock, Plexon, …, and some researchers build their own systems

–  Proprietary data formats •  Not open data formats!

–  Some assistance with data conversion •  Neuroshare DLLs (unidirectional: enables data interrogation)

•  … and that’s just electrophysiology data: not including EEG data, for example

Data formats and services

•  We want to run services – that is code that processes data to produce derived data –  [and updates the metadata to show how the derived data was reached]

•  We don’t want to have to write a new service for each possible data format. –  So we need to either:

•  Convert all the data into a fixed format, or •  Write a set of data conversion services that enable each service to cope with all

the data formats –  CARMEN went with the first option. –  In face we developed out own data format, NDF, which is not directly

HDF 5 compatible •  Services run on a machine hidden behind a portal

–  What implications does this have for interactivity?

NDF (Neural Data Format) and HDF5

•  NDF was developed by Bojian Liang at York University –  At the time HDF5 was not directly available to us –  HDF4 could not cope with large datasets

•  Ours could be up to more than 20Gbytes per experiment •  If we started again we would use HDF5

–  But actually, HDF5 doesn’t really solve the problem •  HDF5 is only a mechanism for data storage •  The precise format still needs defined

–  And there needs to be an Applications Programmer Interface to go with it …

Services and Workflows

•  Services take in data, and produce new derived data, and/or displays from that data –  Services are often parameterised: these are values that can

be set which alter the precise behaviour of a service •  Workflows are concatenated sequences of services

–  Again, each service may have its own parameter set –  Also, there may be loops in the workflow, and possibly

decisions (like when to terminate) as well

Issues

•  Design of a portal-based access to Big Data –  Who are the target clients? –  What sort of user interface do they want/expect/are they willing to put up

with? •  What are they used to?

–  Portal-based systems tend to have a higher latency/delay than local systems

•  (Latency is the time between the last key stroke/mouse gesture and the beginning of a visible reaction)

•  Because the latency is the sum of the latency of the local machine sending a message on the internet, the portal and server processing it, plus the round trip delay

•  This can present a problem in acceptance for those who previously used only local systems

Parameter entry for services •  Currently entering them

one by one … –  Even with a default

•  … is clumsy and slow •  Is there a better design? •  Would a command-driven

approach be better? –  Even if rather old-

fashioned? •  Is there a better way?

Workflow creation is currently graphical

Workflows and GUIs

•  Graphical workflow creations looks good –  And sounds like the right way to go:

•  But: for complex workflows –  Like ones that cycle through a set of parameters in the

different internal services •  … it is not easy to see how to create the appropriate

graphical tools – What might be better?

Technologies in CARMEN

•  The CARMEN Virtual Laboratory Architecture: –  Three-tier

architecture

Web Portal: Google Web Toolkit

System management: Java Servlet Based

MySQL Database

CARMEN Architecture continued

Web Portal: Google Web Toolkit

System management: Java Servlet Based

MySQL Database

User's View

Services on Compute Servers

Service metadata, security info, asset

identifiers

1st tier: Portal built using Ajax, GWT running in the browsers 2nd tier: Back end build using Java, Servlets, C/C++, XML runs across Linux and Windows 3rd tier: Storage MySQL and HP X9000 storage system.

Service architecture

Service wrapping •  Services are written by users in many languages:

–  Matlab, R, Python, … •  … and they need to be deployed •  They need to respect the service architecture

–  Getting parameters, reading and writing files, returning results, …

•  … and to do this, they need to be wrapped. •  Web services wrapping: turn each service into a web

service –  Very general: enables remote services to be run –  Heavyweight: using JAX-WW, added about

20Mbyte/service, added 10-20 seconds to deployment

•  Alternative: make them a simple loadable class –  Simple and fast –  But does imply that service deployment is local.

What have we learned?

•  CARMEN is quite an old project, as Big Data projects go –  Started in 2006

•  What would we do differently –  In response to the users –  if we started again?

•  Poster.

What’s happening now with CARMEN?

•  Right now (next week): being demonstrated again at Society for Neuroscience meeting, Washington DC. –  30,000 delegates!

•  Practical issues –  Running out of funding in February 2015

•  Very soon –  Writing a Horizon 2020 proposal to join it together with other

European “Big Data in Neuroscience” projects to create a large-scale Virtual learning Environment

CARMEN Consortium

code analysis repository and modelling for e-neuroscience - car… · carmen is a big data portal...

Documents