identifying and accessing imaging flowcytobot data and imagery

23
IDENTIFYING AND ACCESSING IMAGING FLOWCYTOBOT DATA AND IMAGERY Information architecture and prototype Joe Futrelle, Heidi Sosik Woods Hole Oceanographic Institution August 2011

Upload: sadie

Post on 23-Feb-2016

66 views

Category:

Documents


0 download

DESCRIPTION

Identifying and accessing imaging Flowcytobot data and imagery. Information architecture and prototype Joe Futrelle, Heidi Sosik Woods Hole Oceanographic Institution August 2011. What / why / so?. Goal: improve access to IFCB data Consistent, unique identifiers for all important datasets - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Identifying and accessing imaging  Flowcytobot data and imagery

IDENTIFYING AND ACCESSINGIMAGING FLOWCYTOBOTDATA AND IMAGERYInformation architecture and prototype

Joe Futrelle, Heidi SosikWoods Hole Oceanographic InstitutionAugust 2011

Page 2: Identifying and accessing imaging  Flowcytobot data and imagery

What / why / so?• Goal: improve access to IFCB data

• Consistent, unique identifiers for all important datasets• Standard representations of measurements and other metadata• Ability to cite and link to IFCB data in a variety of contexts

• What problem does this solve?• Current data access requires access to IFCB laboratory

computers, a Matlab license, ability to read Matlab code, and advice from Heidi and/or Rob

• Who cares?• Users of IFCB data: get improved ability to use existing tools

with IFCB data, and develop new tools using a variety of technologies

• Heidi: gets new capabilities “for free” (more on that later)

Page 3: Identifying and accessing imaging  Flowcytobot data and imagery

Some informatics terminology• An identifier is a short piece of text that identifies a

digital object (e.g., “myfile.txt”)• An identifier can be resolved to the digital object by

software that uses to identifier to access the object (e.g., printing the file called “myfile.txt”)

• The scope (i.e., context) of an identifier is the set of conditions under which it can be resolved to one and only one object (e.g., a folder containing “myfile.txt”)

• The global scope is the international community. Any other scope is local.

• An identifier scheme is a format that specifies the syntax of identifiers (e.g., name + “.” + extension)

Page 4: Identifying and accessing imaging  Flowcytobot data and imagery

Some terminology (cont.)• A namespace is an identifier for the scope of another

identifier. (e.g., “edu” is a namespace that makes “whoi.edu” distinct from “whoi.org”)• Namespaces are generally appended to names as a prefix or

suffix• Namespaces are generally used to transform local

identifiers into global identifiers (e.g., via prefixing)

Page 5: Identifying and accessing imaging  Flowcytobot data and imagery

Global vs. Local identifier schemes

Local identifier schemes

• e.g., pathnames• Non-standard

• Dependent on current software used

• Generally undocumented• “Break” when data

changes• May “collide”

• e.g., my “data.csv” is a different file than your “data.csv”

Global identifier schemes

• e.g., URL’s• Standard

• Specified by standards bodies

• Exhaustively documented• Data-independent• Do not “collide”

• e.g., I cannot replace a web page at a URL you control

Page 6: Identifying and accessing imaging  Flowcytobot data and imagery

IFCB data acquisition flow• Seawater is sampled and forced through flow channel

• Photomultiplier triggers many frame grabs (“ROI’s” or “targets”)

• Data and imagery is written to a set of files

• At end of sample, files are closed and new ones are opened

ROI

ROIROI

.hdr • Context• Metadata

.adc• Scattering

data• ROI metrics

.roi • Raw image data

Page 7: Identifying and accessing imaging  Flowcytobot data and imagery

Imaging FlowCytobot existing ID

• Identifies a bin of observations, generally over an entire seawater sample

• Contains the instrument number and UTC date/time• Used as a filename• Local identifier

• Non-standard scheme• Non-standard resolution mechanism

• Scope = all existing IFCB deployments and software (not so far off from global scope )

IFCB1_2011_234_052230

Page 8: Identifying and accessing imaging  Flowcytobot data and imagery

Resolving IFCB identifiers to files

Component Meaning How to resolve\\cheese.whoi.edu Windows file server Known a priori

\J_IFCB Windows share Known a priori

\ifcb_data_MVCO_jun06 MVCO time series Known a priori

\IFCB1_2011_234 Data from August 22, 2011 UTC

Prefix of local identifier

\IFCB1_2011_234_052230 Bin @ 5:22:30 UTC Local identifier

.roi Data type One of {hdr, adc, roi}

\\ cheese.whoi.edu\J_IFCB\ifcb_data_MVCO_jun06\IFCB1_2011_234\IFCB1_2011_234_052230.roi

Page 9: Identifying and accessing imaging  Flowcytobot data and imagery

Is this pathname a global ID? No

\\cheese.whoi.edu\J_IFCB\ifcb_data_MVCO_jun06\IFCB1_2011_234\IFCB1_2011_234_052230.roi

• It’s global, but it’s not a global identifier of an IFCB dataset; rather, it identifies a location on a file server

• If the files are moved to a different server, share, or directory, the pathnames will change but the dataset will not

• The .roi file represents the same dataset as the .adc and .hdr files, so those pathnames are different but do not identify a different dataset (uniqueness depends on exact matches, not partial matches)

Page 10: Identifying and accessing imaging  Flowcytobot data and imagery

Proposed global ID scheme

http://ifcb-data.whoi.edu/IFCB1_2011_234_052230

• Standard scheme (URL)• Identifies a single instrument, single time bin• Single ID per dataset (i.e., no extension)• No “day’s worth of data” directory (redundant)• Preserves existing local ID scheme (no need to

generate new ID’s)• Works unmodified as a web page URL, XML tag name,

or RDF resource

Page 11: Identifying and accessing imaging  Flowcytobot data and imagery

ID variant: a single ROI

http://ifcb-data.whoi.edu/IFCB1_2011_234_052230_00031

• Identifies a single observation (image + scattering data)

• Observations are numbered sequentially in a time bin

Page 12: Identifying and accessing imaging  Flowcytobot data and imagery

ID variant: a day’s worth of data

http://ifcb-data.whoi.edu/IFCB1_2011_234

• Prefix of existing identifiers• Acts as a namespace for each bin in that day• Note that the instrument number makes this per-

instrument

Page 13: Identifying and accessing imaging  Flowcytobot data and imagery

ID variant: an instrument’s data

http://ifcb-data.whoi.edu/IFCB1

• All data from a given instrument• Metadata about the instrument

Page 14: Identifying and accessing imaging  Flowcytobot data and imagery

ID variant: a formatted representation

http://ifcb-data.whoi.edu/IFCB1_2011_234_052230.xml

• Extension added to global ID• Returns an XML representation of a bin’s worth of data• Includes metadata and links to individual ROI’s

contained in that bin• Other formats available based on extension

• HTML• RDF/XML (Resource Description Framework)• JSON (Javascript Object Notation)• JPEG / TIFF / PNG / etc. for ROI images

Page 15: Identifying and accessing imaging  Flowcytobot data and imagery

Resolution of IFCB global ID’s

Web Server (Apache) @ http://ifcb-data.whoi.edu

mod_rewrite

GIDendpoint

Windows file server @ \\cheese.whoi.edu

resolve.py?id=…

convert.py

samba

path,format

requestedrepresentation(XML, JSON, RDF, jpg, tiff)

request response

memcached

Page 16: Identifying and accessing imaging  Flowcytobot data and imagery

IFCB global ID resolution in action

Page 17: Identifying and accessing imaging  Flowcytobot data and imagery

Interoperability: RSS feed of live data

Page 18: Identifying and accessing imaging  Flowcytobot data and imagery

Interoperability: Android / iPhone

Page 19: Identifying and accessing imaging  Flowcytobot data and imagery

Approach: leave data alone• Reuse as much of existing local ID scheme as possible• “Wrap” with global ID resolution backed by format

conversion service• Do not require data to be reformatted and put in a

repository for management• If data moves, point the services at the new location• If data format changes, tweak the format conversion

service and reuse / extend provided representations• Clients using the ID resolution and format conversion

service (e.g., manual annotation tool TBD, image processing workflow TBD) will be unaffected

Page 20: Identifying and accessing imaging  Flowcytobot data and imagery

Roles of scientist vs. informaticist

Joe the informaticist

• Ask questions• Co-develop

documentation of data formats

• Develop ID scheme• Develop resolution

service• Develop representations

and format service

Heidi / Rob the scientists

• Answer questions• Co-develop

documentation of data formats

• Provide access to data• Share existing data

handling code• Review ID scheme /

representations

Page 21: Identifying and accessing imaging  Flowcytobot data and imagery

What did we just do?• Created long-term, global identifiers for IFCB data

• Citable• “Actionable” (Kunze) = live URL’s• Can continue to be used in metadata and digital preservation

packages even if they are no longer live URL’s• Prototyped services providing access to IFCB data in

standard formats (XML, JSON, RDF)• Supports building web applications using HTML5• Supports web service data access workflow modules• Provides a way to align to standard vocabularies and ontologies

• And what is left to do (… on next slides)

Page 22: Identifying and accessing imaging  Flowcytobot data and imagery

Additional issues to address• Timestamps only recorded in filenames (!)• Syringes with many ROI’s are split across multiple

bins, and timestamp of observations in second bin must come from the filename of the first bin

• No way to identify time series that use more than one instrument• MVCO time series involves IFCB1 being occasionally swapped

with IFCB5• No way to identify deployments generally

• IFCB1 could be moved to a different location to sample plankton as part of a non-MVCO study; there is no way in this scheme to figure out which data goes with which study

Page 23: Identifying and accessing imaging  Flowcytobot data and imagery

Next steps• Clients!

• Manual annotation prototype (using HTML5 / AJAX and JSON format conversion)

• MATLAB (retrofit existing code to use global ID’s)• Kepler (already supports fetching data from web services)

• Improving next-generation IFCB’s data acquisition• Modify on-instrument code• Include timestamp in data (not just filenames)• Use ISO 8601 standard time formats• Generate column headers on CSV data• Record units of measure where appropriate• Align terms in IFCB data (e.g., “temperature”) with standard

terms where appropriate