identifying and accessing imaging flowcytobot data and imagery
DESCRIPTION
Identifying and accessing imaging Flowcytobot data and imagery. Information architecture and prototype Joe Futrelle, Heidi Sosik Woods Hole Oceanographic Institution August 2011. What / why / so?. Goal: improve access to IFCB data Consistent, unique identifiers for all important datasets - PowerPoint PPT PresentationTRANSCRIPT
IDENTIFYING AND ACCESSINGIMAGING FLOWCYTOBOTDATA AND IMAGERYInformation architecture and prototype
Joe Futrelle, Heidi SosikWoods Hole Oceanographic InstitutionAugust 2011
What / why / so?• Goal: improve access to IFCB data
• Consistent, unique identifiers for all important datasets• Standard representations of measurements and other metadata• Ability to cite and link to IFCB data in a variety of contexts
• What problem does this solve?• Current data access requires access to IFCB laboratory
computers, a Matlab license, ability to read Matlab code, and advice from Heidi and/or Rob
• Who cares?• Users of IFCB data: get improved ability to use existing tools
with IFCB data, and develop new tools using a variety of technologies
• Heidi: gets new capabilities “for free” (more on that later)
Some informatics terminology• An identifier is a short piece of text that identifies a
digital object (e.g., “myfile.txt”)• An identifier can be resolved to the digital object by
software that uses to identifier to access the object (e.g., printing the file called “myfile.txt”)
• The scope (i.e., context) of an identifier is the set of conditions under which it can be resolved to one and only one object (e.g., a folder containing “myfile.txt”)
• The global scope is the international community. Any other scope is local.
• An identifier scheme is a format that specifies the syntax of identifiers (e.g., name + “.” + extension)
Some terminology (cont.)• A namespace is an identifier for the scope of another
identifier. (e.g., “edu” is a namespace that makes “whoi.edu” distinct from “whoi.org”)• Namespaces are generally appended to names as a prefix or
suffix• Namespaces are generally used to transform local
identifiers into global identifiers (e.g., via prefixing)
Global vs. Local identifier schemes
Local identifier schemes
• e.g., pathnames• Non-standard
• Dependent on current software used
• Generally undocumented• “Break” when data
changes• May “collide”
• e.g., my “data.csv” is a different file than your “data.csv”
Global identifier schemes
• e.g., URL’s• Standard
• Specified by standards bodies
• Exhaustively documented• Data-independent• Do not “collide”
• e.g., I cannot replace a web page at a URL you control
IFCB data acquisition flow• Seawater is sampled and forced through flow channel
• Photomultiplier triggers many frame grabs (“ROI’s” or “targets”)
• Data and imagery is written to a set of files
• At end of sample, files are closed and new ones are opened
ROI
ROIROI
.hdr • Context• Metadata
.adc• Scattering
data• ROI metrics
.roi • Raw image data
Imaging FlowCytobot existing ID
• Identifies a bin of observations, generally over an entire seawater sample
• Contains the instrument number and UTC date/time• Used as a filename• Local identifier
• Non-standard scheme• Non-standard resolution mechanism
• Scope = all existing IFCB deployments and software (not so far off from global scope )
IFCB1_2011_234_052230
Resolving IFCB identifiers to files
Component Meaning How to resolve\\cheese.whoi.edu Windows file server Known a priori
\J_IFCB Windows share Known a priori
\ifcb_data_MVCO_jun06 MVCO time series Known a priori
\IFCB1_2011_234 Data from August 22, 2011 UTC
Prefix of local identifier
\IFCB1_2011_234_052230 Bin @ 5:22:30 UTC Local identifier
.roi Data type One of {hdr, adc, roi}
\\ cheese.whoi.edu\J_IFCB\ifcb_data_MVCO_jun06\IFCB1_2011_234\IFCB1_2011_234_052230.roi
Is this pathname a global ID? No
\\cheese.whoi.edu\J_IFCB\ifcb_data_MVCO_jun06\IFCB1_2011_234\IFCB1_2011_234_052230.roi
• It’s global, but it’s not a global identifier of an IFCB dataset; rather, it identifies a location on a file server
• If the files are moved to a different server, share, or directory, the pathnames will change but the dataset will not
• The .roi file represents the same dataset as the .adc and .hdr files, so those pathnames are different but do not identify a different dataset (uniqueness depends on exact matches, not partial matches)
Proposed global ID scheme
http://ifcb-data.whoi.edu/IFCB1_2011_234_052230
• Standard scheme (URL)• Identifies a single instrument, single time bin• Single ID per dataset (i.e., no extension)• No “day’s worth of data” directory (redundant)• Preserves existing local ID scheme (no need to
generate new ID’s)• Works unmodified as a web page URL, XML tag name,
or RDF resource
ID variant: a single ROI
http://ifcb-data.whoi.edu/IFCB1_2011_234_052230_00031
• Identifies a single observation (image + scattering data)
• Observations are numbered sequentially in a time bin
ID variant: a day’s worth of data
http://ifcb-data.whoi.edu/IFCB1_2011_234
• Prefix of existing identifiers• Acts as a namespace for each bin in that day• Note that the instrument number makes this per-
instrument
ID variant: an instrument’s data
http://ifcb-data.whoi.edu/IFCB1
• All data from a given instrument• Metadata about the instrument
ID variant: a formatted representation
http://ifcb-data.whoi.edu/IFCB1_2011_234_052230.xml
• Extension added to global ID• Returns an XML representation of a bin’s worth of data• Includes metadata and links to individual ROI’s
contained in that bin• Other formats available based on extension
• HTML• RDF/XML (Resource Description Framework)• JSON (Javascript Object Notation)• JPEG / TIFF / PNG / etc. for ROI images
Resolution of IFCB global ID’s
Web Server (Apache) @ http://ifcb-data.whoi.edu
mod_rewrite
GIDendpoint
Windows file server @ \\cheese.whoi.edu
resolve.py?id=…
convert.py
samba
path,format
requestedrepresentation(XML, JSON, RDF, jpg, tiff)
request response
memcached
IFCB global ID resolution in action
Interoperability: RSS feed of live data
Interoperability: Android / iPhone
Approach: leave data alone• Reuse as much of existing local ID scheme as possible• “Wrap” with global ID resolution backed by format
conversion service• Do not require data to be reformatted and put in a
repository for management• If data moves, point the services at the new location• If data format changes, tweak the format conversion
service and reuse / extend provided representations• Clients using the ID resolution and format conversion
service (e.g., manual annotation tool TBD, image processing workflow TBD) will be unaffected
Roles of scientist vs. informaticist
Joe the informaticist
• Ask questions• Co-develop
documentation of data formats
• Develop ID scheme• Develop resolution
service• Develop representations
and format service
Heidi / Rob the scientists
• Answer questions• Co-develop
documentation of data formats
• Provide access to data• Share existing data
handling code• Review ID scheme /
representations
What did we just do?• Created long-term, global identifiers for IFCB data
• Citable• “Actionable” (Kunze) = live URL’s• Can continue to be used in metadata and digital preservation
packages even if they are no longer live URL’s• Prototyped services providing access to IFCB data in
standard formats (XML, JSON, RDF)• Supports building web applications using HTML5• Supports web service data access workflow modules• Provides a way to align to standard vocabularies and ontologies
• And what is left to do (… on next slides)
Additional issues to address• Timestamps only recorded in filenames (!)• Syringes with many ROI’s are split across multiple
bins, and timestamp of observations in second bin must come from the filename of the first bin
• No way to identify time series that use more than one instrument• MVCO time series involves IFCB1 being occasionally swapped
with IFCB5• No way to identify deployments generally
• IFCB1 could be moved to a different location to sample plankton as part of a non-MVCO study; there is no way in this scheme to figure out which data goes with which study
Next steps• Clients!
• Manual annotation prototype (using HTML5 / AJAX and JSON format conversion)
• MATLAB (retrofit existing code to use global ID’s)• Kepler (already supports fetching data from web services)
• Improving next-generation IFCB’s data acquisition• Modify on-instrument code• Include timestamp in data (not just filenames)• Use ISO 8601 standard time formats• Generate column headers on CSV data• Record units of measure where appropriate• Align terms in IFCB data (e.g., “temperature”) with standard
terms where appropriate