ncsa brown dog - sso · software behind … • conversion –polyglot • wraps third-party...
TRANSCRIPT
National Center for Supercomputing Applications
University of Illinois at Urbana–Champaign
NCSA Brown Dog
DATA TRANSFORMATION SERVICE
Smruti Padhy, Ph.D.
National Center for Supercomputing Applications
University of Illinois at Urbana-Champaign
2016 NAGARA ANNUAL CONFERENCE
July 15, 2016, Lansing, MI
The Problem
• Large collections of unstructured and/or un-curated
digital data
• No textual content – Scanned Handwritten Documents,
Satellite/Aerial Photos, Tabular data, Historic Maps, Audios,
Videos
• No consistent or useful naming of files/directories
• No metadata
• Many data types and hundreds of file formats
• Variety of existing software for access and analysis
• Short life span of digital data and software
What we need
A system that
• Enables access to data contents irrespective of file
formats
• Extracts metadata from data content and does
automatic curation – enabling indexing based on the
content and making it searchable
• Uses existing conversion/extraction/data analysis tools
• Is extensible – easily add new tools
• Is dynamically scalable
• Is easy to use – provide uniform interface
Brown Dog Project
• NSF Data Infrastructure Building Blocks (DIBBs) Award
to NCSA ($10.5 M, 2013 -2018)
• Collaboration between NCSA, University of Illinois at
Urbana-Champaign, University of Maryland, Boston
University, Southern Methodist University
• Leverages prior NCSA/NARA and UMD/NARA funded
work
Brown Dog – Data Transformation Service
• File Format Conversions
• File in, File out
• Example – Image to (Image/text/pdf) format, AutoDesk’s DXF to
(svg, jpg, png, tif, pdf, XML), Video/audio formats(avi, flv, wav, mp3,
mp4) to another video/audio format.
• Can use multiple conversion step
• Metadata Extractions
• Extraction of metadata, signatures or derived products from a file’s
content, tags, previews
• File in, JSON out
• Example – Face extraction from image, text extraction using OCR,
table from pdf, extraction of rivers from historical river maps,
previews
Brown Dog – Data Transformation Service
• Tools Catalog (TC)
• Allows to add new conversion/extraction tools to the Brown Dog
Service
• Highly extensible set of services
• Maintains Provenance
• Auto scaling
• Information loss
OpenCV Face
Extractor
An Example – Conversion and Metadata
Extraction
As seen in the Clowder Web Interface
Tesseract OCR
Extractor
An Example – Conversion and Metadata
Extraction
As seen in the Clowder Web Interface
Conversion
Options
Use Cases Across Many Disciplines
• Biology
• Ecology
• Civil and Environmental Engineering
• Library and Information Science
• Social Science
Lets See Some Examples…
From 19th Century Digitized Historic River
Maps To Study of River Meander
To see the demo and contributors:
http://tinyurl.com/browndog-clients
Census Data
• Extraction of the handwritten contents from each cell of
digitized 1930s census data
Water Bodies Detection from Aerial Photos
To see the demo and contributors:
http://tinyurl.com/browndog-clients
File Manager Extension
• Brown Dog Conversion Service as part of Windows
explorer Right Click Menu
Google’s Chrome Browser Extension
To see the demo :
http://tinyurl.com/browndog-clients
Other Examples …
• Video Analysis Tableau – Cinemetrics Extractors
• Person Tracking from Videos
• Extraction of tables/graphs from pdf when source is not
available
Software Behind …
• Conversion – Polyglot
• Wraps Third-party software/scripts/libraries developed by
researcher
• Intelligently selects conversion path – can chain different software-
minimize conversion hops/information loss
• Imagemagick, Kabeja, Daffodil, ffmpeg, gdal, etc.
• Metadata Extraction - Clowder
• Wraps content analysis/extraction third-party software or scripts
developed by researcher
• NCSA Versus, Siegfried, FITS, Tesseract, OpenCV, Stanford NLP,
ArcGIS etc.
• Provenance Maintenance – DataWolf
• Scalability – Elasticity Module, Cloud (NCSA Nebula), HPC
(XSEDE resource)
Brown Dog Service Demo
Code Snippets
Tools Catalog
Add a Tool
Summary
• Huge diversity in data and analysis tools
• Programmable Interface – various client applications
• Automatically scales up/down
• Place to preserve/reuse software/tools
• Integrates with scientific workflow system
• Reusable modules
Brown Dog Services- Software Components,
Cloud/HPC Resources
Polyglot
Versus Daffodil
Project website:
http://browndog.ncsa.illinois.edu/
Acknowledgements
• The Brown Dog team and collaborators
• This material is based upon work supported by the
National Science Foundation under Grant Number NSF
ACI-1261582: “CIF21 DIBBs: Brown Dog”.
• Any opinions, findings, and conclusions or
recommendations expressed in this material are those of
the author(s) and do not necessarily reflect the views of
the National Science Foundation.
• For More Information :
• Project website: http://browndog.ncsa.illinois.edu/
• Brown Dog Clients:
https://www.youtube.com/watch?v=MvaHQKT3BPQ
Thank You
Questions ?
@NCSABrownDog