ncsa brown dog - sso · software behind … • conversion –polyglot • wraps third-party...

Post on 05-Jul-2020

4 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

National Center for Supercomputing Applications

University of Illinois at Urbana–Champaign

NCSA Brown Dog

DATA TRANSFORMATION SERVICE

Smruti Padhy, Ph.D.

National Center for Supercomputing Applications

University of Illinois at Urbana-Champaign

2016 NAGARA ANNUAL CONFERENCE

July 15, 2016, Lansing, MI

The Problem

• Large collections of unstructured and/or un-curated

digital data

• No textual content – Scanned Handwritten Documents,

Satellite/Aerial Photos, Tabular data, Historic Maps, Audios,

Videos

• No consistent or useful naming of files/directories

• No metadata

• Many data types and hundreds of file formats

• Variety of existing software for access and analysis

• Short life span of digital data and software

What we need

A system that

• Enables access to data contents irrespective of file

formats

• Extracts metadata from data content and does

automatic curation – enabling indexing based on the

content and making it searchable

• Uses existing conversion/extraction/data analysis tools

• Is extensible – easily add new tools

• Is dynamically scalable

• Is easy to use – provide uniform interface

Brown Dog Project

• NSF Data Infrastructure Building Blocks (DIBBs) Award

to NCSA ($10.5 M, 2013 -2018)

• Collaboration between NCSA, University of Illinois at

Urbana-Champaign, University of Maryland, Boston

University, Southern Methodist University

• Leverages prior NCSA/NARA and UMD/NARA funded

work

Brown Dog – Data Transformation Service

• File Format Conversions

• File in, File out

• Example – Image to (Image/text/pdf) format, AutoDesk’s DXF to

(svg, jpg, png, tif, pdf, XML), Video/audio formats(avi, flv, wav, mp3,

mp4) to another video/audio format.

• Can use multiple conversion step

• Metadata Extractions

• Extraction of metadata, signatures or derived products from a file’s

content, tags, previews

• File in, JSON out

• Example – Face extraction from image, text extraction using OCR,

table from pdf, extraction of rivers from historical river maps,

previews

Brown Dog – Data Transformation Service

• Tools Catalog (TC)

• Allows to add new conversion/extraction tools to the Brown Dog

Service

• Highly extensible set of services

• Maintains Provenance

• Auto scaling

• Information loss

OpenCV Face

Extractor

An Example – Conversion and Metadata

Extraction

As seen in the Clowder Web Interface

Tesseract OCR

Extractor

An Example – Conversion and Metadata

Extraction

As seen in the Clowder Web Interface

Conversion

Options

Use Cases Across Many Disciplines

• Biology

• Ecology

• Civil and Environmental Engineering

• Library and Information Science

• Social Science

Lets See Some Examples…

From 19th Century Digitized Historic River

Maps To Study of River Meander

To see the demo and contributors:

http://tinyurl.com/browndog-clients

Census Data

• Extraction of the handwritten contents from each cell of

digitized 1930s census data

Water Bodies Detection from Aerial Photos

To see the demo and contributors:

http://tinyurl.com/browndog-clients

File Manager Extension

• Brown Dog Conversion Service as part of Windows

explorer Right Click Menu

Google’s Chrome Browser Extension

To see the demo :

http://tinyurl.com/browndog-clients

Other Examples …

• Video Analysis Tableau – Cinemetrics Extractors

• Person Tracking from Videos

• Extraction of tables/graphs from pdf when source is not

available

Software Behind …

• Conversion – Polyglot

• Wraps Third-party software/scripts/libraries developed by

researcher

• Intelligently selects conversion path – can chain different software-

minimize conversion hops/information loss

• Imagemagick, Kabeja, Daffodil, ffmpeg, gdal, etc.

• Metadata Extraction - Clowder

• Wraps content analysis/extraction third-party software or scripts

developed by researcher

• NCSA Versus, Siegfried, FITS, Tesseract, OpenCV, Stanford NLP,

ArcGIS etc.

• Provenance Maintenance – DataWolf

• Scalability – Elasticity Module, Cloud (NCSA Nebula), HPC

(XSEDE resource)

Brown Dog Service Demo

Code Snippets

Tools Catalog

Add a Tool

Summary

• Huge diversity in data and analysis tools

• Programmable Interface – various client applications

• Automatically scales up/down

• Place to preserve/reuse software/tools

• Integrates with scientific workflow system

• Reusable modules

Brown Dog Services- Software Components,

Cloud/HPC Resources

Polyglot

Versus Daffodil

Project website:

http://browndog.ncsa.illinois.edu/

Acknowledgements

• The Brown Dog team and collaborators

• This material is based upon work supported by the

National Science Foundation under Grant Number NSF

ACI-1261582: “CIF21 DIBBs: Brown Dog”.

• Any opinions, findings, and conclusions or

recommendations expressed in this material are those of

the author(s) and do not necessarily reflect the views of

the National Science Foundation.

• For More Information :

• Project website: http://browndog.ncsa.illinois.edu/

• Brown Dog Clients:

https://www.youtube.com/watch?v=MvaHQKT3BPQ

Thank You

Questions ?

@NCSABrownDog

top related