ncsa brown dog - sso · software behind … • conversion –polyglot • wraps third-party...

25
National Center for Supercomputing Applications University of Illinois at UrbanaChampaign NCSA Brown Dog DATA TRANSFORMATION SERVICE Smruti Padhy, Ph.D. National Center for Supercomputing Applications University of Illinois at Urbana-Champaign 2016 NAGARA ANNUAL CONFERENCE July 15, 2016, Lansing, MI

Upload: others

Post on 05-Jul-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: NCSA Brown Dog - SSO · Software Behind … • Conversion –Polyglot • Wraps Third-party software/scripts/libraries developed by researcher • Intelligently selects conversion

National Center for Supercomputing Applications

University of Illinois at Urbana–Champaign

NCSA Brown Dog

DATA TRANSFORMATION SERVICE

Smruti Padhy, Ph.D.

National Center for Supercomputing Applications

University of Illinois at Urbana-Champaign

2016 NAGARA ANNUAL CONFERENCE

July 15, 2016, Lansing, MI

Page 2: NCSA Brown Dog - SSO · Software Behind … • Conversion –Polyglot • Wraps Third-party software/scripts/libraries developed by researcher • Intelligently selects conversion

The Problem

• Large collections of unstructured and/or un-curated

digital data

• No textual content – Scanned Handwritten Documents,

Satellite/Aerial Photos, Tabular data, Historic Maps, Audios,

Videos

• No consistent or useful naming of files/directories

• No metadata

• Many data types and hundreds of file formats

• Variety of existing software for access and analysis

• Short life span of digital data and software

Page 3: NCSA Brown Dog - SSO · Software Behind … • Conversion –Polyglot • Wraps Third-party software/scripts/libraries developed by researcher • Intelligently selects conversion

What we need

A system that

• Enables access to data contents irrespective of file

formats

• Extracts metadata from data content and does

automatic curation – enabling indexing based on the

content and making it searchable

• Uses existing conversion/extraction/data analysis tools

• Is extensible – easily add new tools

• Is dynamically scalable

• Is easy to use – provide uniform interface

Page 4: NCSA Brown Dog - SSO · Software Behind … • Conversion –Polyglot • Wraps Third-party software/scripts/libraries developed by researcher • Intelligently selects conversion

Brown Dog Project

• NSF Data Infrastructure Building Blocks (DIBBs) Award

to NCSA ($10.5 M, 2013 -2018)

• Collaboration between NCSA, University of Illinois at

Urbana-Champaign, University of Maryland, Boston

University, Southern Methodist University

• Leverages prior NCSA/NARA and UMD/NARA funded

work

Page 5: NCSA Brown Dog - SSO · Software Behind … • Conversion –Polyglot • Wraps Third-party software/scripts/libraries developed by researcher • Intelligently selects conversion

Brown Dog – Data Transformation Service

• File Format Conversions

• File in, File out

• Example – Image to (Image/text/pdf) format, AutoDesk’s DXF to

(svg, jpg, png, tif, pdf, XML), Video/audio formats(avi, flv, wav, mp3,

mp4) to another video/audio format.

• Can use multiple conversion step

• Metadata Extractions

• Extraction of metadata, signatures or derived products from a file’s

content, tags, previews

• File in, JSON out

• Example – Face extraction from image, text extraction using OCR,

table from pdf, extraction of rivers from historical river maps,

previews

Page 6: NCSA Brown Dog - SSO · Software Behind … • Conversion –Polyglot • Wraps Third-party software/scripts/libraries developed by researcher • Intelligently selects conversion

Brown Dog – Data Transformation Service

• Tools Catalog (TC)

• Allows to add new conversion/extraction tools to the Brown Dog

Service

• Highly extensible set of services

• Maintains Provenance

• Auto scaling

• Information loss

Page 7: NCSA Brown Dog - SSO · Software Behind … • Conversion –Polyglot • Wraps Third-party software/scripts/libraries developed by researcher • Intelligently selects conversion

OpenCV Face

Extractor

An Example – Conversion and Metadata

Extraction

As seen in the Clowder Web Interface

Tesseract OCR

Extractor

Page 8: NCSA Brown Dog - SSO · Software Behind … • Conversion –Polyglot • Wraps Third-party software/scripts/libraries developed by researcher • Intelligently selects conversion

An Example – Conversion and Metadata

Extraction

As seen in the Clowder Web Interface

Conversion

Options

Page 9: NCSA Brown Dog - SSO · Software Behind … • Conversion –Polyglot • Wraps Third-party software/scripts/libraries developed by researcher • Intelligently selects conversion

Use Cases Across Many Disciplines

• Biology

• Ecology

• Civil and Environmental Engineering

• Library and Information Science

• Social Science

Page 10: NCSA Brown Dog - SSO · Software Behind … • Conversion –Polyglot • Wraps Third-party software/scripts/libraries developed by researcher • Intelligently selects conversion

Lets See Some Examples…

Page 11: NCSA Brown Dog - SSO · Software Behind … • Conversion –Polyglot • Wraps Third-party software/scripts/libraries developed by researcher • Intelligently selects conversion

From 19th Century Digitized Historic River

Maps To Study of River Meander

To see the demo and contributors:

http://tinyurl.com/browndog-clients

Page 12: NCSA Brown Dog - SSO · Software Behind … • Conversion –Polyglot • Wraps Third-party software/scripts/libraries developed by researcher • Intelligently selects conversion

Census Data

• Extraction of the handwritten contents from each cell of

digitized 1930s census data

Page 13: NCSA Brown Dog - SSO · Software Behind … • Conversion –Polyglot • Wraps Third-party software/scripts/libraries developed by researcher • Intelligently selects conversion

Water Bodies Detection from Aerial Photos

To see the demo and contributors:

http://tinyurl.com/browndog-clients

Page 14: NCSA Brown Dog - SSO · Software Behind … • Conversion –Polyglot • Wraps Third-party software/scripts/libraries developed by researcher • Intelligently selects conversion

File Manager Extension

• Brown Dog Conversion Service as part of Windows

explorer Right Click Menu

Page 15: NCSA Brown Dog - SSO · Software Behind … • Conversion –Polyglot • Wraps Third-party software/scripts/libraries developed by researcher • Intelligently selects conversion

Google’s Chrome Browser Extension

To see the demo :

http://tinyurl.com/browndog-clients

Page 16: NCSA Brown Dog - SSO · Software Behind … • Conversion –Polyglot • Wraps Third-party software/scripts/libraries developed by researcher • Intelligently selects conversion

Other Examples …

• Video Analysis Tableau – Cinemetrics Extractors

• Person Tracking from Videos

• Extraction of tables/graphs from pdf when source is not

available

Page 17: NCSA Brown Dog - SSO · Software Behind … • Conversion –Polyglot • Wraps Third-party software/scripts/libraries developed by researcher • Intelligently selects conversion

Software Behind …

• Conversion – Polyglot

• Wraps Third-party software/scripts/libraries developed by

researcher

• Intelligently selects conversion path – can chain different software-

minimize conversion hops/information loss

• Imagemagick, Kabeja, Daffodil, ffmpeg, gdal, etc.

• Metadata Extraction - Clowder

• Wraps content analysis/extraction third-party software or scripts

developed by researcher

• NCSA Versus, Siegfried, FITS, Tesseract, OpenCV, Stanford NLP,

ArcGIS etc.

• Provenance Maintenance – DataWolf

• Scalability – Elasticity Module, Cloud (NCSA Nebula), HPC

(XSEDE resource)

Page 18: NCSA Brown Dog - SSO · Software Behind … • Conversion –Polyglot • Wraps Third-party software/scripts/libraries developed by researcher • Intelligently selects conversion

Brown Dog Service Demo

Page 19: NCSA Brown Dog - SSO · Software Behind … • Conversion –Polyglot • Wraps Third-party software/scripts/libraries developed by researcher • Intelligently selects conversion

Code Snippets

Page 20: NCSA Brown Dog - SSO · Software Behind … • Conversion –Polyglot • Wraps Third-party software/scripts/libraries developed by researcher • Intelligently selects conversion

Tools Catalog

Page 21: NCSA Brown Dog - SSO · Software Behind … • Conversion –Polyglot • Wraps Third-party software/scripts/libraries developed by researcher • Intelligently selects conversion

Add a Tool

Page 22: NCSA Brown Dog - SSO · Software Behind … • Conversion –Polyglot • Wraps Third-party software/scripts/libraries developed by researcher • Intelligently selects conversion

Summary

• Huge diversity in data and analysis tools

• Programmable Interface – various client applications

• Automatically scales up/down

• Place to preserve/reuse software/tools

• Integrates with scientific workflow system

• Reusable modules

Page 23: NCSA Brown Dog - SSO · Software Behind … • Conversion –Polyglot • Wraps Third-party software/scripts/libraries developed by researcher • Intelligently selects conversion

Brown Dog Services- Software Components,

Cloud/HPC Resources

Polyglot

Versus Daffodil

Project website:

http://browndog.ncsa.illinois.edu/

Page 24: NCSA Brown Dog - SSO · Software Behind … • Conversion –Polyglot • Wraps Third-party software/scripts/libraries developed by researcher • Intelligently selects conversion

Acknowledgements

• The Brown Dog team and collaborators

• This material is based upon work supported by the

National Science Foundation under Grant Number NSF

ACI-1261582: “CIF21 DIBBs: Brown Dog”.

• Any opinions, findings, and conclusions or

recommendations expressed in this material are those of

the author(s) and do not necessarily reflect the views of

the National Science Foundation.

Page 25: NCSA Brown Dog - SSO · Software Behind … • Conversion –Polyglot • Wraps Third-party software/scripts/libraries developed by researcher • Intelligently selects conversion

• For More Information :

• Project website: http://browndog.ncsa.illinois.edu/

• Brown Dog Clients:

https://www.youtube.com/watch?v=MvaHQKT3BPQ

Thank You

Questions ?

@NCSABrownDog