serverless workflows for indexing large scientific data · data are big, diverse and distributed...

Serverless Workflows for Indexing Large Scientific DataTyler J. Skluzacek, Ryan Chard, Ryan Wong, Zhuozhao Li, Yadu Babuji, Logan Ward, Ben Blaiszik, Kyle Chard, Ian Foster

Data are big, diverse, and distributed

Big: petabytes → exabytes

Diverse: thousands → millions of unique file extensions

Distributed: IoT (edge), HPC, cloud; from many individuals

Generally, scientific data are not FAIR

Findable , Accessible, Interoperable, Reusable

Root of the problem: files lack descriptive metadataRoot of the root of the problem: humans are lazy, metadata are hard

{location, physical attributes, derived information, provenance} {. . .}

{. . .}

SearchIndex

Files Humans Metadata Extraction Ingestion

We need an automatedmetadata extraction system

{location, physical attributes, derived information, provenance} {. . .}

{. . .}

SearchIndex

Metadata Extraction Service

Ideally, to cancel* humans

Files HTTPS Ingestion

We need a flexible, decentralized, scalable metadata extraction system1. Send metadata extraction functions to dataNo need to ship big data

2. DecentralizedExtract the data in their natural habitats (e.g., edge)

3. ScalableRun many concurrent metadata extraction processes

wc –l $FILE1

wc –l $FILE1 wc –l $FILE2

wc –l $FILE1 wc –l $FILE2 . . . wc –l $FILE600000

funcX for FaaS anywhere

Enable secure, isolated, on-demand function serving on myriad compute resources(cloud, HPC, laptops, edge)

Abstract away underlying infrastructure via Parslparallel scripting language

Users deploy endpoints on available compute resources, use Globus Auth for access control, and access a library of containers for running functions

wc –l $FILE1

Function

Compute ResourcefuncX Endpoint

Worker

Container

Result

funcX service

Metadata Extractor = Function

Metadata Extractor: Instructions to create a mapping from input file to output JSON – e.g., looks like a function

Function: Python/BASH metadata extraction instructionPayload: File or group of files which from which to extractFunction Containers: Containers containing all execution dependencies

{location, physical attributes, derived information, provenance}

?

Xtract: the serverless metadata extraction system

Built atop funcXDeploy endpoints at heterogeneous compute resourceson cloud, laptops, HPC, scientific instrumentsCentral web service determines extractors to send to endpointsSend extractors to data, receive results, determine future extractorsSecureUse Globus Auth for access control on data collections and computeCrawls any Globus-connected endpointsRecursively generates file groups dir-by-dirPrototype

Xtract: the serverless metadata extraction system

Site A: Compute at data Site B: Compute elsewhere

Consider how one could automatically move data to available endpoints to suit constraints

Step 1: Crawl the File SystemRecursively crawls all files in all nested dirs located on Globus EndpointGenerate an initial metadata index for each file/file-group

Extracts physical metadata (path, size, extension, last-edited)User Authentication: Globus Auth (OAuth 2)

File queue

Directory queue

Crawl Worker

Crawl Worker

Crawl Worker

Step 2: File Type Identification

Need to “guess” a file’s typeImpractical to apply all extractors to all files (most yield no metadata)Applying an incorrect extractor to a file can waste significant time

Random Forests model trained on 5% of files in a given repoFeatures: 512 bytes from header

Training: File’s type determined by first applicable metadata extractor to fileFeasible because extractors can find other applicable extractors

Step 3: Metadata Extractor Orchestration

Xtract uses file type identity to choose the first appropriate extractor

Extractors return results to service and may immediately deploy additional extractors to endpoint. This can be done recursively.

One file will likely receive multiple metadata extraction functions

Step 4: Ingest Metadata Document

Currently Xtract supports ingesting JSON directly to Globus Search

Diverse, Plentiful Data in Materials Science

The Materials Data Facility (MDF): • is a centralized hub for publishing,

storing, discovering materials data• stores many terabytes of data from

myriad research groups• is spread across tens of millions of files• is co-hosted by ANL and NCSA (at UIUC)

Thus, manual metadata curation is difficult

The Materials Extractor

Atomistic simulations, crystal structures, density functional theory (DFT) calculations, electron microscopy outputs, images, papers, tabular data, abstracts, . . .

MaterialsIO is a library of tools to generatesummaries of materials science data files

We developed a ‘materials extractor’ to return summary as metadata

https://materialsio.readthedocs.io/en/latest/

https://materialsio.readthedocs.io/en/latest/

Extractor Library

We operate a (growing!) suite of metadata extractors, including:

Extractor Description

File Type Generate hints to guide extractor selection

Images SVM analysis to determine image type (map, plot, photo, etc.)

Semi-Structured Extract headings and compute attribute-level metadata

Keyword Extract keyword tags from text

Materials Extract information from identifiable materials science formats

Hierarchical Extract and derive attributes from hierarchical files (NetCDF, HDF)

Tabular Column-level metadata and aggregates, nulls, and headers

Experimental Machinery

Xtract ServiceAWS EC2 t2.small instance (Intel Xeon; 1 vCPU, 2GB RAM)

EndpointfuncX deployed at ANL’s PetrelKube14-node Kubernetes cluster

DataStored on the Petrel data service (3 PB, Globus-accessible endpoint at ANL)255,000 randomly selected files from Materials Data Facility

We evaluate Xtract on the following dimensions: 1. Crawling Performance2. File Type Training3. Extractor Latency

Future work will evaluate: 4. Metadata quality5. Tradeoff optimization (transfer or move if nonuniform resource usage)

1. Crawling Performance

Sequential crawling: 2.2 million files in ~5.2 hoursParallelization? Soon. The remote ls command was previously rate-limited, and a majority of directories have 0 or 1 files.

File queue

Directory queue

Crawl Worker

Crawl Worker

Crawl Worker

Crawler

2. File Type Training

Train file type identification model on 110,900 files in MDF

Total time: 5.3 hours (one-time cost)

Label generation: 5.3 hoursFeature collection + random forests training: 45 seconds

Accuracy: 97%Precision: 97%Recall: 91%

3. Extraction Performance

BatchingExtractor Latency

Extractor # Files Avg. Size (MB)

Avg. Extract Time (ms)

Avg. Stage Time (ms)

File Type 255,132 1.52 3.48 714

Images 76,925 4.17 19.30 1,198

Semi-Str. 29,850 0.38 8.97 412

Keyword 25,997 0.06 0.20 346

Materials 95,434 0.001 24 1,760

Hierarch. 3,855 695 1.90 9,150

Tabular 1,227 1.03 113 625

Conclusion

Data are big, diverse and distributed and are not FAIR (by default)

Xtract is a prototype that enables scalable, distributed metadata extraction on heterogeneous data stores and compute resources

Future work predicates on taking advantage of heterogeneous, distributed resources subject to a number of usage and cost constraints

Next up: index the full 30+ million file Materials Data Facility

Learn more about future work at the Doctoral Symposium

[email protected]

Doctoral Symposium Article: “Dredging a Data Lake: Decentralized Metadata Extraction”. Tyler J. Skluzacek. Middleware ‘19

http://uchicago.edu

serverless workflows for indexing large scientific data · data are big, diverse and distributed...

Documents