addressing the challenges of the scientific data deluge

93
1 Addressing the Challenges of the Scientific Data Deluge Kenneth Chiu SUNY Binghamton

Upload: dexter

Post on 16-Jan-2016

36 views

Category:

Documents


0 download

DESCRIPTION

Addressing the Challenges of the Scientific Data Deluge. Kenneth Chiu SUNY Binghamton. Outline. Overview of collaborative projects that I’m working on. Discussion of challenges and approaches. Technical overview of specific projects. Autoscaling Project. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Addressing the Challenges of the Scientific Data Deluge

1

Addressing the Challenges of the Scientific Data Deluge

Kenneth ChiuSUNY Binghamton

Page 2: Addressing the Challenges of the Scientific Data Deluge

2

Outline

• Overview of collaborative projects that I’m working on.

• Discussion of challenges and approaches.

• Technical overview of specific projects

Page 3: Addressing the Challenges of the Scientific Data Deluge

3

Autoscaling Project

• Traditional research focus in sensor networks on energy, routing, etc.

• In “environmental observatories”, management is the problem.

• Adding a sensor takes a lot of manual reconfiguration.– Calibration, recalibration.– QA/QC is also a major issue.

• What corrections have been applied to the data, and, what calibrations/maintenance have been applied to the sensor?

• With U. Wisconsin, SDSC, and Indiana University.

Page 4: Addressing the Challenges of the Scientific Data Deluge

4

Motivation

• Adding a sensor requires a great deal of manual effort.– Reconfiguring datalogger– Reconfiguring data acquisition software– Reconfiguring QA/QC triggers– Reconfiguring database tables

• QA/QC is not very automated• Result: Sensor networks are not very scalable.• Goal: Automate.

Page 5: Addressing the Challenges of the Scientific Data Deluge

5

Metadatafor each final table

Metadata

• describes each final table

• are used to generate forms dynamically for data retrieval from website

• entered manually

Page 6: Addressing the Challenges of the Scientific Data Deluge

6

Approach

• Use a agent-based, bottom-up approach.

• Agents coordinate among themselves, as much as possible.

• Unify communications. All communications done via data streams.

• Data streams represented as content-based, publish-subscribe systems.

Page 7: Addressing the Challenges of the Scientific Data Deluge

7

Long-Term Ecological Research (LTER)

Data-logger

QAAgent

Env.Agent

QAAgent

Oracle

WebServer

WebBrowser

Config. Event (CIMA)

Other Connection

Trout Lake Station

University ofWisconsin Campus

Buoy

ORB

ARTS Connection

Sensors

Env. Event (CIMA)

JDBC/ODBC Connection

WebBrowser

Config.Agent

Config.Agent

Config.Agent

Config.Agent

1

2

3

3

4

Other Locations

Page 8: Addressing the Challenges of the Scientific Data Deluge

8

Agents

• Characteristics– Autonomous– Bottom up– Distributed coordination– Independence/loosely-coupled

• Can be thought of as a “style” for implementing distributed systems.

Page 9: Addressing the Challenges of the Scientific Data Deluge

9

Sensor Metadata

• Each sensor has intrinsic and extrinsic properties.– Intrinsic are type, model number, etc.

• Static: Cannot be changed.• Dynamic: SDI-12 address.

– Extrinsic are location, sampling rate, etc.

• Use code generation techniques to generate the proper code based on the sensor data.

Page 10: Addressing the Challenges of the Scientific Data Deluge

10

Automatic Sensor Detection and Inventory

InstrumentAgent

WebService

Acquisition Computer

Field Station Computer

DataloggerProgram

SensorMetadata

Repository

3

5: Upload

4: Generate

Datalogger

Sensor

Response

Request

2

2

3

1: Detection event

Data Center

Database

6: Data

7

Page 11: Addressing the Challenges of the Scientific Data Deluge

11

QA/QC

• Malfunctioning anemometer detected as an abnormal occurrence of zero wind speed values.

0

50

100

150

200

250

Jan-95 Jan-97 Jan-99 Jan-01 Jan-03

frequency ofzero hourly average wind speed values per month

Page 12: Addressing the Challenges of the Scientific Data Deluge

12

Another Example

• Buoy was pulled down in the water by the ice.

-2

-1

0

1

2

3

4

23-Nov 23-Dec 22-Jan 21-Feb

watertemperature(deg C)

-2

-1

0

1

2

3

4

15-Nov 15-Dec 14-Jan 13-Feb

sensors displaced normal winter

Hu and Benson

Page 13: Addressing the Challenges of the Scientific Data Deluge

13

Crystal Grid Framework

• Seeks to develop standards and middleware for integrating instrument and sensor data into wide-area infrastructures, such as grid computing.

• With Indiana University.

Page 14: Addressing the Challenges of the Scientific Data Deluge

14

Motivation• Process of collecting and generating data is often critical.

– Current mechanisms for monitoring and control either require physical presence, or use ad hoc protocols, formats.

• Instruments and sensors are already “wired”.– Usually via obscure, or perhaps proprietary protocols.

• Using standard mechanisms and protocols can give these devices a grid presence.– Benefit from a single, unified paradigm, terminology.– Single set of standards; exploit existing grid standards.– Simplifies end-to-end provenance tracking.– Faster, seamless interactions between data acquisition and data

processing.– Greater interoperability and compatibility.

Philosophy: Push grid standards as close to the instrument or sensor as possible. (But no further!) Deal with “impedance mismatches” close to the instrument, so as to localize complexity.

Page 15: Addressing the Challenges of the Scientific Data Deluge

15

• Develop a set of standard grid services for accessing and controlling instruments.– Based on Web standards such as WSDL, SOAP, XML, etc.

• Develop a instrument ontology for describing instruments.– Applications use the description to interact.

• The goal is to develop middleware that abstracts and layers functionality.– Minor differences in instruments should only result in minor loss

of functionality to the application.

• Move metadata and provenance as close to the instrument as possible.

Goals

Page 16: Addressing the Challenges of the Scientific Data Deluge

16

Overview

Physical Network Transport

Data Pipeline

AcquisitionComponent

AcquisitionCode

InstrumentAccess

AnalysisComponent

AnalysisCode

InstrumentAccess

CurationComponent

CurationCode

InstrumentAccess

Instrument

Sensor 1

Controller

Sensor 2

InstrumentPresentation

Scientist

InstrumentAccess

RemoteAccess

GUI

Device-Independent Application Module

Device-Dependent Virtualization Module

Shared Implementation

Page 17: Addressing the Challenges of the Scientific Data Deluge

17

Distributed X-Ray Crystallography

• Crystallographer, chemist, and technician may be separated.– Large resources such as synchrotrons– Convenience and productivity– Expanding usage to smaller institutions

• Data collection, analysis, and curation may be separated.

• Approximate data requirements: 1-10 TB/year.– Currently stored at IU.

• Real-time data collection and control.• Collaboration with IU, Sydney, JCU,

Southampton.

Page 18: Addressing the Challenges of the Scientific Data Deluge

18

X-Ray Crystallography

• Scientists are very reluctant (understandably) to install your software on the acquisition machine.– Use a proxy box by which to access files via CIFS or

NFS.– Scan for files which indicate activity.

• Unfortunately, scientists can manually create files, which can confuse the scanner. No ideal solution.

• For sensor data, request-response is not ideal.– Push data using one-way messages.

• In WSDL 2.0, consider “connecting” out-only services to in-only services.

Page 19: Addressing the Challenges of the Scientific Data Deluge

19

X-Ray Crystallography

Portal

InstrumentManager

DataArchive

Non-grid service

Grid service

Persistent

Non-persistent

Portal

InstrumentManager

DataArchive

Indiana University

University of Sydney

InstrumentServices

Proxy BoxAcquisition Machine

CIFS

Argonne National Labs

University of Southhampton

Fromdiffractometer Instrument

Services

Proxy BoxAcquisition Machine

CIFS

Fromdiffractometer

Page 20: Addressing the Challenges of the Scientific Data Deluge

20

TASCS: Center for Technology for Advanced Scientific Component

Software

• Multi-institution DOE project.• Seeks to develop a common component

architecture for scientific components.• My focus within it is to develop a

BabelRMI/Proteus implementation.– And develop C++ reflection techniques to improve

dynamic connection abilities.

• With LLNL and many other institutions.

Page 21: Addressing the Challenges of the Scientific Data Deluge

21

Babel

• Language interoperability toolkit developed at LLNL.

• Allows writing objects in a number of languages, including non-OOP ones such as Fortran.

• Began as a purely in-process tool, now includes an RMI interface.

Page 22: Addressing the Challenges of the Scientific Data Deluge

22

Proteus

• Started off as a unification API for messaging over multiple standards and implementations, such as CORBA, JMS, SOAP.

• Moving towards focusing on multiprotocol web services.

• Though almost always bound to SOAP, WSDL actually fully supports almost any protocol.

Page 23: Addressing the Challenges of the Scientific Data Deluge

23

Runtime

Stub

IOR

C++ Skel

RMI Stub

Proteus

Impl

Skel

IOR

C++ Stub

Proteus

SerializableObject

B-PAdapter

B-PAdapter

SerializableObject

Generated

Library

User

Babel-ProteusGenerated

WSIT WSIT

Page 24: Addressing the Challenges of the Scientific Data Deluge

24

Multiprotocol

Network

Proteus

Client

ProviderA

ProviderB

Proteus

Client

ProviderA

ProviderB

Protocol A

Protocol B

Process 1 Process 2

Page 25: Addressing the Challenges of the Scientific Data Deluge

28

Lake Sunapee• Most e-Science/cyberinfrastructure R&D is for

institutional science.– Assume significant resources and expertise.

• Much less work on CI for citizen science, non-profits organizations, etc.

• This project explores how to engage them in the development of cyberinfrastructure and e-Science.– Also with a focus on how to use e-Science to

engage and educate K-12.– Also with a focus on how to train CS students to

better engage scientists.• With U. Wisconsin, U. Michigan, LSPA, and

IES.

Page 26: Addressing the Challenges of the Scientific Data Deluge

29

• Hold a series of workshops to understand needs.

• Research and develop systems to allow them accessible means to interpret the sensor data.

• Course component: seminar/project course where students will work with citizen scientists in small groups to define and implement e-Science projects with the lake association.

Page 27: Addressing the Challenges of the Scientific Data Deluge

30

• Semantic publish-subscribe.– Content-based publish-subscribe needs a

content model.– Semantic web/description logics provide an

ideal content model.

Page 28: Addressing the Challenges of the Scientific Data Deluge

31

Many Small Datasets

• Much ecological data is characterized not by a few large datasets, but many small datasets.– e-Science has up to now chosen to focus on a

few large datasets, mostly.

Page 29: Addressing the Challenges of the Scientific Data Deluge

32

Flexible Electronics and Nanotechnology

• Work with Howard Wang in BU ME.

• “Ontologies” for materials science processes (internal).

• Undergraduate education project (NSF).

Page 30: Addressing the Challenges of the Scientific Data Deluge

33

Material Processes

• Materials science research product is the characterization of a process (vibration, heating, chemical, electrical, etc.).

• Applying such research is finding a sequence of processes that will transform a material A (with certain properties such as particle size) to a material B (with certain other properties).

• Very difficult to search the research literature.• Also, this is a type of path finding problem.

Page 31: Addressing the Challenges of the Scientific Data Deluge

34

“annealing”

hasName

tempSchedule

“a schedule”

Conceptually, the schedule is just a function that gives the temperature as output given the time as input. One question is whether or not to attempt to represent it partially in the graph model, or to treat it’s representation as completely outside the model.

For example, a function can be represented as a table, or a Fourier series, wavelets, etc.

“annealing”

hasName

tempSchedule

“a differentschedule”

This is an anonymous node that only serves to “bind” the other nodes together. You can think of it as representing the process as a whole.

Information is sparse.

Page 32: Addressing the Challenges of the Scientific Data Deluge

35

Undergraduate Education

• Groups of nanotechnology students develop senior design projects with CS students.

Page 33: Addressing the Challenges of the Scientific Data Deluge

36

Programs-Australia-Canada-China-Finland-Florida-New Zealand-Israel-South Korea-Taiwan-United Kingdom-Wisconsin

First meeting:San DiegoMarch 7-9, 2005

Source: T. Kratz

Page 34: Addressing the Challenges of the Scientific Data Deluge

37

Vision and Driving Rationale for GLEON

• A global network of hundreds of instrumented lakes, data, researchers, students,

• Predict lake ecosystems response to natural and anthropogenic mediated events – Through improved data inputs to simulation models– To better plan and preserve freshwater resources on

the planet

• More or less a grass roots organization.• Led by Peter Arzberger at SDSC, and with U.

Wisconsin.

Page 35: Addressing the Challenges of the Scientific Data Deluge

38

Why develop such a network?

• Global e-science becoming increasingly possible

• Developments in sensors and sensor networks allow some key measurements to be automated

Porter, Arzberger, C. Lin, F. P. Lin, Kratz, et al. (2005)

July 2005 Issue

Source: T. Kratz

Page 36: Addressing the Challenges of the Scientific Data Deluge

39

Page 37: Addressing the Challenges of the Scientific Data Deluge

40

Outline

• Overview of collaborative projects that I’m working on.

• Discussion of challenges and approaches.

• Technical overview of specific projects

Page 38: Addressing the Challenges of the Scientific Data Deluge

41

Research Challenges

• Biggest challenge is data.• Much time and effort is spent managing data in time-

consuming and human-intensive means.– Often stored in Excel, text files, SAS.– Metadata in notebooks, gray matter.

• No incentives to make data reusable.– Providing data is not valued academically.

• Too much manual work involved in acquisition.– Means much is not captured automatically and semantically.

• Standardization of things such as ontologies are very slowl, and tend to be top-down.– Can we first build a system that provides some benefit without

forcing them to go through a painful standardization process?

Page 39: Addressing the Challenges of the Scientific Data Deluge

42

Cyberinfrastructure and e-Science

• There have been huge improvements in hardware.

• There have been huge local improvements in software.

• Not so many improvements in large-scale integration and interoperability.

Page 40: Addressing the Challenges of the Scientific Data Deluge

43

Data, Data, and More Data!

• Data is the driver of science.

• Recent advances in technology have given us the ability to acquire and generate prodigious amounts of data.

• Processing power, disk, memory have increased at exponential rates.

Page 41: Addressing the Challenges of the Scientific Data Deluge

44

It’s Not a Few Huge Datasets

• Huge datasets get more attention.– More glamorous.– Traditional type of CS problem.– Easier to think about.

• But it’s the number of different datasets that is the real problem.– If you have one big one, can concentrate efforts on

the problem.– Not very amenable to traditional CS “thinking”, since

there is a very significant “human-in-the-loop” component.

– The best CS research is useless if the human ignores it.

Page 42: Addressing the Challenges of the Scientific Data Deluge

45

We Are The Same!(More or Less)

Technology advances fast.

People advance slowl!People compose our institutions, our organizations, our

modes of practice.

Result: The old ways of doing things don’t cut it. But we haven’t yet figured out the

new ways.

Page 43: Addressing the Challenges of the Scientific Data Deluge

46

Technology Impacts Slowly

• Technologies often require many systemic changes to bring benefits.– Sometimes require other complementary technologies

to be invented.

• Steam engine invented in 1712, did not become huge economic success till 1800’s.

• Motor and generator invented in early 1800s.– Real benefits did not occur till 1900s.

Page 44: Addressing the Challenges of the Scientific Data Deluge

47

• Steam-powered factories built around a single large engine.

• Belts and other mechanical drives distributed power.• If you brought a motor to a factory foreman:

– His factory wasn’t built for it.– He might not be able to power it.

• Chicken-and-egg problem.

– He doesn’t even know how to use it.

• It took decades.• Similarly, I believe we are in the early stages when it

comes to computer technology.

Steam To Electric

Page 45: Addressing the Challenges of the Scientific Data Deluge

48

Socio-Technical Problem

• What will it take to figure out how to use all this data?

• Not a pure CS problem, people’s actions affect how easy is it use all the data.

• Many problems these days are sociotechnical in nature.– Password security is a solved problem.– Interoperability is a solved problem.

• Figuring out how to use data is even harder than power, since power distribution is physical, easy to see.– Data/info flow is hard to see.

Page 46: Addressing the Challenges of the Scientific Data Deluge

49

A Vision

• A scientist sits in his office.• He wonders: “I wonder if children who live closer

to cell towers have higher rates of autism?”• How much time would it take a scientist to test

this hypothesis?– Find the data.– Reformat the data, convert it, etc.– Run some analysis tools. Maybe find time on a large

resource.• But the data is out there!

– There are many hypotheses that are never tested because it would take too much work.

Page 47: Addressing the Challenges of the Scientific Data Deluge

50

• This vision also applies to business, military, medicine, industry, management, etc.

• There are a million sources of data out there.– Real-time data streams, archived data, scientific

publications, etc.

• How can we build a flexible infrastructure that will allow analyses to be composed and answered on the fly?

• How do we go from data+computation to knowledge?

Page 48: Addressing the Challenges of the Scientific Data Deluge

51

RDF-like Data Model

• We hypothesize that part of the problem is that RDBMS are based on data models that do not fit scientific data well.– This “impedance” mismatch is a barrier.

• Thus, develop models that more closely resemble the mental model that scientists use when thinking about data.– The less a priori structure imposed on the

data, the better.

Page 49: Addressing the Challenges of the Scientific Data Deluge

52

Goals• Allow some common subset of code and design to be used for

many scientific data and applications.• Suggest a data and information architecture for querying and

storage.• Provide some fundamental semantics. Each discipline would then

refine these semantics.• Don’t get bogged down in trying to figure out everything. Just try to

find some LCD.• This is a logical model of data. Also need a “physical” model to

handle transport, archiving, etc. Then need to map from the physical model to the logical model. For example, an image file has more than just the raw intensities. But some metadata may not be in the file. We don’t want the logical model to be concerned about the how the data is actually arranged.

• Promote bottom-up, grass-roots approaches to building standards.

Page 50: Addressing the Challenges of the Scientific Data Deluge

53

One Person’s Metadata Is Another Person’s Data

• Distinction between data and metadata is artificial and problematic.– What is metadata in one context becomes data in another. For

example, suppose you are taking the temperature at a set of locations (determined via GPS). So for each reading, the temperature is the data, and the metadata is the location. But now suppose that you need the error of the location. So now the error becomes the metametadata of the location metadata?

– A made-up example based loosely on crystallography: The spatial correction is based on a calibration image obtained from a brass plate. So the calibration image is metadata for the set of frames. Now suppose that they need the temperature of the brass plate when the image was made. So now the temperature is metametadata.

Page 51: Addressing the Challenges of the Scientific Data Deluge

54

• Use a graph-based model.– Base on RDF.– Actual data is stored as a graph

• Contrast with models like E-R, where the graph “models” the data, rather than actually being the data.

• A node in E-R might be “customer”, and represent the class of entities that are customers, rather than any specific customer.

• The model:– Each node is a datum.– Each edge denotes an association/attribute/property.– Nodes can be grouped into nodesets, which are also nodes.

• A node may be in more than one nodeset.

– A node-edge-node triple can also be a node.– Main difference from RDF is an attempt to build reification into

the model.

• Somewhat similar to a hypergraph.

Page 52: Addressing the Challenges of the Scientific Data Deluge

55

• The edge with the attribute name set_attr_1 is an attribute of a nodeset.• The edge with the attribute name triple_prop is an attribute of the above

edge.

13

20

temperature

angle

set_attr_1

triple_prop

Nodeset

Nodeset

Page 53: Addressing the Challenges of the Scientific Data Deluge

57

Complete Capture of Raw Data

• Complete digital capture of data and metadata.– Already digital.

• Must have full provenance and other metadata.

Page 54: Addressing the Challenges of the Scientific Data Deluge

58

Put Everything In the Triplestore

• Unify semantic networks and data graphs.

• Metadata relationships can use reified triples.

• Don’t wait for standards, people take too long to decide.– Bottom-up standards tend to work better.– First must have the demand for the standard.

• All data is read-only.

Page 55: Addressing the Challenges of the Scientific Data Deluge

59

But We Can Never Store That Much

• Maybe we can.

• But to drive a technology, first need to show a need.

• RDBMS have had several decades of research to improve performance.

Page 56: Addressing the Challenges of the Scientific Data Deluge

60

Publications Are Data

• In some fields, such as materials science, papers are 80% boilerplate text.

• It’s better to directly publish this as structured, semantic data.– No NL.

• Use NL annotations where needed.

Page 57: Addressing the Challenges of the Scientific Data Deluge

61

• A scientist runs experiments.– All data is captured.

• She reaches a point where she wishes to publish.

• She reviews her experimental data (all captured with provenance, and full metadata, sensor calibration, etc.), and drags and drops what is most relevant.

• She creates a narrative by creating some annotated links between experiments to explain the insights.– Typically probably at most one page of text, maybe

less.• She clicks a button to submit for publication.

Page 58: Addressing the Challenges of the Scientific Data Deluge

62

Closer Ties Between Theoreticians and Practitioners

• In the real world, likely that semantic data treatments will need to deal with uncertainty, quantitativeness, ambiguity, fuzziness.– There is research in these areas, but not a lot of

penetration into practice, which prevents good feedback to the theoreticians.

– For example, many practitioners don’t even know about polyhierarchies. (Clay Shirky)

• Often attempts to create ontologies result in trying to figure out which class is the parent.

Page 59: Addressing the Challenges of the Scientific Data Deluge

63

Outline

• Overview of collaborative projects that I’m working on.

• Discussion of challenges and approaches.

• Technical overview of specific projects

Page 60: Addressing the Challenges of the Scientific Data Deluge

64

Distributed Triplestores

• Published in e-Science 2007.

• With IU student Tharaka Devadithya.

Page 61: Addressing the Challenges of the Scientific Data Deluge

65

Motivation

• Data in some domains is dynamically structured.• Predefining structures (e.g., schemas in RDBMS)

creates a barrier for storing such data.– Certain minute details may get discarded.

• Scientists generally store experiment details in text or binary files (e.g., spreadsheets, word processing documents).– These files can be stored in databases as BLOBs.– However, it is not possible to efficiently query these

data.– Sharing data among other collaborators require that

everyone can read the format used by the author.

Page 62: Addressing the Challenges of the Scientific Data Deluge

66

Storing Dynamically Structured Data

• An RDBMS can be used by modifying its schema each time the structure of data changes. – Not a feasible option if the schemas need to be

modified very frequently.• Data can be stored in a file system with a

hierarchical directory structure to organize the data.– The author needs to remember the organization of

data.– Difficult to share data among collaborators.

• There is a strong requirement for a store of dynamically structured data that does not hinder the ability of efficient querying.

Page 63: Addressing the Challenges of the Scientific Data Deluge

67

Dynamic Structures with Databases

Timestamp Value Units

2006-10-12 14:23:33

25.2 Celsius

2006-10-12 16:44:25

25.5 Celsius

Timestamp Timezone Value Units

2006-10-12 14:23:33

EST (or NULL?)

25.2 Celsius

2006-10-12 16:44:25

EST (or NULL?)

25.5 Celsius

Date Time Timezone Value Units

2006-10-12 14:23:33

EST (or NULL?)

25.2 Celsius

2006-10-12 16:44:25

EST (or NULL?)

25.5 Celsius

New column

Page 64: Addressing the Challenges of the Scientific Data Deluge

68

Dynamic Structures with Databases…more issues

• Suppose the following information is stored about a sensor.– Manufacturer– Measurement type (e.g., temperature, humidity)– Measurement units

• What if there is one sensor whose manufacturer is not known?– Insert NULL to the Manufacturer field?

• Now, what if it is required to store purchased date only for one sensor?– Add a new column? What value to store in this column

for other sensors?– Add another table and join with the original table?

Page 65: Addressing the Challenges of the Scientific Data Deluge

69

Semantic Web Solution

• Semantic web solutions have been successfully used both in scientific and commercial environments.– Do not impose any structure on the data.– Data modeled as a directed graph.

• Resource Description Framework (RDF) is the most commonly used standard for representing such graphs.– Can be used to describe any property about any

resource.

Page 66: Addressing the Challenges of the Scientific Data Deluge

70

RDF and Triplestores

• Triple– Subject: the resource being described– Predicate: the property being described– Object: the value of the property

• E.g., methyl-cyanide crystallographer John– The crystallographer for methyl-cyanide is John.

• A graph in RDF is represented as a set of triples. – Each triple connects a subject node to an object node

in the graph.• A persistent set of such triples is known as a

triplestore.

Subject Predicate Object

Page 67: Addressing the Challenges of the Scientific Data Deluge

71

Example of RDF Graph

Page 68: Addressing the Challenges of the Scientific Data Deluge

72

XML Databases

• Proposed as suitable for such dynamically structured data.

• Commercial databases starting to provide native support for XML.

• XML is extensible and does not impose any structure on the data.– Therefore, it allows to dynamically build

structures

• Suffers from update anomalies

Page 69: Addressing the Challenges of the Scientific Data Deluge

73

Update Anomalies with XML• Assume an XML database is used for storing information about

crystallography experiments, as follows.

<experiment> <crystallographer> <name>John Smith</name> <designation>Scientist</designation> <address>...</address> </crystallographer> <startTime>...</startTime> <location>IUMSC</location> ...</experiment>

• Results in storing redundant information– Address of John Smith will be the same for all experiments.– What happens if he changes his address? Update all previous XML fragments?

• Solution: Normalize certain details as in relational DBMS.– E.g., separate address information from the experiment details and provide a link

(reference) to an address document

Page 70: Addressing the Challenges of the Scientific Data Deluge

74

• However, in order to normalize, the schema should be known in advance.

• This is not possible when data gets added arbitrarily without being compliant with any predefined schema.

• The user has to determine how to normalize the data.• Solution: to normalize everything

– resulting only in attribute, value pairs. E.g.,

<experiment> <crystallographer ref=“JohnSmith”></exeriment>

<JohnSmith> <name ref=”John Smith”></JohnSmith>

…– Very similar to RDF model

Page 71: Addressing the Challenges of the Scientific Data Deluge

75

Need for a Distributed Triplestore

• Origination points• Ownership• Scalability

– Large number of triples.• E.g., consider a table in a RDBMS having 15 columns.

Migrating its data to a triplestore would result in 15 triples for each row in the table.

• Also there will be – data from more than one table– data that normally do not get stored in a database

– This leads to scalability issues.• E.g., querying would be slow, indices might often need to be

fetched/stored from disk.– In order to go beyond the scalability limits of a single triplestore,

triples need to be distributed across multiple triplestores.

Page 72: Addressing the Challenges of the Scientific Data Deluge

77

Our Approach

• Clients access the triplestores via a mediator. • Mediator maintains several indexes to facilitate efficient

querying. • When the mediator receives a query

– breaks down the query in to several sub-queries– find out which triplestores are capable of responding to each sub-query.

• Indexes are mainly used to– build a cost model for the querying– eliminate the triplestores that are unable to give results for a given sub-

query.

Page 73: Addressing the Challenges of the Scientific Data Deluge

78

Types of Indexes at the Mediator

• Predicate Index– Contains details about the predicates in each triplestore– Certain fields are used for cost estimation for sub-queries.

• Node Index– Maintains a list of nodes in the triple graph along with the

triplestores in which these nodes exist.– Contains only resources (E.g., ns:crystallographer); Literals

(E.g., “John Smith”) are not stored.– Used to eliminate certain triplestores when sub-querying.

• Edge Index– Two edge indexes are used for outgoing and incoming edges,

respectively. – Used to avoid querying triplestores that do not have the

corresponding edges from or to them.

Page 74: Addressing the Challenges of the Scientific Data Deluge

79

Future Work

• Minimize joins between triplestores– Identify frequent joins– Instruct the triplestores to re-distribute their

triples such that most of the future joins will be performed locally.

• Avoid extra level of network hop due to the mediator by using a mediator cache.

• Consider network communication when estimating costs for the query plan.

Page 75: Addressing the Challenges of the Scientific Data Deluge

80

Parallel XML Parsing

• Published in Grid 2006, CCGrid 2007, e-Science 2007, IPDPS 2008, ICWS 2008 (streaming), HiPC 2008 (streaming).

• With BU students Yinfei Pan and Ying Zhang.

Page 76: Addressing the Challenges of the Scientific Data Deluge

81

Motivation

• XML is gained wide prevalence as a data format for input and output.

• Multicore CPUs are becoming widespread.– Plans for 100 cores.

• If you have 100 cores, and you are only using one to read and write your output, that could be a significant waste.

Page 77: Addressing the Challenges of the Scientific Data Deluge

82

Parallel XML Parsing

• How can XML parsing be parallelized?– Task parallelism.– Pipeline parallelism.– Data parallelism.

Page 78: Addressing the Challenges of the Scientific Data Deluge

83

• Task parallelism.– Multiple independent processing steps.– The sauce for a dish with sauce can be made in parallel to the

main part.

Step 1

Step 2A

Step 2B

Step 3

Time

Core 1

Core 1

Core 2

Core 1

Page 79: Addressing the Challenges of the Scientific Data Deluge

84

• Pipeline parallelism.– Multiple stages, all simultaneously performed in parallel.– If you are making two cakes (but only have one oven), you can start

mixing the batter for the second cake while the first one is in the oven.

Stage 1Data C

Stage 2Data B

Stage 3Data A

Tim

e

Core 1 Core 2 Core 3

Stage 1Data D

Stage 2Data C

Stage 3Data B

Stage 1Data E

Stage 2Data D

Stage 3Data C

Page 80: Addressing the Challenges of the Scientific Data Deluge

85

• Data parallelism– Divide the data up, process multiple pieces in parallel.

Input Chunk 1 Input Chunk 2 Input Chunk 3

Core 1 Core 2 Core 3

Output Chunk 1 Output Chunk 1 Output Chunk 1

Merge

Output

Page 81: Addressing the Challenges of the Scientific Data Deluge

86

But XML is Inherently Sequential

• How can a chunk be parsed without knowing what came before?

• The parser doesn’t know what state to start in.• Could do various scanning forwards and

backwards, but it is ad hoc, and tricky.– Special characters like < can be in comments.

<element attr=“value”>content</element>

Page 82: Addressing the Challenges of the Scientific Data Deluge

87

Previous work

• We used a fast, sequential preparse scan– Build an outline of the document (skeleton)– Skeleton are used to guide full parse by first

decomposing XML document into well-formed fragments on well-defined unambiguous positions

– The XML fragments are parsed separately on each core by Libxml2 APIs

– Merge the results into final DOM with Libxml2 APIs

• The preparse is sequential, however, so Amdahl’s law kicks in. We scale well to 4 cores, or so.

• So how can we parallelize the preparse?

Page 83: Addressing the Challenges of the Scientific Data Deluge

88

Example: The Preparsing DFA

• The preparsing DFA has two actions: START and END, which are used to build the skeleton during execution of the DFA.

0 1

2

5

3

6

74

>/!"'a

> / ! a'

> / ! a"

< / ! a"'a

</

>!

>

a( START )

a

>

/

>( END )

( END )

""

'

'

Page 84: Addressing the Challenges of the Scientific Data Deluge

89

Example of running preparsing DFA

<foo>sample</foo>

0 1 0 03 0 1 2

END

2 0

START

3

How can this be parallelized?

Page 85: Addressing the Challenges of the Scientific Data Deluge

90

Meta-DFA• Goal

– Pursues simultaneously all possible states at the beginning of a chunk when a processor is about to parse the chunk

• Achieved by:– Transforming the original DFA to a meta-DFA whose transition

function runs multiple instances of the original DFA in parallel via sub-DFAs

– For each state q of the original DFA, the meta-DFA includes a complete copy of the DFA as a sub-DFA which begins execution in state q at the beginning of the chunk

– For the actual execution, the meta-DFA transitions from a set of states to another set of states

Page 86: Addressing the Challenges of the Scientific Data Deluge

91

Output Merging• Since the meta-DFA pursues multiple

possibilities simultaneously, there are also multiple outputs when a chunk is finished.– One corresponding to each possible initial state.

• We know definitively the state at the end of the first chunk.– This is used to select which output of the second

chunk is the correct one.– The definitive state at the end of the second chunk is

now known.– Etc.

Page 87: Addressing the Challenges of the Scientific Data Deluge

92

Performance Evaluation

• Machine:– Sun E6500 with 30 400 MHz US-II processors– Operating System: Solaris 10– Compiler: g++ 4.0 with the option -O3– XML Standard Library: Libxml2 2.6.16

• Tests:– We take the average of ten runs– Test file is selected from a well-known project named

Protein Data Bank (PDB), sized to 34 MB– All the speedups are measured against parsing with

stand-alone Libxml2

Page 88: Addressing the Challenges of the Scientific Data Deluge

93

• The full parsing process is:– First do a parallel preparse using a meta-DFA.

This generates an outline of the document known as the skeleton.

– Then use techniques based on parallel depth-first tree search to parallelize the full parse.

– Subtrees of the document are parsed using unmodified libxml2.

Page 89: Addressing the Challenges of the Scientific Data Deluge

94

Preparser Speedup

• Parallel preparser relative to the non-parallel preparser

Page 90: Addressing the Challenges of the Scientific Data Deluge

95

Speedup on parallel full parsing

• After applying our meta-DFA technique in parallizing the preparsing stage, the parallel full parsing is now scalable.

Page 91: Addressing the Challenges of the Scientific Data Deluge

96

Summary• Data parallel XML parsing is challenging because the

parser does not know in which state to begin a chunk.– One solution is to simply begin the parser in all states

simultaneously.

• This can be achieved by modeling the parser as a DFA with actions, then transforming the DFA into a meta-DFA (product machine).

• The meta-DFA runs multiple instances of the original DFA, one instance for each state of the original DFA.

• The number of states in the meta-DFA is finite, so it is also a DFA and can be executed by a single core.– The parallelism of the meta-DFA is logical parallelism.

Page 92: Addressing the Challenges of the Scientific Data Deluge

97

Future Work

• Parallelizing XPath– Significantly more challenging, but due to

Amdahl’s law, first need to parallelize parsing.

• Offload preparsing to FPGA or perhaps GPU.

Page 93: Addressing the Challenges of the Scientific Data Deluge

98

Acknowledgements

• Grateful for the support provided by the NSF and the DOE for this work.– NSF awards 0836667, 0753178, 0513687,

and 0446298– DOE Award DE-FG02-07ER25803