managing bulk sensor data for heterogeneous …...the big data for distributed multi-modal...

Managing Bulk Sensor Data for Heterogeneous Distributed Sensor

Systems

A Thesis Presented

by

Hanjiao Qiu

to

The Department of Electrical and Computer Engineering

in partial fulfillment of the requirements

for the degree of

Master of Science

in

Electrical and Computer Engineering

Northeastern University

Boston, Massachusetts

April 2014

To my family.

i

Contents

List of Figures iv

List of Tables v

List of Acronyms vi

Acknowledgments vii

Abstract of the Thesis viii

1 Introduction 1

2 Background and Motivation 52.1 CPS Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Big Data Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.3 Data Stream Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3 Versatile Onboard Traffic-Embedded Roaming Sensors (VOTERS) System Overview 103.1 VOTERS Van Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.2 Data Volumes and Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.3 The VOTERS Software Solution Scalable Intelligent ROaming Multi-modal Multi-

sensor (SIROM3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4 Heterogeneous Stream File system Overlay (HSFO) 174.1 Fusion Foundations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.2 HSFO Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.3 Metadata Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.3.1 Streams and Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.3.2 Sessions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.3.3 Surveys . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.4 Big Data Storage Location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284.5 Metadata Example for Data Correlation . . . . . . . . . . . . . . . . . . . . . . . 304.6 Bulk Data Handling Library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.7 Big Data Transfer and Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . 33

ii

5 Plugin Executor (PLEX) 355.1 Plugin Executor (PLEX) Environment Overview . . . . . . . . . . . . . . . . . . 355.2 Plugins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5.2.1 Plugin Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365.2.2 Interaction with HSFO . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5.3 Plugin Executor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405.3.1 Rule-based scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415.3.2 Design Automation and Flexibility . . . . . . . . . . . . . . . . . . . . . . 42

6 Experimental Results 446.1 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

6.1.1 Time Synchronization Accuracy . . . . . . . . . . . . . . . . . . . . . . . 446.1.2 Multi-Sensor Aggregator (MSA) Performance Analysis . . . . . . . . . . 456.1.3 Statistics of Data Behaviors . . . . . . . . . . . . . . . . . . . . . . . . . 466.1.4 Heterogeneous Stream File-system Overlay (HSFO) Performance Overhead 47

6.2 Data Fusion and System Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

7 Conclusion 52

Bibliography 54

iii

List of Figures

1.1 Time-varying behavior of civil infrastructure . . . . . . . . . . . . . . . . . . . . . 2

2.1 Transportation CPS Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2 Big Data Challenge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.3 Automated Data Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.1 VOTERS Van with Sensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.2 SIROM3 Multi-Tier Hierarchical Architecture . . . . . . . . . . . . . . . . . . . . 133.3 SIROM3 Implementation Architecture . . . . . . . . . . . . . . . . . . . . . . . . 15

4.1 HSFO Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.2 HSFO Metafile Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.3 Example Stream Definition Table . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.4 Stream Creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.5 Example Stream Definition Table . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.6 Big Data Storage Location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.7 Metafile Example for Data Correlation . . . . . . . . . . . . . . . . . . . . . . . . 304.8 Abstraction of Bulk Data Handling Library . . . . . . . . . . . . . . . . . . . . . 334.9 Data Transfer and Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

5.1 PLEX Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365.2 Interaction of plugins with HSFO . . . . . . . . . . . . . . . . . . . . . . . . . . . 395.3 Scheduling of Plugins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

6.1 Timing Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446.2 Temporal Data Fusion Example . . . . . . . . . . . . . . . . . . . . . . . . . . . 496.3 City-wide Infrastructure Performance Inspection . . . . . . . . . . . . . . . . . . 50

iv

List of Tables

3.1 Data diversity and volume . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4.1 Stream Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.2 Name Convention for Big Data Directories . . . . . . . . . . . . . . . . . . . . . . 284.3 Big Data Handling Library APIs . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

5.1 Plugin Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375.2 Crack Distortion Correction Plugin Example . . . . . . . . . . . . . . . . . . . . . 375.3 Crack Detection Plugin Code Example . . . . . . . . . . . . . . . . . . . . . . . . 405.4 Three Processing Levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

6.1 MSA Performance Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456.2 Statistics of Data Behaviors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 476.3 HSFO Performance Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 486.4 The Overall Impact of SIROM3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

v

List of Acronyms

CPS Cyber-Physical Systems.

HSFO Heterogeneous Stream File-system Overlay.

PLEX Plugin Executor.

SIROM3 Scalable Intelligent ROaming Multi-modal Multi-sensor.

MMMS Multi-Modal Multi-Sensor.

RMMMS Roaming Multi-Modal Multi-Sensor.

MSA Multi-Sensor Aggregator.

RSS Roaming Sensor System.

FCM Fleet Management and Control.

VOTERS Versatile Onboard Traffic-Embedded Roaming Sensors.

RTE Run-time Environment.

PTP Precision Timing Protocol.

NTP Network Timing Protocol.

SBC Single Board Computer.

BDH Bulk Data Handling.

vi

Acknowledgments

The work presented here could not have been completed without the contributions ofmany supportive and knowledgeable people.

Foremost, I want to express my deep gratitude to the committee chair and my advisor Dr.Gunar Schirner for the continuous support and mentorship during my master program. His patience,enthusiasm, and immense knowledge helped me and encouraged me in the time of my research andwriting my thesis. I have learned not only the expertise, but also the rigorous attitude towards workand the non-stop pursuit of perfection.

My sincere appreciations also go to the rest of the committee, Dr. Ralf Birken and Dr.Ming Wang for their guidance and help along these years. Their devotion to VOTERSs projectspurred me a lot in completing my work and actively participating in this interdisciplinary project.

Special thanks to my colleague, Ph.D candidate Jiaxing Zhang. Without his referral andrecommendations, I could have never had a chance to join this lab and obtained financial support.He gave me encouragement and creative idea when I was inexperienced. He is always willing togive me a hand and offer his best suggestions. He sets a good example for me and is a good friend inmy life. I also want to thank my friends, most of whom are also alumni of Northeastern University,for all the fun we had in this city far away from home, and for their cheerings and care for me.

Last but not the least, I want to thank my family and Guancheng Gu. They are alwaysthere for me and offer me their sincere consolement and love when I am fragile. They always standby me through the good times and bad. Their arms and hugs are always my safest harbor to resortto.

This work is part of the VOTERS project. The VOTES project is a joint venture of North-eastern University, The University of Vermont, The University of Massachusetts in Lowell, EarthScience System LLC, and Trilion Quality Systems Inc. This project is supported by U.S Departmentof Commerce, National Institute of Standards and Technology, Technology Innovation Program.

vii

Abstract of the Thesis

Managing Bulk Sensor Data for Heterogeneous Distributed Sensor

Systems

by

Hanjiao Qiu

Master of Science in Electrical and Computer Engineering

Northeastern University, April 2014

Gunar Schirner, Adviser

The current U.S. transportation infrastructures require tremendous investments to main-tain due to critical roadway and pavement conditions. To prioritize the repair expenditures, Cyber-Physical Systems (CPS) are a promising solution to obtain intrinsic knowledge about infrastruc-ture performance such as roadway surface and subsurface deterioration through sensors and actua-tors. However, to characterize and quantify the infrastructures’ time varying behavior (infrastructurehealth and life cycle) in a cost-efficient and non-intrusive way, an underlying framework to handledomain-specific big data in CPS is needed.

This paper proposes a holistic approach to manage domain-specific bulk sensor data gen-erated from heterogeneous distributed sensor systems to address CPS meeting big data. We focusedon the big data handling in a Scalable Intelligent ROaming Multi-Modal Multi-Sensor (SIROM3)framework, which collects data about roadway conditions from multiple domains through mo-bile agents. A Heterogeneous Stream File-system Overlay (HSFO) is proposed as a platform-independent layer to uniformly define, organize and manage the high volume of heterogeneousstreaming data. Additionally, a flexible plugin system (PLEX) is introduced to simplify and au-tomate data feature extraction, correlation, fusion and visualization. Both HSFO and PLEX aredesigned with high scalability and adaptability. They can be executed on a wide range of platformsfrom mobile systems to mainstream servers with a common software/hardware stack. Our solutionaddresses big data collection, storage, aggregation to processing and knowledge discovery. Theembodied automation eliminates human intervention at every stage and increases overall efficiency.

Over 20 terabytes of data covering 300 miles have been collected, aggregated, and fusedfor comprehending the pavement dynamics of the entire city of Brockton, MA. The performance ofdata processing with and without HSFO was compared. The results indicate that processing data

viii

with HSFO takes an overhead of 0.19µs/KB than that in the absence of HSFO. The difference ofCPU utilization between two methods is less than 5%. This implies that HSFO and PLEX has quitelow overhead with a negligible impediment to the system performance. The unified automationfulfilled by them has demonstrated a significant increase in overall productivity by nearly 25 times,starting from data collection to processing. In result, we established foundational tools for managingthe big data for distributed multi-modal multi-sensor systems in civil infrastructure monitoring.They provide rapid and comprehensive understanding of civil infrastructure health and life cyclemanagement.

ix

Chapter 1

Introduction

There are more than four million miles of roads and nearly 600,000 bridges in the U.S.

demanding a broad range of maintenance activities [2]. According to the ASCE report in 2013 [32],

the U.S. infrastructure embarrassingly only scored a ”D+” due to deterioration and lack of proper

care. One in four bridges is either structurally deficient or functionally obsolete; one third of US’s

major roads are in poor or mediocre conditions. The monumental challenge of current transportation

infrastructure management is the prioritization of expenditures within budgetary constraints as well

as the implementation of maintenance and repairs. It is estimated that $3.6 Trilion in investments

are needed over 5 years to bring the condition of the nations infrastructure up to a good condition.

In current civil infrastructure, pavement deteriorations frequently take place below the

surface and cannot be evaluated by visual means [8]. Pavement deteriorates due to internal moisture

damage, debonding, and loss of subsurface support. Pavement layers are subjected to extensive

abrasion and deterioration from service loading (e.g. traffic) and environmental attacks (e.g. freeze-

thaw, rain, road salts). Figure 1.1 conceptually illustrates the dynamic change of roads (life cycle

modeling) and the optimal opportunities of early deterioration identification and responsive repairs.

If certain distresses are repaired before they reach critical levels, at least five times less

money will be spent, compare to late overhaul. More severely, the impact is exponentially amplified

for the large-scale civil infrastructure without intervention. While in early repair condition, the

pavement condition can be maintained after a long span of time. Identifying trouble spots as soon

as they appear will save huge amounts of money, time, and effort, and prolong the longevity of the

pavement. It is of primary importance to acquire detailed information of the speed of structural

health deterioration so that responsive maintenance is possible and actual infrastructure life-cycle

can be mapped. Long time intervals between two inspections may miss the optimal window to

1

CHAPTER 1. INTRODUCTION

1 5 10 15 20

Age of Pavement (years)

Excellent

Good

Fair

Poor

Very Poor

Failed

25

Pa

ve

men

t C

on

dit

ion

$1 For Preventive

Maintenance

- Seal cracks and fill potholes

$4 to $5 for major

maintenance

- Resurface entire road

Figure 1.1: Time-varying behavior of civil infrastructure

repair. Therefore, knowing infrastructure life cycle is necessary to prioritize direct investments to

the areas of most need and benefit. As a result, frequent structural health monitoring is essential to

gain the complete curve of infrastructure lifecycle.

However, current inspection methods often introduce issues such as intrusive data gath-

ering (e.g. stopping traffic), requiring huge amounts human efforts and subsequently infrequent

data collection (e.g. interval of decades) and limited coverage [32]. Traditional inspection meth-

ods [3, 4, 5] are slow, require traffic delay causing road closures, and are often not effective. Even

with novel technology such as ground penetrating radar (GPR) and infrared thermography [30, 16,

24, 6, 29] suffer either from the need for traffic closures or provide insufficient spatial data cover-

age. With these methods, it is impossible to obtain the life cycle of the time-varying behavior of

the civil infrastructure due to the difficulty in continuous monitoring of the pavement status. There-

fore, a non-intrusive, automated and cost-effective system with easy-to-deploy sensor technology

is crucial for the improved roadway inspection methodologies and comprehensively understanding

the dynamics of civil infrastructures. In addition, a need to manage and jointly consider data from

multiple service providers, locations, and inspections dates is highly desired.

Cyber-Physical System (CPS) is a potential candidate which integrates computing ele-

ments with physical processes to observe, react, and actuate the physical entities. It often involves

heterogeneous components, hybrid systems and design automation methodologies. In particular,

applying Cyber-Physical Systems (CPS) approaches to infrastructure monitoring is promising to

address the societal need outlined above through interdisciplinary research (from computer science

to civil engineering field) to invent solutions which CPS meets big data. An application of the

2


CPS solution is an Roaming Multi-Modal Multi-Sensor (RMMMS) system, which embeds hetero-

geneous sensors and computing onto a mobile agent to collect multi-modal data under roaming

conditions [8, 10]. Research efforts has been devoted to applying the multi-modal multi-sensor

systems to the civil domain in a cyber-physical approach [9, 19, 27, 25].

However, the CPS solution is difficult to develop due to the heterogeneity in multi-modal

sensors, synchronization principles, reliable communication and coordinated operations. Typically

sensor systems are tailored to a specific type of application, which impedes the overall adaptability

and scalability. In addition, as data-intensive sensors produce large volume of streaming data in

real-time, it is challenging to store, access, process and cross-reference the diverse big data. Com-

plexity and challenges are further exploding if multiple RMMMS systems are deployed to increase

geographical coverage area and to repetitively survey for a time-varying assessment.

The Versatile Onboard Traffic-embedded Roaming Sensors (VOTERS) [7, 8] system is

a real implementation of RMMMS systems. It incorporates the CPS approach and complements

periodically localized inspections of roadways and bridge decks with continuous network-wide

health monitoring using multi-modal sensors mounted on vehicles. To solve the above challenges

in RMMMS systems, a Scalable Intelligent ROaming Multi-Modal Multi-Sensor (SIROM3) frame-

work is proposed in [38] as the software solution of VOTERS. It provides a unified solution to

simplify the development, deployment, prototype and management of multiple RMMMS systems.

In this work, we focus on managing the bulk sensor data generated from the heterogenous

distributed sensors in SIROM3. The large volume of data from multiple domains must be efficiently

stored, aggregated and manipulated for knowledge discovery. We propose a Heterogenous Stream

File system Overlay (HSFO) to efficiently store, manage and aggregate the heterogenous big data

through metafile hierarchy. The design complexity from survey dynamics and sensor versatility is

decomposed into different layers of HSFO and each layer abstracts corresponding metadata into

metafiles. A Plugin Executor (Plugin Executor (PLEX)) is designed to run on top of HSFO for

data correlation and referencing across distributed components and different domains. Both HSFO

and PLEX are platform-agnostic and can be run on any level or component of SIROM3. They

cooperatively automate big data handling from data collection, transfer, processing to fusion.

We used VOTERS system to conduct a city-wide pavement condition inspection in Brock-

ton, MA. Over 20 terabytes over 300 miles of road was collected, aggregated, fused and geo-

spatially visualized for roadway assessment. The design automation enables frequent monitoring,

which offers an ideal platform for investigation of time-varying behaviors of roadways.

The thesis is organized as follows: Chapter 2 introduces the CPS approach and possi-

3


ble big data challenge. Chapter 3 overviews the VOTERS system as well as its software solution

SIROM3 framework. Chapter 4 focuses on the details of HSFO to manage the big data. Chapter 5

elaborates automated data processing and fusion using PLEX. Chapter 6 presents survey results and

evaluates performance from multiple aspects. Chapter 7 draws conclusion on the research.

4

Chapter 2

Background and Motivation

This section examines the cyber-physical approach as a promising solution to understand

the life cycle for health conditions of infrastructures. It also raises the challenges in big data handling

in multi-modal multi-sensor systems, which drives the need for a unified solution to realize data

fusion and an automated strategy to facilitate knowledge discovery. It then introduces and compares

the current methodologies for distributed data stream analysis and sensor data fusion.

2.1 CPS Approach

Current methods for structural health inspection often requires intrusive data collection,

obstruction of traffic, prolonged post-analysis, huge manual efforts and expenditures. Therefore,

with current technology, it is infeasible to conceive an efficient, flexible and automated solution to

acquire the life cycle with time-varying behaviors of civil infrastructures. This drives the need for

an integrated cyber-physical system (CPS) approach, which incorporates resourceful computational

power into observation, measurement, actuation of physical process. In this case, the CPS approach

offers automatic control of heterogeneous sensor group and integration of diverse big data, providing

an efficient way to comprehensively evaluate the transportation infrastructure performance.

The infrastructure inspection methodologies can be portrait as a grand loop including

construction, usage, deterioration, maintenance and repair. However, the control loop has been put

aside in spite of its criticality, due to long latencies. Figure 2.1 extracts the whole procedure into a

CPS model and presents a conceptual solution. A diversity of sensing applications are embedded in

mobile roaming acquisition systems. Huge amounts of data is collected by the mobile heterogeneous

sensor systems from surveys feeding into multilayer multimodal data fusion. An asset performance

5

CHAPTER 2. BACKGROUND AND MOTIVATION

Actuators

Transportation Infrastructure

Maintenance Decisions

Sensor

Big-Data

Performance

MetricsPerformance

Metrics

Asset

Performance

Metrics

Mobile SensorMobile SensorMobile SensorMobile SensorConstruction

Repair

Multi-Layer,

Multi-Model Fusion

Figure 2.1: Transportation CPS Overview

metrics is subsequently yielded as vital result and imported into a maintenance decision system

whose duty is to decide maintenance and construction decisions. The actionable information can

be feedback of the system and control the actuators: repair and construction operations and mobile

sensors. The control loop would accelerate data collection, storage, process to guide maintenance

and construction decisions. The fine-grained and continuous infrastructure monitoring makes detail

modeling of infrastructure deterioration and life cycle readily available, leading to the possibility of

early affordable care. In a word, the CPS approach establishes a way to manage the life cycle for

personal health of civil infrastructures.

2.2 Big Data Challenges

The CPS approach hinges on big data challenges. It is an inevitable situation that some

sensor systems have to perform fast sampling in time (e.g. mm-wave radar) or dense sampling in

space (e.g. Ground Penetrating Radar (GPR) array). The huge volume streaming data generated

may quickly flood the entire system during data collection. Since the hybrid system with com-

putational intensive components requires a distributed solution, this issue gets more severe when

multiple such components operate simultaneously. The distributed systems entail coordination and

collaboration, which can be even more convoluted. This is particularly true given that tight time syn-

chronization is the foundation for sensor fusion in a distributed fast-moving roaming sensor system.

Network and real-time constraints in embedded processing unavoidably effects the data migration

and demands data compression.

The magnitude of big data challenge is shown in Figure 2.2. Terabytes of data will be

generated from densely distributed sensors with different types and attributes including acoustic

6


Acoustic Sensor

Arrays

Radar Sensor

Arrays

HD Video Camera

Over terabytes of data daily!

FusionStorage

Figure 2.2: Big Data Challenge

sensor arrays, HD Video Camera and Radar Sensor Arrays in continuous stream daily, which will

then be aggregated in a centralized storage for data fusion and knowledge discovery. This would

be more problematic by introducing more sensor types and doing time-lapse surveys. The big data

handling has challenges including managing large data streams, having efficient data storage and

compression, allowing identification and classification, supporting fast and reliable transfer, making

data fusion from heterogeneous sensors as well as geo-spatial visualization. The goal of this work is

to manage and make use of the bulk data for heterogeneous distributed sensor systems. As a result,

a hierarchical file system overlay, such as Heterogeneous Stream File-system Overlay (HSFO) is

needed to store, categorize, filter and aggregate big data originated from various sensors. The

standard definition of each layer in HSFO would profit managing streams with different attributes

through general solutions. A data processing environment like Plugin Executor (PLEX) is required

as a consumer, which focuses on retrieving, refining, cross-referencing, and fusing data to obtain

the joint and integrated results. It is essential for PLEX to be adaptable to complex streaming

applications as well as capable of integrating new fusion algorithms easily. The HSFO and PLEX

should be platform agnostic to be flexibly and compatibly work on a wide range of platforms.

To boost the overall productivity, the design automation methodology should also be in-

tegrated to facilitate data fusion and reduce human intervention during the whole procedure. Fig-

ure 2.3 illustrates the integral automated data flow. The physical entities show the real implemen-

tation and the orange arrows reveals the data flow running through them, which corresponds to the

four phases on the right. The first phase would be an array of multi-modal sensors deployed on

mobile vehicles to collect data. Data from multiple domains is gathered from these distributed sen-

sors during roaming and managed in their local storages. Besides, a bunch of local storages will be

7


uploaded and aggregated to a control center for centralized storage and management while network

is available. Eventually, applications will operate on the data for processing, correlation, fusion and

visualization. The whole procedure can be automated to allow for operators operating remotely and

making timely decisions.

Process, Fuse

and Visualize

Sensor

Collection

Storage

Upload and

Aggregate

Figure 2.3: Automated Data Flow

The strong processing power and automated design brought by HSFO and PLEX would

effectively solve the big data challenges and manage the heterogeneous bulk sensor data in the CPS

approach, resulting in reducing the long latencies and shorten the execution cycle of the control

loop. In the context of life cycle management for roadway infrastructures, the design of HSFO with

PLEX will significantly benefit the capture of pavement life cycle and preventative maintenance and

early repairs can be attained for future civil infrastructures.

2.3 Data Stream Management

RMMMS systems in civil application domain have recently emerged in [21, 12, 13] in

order to offer real-time monitoring with more mobility and coverage. However, heterogeneous data

in large volume generated by an RMMMS system has not received much attention.

In [17, 20], middlewares and scalable architecture for processing and managing heteroge-

neous sensor data is proposed and implemented. Unlike their approach focusing on the efficiency

manipulating the data, SIROM3 addresses the big data issue from an operating file system over-

8


lay perspective. Another scope of focus on handling heterogeneous big data is to address them in

database perspective [40, 18]. Although it provides an easy interface querying the data, certain over-

head is associated with database operations. Furthermore, database approach is often only suitable

on the side of data center, the overlay design in SIROM3 is flexible to run platform-independently

with little overhead.

A lot of research efforts has been devoted into managing and integrating the data streams

distributed on heterogeneous sensor nodes. As the wide range of surveillance applications are de-

sired using distributed wireless sensor networks (WSN), several avenues on data stream manage-

ment systems (DSMS) have been sought in recent years to aggregate heterogeneous distributed

sensor data to acquire complete knowledge of the physical world [11]. The Global Sensor Net-

works (GSN) [1] offers a middleware that integrates heterogeneous sensor networks at network

level. Each kind of sensor network is abstracted into a virtual sensor, which helps to get a homo-

geneous view on heterogeneous sensors. The virtual sensor definition is an XML-file containing

the streams name, address, output structure, input streams, and SQL queries that led to the output

stream. SStreaMware [18] has a three level query processing. Unlike GSN that process in a bottom-

up fashion, the SStreaMware use hierarchical flow in which a centralized control site provides a

global sensor query service (SQS) and distributes subqueries to the lower level gateways. Each gate-

way further sends decomposed subqueries down to the proxies of different sensors. TinyDB [28]

is another middleware approach that is a direct application based on TinyOS. In this method, each

sensor node is installed with TinyDB so that one can view each node homogeneously. Similar to

the SStreaMware’s top-down approach, a base station parses a query and distributes down to the

corresponding sensors in a tree-based manner. The collected data is retrieved in a reverse direction

which the queries propagate.

However, instead of keeping the data for permanent storage in SIROM3, these approaches

only perform continuous real-time queries. Although they maintain some metadata for system inte-

gration or query optimization, the information is not comprehensive enough and well-organized as

the metafiles in HSFO. Moreover, the SQL-based or graph-based queries of these solutions bring

certain overhead because of database operations. The metafiles in HSFO are in textual presentation

and queried through common interfaces implemented in multiple programming languages, which is

faster and more adaptable.

9

Chapter 3

VOTERS System Overview

The Versatile Onboard Traffic-Embedded Roaming Sensors (VOTERS) project collects

roadway and bridge deck condition information (both surface and subsurface) periodically using

a Roaming Multi-Modal Multi-Sensor (RMMMS) mounted on a vehicle while roaming at traffic

speeds [7, 37]. The goal is to achieve continuous network-wide infrastructure health monitoring

using multiple RMMMS units. This chapter first overviews the real implementation of the VOTERS

van system with sensors and reveals the potential challenges as well as high complexity in designing

such RMMMS system. Then it highlights the varieties and volume of data collected. An unified

framework Scalable Intelligent ROaming Multi-Modal Multi-Sensor (SIROM3) is proposed as the

VOTERS software solution to system integration challenges in the end.

3.1 VOTERS Van Implementation

VOTERS provides the periodical localized inspections of roadways and bridge decks with

continuous monitoring. It uses a set of homogeneous or heterogenous sensors (acoustic, electromag-

netic and optical) mounted on a fleet of vehicles and collects inspection data while roaming through

daily traffic. An important benefit of using RMMMS systems will be the time-lapse survey sets

allowing the analysis of time-varying behaviors of roadway infrastructures. It thereby will provide

experimental results to validate and improve existing life-cycle models [14, 35, 36].

Figure 3.1 shows the VOTERS vehicle containing over 30 sensor units in 10 different

domains including: laser height sensors, acoustic microphone arrays, dynamic tire pressure sen-

sors, differential GPS systems, mm-wave radar, GPR arrays, inertial measurement unit systems,

DMI sensor systems, HD camera systems and GPS timing board systems. These data-intensive

10

CHAPTER 3. VOTERS SYSTEM OVERVIEW

sensors are grouped to 5 Single Board Computers (SBCs) distributing computational and storage

resources (Table 3.1). Each Single Board Computer (SBC) has a 2.6GHz Intel Quad-Core, 4GB

memory, an array of solid state storage system, and a PC-104 interfaces for hardware extensions,

such as DAQ systems or GPS timing board. A local Gigabit network interconnects these SBCs for

intra-component communication, collaboration as well as coordination. To assist on-the-fly data vi-

sualization, the van includes a portable real-time monitoring tablet and an in-vehicle system control

(see Figure 3.1).

Figure 3.1: VOTERS Van with Sensors

However, a number of challenges are posed in designing, realizing and implementing

the VOTERS systems. Due to the versatility and heterogeneity of sensors, data types as well as

the sheer number of sensor systems, it is intricate to manage, integrate and operate the system

components uniformly. Meanwhile, these sensor systems are distributed to multiple computing

units for their high computational demands (e.g. radar sensors, HD video camera). The distributivity

brings challenges such as intra-component communication, coordination and collaboration, which

have strict real-time and time synchronization constraints for data correlation across distributed

elements. The integration complexity is increasing while adding an arbitrary number of new sensor

systems or deploying multiple VOTERS vans, therefore a scalable and expandable framework is

11


critical for system integration. In addition, since the data-intensive sensors produce large volume

of streaming data in real-time, it is essential to effectively store, aggregate and process the big data.

Hence, a flexible and adaptable framework on both software and hardware architecture is needed to

address the above challenges and the overall design complexity in a system-level design approach.

The next section will further give the detailed specifications of VOTERS system, while an unified

software solution will be provided afterwards.

3.2 Data Volumes and Types

The VOTERS van collects data at traffic speed by an array of homogeneous or heteroge-

neous sensor systems from multiple domains (e.g. acoustic, optical and electromagnetic domains).

Table 3.1 indicates the domains and the corresponding recorded data amounts. The multi-modal

sensors either require fast sampling in time or dense sampling in space and therefore have different

triggering methods and sampling rates accordingly.

Table 3.1: Data diversity and volume

Max Min Points / Size / DataDomains Sen- Trigger Sensor / point rate

sor Interval Trigger [byte] [GB/h]Positioning data 1 0.2 s 4 4 0.0003Acoustic Microphones 4 25 us 1 4 2.1Dynamic Tire Pressure 2 25 us 1 4 1.1Millimeter-wave radar 10 25 us 1 4 5.4Video Systems 1 1 m 5018400 1 467.4GPR systems 16 0.01 m 1024 2 305.2

Total 781.1

In Table 3.1, the GPR array and HD video camera are both distance triggered and compose

most of the big data. Acoustic microphones, dynamic tire pressure sensor systems, and millimeter-

wave radar are time triggered. They collect at a sampling rate of 25 µs in the worst-case scenario,

which leads to a relative lower data rates. The positioning data stream is the only data has the geo-

location information, which is of essential importance to geo-tag data points from other domains

and make the whole dataset meaningful. In total, 781 gigabytes are expected per hour when driving

at a maximum velocity of 100 km/h. While the system is roaming through the traffic and the data

collection is ongoing, the bulk data will be stored on the vehicle locally due to unstable cellular

connection. Once a good network condition is available (e.g. Ethernet connection), the big data

12


will be uploaded to for aggregation and fusion. As a result, the automatic data aggregation and

centralization is necessary to efficiently store, manage and transfer the big data. Additionally, a

common interface for data management is crucial for processing heterogeneous datasets uniformly

and increasing the overall adaptability and scalability.

3.3 The VOTERS Software Solution SIROM3

The Scalable Intelligent ROaming Multi-Modal Multi-Sensor (SIROM3) is the software

solution for VOTERS system which provides the basis for efficient real system realization. It is

an unified framework to simplify the development, deployment, management and aggregation of

RMMMS systems tailored to civil infrastructure inspection for flexible and time-lapse monitoring.

Figure 3.2 illustrates the overall design of SIROM3 as a multi-tier hierarchical architecture [38, 39].

FCM

SIROM3-RTE

Fleet Control

Coordination

Visual

SIROM3-RTE

GIS

Visualization

Cellular

SIROM3

RSS 3

SIROM-RTE

Vehicle App

LogicRSS 2

SIROM-RTE

Vehicle App

Logic

RSS 1

SIROM3-RTE

Vehicle App.

Logic

Sensor 1

LAN

Stream Definitions

HSFO

Plugins

Plugin Rules PLEX

HSFO

HSFOHSFO

HSFO

MSA 3

SIROM-RTE

Sensor Driver

MSA 2

SIROM-RTE

Sensor Driver

MSA 1

SIROM3-RTE

Sensor DriverLegend: : Data

: Control

PLEX Environment

Figure 3.2: SIROM3 Multi-Tier Hierarchical Architecture

The SIROM3 framework includes sensors, MSA, Roaming Sensor System (RSS), Fleet

Management and Control (FCM) and a visualization back-end. Meanwhile, a Heterogeneous Stream

File-system Overlay (HSFO) and Plugin Executor (PLEX) environment are tightly integrated to

each component to facilitate big data storage, processing and management. The Run-time Environ-

ment (RTE) is a layered model that defines the essential core services that are common and resuable

among all components from both hardware and software aspects. In this framework, each RSS rep-

resents a VOTERS van equipped with sensors. Similarly, each MSA within the RSS corresponds

13


to a distributed SBC mounting with a group of sensors inside the van. A one-to-many relationship

exists from RSS to MSA, the same applies to FCM to RSS.

The hierarchical architecture eases a control/respond mechanism between a higher level

element and its children. For instance, the FCM centralizes and manipulates the aggregated data

from multiple RSSes to facilitate data fusion and information discovery through temporal/spatial

data correlation and visualized in the GIS visualization server. An array of RSS can report the lo-

cation or transfer data back upon request or autonomously to FCM via cellular or LAN connection.

Due to the limited cellular connection, only control and configuration messages will be transferred

between FCM and RSS for conserving the bandwidth while collecting data outside [7]. Once a

sufficient network environment is available, such as a local gigabyte cable is plugged in to the sys-

tem, the collected bulk data can be uploaded to the FCM automatically. An RSS contains numerous

MSA for their distributed computational power and sensor heterogeneity. Within an MSA, mul-

tiple homogeneous and heterogeneous sensors are attached as the basic unit of RMMMS system.

Connected in a local gigabyte network, each MSA is aware of the existence of the others, which

therefore can communicate, coordinate and collaborate to function cooperatively.

Benefit from the hierarchical design, data aggregation can be easily achieved by propa-

gating the data streams from lower level to higher level. Due to the large volume of data being

collected by data-intensive sensors (such as GPR and HD Video), the HSFO is attached to each

MSA to provide reliable storage and efficient retrieval for the high volume of streaming data. A

centralized HSFO is also instantiated on the FCM to aggregate data from distributed sources. A

PLEX environment is implemented to further simplify algorithm integration and eliminate manual

intervention during data processing and fusion with a fully automated approach.

The SIROM3 framework decomposes the high complexity into hierarchical and modular-

ized design. The RTE encapsulates core services such as hardware requirements, operating systems,

synchronization and communication middleware that can be reused by any system component. The

scalability is guaranteed on the vertical aspect that we scale the computing power, storage system

and data aggregation from the minimal unit (sensors) to the maximum, which is the server cloud

(FCM). The expandability is ensured on the horizontal aspect that at the level of sensors, MSA,

RSS and FCM, an arbitrary number of elements can be integrated seamlessly.

Figure 3.3 shows an overview of the SIROM3 architecture implementation. A hierarchical

system composition is illustrated on the left side with an array of sensor systems contained in MSAs

mounted onto an RSS, which is identical to the SIROM3 framework demonstrated in Figure 3.2. A

fleet of RSSes are directed by the FCM. To conserve bandwidth only control/management messages

14


Visualization &

Analysis

Plugin Executor (PLEX)

Rule-based

Scheduler

Rule

1

Rule

2

PluginN

Plugin3Plugin2

Plugin1

Control &

Management

RSSN

RSS3

RSS2

RSS1

MSAN

MSA3

MSA2

LAN

...MSA1

S1 S2 SN

FCM

Storage

Cellular

Data Intake

Models Proxy

Plugin

GIS Server

Layer 1

Layer 2

...

Layer n

Web

Adaptor

Database

Data Intake

Models Plugin

Legend: : Data

: Control

Streaming Application

HSFOLinux

FS

QNX

FS...

Win

FS

PLEX

Figure 3.3: SIROM3 Implementation Architecture

is transferred between RSS and FCM through a cellular connection during a survey (see Figure 3.3).

After an RSS completes its survey, it returns to its base where a fast network is available (i.e. Gigabit

Ethernet or 802.11n network). Upon detection of the home location and network, automatic bulk

data upload to the FCM and data aggregation are automatically triggered.

The Fleet Control and Management (FCM) contains a centralized data storage, a PLEX

and Control & Management cloud services. An external GIS server is seamlessly integrated into

FCM for visualization and analysis of the fused results, which leverages the the power and flexibility

of the PLEX module. The PLEX is informed when the upload of survey data to the centralized

storage has completed. This triggers the PLEX to start processing the available data managed in

HSFO through contained algorithmic and systemic plugins.

Plugins operate on the uploaded data importing to the GIS server (see Figure 3.3) for

visualization. A configurable rule set in PLEX directs the running sequence of all the plugins

and plays a key role in automating the whole processing procedure. The Data Intake plugin is

responsible for temporal correlation or fusion of both raw data and refined data, attaching each

data point with meaningful geo-tags. Finally, the raw data, refined data as well as the fused data is

transferred into the GIS server and displayed on different layer. The GIS server, based on ArcGIS,

was developed to enable large-scale geo-referencing and knowledge discovery. The GIS server

enables visualizing and comparing citywide pavement conditions. A Web Adaptor enables access

through a thin client application, making the data easily accessible via the Internet [34].

One of the benefits brought by SIROM3 framework is the automation from data collec-

tion, storage, transmit to processing, which hasn’t been realized in current inspection methods. The

localization services in VOTERS offer geo-fencing for automated start or stop of the survey and

15


the survey is conducted in a roaming vehicle cruising among traffic [7]. This work will present an

efficient and automated approach for bulk sensor data handling. We proposed a HSFO to facili-

tate the storage, management, transmit and aggregation of heterogeneous bulk sensor data with a

PLEX environment for data processing and fusion. The next chapters will elaborate the detailed

implementation of HSFO and PLEX.

16

Chapter 4

Heterogeneous Stream File system

Overlay (HSFO)

The distributed sensor systems in RSS and the data-intensive nature of some sensors of

MSAs produce large streaming data (i.e. hundreds of gigabytes of data per hour shown in Table 3.1

continuously during a data collection. Such high volume data streams may flood the entire system

during data collection. The issue is in particular severe when multiple such components exist and

operate simultaneously.

The gathered data needs to be stored and processed automatically through a local storage

system attached to MSAs (see Figure 3.2), as well as being aggregated to FCM from all MSAs in an

RSS. Therefore, the ability to handle data storage across multiple components (sensor, MSA, RSS

and FCM) at multiple levels in the hierarchy is of particular importance. More severely, the hetero-

geneity and versatility of the data pose the challenge for cross-referencing big data with distinctive

attributes and characteristics (e.g. correlation between camera image and acoustic signals).

In this chapter, a Heterogeneous Stream File system Overlay (HSFO) is proposed and

implemented to manage sensor data and address the above challenges in a metafile approach. The

metafile abstracts information of survey dynamics (e.g. sensor types, software/hardware configura-

tion during data collection) and stores metadata together with the raw data in file system. Since the

metafiles are in textual presentation, they bring lower overhead than the database solution during

data storage and processing. In addition, the HSFO is designed as a platform-independent layer

which can be used in any component in the system hierarchy being adaptive to the scalability and

expandability of SIROM3. Thus, it can be executed on a wide range of platforms (from sensors,

17

CHAPTER 4. HETEROGENEOUS STREAM FILE SYSTEM OVERLAY (HSFO)

embedded computers to data centers). Instead of acting as an afterthought after data acquisition,

the HSFO is tightly integrated into SIROM3 framework and offers an efficient, fast and reliable

solution to handle the heterogeneous sensor data. It also exposes the opportunity to uniformly and

automatically access, operate, process and fuse the data.

4.1 Fusion Foundations

The goal for data processing is to convert the collected raw datasets into meaningful

knowledge to base the life-cycle management for infrastructure health conditions and extend their

life-span. It is challenging since sensors within an RSS are spatially distributed, may have differ-

ent triggering requirements (e.g. time triggered or distance triggered) as well as varying sampling

intervals/spacings (see Table 3.1). As the multi-modal data is geographically dispersed during data

collection, the data management should allow each sensor’s sampling to be accurately geo-located

to allow for spatial and temporal comparison and make joint decision.

The ability to determine an accurate position for each data sample acquired is of utmost

importance for data fusion and visualization. To achieve this, we rely on two facts (a) the position of

sensors relative to the whole system is known and (b) a tight time synchronization is in place. Then,

each sample is timestamped when collected at each sensor or the embedded system upon arrival. An

additional stream records the location of the mobile agent (e.g. vehicle) during the survey. Then,

each data sample’s location can be computed based on timestamp, system location and sensor offset.

Temporal correlation across data streams (and consequently spatial correlation) is deter-

mined by timestamping each sample with micro-second accurate time. This poses strict require-

ments on availability (across platforms), robustness and reliability (time synchronization budgets

with different rigidness). Spatial correlation is achieved globally by localization services (GPS)

and local geometrical relationship of sensors to an absolute referencing point on the vehicle (RSS).

Synchronization of streams with the positioning data requires tight time synchronization from RSS

to all MSA (within timing budget). The granularity to timestamp each sample depends on the sam-

pling rate. Streams with varying sample periods must stamp each sample. Streams with constant

sample period time stamp on a coarser granularity.

In our implementation (source Chapter 3), we distributed a set of sensors onto a vehicle

to collect data from multiple domains (acoustic, optical and electromagnetic domains). We fuse a

decimeter accuracy GPS, a Distance Measurement Instrument (DMI) and an Inertial Measurement

Unit (IMU) to obtain a sufficiently accurate position. All the MSAs within an RSS demand strict

18


timing requirements. The jitter in time stamping data must not be larger than 359 µs [7]. The

timing budget is calculated based on the maximum vehicle speed and the desired spatial correlation

between two sensors. To achieve time synchronization across distributed systems, we primarily

use Precision Timing Protocol (PTP) [22] and as a backup Network Timing Protocol (NTP) [31].

Results of the testing of the software based time synchronization show a maximum jitter of 12 µs

with a standard deviation of 2.0375 µs, small enough to achieve our triggering and time stamping

requirements. Conversely, only very lose requirements exist for the global time synchronization,

which is only used to correlate the date of time of recordings between vehicles.

4.2 HSFO Overview

To make the raw data meaningful, the file system overlay should be able to maintain

adequate information describing the data. For example, the time stamp of each sample needs to be

logged for temporal and spatial correlation; the software configuration of a sensor (e.g. resolution)

is required for making decisions and improving the accuracy of results. Moreover, the overlay

should have high scalability and adaptability, which can be instantiated across multiple components

at multiple levels in SIROM3 to store, organize, aggregate and transmit the data. The hierarchical

design of HSFO realizes the high scalability, while the metafiles at each layer describe the details

of data sources.

Depicted in Figure 4.1, the HSFO is designed as a glue layer connecting the streaming

applications that produce or consume big volume of streaming data and the native file systems such

as Linux, QNX or Windows file systems (e.g. EXT4, QNX4, NTFS and FAT). The overlay design

separates the meta information associated with streams of data and the raw binary data. It preserves

the rich information exhibited by the data such as geo-location of survey or sampling frequency

of sensors in the metafile hierarchy provided in the HSFO, while leveraging the underlying file

system for robust file storage. Such design also guarantees the platform-independence as the HSFO

is designed above certain file systems within any OS.

The metafile is designed to organize the heterogeneity exposed in the data collection dy-

namics and design complexity. For instance, different geo-point of interests (e.g. an important

geographical area) may influence several data characteristics as the sampling frequency can be ad-

justed. The HSFO exploits the hierarchical nature of the contains relationship existed between data

objects aiming to decompose the complexity into manageable granularities. This section focuses

19


META

RAW

Streaming Application

HSFO

Ext4 QNX4 ... NTFS

PLEX

survey

...file

...session

root

stream stream

survey

filestream

.meta

...

session

...

session.

meta

survey

.meta

bulk.

meta

Figure 4.1: HSFO Overview

on the standard definition of each layer in HSFO and explains the relationship between layers. The

details about the content as well as the format of the metafiles will be described in the next section.

The hierarchy is shown in Figure 4.1 as a tree structure. We define multiple hierarchical

elements, namely surveys, sessions, and streams to accommodate the dynamic data character-

istics. The root node is the top-level where other elements originated from. It may have multiple

surveys, which contain multiple sessions, which further contain streams. Streams may include

multiple files that are pointing to the raw binary files on the file system. One should notice that in

each layer, there is a corresponding metafile attached, which serves as a manifest to keep track of

all elements at current layer and stores information describing them. For instance, stream.meta

will keeps a list of files, their names, timestamps and other meta information.

As the basic element in the hierarchy, file is the binary/raw data in the host file system. It

could be a single image or a sequence of acoustic signals. The file here acts as a class that combines

the file from native file system with the attributes describing it. Stream.meta will aggregate and

records this information such as file names, timestamps and total number of samples in a file, which

preserves important details to access each file and geo-tag each data point in it (details will be

explained in Chapter 4.3).

20


Global data processing and correlation requires a common definition for different groups

of files. Capturing the streaming essence of the data, stream is defined to distinguish between

different formats and semantic meanings of diverse big data. A set of data that has the same format

and semantics can be recorded as a stream. This layer handles the heterogeneity and versatility

of data by structuring and grouping them accordingly. It also serves as the basic processing unit

and the input for plugin system to do data correlation and fusion in the next chapter. Each stream

instance is associated with a stream type and an originated sensor type to identify the stream. This

enables the streaming applications to identify how to handle or display the streams. The stream

layer also distinguishes between raw, refined and real-time streams and records them as the stream

qualifier. The definition of stream type and other detailed information about stream metadata will

be described later. The stream metadata is also recorded in stream.meta.

Above the stream layer, which focuses on managing data heterogeneity during a contin-

uous data collection, a session is defined to manage a set of streams. The session.meta located

in the stream layer will store session metadata and keep track of all available streams in current

session. The session is defined as a consecutive (e.g. non-interrupting) measurement of an area

with the same sensor configuration (e.g. camera resolution). It is introduced to facilitate managing

streams regarding sensor settings. A new session is started each time the system configuration

changes, or after a brief stop of the recording (e.g. when passing over an area of non-interest). If a

single sensor configuration changes during a session, a new session needs to be started. Thus, the

streams of the same stream type but generated from the same sensor with different measurement

accuracy will be separated into different sessions. This would solve the conflict that a session

has two same streams and eliminate the ambiguity during data processing. The session layer

also potentially separates the heterogeneous data according to the different inspection needs and

geofencing interests.

Composed of a group of sessions, survey splits the big data in a more coarse-grained

way. The survey.meta stored in session layer will maintain the survey metadata along with the

lists of participated sensors and sessions. Compared to conducting inspections for roadway infras-

tructures within a relatively small area, a new survey starts when major geographic location or

date changes. With repetitive surveys, the time-varying behaviors of roadway infrastructures can

be investigated and accurately pieced together to form the complete life-cycle of infrastructures.

More than that, a new survey also needs to be started when hardware configuration changes such

as removing a sensor physically from the system through failure or changing location of a sensor.

In summary, the survey layer is defined as a data collection of a major area by heterogeneous sen-

21


sor systems with fixed hardware configurations. To be suitable to the mobile nature of the hybrid

system, it separates datasets at a temporal or geographical granularity (e.g. date, location), such that

a survey can be an hour/day-based or city-based.

Each RSS can conduct and manage a group of surveys, so the root node represents

the identifier such as identification number of the vehicle (RSS). The root layer allows for the

coexisting of data from multiple RSS and data collected at the same location at different dates.

Therefore, a fleet of RSSes can be deployed to increase the geographical coverage area to make

assessment on the city scale. It also enables the detection of changes to the roadway and bridge

deck conditions over time using all available data.

This hierarchical overlay provides a simple and effective way to handle large volumes

of heterogeneous data with aspect to flexibility, scalability and expandability. Defined as a part

in the RTE (see Figure 3.2), the big data processing layer is capable of being instantiated to any

system component such as MSA, RSS or FCM). Moreover, as SIROM3 inherently supports verti-

cal expandability and horizontal scalability systems, the HSFO also adapts to the same degree of

flexibility in the metafile hierarchy.

4.3 Metadata Definition

The HSFO uses a metafile approach to classify data strategically and manages them in or-

der. The metafile preserves abundant information including timestamp of samples, domain specific

knowledge such as sensor settings (e.g. sample rates, resolution) for data processing and fusion.

Moreover, as a structure tailored to infrastructure life cycle management, the metafile also abstracts

information of physical configurations such as geo-location, participated sensor types, survey date

into files to investigate the time-varying behaviors of roadway infrastructures.

Besides preserving valuable sensor calibration data and configurations, the metafile also

acts as a manifest that is responsible for keeping track of each files and showing the files that

are available for download or processing during a recording session. It keeps the one-to-many

relationship between neighboring layers and eases the maintenance of the file structure during data

transfer, aggregation and processing. For instance, merging two streams into a session only

requires the modification of streamTypeList maintained in session.meta. All the metadata can

also be used for reconstructing file structure, restoring existing stream files and querying information

of survey, session, stream, or file while system restarts. The metafiles are built in memory, and

stored onto the file system in Libconfig[26] format once finishing a session.

22


This section defines the attributes needed for capturing the manifest. Figure 4.2 gives

an impression on HSFO metafile definition. It shows the main attributes required to be captured

to describe the data and the corresponding metafile for permanent storage. This structure will be

described in a bottom up fashion corresponding to the overview explained in the previous section.

ID ...

file ...

...session

ID

Metafile Hierarchy and Definition

HSFO

root

stream

Type Qualifier

survey

...ID Date Location

subsysList

streamTypeList

sessionIdList

ID Time

fileIdList

survey.meta

session.meta

stream.meta

Storage

Figure 4.2: HSFO Metafile Definition

4.3.1 Streams and Files

As the basic element in the metafile hierarchy, each binary file is binded with a fileInfo

structure. The fileInfo contains fileID, fileName, timeStart, timeEnd and numSamples.

FileName represents the name of the file in file system, while the timeStart and timeEnd record

the time stamp of the first and last sample in the file respectively. The field numSamples records

the total number of samples in a file. Each sample in the stream can be recorded with a time stamp.

But for some streams with constant sample period, the data can be time stamped on a coarser

granularity to avoid redundancy. By only recording timeStart and timeEnd of the file, the time

stamp of each sample can be calculated based on numSamples in the file.

The fileInfo for each file is recorded in stream.meta, which is a centralized place

to aggregate the meta information for all files. When a new file is created and begins record-

ing the first sample, a fileInfo with fileID, fileName and timeStart will be written into

23


stream.meta. The timeEnd and numSamples will get updated when file closes. The file within

the stream can be any type, such as a text file filled with data and timestamp tuples, an image or a

period of acoustic signals. For a text file, the time stamp of each sample can be easily recorded to-

gether with the raw data. But for images and binary files, the timestamp for each sample is hard to be

logged. With the timestamp captured in fileInfo and maintained in stream.meta, the timestamp

of each file can be logged in spite of different file types or formats.

A stream groups an arbitrary number of files. It is associated with a streamInfo. The

streamInfo has properties of streamType, sensorID, streamQualifier and fileIdList. The

field streamType defines the type as well as the meaning of data. All files of a stream shall have

the same syntax and semantics. If a sensor writes files of different semantics and formats, they

have to be captured in different streams. The streamType is an enumeration value identifying the

stream. It is automatically generated from the stream definition table. The stream definition table

makes standard definition of streams and records the detailed information about them such as stream

short name, file format and description. This table is recorded in a centralized configuration file in

order to make the whole system aware of different types of streams available and understand their

meaning during data fusion. Table 4.1 shows the items have to be defined for each stream.

Table 4.1: Stream Definition

Item Description Example

eStreamTypeEnumeration value identifying thestream numerically

STREAM VOO LOC

ShortNameShort name of the stream, will be usedfor directory and file naming

vooLoc

DescriptionTextual description of stream, will bedisplayed on the visualized portal

Fused data defines multi-modalmulti-sensor system referenceposition

FileSuffixEach file inside the stream will need tohave a file extension. This records theextension and thus defines the data type

txt

IsRealTimeDisplay

Some data stream needs to be sampledfrom the orignal stream and displayedin real-time, in attempt to validate thequality of data

true or false

The definition of streams has to be globally recorded in the centralized configuration

file of the RSS for unified definition and search. The Figure 4.3 shows an example of a stream

definition table in the Libconfig[26] format. It generally follows the format: <eStreamType>,

24


<shortName>, <Description>, <FileSuffix>, <IsRealTimeDisplay>. In the table, a raw unpro-

cessed stream surfaceImgRaw (raw uncorrected video images of the road surface) and a refined

stream surfaceImgCrack (processed from the raw images to show detected cracks) are consid-

ered as two separate streams although they are originated from the same sensor. The real-time

stream surfaceConditionSub is extracted from the raw stream surfaceCondition (millimeter-

wave radar signals) at some sample rate. Since the attribute IsRealT imeDisplay is set to true, this

stream will be automatically uploaded to RSS through FTP service and be processed and displayed

to validate the quality of data in real-time.

Streams = (

( "STREAM_VOO_LOC", "vooLoc", "Fused data defines sensing system reference

position", "txt", false ),

( "STREAM_VOO_DIS", "vooDis", "Fused data defines sensing system travelled

distance", "txt", false ),

( "STREAM_SURFACE_IMG_RAW", "surfaceImgRaw", "Raw uncorrected images

taken by video camera facing toward the road", "jpeg", false ),

( "STREAM_SURFACE_IMG_CRACK", "surfaceImgCrack", "Binary images with

detected cracks as 0", "png", false ),

( "STREAM_SURFACE_CON", "surfaceCondition", "Surface condition data of

roadway collected by millimeter-wave radar", "bin", false ),

( "STREAM_SURFACE_CON_SUB", "surfaceConditionSub", "Sample data of surface

condition data", "bin", true ) );

Figure 4.3: Example Stream Definition Table

The sensorID in streamInfo corresponds to the originated sensor type of a stream.

The field streamQualifier shows the processing levels of the data such as raw, fused, or refined

data. A refined stream is processed from a raw stream for feature extraction. A fused stream

is the joint results produced from multiple streams. For example, Pavement Condition Index

(PCI) is fused from streams generated from Microphone Sensor and Dynamic Tire Pressure Sen-

sor (DTPS) [33]. This also implies that two streams from the same sensorID but have different

streamQualifer are two separated streams. The fileIdList preserves a list of fileID of all the

available files included in current stream.

The streamInfo handles heterogeneity and versatility of data by structuring and group-

ing them accordingly. This structure is also recorded in stream.meta. Figure 4.7 shows the con-

tents of stream.meta of two streams with different sampling rates. It contains a stream setting

and a file setting recording streamInfo and a group of fileInfo respectively. The time stamp

recorded in stream.meta gives the opportunity for temporal correlation of two streams with differ-

25


ent characteristics. Detailed information about how data gets correlated through stream.meta will

be explained in Chapter 4.5.

4.3.2 Sessions

A session is defined as the consecutive (i.e. non-interrupting) measurement of an area.

It is introduced to facilitate managing streams regarding settings (i.e. Qualifier, etc.). The in-

formation of session is also encapsulated into a sessionInfo structure like stream and file.

The sessionInfo is composed of sessionID, timeStart, timeEnd and streamTypeList. The

timeStart and timeEnd record the start time and end time of a session. The coarser-grained time

period allows quick search for a stream or file generated during a certain time. Instead of compar-

ing with the time stamp of each file in turn, the query can first locate the session that contains the

time stamp and then look into the session and check each file. This will dramatically speed up the

query if each session contains thousands of files. The streamTypeList lists all the streamType

of the streams included in each session. The sessionInfo is recorded in session.meta under each

session and get updated when a new stream is created or merged.

session

session.

meta

voo

Loc

voo

Dis

Surface

ImgRaw

session : { sessionId = 0; timeStart : { sec = 1385209325L; nsec = 702437775L; }; timeEnd : { sec = 1385227583L; nsec = 847077204L; }; streamTypeList = [ 5, 0, 1, 14];};

Streams = ( ( "STREAM_VOO_LOC", "vooLoc", "Vehicle reference position", "txt", false ), …));

Surface

ImgCorr

file file stream.meta

1 0 5 14

Figure 4.4: Stream Creation

Figure 4.4 shows an example of the changes to the HSFO to create a new stream. The

sessionInfo maintains a streamTypeList which records the available streams in current session

while each stream is represented by its streamType. It also contains a stream definition table which

has the detailed definition of streams. The streamType enumerator corresponds to the index of the

stream in the stream definition table. While creating a new stream surfaceImgCorr, the stream

26


needs to register its streamType to the streamTypeList in session.meta besides creating a

corresponding stream.meta to keep track of the newly created files.

4.3.3 Surveys

To be suitable to the mobile nature of the system as well as its focus on infrastructure mon-

itoring, survey can be conveniently separately manually at a temporal or geographical granularity

(i.e. data, location). This information is abstracted into surveyInfo structure like the other layers.

The surveyInfo is composed of surveyID, startDate, state, city, location, surveyName,

subsysList, sessionIdList, timeStart and timeEnd. The surveyInfo comprehensively de-

scribes physical conditions and system hardware settings of a survey.

survey : { surveyId = 0; startDate = 20131124L; state = "MA"; city = "Boston"; location = "Ruggles"; surveyName = "20131124_MA_Boston_Ruggles_0"; subsysList = [ 2, 3, 5, 6 ]; sessionIdList = [ 0, 1 ];};

Figure 4.5: Example Survey Meta File

Figure 4.5 demonstrates the content of survey.meta, time stamps are omitted here for the

same functionality of the timestamps in sessionInfo. The state, city and location fields point out

the geographical area that has been surveyed. The surveyDate preserves valuable time information

of the survey, which further enables time-lapse surveys for the same location to investigate the life

cycle of infrastructures. The surveyName is composed of surveyDate, geo-location information

and surveyID, which identifies a survey. The detailed name convention will be discussed in next

section. Therefore, a new survey can be started when a major geographic location or survey date

changes, such that a survey can be an hour/day-based or city-based. In addition, the surveyInfo

manages a list of participated MSAs. If the hardware configuration changes such as adding a new

MSA physically into the system while going through some areas of interest or change the location

of a sensor in RSS, a new survey needs to be started and the subsysList should be updated corre-

spondingly. The time-variant surveys provide sufficient experimental data to study the time-varying

behaviors and life-cycle of civil infrastructures.

27


4.4 Big Data Storage Location

Since the sensory components require fast sampling in time or dense sampling in space,

the RSS often demands high processing load that can not be accommodated by a single processing

element and thus requires a distributed system solution. The distributed MSAs are responsible for

recording its own data as files of a stream. The integrated systems have a centralized configuration

file called voo.config which is responsible for recording system dynamic configurations (e.g. sys-

tem IP, centralized storage and runtime distribution information), which is also written in Libconfig

format [26]. The base directory of data storage is defined in voo.config, in the setting under key

name bulkDir. The default is /var/voters. This example will be also used for further definition.

The stream files will be located in a directory that follows the naming convention below:

/var/voters/<V ooNr>/<Survey>/<SessionNr>/<Stream> <StreamQual>. The field

names in the directory path are defined in Table 4.2.

Table 4.2: Name Convention for Big Data Directories

Field Subfield Description

<VooNr>Numeric identifier of the heterogeneous system in three-digitformat (NNN)

<Survey>

Number/Name of survey in format of<Date> <State> <City> <Loc> <SurveyNr>.As an example, recording from 11/24/2013 starting in Boston,MA at Northeastern would get the following survey name:20131124 Boston NEU 0

Date Date in format YYYYMMDD: e.g. 20131124 for Nov.24, 2013State Two character state e.g. MACity City or area name, e.g. BostonLocation Special description if needed e.g. NEUSurveyNr Sequence number of survey since start of day, in format N or NN

<SessionNr>

Session inside the survey, in format N or NNA new session is started on:• System configuration change. Note: changing configurationof a single sensor changes system configuration, thus a newsession is started!• Brief stop of recording.• When transitioning into a different geofencing area thuschanging configuration.

<Stream > Short stream name, see Table 4.1, e.g. Dis, Loc

<StreamQual >Variation on the same stream with different configurations. For theraw data, the stream qualifier will be empty.

28


The above naming convention for directories describe the storage location for each stream.

Each stream file shall only contain data recorded with the same configuration. If sample rate

changes, a new session has to be started. Within a stream a file can be identified via:

<BaseDir><FileBaseName><FileEnumerator>.<FileExtension>

<BaseDir> is defined as above table. <FileBaseName> is equal to the stream

short name (shortName) defined in Table 4.1. <FileEnumerator> is sequence number of file.

<FileExtension> is equal to the FileSuffix defined in stream definition.

Figure 4.6 shows an example of big data storage location in file system. The root directory

of the big data is /var/voters, which is defined as the bulkDir setting in the centralized dynamic

configuration file. During the civil infrastructure inspection, a fleet of RSSes may be deployed to in-

crease the geographical coverage area. So the next level is the identifier number of each vehicle. For

a vehicle whose V ooNr is 008, its data will be recorded in /var/voters/008. For each vehicle, it

can take an arbitrary number of surveys. Suppose we take several surveys around Northeastern Uni-

versity in Boston, MA on Oct 24, 2014, the first survey can be called 20131124 Boston NEU 0.

Each session under the survey will be indexed as an integer number. A GPS system, a DMI sen-

sor system and a HD camera system participate in the first survey first session, which produce raw

streams called Dis, Loc, surfaceImg respectively. The location for the distance stream is defined

as /var/voters/008/20131124 Boston NEU 0/0/Dis. The absolute path of a stream file can

be /var/voters/008/20131124 Boston NEU 0/0/Dis/Dis0.txt.

20131124_Boston_NEU_0

Dis0.txt ...

0

008

Dis

/var/voters

016 ...

1 2

20131124_Boston_NEU_1

...

...

Loc surfaceImg ...

Dis1.txt stream.meta

session.meta

survey.meta

bulk.meta

Figure 4.6: Big Data Storage Location

Note that the path for each layer and stream file will be automatically generated by com-

bining the name convention rules, information in centralized configuration file and stream definition

table. With the well-defined structures and file path, HSFO can keep track of each file and randomly

29


access any file. The information about the path such as survey name, session index and filename

will also be recorded into corresponding metafiles for reconstructing the file path and restoring the

existing stream files on disk while system restarts.

4.5 Metadata Example for Data Correlation

stream : {

sensorId = Video_camera;

streamType = surfaceImgRaw;

streamQualifier = "";

fileIdList = [ 0, 1, 2, 3, 4, 5, 6, 7… ]; };

file = ( {

fileId = 0;

fileName = "surfaceImgRaw0.tiff";

timeStart : {

sec = 1374249022L;

nsec = 789775000L; };

timeEnd : {

sec = 1374249022L;

nsec = 856646000L; };

numSamples = 1L; }, ...);

stream : {

sensorId = Microphone;

streamType = complexDaqMic;

streamQualifier = "";

fileIdList = [ 0, 1, 2... ]; };

file = ( {

fileId = 0;

fileName = "complexDaqMic0.txt";

timeStart : {

sec = 1374249013L;

nsec = 6042665L; };

timeEnd : {

sec = 1374249043L;

nsec = 762286273L; };

numSamples = 1500000L; }, … );

Figure 4.7: Metafile Example for Data Correlation

In Figure 4.7, it shows an example of stream metafiles for video images stream and mi-

crophone acoustic stream. Metafiles are written in Libconfig format [26], which is more compact

and human readable than XML. Moreover, they bring lower overhead because of the textual presen-

tation. The stream metafiles maintain stream attributes together with all binary files contained in the

stream. The sensorId and streamType in the stream attribute indicate the originated sensor type

of the stream and current stream name. For example, the two streams called surfaceImgRaw and

complexDaqMic represent pavement surface images captured by video camera and microphone

acoustic data respectively. StreamQualifier shows both streams are raw data without being re-

fined (empty as default for raw stream). The fileIdList lists all the ID of the binary data files

included in each stream. Since each taken picture is a file in the stream, the video stream has an

array of files. Similarly, acoustic stream records a list of 30s acoustic signal file. Each file has

properties such as fileId, fileName and etc. The two settings: timeStart, timeEnd record the

timestamp of acquiring the data and are most critical for data correlation since it is the common

attribute between different data sets. The numSamples shows total number of samples contained

in a file. As video camera has variable sample rate, it has single sample per file. However, mi-

30


crophone has many samples in the same file due to its constant sample rate. Thus, the microphone

stream is time stamped on a coarser granularity and the accurate time stamp of each sample is cal-

culated from timeStart, timeEnd and numSamples. To associate each image with an acoustic

clip, e.g., the timestamp of surfaceImgRaw0.tiff should be correlated with the time period of

complexDaqMic0.txt. Since each video image is taken in less than 1s whereas an acoustic file

lasts for 30s, a bunch of images may be correlated to one microphone acoustic file. To be more

finer-grained, the timestamp of each sample can be calculated as each microphone file contains

15000 samples. Therefore, surfaceImgRaw0.tiff can be correlated with multiple samples in

complexDaqMic0.txt.

4.6 Bulk Data Handling Library

The stream applications and plugins need to obtain meta information about the survey,

session as well as retrieving stream files in HSFO. However, the versatility of processing tools

tackle the challenge to uniformly access and process different streams in HSFO. Instead of imple-

menting the access functions again and again in each plugin, a common implementation is desirable

to facilitate interaction with HSFO and simplify the access to it for different plugins. The Bulk

Data Handling (BDH) library abstracts and encapsulates common functionalities needed by stream

applications or plugins to ease their operations on HSFO.

The abundant APIs exposed by the BDH library allow plugins to interact with HSFO

seamlessly. By providing common interfaces to operate on HSFO, the big data handing library

isolates the heterogeneity of stream processing elements and allows them to process heterogeneous

data uniformly. The plugins can not only retrieve and process the raw data from HSFO, but also

generate refined results and commit them to HSFO for permanent storage. The BDH library mainly

supports metadata query, stream/file retrieval and stream/file creation. It provides facilities to get to

the file names (including their absolute path), reading and writing of meta information. With BDH

library, plugin only needs to know how to read the file without knowing where the file is located.

These interfaces also reveal opportunities for easy integration of domain-specific algorithms for

processing and fusion, since the duplicated implementation effort for different plugins are avoided.

Therefore, the BDH library increases the overall flexibility and maintainability. In addition, as a part

of HSFO, the BDH library is adaptable to be used on any system component from MSA to FCM

to facilitate data processing for plugins. Table 4.3 shows a list of commonly used APIs with their

input parameters and functionality during data processing. More functionality can be expanded and

31


explored based on the existing interfaces. The next chapter will describe more details about how

plugins interact with HSFO and process streams utilizing BDH library.

Table 4.3: Big Data Handling Library APIs

Name Input Parameters Functionalityinit() none Restore all files and metafiles on disk

surveyListGet() noneGet total number of surveys and a surveyID listof available surveys

surveyInfoGet() surveyIDQuery survey information, return the result insurveyInfo structure format

sessionInfoGet() surveyID, sessionIDQuery session information, return the result insessionInfo structure format

streamInfoGet()surveyID, sessionIDstreamType

Query stream information, return the result instreamInfo structure format

streamCreate()surveyID, sessionIDstreamType, sensorID

Create a new stream with stream information,update session.meta with new streamType

streamClose()surveyID, sessionIDstreamType

Close the newly created stream, dumpstream.meta to disk

fileInfoGet()surveyID, sessionIDstreamType, fileID

Query file information, return the result infileInfo structure format

fileNameAbsGet()surveyID, sessionIDstreamType, fileID

Get absolute path of file

fileCreateMeta()surveyID, sessionIDstreamType

Create meta information for a new file under astream, return the newly generated fileID

fileCommit()surveyID, sessionIDstreamType, fileID,fileInfo

Set meta information for the file and commitchanges back to data structures such asappending fileInfo to stream.meta

Explained in the table, a plugin can randomly get access to a file by querying its abso-

lute file path. Through calling API fileNameAbsGet() with corresponding parameters such as

surveyID, sessionID, streamType and fileID, the plugin can fetch the file without knowing

its real location. The APIs such as fileInfoGet() and fileCommit() will help read or write file

meta information. The BDH library can be directly used for any C++ plugin implementation. It is

also exposed through MATALB and Python to provide high adaptability to Plugin Executor (PLEX)

and enable a wide range of plugin development environments. Figure 4.8 shows the abstraction of

BDH library, which is designed to simplify the access to HSFO for plugins. By calling particular

interfaces, the meta information or file absolute path will be returned. A MATLAB implementation

needs MEX files that wrapper around the library access. Similarly, a Python implementation utilizes

SWIG as integration tools to generate glue code and call into the C++ library. Different plugins run

32


above them to fetch the information or files from HSFO to process through the BDH library.

LibBDH

C++Matlab

Mex

Python

HSFO

SWIG

call file

Plugins

Figure 4.8: Abstraction of Bulk Data Handling Library

4.7 Big Data Transfer and Aggregation

With a well-defined hierarchical structure, the HSFO has high scalability and adaptability

to be instantiated on a wide range of platforms. Thus, a HSFO can be attached to each distributed

MSA in an RSS and provide reliable and efficient local storage for large volumes of streaming data.

A centralized HSFO can also be created on FCM to aggregate data from distributed sources and

apply data fusion algorithms to diverse big data. Figure 4.9 illustrates the bulk data transfer and

aggregation from distributed MSAs to FCM in SIROM3.

While the vehicle is navigating through the traffic and collecting data during a survey, only

cellular connection is available. The bulk sensor data is stored locally at each MSA to conserve the

bandwidth. As a result, each MSA will maintain several streams generated from different sensors. A

centralized metafile session.meta is created on the on-board controller of RSS for current session

to maintain a list of available streams. The newly created stream needs to register its streamType

to the streamTypeList in session.meta. After the RSS completes the survey, the vehicle returns

to its base where a fast network is available. The automatic bulk data upload to the FCM and data

aggregation are automatically triggered.

The session.meta located on RSS is uploaded to FCM through FTP service first. The

directory to the file will be automatically created on FCM during upload to ensure a same HSFO

structure. Since the streamTypeList in session.meta keeps track of all streams in current session,

the on-board controller of RSS informs corresponding MSAs in turn to upload their streams to

33


the data center in FCM. Finally, the FCM groups all streams as well as the session metafile into

corresponding session. In Figure 4.9, streamDis, Loc and surfaceImg generated from distributed

MSAs are transferred and aggregated to survey0 session0 with session.meta on FCM.

survey 0

Dis

MSA 1

session 0

HSFO

Loc

survey 0

surfaceImg

MSA 2

session 0

HSFO

... ...

...

survey 0

surfaceImg

session 0HSFO

(Data

Center)

Dis Loc Session.meta

session : { streamTypeList = [ 5, 0, 1]; };

Session.metaHSFO

RSS

FCM

Figure 4.9: Data Transfer and Aggregation

The HSFO provides an efficient and reliable solution to store, manage and aggregate

the heterogeneous bulk sensor data with regard to flexibility, scalability and expandability. The

containing metafiles abstract abundant information of data collection dynamics (e.g. sensor settings,

timestamps) to describe the raw data with minimum processing overhead, which is critical for data

management and processing. The BDH library eases the interaction with the HSFO for different

kinds of plugins to access and process the data. The next step is the flexible data processing on the

HSFO using the Plugin Executor (PLEX) environment. The PLEX simplifies the integration of new

algorithms and eliminates human interaction during processing with an automated approach.

34

Chapter 5

Plugin Executor (PLEX)

The key point for big data handling is the flexible data processing and fusion to convert

the raw data into meaningful knowledge. It corresponds to multi-layer multi-modal fusion stage in

Figure 2.1 and directly works on the sensor data storage HSFO. Facilities are needed to fuse hetero-

geneous data streams, correlate them in time and space and allow for visualization. Furthermore,

flexibility in processing and chaining processing steps have to be an integral part of the solution.

Lots of human efforts are involved in current data processing procedure, which cause

an inevitable delay and consequently lower productivity. Tools are required to simplify algorithm

integration and make use of bulk data handling to avoid duplicated work on retrieving files. A

standard definition of plugins is critical to provide a unified solution to execute the plugins, which

further automate the running of plugins according to a particular sequence during processing.

A PLEX environment is designed to address the above issues. We introduce the modular-

ized design and break the whole processing procedure into manageable small plugins. The plugins

integrate various domain-specific algorithm into PLEX and leverage the bulk data handling library

to uniformly interact with HSFO. The running sequence of plugins is predefined by a configurable

scheduler for eliminating human interaction during data processing. This chapter first overviews the

PLEX environment. Then it elaborates the standard definition and detailed usage of plugins. Lastly,

a plugin scheduler is illustrated to automate and accelerate the data processing procedure.

5.1 Plugin Executor (PLEX) Environment Overview

As part of SIROM3 framework, PLEX is designed with high scalability and adaptability,

which can be run on different levels or different components (e.g. from sensors, MSA to FCM).

35

CHAPTER 5. PLUGIN EXECUTOR (PLEX)

Since it is platform-agnostic, data processing and fusion can happen on sensor in real-time or cen-

tralized storage after aggregation. Figure 5.1 gives an impression about the PLEX environment.

PLEX

Rule-based

Scheduler

Rule

1

Rule

2

PluginN

Plugin3Plugin2

Plugin1

Plugins

HSFO

Ext4 QNX4 ... NTFS

PLEX

Stream Definitions

Plugin RulesUser Inputs

PLEX Environment

Figure 5.1: PLEX Environment

To enable flexible processing, we introduce plugins. A plugin uses streams as inputs and

produces new output streams. Examples include the calculation of crack density from the video

stream, or calculating Pavement Condition Index (PCI) from fused data on pavement surfaces.

A plugin executor (PLEX) is provided to manage and execute the plugins. Figure 5.1

shows the PLEX environment containing the PLEX and two user defined input: Stream Definitions

and Plugin Rules, and an array of systemic or algorithmic plugins with a Rule-based scheduler. Sys-

temic plugins are system operation such as data retrieval, transfer or tagging, whereas algorithmic

plugins are algorithm realizations operating on streams to produce new results or refine existing re-

sults (e.g. feature extraction). The Rule-based scheduler directs the execution sequence of plugins.

5.2 Plugins

The plugins are designed to process from one stream type (e.g. unprocessed data) to

another stream type (e.g. feature extracted data). Different from some SQL queries which operate

selectively on a partial stream and return data for immediate display [18], the plugins here will work

on full data streams for permanent storage and commit updates to metafiles.

5.2.1 Plugin Definition

The plugins process data on the granularity of streams. The advantage of operating on

complete streams as supposed to individual files is that calling a plugin becomes independent on the

36


granularity of files. As for video data, it might be represented in both individual images, as well as

single file as a container encompassing multiple frames. One plugin example might be a conversion

of individual images to a video stream. In this case, there is no one-to-one relation between input

and output file. Thus, plugins manipulate complete streams and are responsible for the files. Each

plugin defines the input stream type it can accept and the output stream type it produces. To perform

different plugins uniformly, Table 5.1 makes a standard definition of plugins. Table 5.2 shows the

definition of crack distortion correction plugin as an example.

Table 5.1: Plugin Definition

Field DescriptioninStreamList List of input stream types. A plugin may operate on one or more input

streams.outStreamList List of output stream types. Note: A plugin may produce multiple out-

put streams in the same run. Each produced output stream is listed inthis array. The order of output stream types has to match the order ofarguments on the command line.

shortName Plugin Short Name corresponds to the name of the directory. As such,name should be alphanumeric only, no spaces or special characters.

Name Plugin Name (can have spaces).Path Path to the plugin on file system.Description Plugin Description (can be a longer description).Cmd Command on file system to run plugin.Args String of constant arguments to call plugin.Rules Processing rules for chaining filters.Event List of events that the plugin can trigger.

Table 5.2: Crack Distortion Correction Plugin Example

Field DescriptioninStreamList STREAM SURFACE IMG RAWoutStreamList STREAM SURFACE IMG CORRshortName DistortionCorrectionName Distortion Correction for RAW Video ImagesPath /Algo/PluginsDescription Corrects the raw images for distortion due to angle and zoom. Crop

appropriately to remove useless information such as bumper of the car.Cmd MainDistortionCorrectionArgs %SURVEY ID% %SESSION ID%Rules NAEvent Can trigger Crack Detection plugin

37


Table 5.1 summarizes parameters that need to be captured to characterize the available

plugins. The definition of all plugins is centralized in a Plugin Definitions table like Stream Defi-

nitions. It is still written in Libconfig format [26] and comprehensively describes the plugins. In

the plugin definition, both the input and output can be a list of streams, which are represented by

enumerators corresponding to streamType in Stream Definitions. The shortName is used to iden-

tified the plugins in Plugin Rules. When run the plugins listed in Plugin Rules, the corresponding

definitions of plugins will be searched from Plugin Definitions based on the shortName. Then the

command lines to execute the plugins are automatically composed from Path, Cmd and Args.

Table 5.2 gives an example about the definition of crack distortion plugin. The enumera-

tors identify the stream type defined in Stream Definitions (Figure 4.3). The crack distortion plugin

operates on raw uncorrected images as an input stream and creates angle and zoom corrected and

cropped images [15]. The file is pre-compiled to an executable file using the Matlab compiler. The

Path, Cmd and Args fields would result into a call on the command line as:

./Algo/P lugins/MainDistortionCorrection %SURVEY ID% %SESSION ID%

The arguments distinguish System Parameters from User Parameters. The fixed parame-

ters such as %SURVEY ID% and %SESSION ID% will not be influenced by users. They identify

survey and session to be operated on and are replaced by current values. For example, the plugin

processes all sessions within a survey in turn. The surveyID and sessionID are retrieved from

metafiles. The string %SURVEY ID% and %SESSION ID% are replaced by each combination of

surveyID and sessionID when calling the plugin to process streams under each session. Con-

versely, the User Parameters can be altered by the user at runtime to adapt to the desired outcome

such as threshold. More parameters can be filled into the Args field and passed through command

line to control the operations of plugins. Both systemic plugins and algorithmic plugins can be

defined in this way. The Plugin Definitions maintains standard definitions of all the plugins and

enables running different plugins uniformly.

5.2.2 Interaction with HSFO

Leveraging bulk data handling library, different plugins can get access to the HSFO uni-

formly. These common utilities avoid duplicated implementations to retrieve files for each plugin

and ease new algorithms integration. This section elaborates the detailed steps about how plugins

interact with the HSFO using the bulk data handling library and process streams.

To access and operate on streams in HSFO, plugins will need to access meta infor-

38


mation from the top of the spanning tree structure in Figure 4.2 (trace from survey, session to

stream). Profited by the big data handling library (described in Chapter 4.6), plugins can interact

with the HSFO such as retrieving files or querying metadata uniformly and seamlessly via a num-

ber of exposed APIs. For instance, a plugin can retrieve existing streams to operate on through

streamInfoGet() or create new streams to produce improved results by streamCreate(). Only

surveyId, sessionId and streamType of current stream are needed as input.

streamInfoGet()streamDef

Table

+InStreamList[]

streamCreate()

+

plugin

Algo

fileMetaCreate()

streamInstreamOut

fileNameAbsGet()fileNameAbsGet()

OutStreamList[]

fileCommit()

Loop for each file

crackDetect

Threshold

videoRaw crackMap

streamClose()

Figure 5.2: Interaction of Plugins with HSFO

Figure 5.2 demonstrates the interaction with the HSFO through the crack detection plugin.

The overview of the plugin is shown at the bottom of the graph, which operates on raw video

images and outputs binary images with detected cracks. The threshold is a User Parameter to

decide crack width. An algorithmic plugin is usually composed of domain-specific algorithms as

well as operations on HSFO such as retrieving stream files or dumping metafiles. The BDH library

encapsulates common functionalities to ease the access of HSFO for plugins. Depicted in Figure 5.2,

the plugin retrieves metadata of input streams via streamInfoGet() with Stream Definitions. With

the fileIdList contained in stream metadata, the plugin can acquire the absolute path for each file

through fileNameAbsGet() and process files with algorithms in a loop. The stream metadata of

39


new created stream can be automatically generated by streamCreate() with given streamType

and sensorID. Similarly, the file metadata is composed via fileMetaCreate(). Followed by

fileNameAbsGet(), the refined stream data can be written to the new files with their absolute

path. The interface fileMetaCommit() updates meta information for each new file (e.g. start/end

time stamp, number of samples), while streamClose() dumps the stream metafile to disk.

Table 5.3 presents the corresponding code example for crack detection plugin written in

C++. Since the BDH library is exposed through Matlab, Python and SWIG, plugins can be easily

translated to other languages and various development environments are enabled. The examples

show that only the algorithm parts differ for different algorithmic plugins whereas their interaction

with HSFO is fixed. This significantly simplifies the integration of new algorithms.

Table 5.3: Crack Detection Plugin Code Example

Line Code1 VOTERS::tStreamInfo streamInfoOrig = BDHPlg.streamInfoGet (surveyId,

sessionId, VOTERS::STREAM SURFACE IMG RAW);2 int fileNum = streamInfoOrig.fileIdList.length();3 BDHPlg.streamCreate (surveyId, sessionId, VOTERS::STREAM SURFACE-

IMG CRACK, VOTERS::SENSOR CAM);4 for (int fileId = 0; fileId < fileNum; fileId++) {5 string fileNameAbsOrig = BDHPlg.fileNameAbsGet (surveyId, sessionId,

VOTERS::STREAM SURFACE IMG RAW, fileId);6 VOTERS::tFileId fileIdNew = BDHPlg.fileCreateMeta (surveyId, sessionId,

VOTERS::STREAM SURFACE IMG CRACK );7 string fileNameAbsNew = BDHPlg.fileNameAbsGet (surveyId, sessionId,

VOTERS::STREAM SURFACE IMG CRACK, fileIdNew);

8 Specific data processing algorithms ...

9 BDHPlg.fileCommit (surveyId, sessionId, VOTERS::STREAM SURFACE-IMG CRACK, fileIdNew, fileInfoNew);

10 }11 streamClose(surveyId, sessionId, VOTERS::STREAM SURFACE-

IMG CRACK);

5.3 Plugin Executor

As one plugin is often not able to accomplish all processing, multiple plugins are designed

and incorporated to perform incremental and algorithmic-specific data processing. The PLEX is im-

40


plemented to schedule and automate the execution of a collection of plugins. Because the processing

of plugins complies to the metafile hierarchy of HSFO, the PLEX is adaptable and scalable to be

executed on HSFO from different levels in the hierarchical architecture of SIROM3. The integrated

design automation eliminates human interaction and boosts the overall productivity.

5.3.1 Rule-based scheduler

Data processing will be done by a group of plugins. The Rule-based scheduler defines

a configuration rule to direct the running sequence of plugins. Some of the plugins run in parallel

while some needs to wait for the other ones to finish as they have data dependency. Figure 5.3

illustrates the running sequences of plugins arranged by the Rule-based scheduler.

PluginDef

Table

+

streamDef

Table

Plugin 1

Plugin 2

...

Rule-based

Scheduler

Rule1:

(Plugin 1 ||

Plugin 3),

Plugin2

Rule2

Plugin 1

Plugin 3

Plugin 3

Plugin 2

Stream BStream A

Stream C

HSFO

...

Figure 5.3: Scheduling of Plugins

In Figure 5.3, there is a list of user defined Plugin Rules. The Rule-based scheduler takes

the input rules and performs plugins according to the sequence. Note that the plugins are represented

by their shortName in Plugin Rules. They need to be replaced by the command line to execute the

41


plugins during data processing. The PLEX will search for the whole definition of the plugin from

Plugin Definitions based on its shortName and compose the command line according to part of

the settings (Path, Cmd and Args). Then the command line gets called in the same order with

the sequence of shortName listed in Plugin Rules. In the example, Rule1 defines a sequence that

Plugin1 runs in parallel with Plugin3 first, then Plugin2 runs sequentially after them. The Rule-

based scheduler takes Rule1 and run plugins automatically in the same order. The output streams

of Plugin1 and Plugin3 become the input of Plugin2. All the created streams (StreamA,B,C)

as well as their metafiles are written into HSFO for permanent storage through HSFO APIs.

In order to integrate a new plugin into PLEX, a user needs to provide a plugin definition

into Plugin Definitions to define plugin name, input or output streams, arguments and so on. The

information about all involved streams should be available in Stream Definitions for composing

stream metadata. Finally, the shortName of the plugin needs to be added to Plugin Rules that

directs the Rule-based scheduler. The modularized design pattern of plugins significantly reduces

the effort of developing and experimenting new algorithms, while the internal Rule-based scheduler

further enhances the overall automation. For instance, a serialization can be achieved by running a

number of plugins with dependency. e.g. Image distortion correction plugin is often necessary for

crack detection data mining algorithm plugin [15].

5.3.2 Design Automation and Flexibility

The standard definition of plugins offers a unified solution to manage different kinds of

plugins. Benefitted from this, the command line to execute the plugins can be uniformly generated

from their definitions following the calling convention. In addition, the Rule-based scheduler elim-

inates human operations during the execution and automatically triggers plugins according to the

predefined order, which avoids error operations and interrupt between plugins. The plugins with no

data dependencies can be arranged to run in parallel to speed up the performance. The PLEX also

enables to check the execution progress and verify if a plugin runs successfully during processing.

The modularized design of plugins facilitates the development of algorithms. As it is a

progressive effort, the algorithms will usually be improved and need to be updated every once in

a while. With PLEX environment, each algorithm can be changed individually as needed without

interfering with others. Despite the design flexibility, the PLEX is platform-agnostic and flexible

to be run on any level in SIROM3. For example, the PLEX can perform algorithmic plugins on

each MSA and fuse or process raw data from single-domain (e.g. real-time location optimization

42


with real-time streaming data). This level only contains temporal correlation between data points.

Knowledge-level fusion can be achieved with a similar approach yet only a different algorithm at

FCM level to fuse streams from multiple domains. This level adds knowledge of geometry allowing

for spatial correlation. Therefore, the data processing can exist on multiple levels in the distributed

architecture of SIROM3. A three-level data processing solution (Table 5.4) is raise up in [7] to

distribute processing responsibilities into different levels.

Table 5.4: Three Processing Levels

Level Description Real-Time Scope/Fusion1 Sensor Processing (MSA) Possible Single domain, single sensor2 On-board Processing (RSS) Time-delayed Multiple domains, multiple sen-

sors, local geometry3 Off-board Processing (FCM) Post-processing Multiple domains, multiple sen-

sors, local and global geometry,multiple times

The advantages of PLEX include reusability, understandability, automation, and the al-

lowance for formal software analysis and integration techniques. This also increases the reliability

and power of the overall system. In summary, the HSFO and PLEX are integrated into SIROM3

framework to provide a unified and efficient method to store, manage, aggregate and process the

heterogeneous bulk sensor data. Both of them are platform-agnostic to run from sensors to data

centers. They bring great benefits such as minimum overhead for algorithm designers, rapid algo-

rithm exploration and fully automated data aggregation and fusion. The results of using HSFO and

PLEX are demonstrated in the next chapter.

43

Chapter 6

Experimental Results

To validate software framework of SIROM3 as well as its real implementation on VOTERS

project, the system has been subjected to a city-wide roadway condition inspection. This section

first briefly examines the time synchronization accuracy. Then it focuses on the performance anal-

ysis of HSFO and PLEX during data collection and processing respectively and discusses their

low overhead. The overall productivity of SIROM3 is compared with other methods in the end to

demonstrate the benefits using SIROM3 on roadway assessment.

6.1 Performance Analysis

6.1.1 Time Synchronization Accuracy

−6000 −4000 −2000 0 2000 4000 60000

50

100

150

200

Offset [ns]

Num

ber

of S

ampl

es

(a) PTP Accuracy

100

101

102

0

0.2

0.4

0.6

0.8

1

Latency Overhead [us]

Cu

mu

lative

Pro

ba

bili

ty

MSA1

MSA2

MSA3

MSA4

MSA5

(b) Communication Latency

Figure 6.1: Timing Analysis

44

CHAPTER 6. EXPERIMENTAL RESULTS

Timing accuracy is of critical importance for sensor fusion since streams are correlated via

time stamps. Different kinds of data need to agree upon a common temporal data point. A huge time

delay may cause an erroneous offset of geo-tagged data point in the space domain. Figure 6.1(a)

shows that our realized PTP synchronization stays within a maximum jitter of 12 µs with a standard

deviation of 2.0375 µs, which meets the synchronization requirements (359 µs) defined in [7].

Also, the 12 µs jitter results in an offset of 0.33 mm in distance given the vehicle’s roaming speed

of 100 km/h, small enough to achieve the desired spatial correlation (1 cm) between sensors.

In addition, a common time is essential for correlating data across distributed subsystems.

Thus, collaboration across MSAs needs to occur within timing bounds. To evaluate the communica-

tion performance, Figure 6.1(b) plots the cumulative probability of communication latency during

regular operations across MSAs. All MSAs exhibit a very low communication latency. The 96%-

tile ranges from 3µs to 9µs, which causes a maximum deviation range of distance of 0.25 mm at

the speed of 100 km/h for spatial correlation. The low latency ensures a timely communication

and collaboration. Since each sample is time stamped by sensors synchronized to MSAs (e.g. HD

video camera) or by MSAs upon arrival, the synchronization services guarantee the accuracy and

correctness for temporal data correlation.

6.1.2 MSA Performance Analysis

To assess SIROM3 quality, we evaluate the resource utilization and communication over-

head of MSAs outlined in Table 6.1 during data collection. The results are averaged over the period

of a survey, which reflect system operations (i.e. without extra plugins) and imply the overhead of

HSFO for creating meta information (e.g. generate file name and compose absolute file path).

Table 6.1: MSA Performance Results

Systems MSA1 MSA2 MSA3 MSA4 MSA5 MSA5 (DAQ)CPU [%] 0.04 2.69 3.25 3.21 52.6 45.4

Network load [KBps] 216.392 83.06 800.97 842.98 212.68 NAAvg. Comm. Latency [us] 3.7149 4.2916 6.3283 5.5105 4.2325 NA

Overall, Table 6.1 shows that all MSAs operate with a fairly low CPU consumption uti-

lizing asynchronous data acquisition (DAQ) and direct memory access (DMA). During data acqui-

sition, HSFO also brings some overhead to system performance since each MSA repetitively uses

bulk data handling interfaces to create file meta information and retrieve files to write. The low CPU

utilization reflects only a small overhead is brought up by HSFO. As for the MSA5, the high qual-

45


ity video system, has a high CPU consumption due to simultaneous raw image shooting and JPEG

compression. One should notice that the column MSA5 (DAQ) gives the standard CPU utilization

of the video system without being integrated into SIROM3 and using HSFO. Only 7% difference of

CPU consumption is found before and after the integration. As the video captures around 20 images

per second, the overhead (e.g. create meta information to maintain HSFO structure) for each file

(each image is a separate file) caused by HSFO is fairly low. The meta information is withheld in

memory during data collection and dumped to metafiles at the time after session stops, so that there

is no delay during a session. In this way, the HSFO would not hinder the speed of data acquisition

especially for those sensors that don’t switch to new files very often.

The local Gigabit network interconnecting all MSAs shows low utilization as stream data

is collected and stored locally to each MSA. The low traffic across MSAs enables the essential

collaboration messages. For instance, the video camera on MSA5 captures images per 50 mm

while the vehicle is roaming through traffic. The accurate distance information is provided by

Distance Measurement Instrument (DMI) sensor on MSA1. Thus, they exhibit similar network

load due to the same data flow. On the other hand, MSA4 and MSA2 present higher network load

because of real-time streams. To ensure the real-time data quality, a sample stream with reduced

frequency of raw data is recorded and uploaded to on-board controller every second, which is further

shown on real-time monitoring tablet for validation. The real-time stream is handled automatically

and differently by HSFO in a way that it only uploads files for real-time usage without dumping

metafiles. The average latency ranging from 3µs to 6µs guarantees accurate time synchronization

and real-time operations. As a result, HSFO provides reliable storage and handles different streams

flexibly without hindering the overall performance of MSA.

6.1.3 Statistics of Data Behaviors

To demonstrate the impact and result of HSFO, the Table 6.2 selectively shows the statis-

tics describing data behaviors collected in field tests.

Shown in three survey results, each survey is conducted with different number of streams

as different sensors were active (i.e. survey 1 had active Radar system) or new stream with new

parameter (i.e. survey 2 used compression parameter). These differences can be triggered both

manually or auto-calibrated depending on geo-point of interests (i.e. higher resolution or more

sensors in critical areas or fewer sensors in less important place to save storage).

During survey 0, four streams on the vehicle participated with a data acquisition speed at

46


Table 6.2: Statistics of Data Behaviors

survey 0 survey 1 survey 2Participated stream numbers 4 6 7

Meta size for 1 vehicle/h [MB/h] 10.112 8.363 8.96Raw size for 1 vehicle/h [GB/h] 307.968 346.02 296.62

Meta size for 3 vehicles/h [MB/h] 30.336 25.089 26.88Raw size for 3 vehicles/h [GB/h] 923.9 1038 890

Avg. Meta/Raw Ratio [%] 0.0032 0.0024 0.0029

approximately 307 gigabytes per hour. In survey 1, two new streams (HD camera and millimeter-

wave radar) are active during the data collection, which increases the data generating speed to 346

gigabytes per hour. Due to the stream parameter adjustment (i.e. JPEG image compression)

introduced in survey 2, the data generation speed is reduced to 296 gigabytes per hour. All three

surveys generate around 300-350 gigabytes of data per hour for one RSS system. With three RSS

running in parallel, nearly 1 terabytes of data in total needs to be collected, aggregated and analyzed

per hour. Compare to terabytes of raw data collected, only a few hundred megabytes of metafiles

are necessary to maintain the extra meta-information. With a small footprint, metafiles can be easily

and quickly processed. The average ratio of meta data over raw data is approximately 0.003%.

As shown in the example survey, the HSFO is well adapted to the SIROM3 flexibility

as multiple distinctive components can be inserted or removed seamlessly without interfering any

other operating components and the data stored. The efficient design of metafile in HSFO is critical

in balancing the storage space for raw data and extra information. Thanks to this multi-layered

overlay, HSFO offers substantial advantages in data storage, management and processing. A PLEX

will be executed on HSFO later to process the data using algorithmic and systemic plugins.

6.1.4 HSFO Performance Overhead

The HSFO provides a reliable and efficient method to store and organize the data during a

survey. After the RSS finishes the survey and returns to homezone, data is uploaded and aggregated

to the centralized HSFO on FCM and processed by PLEX through a set of plugins. To quantify

the impact brought by HSFO and PLEX, we calculate their resource utilization and processing

overheads demonstrated in Table 6.3.

HSFO is a layered structure put atop of the host file system to categorize and organize

data automatically and uniformly. It provides bulk data handling library for plugins to simply access

them. But it has a trade-off between automation and processing time. The time used for processing

47


Table 6.3: HSFO Performance Results

Data Management Method With HSFO Without HSFOCPU [%] 52.4 48.6

Execution time [s] 362 359Processing Speed [GB/s] 0.04198 0.04233

Overhead 0.19µs/KB

metafiles brings about computing overhead, while the automation eliminates human-intrusive op-

eration and thereby increases overall productivity. The performance with and without HSFO was

compared in a Linux environment. Since the algorithm itself is independent with PLEX, it would

not affect the comparison regardless of what algorithm is. Here we simply use a down sample plu-

gin to sample the data every other line and reduce the file size to half. The experiment uses a 15GB

dataset containing one stream with more than 600 separate files inside of it. Note that the plugin

needs to process metadata for each file.

The results indicate that processing data with HSFO using down sample plugin takes 3-5

seconds longer than that in the absence of HSFO. In another word, the time overhead accounts for

0.08%-0.14% of the original. The 3 seconds take all operations leveraging bulk data handling library

into account, from composing meta information to dumping metafiles to HSFO. As mentioned

before, any update to the metafile only happens in memory during processing and is dumped once

the stream is closed to save time. Using an alternative calculation, processing data with HSFO takes

an overhead of 0.19µs/KB than that without HSFO, small enough to be omitted. It is also can

be explained by that the metafile size for managing 15GB data is 285.1 KB, which can be quickly

processed by a plugin. Moreover, the difference of CPU utilization between two methods is less than

5%. This implies that HSFO and PLEX have quite low overhead without impeding the performance.

With the bulk data handling library, different plugins can retrieve files through common

interfaces without duplicated work, which simplifies algorithm integration. Because of the standard

definition of plugins and the centralized plugin scheduler, PLEX automates data processing by

eliminating human-interactive operations and significantly improves the productivity. The system

impact of SIROM3 will be discussed in the following section.

6.2 Data Fusion and System Impact

Given the accuracy in timing, Figure 6.2 illustrates data fusion opportunities through tem-

poral correlation. Three different data streams are correlated based on sample time to identify

48


Radar

Acoustic

Pressure

Figure 6.2: Temporal Data Fusion Example

abnormality on the pavement surface. In this example, manholes are differentiated from potholes

using multiple sensor sources to remove false positive results. The data fusion process is fully

automated using PLEX and the developed GIS visualization portal contains multiple data layers

covering a given roadway.

The HSFO Figure 6.3 shows a snapshot of the GIS visualization portal showing the city-

wide pavement condition for entire road network of the city Brockton, MA. The overall condition

of the road is represented by Pavement Condition Index (PCI) in a scale of 0-100 [33]. We surveyed

over 300 miles gathering inspection data over all lanes using one RSS. Over 20 terabytes of data

have already been collected, aggregated, fused and visualized in the FCM and GIS server. Using the

system, the severity of the road condition can be prioritized so that proper repair and maintenance

action can be taken considering budgetary constraints.

Table 6.4: The Overall Impact of SIROM3

300 Miles Data Data Data Overall Slow-Coverage Collection [h] Transfer [h] Processing [h] [h] down

Traditional Methods 640-800 0 160-320 1120 22.4Van-based systems [23] 24-32 14-16 160-320 368 7.4

SIROM3 16-20 14-16 14 50 1

In order to assess the value of the VOTERS project built upon the SIROM3 framework,

we compare the time effort for data collection, transfer and processing in three scenarios: traditional

methods, van-based systems and using the SIROM3. The three periods are calculated independently

without any overlap. The mobility apparently wins over traditional methods on data collection pe-

49


Road Condition

Rating Scale

Good

Statisfactory

Fair

Poor

Very Poor

Serious

Failed

100

85

70

55

40

25

10

Figure 6.3: City-wide Infrastructure Performance Inspection

riod in the case of SIROM3 and van-based systems. However, the data transfer is significant for

mobile agents as large amount of data transfer is under bandwidth constraint, whereas field engi-

neers can retrieve data immediately on site. Lastly, SIROM3 excels on data processing time owing

to the unified automation embodied in plugins and PLEX. The slowdown from the last column

is calculated based on the overall time period spent from data collection to processing. The over-

all productivity has been increased by nearly 25 times using SIROM3 compared to the traditional

methods. Moreover, the scalability and expendability of SIROM3 creates more diverse opportuni-

ties for new senor technique integrations yet systems in [23] are less scalable and expandable in cost

efficient ways.

In result, SIROM3 significantly simplifies the construction of a scalable and efficient

multi-modal multi-sensor mobile sensor system. The HSFO embedded in SIROM3 with a PLEX

promote and automate heterogeneous big sensor data management from storage, transfer, process-

ing to fusion. The VOTERS project which is based on SIROM3 enables collection of infrastructure

health information at traffic speeds. This allows expanding analysis coverage and repeating inspec-

tions to validate and improve the desperately needed life-cycle models.

Although we have already automated the whole process from data collection, transfer

to processing, the three periods are conducted individually. The next step could be overlapping

50


these steps and getting much higher productivity. The opportunity can be found in on-board data

processing. The responsibility of pre-processing is moved from off-board (FCM) to each sensor

(MSA) so that available files during a data collection can be processed on MSA in real-time. In

this way, data collection and processing can be parallelized. Another benefit is that the reduction

of data size can shorten the time for data transfer. For instance, the raw images of roadway surface

conditions captured by HD video camera can be pre-processed by a distortion correction plugin in

real-time to correct and remove useless information (e.g. bumper of the car). The refined data with

smaller size can be uploaded instead of the raw data. During the data upload, the available streams

on FCM can also be processed by PLEX first before the whole upload procedure has finished. The

real-time data processing can be realized by running PLEX on each MSA. The PLEX can be run on

any system component like HSFO as it is designed to have high scalability and adaptability. Thus, a

PLEX can be run on each MSA and performs real-time processing on the available raw data during

data collection.

The future research may look into the layered design of HSFO since our design is tailored

to the civil infrastructure health monitoring. For example, we compare surveys from the same

location but on different inspection dates to explore their deteriorations or improvements. This also

can be used in other RMMMS applications such as water quality and air pollution detection. To

extend the usage of SIROM3, the system can be easily translated into other domains with only

minor adjustments on HSFO. Take the human health care as an example, the survey can maintain

information about age, test site and date, while an upper layer cares about personal identifier can

aggregate repetitive surveys of a person.

51

Chapter 7

Conclusion

Performance monitoring of civil infrastructure is a crucial aspect of maintaining trans-

portation infrastructure. Cyber-Physical Systems (CPS) are a promising approach to acquire infor-

mation on infrastructure conditions and time-varying behaviors by providing automated inspection

with lower cost and better coverage and time resolution. To handle the domain-specific big data

challenge in this CPS approach, the HSFO with PLEX environment was introduced.

The hierarchically constructed HSFO provides a unified and efficient solution for storing,

organizing, managing and aggregating the large volumes of heterogeneous streaming data generated

from the distributed sensor systems. The metafiles conserve rich information about sensor hetero-

geneity and data collection dynamics. Moreover, the big data handling library provided by HSFO

eases access to it for plugins. To leverage the power of HSFO, a PLEX is run on HSFO to facilitate

data processing, fusion and visualization. The PLEX cooperates with HSFO to automate the big

data handling, through data storage, transfer and processing through plugins. The PLEX also offers

a flexible plugin system that is an ideal testbed to develop data fusion analysis methodologies. Since

both HSFO and PLEX are developed with high scalability and adaptability, they can be executed on

extensive platforms from mobile systems to stationary servers. Our approach addresses the big data

management, aggregation, processing and fusion for information discovery and data mining.

Being part of the SIROM3 framework, HSFO and PLEX were demonstrated with the

entire framework and realized in a VOTERS project for network-wide roadway infrastructure mon-

itoring. Over 20 terabytes of data covering 300 miles have been collected, stored, and aggregated

so far in the city of Brockton, MA, and thereby processed, geo-spatially analyzed, and visualized

to investigate infrastructure time-varying behaviors. The performance of HSFO with PLEX was

measured during different periods from data collection, storage to processing in order to ensure a

52

CHAPTER 7. CONCLUSION

small overhead (0.19µs/KB) to the entire system. The automation embodied in SIROM3 sped up

the overall productivity by nearly 25 times. The fused big data gives city officials valuable data to

do manage life cycle of infrastructures and coordinate investments.

53

Bibliography

[1] Karl Aberer, Manfred Hauswirth, and Ali Salehi. Infrastructure for data processing in large-

scale interconnected sensor networks. In Mobile Data Management, 2007 International Con-

ference on, pages 198–205. IEEE, 2007.

[2] Federal Highway Administration. 2008 status of the nation’s highways, bridges, and transit:

conditions and performance report to congress, us dot fhwa, 2008.

[3] ASTM. Standard practice for measuring delaminations in concrete bridge decks by sounding:

ASTM D4580-03 (reapproved 2007), 2007.

[4] ASTM. Standard test method for water-soluble chloride in mortar and concrete: ASTM

C1218/C1218M-99, 2008.

[5] ASTM. Standard test method for corrosion potentials of uncoated reinforcing steel concrete:

ASTM C876-09, 2009.

[6] Christopher L. Barnes and Jean-Francois Trottier. Effectiveness of ground penetrating radar

in predicting deck repair quantities. Journal of Infrastructure Systems, 10:69–76, 2004.

[7] Ralf Birken, Gunar Schirner, and Ming Wang. VOTERS: design of a mobile multi-modal

multi-sensor system. In Proceedings of the Sixth International Workshop on Knowledge Dis-

covery from Sensor Data, SensorKDD ’12, pages 8–15, New York, NY, USA, 2012. ACM.

[8] Ralf Birken, Ming Wang, and Sara Wadia-Fascetti. Framework for continuous network-wide

health monitoring of roadways and bridge decks. In Proceedings of Transportation Systems

Workshop 2012, Austin, TX, March 2012.

54

BIBLIOGRAPHY

[9] M. Bocca, J. Toivola, L.M. Eriksson, J. Hollmn, and H. Koivo. Structural health monitoring

in wireless sensor networks by the embedded goertzel algorithm. In 2011 IEEE/ACM Interna-

tional Conference on Cyber-Physical Systems (ICCPS), pages 206–214, 2011.

[10] Kameswari Chebrolu, Bhaskaran Raman, Nilesh Mishra, Phani Kumar Valiveti, and Raj Ku-

mar. Brimon: a sensor network system for railway bridge monitoring. In Proceedings of

the 6th international conference on Mobile systems, applications, and services, MobiSys ’08,

pages 2–14, New York, NY, USA, 2008. ACM.

[11] M. Daum, M. Fischer, M. Kiefer, and K. Meyer-Wegener. Integration of heterogeneous sensor

nodes by data stream management. In Tenth International Conference on Mobile Data Man-

agement: Systems, Services and Middleware, 2009. MDM ’09, pages 525–530, May 2009.

[12] Jakob Eriksson, Hari Balakrishnan, and Samuel Madden. Cabernet: vehicular content delivery

using WiFi. In Proceedings of the 14th ACM international conference on Mobile computing

and networking, MobiCom ’08, pages 199–210, New York, NY, USA, 2008. ACM.

[13] M.R. Fetterman, T. Hughes, N. Armstrong-Crews, C. Barbu, K. Cole, R. Freking, K. Hood,

J. Lacirignola, M. McLarney, A. Myne, S. Relyea, T. Vian, S. Vogl, and Z. Weber. Distributed

multi-modal sensor system for searching a foliage-covered region. In 2011 IEEE Conference

on Technologies for Practical Robot Applications (TePRA), pages 7–14, 2011.

[14] L. Galehouse, J.S. Moulthrop, and R.G. Hicks. Principles of pavement preservation: Defini-

tions, benefits, issues, and barriers. TR News, September-October 2003, pp. 4-15. Transporta-

tion Research Board (TRB), National Research Council, Washington, D.C., 2003.

[15] Sindhu Ghanta, Ralf Birken, and Jennifer Dy. Automatic road surface defect detection from

grayscale images. In Proceedings of SPIE Symposium on Smart Structures and Materials +

Nondestructive Evaluation and Health Monitoring, March 2012.

[16] N. Gucunski, F. Romero, S. Kruschwitz, R. Feldmann, A. Abu-Hawash, and M. Dunn. Mul-

tiple complementary nondestructive evaluation technologies for condition assessment of con-

crete bridge decks. Transportation Research Record, No. 2201, 2010.

[17] L. Gurgen, C. Labbe, V. Olive, and C. Roncancio. A scalable architecture for heterogeneous

sensor management. In Sixteenth International Workshop on Database and Expert Systems

Applications, 2005. Proceedings, pages 1108–1112, 2005.

55

BIBLIOGRAPHY

[18] Levent Gurgen, Claudia Roncancio, Cyril Labb, Andr Bottaro, and Vincent Olive.

SStreaMWare: a service oriented middleware for heterogeneous sensor data management.

In Proceedings of the 5th International Conference on Pervasive Services, ICPS ’08, page

121130, New York, NY, USA, 2008. ACM.

[19] G. Hackmann, W. Guo, G. Yan, Z. Sun, C. Lu, and S. Dyke. Cyber-physical codesign of

distributed structural health monitoring with wireless sensor networks. Early Access Online,

2013.

[20] Yo-Ming Hsieh and Yu-Cheng Hung. A scalable IT infrastructure for automated monitoring

systems based on the distributed computing technique using simple object access protocol

web-services. Automation in Construction, 18(4), July 2009.

[21] Bret Hull, Vladimir Bychkovsky, Yang Zhang, Kevin Chen, Michel Goraczko, Allen Miu, Eu-

gene Shih, Hari Balakrishnan, and Samuel Madden. CarTel: a distributed mobile sensor com-

puting system. In Proceedings of the 4th international conference on Embedded networked

sensor systems, SenSys ’06, New York, NY, USA, 2006. ACM.

[22] IEEE. 1588-2008 - IEEE Standard for a Precision Clock Synchronization Protocol for Net-

worked Measurement and Control Systems, 2008.

[23] Pathway Services Inc. Automated road and pavement condition surveys, 2009.

[24] F. Jalinoos, R. Arndt, D. Huston, and J. Cui. Periodic nde for bridge maintenance. In Proceed-

ings of Structural Faults and Repair Conferenc, Edinburgh, June 2010.

[25] R. Kozma, Lan Wang, K. Iftekharuddin, E. McCracken, M. Khan, K. Islam, and R.M. Demirer.

Multi-modal sensor system integrating COTS technology for surveillance and tracking. In

2010 IEEE Radar Conference, pages 1030–1035, 2010.

[26] M Lindner. libconfig–c/c++ configuration file library.

[27] Xuefeng Liu, Jiannong Cao, Wen-Zhan Song, and ShaoJie Tang. Distributed sensing for high

quality structural health monitoring using wireless sensor networks. In Real-Time Systems

Symposium (RTSS), 2012 IEEE 33rd.

[28] Samuel R Madden, Michael J Franklin, Joseph M Hellerstein, and Wei Hong. Tinydb: an

acquisitional query processing system for sensor networks. ACM Transactions on database

systems (TODS), 30(1):122–173, 2005.

56

BIBLIOGRAPHY

[29] K.R. Maser. Bridge deck condition surveys using radar: Case studies of 28 new england decks.

Transportation Research Record, No. 1304, TRB, National Research Council, 1991.

[30] K.R. Maser, J. Doughty, and R. Birken. Characterization and detection of bridge deck deterio-

ration. In Proceedings of the Engineering Mechanics Institute (EMI 2011), Boston, MA, June

2011.

[31] D.L. Mills. Network Time Protocol (Version 3) specification, implementation and analysis.

Network Working Group Report RFC-1305, March 1992.

[32] American Society of Civil Engineers. 2013 report card for americas’ infrastructure, 2013.

[33] Shahini Shamsabadi S., Wang M. L., and Birken R. Pavement condition monitoring by fus-

ing data from a mobile multi-sensor inspection system. In Proceedings of the SAGEEP on

Geophysical Data Management, Boston, MA, USA, 2014.

[34] Shahini Shamsabadi S., Birken R., and Wang M. L. PAVEMON: a gis-based data management

system for pavement monitoring based on large amounts of near-surface geophysical sensor

data. In Proceedings of the IACSM on Structure Control and Monitoring, Barcelona, Spain,

2014.

[35] L.B. Stevens. Road surface management for local governments’ resource notebook. Federal

Highway Administration, Washington, D.C. Publication No. DOTI’85’37, 1985.

[36] J. Walls and M.R. Smith. Life-cycle cost analysis in pavement design. Federal Highway

Administration, Washington, D.C. FHWA report FHWA-SA-98-079, 1998.

[37] Ming Wang, Ralf Birken, and Salar Shahini Shamsabadi. Framework and implementation of

a continuous network-wide health monitoring system for roadways. In SPIE Smart Structures

and Materials+ Nondestructive Evaluation and Health Monitoring, pages 90630H–90630H.

International Society for Optics and Photonics, 2014.

[38] J. Zhang, Qiu H., Shahini Shamsabadi S., R. Birken, and G Schirner. SIROM3 - a scalable

intelligent roaming multi-modal multi-sensor framework. In Proceedings of 38th Annual IEEE

International Computers, Software, and Applications Conference, Vasteras, Sweden, 2014.

[39] J. Zhang, Qiu H., Shahini Shamsabadi S., R. Birken, and G Schirner. WiP: system-level

integration of mobile multi-modal multi-sensor systems. In Proceedings of ACM/IEEE 5th

International Conference on Cyber-Physical Systems, Berlin, Germany, 2014.

57

BIBLIOGRAPHY

[40] Yang Zhang, B. Hull, H. Balakrishnan, and S. Madden. ICEDB: intermittently-connected

continuous query processing. In IEEE 23rd International Conference on Data Engineering,

2007. ICDE 2007, 2007.

58

managing bulk sensor data for heterogeneous …...the big data for distributed multi-modal...

Documents