managing bulk sensor data for heterogeneous …...the big data for distributed multi-modal...
TRANSCRIPT
Managing Bulk Sensor Data for Heterogeneous Distributed Sensor
Systems
A Thesis Presented
by
Hanjiao Qiu
to
The Department of Electrical and Computer Engineering
in partial fulfillment of the requirements
for the degree of
Master of Science
in
Electrical and Computer Engineering
Northeastern University
Boston, Massachusetts
April 2014
To my family.
i
Contents
List of Figures iv
List of Tables v
List of Acronyms vi
Acknowledgments vii
Abstract of the Thesis viii
1 Introduction 1
2 Background and Motivation 52.1 CPS Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Big Data Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.3 Data Stream Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3 Versatile Onboard Traffic-Embedded Roaming Sensors (VOTERS) System Overview 103.1 VOTERS Van Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.2 Data Volumes and Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.3 The VOTERS Software Solution Scalable Intelligent ROaming Multi-modal Multi-
sensor (SIROM3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4 Heterogeneous Stream File system Overlay (HSFO) 174.1 Fusion Foundations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.2 HSFO Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.3 Metadata Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.3.1 Streams and Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.3.2 Sessions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.3.3 Surveys . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.4 Big Data Storage Location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284.5 Metadata Example for Data Correlation . . . . . . . . . . . . . . . . . . . . . . . 304.6 Bulk Data Handling Library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.7 Big Data Transfer and Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . 33
ii
5 Plugin Executor (PLEX) 355.1 Plugin Executor (PLEX) Environment Overview . . . . . . . . . . . . . . . . . . 355.2 Plugins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.2.1 Plugin Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365.2.2 Interaction with HSFO . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.3 Plugin Executor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405.3.1 Rule-based scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415.3.2 Design Automation and Flexibility . . . . . . . . . . . . . . . . . . . . . . 42
6 Experimental Results 446.1 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
6.1.1 Time Synchronization Accuracy . . . . . . . . . . . . . . . . . . . . . . . 446.1.2 Multi-Sensor Aggregator (MSA) Performance Analysis . . . . . . . . . . 456.1.3 Statistics of Data Behaviors . . . . . . . . . . . . . . . . . . . . . . . . . 466.1.4 Heterogeneous Stream File-system Overlay (HSFO) Performance Overhead 47
6.2 Data Fusion and System Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
7 Conclusion 52
Bibliography 54
iii
List of Figures
1.1 Time-varying behavior of civil infrastructure . . . . . . . . . . . . . . . . . . . . . 2
2.1 Transportation CPS Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2 Big Data Challenge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.3 Automated Data Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.1 VOTERS Van with Sensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.2 SIROM3 Multi-Tier Hierarchical Architecture . . . . . . . . . . . . . . . . . . . . 133.3 SIROM3 Implementation Architecture . . . . . . . . . . . . . . . . . . . . . . . . 15
4.1 HSFO Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.2 HSFO Metafile Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.3 Example Stream Definition Table . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.4 Stream Creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.5 Example Stream Definition Table . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.6 Big Data Storage Location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.7 Metafile Example for Data Correlation . . . . . . . . . . . . . . . . . . . . . . . . 304.8 Abstraction of Bulk Data Handling Library . . . . . . . . . . . . . . . . . . . . . 334.9 Data Transfer and Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.1 PLEX Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365.2 Interaction of plugins with HSFO . . . . . . . . . . . . . . . . . . . . . . . . . . . 395.3 Scheduling of Plugins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
6.1 Timing Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446.2 Temporal Data Fusion Example . . . . . . . . . . . . . . . . . . . . . . . . . . . 496.3 City-wide Infrastructure Performance Inspection . . . . . . . . . . . . . . . . . . 50
iv
List of Tables
3.1 Data diversity and volume . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.1 Stream Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.2 Name Convention for Big Data Directories . . . . . . . . . . . . . . . . . . . . . . 284.3 Big Data Handling Library APIs . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.1 Plugin Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375.2 Crack Distortion Correction Plugin Example . . . . . . . . . . . . . . . . . . . . . 375.3 Crack Detection Plugin Code Example . . . . . . . . . . . . . . . . . . . . . . . . 405.4 Three Processing Levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.1 MSA Performance Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456.2 Statistics of Data Behaviors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 476.3 HSFO Performance Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 486.4 The Overall Impact of SIROM3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
v
List of Acronyms
CPS Cyber-Physical Systems.
HSFO Heterogeneous Stream File-system Overlay.
PLEX Plugin Executor.
SIROM3 Scalable Intelligent ROaming Multi-modal Multi-sensor.
MMMS Multi-Modal Multi-Sensor.
RMMMS Roaming Multi-Modal Multi-Sensor.
MSA Multi-Sensor Aggregator.
RSS Roaming Sensor System.
FCM Fleet Management and Control.
VOTERS Versatile Onboard Traffic-Embedded Roaming Sensors.
RTE Run-time Environment.
PTP Precision Timing Protocol.
NTP Network Timing Protocol.
SBC Single Board Computer.
BDH Bulk Data Handling.
vi
Acknowledgments
The work presented here could not have been completed without the contributions ofmany supportive and knowledgeable people.
Foremost, I want to express my deep gratitude to the committee chair and my advisor Dr.Gunar Schirner for the continuous support and mentorship during my master program. His patience,enthusiasm, and immense knowledge helped me and encouraged me in the time of my research andwriting my thesis. I have learned not only the expertise, but also the rigorous attitude towards workand the non-stop pursuit of perfection.
My sincere appreciations also go to the rest of the committee, Dr. Ralf Birken and Dr.Ming Wang for their guidance and help along these years. Their devotion to VOTERSs projectspurred me a lot in completing my work and actively participating in this interdisciplinary project.
Special thanks to my colleague, Ph.D candidate Jiaxing Zhang. Without his referral andrecommendations, I could have never had a chance to join this lab and obtained financial support.He gave me encouragement and creative idea when I was inexperienced. He is always willing togive me a hand and offer his best suggestions. He sets a good example for me and is a good friend inmy life. I also want to thank my friends, most of whom are also alumni of Northeastern University,for all the fun we had in this city far away from home, and for their cheerings and care for me.
Last but not the least, I want to thank my family and Guancheng Gu. They are alwaysthere for me and offer me their sincere consolement and love when I am fragile. They always standby me through the good times and bad. Their arms and hugs are always my safest harbor to resortto.
This work is part of the VOTERS project. The VOTES project is a joint venture of North-eastern University, The University of Vermont, The University of Massachusetts in Lowell, EarthScience System LLC, and Trilion Quality Systems Inc. This project is supported by U.S Departmentof Commerce, National Institute of Standards and Technology, Technology Innovation Program.
vii
Abstract of the Thesis
Managing Bulk Sensor Data for Heterogeneous Distributed Sensor
Systems
by
Hanjiao Qiu
Master of Science in Electrical and Computer Engineering
Northeastern University, April 2014
Gunar Schirner, Adviser
The current U.S. transportation infrastructures require tremendous investments to main-tain due to critical roadway and pavement conditions. To prioritize the repair expenditures, Cyber-Physical Systems (CPS) are a promising solution to obtain intrinsic knowledge about infrastruc-ture performance such as roadway surface and subsurface deterioration through sensors and actua-tors. However, to characterize and quantify the infrastructures’ time varying behavior (infrastructurehealth and life cycle) in a cost-efficient and non-intrusive way, an underlying framework to handledomain-specific big data in CPS is needed.
This paper proposes a holistic approach to manage domain-specific bulk sensor data gen-erated from heterogeneous distributed sensor systems to address CPS meeting big data. We focusedon the big data handling in a Scalable Intelligent ROaming Multi-Modal Multi-Sensor (SIROM3)framework, which collects data about roadway conditions from multiple domains through mo-bile agents. A Heterogeneous Stream File-system Overlay (HSFO) is proposed as a platform-independent layer to uniformly define, organize and manage the high volume of heterogeneousstreaming data. Additionally, a flexible plugin system (PLEX) is introduced to simplify and au-tomate data feature extraction, correlation, fusion and visualization. Both HSFO and PLEX aredesigned with high scalability and adaptability. They can be executed on a wide range of platformsfrom mobile systems to mainstream servers with a common software/hardware stack. Our solutionaddresses big data collection, storage, aggregation to processing and knowledge discovery. Theembodied automation eliminates human intervention at every stage and increases overall efficiency.
Over 20 terabytes of data covering 300 miles have been collected, aggregated, and fusedfor comprehending the pavement dynamics of the entire city of Brockton, MA. The performance ofdata processing with and without HSFO was compared. The results indicate that processing data
viii
with HSFO takes an overhead of 0.19µs/KB than that in the absence of HSFO. The difference ofCPU utilization between two methods is less than 5%. This implies that HSFO and PLEX has quitelow overhead with a negligible impediment to the system performance. The unified automationfulfilled by them has demonstrated a significant increase in overall productivity by nearly 25 times,starting from data collection to processing. In result, we established foundational tools for managingthe big data for distributed multi-modal multi-sensor systems in civil infrastructure monitoring.They provide rapid and comprehensive understanding of civil infrastructure health and life cyclemanagement.
ix
Chapter 1
Introduction
There are more than four million miles of roads and nearly 600,000 bridges in the U.S.
demanding a broad range of maintenance activities [2]. According to the ASCE report in 2013 [32],
the U.S. infrastructure embarrassingly only scored a ”D+” due to deterioration and lack of proper
care. One in four bridges is either structurally deficient or functionally obsolete; one third of US’s
major roads are in poor or mediocre conditions. The monumental challenge of current transportation
infrastructure management is the prioritization of expenditures within budgetary constraints as well
as the implementation of maintenance and repairs. It is estimated that $3.6 Trilion in investments
are needed over 5 years to bring the condition of the nations infrastructure up to a good condition.
In current civil infrastructure, pavement deteriorations frequently take place below the
surface and cannot be evaluated by visual means [8]. Pavement deteriorates due to internal moisture
damage, debonding, and loss of subsurface support. Pavement layers are subjected to extensive
abrasion and deterioration from service loading (e.g. traffic) and environmental attacks (e.g. freeze-
thaw, rain, road salts). Figure 1.1 conceptually illustrates the dynamic change of roads (life cycle
modeling) and the optimal opportunities of early deterioration identification and responsive repairs.
If certain distresses are repaired before they reach critical levels, at least five times less
money will be spent, compare to late overhaul. More severely, the impact is exponentially amplified
for the large-scale civil infrastructure without intervention. While in early repair condition, the
pavement condition can be maintained after a long span of time. Identifying trouble spots as soon
as they appear will save huge amounts of money, time, and effort, and prolong the longevity of the
pavement. It is of primary importance to acquire detailed information of the speed of structural
health deterioration so that responsive maintenance is possible and actual infrastructure life-cycle
can be mapped. Long time intervals between two inspections may miss the optimal window to
1
CHAPTER 1. INTRODUCTION
1 5 10 15 20
Age of Pavement (years)
Excellent
Good
Fair
Poor
Very Poor
Failed
25
Pa
ve
men
t C
on
dit
ion
$1 For Preventive
Maintenance
- Seal cracks and fill potholes
$4 to $5 for major
maintenance
- Resurface entire road
Figure 1.1: Time-varying behavior of civil infrastructure
repair. Therefore, knowing infrastructure life cycle is necessary to prioritize direct investments to
the areas of most need and benefit. As a result, frequent structural health monitoring is essential to
gain the complete curve of infrastructure lifecycle.
However, current inspection methods often introduce issues such as intrusive data gath-
ering (e.g. stopping traffic), requiring huge amounts human efforts and subsequently infrequent
data collection (e.g. interval of decades) and limited coverage [32]. Traditional inspection meth-
ods [3, 4, 5] are slow, require traffic delay causing road closures, and are often not effective. Even
with novel technology such as ground penetrating radar (GPR) and infrared thermography [30, 16,
24, 6, 29] suffer either from the need for traffic closures or provide insufficient spatial data cover-
age. With these methods, it is impossible to obtain the life cycle of the time-varying behavior of
the civil infrastructure due to the difficulty in continuous monitoring of the pavement status. There-
fore, a non-intrusive, automated and cost-effective system with easy-to-deploy sensor technology
is crucial for the improved roadway inspection methodologies and comprehensively understanding
the dynamics of civil infrastructures. In addition, a need to manage and jointly consider data from
multiple service providers, locations, and inspections dates is highly desired.
Cyber-Physical System (CPS) is a potential candidate which integrates computing ele-
ments with physical processes to observe, react, and actuate the physical entities. It often involves
heterogeneous components, hybrid systems and design automation methodologies. In particular,
applying Cyber-Physical Systems (CPS) approaches to infrastructure monitoring is promising to
address the societal need outlined above through interdisciplinary research (from computer science
to civil engineering field) to invent solutions which CPS meets big data. An application of the
2
CHAPTER 1. INTRODUCTION
CPS solution is an Roaming Multi-Modal Multi-Sensor (RMMMS) system, which embeds hetero-
geneous sensors and computing onto a mobile agent to collect multi-modal data under roaming
conditions [8, 10]. Research efforts has been devoted to applying the multi-modal multi-sensor
systems to the civil domain in a cyber-physical approach [9, 19, 27, 25].
However, the CPS solution is difficult to develop due to the heterogeneity in multi-modal
sensors, synchronization principles, reliable communication and coordinated operations. Typically
sensor systems are tailored to a specific type of application, which impedes the overall adaptability
and scalability. In addition, as data-intensive sensors produce large volume of streaming data in
real-time, it is challenging to store, access, process and cross-reference the diverse big data. Com-
plexity and challenges are further exploding if multiple RMMMS systems are deployed to increase
geographical coverage area and to repetitively survey for a time-varying assessment.
The Versatile Onboard Traffic-embedded Roaming Sensors (VOTERS) [7, 8] system is
a real implementation of RMMMS systems. It incorporates the CPS approach and complements
periodically localized inspections of roadways and bridge decks with continuous network-wide
health monitoring using multi-modal sensors mounted on vehicles. To solve the above challenges
in RMMMS systems, a Scalable Intelligent ROaming Multi-Modal Multi-Sensor (SIROM3) frame-
work is proposed in [38] as the software solution of VOTERS. It provides a unified solution to
simplify the development, deployment, prototype and management of multiple RMMMS systems.
In this work, we focus on managing the bulk sensor data generated from the heterogenous
distributed sensors in SIROM3. The large volume of data from multiple domains must be efficiently
stored, aggregated and manipulated for knowledge discovery. We propose a Heterogenous Stream
File system Overlay (HSFO) to efficiently store, manage and aggregate the heterogenous big data
through metafile hierarchy. The design complexity from survey dynamics and sensor versatility is
decomposed into different layers of HSFO and each layer abstracts corresponding metadata into
metafiles. A Plugin Executor (Plugin Executor (PLEX)) is designed to run on top of HSFO for
data correlation and referencing across distributed components and different domains. Both HSFO
and PLEX are platform-agnostic and can be run on any level or component of SIROM3. They
cooperatively automate big data handling from data collection, transfer, processing to fusion.
We used VOTERS system to conduct a city-wide pavement condition inspection in Brock-
ton, MA. Over 20 terabytes over 300 miles of road was collected, aggregated, fused and geo-
spatially visualized for roadway assessment. The design automation enables frequent monitoring,
which offers an ideal platform for investigation of time-varying behaviors of roadways.
The thesis is organized as follows: Chapter 2 introduces the CPS approach and possi-
3
CHAPTER 1. INTRODUCTION
ble big data challenge. Chapter 3 overviews the VOTERS system as well as its software solution
SIROM3 framework. Chapter 4 focuses on the details of HSFO to manage the big data. Chapter 5
elaborates automated data processing and fusion using PLEX. Chapter 6 presents survey results and
evaluates performance from multiple aspects. Chapter 7 draws conclusion on the research.
4
Chapter 2
Background and Motivation
This section examines the cyber-physical approach as a promising solution to understand
the life cycle for health conditions of infrastructures. It also raises the challenges in big data handling
in multi-modal multi-sensor systems, which drives the need for a unified solution to realize data
fusion and an automated strategy to facilitate knowledge discovery. It then introduces and compares
the current methodologies for distributed data stream analysis and sensor data fusion.
2.1 CPS Approach
Current methods for structural health inspection often requires intrusive data collection,
obstruction of traffic, prolonged post-analysis, huge manual efforts and expenditures. Therefore,
with current technology, it is infeasible to conceive an efficient, flexible and automated solution to
acquire the life cycle with time-varying behaviors of civil infrastructures. This drives the need for
an integrated cyber-physical system (CPS) approach, which incorporates resourceful computational
power into observation, measurement, actuation of physical process. In this case, the CPS approach
offers automatic control of heterogeneous sensor group and integration of diverse big data, providing
an efficient way to comprehensively evaluate the transportation infrastructure performance.
The infrastructure inspection methodologies can be portrait as a grand loop including
construction, usage, deterioration, maintenance and repair. However, the control loop has been put
aside in spite of its criticality, due to long latencies. Figure 2.1 extracts the whole procedure into a
CPS model and presents a conceptual solution. A diversity of sensing applications are embedded in
mobile roaming acquisition systems. Huge amounts of data is collected by the mobile heterogeneous
sensor systems from surveys feeding into multilayer multimodal data fusion. An asset performance
5
CHAPTER 2. BACKGROUND AND MOTIVATION
Actuators
Transportation Infrastructure
Maintenance Decisions
Sensor
Big-Data
Performance
MetricsPerformance
Metrics
Asset
Performance
Metrics
Mobile SensorMobile SensorMobile SensorMobile SensorConstruction
Repair
Multi-Layer,
Multi-Model Fusion
Figure 2.1: Transportation CPS Overview
metrics is subsequently yielded as vital result and imported into a maintenance decision system
whose duty is to decide maintenance and construction decisions. The actionable information can
be feedback of the system and control the actuators: repair and construction operations and mobile
sensors. The control loop would accelerate data collection, storage, process to guide maintenance
and construction decisions. The fine-grained and continuous infrastructure monitoring makes detail
modeling of infrastructure deterioration and life cycle readily available, leading to the possibility of
early affordable care. In a word, the CPS approach establishes a way to manage the life cycle for
personal health of civil infrastructures.
2.2 Big Data Challenges
The CPS approach hinges on big data challenges. It is an inevitable situation that some
sensor systems have to perform fast sampling in time (e.g. mm-wave radar) or dense sampling in
space (e.g. Ground Penetrating Radar (GPR) array). The huge volume streaming data generated
may quickly flood the entire system during data collection. Since the hybrid system with com-
putational intensive components requires a distributed solution, this issue gets more severe when
multiple such components operate simultaneously. The distributed systems entail coordination and
collaboration, which can be even more convoluted. This is particularly true given that tight time syn-
chronization is the foundation for sensor fusion in a distributed fast-moving roaming sensor system.
Network and real-time constraints in embedded processing unavoidably effects the data migration
and demands data compression.
The magnitude of big data challenge is shown in Figure 2.2. Terabytes of data will be
generated from densely distributed sensors with different types and attributes including acoustic
6
CHAPTER 2. BACKGROUND AND MOTIVATION
Acoustic Sensor
Arrays
Radar Sensor
Arrays
HD Video Camera
Over terabytes of data daily!
FusionStorage
Figure 2.2: Big Data Challenge
sensor arrays, HD Video Camera and Radar Sensor Arrays in continuous stream daily, which will
then be aggregated in a centralized storage for data fusion and knowledge discovery. This would
be more problematic by introducing more sensor types and doing time-lapse surveys. The big data
handling has challenges including managing large data streams, having efficient data storage and
compression, allowing identification and classification, supporting fast and reliable transfer, making
data fusion from heterogeneous sensors as well as geo-spatial visualization. The goal of this work is
to manage and make use of the bulk data for heterogeneous distributed sensor systems. As a result,
a hierarchical file system overlay, such as Heterogeneous Stream File-system Overlay (HSFO) is
needed to store, categorize, filter and aggregate big data originated from various sensors. The
standard definition of each layer in HSFO would profit managing streams with different attributes
through general solutions. A data processing environment like Plugin Executor (PLEX) is required
as a consumer, which focuses on retrieving, refining, cross-referencing, and fusing data to obtain
the joint and integrated results. It is essential for PLEX to be adaptable to complex streaming
applications as well as capable of integrating new fusion algorithms easily. The HSFO and PLEX
should be platform agnostic to be flexibly and compatibly work on a wide range of platforms.
To boost the overall productivity, the design automation methodology should also be in-
tegrated to facilitate data fusion and reduce human intervention during the whole procedure. Fig-
ure 2.3 illustrates the integral automated data flow. The physical entities show the real implemen-
tation and the orange arrows reveals the data flow running through them, which corresponds to the
four phases on the right. The first phase would be an array of multi-modal sensors deployed on
mobile vehicles to collect data. Data from multiple domains is gathered from these distributed sen-
sors during roaming and managed in their local storages. Besides, a bunch of local storages will be
7
CHAPTER 2. BACKGROUND AND MOTIVATION
uploaded and aggregated to a control center for centralized storage and management while network
is available. Eventually, applications will operate on the data for processing, correlation, fusion and
visualization. The whole procedure can be automated to allow for operators operating remotely and
making timely decisions.
Process, Fuse
and Visualize
Sensor
Collection
Storage
Upload and
Aggregate
Figure 2.3: Automated Data Flow
The strong processing power and automated design brought by HSFO and PLEX would
effectively solve the big data challenges and manage the heterogeneous bulk sensor data in the CPS
approach, resulting in reducing the long latencies and shorten the execution cycle of the control
loop. In the context of life cycle management for roadway infrastructures, the design of HSFO with
PLEX will significantly benefit the capture of pavement life cycle and preventative maintenance and
early repairs can be attained for future civil infrastructures.
2.3 Data Stream Management
RMMMS systems in civil application domain have recently emerged in [21, 12, 13] in
order to offer real-time monitoring with more mobility and coverage. However, heterogeneous data
in large volume generated by an RMMMS system has not received much attention.
In [17, 20], middlewares and scalable architecture for processing and managing heteroge-
neous sensor data is proposed and implemented. Unlike their approach focusing on the efficiency
manipulating the data, SIROM3 addresses the big data issue from an operating file system over-
8
CHAPTER 2. BACKGROUND AND MOTIVATION
lay perspective. Another scope of focus on handling heterogeneous big data is to address them in
database perspective [40, 18]. Although it provides an easy interface querying the data, certain over-
head is associated with database operations. Furthermore, database approach is often only suitable
on the side of data center, the overlay design in SIROM3 is flexible to run platform-independently
with little overhead.
A lot of research efforts has been devoted into managing and integrating the data streams
distributed on heterogeneous sensor nodes. As the wide range of surveillance applications are de-
sired using distributed wireless sensor networks (WSN), several avenues on data stream manage-
ment systems (DSMS) have been sought in recent years to aggregate heterogeneous distributed
sensor data to acquire complete knowledge of the physical world [11]. The Global Sensor Net-
works (GSN) [1] offers a middleware that integrates heterogeneous sensor networks at network
level. Each kind of sensor network is abstracted into a virtual sensor, which helps to get a homo-
geneous view on heterogeneous sensors. The virtual sensor definition is an XML-file containing
the streams name, address, output structure, input streams, and SQL queries that led to the output
stream. SStreaMware [18] has a three level query processing. Unlike GSN that process in a bottom-
up fashion, the SStreaMware use hierarchical flow in which a centralized control site provides a
global sensor query service (SQS) and distributes subqueries to the lower level gateways. Each gate-
way further sends decomposed subqueries down to the proxies of different sensors. TinyDB [28]
is another middleware approach that is a direct application based on TinyOS. In this method, each
sensor node is installed with TinyDB so that one can view each node homogeneously. Similar to
the SStreaMware’s top-down approach, a base station parses a query and distributes down to the
corresponding sensors in a tree-based manner. The collected data is retrieved in a reverse direction
which the queries propagate.
However, instead of keeping the data for permanent storage in SIROM3, these approaches
only perform continuous real-time queries. Although they maintain some metadata for system inte-
gration or query optimization, the information is not comprehensive enough and well-organized as
the metafiles in HSFO. Moreover, the SQL-based or graph-based queries of these solutions bring
certain overhead because of database operations. The metafiles in HSFO are in textual presentation
and queried through common interfaces implemented in multiple programming languages, which is
faster and more adaptable.
9
Chapter 3
VOTERS System Overview
The Versatile Onboard Traffic-Embedded Roaming Sensors (VOTERS) project collects
roadway and bridge deck condition information (both surface and subsurface) periodically using
a Roaming Multi-Modal Multi-Sensor (RMMMS) mounted on a vehicle while roaming at traffic
speeds [7, 37]. The goal is to achieve continuous network-wide infrastructure health monitoring
using multiple RMMMS units. This chapter first overviews the real implementation of the VOTERS
van system with sensors and reveals the potential challenges as well as high complexity in designing
such RMMMS system. Then it highlights the varieties and volume of data collected. An unified
framework Scalable Intelligent ROaming Multi-Modal Multi-Sensor (SIROM3) is proposed as the
VOTERS software solution to system integration challenges in the end.
3.1 VOTERS Van Implementation
VOTERS provides the periodical localized inspections of roadways and bridge decks with
continuous monitoring. It uses a set of homogeneous or heterogenous sensors (acoustic, electromag-
netic and optical) mounted on a fleet of vehicles and collects inspection data while roaming through
daily traffic. An important benefit of using RMMMS systems will be the time-lapse survey sets
allowing the analysis of time-varying behaviors of roadway infrastructures. It thereby will provide
experimental results to validate and improve existing life-cycle models [14, 35, 36].
Figure 3.1 shows the VOTERS vehicle containing over 30 sensor units in 10 different
domains including: laser height sensors, acoustic microphone arrays, dynamic tire pressure sen-
sors, differential GPS systems, mm-wave radar, GPR arrays, inertial measurement unit systems,
DMI sensor systems, HD camera systems and GPS timing board systems. These data-intensive
10
CHAPTER 3. VOTERS SYSTEM OVERVIEW
sensors are grouped to 5 Single Board Computers (SBCs) distributing computational and storage
resources (Table 3.1). Each Single Board Computer (SBC) has a 2.6GHz Intel Quad-Core, 4GB
memory, an array of solid state storage system, and a PC-104 interfaces for hardware extensions,
such as DAQ systems or GPS timing board. A local Gigabit network interconnects these SBCs for
intra-component communication, collaboration as well as coordination. To assist on-the-fly data vi-
sualization, the van includes a portable real-time monitoring tablet and an in-vehicle system control
(see Figure 3.1).
Figure 3.1: VOTERS Van with Sensors
However, a number of challenges are posed in designing, realizing and implementing
the VOTERS systems. Due to the versatility and heterogeneity of sensors, data types as well as
the sheer number of sensor systems, it is intricate to manage, integrate and operate the system
components uniformly. Meanwhile, these sensor systems are distributed to multiple computing
units for their high computational demands (e.g. radar sensors, HD video camera). The distributivity
brings challenges such as intra-component communication, coordination and collaboration, which
have strict real-time and time synchronization constraints for data correlation across distributed
elements. The integration complexity is increasing while adding an arbitrary number of new sensor
systems or deploying multiple VOTERS vans, therefore a scalable and expandable framework is
11
CHAPTER 3. VOTERS SYSTEM OVERVIEW
critical for system integration. In addition, since the data-intensive sensors produce large volume
of streaming data in real-time, it is essential to effectively store, aggregate and process the big data.
Hence, a flexible and adaptable framework on both software and hardware architecture is needed to
address the above challenges and the overall design complexity in a system-level design approach.
The next section will further give the detailed specifications of VOTERS system, while an unified
software solution will be provided afterwards.
3.2 Data Volumes and Types
The VOTERS van collects data at traffic speed by an array of homogeneous or heteroge-
neous sensor systems from multiple domains (e.g. acoustic, optical and electromagnetic domains).
Table 3.1 indicates the domains and the corresponding recorded data amounts. The multi-modal
sensors either require fast sampling in time or dense sampling in space and therefore have different
triggering methods and sampling rates accordingly.
Table 3.1: Data diversity and volume
Max Min Points / Size / DataDomains Sen- Trigger Sensor / point rate
sor Interval Trigger [byte] [GB/h]Positioning data 1 0.2 s 4 4 0.0003Acoustic Microphones 4 25 us 1 4 2.1Dynamic Tire Pressure 2 25 us 1 4 1.1Millimeter-wave radar 10 25 us 1 4 5.4Video Systems 1 1 m 5018400 1 467.4GPR systems 16 0.01 m 1024 2 305.2
Total 781.1
In Table 3.1, the GPR array and HD video camera are both distance triggered and compose
most of the big data. Acoustic microphones, dynamic tire pressure sensor systems, and millimeter-
wave radar are time triggered. They collect at a sampling rate of 25 µs in the worst-case scenario,
which leads to a relative lower data rates. The positioning data stream is the only data has the geo-
location information, which is of essential importance to geo-tag data points from other domains
and make the whole dataset meaningful. In total, 781 gigabytes are expected per hour when driving
at a maximum velocity of 100 km/h. While the system is roaming through the traffic and the data
collection is ongoing, the bulk data will be stored on the vehicle locally due to unstable cellular
connection. Once a good network condition is available (e.g. Ethernet connection), the big data
12
CHAPTER 3. VOTERS SYSTEM OVERVIEW
will be uploaded to for aggregation and fusion. As a result, the automatic data aggregation and
centralization is necessary to efficiently store, manage and transfer the big data. Additionally, a
common interface for data management is crucial for processing heterogeneous datasets uniformly
and increasing the overall adaptability and scalability.
3.3 The VOTERS Software Solution SIROM3
The Scalable Intelligent ROaming Multi-Modal Multi-Sensor (SIROM3) is the software
solution for VOTERS system which provides the basis for efficient real system realization. It is
an unified framework to simplify the development, deployment, management and aggregation of
RMMMS systems tailored to civil infrastructure inspection for flexible and time-lapse monitoring.
Figure 3.2 illustrates the overall design of SIROM3 as a multi-tier hierarchical architecture [38, 39].
FCM
SIROM3-RTE
Fleet Control
Coordination
Visual
SIROM3-RTE
GIS
Visualization
Cellular
SIROM3
RSS 3
SIROM-RTE
Vehicle App
LogicRSS 2
SIROM-RTE
Vehicle App
Logic
RSS 1
SIROM3-RTE
Vehicle App.
Logic
Sensor 1
LAN
Stream Definitions
HSFO
Plugins
Plugin Rules PLEX
HSFO
HSFOHSFO
HSFO
MSA 3
SIROM-RTE
Sensor Driver
MSA 2
SIROM-RTE
Sensor Driver
MSA 1
SIROM3-RTE
Sensor DriverLegend: : Data
: Control
PLEX Environment
Figure 3.2: SIROM3 Multi-Tier Hierarchical Architecture
The SIROM3 framework includes sensors, MSA, Roaming Sensor System (RSS), Fleet
Management and Control (FCM) and a visualization back-end. Meanwhile, a Heterogeneous Stream
File-system Overlay (HSFO) and Plugin Executor (PLEX) environment are tightly integrated to
each component to facilitate big data storage, processing and management. The Run-time Environ-
ment (RTE) is a layered model that defines the essential core services that are common and resuable
among all components from both hardware and software aspects. In this framework, each RSS rep-
resents a VOTERS van equipped with sensors. Similarly, each MSA within the RSS corresponds
13
CHAPTER 3. VOTERS SYSTEM OVERVIEW
to a distributed SBC mounting with a group of sensors inside the van. A one-to-many relationship
exists from RSS to MSA, the same applies to FCM to RSS.
The hierarchical architecture eases a control/respond mechanism between a higher level
element and its children. For instance, the FCM centralizes and manipulates the aggregated data
from multiple RSSes to facilitate data fusion and information discovery through temporal/spatial
data correlation and visualized in the GIS visualization server. An array of RSS can report the lo-
cation or transfer data back upon request or autonomously to FCM via cellular or LAN connection.
Due to the limited cellular connection, only control and configuration messages will be transferred
between FCM and RSS for conserving the bandwidth while collecting data outside [7]. Once a
sufficient network environment is available, such as a local gigabyte cable is plugged in to the sys-
tem, the collected bulk data can be uploaded to the FCM automatically. An RSS contains numerous
MSA for their distributed computational power and sensor heterogeneity. Within an MSA, mul-
tiple homogeneous and heterogeneous sensors are attached as the basic unit of RMMMS system.
Connected in a local gigabyte network, each MSA is aware of the existence of the others, which
therefore can communicate, coordinate and collaborate to function cooperatively.
Benefit from the hierarchical design, data aggregation can be easily achieved by propa-
gating the data streams from lower level to higher level. Due to the large volume of data being
collected by data-intensive sensors (such as GPR and HD Video), the HSFO is attached to each
MSA to provide reliable storage and efficient retrieval for the high volume of streaming data. A
centralized HSFO is also instantiated on the FCM to aggregate data from distributed sources. A
PLEX environment is implemented to further simplify algorithm integration and eliminate manual
intervention during data processing and fusion with a fully automated approach.
The SIROM3 framework decomposes the high complexity into hierarchical and modular-
ized design. The RTE encapsulates core services such as hardware requirements, operating systems,
synchronization and communication middleware that can be reused by any system component. The
scalability is guaranteed on the vertical aspect that we scale the computing power, storage system
and data aggregation from the minimal unit (sensors) to the maximum, which is the server cloud
(FCM). The expandability is ensured on the horizontal aspect that at the level of sensors, MSA,
RSS and FCM, an arbitrary number of elements can be integrated seamlessly.
Figure 3.3 shows an overview of the SIROM3 architecture implementation. A hierarchical
system composition is illustrated on the left side with an array of sensor systems contained in MSAs
mounted onto an RSS, which is identical to the SIROM3 framework demonstrated in Figure 3.2. A
fleet of RSSes are directed by the FCM. To conserve bandwidth only control/management messages
14
CHAPTER 3. VOTERS SYSTEM OVERVIEW
Visualization &
Analysis
Plugin Executor (PLEX)
Rule-based
Scheduler
Rule
1
Rule
2
PluginN
Plugin3Plugin2
Plugin1
Control &
Management
RSSN
RSS3
RSS2
RSS1
MSAN
MSA3
MSA2
LAN
...MSA1
S1 S2 SN
FCM
Storage
Cellular
Data Intake
Models Proxy
Plugin
GIS Server
Layer 1
Layer 2
...
Layer n
Web
Adaptor
Database
Data Intake
Models Plugin
Legend: : Data
: Control
Streaming Application
HSFOLinux
FS
QNX
FS...
Win
FS
PLEX
Figure 3.3: SIROM3 Implementation Architecture
is transferred between RSS and FCM through a cellular connection during a survey (see Figure 3.3).
After an RSS completes its survey, it returns to its base where a fast network is available (i.e. Gigabit
Ethernet or 802.11n network). Upon detection of the home location and network, automatic bulk
data upload to the FCM and data aggregation are automatically triggered.
The Fleet Control and Management (FCM) contains a centralized data storage, a PLEX
and Control & Management cloud services. An external GIS server is seamlessly integrated into
FCM for visualization and analysis of the fused results, which leverages the the power and flexibility
of the PLEX module. The PLEX is informed when the upload of survey data to the centralized
storage has completed. This triggers the PLEX to start processing the available data managed in
HSFO through contained algorithmic and systemic plugins.
Plugins operate on the uploaded data importing to the GIS server (see Figure 3.3) for
visualization. A configurable rule set in PLEX directs the running sequence of all the plugins
and plays a key role in automating the whole processing procedure. The Data Intake plugin is
responsible for temporal correlation or fusion of both raw data and refined data, attaching each
data point with meaningful geo-tags. Finally, the raw data, refined data as well as the fused data is
transferred into the GIS server and displayed on different layer. The GIS server, based on ArcGIS,
was developed to enable large-scale geo-referencing and knowledge discovery. The GIS server
enables visualizing and comparing citywide pavement conditions. A Web Adaptor enables access
through a thin client application, making the data easily accessible via the Internet [34].
One of the benefits brought by SIROM3 framework is the automation from data collec-
tion, storage, transmit to processing, which hasn’t been realized in current inspection methods. The
localization services in VOTERS offer geo-fencing for automated start or stop of the survey and
15
CHAPTER 3. VOTERS SYSTEM OVERVIEW
the survey is conducted in a roaming vehicle cruising among traffic [7]. This work will present an
efficient and automated approach for bulk sensor data handling. We proposed a HSFO to facili-
tate the storage, management, transmit and aggregation of heterogeneous bulk sensor data with a
PLEX environment for data processing and fusion. The next chapters will elaborate the detailed
implementation of HSFO and PLEX.
16
Chapter 4
Heterogeneous Stream File system
Overlay (HSFO)
The distributed sensor systems in RSS and the data-intensive nature of some sensors of
MSAs produce large streaming data (i.e. hundreds of gigabytes of data per hour shown in Table 3.1
continuously during a data collection. Such high volume data streams may flood the entire system
during data collection. The issue is in particular severe when multiple such components exist and
operate simultaneously.
The gathered data needs to be stored and processed automatically through a local storage
system attached to MSAs (see Figure 3.2), as well as being aggregated to FCM from all MSAs in an
RSS. Therefore, the ability to handle data storage across multiple components (sensor, MSA, RSS
and FCM) at multiple levels in the hierarchy is of particular importance. More severely, the hetero-
geneity and versatility of the data pose the challenge for cross-referencing big data with distinctive
attributes and characteristics (e.g. correlation between camera image and acoustic signals).
In this chapter, a Heterogeneous Stream File system Overlay (HSFO) is proposed and
implemented to manage sensor data and address the above challenges in a metafile approach. The
metafile abstracts information of survey dynamics (e.g. sensor types, software/hardware configura-
tion during data collection) and stores metadata together with the raw data in file system. Since the
metafiles are in textual presentation, they bring lower overhead than the database solution during
data storage and processing. In addition, the HSFO is designed as a platform-independent layer
which can be used in any component in the system hierarchy being adaptive to the scalability and
expandability of SIROM3. Thus, it can be executed on a wide range of platforms (from sensors,
17
CHAPTER 4. HETEROGENEOUS STREAM FILE SYSTEM OVERLAY (HSFO)
embedded computers to data centers). Instead of acting as an afterthought after data acquisition,
the HSFO is tightly integrated into SIROM3 framework and offers an efficient, fast and reliable
solution to handle the heterogeneous sensor data. It also exposes the opportunity to uniformly and
automatically access, operate, process and fuse the data.
4.1 Fusion Foundations
The goal for data processing is to convert the collected raw datasets into meaningful
knowledge to base the life-cycle management for infrastructure health conditions and extend their
life-span. It is challenging since sensors within an RSS are spatially distributed, may have differ-
ent triggering requirements (e.g. time triggered or distance triggered) as well as varying sampling
intervals/spacings (see Table 3.1). As the multi-modal data is geographically dispersed during data
collection, the data management should allow each sensor’s sampling to be accurately geo-located
to allow for spatial and temporal comparison and make joint decision.
The ability to determine an accurate position for each data sample acquired is of utmost
importance for data fusion and visualization. To achieve this, we rely on two facts (a) the position of
sensors relative to the whole system is known and (b) a tight time synchronization is in place. Then,
each sample is timestamped when collected at each sensor or the embedded system upon arrival. An
additional stream records the location of the mobile agent (e.g. vehicle) during the survey. Then,
each data sample’s location can be computed based on timestamp, system location and sensor offset.
Temporal correlation across data streams (and consequently spatial correlation) is deter-
mined by timestamping each sample with micro-second accurate time. This poses strict require-
ments on availability (across platforms), robustness and reliability (time synchronization budgets
with different rigidness). Spatial correlation is achieved globally by localization services (GPS)
and local geometrical relationship of sensors to an absolute referencing point on the vehicle (RSS).
Synchronization of streams with the positioning data requires tight time synchronization from RSS
to all MSA (within timing budget). The granularity to timestamp each sample depends on the sam-
pling rate. Streams with varying sample periods must stamp each sample. Streams with constant
sample period time stamp on a coarser granularity.
In our implementation (source Chapter 3), we distributed a set of sensors onto a vehicle
to collect data from multiple domains (acoustic, optical and electromagnetic domains). We fuse a
decimeter accuracy GPS, a Distance Measurement Instrument (DMI) and an Inertial Measurement
Unit (IMU) to obtain a sufficiently accurate position. All the MSAs within an RSS demand strict
18
CHAPTER 4. HETEROGENEOUS STREAM FILE SYSTEM OVERLAY (HSFO)
timing requirements. The jitter in time stamping data must not be larger than 359 µs [7]. The
timing budget is calculated based on the maximum vehicle speed and the desired spatial correlation
between two sensors. To achieve time synchronization across distributed systems, we primarily
use Precision Timing Protocol (PTP) [22] and as a backup Network Timing Protocol (NTP) [31].
Results of the testing of the software based time synchronization show a maximum jitter of 12 µs
with a standard deviation of 2.0375 µs, small enough to achieve our triggering and time stamping
requirements. Conversely, only very lose requirements exist for the global time synchronization,
which is only used to correlate the date of time of recordings between vehicles.
4.2 HSFO Overview
To make the raw data meaningful, the file system overlay should be able to maintain
adequate information describing the data. For example, the time stamp of each sample needs to be
logged for temporal and spatial correlation; the software configuration of a sensor (e.g. resolution)
is required for making decisions and improving the accuracy of results. Moreover, the overlay
should have high scalability and adaptability, which can be instantiated across multiple components
at multiple levels in SIROM3 to store, organize, aggregate and transmit the data. The hierarchical
design of HSFO realizes the high scalability, while the metafiles at each layer describe the details
of data sources.
Depicted in Figure 4.1, the HSFO is designed as a glue layer connecting the streaming
applications that produce or consume big volume of streaming data and the native file systems such
as Linux, QNX or Windows file systems (e.g. EXT4, QNX4, NTFS and FAT). The overlay design
separates the meta information associated with streams of data and the raw binary data. It preserves
the rich information exhibited by the data such as geo-location of survey or sampling frequency
of sensors in the metafile hierarchy provided in the HSFO, while leveraging the underlying file
system for robust file storage. Such design also guarantees the platform-independence as the HSFO
is designed above certain file systems within any OS.
The metafile is designed to organize the heterogeneity exposed in the data collection dy-
namics and design complexity. For instance, different geo-point of interests (e.g. an important
geographical area) may influence several data characteristics as the sampling frequency can be ad-
justed. The HSFO exploits the hierarchical nature of the contains relationship existed between data
objects aiming to decompose the complexity into manageable granularities. This section focuses
19
CHAPTER 4. HETEROGENEOUS STREAM FILE SYSTEM OVERLAY (HSFO)
META
RAW
Streaming Application
HSFO
Ext4 QNX4 ... NTFS
PLEX
survey
...file
...session
root
stream stream
survey
filestream
.meta
...
session
...
session.
meta
survey
.meta
bulk.
meta
Figure 4.1: HSFO Overview
on the standard definition of each layer in HSFO and explains the relationship between layers. The
details about the content as well as the format of the metafiles will be described in the next section.
The hierarchy is shown in Figure 4.1 as a tree structure. We define multiple hierarchical
elements, namely surveys, sessions, and streams to accommodate the dynamic data character-
istics. The root node is the top-level where other elements originated from. It may have multiple
surveys, which contain multiple sessions, which further contain streams. Streams may include
multiple files that are pointing to the raw binary files on the file system. One should notice that in
each layer, there is a corresponding metafile attached, which serves as a manifest to keep track of
all elements at current layer and stores information describing them. For instance, stream.meta
will keeps a list of files, their names, timestamps and other meta information.
As the basic element in the hierarchy, file is the binary/raw data in the host file system. It
could be a single image or a sequence of acoustic signals. The file here acts as a class that combines
the file from native file system with the attributes describing it. Stream.meta will aggregate and
records this information such as file names, timestamps and total number of samples in a file, which
preserves important details to access each file and geo-tag each data point in it (details will be
explained in Chapter 4.3).
20
CHAPTER 4. HETEROGENEOUS STREAM FILE SYSTEM OVERLAY (HSFO)
Global data processing and correlation requires a common definition for different groups
of files. Capturing the streaming essence of the data, stream is defined to distinguish between
different formats and semantic meanings of diverse big data. A set of data that has the same format
and semantics can be recorded as a stream. This layer handles the heterogeneity and versatility
of data by structuring and grouping them accordingly. It also serves as the basic processing unit
and the input for plugin system to do data correlation and fusion in the next chapter. Each stream
instance is associated with a stream type and an originated sensor type to identify the stream. This
enables the streaming applications to identify how to handle or display the streams. The stream
layer also distinguishes between raw, refined and real-time streams and records them as the stream
qualifier. The definition of stream type and other detailed information about stream metadata will
be described later. The stream metadata is also recorded in stream.meta.
Above the stream layer, which focuses on managing data heterogeneity during a contin-
uous data collection, a session is defined to manage a set of streams. The session.meta located
in the stream layer will store session metadata and keep track of all available streams in current
session. The session is defined as a consecutive (e.g. non-interrupting) measurement of an area
with the same sensor configuration (e.g. camera resolution). It is introduced to facilitate managing
streams regarding sensor settings. A new session is started each time the system configuration
changes, or after a brief stop of the recording (e.g. when passing over an area of non-interest). If a
single sensor configuration changes during a session, a new session needs to be started. Thus, the
streams of the same stream type but generated from the same sensor with different measurement
accuracy will be separated into different sessions. This would solve the conflict that a session
has two same streams and eliminate the ambiguity during data processing. The session layer
also potentially separates the heterogeneous data according to the different inspection needs and
geofencing interests.
Composed of a group of sessions, survey splits the big data in a more coarse-grained
way. The survey.meta stored in session layer will maintain the survey metadata along with the
lists of participated sensors and sessions. Compared to conducting inspections for roadway infras-
tructures within a relatively small area, a new survey starts when major geographic location or
date changes. With repetitive surveys, the time-varying behaviors of roadway infrastructures can
be investigated and accurately pieced together to form the complete life-cycle of infrastructures.
More than that, a new survey also needs to be started when hardware configuration changes such
as removing a sensor physically from the system through failure or changing location of a sensor.
In summary, the survey layer is defined as a data collection of a major area by heterogeneous sen-
21
CHAPTER 4. HETEROGENEOUS STREAM FILE SYSTEM OVERLAY (HSFO)
sor systems with fixed hardware configurations. To be suitable to the mobile nature of the hybrid
system, it separates datasets at a temporal or geographical granularity (e.g. date, location), such that
a survey can be an hour/day-based or city-based.
Each RSS can conduct and manage a group of surveys, so the root node represents
the identifier such as identification number of the vehicle (RSS). The root layer allows for the
coexisting of data from multiple RSS and data collected at the same location at different dates.
Therefore, a fleet of RSSes can be deployed to increase the geographical coverage area to make
assessment on the city scale. It also enables the detection of changes to the roadway and bridge
deck conditions over time using all available data.
This hierarchical overlay provides a simple and effective way to handle large volumes
of heterogeneous data with aspect to flexibility, scalability and expandability. Defined as a part
in the RTE (see Figure 3.2), the big data processing layer is capable of being instantiated to any
system component such as MSA, RSS or FCM). Moreover, as SIROM3 inherently supports verti-
cal expandability and horizontal scalability systems, the HSFO also adapts to the same degree of
flexibility in the metafile hierarchy.
4.3 Metadata Definition
The HSFO uses a metafile approach to classify data strategically and manages them in or-
der. The metafile preserves abundant information including timestamp of samples, domain specific
knowledge such as sensor settings (e.g. sample rates, resolution) for data processing and fusion.
Moreover, as a structure tailored to infrastructure life cycle management, the metafile also abstracts
information of physical configurations such as geo-location, participated sensor types, survey date
into files to investigate the time-varying behaviors of roadway infrastructures.
Besides preserving valuable sensor calibration data and configurations, the metafile also
acts as a manifest that is responsible for keeping track of each files and showing the files that
are available for download or processing during a recording session. It keeps the one-to-many
relationship between neighboring layers and eases the maintenance of the file structure during data
transfer, aggregation and processing. For instance, merging two streams into a session only
requires the modification of streamTypeList maintained in session.meta. All the metadata can
also be used for reconstructing file structure, restoring existing stream files and querying information
of survey, session, stream, or file while system restarts. The metafiles are built in memory, and
stored onto the file system in Libconfig[26] format once finishing a session.
22
CHAPTER 4. HETEROGENEOUS STREAM FILE SYSTEM OVERLAY (HSFO)
This section defines the attributes needed for capturing the manifest. Figure 4.2 gives
an impression on HSFO metafile definition. It shows the main attributes required to be captured
to describe the data and the corresponding metafile for permanent storage. This structure will be
described in a bottom up fashion corresponding to the overview explained in the previous section.
ID ...
file ...
...session
ID
Metafile Hierarchy and Definition
HSFO
root
stream
Type Qualifier
survey
...ID Date Location
subsysList
streamTypeList
sessionIdList
ID Time
fileIdList
survey.meta
session.meta
stream.meta
Storage
Figure 4.2: HSFO Metafile Definition
4.3.1 Streams and Files
As the basic element in the metafile hierarchy, each binary file is binded with a fileInfo
structure. The fileInfo contains fileID, fileName, timeStart, timeEnd and numSamples.
FileName represents the name of the file in file system, while the timeStart and timeEnd record
the time stamp of the first and last sample in the file respectively. The field numSamples records
the total number of samples in a file. Each sample in the stream can be recorded with a time stamp.
But for some streams with constant sample period, the data can be time stamped on a coarser
granularity to avoid redundancy. By only recording timeStart and timeEnd of the file, the time
stamp of each sample can be calculated based on numSamples in the file.
The fileInfo for each file is recorded in stream.meta, which is a centralized place
to aggregate the meta information for all files. When a new file is created and begins record-
ing the first sample, a fileInfo with fileID, fileName and timeStart will be written into
23
CHAPTER 4. HETEROGENEOUS STREAM FILE SYSTEM OVERLAY (HSFO)
stream.meta. The timeEnd and numSamples will get updated when file closes. The file within
the stream can be any type, such as a text file filled with data and timestamp tuples, an image or a
period of acoustic signals. For a text file, the time stamp of each sample can be easily recorded to-
gether with the raw data. But for images and binary files, the timestamp for each sample is hard to be
logged. With the timestamp captured in fileInfo and maintained in stream.meta, the timestamp
of each file can be logged in spite of different file types or formats.
A stream groups an arbitrary number of files. It is associated with a streamInfo. The
streamInfo has properties of streamType, sensorID, streamQualifier and fileIdList. The
field streamType defines the type as well as the meaning of data. All files of a stream shall have
the same syntax and semantics. If a sensor writes files of different semantics and formats, they
have to be captured in different streams. The streamType is an enumeration value identifying the
stream. It is automatically generated from the stream definition table. The stream definition table
makes standard definition of streams and records the detailed information about them such as stream
short name, file format and description. This table is recorded in a centralized configuration file in
order to make the whole system aware of different types of streams available and understand their
meaning during data fusion. Table 4.1 shows the items have to be defined for each stream.
Table 4.1: Stream Definition
Item Description Example
eStreamTypeEnumeration value identifying thestream numerically
STREAM VOO LOC
ShortNameShort name of the stream, will be usedfor directory and file naming
vooLoc
DescriptionTextual description of stream, will bedisplayed on the visualized portal
Fused data defines multi-modalmulti-sensor system referenceposition
FileSuffixEach file inside the stream will need tohave a file extension. This records theextension and thus defines the data type
txt
IsRealTimeDisplay
Some data stream needs to be sampledfrom the orignal stream and displayedin real-time, in attempt to validate thequality of data
true or false
The definition of streams has to be globally recorded in the centralized configuration
file of the RSS for unified definition and search. The Figure 4.3 shows an example of a stream
definition table in the Libconfig[26] format. It generally follows the format: <eStreamType>,
24
CHAPTER 4. HETEROGENEOUS STREAM FILE SYSTEM OVERLAY (HSFO)
<shortName>, <Description>, <FileSuffix>, <IsRealTimeDisplay>. In the table, a raw unpro-
cessed stream surfaceImgRaw (raw uncorrected video images of the road surface) and a refined
stream surfaceImgCrack (processed from the raw images to show detected cracks) are consid-
ered as two separate streams although they are originated from the same sensor. The real-time
stream surfaceConditionSub is extracted from the raw stream surfaceCondition (millimeter-
wave radar signals) at some sample rate. Since the attribute IsRealT imeDisplay is set to true, this
stream will be automatically uploaded to RSS through FTP service and be processed and displayed
to validate the quality of data in real-time.
Streams = (
( "STREAM_VOO_LOC", "vooLoc", "Fused data defines sensing system reference
position", "txt", false ),
( "STREAM_VOO_DIS", "vooDis", "Fused data defines sensing system travelled
distance", "txt", false ),
( "STREAM_SURFACE_IMG_RAW", "surfaceImgRaw", "Raw uncorrected images
taken by video camera facing toward the road", "jpeg", false ),
( "STREAM_SURFACE_IMG_CRACK", "surfaceImgCrack", "Binary images with
detected cracks as 0", "png", false ),
( "STREAM_SURFACE_CON", "surfaceCondition", "Surface condition data of
roadway collected by millimeter-wave radar", "bin", false ),
( "STREAM_SURFACE_CON_SUB", "surfaceConditionSub", "Sample data of surface
condition data", "bin", true ) );
Figure 4.3: Example Stream Definition Table
The sensorID in streamInfo corresponds to the originated sensor type of a stream.
The field streamQualifier shows the processing levels of the data such as raw, fused, or refined
data. A refined stream is processed from a raw stream for feature extraction. A fused stream
is the joint results produced from multiple streams. For example, Pavement Condition Index
(PCI) is fused from streams generated from Microphone Sensor and Dynamic Tire Pressure Sen-
sor (DTPS) [33]. This also implies that two streams from the same sensorID but have different
streamQualifer are two separated streams. The fileIdList preserves a list of fileID of all the
available files included in current stream.
The streamInfo handles heterogeneity and versatility of data by structuring and group-
ing them accordingly. This structure is also recorded in stream.meta. Figure 4.7 shows the con-
tents of stream.meta of two streams with different sampling rates. It contains a stream setting
and a file setting recording streamInfo and a group of fileInfo respectively. The time stamp
recorded in stream.meta gives the opportunity for temporal correlation of two streams with differ-
25
CHAPTER 4. HETEROGENEOUS STREAM FILE SYSTEM OVERLAY (HSFO)
ent characteristics. Detailed information about how data gets correlated through stream.meta will
be explained in Chapter 4.5.
4.3.2 Sessions
A session is defined as the consecutive (i.e. non-interrupting) measurement of an area.
It is introduced to facilitate managing streams regarding settings (i.e. Qualifier, etc.). The in-
formation of session is also encapsulated into a sessionInfo structure like stream and file.
The sessionInfo is composed of sessionID, timeStart, timeEnd and streamTypeList. The
timeStart and timeEnd record the start time and end time of a session. The coarser-grained time
period allows quick search for a stream or file generated during a certain time. Instead of compar-
ing with the time stamp of each file in turn, the query can first locate the session that contains the
time stamp and then look into the session and check each file. This will dramatically speed up the
query if each session contains thousands of files. The streamTypeList lists all the streamType
of the streams included in each session. The sessionInfo is recorded in session.meta under each
session and get updated when a new stream is created or merged.
session
session.
meta
voo
Loc
voo
Dis
Surface
ImgRaw
session : { sessionId = 0; timeStart : { sec = 1385209325L; nsec = 702437775L; }; timeEnd : { sec = 1385227583L; nsec = 847077204L; }; streamTypeList = [ 5, 0, 1, 14];};
Streams = ( ( "STREAM_VOO_LOC", "vooLoc", "Vehicle reference position", "txt", false ), …));
Surface
ImgCorr
file file stream.meta
1 0 5 14
Figure 4.4: Stream Creation
Figure 4.4 shows an example of the changes to the HSFO to create a new stream. The
sessionInfo maintains a streamTypeList which records the available streams in current session
while each stream is represented by its streamType. It also contains a stream definition table which
has the detailed definition of streams. The streamType enumerator corresponds to the index of the
stream in the stream definition table. While creating a new stream surfaceImgCorr, the stream
26
CHAPTER 4. HETEROGENEOUS STREAM FILE SYSTEM OVERLAY (HSFO)
needs to register its streamType to the streamTypeList in session.meta besides creating a
corresponding stream.meta to keep track of the newly created files.
4.3.3 Surveys
To be suitable to the mobile nature of the system as well as its focus on infrastructure mon-
itoring, survey can be conveniently separately manually at a temporal or geographical granularity
(i.e. data, location). This information is abstracted into surveyInfo structure like the other layers.
The surveyInfo is composed of surveyID, startDate, state, city, location, surveyName,
subsysList, sessionIdList, timeStart and timeEnd. The surveyInfo comprehensively de-
scribes physical conditions and system hardware settings of a survey.
survey : { surveyId = 0; startDate = 20131124L; state = "MA"; city = "Boston"; location = "Ruggles"; surveyName = "20131124_MA_Boston_Ruggles_0"; subsysList = [ 2, 3, 5, 6 ]; sessionIdList = [ 0, 1 ];};
Figure 4.5: Example Survey Meta File
Figure 4.5 demonstrates the content of survey.meta, time stamps are omitted here for the
same functionality of the timestamps in sessionInfo. The state, city and location fields point out
the geographical area that has been surveyed. The surveyDate preserves valuable time information
of the survey, which further enables time-lapse surveys for the same location to investigate the life
cycle of infrastructures. The surveyName is composed of surveyDate, geo-location information
and surveyID, which identifies a survey. The detailed name convention will be discussed in next
section. Therefore, a new survey can be started when a major geographic location or survey date
changes, such that a survey can be an hour/day-based or city-based. In addition, the surveyInfo
manages a list of participated MSAs. If the hardware configuration changes such as adding a new
MSA physically into the system while going through some areas of interest or change the location
of a sensor in RSS, a new survey needs to be started and the subsysList should be updated corre-
spondingly. The time-variant surveys provide sufficient experimental data to study the time-varying
behaviors and life-cycle of civil infrastructures.
27
CHAPTER 4. HETEROGENEOUS STREAM FILE SYSTEM OVERLAY (HSFO)
4.4 Big Data Storage Location
Since the sensory components require fast sampling in time or dense sampling in space,
the RSS often demands high processing load that can not be accommodated by a single processing
element and thus requires a distributed system solution. The distributed MSAs are responsible for
recording its own data as files of a stream. The integrated systems have a centralized configuration
file called voo.config which is responsible for recording system dynamic configurations (e.g. sys-
tem IP, centralized storage and runtime distribution information), which is also written in Libconfig
format [26]. The base directory of data storage is defined in voo.config, in the setting under key
name bulkDir. The default is /var/voters. This example will be also used for further definition.
The stream files will be located in a directory that follows the naming convention below:
/var/voters/<V ooNr>/<Survey>/<SessionNr>/<Stream> <StreamQual>. The field
names in the directory path are defined in Table 4.2.
Table 4.2: Name Convention for Big Data Directories
Field Subfield Description
<VooNr>Numeric identifier of the heterogeneous system in three-digitformat (NNN)
<Survey>
Number/Name of survey in format of<Date> <State> <City> <Loc> <SurveyNr>.As an example, recording from 11/24/2013 starting in Boston,MA at Northeastern would get the following survey name:20131124 Boston NEU 0
Date Date in format YYYYMMDD: e.g. 20131124 for Nov.24, 2013State Two character state e.g. MACity City or area name, e.g. BostonLocation Special description if needed e.g. NEUSurveyNr Sequence number of survey since start of day, in format N or NN
<SessionNr>
Session inside the survey, in format N or NNA new session is started on:• System configuration change. Note: changing configurationof a single sensor changes system configuration, thus a newsession is started!• Brief stop of recording.• When transitioning into a different geofencing area thuschanging configuration.
<Stream > Short stream name, see Table 4.1, e.g. Dis, Loc
<StreamQual >Variation on the same stream with different configurations. For theraw data, the stream qualifier will be empty.
28
CHAPTER 4. HETEROGENEOUS STREAM FILE SYSTEM OVERLAY (HSFO)
The above naming convention for directories describe the storage location for each stream.
Each stream file shall only contain data recorded with the same configuration. If sample rate
changes, a new session has to be started. Within a stream a file can be identified via:
<BaseDir><FileBaseName><FileEnumerator>.<FileExtension>
<BaseDir> is defined as above table. <FileBaseName> is equal to the stream
short name (shortName) defined in Table 4.1. <FileEnumerator> is sequence number of file.
<FileExtension> is equal to the FileSuffix defined in stream definition.
Figure 4.6 shows an example of big data storage location in file system. The root directory
of the big data is /var/voters, which is defined as the bulkDir setting in the centralized dynamic
configuration file. During the civil infrastructure inspection, a fleet of RSSes may be deployed to in-
crease the geographical coverage area. So the next level is the identifier number of each vehicle. For
a vehicle whose V ooNr is 008, its data will be recorded in /var/voters/008. For each vehicle, it
can take an arbitrary number of surveys. Suppose we take several surveys around Northeastern Uni-
versity in Boston, MA on Oct 24, 2014, the first survey can be called 20131124 Boston NEU 0.
Each session under the survey will be indexed as an integer number. A GPS system, a DMI sen-
sor system and a HD camera system participate in the first survey first session, which produce raw
streams called Dis, Loc, surfaceImg respectively. The location for the distance stream is defined
as /var/voters/008/20131124 Boston NEU 0/0/Dis. The absolute path of a stream file can
be /var/voters/008/20131124 Boston NEU 0/0/Dis/Dis0.txt.
20131124_Boston_NEU_0
Dis0.txt ...
0
008
Dis
/var/voters
016 ...
1 2
20131124_Boston_NEU_1
...
...
Loc surfaceImg ...
Dis1.txt stream.meta
session.meta
survey.meta
bulk.meta
Figure 4.6: Big Data Storage Location
Note that the path for each layer and stream file will be automatically generated by com-
bining the name convention rules, information in centralized configuration file and stream definition
table. With the well-defined structures and file path, HSFO can keep track of each file and randomly
29
CHAPTER 4. HETEROGENEOUS STREAM FILE SYSTEM OVERLAY (HSFO)
access any file. The information about the path such as survey name, session index and filename
will also be recorded into corresponding metafiles for reconstructing the file path and restoring the
existing stream files on disk while system restarts.
4.5 Metadata Example for Data Correlation
stream : {
sensorId = Video_camera;
streamType = surfaceImgRaw;
streamQualifier = "";
fileIdList = [ 0, 1, 2, 3, 4, 5, 6, 7… ]; };
file = ( {
fileId = 0;
fileName = "surfaceImgRaw0.tiff";
timeStart : {
sec = 1374249022L;
nsec = 789775000L; };
timeEnd : {
sec = 1374249022L;
nsec = 856646000L; };
numSamples = 1L; }, ...);
stream : {
sensorId = Microphone;
streamType = complexDaqMic;
streamQualifier = "";
fileIdList = [ 0, 1, 2... ]; };
file = ( {
fileId = 0;
fileName = "complexDaqMic0.txt";
timeStart : {
sec = 1374249013L;
nsec = 6042665L; };
timeEnd : {
sec = 1374249043L;
nsec = 762286273L; };
numSamples = 1500000L; }, … );
Figure 4.7: Metafile Example for Data Correlation
In Figure 4.7, it shows an example of stream metafiles for video images stream and mi-
crophone acoustic stream. Metafiles are written in Libconfig format [26], which is more compact
and human readable than XML. Moreover, they bring lower overhead because of the textual presen-
tation. The stream metafiles maintain stream attributes together with all binary files contained in the
stream. The sensorId and streamType in the stream attribute indicate the originated sensor type
of the stream and current stream name. For example, the two streams called surfaceImgRaw and
complexDaqMic represent pavement surface images captured by video camera and microphone
acoustic data respectively. StreamQualifier shows both streams are raw data without being re-
fined (empty as default for raw stream). The fileIdList lists all the ID of the binary data files
included in each stream. Since each taken picture is a file in the stream, the video stream has an
array of files. Similarly, acoustic stream records a list of 30s acoustic signal file. Each file has
properties such as fileId, fileName and etc. The two settings: timeStart, timeEnd record the
timestamp of acquiring the data and are most critical for data correlation since it is the common
attribute between different data sets. The numSamples shows total number of samples contained
in a file. As video camera has variable sample rate, it has single sample per file. However, mi-
30
CHAPTER 4. HETEROGENEOUS STREAM FILE SYSTEM OVERLAY (HSFO)
crophone has many samples in the same file due to its constant sample rate. Thus, the microphone
stream is time stamped on a coarser granularity and the accurate time stamp of each sample is cal-
culated from timeStart, timeEnd and numSamples. To associate each image with an acoustic
clip, e.g., the timestamp of surfaceImgRaw0.tiff should be correlated with the time period of
complexDaqMic0.txt. Since each video image is taken in less than 1s whereas an acoustic file
lasts for 30s, a bunch of images may be correlated to one microphone acoustic file. To be more
finer-grained, the timestamp of each sample can be calculated as each microphone file contains
15000 samples. Therefore, surfaceImgRaw0.tiff can be correlated with multiple samples in
complexDaqMic0.txt.
4.6 Bulk Data Handling Library
The stream applications and plugins need to obtain meta information about the survey,
session as well as retrieving stream files in HSFO. However, the versatility of processing tools
tackle the challenge to uniformly access and process different streams in HSFO. Instead of imple-
menting the access functions again and again in each plugin, a common implementation is desirable
to facilitate interaction with HSFO and simplify the access to it for different plugins. The Bulk
Data Handling (BDH) library abstracts and encapsulates common functionalities needed by stream
applications or plugins to ease their operations on HSFO.
The abundant APIs exposed by the BDH library allow plugins to interact with HSFO
seamlessly. By providing common interfaces to operate on HSFO, the big data handing library
isolates the heterogeneity of stream processing elements and allows them to process heterogeneous
data uniformly. The plugins can not only retrieve and process the raw data from HSFO, but also
generate refined results and commit them to HSFO for permanent storage. The BDH library mainly
supports metadata query, stream/file retrieval and stream/file creation. It provides facilities to get to
the file names (including their absolute path), reading and writing of meta information. With BDH
library, plugin only needs to know how to read the file without knowing where the file is located.
These interfaces also reveal opportunities for easy integration of domain-specific algorithms for
processing and fusion, since the duplicated implementation effort for different plugins are avoided.
Therefore, the BDH library increases the overall flexibility and maintainability. In addition, as a part
of HSFO, the BDH library is adaptable to be used on any system component from MSA to FCM
to facilitate data processing for plugins. Table 4.3 shows a list of commonly used APIs with their
input parameters and functionality during data processing. More functionality can be expanded and
31
CHAPTER 4. HETEROGENEOUS STREAM FILE SYSTEM OVERLAY (HSFO)
explored based on the existing interfaces. The next chapter will describe more details about how
plugins interact with HSFO and process streams utilizing BDH library.
Table 4.3: Big Data Handling Library APIs
Name Input Parameters Functionalityinit() none Restore all files and metafiles on disk
surveyListGet() noneGet total number of surveys and a surveyID listof available surveys
surveyInfoGet() surveyIDQuery survey information, return the result insurveyInfo structure format
sessionInfoGet() surveyID, sessionIDQuery session information, return the result insessionInfo structure format
streamInfoGet()surveyID, sessionIDstreamType
Query stream information, return the result instreamInfo structure format
streamCreate()surveyID, sessionIDstreamType, sensorID
Create a new stream with stream information,update session.meta with new streamType
streamClose()surveyID, sessionIDstreamType
Close the newly created stream, dumpstream.meta to disk
fileInfoGet()surveyID, sessionIDstreamType, fileID
Query file information, return the result infileInfo structure format
fileNameAbsGet()surveyID, sessionIDstreamType, fileID
Get absolute path of file
fileCreateMeta()surveyID, sessionIDstreamType
Create meta information for a new file under astream, return the newly generated fileID
fileCommit()surveyID, sessionIDstreamType, fileID,fileInfo
Set meta information for the file and commitchanges back to data structures such asappending fileInfo to stream.meta
Explained in the table, a plugin can randomly get access to a file by querying its abso-
lute file path. Through calling API fileNameAbsGet() with corresponding parameters such as
surveyID, sessionID, streamType and fileID, the plugin can fetch the file without knowing
its real location. The APIs such as fileInfoGet() and fileCommit() will help read or write file
meta information. The BDH library can be directly used for any C++ plugin implementation. It is
also exposed through MATALB and Python to provide high adaptability to Plugin Executor (PLEX)
and enable a wide range of plugin development environments. Figure 4.8 shows the abstraction of
BDH library, which is designed to simplify the access to HSFO for plugins. By calling particular
interfaces, the meta information or file absolute path will be returned. A MATLAB implementation
needs MEX files that wrapper around the library access. Similarly, a Python implementation utilizes
SWIG as integration tools to generate glue code and call into the C++ library. Different plugins run
32
CHAPTER 4. HETEROGENEOUS STREAM FILE SYSTEM OVERLAY (HSFO)
above them to fetch the information or files from HSFO to process through the BDH library.
LibBDH
C++Matlab
Mex
Python
HSFO
SWIG
call file
Plugins
Figure 4.8: Abstraction of Bulk Data Handling Library
4.7 Big Data Transfer and Aggregation
With a well-defined hierarchical structure, the HSFO has high scalability and adaptability
to be instantiated on a wide range of platforms. Thus, a HSFO can be attached to each distributed
MSA in an RSS and provide reliable and efficient local storage for large volumes of streaming data.
A centralized HSFO can also be created on FCM to aggregate data from distributed sources and
apply data fusion algorithms to diverse big data. Figure 4.9 illustrates the bulk data transfer and
aggregation from distributed MSAs to FCM in SIROM3.
While the vehicle is navigating through the traffic and collecting data during a survey, only
cellular connection is available. The bulk sensor data is stored locally at each MSA to conserve the
bandwidth. As a result, each MSA will maintain several streams generated from different sensors. A
centralized metafile session.meta is created on the on-board controller of RSS for current session
to maintain a list of available streams. The newly created stream needs to register its streamType
to the streamTypeList in session.meta. After the RSS completes the survey, the vehicle returns
to its base where a fast network is available. The automatic bulk data upload to the FCM and data
aggregation are automatically triggered.
The session.meta located on RSS is uploaded to FCM through FTP service first. The
directory to the file will be automatically created on FCM during upload to ensure a same HSFO
structure. Since the streamTypeList in session.meta keeps track of all streams in current session,
the on-board controller of RSS informs corresponding MSAs in turn to upload their streams to
33
CHAPTER 4. HETEROGENEOUS STREAM FILE SYSTEM OVERLAY (HSFO)
the data center in FCM. Finally, the FCM groups all streams as well as the session metafile into
corresponding session. In Figure 4.9, streamDis, Loc and surfaceImg generated from distributed
MSAs are transferred and aggregated to survey0 session0 with session.meta on FCM.
survey 0
Dis
MSA 1
session 0
HSFO
Loc
survey 0
surfaceImg
MSA 2
session 0
HSFO
... ...
...
survey 0
surfaceImg
session 0HSFO
(Data
Center)
Dis Loc Session.meta
session : { streamTypeList = [ 5, 0, 1]; };
Session.metaHSFO
RSS
FCM
Figure 4.9: Data Transfer and Aggregation
The HSFO provides an efficient and reliable solution to store, manage and aggregate
the heterogeneous bulk sensor data with regard to flexibility, scalability and expandability. The
containing metafiles abstract abundant information of data collection dynamics (e.g. sensor settings,
timestamps) to describe the raw data with minimum processing overhead, which is critical for data
management and processing. The BDH library eases the interaction with the HSFO for different
kinds of plugins to access and process the data. The next step is the flexible data processing on the
HSFO using the Plugin Executor (PLEX) environment. The PLEX simplifies the integration of new
algorithms and eliminates human interaction during processing with an automated approach.
34
Chapter 5
Plugin Executor (PLEX)
The key point for big data handling is the flexible data processing and fusion to convert
the raw data into meaningful knowledge. It corresponds to multi-layer multi-modal fusion stage in
Figure 2.1 and directly works on the sensor data storage HSFO. Facilities are needed to fuse hetero-
geneous data streams, correlate them in time and space and allow for visualization. Furthermore,
flexibility in processing and chaining processing steps have to be an integral part of the solution.
Lots of human efforts are involved in current data processing procedure, which cause
an inevitable delay and consequently lower productivity. Tools are required to simplify algorithm
integration and make use of bulk data handling to avoid duplicated work on retrieving files. A
standard definition of plugins is critical to provide a unified solution to execute the plugins, which
further automate the running of plugins according to a particular sequence during processing.
A PLEX environment is designed to address the above issues. We introduce the modular-
ized design and break the whole processing procedure into manageable small plugins. The plugins
integrate various domain-specific algorithm into PLEX and leverage the bulk data handling library
to uniformly interact with HSFO. The running sequence of plugins is predefined by a configurable
scheduler for eliminating human interaction during data processing. This chapter first overviews the
PLEX environment. Then it elaborates the standard definition and detailed usage of plugins. Lastly,
a plugin scheduler is illustrated to automate and accelerate the data processing procedure.
5.1 Plugin Executor (PLEX) Environment Overview
As part of SIROM3 framework, PLEX is designed with high scalability and adaptability,
which can be run on different levels or different components (e.g. from sensors, MSA to FCM).
35
CHAPTER 5. PLUGIN EXECUTOR (PLEX)
Since it is platform-agnostic, data processing and fusion can happen on sensor in real-time or cen-
tralized storage after aggregation. Figure 5.1 gives an impression about the PLEX environment.
PLEX
Rule-based
Scheduler
Rule
1
Rule
2
PluginN
Plugin3Plugin2
Plugin1
Plugins
HSFO
Ext4 QNX4 ... NTFS
PLEX
Stream Definitions
Plugin RulesUser Inputs
PLEX Environment
Figure 5.1: PLEX Environment
To enable flexible processing, we introduce plugins. A plugin uses streams as inputs and
produces new output streams. Examples include the calculation of crack density from the video
stream, or calculating Pavement Condition Index (PCI) from fused data on pavement surfaces.
A plugin executor (PLEX) is provided to manage and execute the plugins. Figure 5.1
shows the PLEX environment containing the PLEX and two user defined input: Stream Definitions
and Plugin Rules, and an array of systemic or algorithmic plugins with a Rule-based scheduler. Sys-
temic plugins are system operation such as data retrieval, transfer or tagging, whereas algorithmic
plugins are algorithm realizations operating on streams to produce new results or refine existing re-
sults (e.g. feature extraction). The Rule-based scheduler directs the execution sequence of plugins.
5.2 Plugins
The plugins are designed to process from one stream type (e.g. unprocessed data) to
another stream type (e.g. feature extracted data). Different from some SQL queries which operate
selectively on a partial stream and return data for immediate display [18], the plugins here will work
on full data streams for permanent storage and commit updates to metafiles.
5.2.1 Plugin Definition
The plugins process data on the granularity of streams. The advantage of operating on
complete streams as supposed to individual files is that calling a plugin becomes independent on the
36
CHAPTER 5. PLUGIN EXECUTOR (PLEX)
granularity of files. As for video data, it might be represented in both individual images, as well as
single file as a container encompassing multiple frames. One plugin example might be a conversion
of individual images to a video stream. In this case, there is no one-to-one relation between input
and output file. Thus, plugins manipulate complete streams and are responsible for the files. Each
plugin defines the input stream type it can accept and the output stream type it produces. To perform
different plugins uniformly, Table 5.1 makes a standard definition of plugins. Table 5.2 shows the
definition of crack distortion correction plugin as an example.
Table 5.1: Plugin Definition
Field DescriptioninStreamList List of input stream types. A plugin may operate on one or more input
streams.outStreamList List of output stream types. Note: A plugin may produce multiple out-
put streams in the same run. Each produced output stream is listed inthis array. The order of output stream types has to match the order ofarguments on the command line.
shortName Plugin Short Name corresponds to the name of the directory. As such,name should be alphanumeric only, no spaces or special characters.
Name Plugin Name (can have spaces).Path Path to the plugin on file system.Description Plugin Description (can be a longer description).Cmd Command on file system to run plugin.Args String of constant arguments to call plugin.Rules Processing rules for chaining filters.Event List of events that the plugin can trigger.
Table 5.2: Crack Distortion Correction Plugin Example
Field DescriptioninStreamList STREAM SURFACE IMG RAWoutStreamList STREAM SURFACE IMG CORRshortName DistortionCorrectionName Distortion Correction for RAW Video ImagesPath /Algo/PluginsDescription Corrects the raw images for distortion due to angle and zoom. Crop
appropriately to remove useless information such as bumper of the car.Cmd MainDistortionCorrectionArgs %SURVEY ID% %SESSION ID%Rules NAEvent Can trigger Crack Detection plugin
37
CHAPTER 5. PLUGIN EXECUTOR (PLEX)
Table 5.1 summarizes parameters that need to be captured to characterize the available
plugins. The definition of all plugins is centralized in a Plugin Definitions table like Stream Defi-
nitions. It is still written in Libconfig format [26] and comprehensively describes the plugins. In
the plugin definition, both the input and output can be a list of streams, which are represented by
enumerators corresponding to streamType in Stream Definitions. The shortName is used to iden-
tified the plugins in Plugin Rules. When run the plugins listed in Plugin Rules, the corresponding
definitions of plugins will be searched from Plugin Definitions based on the shortName. Then the
command lines to execute the plugins are automatically composed from Path, Cmd and Args.
Table 5.2 gives an example about the definition of crack distortion plugin. The enumera-
tors identify the stream type defined in Stream Definitions (Figure 4.3). The crack distortion plugin
operates on raw uncorrected images as an input stream and creates angle and zoom corrected and
cropped images [15]. The file is pre-compiled to an executable file using the Matlab compiler. The
Path, Cmd and Args fields would result into a call on the command line as:
./Algo/P lugins/MainDistortionCorrection %SURVEY ID% %SESSION ID%
The arguments distinguish System Parameters from User Parameters. The fixed parame-
ters such as %SURVEY ID% and %SESSION ID% will not be influenced by users. They identify
survey and session to be operated on and are replaced by current values. For example, the plugin
processes all sessions within a survey in turn. The surveyID and sessionID are retrieved from
metafiles. The string %SURVEY ID% and %SESSION ID% are replaced by each combination of
surveyID and sessionID when calling the plugin to process streams under each session. Con-
versely, the User Parameters can be altered by the user at runtime to adapt to the desired outcome
such as threshold. More parameters can be filled into the Args field and passed through command
line to control the operations of plugins. Both systemic plugins and algorithmic plugins can be
defined in this way. The Plugin Definitions maintains standard definitions of all the plugins and
enables running different plugins uniformly.
5.2.2 Interaction with HSFO
Leveraging bulk data handling library, different plugins can get access to the HSFO uni-
formly. These common utilities avoid duplicated implementations to retrieve files for each plugin
and ease new algorithms integration. This section elaborates the detailed steps about how plugins
interact with the HSFO using the bulk data handling library and process streams.
To access and operate on streams in HSFO, plugins will need to access meta infor-
38
CHAPTER 5. PLUGIN EXECUTOR (PLEX)
mation from the top of the spanning tree structure in Figure 4.2 (trace from survey, session to
stream). Profited by the big data handling library (described in Chapter 4.6), plugins can interact
with the HSFO such as retrieving files or querying metadata uniformly and seamlessly via a num-
ber of exposed APIs. For instance, a plugin can retrieve existing streams to operate on through
streamInfoGet() or create new streams to produce improved results by streamCreate(). Only
surveyId, sessionId and streamType of current stream are needed as input.
streamInfoGet()streamDef
Table
+InStreamList[]
streamCreate()
+
plugin
Algo
fileMetaCreate()
streamInstreamOut
fileNameAbsGet()fileNameAbsGet()
OutStreamList[]
fileCommit()
Loop for each file
crackDetect
Threshold
videoRaw crackMap
streamClose()
Figure 5.2: Interaction of Plugins with HSFO
Figure 5.2 demonstrates the interaction with the HSFO through the crack detection plugin.
The overview of the plugin is shown at the bottom of the graph, which operates on raw video
images and outputs binary images with detected cracks. The threshold is a User Parameter to
decide crack width. An algorithmic plugin is usually composed of domain-specific algorithms as
well as operations on HSFO such as retrieving stream files or dumping metafiles. The BDH library
encapsulates common functionalities to ease the access of HSFO for plugins. Depicted in Figure 5.2,
the plugin retrieves metadata of input streams via streamInfoGet() with Stream Definitions. With
the fileIdList contained in stream metadata, the plugin can acquire the absolute path for each file
through fileNameAbsGet() and process files with algorithms in a loop. The stream metadata of
39
CHAPTER 5. PLUGIN EXECUTOR (PLEX)
new created stream can be automatically generated by streamCreate() with given streamType
and sensorID. Similarly, the file metadata is composed via fileMetaCreate(). Followed by
fileNameAbsGet(), the refined stream data can be written to the new files with their absolute
path. The interface fileMetaCommit() updates meta information for each new file (e.g. start/end
time stamp, number of samples), while streamClose() dumps the stream metafile to disk.
Table 5.3 presents the corresponding code example for crack detection plugin written in
C++. Since the BDH library is exposed through Matlab, Python and SWIG, plugins can be easily
translated to other languages and various development environments are enabled. The examples
show that only the algorithm parts differ for different algorithmic plugins whereas their interaction
with HSFO is fixed. This significantly simplifies the integration of new algorithms.
Table 5.3: Crack Detection Plugin Code Example
Line Code1 VOTERS::tStreamInfo streamInfoOrig = BDHPlg.streamInfoGet (surveyId,
sessionId, VOTERS::STREAM SURFACE IMG RAW);2 int fileNum = streamInfoOrig.fileIdList.length();3 BDHPlg.streamCreate (surveyId, sessionId, VOTERS::STREAM SURFACE-
IMG CRACK, VOTERS::SENSOR CAM);4 for (int fileId = 0; fileId < fileNum; fileId++) {5 string fileNameAbsOrig = BDHPlg.fileNameAbsGet (surveyId, sessionId,
VOTERS::STREAM SURFACE IMG RAW, fileId);6 VOTERS::tFileId fileIdNew = BDHPlg.fileCreateMeta (surveyId, sessionId,
VOTERS::STREAM SURFACE IMG CRACK );7 string fileNameAbsNew = BDHPlg.fileNameAbsGet (surveyId, sessionId,
VOTERS::STREAM SURFACE IMG CRACK, fileIdNew);
8 Specific data processing algorithms ...
9 BDHPlg.fileCommit (surveyId, sessionId, VOTERS::STREAM SURFACE-IMG CRACK, fileIdNew, fileInfoNew);
10 }11 streamClose(surveyId, sessionId, VOTERS::STREAM SURFACE-
IMG CRACK);
5.3 Plugin Executor
As one plugin is often not able to accomplish all processing, multiple plugins are designed
and incorporated to perform incremental and algorithmic-specific data processing. The PLEX is im-
40
CHAPTER 5. PLUGIN EXECUTOR (PLEX)
plemented to schedule and automate the execution of a collection of plugins. Because the processing
of plugins complies to the metafile hierarchy of HSFO, the PLEX is adaptable and scalable to be
executed on HSFO from different levels in the hierarchical architecture of SIROM3. The integrated
design automation eliminates human interaction and boosts the overall productivity.
5.3.1 Rule-based scheduler
Data processing will be done by a group of plugins. The Rule-based scheduler defines
a configuration rule to direct the running sequence of plugins. Some of the plugins run in parallel
while some needs to wait for the other ones to finish as they have data dependency. Figure 5.3
illustrates the running sequences of plugins arranged by the Rule-based scheduler.
PluginDef
Table
+
streamDef
Table
Plugin 1
Plugin 2
...
Rule-based
Scheduler
Rule1:
(Plugin 1 ||
Plugin 3),
Plugin2
Rule2
Plugin 1
Plugin 3
Plugin 3
Plugin 2
Stream BStream A
Stream C
HSFO
...
Figure 5.3: Scheduling of Plugins
In Figure 5.3, there is a list of user defined Plugin Rules. The Rule-based scheduler takes
the input rules and performs plugins according to the sequence. Note that the plugins are represented
by their shortName in Plugin Rules. They need to be replaced by the command line to execute the
41
CHAPTER 5. PLUGIN EXECUTOR (PLEX)
plugins during data processing. The PLEX will search for the whole definition of the plugin from
Plugin Definitions based on its shortName and compose the command line according to part of
the settings (Path, Cmd and Args). Then the command line gets called in the same order with
the sequence of shortName listed in Plugin Rules. In the example, Rule1 defines a sequence that
Plugin1 runs in parallel with Plugin3 first, then Plugin2 runs sequentially after them. The Rule-
based scheduler takes Rule1 and run plugins automatically in the same order. The output streams
of Plugin1 and Plugin3 become the input of Plugin2. All the created streams (StreamA,B,C)
as well as their metafiles are written into HSFO for permanent storage through HSFO APIs.
In order to integrate a new plugin into PLEX, a user needs to provide a plugin definition
into Plugin Definitions to define plugin name, input or output streams, arguments and so on. The
information about all involved streams should be available in Stream Definitions for composing
stream metadata. Finally, the shortName of the plugin needs to be added to Plugin Rules that
directs the Rule-based scheduler. The modularized design pattern of plugins significantly reduces
the effort of developing and experimenting new algorithms, while the internal Rule-based scheduler
further enhances the overall automation. For instance, a serialization can be achieved by running a
number of plugins with dependency. e.g. Image distortion correction plugin is often necessary for
crack detection data mining algorithm plugin [15].
5.3.2 Design Automation and Flexibility
The standard definition of plugins offers a unified solution to manage different kinds of
plugins. Benefitted from this, the command line to execute the plugins can be uniformly generated
from their definitions following the calling convention. In addition, the Rule-based scheduler elim-
inates human operations during the execution and automatically triggers plugins according to the
predefined order, which avoids error operations and interrupt between plugins. The plugins with no
data dependencies can be arranged to run in parallel to speed up the performance. The PLEX also
enables to check the execution progress and verify if a plugin runs successfully during processing.
The modularized design of plugins facilitates the development of algorithms. As it is a
progressive effort, the algorithms will usually be improved and need to be updated every once in
a while. With PLEX environment, each algorithm can be changed individually as needed without
interfering with others. Despite the design flexibility, the PLEX is platform-agnostic and flexible
to be run on any level in SIROM3. For example, the PLEX can perform algorithmic plugins on
each MSA and fuse or process raw data from single-domain (e.g. real-time location optimization
42
CHAPTER 5. PLUGIN EXECUTOR (PLEX)
with real-time streaming data). This level only contains temporal correlation between data points.
Knowledge-level fusion can be achieved with a similar approach yet only a different algorithm at
FCM level to fuse streams from multiple domains. This level adds knowledge of geometry allowing
for spatial correlation. Therefore, the data processing can exist on multiple levels in the distributed
architecture of SIROM3. A three-level data processing solution (Table 5.4) is raise up in [7] to
distribute processing responsibilities into different levels.
Table 5.4: Three Processing Levels
Level Description Real-Time Scope/Fusion1 Sensor Processing (MSA) Possible Single domain, single sensor2 On-board Processing (RSS) Time-delayed Multiple domains, multiple sen-
sors, local geometry3 Off-board Processing (FCM) Post-processing Multiple domains, multiple sen-
sors, local and global geometry,multiple times
The advantages of PLEX include reusability, understandability, automation, and the al-
lowance for formal software analysis and integration techniques. This also increases the reliability
and power of the overall system. In summary, the HSFO and PLEX are integrated into SIROM3
framework to provide a unified and efficient method to store, manage, aggregate and process the
heterogeneous bulk sensor data. Both of them are platform-agnostic to run from sensors to data
centers. They bring great benefits such as minimum overhead for algorithm designers, rapid algo-
rithm exploration and fully automated data aggregation and fusion. The results of using HSFO and
PLEX are demonstrated in the next chapter.
43
Chapter 6
Experimental Results
To validate software framework of SIROM3 as well as its real implementation on VOTERS
project, the system has been subjected to a city-wide roadway condition inspection. This section
first briefly examines the time synchronization accuracy. Then it focuses on the performance anal-
ysis of HSFO and PLEX during data collection and processing respectively and discusses their
low overhead. The overall productivity of SIROM3 is compared with other methods in the end to
demonstrate the benefits using SIROM3 on roadway assessment.
6.1 Performance Analysis
6.1.1 Time Synchronization Accuracy
−6000 −4000 −2000 0 2000 4000 60000
50
100
150
200
Offset [ns]
Num
ber
of S
ampl
es
(a) PTP Accuracy
100
101
102
0
0.2
0.4
0.6
0.8
1
Latency Overhead [us]
Cu
mu
lative
Pro
ba
bili
ty
MSA1
MSA2
MSA3
MSA4
MSA5
(b) Communication Latency
Figure 6.1: Timing Analysis
44
CHAPTER 6. EXPERIMENTAL RESULTS
Timing accuracy is of critical importance for sensor fusion since streams are correlated via
time stamps. Different kinds of data need to agree upon a common temporal data point. A huge time
delay may cause an erroneous offset of geo-tagged data point in the space domain. Figure 6.1(a)
shows that our realized PTP synchronization stays within a maximum jitter of 12 µs with a standard
deviation of 2.0375 µs, which meets the synchronization requirements (359 µs) defined in [7].
Also, the 12 µs jitter results in an offset of 0.33 mm in distance given the vehicle’s roaming speed
of 100 km/h, small enough to achieve the desired spatial correlation (1 cm) between sensors.
In addition, a common time is essential for correlating data across distributed subsystems.
Thus, collaboration across MSAs needs to occur within timing bounds. To evaluate the communica-
tion performance, Figure 6.1(b) plots the cumulative probability of communication latency during
regular operations across MSAs. All MSAs exhibit a very low communication latency. The 96%-
tile ranges from 3µs to 9µs, which causes a maximum deviation range of distance of 0.25 mm at
the speed of 100 km/h for spatial correlation. The low latency ensures a timely communication
and collaboration. Since each sample is time stamped by sensors synchronized to MSAs (e.g. HD
video camera) or by MSAs upon arrival, the synchronization services guarantee the accuracy and
correctness for temporal data correlation.
6.1.2 MSA Performance Analysis
To assess SIROM3 quality, we evaluate the resource utilization and communication over-
head of MSAs outlined in Table 6.1 during data collection. The results are averaged over the period
of a survey, which reflect system operations (i.e. without extra plugins) and imply the overhead of
HSFO for creating meta information (e.g. generate file name and compose absolute file path).
Table 6.1: MSA Performance Results
Systems MSA1 MSA2 MSA3 MSA4 MSA5 MSA5 (DAQ)CPU [%] 0.04 2.69 3.25 3.21 52.6 45.4
Network load [KBps] 216.392 83.06 800.97 842.98 212.68 NAAvg. Comm. Latency [us] 3.7149 4.2916 6.3283 5.5105 4.2325 NA
Overall, Table 6.1 shows that all MSAs operate with a fairly low CPU consumption uti-
lizing asynchronous data acquisition (DAQ) and direct memory access (DMA). During data acqui-
sition, HSFO also brings some overhead to system performance since each MSA repetitively uses
bulk data handling interfaces to create file meta information and retrieve files to write. The low CPU
utilization reflects only a small overhead is brought up by HSFO. As for the MSA5, the high qual-
45
CHAPTER 6. EXPERIMENTAL RESULTS
ity video system, has a high CPU consumption due to simultaneous raw image shooting and JPEG
compression. One should notice that the column MSA5 (DAQ) gives the standard CPU utilization
of the video system without being integrated into SIROM3 and using HSFO. Only 7% difference of
CPU consumption is found before and after the integration. As the video captures around 20 images
per second, the overhead (e.g. create meta information to maintain HSFO structure) for each file
(each image is a separate file) caused by HSFO is fairly low. The meta information is withheld in
memory during data collection and dumped to metafiles at the time after session stops, so that there
is no delay during a session. In this way, the HSFO would not hinder the speed of data acquisition
especially for those sensors that don’t switch to new files very often.
The local Gigabit network interconnecting all MSAs shows low utilization as stream data
is collected and stored locally to each MSA. The low traffic across MSAs enables the essential
collaboration messages. For instance, the video camera on MSA5 captures images per 50 mm
while the vehicle is roaming through traffic. The accurate distance information is provided by
Distance Measurement Instrument (DMI) sensor on MSA1. Thus, they exhibit similar network
load due to the same data flow. On the other hand, MSA4 and MSA2 present higher network load
because of real-time streams. To ensure the real-time data quality, a sample stream with reduced
frequency of raw data is recorded and uploaded to on-board controller every second, which is further
shown on real-time monitoring tablet for validation. The real-time stream is handled automatically
and differently by HSFO in a way that it only uploads files for real-time usage without dumping
metafiles. The average latency ranging from 3µs to 6µs guarantees accurate time synchronization
and real-time operations. As a result, HSFO provides reliable storage and handles different streams
flexibly without hindering the overall performance of MSA.
6.1.3 Statistics of Data Behaviors
To demonstrate the impact and result of HSFO, the Table 6.2 selectively shows the statis-
tics describing data behaviors collected in field tests.
Shown in three survey results, each survey is conducted with different number of streams
as different sensors were active (i.e. survey 1 had active Radar system) or new stream with new
parameter (i.e. survey 2 used compression parameter). These differences can be triggered both
manually or auto-calibrated depending on geo-point of interests (i.e. higher resolution or more
sensors in critical areas or fewer sensors in less important place to save storage).
During survey 0, four streams on the vehicle participated with a data acquisition speed at
46
CHAPTER 6. EXPERIMENTAL RESULTS
Table 6.2: Statistics of Data Behaviors
survey 0 survey 1 survey 2Participated stream numbers 4 6 7
Meta size for 1 vehicle/h [MB/h] 10.112 8.363 8.96Raw size for 1 vehicle/h [GB/h] 307.968 346.02 296.62
Meta size for 3 vehicles/h [MB/h] 30.336 25.089 26.88Raw size for 3 vehicles/h [GB/h] 923.9 1038 890
Avg. Meta/Raw Ratio [%] 0.0032 0.0024 0.0029
approximately 307 gigabytes per hour. In survey 1, two new streams (HD camera and millimeter-
wave radar) are active during the data collection, which increases the data generating speed to 346
gigabytes per hour. Due to the stream parameter adjustment (i.e. JPEG image compression)
introduced in survey 2, the data generation speed is reduced to 296 gigabytes per hour. All three
surveys generate around 300-350 gigabytes of data per hour for one RSS system. With three RSS
running in parallel, nearly 1 terabytes of data in total needs to be collected, aggregated and analyzed
per hour. Compare to terabytes of raw data collected, only a few hundred megabytes of metafiles
are necessary to maintain the extra meta-information. With a small footprint, metafiles can be easily
and quickly processed. The average ratio of meta data over raw data is approximately 0.003%.
As shown in the example survey, the HSFO is well adapted to the SIROM3 flexibility
as multiple distinctive components can be inserted or removed seamlessly without interfering any
other operating components and the data stored. The efficient design of metafile in HSFO is critical
in balancing the storage space for raw data and extra information. Thanks to this multi-layered
overlay, HSFO offers substantial advantages in data storage, management and processing. A PLEX
will be executed on HSFO later to process the data using algorithmic and systemic plugins.
6.1.4 HSFO Performance Overhead
The HSFO provides a reliable and efficient method to store and organize the data during a
survey. After the RSS finishes the survey and returns to homezone, data is uploaded and aggregated
to the centralized HSFO on FCM and processed by PLEX through a set of plugins. To quantify
the impact brought by HSFO and PLEX, we calculate their resource utilization and processing
overheads demonstrated in Table 6.3.
HSFO is a layered structure put atop of the host file system to categorize and organize
data automatically and uniformly. It provides bulk data handling library for plugins to simply access
them. But it has a trade-off between automation and processing time. The time used for processing
47
CHAPTER 6. EXPERIMENTAL RESULTS
Table 6.3: HSFO Performance Results
Data Management Method With HSFO Without HSFOCPU [%] 52.4 48.6
Execution time [s] 362 359Processing Speed [GB/s] 0.04198 0.04233
Overhead 0.19µs/KB
metafiles brings about computing overhead, while the automation eliminates human-intrusive op-
eration and thereby increases overall productivity. The performance with and without HSFO was
compared in a Linux environment. Since the algorithm itself is independent with PLEX, it would
not affect the comparison regardless of what algorithm is. Here we simply use a down sample plu-
gin to sample the data every other line and reduce the file size to half. The experiment uses a 15GB
dataset containing one stream with more than 600 separate files inside of it. Note that the plugin
needs to process metadata for each file.
The results indicate that processing data with HSFO using down sample plugin takes 3-5
seconds longer than that in the absence of HSFO. In another word, the time overhead accounts for
0.08%-0.14% of the original. The 3 seconds take all operations leveraging bulk data handling library
into account, from composing meta information to dumping metafiles to HSFO. As mentioned
before, any update to the metafile only happens in memory during processing and is dumped once
the stream is closed to save time. Using an alternative calculation, processing data with HSFO takes
an overhead of 0.19µs/KB than that without HSFO, small enough to be omitted. It is also can
be explained by that the metafile size for managing 15GB data is 285.1 KB, which can be quickly
processed by a plugin. Moreover, the difference of CPU utilization between two methods is less than
5%. This implies that HSFO and PLEX have quite low overhead without impeding the performance.
With the bulk data handling library, different plugins can retrieve files through common
interfaces without duplicated work, which simplifies algorithm integration. Because of the standard
definition of plugins and the centralized plugin scheduler, PLEX automates data processing by
eliminating human-interactive operations and significantly improves the productivity. The system
impact of SIROM3 will be discussed in the following section.
6.2 Data Fusion and System Impact
Given the accuracy in timing, Figure 6.2 illustrates data fusion opportunities through tem-
poral correlation. Three different data streams are correlated based on sample time to identify
48
CHAPTER 6. EXPERIMENTAL RESULTS
Radar
Acoustic
Pressure
Figure 6.2: Temporal Data Fusion Example
abnormality on the pavement surface. In this example, manholes are differentiated from potholes
using multiple sensor sources to remove false positive results. The data fusion process is fully
automated using PLEX and the developed GIS visualization portal contains multiple data layers
covering a given roadway.
The HSFO Figure 6.3 shows a snapshot of the GIS visualization portal showing the city-
wide pavement condition for entire road network of the city Brockton, MA. The overall condition
of the road is represented by Pavement Condition Index (PCI) in a scale of 0-100 [33]. We surveyed
over 300 miles gathering inspection data over all lanes using one RSS. Over 20 terabytes of data
have already been collected, aggregated, fused and visualized in the FCM and GIS server. Using the
system, the severity of the road condition can be prioritized so that proper repair and maintenance
action can be taken considering budgetary constraints.
Table 6.4: The Overall Impact of SIROM3
300 Miles Data Data Data Overall Slow-Coverage Collection [h] Transfer [h] Processing [h] [h] down
Traditional Methods 640-800 0 160-320 1120 22.4Van-based systems [23] 24-32 14-16 160-320 368 7.4
SIROM3 16-20 14-16 14 50 1
In order to assess the value of the VOTERS project built upon the SIROM3 framework,
we compare the time effort for data collection, transfer and processing in three scenarios: traditional
methods, van-based systems and using the SIROM3. The three periods are calculated independently
without any overlap. The mobility apparently wins over traditional methods on data collection pe-
49
CHAPTER 6. EXPERIMENTAL RESULTS
Road Condition
Rating Scale
Good
Statisfactory
Fair
Poor
Very Poor
Serious
Failed
100
85
70
55
40
25
10
Figure 6.3: City-wide Infrastructure Performance Inspection
riod in the case of SIROM3 and van-based systems. However, the data transfer is significant for
mobile agents as large amount of data transfer is under bandwidth constraint, whereas field engi-
neers can retrieve data immediately on site. Lastly, SIROM3 excels on data processing time owing
to the unified automation embodied in plugins and PLEX. The slowdown from the last column
is calculated based on the overall time period spent from data collection to processing. The over-
all productivity has been increased by nearly 25 times using SIROM3 compared to the traditional
methods. Moreover, the scalability and expendability of SIROM3 creates more diverse opportuni-
ties for new senor technique integrations yet systems in [23] are less scalable and expandable in cost
efficient ways.
In result, SIROM3 significantly simplifies the construction of a scalable and efficient
multi-modal multi-sensor mobile sensor system. The HSFO embedded in SIROM3 with a PLEX
promote and automate heterogeneous big sensor data management from storage, transfer, process-
ing to fusion. The VOTERS project which is based on SIROM3 enables collection of infrastructure
health information at traffic speeds. This allows expanding analysis coverage and repeating inspec-
tions to validate and improve the desperately needed life-cycle models.
Although we have already automated the whole process from data collection, transfer
to processing, the three periods are conducted individually. The next step could be overlapping
50
CHAPTER 6. EXPERIMENTAL RESULTS
these steps and getting much higher productivity. The opportunity can be found in on-board data
processing. The responsibility of pre-processing is moved from off-board (FCM) to each sensor
(MSA) so that available files during a data collection can be processed on MSA in real-time. In
this way, data collection and processing can be parallelized. Another benefit is that the reduction
of data size can shorten the time for data transfer. For instance, the raw images of roadway surface
conditions captured by HD video camera can be pre-processed by a distortion correction plugin in
real-time to correct and remove useless information (e.g. bumper of the car). The refined data with
smaller size can be uploaded instead of the raw data. During the data upload, the available streams
on FCM can also be processed by PLEX first before the whole upload procedure has finished. The
real-time data processing can be realized by running PLEX on each MSA. The PLEX can be run on
any system component like HSFO as it is designed to have high scalability and adaptability. Thus, a
PLEX can be run on each MSA and performs real-time processing on the available raw data during
data collection.
The future research may look into the layered design of HSFO since our design is tailored
to the civil infrastructure health monitoring. For example, we compare surveys from the same
location but on different inspection dates to explore their deteriorations or improvements. This also
can be used in other RMMMS applications such as water quality and air pollution detection. To
extend the usage of SIROM3, the system can be easily translated into other domains with only
minor adjustments on HSFO. Take the human health care as an example, the survey can maintain
information about age, test site and date, while an upper layer cares about personal identifier can
aggregate repetitive surveys of a person.
51
Chapter 7
Conclusion
Performance monitoring of civil infrastructure is a crucial aspect of maintaining trans-
portation infrastructure. Cyber-Physical Systems (CPS) are a promising approach to acquire infor-
mation on infrastructure conditions and time-varying behaviors by providing automated inspection
with lower cost and better coverage and time resolution. To handle the domain-specific big data
challenge in this CPS approach, the HSFO with PLEX environment was introduced.
The hierarchically constructed HSFO provides a unified and efficient solution for storing,
organizing, managing and aggregating the large volumes of heterogeneous streaming data generated
from the distributed sensor systems. The metafiles conserve rich information about sensor hetero-
geneity and data collection dynamics. Moreover, the big data handling library provided by HSFO
eases access to it for plugins. To leverage the power of HSFO, a PLEX is run on HSFO to facilitate
data processing, fusion and visualization. The PLEX cooperates with HSFO to automate the big
data handling, through data storage, transfer and processing through plugins. The PLEX also offers
a flexible plugin system that is an ideal testbed to develop data fusion analysis methodologies. Since
both HSFO and PLEX are developed with high scalability and adaptability, they can be executed on
extensive platforms from mobile systems to stationary servers. Our approach addresses the big data
management, aggregation, processing and fusion for information discovery and data mining.
Being part of the SIROM3 framework, HSFO and PLEX were demonstrated with the
entire framework and realized in a VOTERS project for network-wide roadway infrastructure mon-
itoring. Over 20 terabytes of data covering 300 miles have been collected, stored, and aggregated
so far in the city of Brockton, MA, and thereby processed, geo-spatially analyzed, and visualized
to investigate infrastructure time-varying behaviors. The performance of HSFO with PLEX was
measured during different periods from data collection, storage to processing in order to ensure a
52
CHAPTER 7. CONCLUSION
small overhead (0.19µs/KB) to the entire system. The automation embodied in SIROM3 sped up
the overall productivity by nearly 25 times. The fused big data gives city officials valuable data to
do manage life cycle of infrastructures and coordinate investments.
53
Bibliography
[1] Karl Aberer, Manfred Hauswirth, and Ali Salehi. Infrastructure for data processing in large-
scale interconnected sensor networks. In Mobile Data Management, 2007 International Con-
ference on, pages 198–205. IEEE, 2007.
[2] Federal Highway Administration. 2008 status of the nation’s highways, bridges, and transit:
conditions and performance report to congress, us dot fhwa, 2008.
[3] ASTM. Standard practice for measuring delaminations in concrete bridge decks by sounding:
ASTM D4580-03 (reapproved 2007), 2007.
[4] ASTM. Standard test method for water-soluble chloride in mortar and concrete: ASTM
C1218/C1218M-99, 2008.
[5] ASTM. Standard test method for corrosion potentials of uncoated reinforcing steel concrete:
ASTM C876-09, 2009.
[6] Christopher L. Barnes and Jean-Francois Trottier. Effectiveness of ground penetrating radar
in predicting deck repair quantities. Journal of Infrastructure Systems, 10:69–76, 2004.
[7] Ralf Birken, Gunar Schirner, and Ming Wang. VOTERS: design of a mobile multi-modal
multi-sensor system. In Proceedings of the Sixth International Workshop on Knowledge Dis-
covery from Sensor Data, SensorKDD ’12, pages 8–15, New York, NY, USA, 2012. ACM.
[8] Ralf Birken, Ming Wang, and Sara Wadia-Fascetti. Framework for continuous network-wide
health monitoring of roadways and bridge decks. In Proceedings of Transportation Systems
Workshop 2012, Austin, TX, March 2012.
54
BIBLIOGRAPHY
[9] M. Bocca, J. Toivola, L.M. Eriksson, J. Hollmn, and H. Koivo. Structural health monitoring
in wireless sensor networks by the embedded goertzel algorithm. In 2011 IEEE/ACM Interna-
tional Conference on Cyber-Physical Systems (ICCPS), pages 206–214, 2011.
[10] Kameswari Chebrolu, Bhaskaran Raman, Nilesh Mishra, Phani Kumar Valiveti, and Raj Ku-
mar. Brimon: a sensor network system for railway bridge monitoring. In Proceedings of
the 6th international conference on Mobile systems, applications, and services, MobiSys ’08,
pages 2–14, New York, NY, USA, 2008. ACM.
[11] M. Daum, M. Fischer, M. Kiefer, and K. Meyer-Wegener. Integration of heterogeneous sensor
nodes by data stream management. In Tenth International Conference on Mobile Data Man-
agement: Systems, Services and Middleware, 2009. MDM ’09, pages 525–530, May 2009.
[12] Jakob Eriksson, Hari Balakrishnan, and Samuel Madden. Cabernet: vehicular content delivery
using WiFi. In Proceedings of the 14th ACM international conference on Mobile computing
and networking, MobiCom ’08, pages 199–210, New York, NY, USA, 2008. ACM.
[13] M.R. Fetterman, T. Hughes, N. Armstrong-Crews, C. Barbu, K. Cole, R. Freking, K. Hood,
J. Lacirignola, M. McLarney, A. Myne, S. Relyea, T. Vian, S. Vogl, and Z. Weber. Distributed
multi-modal sensor system for searching a foliage-covered region. In 2011 IEEE Conference
on Technologies for Practical Robot Applications (TePRA), pages 7–14, 2011.
[14] L. Galehouse, J.S. Moulthrop, and R.G. Hicks. Principles of pavement preservation: Defini-
tions, benefits, issues, and barriers. TR News, September-October 2003, pp. 4-15. Transporta-
tion Research Board (TRB), National Research Council, Washington, D.C., 2003.
[15] Sindhu Ghanta, Ralf Birken, and Jennifer Dy. Automatic road surface defect detection from
grayscale images. In Proceedings of SPIE Symposium on Smart Structures and Materials +
Nondestructive Evaluation and Health Monitoring, March 2012.
[16] N. Gucunski, F. Romero, S. Kruschwitz, R. Feldmann, A. Abu-Hawash, and M. Dunn. Mul-
tiple complementary nondestructive evaluation technologies for condition assessment of con-
crete bridge decks. Transportation Research Record, No. 2201, 2010.
[17] L. Gurgen, C. Labbe, V. Olive, and C. Roncancio. A scalable architecture for heterogeneous
sensor management. In Sixteenth International Workshop on Database and Expert Systems
Applications, 2005. Proceedings, pages 1108–1112, 2005.
55
BIBLIOGRAPHY
[18] Levent Gurgen, Claudia Roncancio, Cyril Labb, Andr Bottaro, and Vincent Olive.
SStreaMWare: a service oriented middleware for heterogeneous sensor data management.
In Proceedings of the 5th International Conference on Pervasive Services, ICPS ’08, page
121130, New York, NY, USA, 2008. ACM.
[19] G. Hackmann, W. Guo, G. Yan, Z. Sun, C. Lu, and S. Dyke. Cyber-physical codesign of
distributed structural health monitoring with wireless sensor networks. Early Access Online,
2013.
[20] Yo-Ming Hsieh and Yu-Cheng Hung. A scalable IT infrastructure for automated monitoring
systems based on the distributed computing technique using simple object access protocol
web-services. Automation in Construction, 18(4), July 2009.
[21] Bret Hull, Vladimir Bychkovsky, Yang Zhang, Kevin Chen, Michel Goraczko, Allen Miu, Eu-
gene Shih, Hari Balakrishnan, and Samuel Madden. CarTel: a distributed mobile sensor com-
puting system. In Proceedings of the 4th international conference on Embedded networked
sensor systems, SenSys ’06, New York, NY, USA, 2006. ACM.
[22] IEEE. 1588-2008 - IEEE Standard for a Precision Clock Synchronization Protocol for Net-
worked Measurement and Control Systems, 2008.
[23] Pathway Services Inc. Automated road and pavement condition surveys, 2009.
[24] F. Jalinoos, R. Arndt, D. Huston, and J. Cui. Periodic nde for bridge maintenance. In Proceed-
ings of Structural Faults and Repair Conferenc, Edinburgh, June 2010.
[25] R. Kozma, Lan Wang, K. Iftekharuddin, E. McCracken, M. Khan, K. Islam, and R.M. Demirer.
Multi-modal sensor system integrating COTS technology for surveillance and tracking. In
2010 IEEE Radar Conference, pages 1030–1035, 2010.
[26] M Lindner. libconfig–c/c++ configuration file library.
[27] Xuefeng Liu, Jiannong Cao, Wen-Zhan Song, and ShaoJie Tang. Distributed sensing for high
quality structural health monitoring using wireless sensor networks. In Real-Time Systems
Symposium (RTSS), 2012 IEEE 33rd.
[28] Samuel R Madden, Michael J Franklin, Joseph M Hellerstein, and Wei Hong. Tinydb: an
acquisitional query processing system for sensor networks. ACM Transactions on database
systems (TODS), 30(1):122–173, 2005.
56
BIBLIOGRAPHY
[29] K.R. Maser. Bridge deck condition surveys using radar: Case studies of 28 new england decks.
Transportation Research Record, No. 1304, TRB, National Research Council, 1991.
[30] K.R. Maser, J. Doughty, and R. Birken. Characterization and detection of bridge deck deterio-
ration. In Proceedings of the Engineering Mechanics Institute (EMI 2011), Boston, MA, June
2011.
[31] D.L. Mills. Network Time Protocol (Version 3) specification, implementation and analysis.
Network Working Group Report RFC-1305, March 1992.
[32] American Society of Civil Engineers. 2013 report card for americas’ infrastructure, 2013.
[33] Shahini Shamsabadi S., Wang M. L., and Birken R. Pavement condition monitoring by fus-
ing data from a mobile multi-sensor inspection system. In Proceedings of the SAGEEP on
Geophysical Data Management, Boston, MA, USA, 2014.
[34] Shahini Shamsabadi S., Birken R., and Wang M. L. PAVEMON: a gis-based data management
system for pavement monitoring based on large amounts of near-surface geophysical sensor
data. In Proceedings of the IACSM on Structure Control and Monitoring, Barcelona, Spain,
2014.
[35] L.B. Stevens. Road surface management for local governments’ resource notebook. Federal
Highway Administration, Washington, D.C. Publication No. DOTI’85’37, 1985.
[36] J. Walls and M.R. Smith. Life-cycle cost analysis in pavement design. Federal Highway
Administration, Washington, D.C. FHWA report FHWA-SA-98-079, 1998.
[37] Ming Wang, Ralf Birken, and Salar Shahini Shamsabadi. Framework and implementation of
a continuous network-wide health monitoring system for roadways. In SPIE Smart Structures
and Materials+ Nondestructive Evaluation and Health Monitoring, pages 90630H–90630H.
International Society for Optics and Photonics, 2014.
[38] J. Zhang, Qiu H., Shahini Shamsabadi S., R. Birken, and G Schirner. SIROM3 - a scalable
intelligent roaming multi-modal multi-sensor framework. In Proceedings of 38th Annual IEEE
International Computers, Software, and Applications Conference, Vasteras, Sweden, 2014.
[39] J. Zhang, Qiu H., Shahini Shamsabadi S., R. Birken, and G Schirner. WiP: system-level
integration of mobile multi-modal multi-sensor systems. In Proceedings of ACM/IEEE 5th
International Conference on Cyber-Physical Systems, Berlin, Germany, 2014.
57
BIBLIOGRAPHY
[40] Yang Zhang, B. Hull, H. Balakrishnan, and S. Madden. ICEDB: intermittently-connected
continuous query processing. In IEEE 23rd International Conference on Data Engineering,
2007. ICDE 2007, 2007.
58