carstream: an industrial system of big data processing … · carstream: an industrial system of...

12
CarStream: An Industrial System of Big Data Processing for Internet-of-Vehicles Mingming Zhang SKLSDE, Beihang University Beijing, China [email protected] Tianyu Wo SKLSDE, Beihang University Beijing, China [email protected] Tao Xie University of Illinois at Urbana-Champaign Urbana, IL, USA [email protected] Xuelian Lin SKLSDE, Beihang University Beijing, China [email protected] Yaxiao Liu Tsinghua University Beijing, China [email protected] ABSTRACT As the Internet-of-Vehicles (IoV) technology becomes an in- creasingly important trend for future transportation, de- signing large-scale IoV systems has become a critical task that aims to process big data uploaded by fleet vehicles and to provide data-driven services. The IoV data, espe- cially high-frequency vehicle statuses (e.g., location, engine parameters), are characterized as large volume with a low density of value and low data quality. Such characteristics pose challenges for developing real-time applications based on such data. In this paper, we address the challenges in de- signing a scalable IoV system by describing CarStream, an industrial system of big data processing for chauffeured car services. Connected with over 30,000 vehicles, CarStream collects and processes multiple types of driving data includ- ing vehicle status, driver activity, and passenger-trip infor- mation. Multiple services are provided based on the col- lected data. CarStream has been deployed and maintained for three years in industrial usage, collecting over 40 ter- abytes of driving data. This paper shares our experiences on designing CarStream based on large-scale driving-data streams, and the lessons learned from the process of address- ing the challenges in designing and maintaining CarStream. 1. INTRODUCTION In recent years, the Internet-of-Things (IoT) technology has emerged as an important research and application area. As a major branch of IoT, Internet-of-Vehicles (IoV) has drawn great research and industry attention. Recently, the cloud-based IoV has benefited from the fast development of mobile networking and big data technologies. Different from traditional vehicle networking technologies, which focus on This work is licensed under the Creative Commons Attribution- NonCommercial-NoDerivatives 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For any use beyond those covered by this license, obtain permission by emailing [email protected]. Proceedings of the VLDB Endowment, Vol. 10, No. 12 Copyright 2017 VLDB Endowment 2150-8097/17/08. vehicle-to-vehicle (V2V) communication and vehicular net- works [28, 36], in a typical cloud-based IoV scenario, the vehicles are connected to the cloud data center and upload vehicle statuses to the center through wireless communica- tions. The cloud collects and analyzes the uploaded data and sends the value-added information back to the vehicles. Similar to other IoT applications, the vehicle data are usu- ally organized in a manner of streaming. Although each ve- hicle uploads a small stream of data, a big stream is merged in the cloud due to both the high frequency and the large fleet scale. Therefore, a core requirement in this cloud-based IoV scenario is to process the stream of big vehicle data in a timely fashion. However, to satisfy such core requirement, there are three main challenges in designing a large-scale industrial IoV sys- tem to support various data-centric services. First, the sys- tem needs to be highly scalable to deal with the big scale of the data. Unlike other systems, the data in IoV are mostly generated by machines instead of humans, and this kind of data can be massive due to the high sampling fre- quency. Second, the system needs to make a desirable trade- off between the real-time and accuracy requirements of data- centric services when processing large-scale data with low quality and low density of value. Remarkable redundancy exists in the data due to the high sampling frequency, which causes the low density of data value. However, such data should not be simply removed because it is challenging to identify which data will never be used by any supported ser- vice. Third, the system needs to be highly reliable to sup- port safety-critical services such as emergency assistance. Once a service is deployed, it needs to run continuously and reliably along with the data keeping coming in. Unexpected service outage may cause severe damage of property or even life loss. There can be multiple solutions for designing an IoV sys- tem. For example, the traditional OLTP (Online Trans- action Processing) architecture uses a mature and stable database as a center to deploy services. In such architecture, the system collects data uploaded by vehicles and stores them into the database, and further provides services based on the analytical ability of the database subsystem. This solution is simple, mature, and can be highly reliable. How- ever, it cannot scale well with the fleet size growing because 1766

Upload: dongoc

Post on 16-May-2018

216 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: CarStream: An Industrial System of Big Data Processing … · CarStream: An Industrial System of Big Data Processing for Internet-of-Vehicles ... centric services when processing

CarStream: An Industrial System of Big Data Processingfor Internet-of-Vehicles

Mingming ZhangSKLSDE, Beihang University

Beijing, China

[email protected]

Tianyu WoSKLSDE, Beihang University

Beijing, China

[email protected]

Tao XieUniversity of Illinois atUrbana-Champaign

Urbana, IL, [email protected]

Xuelian LinSKLSDE, Beihang University

Beijing, China

[email protected]

Yaxiao LiuTsinghua University

Beijing, China

[email protected]

ABSTRACTAs the Internet-of-Vehicles (IoV) technology becomes an in-creasingly important trend for future transportation, de-signing large-scale IoV systems has become a critical taskthat aims to process big data uploaded by fleet vehiclesand to provide data-driven services. The IoV data, espe-cially high-frequency vehicle statuses (e.g., location, engineparameters), are characterized as large volume with a lowdensity of value and low data quality. Such characteristicspose challenges for developing real-time applications basedon such data. In this paper, we address the challenges in de-signing a scalable IoV system by describing CarStream, anindustrial system of big data processing for chauffeured carservices. Connected with over 30,000 vehicles, CarStreamcollects and processes multiple types of driving data includ-ing vehicle status, driver activity, and passenger-trip infor-mation. Multiple services are provided based on the col-lected data. CarStream has been deployed and maintainedfor three years in industrial usage, collecting over 40 ter-abytes of driving data. This paper shares our experienceson designing CarStream based on large-scale driving-datastreams, and the lessons learned from the process of address-ing the challenges in designing and maintaining CarStream.

1. INTRODUCTIONIn recent years, the Internet-of-Things (IoT) technology

has emerged as an important research and application area.As a major branch of IoT, Internet-of-Vehicles (IoV) hasdrawn great research and industry attention. Recently, thecloud-based IoV has benefited from the fast development ofmobile networking and big data technologies. Different fromtraditional vehicle networking technologies, which focus on

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. To view a copyof this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. Forany use beyond those covered by this license, obtain permission by [email protected] of the VLDB Endowment, Vol. 10, No. 12Copyright 2017 VLDB Endowment 2150-8097/17/08.

vehicle-to-vehicle (V2V) communication and vehicular net-works [28, 36], in a typical cloud-based IoV scenario, thevehicles are connected to the cloud data center and uploadvehicle statuses to the center through wireless communica-tions. The cloud collects and analyzes the uploaded dataand sends the value-added information back to the vehicles.

Similar to other IoT applications, the vehicle data are usu-ally organized in a manner of streaming. Although each ve-hicle uploads a small stream of data, a big stream is mergedin the cloud due to both the high frequency and the largefleet scale. Therefore, a core requirement in this cloud-basedIoV scenario is to process the stream of big vehicle data ina timely fashion.

However, to satisfy such core requirement, there are threemain challenges in designing a large-scale industrial IoV sys-tem to support various data-centric services. First, the sys-tem needs to be highly scalable to deal with the big scaleof the data. Unlike other systems, the data in IoV aremostly generated by machines instead of humans, and thiskind of data can be massive due to the high sampling fre-quency. Second, the system needs to make a desirable trade-off between the real-time and accuracy requirements of data-centric services when processing large-scale data with lowquality and low density of value. Remarkable redundancyexists in the data due to the high sampling frequency, whichcauses the low density of data value. However, such datashould not be simply removed because it is challenging toidentify which data will never be used by any supported ser-vice. Third, the system needs to be highly reliable to sup-port safety-critical services such as emergency assistance.Once a service is deployed, it needs to run continuously andreliably along with the data keeping coming in. Unexpectedservice outage may cause severe damage of property or evenlife loss.

There can be multiple solutions for designing an IoV sys-tem. For example, the traditional OLTP (Online Trans-action Processing) architecture uses a mature and stabledatabase as a center to deploy services. In such architecture,the system collects data uploaded by vehicles and storesthem into the database, and further provides services basedon the analytical ability of the database subsystem. Thissolution is simple, mature, and can be highly reliable. How-ever, it cannot scale well with the fleet size growing because

1766

Page 2: CarStream: An Industrial System of Big Data Processing … · CarStream: An Industrial System of Big Data Processing for Internet-of-Vehicles ... centric services when processing

the database can easily become the bottleneck of the wholesystem. The OLAP (Online Analytical Processing) architec-ture is often used to deal with large-scale data, in which adata-processing subsystem, rather than a data-storage sub-system (such as a database subsystem), offers the core ana-lytical ability. In such a system, the computing pressure ismainly on the data-processing subsystem, which often exe-cutes complex queries. Designing an OLAP-style processingsubsystem for IoV scenarios often requires integrating multi-ple platforms such as Storm [5] and Spark Streaming [4,38].

In this paper, we address the challenges in designing ascalable, high-performance big data analytics system for IoVby describing CarStream, an industrial data-processing sys-tem for chauffeured car services. CarStream has connectedover 30,000 vehicles in more than 60 cities, and it has multi-ple data sources including vehicle status data (such as speed,trajectories, engine Revolutions Per Minute (RPM), remain-ing gasoline), driver activities (such as starting a service,picking up a passenger), and passenger orders. CarStreamprovides multiple real-time services based on these data.In particular, we address the scalability issue by equippingCarStream with the ability of distributed processing anddata storage. We then employ a stream-processing subsys-tem to preprocess the data on the fly, so that the prob-lems of low data quality and low density of value can beaddressed. Our solution achieves higher performance withthe cost of more storage space. We also further acceleratethe performance of CarStream by integrating an in-memorycaching subsystem. Finally, we design a three-layered mon-itoring subsystem for CarStream to provide high reliabilityand manageability. The monitoring subsystem provides acomprehensive view of how the overall system runs, includ-ing the cluster layer, the computing-platform layer, and theapplication layer. This subsystem helps sense the health sta-tus of CarStream in real-time, and it also provides valuableinformation of the system to developers who are improvingthe reliability and performance of the system continuously.

CarStream has been deployed and maintained for threeyears in industrial usage, in which we keep improving andupgrading the system, and deploy new applications. In thisprocess, we have learned various lessons, which are valuablefor both the academic and industrial communities, especiallythose who are interested in designing a streaming system ofbig data processing. In this paper, we share our experiencesincluding the system designs and the lessons learned in thisprocess.

In summary, this paper makes the following main contri-butions:

• An industrial system of big data processing that inte-grates multiple technologies to serve high performanceto IoV applications.

• A three-layered monitoring subsystem for reliabilityassurance of long/durable-running and safety-criticalstream-processing applications.

• A set of lessons learned throughout the three-year pro-cess of designing, maintaining, and evolving such in-dustrial system of big data processing to support real-world chauffeured car services.

The rest of this paper is organized as follows. Section 2 de-scribes the background of CarStream, including the business

background and the data. Section 3 describes the overall ar-chitecture of CarStream. Section 4 discusses the implemen-tation detail of CarStream, including the stream-processingsubsystem, the heterogeneous big-data management subsys-tem, and the three-layered monitoring subsystem. Section 5presents the lessons learned. Section 6 summarizes the re-lated work and Section 7 concludes the paper.

2. BACKGROUND OF CARSTREAMThe concept of IoV is originated from IoT. There can

be multiple definitions of IoV. For example, traditional IoVdefines a vehicular network connected by vehicles throughthe Radio Frequency Identification (RFID) technology. Atechnology known as Vehicle to Vehicle (V2V) can furthershare traffic and vehicle information through running vehi-cles, providing a local-area solution for transportation safetyand efficiency. With the wide adoption of mobile Internet,traditional IoV has evolved to a wide-area network deploy-ment that combines in-vehicle network and inter-vehicle net-work. In such IoV, the vehicle sensors are connected to theElectronic Control Unit (ECU) through CAN-BUS, and acloud-centric vehicular network is constructed with vehiclesconnected to the backend servers through the mobile net-work. The backend servers act as a central intelligent unitthat collects, analyzes, stores the data uploaded by vehi-cles, and further send the value-added information back tovehicles.

By successfully fulfilling many requirements in the trans-portation domain, IoV has been for years one of the mostpopular applications in the IoT paradigm. But few large IoVdeployments exist, due to the high deployment and mainte-nance cost for individually owned cars. Private car services,such as Uber [9] and Lyft [8], are successful Internet-drivenindustrial examples. The underlying system supporting suchservices can be regarded as a simplified version of IoV wheresmartphones, instead of dedicated sensors, are used to con-nect passengers, drivers, and vehicles. This technology hasgiven rise to a huge market targeted by many similar com-panies established in recent years. To implement an IoVplatform, the basic functional requirements should at leastinclude vehicle-data collection, storage, analysis, and ser-vices. Vehicles can upload driving data, including instantlocations and vehicle-engine statuses, to the backend servers.The data can be used in both real-time and offline vehicle-management applications.

In this paper, we discuss the design and implementa-tion issues of a cloud-centric IoV system named CarStream.CarStream is based on a real-world chauffeured car servicecompany in China, named UCAR. Such service is similar tothe Uber private-car service where the drivers take ordersfrom passengers through mobile phones and drive them totheir destinations. But the UCAR service also has severalunique characteristics: the fleet belongs to the company,and drivers are employees of the company. This centralizedbusiness model provides a good opportunity for deployingan Onboard Diagnostic (OBD) connector, a data-samplingdevice for vehicles. The OBD is connected to the in-vehicleController Area Network bus (CAN-BUS). Under a definedinterval, the OBD collects vehicle parameter values such asspeed, engine RPM, and vehicle error code. The collecteddata are transmitted back to the backend server through anintegrated wireless-communication module. With the help

1767

Page 3: CarStream: An Industrial System of Big Data Processing … · CarStream: An Industrial System of Big Data Processing for Internet-of-Vehicles ... centric services when processing

Table 1: Functionalities and applications provided in CarStream.

Type Name Note

Dataquality

Outlier Detection Detect outlier data from vehicle-data streams.Data Cleaning Filter outlier data or repeated data.Data Quality Inspection Analyze data quality from multiple dimensions and generate reports periodi-

cally.Data Filling Fill the missing data or replace the wrong data in the streams with various

techniques.

FleetManage-ment

Electronic Fence Based on user-defined virtual fence (on map) and vehicle location, detect theactivities of fence entering and leaving.

Vehicle Tracking Track vehicle status including location, speed, engine parameter, error, andwarning in real-time. Provide fast feedback to drivers and the control depart-ment.

Fleet Distribution Fast index and visualize the dynamic fleet distribution in the space dimension;support the vehicle scheduling.

Gasoline Anomaly Detec-tion

Detect abnormal gas consumption in case of stealing or leaking.

Fleet Operation Report Produce statistics of the driving data; offer a regular report on the runningstatus of the fleet, such as daily mileage, gas consumption, valid mileage.

Driving Behavior Analysis Evaluate driving skills in terms of safe and skillful driving.Driving Event Detection Detect driving events, such as sudden braking, from data stream.

DecisionMaking

Driver Profit Assessment Calculate the profitable mileage and gasoline consumption of a certain trip byusing historical data.

Order Prediction Based on historical order data, predict the order distribution in the future sothat cars can be dispatched in advance.

Dynamic Pricing Design a flexible price model based on the real-time passenger demand toachieve maximum profit, e.g., increasing the price for peak hour.

OtherSystem Monitoring Provide a comprehensive view for the system-health status.Multi Stream Fusion Merge multiple streams such as the driving events and the GPS locations to

extract correlated patterns.Trajectory Compression Losslessly compress huge historical trajectory data.

of such near real-time vehicle data, we can monitor the driv-ing behavior in a fine granularity and react in a timely man-ner to ensure the safety and quality of the trip services. Thedata also provide a direct view of vehicle status, being valu-able in real-time fleet scheduling and long-time operationalmanagement.

Besides the OBD connector, each driver also uses a mo-bile phone to conduct business-related operations such astaking orders, navigation, changing service status. There-fore, the mobile phone generates a separate data stream,namely business data, and uploads the data to the server ina different channel.

CarStream is designed to process those vehicle data andbusiness data as a back-end service. In general, CarStreamcollects, stores, and analyzes the uploaded data in termsof vehicle and driver management. So far, a major part ofthe fleet, over 30,000 vehicles, are connected to CarStream.Those vehicles are distributed among 60 different cities inChina. Each vehicle uploads a data packet to the server ev-ery 1 to 3 seconds when the vehicle is running. Thus, thebackend system needs to process nearly 1 billion data in-stances per day. The scale of the fleet had expanded fromthe initial 1500 to 30,000 within a few months; this busi-ness expansion poses a horizontal scalability request to theunderlying platform.

Figure 1 illustrates the multiple data that are collectedand processed in CarStream during a typical day. Afterrunning for 3 years, totally 40 TB vehicle data are col-lected by CarStream. The data can be roughly classifiedinto four types: vehicle status, passenger order, driver ac-

GPS

Trajectory

Latitude

Longitude

Speed

Direction

RPMAcc Pedal

Windows

Doors

Error Codes

Gear

FuelMileage

Lights

Brake Pedal

Seat Belt

Air Conditioner

Vehicle

Status

40TB Data

Start ServiceDriver

ActivityPickup

End Service

Arrive

USER Demand

Pickup point

Destination

Pickup Time

Arrive Time

Figure 1: The data processed in CarStream.

tivity, and trajectory data, as shown in Figure 1. The datain CarStream are not only large in volume but also compre-hensive.

However, underneath the abundant data are a series ofdata-quality problems that need to be addressed. For exam-ple, a full set of the vehicle-data parameters contain morethan 70 items, but because the vehicles are designed by mul-tiple manufacturers, different subsets of the whole parameterset may be collected from different vehicles. Some vehicleshave gear parameter data while others do not. Meanwhile,the running of the vehicle always causes vibration, which inturn affects the correctness of sensor readings such as theremaining gasoline level. Other data-quality problems thatCarStream faces include disorder and delay, which may becaused by multiple factors such as distributed processing

1768

Page 4: CarStream: An Industrial System of Big Data Processing … · CarStream: An Industrial System of Big Data Processing for Internet-of-Vehicles ... centric services when processing

0

10

20

30

40

50

60

70

80

90

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Num

ber

of

Dat

a In

stan

ces

(*10^5)

Hours in a day

Day1 Day2 Day3

Figure 2: The data amount changes with time.

and network problems. Per the statistics of our dataset,nearly 10% data are disordered or delayed. Such problemsmay have a significant influence on different aspects of thesystem, especially on the processing accuracy.

A special characteristic of the data stream in CarStreamis that the data stream is subjected to the traffic pattern:the data amount uploaded in peak hours can be as muchas 80 times larger than that are in off-peak hours. For thisreason, CarStream also needs to deal with burst streams.Figure 2 illustrates the data amount of 24 hours in threedays, and clearly shows a traffic-like pattern with morningand evening peak hours.

Based on the data, CarStream provides multiple function-alities and applications to facilitate the fleet managementfor chauffeured car service. As shown in Table 1, the ap-plications have a dependency on each other. Data filling,for example, relies on data cleaning, which in turn has adependency on outlier detection. The application of fleetoperation report has a strong dependency on the function-alities of data cleaning and outlier detection because most ofthe gasoline data are very inaccurate due to the influence ofvehicle vibration. Multi stream fusion is another applicationthat other applications depend on. Because the data are col-lected from multiple sources with different frequencies, thefusion (or matching) becomes a common requirement. Forexample, an interesting application in CarStream is to de-tect abnormal gasoline consumptions. In this application,we need to merge the data stream of driver activities withthe data stream of vehicle statuses so that we can learnwhether the gasoline consumption is reasonable. Suddendrops of the level of remaining gasoline are detected by thesystem, but a further analysis of the data suggests that mostof these phenomena are caused by the influence of parkingposition. However, such phenomena can also be caused bygasoline stealing, which was not expected by fleet managers.

The applications in CarStream also differ on the perfor-mance requirements, as we can infer from Table 1. Some ap-plications may have high-performance requirements as theyare time-critical or safety-critical missions. The various ap-plication requirements increase the complexity in designingan integrated system that fits all of the listed applications.

3. OVERALL ARCHITECTUREBased on the aforementioned challenges and the require-

ments, we design CarStream, a high-performance big dataanalytics system for IoV. In this section, we present theoverview of CarStream and its comparison with other pos-

sible solutions. As illustrated in Figure 4, CarStream is anintegrated system with multiple layers involved. The com-ponents that are marked by the dotted line are the coreparts of CarStream and therefore are discussed in detail inthis paper.

There can be multiple design choices for fulfilling IoV re-quirements. The traditional OLTP/OLAP solutions adopt adatabase-centric design. In such architecture, the uploadeddriving data are normally stored in a relational database;the services are implemented based on the queries of thedatabase. Such solution is simple, mature, and easy to de-ploy and maintain, qualifying such solution to be an optionfor IoV scenarios with a small-scale fleet. However, for large-scale scenarios, the database-centric architecture would suf-fer from the poor throughput of data reading and writing.Since the relational database is designed to fulfill the com-mon requirements of transactional processing; it is hard totailor the database platform in a fine-grained manner, drop-ping the unnecessary indexing and transactional protection,which consequently cause bad system performance.

Nowadays, as applications related to streaming data be-come popular, many stream-processing architectures havebeen proposed. The architecture comprising distributed buffer-ing, stream processing, and big data storage is widely adopted.Such solution can satisfy many IoV scenarios. However,it is not sufficient to deploy the stream processing alone.The comprehensive and evolving application requirements,as well as the varying data characteristics, require both on-line and offline processing ability. A good design to wellconnect different loosely coupled components is a key to thesuccess of the system.

3.1 Data Bus LayerThe data bus layer provides a data-exchange hub for sys-

tem modules to connect different computing tasks. Thislayer is implemented with a distributed messaging platformwith fault-tolerant, high performance, and horizontally scal-able ability. This layer also aims at providing a buffer foradjusting the stream volume of downstream data-processingprocedures. By eliminating the time-varying spike of datastreams, the downstream processing subsystems can takea routine pace to deal with the streams, without worryingabout the system capacity being exceeded.

3.2 Processing LayerThe processing layer includes two parts: the online stream-

processing subsystem and the offline batch-processing sub-system. The stream processing subsystem is designed toprovide a preprocessing functionality to the streams in away that the query pressure of the database can be reduced.Without preprocessing, the applications based on the rawdata would bring huge read pressure to the database. Sincethe large volume of raw data will be continuously stored,the writing burden cannot be reduced. The system wouldsuffer from the read and write operations simultaneously ifthey both apply to a single target. An extracted subset ofthe raw data can be generated and maintained for furtherquery by preprocessing. The result of preprocessing is incre-mentally updated in a data-driven manner. Such solutionsatisfies many IoV applications that depend more on the ex-tracted data subset rather than the whole dataset. In thismanner, the read and write pressure of the database can be

1769

Page 5: CarStream: An Industrial System of Big Data Processing … · CarStream: An Industrial System of Big Data Processing for Internet-of-Vehicles ... centric services when processing

Acc

ess

Laye

rSpatial-Temporal Big Data Management

Dis

trib

ute

d M

essa

gin

g Sy

stem

HDFS

HBase

Real-time Stream Processing

JStorm Cluster

Batch Processing Mem

cach

ed

/Red

is

Web

Ser

ver

File System(Archive)

Rational Database

Figure 3: An overview of CarStream.

Data Bus Layer

Processing Layer

Data Manage Layer

Monitor

Query ServerLayer

Data Receiving/Dispatching Layer

Figure 4: The system architecture of CarStream.

separated, leading to a more friendly architecture to scaleto a larger data volume and more applications.

The batch-processing subsystem is employed to satisfy thecomplex analysis requirements that involve more historicaldata. Moreover, considering that the low quality of the rawdata may affect the accuracy of stream processing, the resultof batch processing can also play a role as an auxiliary mea-surement to evaluate the quality of the stream-processingresult. Combining both the stream processing and batchprocessing, CarStream can satisfy the needs of a wide rangeof workloads and use cases, in which low-latency reads andfrequent updates of data are required.

3.3 Data Management LayerThe data management layer adopts a heterogeneous struc-

ture constructed with multiple databases. In CarStream,both NOSQL and SQL databases are adopted to achievehigh performance. The design decision is based on a com-prehensive analysis of the IoV applications. We store thedata in different storage systems per the access requirementsof the applications.

More specifically, the most frequently accessed data arestored in the in-memory caching subsystem so that the datacan be accessed and updated frequently. By using an in-memory key-value data structure, the in-memory cachingsubsystem can achieve a milliseconds latency and a highthroughput in data updating and exchanging. Choices ofin-memory caching can be multiple, such as Memcached [25]and Redis [19]. These platforms are originally designed forcaching objects of a web server to reduce the pressure ofthe database. The in-memory caching employed here actsas the fast media of data update and exchange for applica-tions. On one hand, the processing subsystem keeps updat-ing the current result into the caching; on the other hand,

the web servers keep fetching the processing results from thecaching. An application example here is vehicle tracking, inwhich the stream processing subsystem updates the statusesof the vehicles, while the users check these statuses simulta-neously through the web browser. Then, the structured andpreprocessed data, which can be some smaller data streamsthan the volume of raw data, are stored in RDBMS andbeing accessed with SQL. By leveraging the mature object-relation mapping frameworks, it is much easier for RDBMSto be integrated to build web user interfaces. Finally, all thedata generated in CarStream are archived in the distributedNOSQL big data storage subsystem, which supports onlythe non-realtime offline data processing. This architectureincreases the system complexity, but it also accommodatesthe various requirements of different applications and canreduce the risk of data loss in a single-system failure.

3.4 System MonitoringSystem monitoring is a unified name for a three-layered

monitoring subsystem of CarStream. We monitor the sys-tem from three different layers: infrastructure, computingplatform, and application. It is necessary to monitor eachof the layers. The infrastructure monitoring is used to vieweither live or archived statistics covering metrics such as theaverage CPU load and the network utilization of the cluster.The monitoring infrastructure provides not only a real-timehealth view of the hardware but also a guidance for adjust-ing the system deployment in terms of load balancing andresource utilization. For the computing platform layer, wemonitor the platforms such as the stream-processing subsys-tem, the batch-processing subsystem, and the databases totrack the running status of the platform software. Specifi-cally, the applications in CarStream are monitored becausethe monitoring information collected for the applications candirectly reflect the health status of the provided services.Monitoring the application layer can provide useful infor-mation for service-failure prevention or recovery.

4. IMPLEMENTATION PARADIGMThe overall structure of CarStream and the connections

between different modules are illustrated in Figure 3. In thissection, we discuss the detailed implementation of CarStreamfrom three main parts: the processing subsystem, streamingdata management subsystem, and the three-layered moni-toring subsystem. We show how to combine different tech-

1770

Page 6: CarStream: An Industrial System of Big Data Processing … · CarStream: An Industrial System of Big Data Processing for Internet-of-Vehicles ... centric services when processing

Table 2: Application requirements of CarStream.

Requirement Detail

Scalability With the increase of the fleet scale, weneed to be able to improve the ability ofthe processing platform. We also need tobe able to add additional capacity to ourdata-storage subsystem.

Low Latency After data are uploaded by a vehicle, theymust be processed immediately. The pro-cessed result should be provided with lowlatency.

HighThrough-put

On one hand, the system needs to be ableto receive and store all the uploaded data.On the other hand, the system needs to beable to provide a fast query for multipleusers.

High Relia-bility

CarStream needs to provide a highly reli-able data-analytics system to users, includ-ing drivers, fleet managers, and passengers,to ensure the dependability of the services.

High Avail-ability

CarStream needs to provide sustainableservices that are robust enough to toleratevarious unstable environmental conditions.

nologies to accomplish a high-performance data-analyticssystem for IoV. For each part, we present the problems inthe implementation along with the insights of solving theseproblems.

The requirements of applications in CarStream are sum-marized in Table 2.

4.1 Processing SubsystemAs introduced in the preceding sections, the data pro-

cessed in CarStream are uploaded by vehicles. With thecontinuous vehicle running, the data stream formed by datapackets comes continuously. Thousands of small streamsare merged into a big stream in the cloud. Such data arenaturally distributed; therefore, a natural solution to pro-cess the data is using a distributed platform for stream pro-cessing. There can be multiple technological solutions fordata-stream processing, such as Storm [5], JStorm, Flink [1],Spark Streaming [4, 38], Flume [2], Samza [3], and S4 [32].

Apache Storm [5] is a distributed real-time computatingplatform for processing a large volume of high-velocity data.Based on the master-worker paradigm, Storm uses topologyto construct its computing tasks. Storm is a fast, scalable,fault-tolerant, and reliable platform; it is easy to set up andoperate. JStorm is a Java version of Storm while Storm isimplemented with Clojure, making it inconvenient for big-data developers who are mostly familiar with Java or C++.Compared with Storm, JStorm is more stable and power-ful. Apache Flink [1] is an open-source stream process-ing framework for distributed, high-performance, always-available, and accurate data streaming applications. SparkStreaming [4,38] is an extension of the core Spark API; suchextension enables scalable, high-throughput, fault-tolerantstream processing of live data streams. In Spark Streaming,data can be processed using complex algorithms expressedwith high-level functionalities. Apache Flume [2] is a dis-tributed, dependable, and available service for efficiently col-lecting, aggregating, and moving a large volume of log data

or streaming event data. Targeting at processing log-likedata, Flume also provides a simple extensible data modelfor online analytics applications. Apache Samza [3] is arelatively new framework for distributed stream processing.It uses Apache Kafka for messaging, and Apache HadoopYARN to provide fault tolerance, processor isolation, secu-rity, and resource management. S4 [32] is a general-purpose,distributed, scalable, fault-tolerant, pluggable platform thatallows programmers to easily develop applications for pro-cessing continuous unbounded streams of data. There arealso some other stream processing platforms such as Ama-zon Kinesis, which runs on Amazon Web Service (AWS),and IBM InfoSphere Streams.

The preceding platforms share some common character-istics in that they are all scalable and high performance,and can process millions of data instances per second. Theyall provide basic reliability guarantee and support multipleprogramming languages. It is not easy to generally assessthe pros and cons of each system. However, for the given re-quirements, if the decision has been narrowed down to choos-ing between those platforms, the choice usually depends onthe following considerations:

• System Cost. Cost is an important factor in selectinga technology since small enterprises or research groupsmight not be able to afford to pay for services that arenot free.

• Development Cost. Development cost includes thehuman cost of deploying the infrastructure platformsand developing the applications. It would be easier todevelop a comprehensive data-analytics system if theselected platforms have good support to be integratedtogether. Meanwhile, the deployment could also bepart of the human cost as sometimes it is easy to sinkdays or weeks of the development time into makingopen-source platforms to be a scale-ready, productionenvironment.

• Upgrade Cost. Business requirements change overtime, and the corresponding applications also need tochange accordingly. In an earlier stage of a productionsystem, such evolution is especially frequent. Devel-opers often suffer from such circumstance; however,the problem can hardly be attributed to the process-ing system if the platforms support an easy upgradeof the applications in a way that developers can easilyreuse the existing work, so that the upgrade can beless painful.

• Language Paradigm. This consideration includestwo aspects. The first is what language developerswant to use to develop their applications. The secondis what language the infrastructure is developed with.Developers prefer to choose the language that they arefamiliar with to develop their applications. Once theywant to improve the infrastructure platforms, it wouldbe more convenient if the platform is developed withthe language that developers are familiar with.

• Processing Model. Different platforms support dif-ferent processing models. For example, Storm usestopology to organize its steps of processing the datastreams. The processing model decides which platformfits best in implementing the applications.

1771

Page 7: CarStream: An Industrial System of Big Data Processing … · CarStream: An Industrial System of Big Data Processing for Internet-of-Vehicles ... centric services when processing

Driving

Event

Detection

Data

Cleaning

Driver

Assessment

Data

Filling Gasoline

Abnormal

Trip

Assessment

Fleet

Management

Stream

Fusion

Figure 5: An example of dependency among applications.

After a considerable investigation and experimental ex-ploration, we adopt JStorm as the basic stream-processingplatform in CarStream. The main reason is that, besidesthe basic high-performance, distributed, fault-tolerant char-acteristics of stream processing, JStorm also uses topologyas the computing task model; this characteristic is especiallysuitable for implementing the aforementioned IoV applica-tions. As discussed earlier, most IoV applications eitherhave a dependency on each other or share similar function-alities. An example of such dependency is shown in Figure5. It is common to develop and deploy an independent com-puting task for each application. But sometimes we mightneed to merge different tasks to improve resource utilizationand development efficiency. With topology as the program-ming model, developers can integrate multiple applicationsby merging similar processing tasks or split a huge task intosmaller (and thus more reusable) function modules. Smazaprovides another solution by using KAFKA, a distributedmessaging platform, as a data hub. Each task outputs itsprocessing results to KAFKA, and such results can be fur-ther used by other tasks.

Another reason of using JStorm is that both its imple-mentation and supported programming language are Java,a language that most big data developers are familiar with.Therefore, when it is necessary to modify the platform dur-ing the development, it is easier to modify JStorm thanother platforms such as Storm, which is implemented withClojure. CarStream also employs Hadoop as the offline pro-cessing subsystem. Applications in IoV require not onlyreal-time stream processing but also large-scale batch pro-cessing. However, such requirement is not the only reasonwhy batch processing is used here. The data-quality issueis another reason. According to our empirical study of thedata, nearly 10% data are disordered or delayed, and lessthan 0.1% data are substantially delayed, or even lost. Al-though such data occupy a minor part of the whole dataset,they still affect the stream processing significantly in termsof accuracy and system performance. Consider multi-streammatching as an example. The basic requirement is matchingstream A with stream B. In this task, if some data of streamB are delayed or lost, then the data in stream A may notfind their matches. As a result, the program has to wait orat least buffer more data instances for stream A in case thedelayed data in stream B might come later. This solution iscommon, but sometimes it can be risky because increasingthe buffer length may cause memory issues. In our imple-mentation, we detect the data-quality issue and split the badpart of the data for offline processing. Online stream pro-cessing deals with only the good part of the data. Offlineprocessing runs with a given interval when most data areready. When an application launches a request, both online

and offline processed results are queried. This combinationof online with offline is widely used in CarStream to achievereal-time performance for most data, and CarStream triesits best to offer a better result than performing online pro-cessing only.

The data bus layer in CarStream is implemented withKAFKA, one of the most powerful distributed messagingplatforms. KAFKA can be easily integrated with most streamprocessing platforms and hence is widely used as industrialsolutions. In fact, there are also many other similar plat-forms that can be used as the data bus, such as MetaQ,RocketMQ, ActiveMQ, and RabbitMQ.

4.2 Streaming Data Management SubsystemCarStream needs to manage terabytes of driving data and

further to provide services based on the data analytics. Thestream is formed by a great number of data instances, eachof which takes a hundred bytes. CarStream collects nearly abillion of data instances every day. The value density of suchdata is very low, as few useful information can be extracted,due to the high sampling frequency. Moreover, the datavalue decreases fast with time because most of the applica-tions, especially the time-critical applications, in IoV caremore about the newly generated data. Though, the histori-cal data cannot be simply compressed or discarded becausethe data might still be useful for certain applications.

We try to address the issue of data storage in IoV fromtwo aspects. The first aspect is data separation. We iden-tify those applications that have high real-time requirementsso that we can further identify the corresponding hot data(such as the vehicle current status). We then separate thehot data with the cold data and try to put them into dif-ferent storage platforms with different performance. Thesecond aspect is preprocessing. Big dataset scan for thedatabase is painful, but the data appending (write) can befast. Therefore, we try to extract a small dataset with ahigher density of value by using stream processing. We thenseparate this part of data with the raw data. In this manner,the storage subsystem can provide high throughput for datawrite, and high performance for the queries that are basedon the small processed data. Furthermore, we put some hotdata in in-memory cache to achieve a real-time data access.

To construct a data management subsystem, we adoptedthree different storage platforms in CarStream, includingin-memory caching, relational database, and NOSQL bigdata storage. There are multiple selections for each kind ofstorage platform. Many NOSQL databases, such as Cassan-dra, HBase, and MongoDB, can be adopted. In practice,we use HBase as the archive storage platform for the rawdata. HBase can easily integrate with Hadoop to achievebig data batch processing ability. HBase provides a high-performance key-value storage, which perfectly fits the re-quirements of storing the raw vehicle data. The raw datainclude vehicle status, vehicle trajectories, driver activitiesrelated data, and user-order related data. Those data useID (vehicle ID, user ID, driver ID) and timestamp as theindex key. In the technical decision making procedure, weconducted experiments on managing raw driving data withRDBMS such as Mysql and Postgresql. In the test, we use< ID, timestamp > as the index. The result shows thatthe relational databases have to maintain huge index filesto index the big dataset. In another word, the data accessperformance of RDBMS does not scale with data volume.

1772

Page 8: CarStream: An Industrial System of Big Data Processing … · CarStream: An Industrial System of Big Data Processing for Internet-of-Vehicles ... centric services when processing

However, as a distributed key-value storage, HBase has thescalability with data volume.

Postgresql database is used to store relational data, andsome of the processed data, to facilitate the queries from theweb server. Compared with other relational databases suchas MySQL, DB2, Sybase, the Postgresql has more compre-hensive extended spatial functions. This feature is especiallyimportant because IoV applications have many location re-lated processing requirements. Even though such require-ments are mostly satisfied by the processing subsystem, thedatabase still needs to provide several spatial related queryabilities. Postgresql has a spatial database extension namedPostGIS, which offers many spatial related functions thatare rarely found in other competing spatial databases.

In-memory caching plays a key role in accelerating thereal-time applications in CarStream. Memcached and Re-dis are popular in-memory caching platforms; they bothprovide high-performance key-value storage. Memcachedprovides extremely high performance on key-value basedcaching; therefore, the earlier version of CarStream adoptsMemcached to be the data buffering and high-performancedata storage media. Later we found that Memcached haslimitation in its functionalities and supported data types.As a comparison, Redis supports more data types and dataaccessing functionalities; Redis also provides data persis-tence in case of memory data lost. We then switched toRedis to seek for more support on multiple data types. Atypical application example of using in-memory caching toimprove performance is real-time vehicle tracking. In suchapplication, the vehicle uploads data to the server, and thedata are further processed by the stream processing subsys-tem; the processed results are updated to Redis for furtheruse. Both the data producer (processing subsystem) andconsumer (web server) can achieve high performance withthis solution.

4.3 Three-Layered MonitoringMonitoring how the system runs is an essential method to

provide highly dependable services [26]. By monitoring thesystem, maintainers can take corresponding actions timelywhen something goes wrong. CarStream can be roughlyseparated into three layers, the infrastructure layer, whichincludes the cluster and the operating system; the comput-ing platform layer, which includes the processing platformssuch as Hadoop, Storm; and the application layer that isformed by the applications. To assure high dependability,each of the three layers needs to be monitored.

Typically, for infrastructure layer, developers often usecluster monitoring tools such as Nagios [29], Ganglia [6],or Zabbix [10] to monitor the memory, network, I/O, andCPU load of the servers. Such monitoring tools also providea useful dashboard to help to visualize the monitoring. Theprocessing subsystem is formed by multiple processing plat-forms, which usually provide monitor tools to track how theplatform works. It is also an option to use the third-partytools to monitor these platforms [23]. The monitoring ofinfrastructure layer and computing platform layer provideuseful information for assuring service dependability. In-frastructure monitoring has become one of the basic systemdeployment and maintenance requirements. In CarStream,we monitor the infrastructure with Ganglia. Compared withother monitor tools, Ganglia provides powerful functions tomonitor the cluster performance from a comprehensive view,

Application Monitoring

Infrastructure Monitoring

Network Load CPU

I/O Load

Disk Capacity

MemoryOverall Server Load

Platform Monitoring

HBaseKAFKA JStorm DatabaseHadoop

Consumer Delay

Data Flow Tracking Storage Delay

TCP Server Web Server In memory cache

Data Lifecycle Tracking

Topology Throughput Monitoring (Node)

Figure 6: A user interface illustration of the three-layeredmonitoring subsystem.

it also enables users to define their own monitor items. Gan-glia collects cluster information with very low cost. Throughthe dashboard provided by Ganglia, developers can obtainan overall view of the whole cluster. For computing platformmonitoring, our monitoring subsystem simply integrates themonitoring APIs provided by each platform.

However, our maintenance experience on CarStream sug-gests that application monitoring is also essential. The ap-plications in IoV run permanently with the streaming datacontinuously come. Meanwhile, the applications are alwaysevolving with business changes. The evolving requires abun-dant information of the old versions of applications includ-ing how each application runs. For example, Is the paral-lelism enough for a specific application? How the applicationworks on handling the burst stream spikes? Which task isthe bottleneck in the processing topology (which task causedthe congestion)? The answers to these questions are highlyrelated to the monitoring of the application layer. From thisperspective, the monitoring of application layer is as impor-tant as the monitoring of the infrastructure layer and thecomputing platform layer. Therefore, we empowered appli-cation layer monitoring into the monitoring subsystem ofCarStream. In practice, we embedded into each applicationa module to monitor the parameters that can reflect theperformance and the health status of the applications; suchparameters include the buffer queue length, the streamingspeed, heartbeat, etc. Those parameters are gathered intoa shared in-memory database and further being analyzedin near real-time by a center monitor server. Moreover, weemploy anomaly detection algorithms on these collected pa-rameters to predict the potential system failures. The mon-itoring server sends an alert message to maintainers whenan abnormal is detected.

With the help of this monitoring subsystem, the reactiontime to system problems in CarStream can be reduced fromhours to seconds. Figure 6 illustrates the three-layered mon-itoring subsystem from a user interface perspective. Theabundant application-running information that have beencollected are further used in improving the alert accuracy ofthe monitoring. For example, we update the alerting thresh-olds to more proper ones based on the analysis of historicalmonitoring information; therefore, the spam alert messages,which are the pain point of maintainers, can be significantlyreduced. The daily number of alert messages of CarStreamis reduced from tens to just a few routine notifications.

1773

Page 9: CarStream: An Industrial System of Big Data Processing … · CarStream: An Industrial System of Big Data Processing for Internet-of-Vehicles ... centric services when processing

0

10

20

30

40

50

60

70

80

<1S 1-2S >2S

Per

centa

ge

Degree of delay

Figure 7: A statistic of the delay in data processing.

4.4 Performance EvaluationThe performance of each platform used in CarStream is

evaluated by researchers and the platforms are proving tobe high performance. Here, to answer the question thatwhether the integrated system still scales with data volumeand have high performance, we conduct an end-to-end per-formance evaluation of the entire stream-processing proce-dure from the data receiving to the data query. The pro-cessing subsystem is deployed on a virtualized cluster with25 servers, each server has 16GB memory, a 2-core 2.0 GHzCPU, 2*10 Gbps Emulex NIC. A dedicated server, whichhas 2*Intel XEON E5-2630v3, 64GB memory, 2*120G SSDand 102TB HDD, is allocated to deploy the database. TheHBase and HDFS are deployed in a non-virtualized envi-ronment with three 12-node clusters. The total capacity ofHDFS is above 600TB. We test the system from the follow-ing aspects.

Throughput. CarStream can easily handle the datastream uploaded by 30,000 vehicles. We further simulate ahigh data workload by injecting the historical dataset backinto the system with high frequency. With the same de-ployment, the total processing throughput of CarStreamhas reached 240,000 data instances per second (each datainstance is around 100KB).

End-to-end delay. This test shows whether the systemcan satisfy the low-delay requirement of time-critical appli-cations. We use real-time vehicle tracking as the test appli-cation. In this application, the data are uploaded from thevehicles and go through a series of platforms including thebuffering platform, stream processing platform, in-memorycaching platform, and then finally being queried by the webserver. For the end-to-end delay, we compare the times-tamps between the data being generated and the data beingprocessed. The result, as illustrated in Figure 7, shows thatthe average delay is less than 2 seconds; such delay is ac-ceptable in this scenario. We also monitor the length of thequeue inside of the computing tasks; such queue length mayalso reflect the processing delay in an indirect way. Theresult is shown in Figure 8. It can be inferred from the eval-uation result that the buffering subsystem performs well onbuffering the workload of peak hours; the buffering slightlyincreases the processing delay, but the processing subsystemcan catch up on dealing with the workload quickly.

Processing scalability. Scalability is a critical issue forCarStream as the fleet scale increases over time. We testthe processing scalability of CarStream by increasing the de-ployment scale of the buffering and processing subsystems.In this test, we use the electronic fence as the foundationapplication. We deploy the processing subsystem on oneserver and increase the number of servers gradually. Byinjecting a large volume of data into the system, we eval-uate the relationship between the throughput and the sys-tem scale. The evaluation result, as illustrated in Figure

0

5

10

15

20

25

30

35

0 10 20 30 40 50 60 70 80 90 100 110 120 130

Buff

er L

ength

x 1

00

Time

Figure 8: The fluctuation of buffered queue length of streamprocessing subsystem.

0

2

4

6

8

10

12

14

1 2 3 4 5 6 7 8 9

Thro

ughput

x 1

0000

Number of servers used

Figure 9: The scalability evaluation of CarStream.

9, shows a near linear scalability of the processing subsys-tem. This performance can be attributed to two reasons, thedistributed design of the processing platform, and the nat-urally distributed characteristic of the vehicle data. As thefleet scales up, we can always partition the data into smallstreams and deploy high-performance processing tasks ac-cordingly.

5. LESSONS LEARNEDLesson Learned 1: Application-level monitoring is

necessary for achieving high reliability.Infrastructure monitoring and computing-platform moni-

toring are the most commonly adopted techniques for as-suring system reliability. Doing so can allow us to taketimely reactions when a problem occurs. We can also usethe monitoring information to help to utilize system re-sources by adjusting the deployment. However, we learnthrough the maintaining of CarStream that infrastructure-level and platform-level monitoring are insufficient to pro-vide highly dependable services, especially in safety-criticalscenarios. Our insight is that, it is necessary to monitorthe application-level because application monitoring pro-vides direct information to answer key questions such aswhich processing step is the bottleneck? or what caused thecutoff of the data stream? By monitoring the key parame-ters in an application, such as local buffer length, processingdelay, and processing rate, we can have a deeper understand-ing of the application’s runtime behavior. These parameterscan be collected and analyzed on the fly, helping operatorsto take actions for service failure prevention or recovery.

Lesson Learned 2: Low data quality widely existsin IoV. A fixing model extracted from historical datapatterns can be helpful.

According to our experience, IoV applications usually facea severe issue of low data quality, such as data loss, datadisorder, data delay, insufficient data, and wrong data, etc.There are multiple causes of those problems. For exam-ple, the data disorder can be attributed to the distributedprocessing mode of the system. Because the data are pro-cessed concurrently so that later data are possible to be

1774

Page 10: CarStream: An Industrial System of Big Data Processing … · CarStream: An Industrial System of Big Data Processing for Internet-of-Vehicles ... centric services when processing

Table 3: Common data-quality issues in IoV scenarios.

Category Problem Consequence Cause Solution

Lack ofdata

Data LossNo result/Inaccurate result

Network failure Redundant deployment. Interpolationby data patterns.Software fault

InsufficientData

Inaccurate result Physical limitation (ve-hicle limitation)

Interpolation by data patterns.

Wrongdata

DisorderWrong result/Inaccurate result

Distributed nature ofprocessing platform

Delay and wait. Dropping out-of-datedata. Order guarantee design (in applayer). Fixing with prediction.Network transition

Store and forward de-sign of data nodes

Wrong/OutlierData

Wrong resultHardware malfunction

Outlier detection. Data cleaning.Fixing with data patterns.

Inaccurate sensorsOther physical reasons

processed earlier than the earlier-arrived data, resulting ina disordered sequence. The disorder can also be caused bythe parallel transmission of the network or the defect of theprocessing in the application layer. Similarly, the problemof wrong/outlier data can be attributed to multiple factorssuch as hardware malfunction, the sensor jitter, or otherphysical reasons (e.g., vehicle vibration). In IoV scenarios,these problems usually need to be addressed in real time.A general list of the problems and the corresponding causesare summarized in Table 3.

There can be multiple solutions to these problems. Letus take dealing with data disorder as an example. A basicsolution is delay-and-wait: the processing is pended until allthe data instances within the sliding window are ready. Thissolution compromises real-time performance to achieve theaccuracy. Another solution is providing sequence guaranteein the application when designing the system. For example,the data of each vehicle are processed by the same processingnode. Hence, the processing is paralleled for the fleet but issequential for any specific vehicle. This solution also has itsown limitation: it sacrifices the flexibility and scalability ofthe processing subsystem.

In CarStream, we employ a data-driven solution to ad-dress the data-quality problem. In particular, we make useof the patterns in the data and drive a fixing model to rep-resent the data patterns, and then we use the fixing modelonline to improve the data quality. Our empirical study onthe data reveals clear patterns from the data; such data na-ture is largely due to the regularity in driving. For example,the trajectory generated along a road follows the trend ofthat road, and drivers would maintain the vehicle in a stablestatus when driving on a highway. During the accelerationstage, most drivers would take a steady acceleration. Themodel extraction takes some time but the online runningcan be fast and efficiency. Note that such solution is onlyapplicable to the problem of local data quality where a smallpart of the stream is missing or disordered.

Lesson Learned 3: In large-scale scenarios, thelinked-queue-based sliding-window processing maycause a performance issue.

In stream-processing applications, the sliding window isone of the most commonly used functionalities. Typically,the sliding window is implemented with a linked queue. Anew data is appended to the queue when the data comes.Meanwhile, when the queue has reached the maximum length,the oldest data will be removed. The enqueue and dequeue

operations have high performance when the dataset is rel-atively small. However, when it comes to large-scale datastreams, such linked-queue-based solution may suffer frompoor performance due to the frequent memory-management(allocate and free) operations.

To improve the performance, we suggest using a Round-Robin Queue (RRQ) to implement the sliding window. TheRRQ reuses memory to improve the performance of process-ing. Compared with the linked-queue-based solution, theruntime of processing the same quantity of data with RRQcan be reduced by 80%.

Lesson Learned 4: A single storage is usually insuf-ficient; a heterogeneous storage architecture is nec-essary for managing large-scale vehicle data.

As a typical big data processing scenario, IoV needs tomanage a huge quantity of vehicle data. In this scenario,data management faces severe challenges: the data volumeis huge, and the applications are of variety such that differ-ent data-access requirements need to be satisfied. For ex-ample, decision-making-related applications require a largethroughput of the data access, while time-critical applica-tions require higher data-access performance. Based on ourexperience, there might not be a one-for-all storage platformthat satisfies the requirements of all the applications, and aheterogeneous storage architecture is usually necessary formaximizing the system performance.

We summarize two tips on how to improve the databaseperformance from an architecture perspective.

• Separate hot data with archived (cold) data; use in-memory caching as the exchange media for the hotdata.

• Avoid scanning a big dataset by extracting a smalldataset with preprocessing.

Lesson Learned 5: Single platform may not be suf-ficient for multiple requirements in IoV.

Simplicity is an important principle in system design: thebenefit of a simple design includes ease of maintenance, up-grade, and use. For general computing system, high per-formance often comes with simplicity of design. However,for IoV system, which has multiple complex application re-quirements, simple design may not be sufficient to fulfill therequirements of each application. It’s also hard for a mono-lithic system design to be flexible enough to keep up with thechanging of service requirements. In CarStream, we adopt a

1775

Page 11: CarStream: An Industrial System of Big Data Processing … · CarStream: An Industrial System of Big Data Processing for Internet-of-Vehicles ... centric services when processing

multi-platform design for the IoV applications. We carefullyevaluate the requirement of each application and choose thebest platform to deploy the application. Designing in suchmanner indeed increases the system complexity and redun-dancy, but by using a robust data bus, processing of multiplydata flows can be loosely coupled and fulfill the requirementsof IoV together.

6. RELATED WORKIoV has become an increasingly important area in both

industry and academia. IoV is, in fact, an integrated tech-nology including big data processing, distributed systems,big data management, wireless communication, vehicle de-sign technology, human behavior analysis, etc. For a cloud-based IoV, big data processing and management are espe-cially critical.

Various high-performance platforms for big data process-ing have been proposed. Some examples are Storm [5],Spark [4], Samza [3], S4 [32], Flume [2], and Flink [1]. How-ever, it is still difficult to smoothly integrate different tech-nologies to develop a system for complex scenarios in IoV.Companies providing private car services, such as Uber [9]and Lyft [8], develop complex systems for data processing,by integrating online processing, offline processing, and bigdata management, to achieve an ecosystem for serving in-telligent transportations.

As a key technology of IoV, system design for streamprocessing [31, 33, 34] is a challenging task. Cherniack etal. [21] discuss the architectural challenges in designing twolarge-scale distributed systems (Aurora and Medusa [11])for stream processing; the paper explores complementarysolutions to tackle the challenges. At Google, a dataflowmodel [13] is designed to process unbounded, out-of-ordereddata streams. Google also provide CloudIoT [7], a data-processing solution that targeting at IoT scenario. CloudIoTaims at providing a unified programming model for bothbatch and streaming data sources from IoT. Twitter hasbeen using Storm to process streaming data for years. Kulka-rni et al. [30] discuss the limitations of Storm in processingthe increasing volume of data at Twitter, and they designand implement Heron, a new streaming data processing en-gine to replace Storm. For online and offline integrated pro-cessing, Twitter also propose Summingbird [18], a frame-work that integrates batch processing and online streamprocessing. Kinesis [16] is an industrial solution providedby Amazon as a big data processing platform that runson Amazon Web Service (AWS). Kinesis provides power-ful functionalities for users to load and process streamingdata of IoT or other scenarios.

Chen et al. [20] discuss multiple design decisions made inthe real-time data processing system of Facebook and theireffects on ease of use, performance, fault tolerance, scalabil-ity, and correctness. Borthakur et al. [17] propose FacebookMessages, a real-time user-facing application built on theApache Hadoop platform. Arasu at al. [15] propose an on-line stream query system named STREAM. STREAM usesContinuous Query Language (CQL), a query language thatis similar to SQL, to query the continuous streaming data.Multiple-stream fusion is an important problem in variousstream-processing scenarios such as those in IoV [37]. Anan-thanarayanan et al. [14] propose Photon, a fault-tolerant andscalable system for joining continuous data streams. Gediket al. [27] investigate the problem of scalable execution of

windowed stream join operators on multi-core processors,and specifically on the cell processor. The management andanalysis of big spatial data are important foundational abili-ties for IoV. Fernandes et al. [24] recently present TrafficDB,a shared-memory data store, designed for traffic-aware ser-vices. Cudre-Mauroux et al. [22] propose TrajStore, a dy-namic storage system optimized for efficiently retrieving alldata in a spatio-temporal region. Hadoop-GIS [12] and Lo-cationSpark [35] are two high-performance platforms, whichare based on Hadoop and Spark, respectively, designed forlarge-scale spatial data processing.

7. CONCLUSIONIn this paper, we have described our experience on ad-

dressing the challenges in designing CarStream, an indus-trial system of big data processing for IoV, and the expe-rience of constructing multiple data-driven applications forchauffeured car services based on this system. CarStreamprovides high-dependability assurance for safety-critical ser-vices in IoV by including a three-layered monitoring subsys-tem. The monitoring subsystem covers from the applicationlayer down to the infrastructure layer. CarStream furtherleverages in-memory caching and stream processing to ad-dress the issues of real-time processing, large-scale data, lowdata quality, and low density of value. CarStream managesthe large volume of driving data with a heterogeneous data-storage subsystem. We have also shared our lessons learnedin maintaining and evolving CarStream to satisfy the chang-ing application requirements in IoV. So far, our system canhandle tens of thousands of vehicles. In our future work, weplan to evolve CarStream into a system with micro-servicearchitecture to better maintain the system and to developnew applications when the fleet scales up.

8. ACKNOWLEDGMENTThis work is supported by grants from China Key Re-

search and Development Program (2016YFB0100902), theNational Natural Science Foundation of China (91118008,61421003). Tao Xie’s work is supported in part by NSF un-der grants no. CCF-1409423, CNS-1434582, CNS-1513939,CNS-1564274. We would like to thank UCAR Inc for thecollaboration and the data that are used in this paper.

9. REFERENCES[1] Apache Flink project. http://flink.apache.org/.

Accessed: 2017-02-28.

[2] Apache Flume project. http://flume.apache.org/.Accessed: 2017-02-28.

[3] Apache Samza project. http://samza.apache.org/.Accessed: 2017-02-28.

[4] Apache Spark project. http://spark.apache.org/.Accessed: 2017-02-28.

[5] Apache Storm project. http://storm.apache.org/.Accessed: 2017-02-28.

[6] Ganglia monitor system. http://ganglia.info/.Accessed: 2017-02-28.

[7] Google Internet of Things (IoT) solutions.https://cloud.google.com/solutions/iot/.Accessed: 2017-02-28.

[8] Lyft engineering blog. https://eng.lyft.com/.Accessed: 2017-02-28.

1776

Page 12: CarStream: An Industrial System of Big Data Processing … · CarStream: An Industrial System of Big Data Processing for Internet-of-Vehicles ... centric services when processing

[9] Uber engineering blog. https://eng.uber.com/.Accessed: 2017-02-28.

[10] Zabbix monitoring solution.http://www.zabbix.com/. Accessed: 2017-02-28.

[11] D. J. Abadi, D. Carney, U. Cetintemel, M. Cherniack,C. Convey, S. Lee, M. Stonebraker, N. Tatbul, andS. Zdonik. Aurora: A new model and architecture fordata stream management. The VLDB Journal,12(2):120–139, 2003.

[12] A. Aji, F. Wang, H. Vo, R. Lee, Q. Liu, X. Zhang, andJ. Saltz. HadoopGIS: A high performance spatial datawarehousing system over mapreduce. In Proc. VLDBEndow., 6(11):1009–1020, 2013.

[13] T. Akidau, R. Bradshaw, C. Chambers, S. Chernyak,R. J. Fernandez-Moctezuma, R. Lax, S. McVeety,D. Mills, F. Perry, E. Schmidt, and S. Whittle. Thedataflow model: A practical approach to balancingcorrectness, latency, and cost in massive-scale,unbounded, out-of-order data processing. In Proc.VLDB Endow., 8(12):1792–1803, 2015.

[14] R. Ananthanarayanan, V. Basker, S. Das, A. Gupta,H. Jiang, T. Qiu, A. Reznichenko, D. Ryabkov,M. Singh, and S. Venkataraman. Photon:Fault-tolerant and scalable joining of continuous datastreams. In Proc. ACM SIGMOD InternationalConference on Management of Data, pages 577–588,2013.

[15] A. Arasu, B. Babcock, S. Babu, J. Cieslewicz,M. Datar, K. Ito, R. Motwani, U. Srivastava, andJ. Widom. STREAM: The Stanford data streammanagement system. In Data Stream Management,pages 317–336. 2016.

[16] J. Barr. Amazon Kinesis Analytics: Process streamingdata in real-time with SQL.https://aws.amazon.com/cn/blogs/aws/amazon-

kinesis-analytics-process-streaming-data-in-

real-time-with-sql/. Accessed: 2017-02-28.

[17] D. Borthakur, J. Gray, J. S. Sarma,K. Muthukkaruppan, N. Spiegelberg, H. Kuang,K. Ranganathan, D. Molkov, A. Menon, S. Rash,et al. Apache Hadoop goes realtime at Facebook. InProc. ACM SIGMOD International Conference onManagement of Data, pages 1071–1080, 2011.

[18] O. Boykin, S. Ritchie, I. O’Connell, and J. Lin.Summingbird: A framework for integrating batch andonline mapreduce computations. In Proc. VLDBEndow., 7(13):1441–1451, 2014.

[19] J. L. Carlson. Redis in Action. Manning PublicationsCo., 2013.

[20] G. J. Chen, J. L. Wiener, S. Iyer, A. Jaiswal, R. Lei,N. Simha, W. Wang, K. Wilfong, T. Williamson, andS. Yilmaz. Realtime data processing at Facebook. InProc. ACM SIGMOD International Conference onManagement of Data, pages 1087–1098, 2016.

[21] M. Cherniack, H. Balakrishnan, M. Balazinska,D. Carney, U. Cetintemel, Y. Xing, and S. B. Zdonik.Scalable distributed stream processing. In Proc.CIDR, volume 3, pages 257–268, 2003.

[22] P. Cudre-Mauroux, E. Wu, and S. Madden. Trajstore:An adaptive storage system for very large trajectorydata sets. In Proc. 26th IEEE InternationalConference on Data Engineering (ICDE), pages

109–120, 2010.

[23] D. J. Dean, H. Nguyen, P. Wang, and X. Gu.Perfcompass: Toward runtime performance anomalyfault localization for infrastructure-as-a-service clouds.In Proc. HotCloud, 2014.

[24] R. Fernandes, P. Zaczkowski, B. Gottler, C. Ettinoffe,and A. Moussa. TrafficDB: HERE’s high performanceshared-memory data store. In Proc. VLDB Endow.,9(13):1365–1376, 2016.

[25] B. Fitzpatrick. Distributed caching with Memcached.Linux journal, 2004(124):5, 2004.

[26] Q. Fu, J.-G. Lou, Q.-W. Lin, R. Ding, D. Zhang,Z. Ye, and T. Xie. Performance issue diagnosis foronline service systems. In Proc. 31st IEEE Symposiumon Reliable Distributed Systems (SRDS), pages273–278, 2012.

[27] B. Gedik, P. S. Yu, and R. R. Bordawekar. Executingstream joins on the cell processor. In Proc. 33rdInternational Conference on Very Large Data Bases,VLDB ’07, pages 363–374, 2007.

[28] M. Gerla. Vehicular cloud computing. In Proc. 11thAnnual Mediterranean Ad Hoc Networking Workshop(Med-Hoc-Net), pages 152–155. IEEE, 2012.

[29] D. Josephsen. Building a monitoring infrastructurewith Nagios. Prentice Hall PTR, 2007.

[30] S. Kulkarni, N. Bhagat, M. Fu, V. Kedigehalli,C. Kellogg, S. Mittal, J. M. Patel, K. Ramasamy, andS. Taneja. Twitter Heron: Stream processing at scale.In Proc. ACM SIGMOD International Conference onManagement of Data, pages 239–250, 2015.

[31] D. Namiot. On big data stream processing.International Journal of Open InformationTechnologies, 3(8):48–51, 2015.

[32] L. Neumeyer, B. Robbins, A. Nair, and A. Kesari. S4:Distributed stream computing platform. In Proc.IEEE International Conference on Data MiningWorkshops (ICDMW), pages 170–177, 2010.

[33] R. Ranjan. Streaming big data processing indatacenter clouds. IEEE Cloud Computing,1(1):78–83, 2014.

[34] M. Stonebraker, U. Cetintemel, and S. Zdonik. The 8requirements of real-time stream processing. ACMSIGMOD Record, 34(4):42–47, 2005.

[35] M. Tang, Y. Yu, Q. M. Malluhi, M. Ouzzani, andW. G. Aref. LocationSpark: A distributed in-memorydata management system for big spatial data. In Proc.VLDB Endow., 9(13):1565–1568, 2016.

[36] M. Whaiduzzaman, M. Sookhak, A. Gani, andR. Buyya. A survey on vehicular cloud computing.Journal of Network and Computer Applications,40:325–344, 2014.

[37] J. Xie and J. Yang. A survey of join processing in datastreams. In Data Streams, pages 209–236. 2007.

[38] M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker, andI. Stoica. Discretized streams: Fault-tolerantstreaming computation at scale. In Proc. 24th ACMSymposium on Operating Systems Principles, pages423–438, 2013.

1777