managing sensor data streams: lessons learned from the...

Managing Sensor Data Streams: Lessons Learned from theWeBike Project

Christian Goreno, Lukasz Golab and Srinivasan KeshavUniversity of Waterloo, Canada

[cgoreno,lgolab,keshav]@uwaterloo.ca

ABSTRACTWe present insights on data management resulting from a elddeployment of approximately 30 sensor-equipped electric bicycles(e-bikes) at the University of Waterloo. e trial has been in opera-tion for the last two-and-a-half years, and we have collected andanalyzed more than 150 gigabytes of data. We discuss best practicesfor the entire data management process, spanning data collection,extract-transform-load, data cleaning, and choosing a suitable datamanagement ecosystem. We also comment on how our experienceswill inform the design of a future large-scale eld trial involvingseveral thousand fully-instrumented e-bikes.

CCS CONCEPTS•Information systems →Data management systems; •Computersystems organization →Sensor networks;

KEYWORDSData management for the Internet of ings (IoT), Data feed man-agement, Time series data management

ACM Reference format:Christian Goreno, Lukasz Golab and Srinivasan Keshav. 2017. Manag-ing Sensor Data Streams: Lessons Learned from the WeBike Project . InProceedings of SSDBM ’17, Chicago, IL, USA, June 27-29, 2017, 11 pages.DOI: hp://dx.doi.org/10.1145/3085504.3085505

1 INTRODUCTIONInternet-connected devices and infrastructure such as smart phonesand smart buildings, which collectively constitute the Internet ofings (IoT), generate massive amounts of sensor data about thephysical world. IoT-based applications are highly data-driven andface a common set of challenges related to the collection, pre-processing, cleaning, and analysis of data feeds. e data man-agement community has been addressing these challenges overthe past decade or so, and has developed solutions ranging fromacquisitional data processing in sensor networks [18], stream andtime series database engines [11], to cleaning and querying noisyand uncertain sensor data [15, 16].

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor prot or commercial advantage and that copies bear this notice and the full citationon the rst page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permied. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specic permission and/or afee. Request permissions from [email protected] ’17, Chicago, IL, USA© 2017 ACM. 978-1-4503-5282-6/17/06. . .$15.00DOI: hp://dx.doi.org/10.1145/3085504.3085505

In this paper, we complement this body of work by presentinglessons learned from a real IoT deployment: the University of Wa-terloo WeBike project [3]. In this project, a eet of approximately30 sensor-equipped electric bicycles (e-bikes)1 were distributed tofaculty, sta and students for their own use. Since the summer of2014, our pilot has generated over 150 Gigabytes of data, includingGPS xes, acceleration, temperature, and baery charge/dischargecurrent and voltage. Data from these e-bikes were continuouslycollected and periodically analyzed to understand how e-bikes arebeing used and how they can be improved.

We believe that the insights and lessons learned from the WeBikeproject are of broad interest for several reasons. First, this is a realsystem deployed in the eld that has to deal with the unavoidablemessiness of the real world. Second, it has signicant scale: about30 e-bikes, and more than 150 Gigabytes of collected data overtwo-and-half years. ird, the project encompasses the gamut fromsensor system design and deployment to data collection, ingest, andanalysis. is breadth of engagement has led to insights into everyaspect of the data management process. Other projects that have asimilar or more limited scale can benet from our insights. Finally,though parts are perhaps obvious in 20-20 hindsight, the list oflessons serves as a useful checklist for other researchers workingin the space of IoT data management.

Importantly, this project is a pilot for our long-term vision ofcollecting data streams from several thousand e-bikes deployedin various cities worldwide. Our ultimate goals are to use ourdistributed testbed not only to study e-bikes as an emerging modeof eco-friendly urban transportation, but also to use e-bikes (andpossibly also regular bicycles, taxis, police cars, municipal vehicles,etc.) as a mobile sensing infrastructure for the natural and builtenvironments.

e main contribution of this paper is a set of lessons learnedand best practices informed by the WeBike project. In addition topresenting general lessons about the importance of conducting apilot before proceeding with large-scale deployment, we discussthe following issues:

• Lessons about collecting data from mobile devices, suchas choosing an appropriate sampling rate and includingmetadata with data.

• Best practices for preparing the collected data for ingestinto a database, including timestamp standardization, selec-tion of keys and identiers, and the importance of exactly-once processing of data feeds in the Extract-Transform-Load (ETL) process.

1E-bikes are propelled by a combination of pedaling and baery-powered electricmotors; the baery detaches from the bike and can be recharged from a regular poweroutlet.

SSDBM ’17, June 27-29, 2017, Chicago, IL, USA Christian Gorenflo, Lukasz Golab and Srinivasan Keshav

(a) An eProdigy Whistler electric bicycle with a battery and sensorkit.

(b) Opened sensor box. Shown are the smart phone as the centralsensor hub and additional sensors connected to a phidget board [2].

Figure 1: e e-bikes and sensor kits used in the WeBike project

• Selecting an appropriate data management ecosystem forIoT data. We provide a performance comparison of ourchoice—the InuxDB time series database—with alterna-tives including a standard row-oriented DBMS (MySQL), acolumn-store (MonetDB) and a main-memory time seriesdatabase used in the nance sector.

• Sensor data quality issues such as imprecise and missingdata.

• Lessons about data analysis over sensor streams, includingderiving higher-level events from raw data and drawingstatistically sound conclusions.

e remainder of this paper is organized as follows: In Section 2,we provide more details about e-bikes, the WeBike project, and ourlong-term vision. Sections 3 through 8 present our lessons learned,as outlined above. Section 9 discusses related work in stream, sensorand time series data management. Section 10 concludes the paperwith directions for future work.

2 THEWEBIKE PROJECTWe begin by describing the WeBike project, including the currentpilot and our long-term vision. We argue that the data collectionand analysis needs of this project are similar to those arising from awide range of eld-deployed, data-intensive sensor and IoT systems,and therefore our lessons are of broader interest.

2.1 MotivationE-bikes are an interesting alternative to traditional bicycles and,in some situations, automobiles. eir desirable aributes includelower manufacturing and operating carbon footprint than even elec-tric automobiles; zero carbon, particulate, and pollutant emissionsduring operation; much lower costs to purchase, own, and operatecompared to cars; and potential to improve rider health and mobil-ity since they typically assist rather than replace pedaling. E-bikeadoption is accelerating globally, with over 200 million being usedin China alone [1]. However, e-bike sales in North America remainrelatively low and lile is known about e-bike ridership paernsand their implications within North American cities. As a result,

municipalities lack data to make evidence-based decisions and rec-ommendations. To ll this gap, the goal of the WeBike project isto collect real-world data to understand e-bikes and their potentialbenets as part of a sustainable urban transportation system.

2.2 OverviewIn the summer of 2014, we started a pilot project at Waterloo witha eet of 31 sensor-equipped e-bikes. Over 100 prospective partic-ipants were asked to ll out a questionnaire about their currenttransportation modes and their interest in e-bikes. 31 participantswere selected for the eld trial: 18 male and 13 female, comprisedof 16 faculty or sta members and 15 graduate students. e pilotproject will run for 3 years, i.e., until August 2017.

Figure 1 illustrates the e-bikes used in our project and our custom-built sensing hardware located in a box aached to the baery.e bikes have a general appearance of regular mountain bikes,with the exception of the motor integrated into the pedal axle, adetachable baery mounted on the frame, and a throle buon anddigital speedometer-odometer mounted on the handlebars. Eachbike weighs 21kg (including the 2.5kg baery).

e bikes may be used in three modes: human-powered (whichtakes more eort than pedalling a regular bicycle because e-bikesare heavier), all-electric (by pressing the throle), or hybrid. Inhybrid mode, the motor comes on only when the rider is pedalingto provide assistance. e baery detaches from the bike and canbe recharged from empty in about 5 hours using a regular poweroutlet. e manufacturer-reported maximum range is 45km.

We mounted the sensor box onto the baery to prevent the:riders can park their bikes, detach the baery, sensor box in tow,and take it with them for recharging. e main component of thesensor kit is an Android-based Samsung Galaxy S3 smart phone.We chose to use a smart phone because of its integrated sensors(GPS, clock, gyroscope, accelerometer and magnetometer), built-inconnectivity, ease of writing custom soware and relatively lowcost. Additional sensors for measuring baery/ambient tempera-ture and charge/discharge current and voltage are connected to aphidget board [2] and are controlled by the soware on the phone.

Managing Sensor Data Streams: Lessons Learned from the WeBike Project SSDBM ’17, June 27-29, 2017, Chicago, IL, USA

Figure 2: WeBike dataow. e smart phones mounted on the e-bikes transfer sensor data in batches to a staging server viawi-. ere, an import script pre-processes and loads incoming data to a database. e analysis and visualization tools arelayered on top of a database system.

Because the WeBike project is long-running and the participantsare volunteers, our goal was to design a non-intrusive data collec-tion platform. us, the phone automatically collects data from thesensors and uploads it whenever it is within range of the Waterloocampus wi- network or a participant’s home network.

With the data collected so far, we have recently published resultson e-bike usage and baery-charging paerns [7, 20] and rangeestimation for e-bikes [10].

2.3 Long-Term VisionOur long-term objective is to use big data to support public policydevelopment in the area of e-bikes. Five subsidiary short-termresearch objectives underlie this broad theme:

(1) To help to improve sustainable urban transport systems byunderstanding e-bike riding behaviour

(2) To support the public sector in planning, developing, andeectively governing transportation infrastructure usingevidence-based policies

(3) To understand the potential public-health benets of in-creased e-bike adoption. is will be based on pre-postand/or case-control study designs with populations whoface health and social inequalities, such as older adults,low-income youth and immigrant families

(4) To use e-bikes, regular bikes, taxis, police cars and munici-pal vehicles for monitoring the urban environment

(5) To understand and improve e-bike and baery performance

To do so, we will scale out our pilot project in two dimensions.First, to compare dierent population groups, locations and regu-latory contexts, we plan to deploy bikes in several cities aroundthe world. Second, in addition to location, movement and baeryusage, we will install a wider variety of sensors, leading to moredata streams per bike. Examples of new sensors and data sourcesinclude cameras (front and rear), proximity, noise and pollutionsensors, and heart rate monitors. In addition to usage and charginghabits, this will allow us to investigate e-bike safety and healthissues; e.g., near-accidents, stressful situations, and health benets.

2.4 DataowFigure 2 presents the dataow of the WeBike project. Data collectedby the phone and the sensors mounted on the bikes are bueredlocally and sent to a staging server whenever a wi- connection isavailable. e staging server runs a script that monitors the stagingfolder for new les and performs extract-transform-load operations.Raw data is archived on the staging server and processed data isloaded into a database, which includes sensor data streams as wellas relational metadata (e.g., name of the participant using each e-bike). Finally, visualization and analysis soware runs on top of thedatabase. Examples of visualization and analysis tasks include real-time status monitoring (e.g., last known timestamp and location ofeach bike), detection of trip start and end times from raw data ando-line analysis of travel and charging paerns.

In our pilot project, no processing is done locally at the bikes.However, as we scale the number of bikes and the number of sen-sors per bike, sending all raw data, including video footage, to thestaging server will become infeasible. We expect to develop newtechniques for deciding which data pre-processing and perhapseven analytical tasks to move to the data sources.

2.5 GeneralizationWe now argue that the WeBike data platform, both in the pilotand in the long-term, is similar to that of a wide variety of IoT andsensor applications:

Data collection IoT applications involve a multitude of dis-tributed sensors and devices. In WeBike, the bikes form amoving distributed sensor platform.

Data transfer While some computation may be done at thenetwork edges, most IoT applications transfer raw data toa central server. In case of intermient connectivity (as inWeBike), data sources accumulate data in appropriately-sized local data buers. is naturally leads to a batchedstreaming approach for data collection and transfer.

Data management and analysis Most data-intensive ap-plications pre-process and load data into a (centralized


or distributed) database system, with visualization and an-alytics applications running on top of the database. Bothrelational (e.g., sensor metadata, project participant data)and time series data (e.g., sensor streams) must be storedin the database(s).

In the remainder of this paper, we will follow Figure 2 from datacollection to visualization and analysis, and describe the lessonswe learned along the way. In some cases that are specic to ourproject setup, we generalize our ndings to make them applicableto a wide range of distributed sensor projects.

3 GENERAL LESSONSAny large-scale IoT project involving eld deployements is subjectto a considerable degree of uncertainty not only due to evolvingproject goals but also rapid evolution in IoT technologies. Indeed, assensor technologies improve and decrease in price, this increasingability to collect various types of sensor data oen suggests moreambitious project goals, leading to time-varying data requirements.us, it is rarely clear at the outset of a project what physical valuesto measure or how best to do it. For example, one might know thata current is to be measured, but to what precision? What allowablerange? And using which kind of sensor?

We believe that it is hard to get the design of a large-scale IoTproject right on the rst try. At the same time, it is very challengingto re-deploy hundreds or thousands of distributed sensors. isleads to an important lesson:

Lesson 1. Do a pilot project rst Every large-scale projectshould be preceded by a thorough pilot study. In the pi-lot, it is not sucient to select a single design path anddo a small-scale cost-conscious deployment. Instead, wesuggest experimenting with dierent hardware technolo-gies (e.g., dierent types of sensors that measure the samething) and dierent soware approaches. We took this ap-proach in our own work, and what follows are the lessonsfrom the pilot that we will use for the main deployment.

Even during a pilot project, the soware running on the dis-tributed devices may need to be upgraded. It is usually impossibleto do so by manually collecting the devices and updating them in alab. is motivates the next lesson.

Lesson 2. Enable easy eld updates To allow soware up-dates to deployed systems, we wrote an Android app thatperiodically connects to our main server and side-loadsall updates and new apps it nds in a specied folder onthe server onto the device. Having a robust eld-upgrademechanism turned out to be critical, and we used this func-tionality extensively (the devices are now running version24 of our soware).

4 DATA COLLECTION LESSONSWe now discuss lessons regarding collecting data from devices,i.e., the part of the dataow corresponding to the le-most part ofFigure 2.

ere is a natural tradeo between energy consumption andsampling frequency. In general, sensing devices have small bat-teries with limited lifetimes. In WeBike, one advantage is that thesensor kit and the smart phone are powered by the e-bike baery.

Nevertheless, collecting data too oen, say, every second, wouldquickly drain the bike’s baery and decrease the bike’s range. Ad-ditionally, when a bike is parked and not being charged, it maynot be necessary to collect any data. is suggests using a variablesampling rate, as discussed next.

Lesson 3. Use variable sampling rates We suggest the fol-lowing methodology for variable-rate sampling. Start outwith a sampling rate that is as high as reasonably possi-ble with the given sensor setup. en the rst step is toidentify signicant events from the raw data stream, thatshould trigger data collection. Second, choose a maximumdelay, call it d , which still allows for the identication ofthe beginning of these events. Given these parameters, thedefault sampling rate is d . However, whenever indicatorsin the data stream show that an event may be starting, thesampling rate should increase to a higher value a. Whenindicators show that the even has ended, there should be acool-down period c to compensate for inaccurate sensorreadings, followed by a return to the default sampling rated . We summarize this process in Algorithm 1.

Algorithm 1 Variable Sampling Pseudo Codefunction varsampling

d ← sampling interval during standbya ← active sampling intervalc ← cool down periodi ← set of event indicatorsloop

if any(i) == true thent ← ac ← cool down period

else if (All(i) == false)∧

(c > 0) thent ← ac ← c − a

elset ← d

end ifRecordSensorDatasleep(t )

end loopend function

In WeBike, the two events of interest are trips and chargingsessions. A trip starts when the baery begins to discharge (i.e.,when the discharge current sensor exceeds a nonzero threshold) orwhen the accelerometer or the gyroscope detect movement. us,there are three trip indicators. A charging event starts when charg-ing current exceeds a nonzero threshold. Our standby samplingrate, d , is one minute: every minute, the phone wakes up for fourseconds to collect data from all the sensors2. If an event is detected,we increase the sampling rate to once per second. We continuecollecting data every second until the sensors do not register anyevent for 5 minutes, and then return to sampling every minute.2We found that acquiring a GPS x may take a few seconds, which is why we collectsensor data for four seconds at a time during standby periods.


e 5-minute cooldown period accounts for missing or inaccuratesensor data as well as for issues such as an e-bike stopping at atrac light, which we do not want to conate with the end of atrip. In general, multiple events might be indicated by independentsensors that may operate on dierent time scales. In that case, wecan dene dierent sampling rates per event and wake only thenecessary sensors.

Since connectivity between sensors and the data-collection backend can rarely be guaranteed, IoT applications typically buer dataat the sources and send data to a staging server in batches. Eachbatch may correspond to one or more les. Hardware and sowaremay change during the course of a project, so even if a thoroughpilot study has been done, the data format such as the number ofelds may also change. us, we suggest making each data le self-containing and self-documented. is also decouples distributedsensing from data collection.

Lesson 4: Include metadata with data We recommend in-cluding a header with every le indicating column namesand possibly also data types. Additionally, le names canencode metadata: in Webike, le names include the ID ofthe e-bike generating the data and the timestamp of themost recent record. We further added a format versionnumber to every le and maintained a table with metadataabout prior formats in case we need to reload old data.

5 EXTRACT-TRANSFORM-LOAD LESSONSNext, we focus on the data pre-processing step at the staging server,aer raw data les have arrived and before processed data is loadedinto a database. e Extract-Transform-Load (ETL) is a useful layerof abstraction that decouples data collection from database design.

Lesson 5: Never delete raw data We recommend archiv-ing raw data in a data warehouse that is independent ofthe database system. is makes it possible to re-importraw data when necessary, due to changes in the databasedesign or the database management system. In WeBike,we simply saved raw les in the le system, even aeringestion.

Lesson 6: Log error messages Since dierent applicationsmay use dierent sensors and data formats, many IoTprojects, WeBike included, use home-grown ETL scripts.We found it critical to have such scripts generate and loginformative error messages (e.g., incorrect data type orvalue out of expected range — these can happen due to sen-sor failures). Data mining the log les helped us identifysystematic errors in data collection.

Lesson 7: Ensure exactly-once processing of the inputsIt is very important that the ETL process is idempotent,that is, all staged records must be loaded into the databaseand each record must be loaded once. is means that ifan ETL script crashes half-way, it should be possible tosimply re-run it for full recovery. Likewise, aemptingto load a le that has already been loaded should notlead to duplicate data in the database. Idempotency isnot dicult to achieve. For example, in WeBike, the ETLprocess monitors a specic folder on the staging serverfor new data, and when it arrives, bulk-loads data into

the database one le at a time, moving les to an archivefolder aer they have been loaded. Should this scriptcrash, re-running it uploads only the unprocessed les.Moreover, each le is uploaded using a single transaction,so that at-most-once semantics are achieved.

e above lessons refer to general properties of the ETL process.We now present two lessons about the transformations themselves.

Lesson 8: Use standardized timestamps A common prop-erty of data streams is that each record includes at least onetimestamp. We recommend standardizing all timestamps,e.g., into UTC, to avoid problems with dierent timezonesor transitions to and from daylight savings time.

Lesson 9: Use opaque keys We recommend using datakeys that do not reveal potentially sensitive information.In WeBike, each bike is associated with a participant, butidentied by the IMEI of the phone in its sensor kit through-out the database. Additionally, there is a private table thatmaps IMEIs to participants. is makes it easier to sharenon-sensitive data with other researchers while maintain-ing a mapping to private data about the participants3.

6 DATABASE SYSTEM LESSONSWe now move on to issues regarding data management systemsfor IoT data. We start withe an important observation about datavariety.

Lesson 10: It’s not just time series! In addition to nu-meric time series with sensor measurements, be preparedto manage relational, row-oriented data such as sensormetadata. Additionally, and perhaps less obviously, higher-level events derived from raw data streams tend to be row-oriented (e.g., bike trips, each having a trip identier, bikeID, and descriptive aributes such as trip distance andduration).

As a consequence of the above, IoT projects may need to usetwo dierent database systems: a time series column-oriented data-base for ecient ingestion and processing of sensor streams and arow-oriented relational database such as MySQL for other datasets.However, there is more to selecting a data management infrastruc-ture than just performance:

Lesson 11: It’s not just about performance! Select a datamanagement platform that meets application performanceneeds and provides an ecosystem of tools and applications;e.g., tools for exploring and visualizing time series data. Aswe will show later, this will make data analysis easier.

In WeBike, we use MySQL for row-oriented data and InuxDBfor time series. We do not use a data stream management systembecause we also need historical time series processing functionality;however, we revisit this issue in Section 10.

6.1 Performance ComparisonIn the remainder of this section, we explain why we selected In-uxDB to manage sensor data, including a performance comparison

3For example, we have shared non-sensitive data such as trip and baery chargingdurations with other researchers. However, we consider GPS xes to be sensitive, evenif associated with phone IMEIs rather than participant names.


MySQL MonetDB FinanceDB InuxDB0

50

100

150

112.62 110.31 109.28 106.42

145.81

109.27

128.95

Tota

ltab

lesi

zein

GB

single table (IMEI as index)separate tables per IMEI

(a) Total table sizes; one week of sensor data with per-second mea-surements.

MySQL MonetDB FinanceDB InuxDB0

2

4

6

8

0.1030.413

0.0650.377

7.6

3.5

Exec

utio

ntim

ein

s

single table (IMEI as index)separate tables per IMEI

(b) Average time to insert a one-second batch of data.

Figure 3: Comparing size and insert speeds for dierent databases and two versions of data slicing.

with other candidates: MySQL (as a representative of traditionalrelational databases), MonetDB (column store), and “FinanceDB” (aclosed-source main-memory time series database, anonymized pertheir request). We test single-server versions of these systems butthey are all available in distributed versions.

We carry out our performance comparison on our staging serverwith two Intel Xeon E5620 2.4GHz processors and 100 GB of RAM.Each processor consists of 4 cores, adding up to 16 hyperthreads intotal. We created a synthetic workload trace of one week of sensordata as collected by our platform, sampled once per second, from1000 e-bikes. Each record is 500 bytes long and consists of 30 sensorvalues; the number of records is:

#rows = #bikes × #seconds per week = 604,800,000

We test two designs: 1) a separate table for each e-bike, identiedby the phone’s IMEI, and 2) one table for all time series, indexedby IMEI. InuxDB splits tables by tags, its index equivalent, intoseparate entities. us, it follows design 1) by default, and thereforewe did not test it with design 2).

6.1.1 Experiment 1: Database size. Figure 3a shows the size ofthe dierent databases for both designs: one table vs. separate tablesper bike. In the former, each system has very similar size; in thelaer, MySQL and FinanceDB incur some overhead.

6.1.2 Experiment 2: Data ingest. Figure 3b shows the averageinsertion time of one-second batches of data (from all bikes). Tokeep up with the data stream, inserting each batch must take underone second (under the horizontal line). Using a single table forall the time series, every system was able to insert one-secondbatches in under a second. However, using separate tables per bike,MySQL and MonetDB took several seconds while FinanceDB didnot terminate in reasonable time (and hence the corresponding baris missing from Figure 3b). We conclude that appending a batch

Table 1: Benchmarked queries. Every query was run withand without concurrent inserts.

Qd select GPS (latitude and longitude) data for a specicbike for a single day.

Qdc Qd with a constraint on the discharge current value(simulating trip identication)

Qw Qd for a specic bike but over the whole dataset (1week).

Qwc Qdc for a specic bike but over the whole dataset (1week).

of data into a single table is much more ecient than separatelyappending new data from each source to a separate table.

6.1.3 Experiment 3: ery performance. Next, we consider typ-ical IoT query workloads. Usually, many dierent measurementsare collected but only a small number of them (i.e., a small numberof columns in the table) are of interest for any particular analysis.us, it is important to test queries with projection operations. Ad-ditionally, detecting interesting events from raw measurement datatypically requires checking whether certain measurements exceeda threshold (recall Lesson 3 and Algorithm 1). us, IoT queries arelikely to include various selection predicates.

We have created a simple IoT benchmark consisting of fourqueries, as described in Table 1. e rst query selects the GPSlatitude and longitude for one bike on the most recent avail-able day. e second query adds a predicate on another column(discharge current > currthresh, to simulate trip identication). enal two queries are as above, but over the entire dataset (1 week)instead of the most recent day. We run each query with and withoutconcurrent inserts, repeated 1000 times (once for each bike).

We used one table for all bikes. In MySQL and MonetDB, wecreated indices on bike ID and timestamp. In FinanceDB, we only


created an index on bike ID as the timestamp aribute is indexedby default. Similarly, InuxDB treats timestamps in a special way,so we only tagged each time series with its bike ID, which amountsto creating an index on bike ID.

Figure 4a shows the query performance of MySQL. Given aquery-only workload, queries over the most recent day are fasterthan those over the whole dataset (1 week), which is not surprising.Concurrent inserts aect queries over the most recent day; we ranthe queries in parallel to new data being inserted, so the queryruntimes likely include the cost of writing new data to disk andupdating the indices.

As shown in Figure 4b, for the queries-only workload, MonetDBis about an order of magnitude faster than MySQL (notice themillisecond y-scale). Surprisingly, queries over the whole datasetwere a bit faster than those over a single day. A possible explanationis that MonetDB chose not to use the index on the timestampcolumn (it autonomously assigns indices and takes user input onlyas a suggestion). Furthermore, our benchmark crashed when weaempted to run queries and concurrent updates in MonetDB. us,there are no blue bars in Figure 4b.

Figure 4c illustrates the performance of FinanceDB; notice themillisecond y-scale. Without concurrent inserts, queries in Fi-nanceDB take under 10 milliseconds, which is 3-4 orders of mag-nitude faster than in MySQL and MonetDB. e performance ofsingle-day queries suers from concurrent inserts, but is still fasterthan the other systems.

An additional insight can be gained by comparing the execu-tion times of Qd and Qdc . Qdc performs signicantly beer eventhough there is no index on discharge current. However, due toFinanaceDB’s column storage and vectorization, the discharge cur-rent column is essentially a vast integer vector on which a compar-ison with a scalar can be cheaply performed.

Results for InuxDB are shown in Figure 4d. It turned out to beone of the slowest systems. However, we decided that a runtime of 5seconds for queries over the current day is acceptable. Also, relativeto queries-only, query performance in InuxDB with concurrentupdates was not aected.

6.2 Selection of Data Management PlatformAs per Lesson 11, we selected InuxDB to store time series data(and MySQL for row-oriented data). FinanceDB was fastest byfar, but has a proprietary language and lacks an associated toolecosystem. On the other hand, InuxDB insertion times werefast enough. Its query runtimes, although slower than other sys-tems, were deemed acceptable. Additionally, InuxDB includes arst-party open source toolchain for data importing (Telegraf) andstream processing (Kapacitor), with predened transformationsand alerts and support for external scripts (more on this in Sec-tion 7). ere are also open-source visualization tools for InuxDB(Grafana) which enable visual dashboards containing graphs, tablesand maps. Grafana is plug-in based, which means it can be extendedif the desired visualization is missing.

We conclude this section with an brief description of how we in-tegrated InuxDB with MySQL. Our ETL scripts load new data intothe InuxDB database, which is monitored by Kapacitor. Kapacitorthen streams the newly arrived data through our custom script that

detects trips from raw data. As new trips are detected, the scriptassembles trip tuples (with trip IDs, trip start times, trip end times,etc.) and writes them to the corresponding MySQL table.

7 DATA QUALITY LESSONSWe now present several lessons regarding data quality, that is,dealing with bad (incorrect/imprecise) data and missing data.

We suspect that most IoT projects are likely to suer from badand missing data. One way to mitigate their impact is to build re-dundancy into the sensing platforms, by deploying several dierentsensors to capture similar information:

Lesson 12: Collect redundant data is can be done byduplicating sensors (e.g., puing several temperature sen-sors in the same area decreases the impact of a single sensorfailure) or by deploying dierent sensors to measure thesame quantity or event. In WeBike, we exploited the laerto detect trips from raw data. We collect baery dischargecurrent as well as movement and acceleration: either onecan be used detect the beginning of a trip.

Next, we zoom in on various causes of bad and missing data:Lesson 13: Pay attention to sensor precision Note that

sensor precision/accuracy may vary throughout the sens-ing range. For example, our discharge current sensor hasa wide sensing range, but turned out to be imprecise forvery low values. It can detect when the e-bike motor isrunning (high discharge current), but cannot distinguishzero discharge current from very small discharge currentdue to the charging of the smart phone. is prevented usfrom doing precise Coulomb counting (i.e., aggregating thedischarge current) to estimate the baery state of charge.

Lesson 14: Be careful with zero values A zero tempera-ture, voltage or current reading could be correct, or it couldindicate a malfunctioning sensor. If possible, avoid usingzero to denote bad or missing data.

Lesson 15: Assume GPS data is bad GPS, especially in ur-ban areas, can have considerable error, up to hundreds ofmeters. Moreover, even if there is nothing wrong with thehardware, it may take several minutes to acquire a GPSsignal and the signal may be lost in a tunnel or around tallbuildings.

Figure 5 illustrates the extent of missing GPS data in the WeBikedataset. e gure shows several trips for a particular bike. Eachtrip includes a start time and and end time, as well as a sequence ofmarkers. One marker corresponds to one minute. A red cross indi-cates no GPS data during that minute; otherwise there is a green dot.Nearly every trip has red crosses at the beginning, correspondingto the time needed to acquire the GPS signal. ere is also missingGPS data in the middle of some trips.

A simple way to impute missing GPS data at the beginning of atrip is to use the position at the end of the previous trip. We referthe interested reader to [10] for further details of reconstructingtrip trajectories given missing GPS data (with the help of a streetmap).

We conclude this section by highlighting the importance of dataquality dashboards.


Qd Qdc Qw Qwc

0

5

10

15

3.81

2.05

8.31

4.24

12.09

4.43

8.4

4.28Exec

utio

ntim

ein

squeries onlyqueries + concurrent inserts

(a) MySQL database. Concurrent inserts have a high impact onquerying current data.

Qd Qdc Qw Qwc

0

200

400

600

420436

249

390

Exec

utio

ntim

ein

ms

queries only

(b) MonetDB database. Concurrent inserts while querying lead toerrors and are therefore omitted. Notice the millisecond timescale.

Qd Qdc Qw Qwc

0

200

400

600

800

0.77 0.12 7.58 2.36

763

126

7.58 2.36

Exec

utio

ntim

ein

ms

queries onlyqueries + concurrent inserts

(c) FinanceDB database. eries Qw and Qwc with concurrent in-serts are identical to queries only, because historical data is in adierent database partition than current data and is therefore unaf-fected by the new inserts. Notice the millisecond timescale.

Qd Qdc Qw Qwc

0

5

10

15

20

4.06

2.26

15.38

8.59

4.08

2.28

15.69

8.65

Exec

utio

ntim

ein

s

queries onlyqueries + concurrent inserts

(d) InfluxDB database. Performance is virtually the same with andwithout concurrent inserts.

Figure 4: ery performance results for all 4 databases. eries were run on an idle system without concurrency and an asystem that concurrently executed per-second inserts into the database.

Lesson 16: Maintain a status dashboard To help detectbad and missing data collected from mobile devices, werecommend a visual dashboard that keeps track of eachdevice: its most recent position, the timestamp and valueof its most recent measurements, etc.

Figure 6 shows a screenshot of the WeBike status dashboardimplemented in Grafana. e dashboard shows the soware versionand lastest GPS coordinates of each bike. e latest timestamp isalso shown, indicating when the bike last transmied data. Otherdashboard views include latest baery level (participants whose

baeries are empty may be reminded to recharge them). We foundthis dashboard essential for managing our eld trial.

8 DATA ANALYSIS LESSONSWe now present the last set of lessons, referring to issues we en-countered when analyzing the data.

We start with an issue that aects time series data. It is natural tocollect data in a way that obeys temporal boundaries such as daysor hours. However, interesting events in the data may persist acrossboundaries. For example, when detecting trips from raw WeBikedata, we cannot process each day individually in case there is a trip


Figure 5: Visualization of GPS coverage for a single e-bike. Every sequence of markers is a trip; every marker corresponds toone minute. Green dots indicate at least one GPS x during that minute. Red crosses indicate no GPS data.

that begins before midnight and ends aer midnight (otherwise, wewould incorrectly split this trip into two). Eectively, this meansthat our trip detection script, which runs continuously as new dataarrive, must be stateful. is leads to the following lesson:

Lesson 17: Be careful with temporal partitioningWhen analyzing time series data, avoid partitioning thestream and dealing with each part independently, unlessyou are certain that events of interest do not span multipleparts.

Our nal lesson refers to drawing conclusions from data.Lesson 18: Pay attention to sample size IoT projects of-

ten collect data from human participants or data abouthuman behaviour. It is important to remember that the col-lected dataset only represents a (biased) sample of a largerpopulation. As a result, not every nding is statisticallysignicant.

For a simple example, our WeBike dataset suggests that peoplecycle less in August. However, this may not be representative ofthe general population: our participants are university memberswho tend to go on holidays in August.

9 RELATEDWORKTo the best of our knowledge, this is the rst paper of to present avariety of data management lessons from a real IoT deployment,ranging from data collection to analysis. However, there is comple-mentary prior work on reviewing data management and miningmethods for sensors [4] and IoT [8], and several papers describingIoT deployments (but focusing on soware engineering, hardwaretesting or end-to-end domain-specic solutions) [6, 14, 17]. For ex-ample, Langendoen et al. [17] deployed a large-scale sensor networkof over 100 sensors in agricultural farming as a pilot project over aone year period. ey too present lessons learned and stress theimportance of a pilot but focus on soware engineering and hard-ware testing. Another work applied to agriculture by Jayaraman etal. [14] presents an out-of-the-box solution, SmartFarmNet, for datacollection and analysis of agricultural data. Finally, Barrenetxeaet al. [6] present a deployment guide for wireless sensor network(WSN) researchers. ey conducted multiple WSN deploymentsof environmental and weather sensors in mostly mountainous lo-cations, and focus on deployment preparation and testing of the


Figure 6: WeBike status dashboard, showing the soware version and last known GPS coordinates of each bike.

hardware, environmental on-site inuences, and soware debug-ging challenges.

Extensive literature exists on database comparisons. However,there is lile work on comparing databases of dierent types, andeven less so in a sensor streaming context. Ilic et al. [12] compareMonetDB and MySQL based on Smart Grid data but do so foroine data aggregation and on a much smaller sample size thanin this work. Both insertion and aggregation/querying are testedby Pungila et al. [19], but they do not study concurrent insert andquery scenarios on streaming data. Furthermore, an IoT analyticsbenchmark has recently been proposed but it only presents resultsfor one system: HP Vertica [5]. To the best of our knowledge,no comparisons between relational, column store and time seriesdatabases have been done in a concurrent streaming scenario.

Some of our lessons are related to data quality. In most IoTdeployments, some data will be imprecise or missing; we suggestedways to mitigate these problems, e.g., by collecting redundant data.ere is also a body of work on sensor and stream data cleaningwhich can help identify and correct bad data; see, e.g., [15, 16, 21].

In our pilot project, raw data is sent to a staging server withoutany in-network processing; however in-network processing maybecome important as we scale out the project. Galpin et al. [9]study the eects on sensor baery life on in-network processingand propose a benchmark to help make informed design decisions.In one of their lessons learned about energy consumption, theycome to the same conclusion as we do: that it is essential to havesensors go back to standby mode whenever possible. Furthermore,Stokes et al. [22] show that a proactive approach to query execu-tion planning can have a signicant impact on baery life. eypropose a technique that generates multiple query execution plansat compile time and decides when to switch between them in orderto distribute load on the nodes of the sensor network.

Our approach for varying data sampling rates (lesson 3) accom-modates arbitrary event indicators but only switches between twopredened sampling rates. If a suitable sampling rate is not knowna priori or if more than two sampling levels are needed, Jain et al.[13] propose a model in which the sampling rate adapts based onthe estimation error of a Kalman lter.

10 CONCLUSIONSIn this paper, we presented 18 lessons learned from the WeBike pilotproject, in which we are collecting cycling and baery chargingdata from a eet of 31 sensor-equipped electric bicycles. We hopethat our lessons will be of interest to others involved in collectingand analyzing sensor and IoT data.

Perhaps the most important take-away message from this paperis the importance of a small-scale pilot project to work out thehardware and soware bugs. Our pilot has informed the dataplatform design of a larger-scale project with several thousande-bikes. We now know what sensors to use and what data to collect.We also have a data management ecosystem that seamlessly handlesdata ingest, time series processing and traditional row-oriented datastorage, as well as visual status dashboards.

Since our long-term vision involves much more data collectedfrom many more devices, an immediate direction for future work isto investigate distributed data processing at the data sources insteadof sending all the data to a staging server. We will also investigatenew sensing kit designs, including those with a Linux device such asRaspberry Pi or Omega2 instead of an Android phone. Furthermore,we plan to evaluate using a data stream engine in place of externalscripts in our Extract-Transform-Load pipeline.


11 ACKNOWLEDGMENTSWe would like to acknowledge the contributions from many WeBikeproject team members over the years, including Tommy Carpen-ter, Simon Fink, Lukas Gebhard, Fiodar Kazhamiaka, MykhailoKazhamiaka, Milad Khaki, Costin Ograda-Bratu, Ivan Rios, andRayman Preet Singh.

REFERENCES[1] 2013. Consider the e-bike: Can 200 million Chinese be wrong? (2013). hp:

//qz.com/137518/[2] 2017. Phidgets Inc. - Unique and Easy to Use USB Interfaces. (2017). hp:

//www.phidgets.com/[3] 2017. WeBike web page. (2017). hp://blizzard.cs.uwaterloo.ca/iss4e/?page id=

3661[4] Charu Aggarwal. 2013. Managing and Mining Sensor Data. Springer.[5] Martin Arli, Manish Marwah, Gowtham Bellala, Amip Shah, Je Healey, and

Ben Vandiver. 2015. IoTAbench: An Internet of ings Analytics Benchmark.In Proceedings of the 6th ACM/SPEC International Conference on PerformanceEngineering (ICPE ’15). 133–144.

[6] Guillermo Barrenetxea, Franois Ingelrest, Gunnar Schaefer, and Martin Veerli.2008. e hitchhiker’s guide to successful wireless sensor network deployments.In Proceedings of the 6th ACM conference on Embedded network sensor systems -SenSys ’08. 43–56.

[7] Tommy Carpenter. 2015. Measuring and Mitigating Electric Vehicle AdoptionBarriers. Ph.D. Dissertation. University of Waterloo.

[8] Feng Chen, Pan Deng, Jiafu Wan, Daqiang Zhang, Athanasios Vasilakos, andXiaohui Rong. 2015. Data Mining for the Internet of ings: Literature Reviewand Challenges. International Journal of Distributed Sensor Networks (2015),12:12–12:12.

[9] Ixent Galpin, Alan Stokes, George Valkanas, Alasdair Gray, Norman Paton, Al-varo Fernandes, Kai-Uwe Saler, and Dimitrios Gunopulos. 2014. SensorBench:Benchmarking Approaches to Processing Wireless Sensor Network Data. In Pro-ceedings of the 26th International Conference on Scientic and Statistical DatabaseManagement. 21:1–21:12.

[10] Lukas Gebhard, Lukasz Golab, Srinivasan Keshav, and Hermann de Meer. 2016.Range prediction for electric bicycles. In Proceedings of the Seventh InternationalConference on Future EnergySystems, Waterloo, ON, Canada, June 21 - 24, 2016.21:1–21:11.

[11] Lukasz Golab and M. Tamer Özsu. 2010. Data Stream Management. Morgan &Claypool Publishers.

[12] Dejan Ilić, Stamatis Karnouskos, and Martin Wilhelm. 2013. A ComparativeAnalysis of Smart Metering Data Aggregation Performance. In 2013 11th IEEEInternational Conference on Industrial Informatics (INDIN). 434–439.

[13] Ankur Jain and Edward Chang. 2004. Adaptive sampling for sensor networks.In Proceeedings of the 1st international workshop on Data management for sensornetworks in conjunction with VLDB 2004 - DMSN ’04. 10–16.

[14] Prem Jayaraman, Ali Yavari, Dimitrios Georgakopoulos, Ahsan Morshed, andArkady Zaslavsky. 2016. Internet of ings Platform for Smart Farming: Experi-ences and Lessons Learnt. Sensors 16, 11 (2016), 1884.

[15] Shawn Jeery, Gustavo Alonso, Michael Franklin, Wei Hong, and Jennifer Widom.2006. A Pipelined Framework for Online Cleaning of Sensor Data Streams. InProceedings of the 22nd International Conference on Data Engineering, ICDE. 140.

[16] Nodira Khoussainova, Magdalena Balazinska, and Dan Suciu. 2006. TowardsCorrecting Input Data Errors Probabilistically Using Integrity Constraints. InProceedings of the 5th ACM International Workshop on Data Engineering forWireless and Mobile Access (MobiDE ’06). 43–50.

[17] Koen Langendoen, Aline Baggio, and Oo Visser. 2006. Murphy loves potatoes:experiences from a pilot sensor network deployment in precision agriculture. InProceedings 20th IEEE International Parallel & Distributed Processing Symposium(IPDPS). 174.

[18] Samuel Madden, Michael Franklin, Joseph Hellerstein, and Wei Hong. 2005.TinyDB: an acquisitional query processing system for sensor networks. ACMTrans. Database Syst. 30, 1 (2005), 122–173.

[19] Ciprian Pungila, Teodor-Florin Fortis, and Ovidiu Aritoni. 2009. BenchmarkingDatabase Systems for the Requirements of Sensor Readings. IETE TechnicalReview 26, 5 (2009), 342–349.

[20] Ivan Rios, Lukasz Golab, and S Keshav. 2016. Analyzing the Usage Paerns ofElectric Bicycles. In Proceedings of the Workshop on Electric Vehicle Systems, Data,and Applications (EV-SYS ’16). 2:1–2:6.

[21] Shaoxu Song, Aoqian Zhang, Jianmin Wang, and Philip Yu. 2015. SCREEN:Stream Data Cleaning Under Speed Constraints. In Proceedings of the 2015 ACMSIGMOD International Conference on Management of Data (SIGMOD ’15). 827–841.

[22] Alan B Stokes, Norman W Paton, and Alvaro A A Fernandes. 2014. ProactiveAdaptations in Sensor Network ery Processing. In Proceedings of the 26th In-ternational Conference on Scientic and Statistical Database Management (SSDBM’14). 23:1–23:12.

http://qz.com/137518/http://qz.com/137518/http://www.phidgets.com/http://www.phidgets.com/http://blizzard.cs.uwaterloo.ca/iss4e/?page_id=3661http://blizzard.cs.uwaterloo.ca/iss4e/?page_id=3661

Abstract1 Introduction2 The WeBike Project2.1 Motivation2.2 Overview2.3 Long-Term Vision2.4 Dataflow2.5 Generalization

3 General Lessons4 Data Collection Lessons5 Extract-Transform-Load Lessons6 Database System Lessons6.1 Performance Comparison6.2 Selection of Data Management Platform

7 Data Quality Lessons8 Data Analysis Lessons9 Related work10 Conclusions11 AcknowledgmentsReferences

managing sensor data streams: lessons learned from the...

Documents