multi-task learning for transit service disruption...

2018 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM)

Multi-task Learning for Transit Service Disruption Detection

Taoran Ji1,2,∗, Kaiqun Fu1,2,

∗, Nathan Self1,2, Chang-Tien Lu1,2, and Naren Ramakrishnan1,2

1Discovery Analytics Center, Virginia Tech, Arlington, VA 22203, USA2Department of Computer Science, Virginia Tech, Arlington, VA 22203, USA

Abstract—With the rapid growth in urban transit networks inrecent years, detecting service disruptions in a timely manneris a problem of increased interest to service providers. Tran-sit agencies are seeking to move beyond traditional customerquestionnaires and manual service inspections to leveraging opensource indicators like social media for deteting emerging transitevents. In this paper, we leverage Twitter data for early detectionof metro service disruptions. Inspired by the multi-task learningframework, we propose the Metro Disruption Detection Model,which captures the semantic similarity between transit lines inTwitter space. We propose novel constraints on feature semanticsimilarity exploiting prior knowledge about the spatial connec-tivity and shared tracks of the metro network. An algorithmbased on the alternating direction method of multipliers (ADMM)framework is developed to solve the proposed model. We runextensive experiments and comparisons to other models withreal world Twitter data and transit disruption records from theWashington Metropolitan Area Transit Authority (WMATA) tojustify the efficacy of our model.

Index Terms—Social Media, Twitter, Event Detection, MetroService Disruption Detection

I. INTRODUCTION

Public transportation plays an important societal role, pro-viding a convenient means of transportation for commuters.The increased development and wide reach of transit networksin recent decades has led more travelers to name public transitas one of their main modes of transportation. According tostatistics from the United States Department of Transportation(USDOT) and annual reports from the American Public Trans-portation Association (APTA) [1], [2], public transportationhas seen long-term growth in ridership since the early 1970swith over 44% more trips in 2015. In 2016, miles traveled(nationally) on public transit systems was 58.6 billion miles.This paper focuses on metro transit networks which accountfor a majority (55%) of total passenger miles in 2016 [2]. Inorder to provide higher quality experience to riders on metrotransit systems, it is crucial to capture and respond to feedbackand complaints from riders and to minimize delays across thesystem. To handle this problem, recent research in both transitsystem analysis and social media mining have proposed usefulmethods from different perspectives.

From the perspective of metro management, higher ridershipand longer commutes may increase uncertainty in estimatingtimes of arrival and inevitably travel delays for customers.Hence, transit agencies encounter the following challenges.(1) Can we detect service failures in their early stages?

∗ Authors have equal contribution.

Fig. 1. Tweets related to a station fire at L’Enfant Plaza on January 12, 2015.Twitter data provides timely and near-ubiquitous coverage of metro disruptionevents across the metro transit network.

Existing solutions mostly rely on manual inspections andcustomer complaints. Such methods cannot provide real-timeservice failure detection. Some failures are sometimes detectedweeks after they arise! (2) Are service failures causing traveldelays for riders? Current answers to this question focuson statistical analytics of commuter questionnaires. Thesesolutions suffer from insufficient sample points to representthe entire transit network.

Twitter data, on the other hand, can capture commuterfeedback and complaints in a timely manner and can furtherspan the entire transit network. We aim to improve thequality of metro service with accurate disruption detection viaanalysis of a broad range of Twitter comments. Compared totraditional media Twitter has two key properties that are highlysuitable for metro disruption detection. Promptness: Unliketraditional media which may take hours or even days to bepublished, tweets are often posted rapidly after an event viaever-present mobile devices [3]. Geolocated: According to thelatest statistics on Twitter usage, 80% of users post tweetsfrom mobile devices [4]. These features are important for adisruption detection system that is both timely and accurate.

When multiple service complaints are posted near a metrostation over a short period of time, we can infer with highconfidence that a metro service related incident has occurred.Different metropolitan areas have different watchwords acrosssocial media platforms, including different hashtags and in-fluential users who are involved in service disruption com-munication. Figure 1 shows tweets during a disruption eventIEEE/ACM ASONAM 2018, August 28-31, 2018, Barcelona, Spain

978-1-5386-6051-5/18/$31.00 c© 2018 IEEE

that were sent from locations near L’Enfant Plaza, a metrostation operated by the Washington Metropolitan Area TransitAuthority (WMATA) in the Washington DC region. Localhashtags and influential usernames appear in this example,such as “#wmata”, “#wmatafail”, and “@unsuckdcmetro”.

Two challenges arise in leveraging the rich informationTwitter provides for metro service disruption detection. (1)Sparsity of metro service features: Among the many textfeatures latent in Twitter data, only a few key features arerelated to metro services. (2) Modeling semantic similarityacross the problem space: Although complaints about metroservice usually target one particular metro line or station, it isappropriate to assume that language usage patterns are similaracross distinct events. To address these challenges, we proposea multi-task learning based Metro Disruption Detection Model(MDDM). To the best of our knowledge, MDDM is the firstwork to use a supervised learning scheme for metro servicedisruption detection. Our main contributions are:

• Formulating a multi-task learning framework formetro disruption detection using online social media.In contrast to existing works, we formulate the problemof disruption detection for metro service as a multi-tasksupervised learning problem. In the proposed methods,models for different metro lines are learned simultane-ously by restricting all lines to exploit a common set offeatures.

• Modeling semantic similarity among metro lines infeature space. Based on extensive analysis of metrorelated information on social media, specifically designedconstraints are proposed to model semantic similaritiesamong data for distinct metro lines. These similarities infeature space are driven by both spatial connectivity andcommon complaint vocabulary.

• Developing an efficient algorithm to solve the pro-posed model. The underlying optimization problem ofthe proposed multi-task model is a non-smooth, multi-convex, inequality-constrained one and challenging tosolve. By introducing auxiliary variables, we develop aneffective ADMM-based algorithm to decouple the mainproblem into several subproblems which can be solvedby block coordinate descent and proximal operators.

• Comprehensive experiments to validate the effective-ness and efficiency of the proposed model. We evaluatethe proposed model using metro related Twitter data col-lected from January 2015 to June 2016. For comparison,we implement a broad range of other algorithms includingLOGR, LOG-LASSO, and LOG-RMTFL.

II. RELATED WORK

In this section, we provide a review of current researchon social media-based incident detection in transit networks.We break this topic into three subtopics: incident detectionin transit networks, local event detection and monitoring withTwitter, and multi-task learning frameworks for spatiotemporalevent detection.

A. Incident Detection in Transit Networks

Early detection of emergency incidents on transit and road-way networks is critical for reducing their impact on trafficconditions. This problem has drawn increased attention fromresearchers in the field of intelligent transportation systems.Previous studies focused on using multiple sources for incidentdetection including reports from transit operation patrols,commuter calls, traffic sensors, and closed circuit television(CCTV) monitoring [5], [6]. These traditional incident detec-tion methods suffer from two major disadvantages: lack ofincident detection sensitivity (e.g. patrol inspection, commuterreports), and limited urban areas that are monitored (e.g. trafficsensors, CCTV). These drawbacks motivate us to explore theapplication of social media analytics for incident detection andtransit network management.

In recent years, there have been some interest in social me-dia based event analysis for transit networks and transportationsystems [7], [8], [9]. Most of these papers focus on statisticalanalysis such as the correlation between the volume of socialmedia posted and the number of occurrences of transportationrelated events [10]. These methods rely mostly on analysisfrom human experts and are impractical to automate. Anotherbranch of relevant work focuses on implementing heuristicrule-based models to predict the occurrence of events intransit networks. Ma et al. proposed a mobility analyzerframework [11] which consists of a social media based eventdetection module for transit networks.

B. Local Event Detection and Monitoring on Twitter

Several previous studies have used social media data forlocal event detection problems. Sakaki et al. [12] traineda prediction model to judge whether a newly posted tweetrefers to an earthquake. Zhang et al. [13] used taxi tracerecords to infer occurrences of social events. Furthermore,they propose a model for measuring the scale and impact ofthose events. Santillana et al. [14] explored the feasibility ofapplying machine learning algorithms to detect and monitor in-fluenza activity by leveraging data from multiple data sourcesincluding Google searches, Twitter data, and hospital visitrecords. Gerber et al. [15] used a logistic regression model toforecast criminal activities in a spatio-temporal setting. Paul etal. [16] and Parker et at. [17] proposed methods to track publichealth conditions via social media data sources. Alternativesto traditional unsupervised machine learning methods such aslatent Dirichlet allocation were proposed by Chen et al. [18] todetect public health related events. They further show that theirmethods can better forecast a flu season’s trends as well as flu-peaks by aggregating user states in a region over a period.

C. Multi-task Learning for Spatiotemporal Event Detection

Multi-task learning (MTL) refers to models that learn mul-tiple related tasks simultaneously to improve overall perfor-mance. Recent decades have witnessed proposals for manyMTL approaches [19]. Evgeniou et al. [20] proposed a regu-larized MTL formulation that constrains the models of eachtask to be close to each other. Task relatedness can also be

modeled by constraining multiple tasks to share a commonunderlying structure (e.g. a common set of features) [21],or a common subspace [22]. Zhao et al. [23] designed amulti-task learning framework that models forecasting tasksin related geolocations. MTL approaches have been appliedin many domains including computer vision and biomedicalinformatics. Our work, to the best of our knowledge, isthe first paper to address the feasibility of combining socialmedia analysis and multi-task learning techniques to resolvedisruption detection problems for transit networks.

III. PROBLEM SETUP

Given a collection of tweets D, which is collected along acontinuous time series, we first filter it using names of metrostations and metro lines operated by WMATA. This producesthe target tweet subcollection D+. Then based on which metroline is referred to in each tweet, D+ is grouped into {D+

c }c∈Φ,

where Φ = {blue, green, orange, red, silver, yellow} (the col-ors refer to the WMATA metro lines.)

Our operative question is: given a metro line color c, a timeslot t, and the collection of corresponding tweets D+

c,t, is therea delay for metro line c during time period t? To answer thisquestion, we cast it as a supervised learning problem usingthe multi-task learning framework.

Under the assumption that metro delays can be captured bycomplaints and negative discussion in Twitter space, we adopta dictionary F of features trained specifically for Twitter [24].For each subcollection D+

c,t, we generate a correspondingmatrix Xc

t by counting the frequencies of semantic featuresin F . Now, our problem can be formulated as performing themapping

Fc(Xct)→ Yc

t , (1)

where Yct ∈ {−1, 1} are labels which denote if there is a

delay, and Fc is the model for metro line c. (In the case ofWMATA which operates six metro lines, there are six modelsto learn.)

A traditional way to solve this problem is to learn the modelfor each metro line separately. However, the performance ofeach model may be adversely affected by ignoring the related-ness among different lines. In our approach, this relatedness isexpressed as the semantic similarities among complaints aboutdifferent metro lines and stations. We consider that two factorscontribute to this semantic similarity in Twitter space. (1)Spatial connectivity of metro lines: Metro lines are spatiallyrelated together (e.g. the orange and silver lines share 83% oftheir stations). As a result, delays that affect multiple linesmay provoke similar complaints. (2) Common complaintvocabulary across metro lines: We assume that the wordsused by Twitter users to complain of disruption events will besimilar across all metro lines. To model semantic similaritycaused by these two factors, we design and implement a multi-task learning based metro delay detection model. The detailsare explained in the next section.

IV. MODELS

Considering that we want to predict if there is a delayfor a metro line given a subcollection of tweets D+

c,t whichmention metro line c during time slot t, our problem fits wellinto the scope of a classification or regression problem. Forinstance, learning the function Fc can be modeled as a logisticregression problem and the model parameters w can be learnedby solving the following optimization problem:

argminw

Lc =

mc∑t=1

log (1 + exp{Yct(X

ctw)}) , (2)

where mc is the total number of data points in D+c,t. However,

as stated in Section III, if w for each metro line is learnedseparately, these models will fail to reflect the semanticsimilarity among metro lines in feature space. To solve thesechallenges, we cast the original problem into a multi-tasklearning framework:

argminW

L =

|Φ|∑c=1

mc∑t=1

log (1 + exp{−Yct(X

ctW

c)}) , (3)

where each column of W, referred to as Wc, denotes themodel parameters of Fc. In this way, we can further modelthe relatedness among metro lines with parameter matrix W.

A. Modeling Spatial Connectivity in Feature Space

In real world metro systems, distinct metro lines are oftenspatially related. That is, two or more metro lines may shareseveral stations or several segments of track. For instance, inthe Washington metro system, the Orange and Silver linesshare 83% of their stations; and the Silver and Blue linesshare 64% of their stations. This means that if the Orangeline has a delay, there is a good chance that the Silver linewill also have a delay. This spatial relatedness results insemantic similarity in Twitter space and, therefore, a similardistribution of tweets complaining of delays. For example, onecomplaint from our dataset mentions three different metro linesat once: “@unsuckmetro. Just spend 30 min @foggy bottom &McPherson #metro. No explanation. Blue/silver/orange #de-lays. #Transparency much? @wmata”. Thus, our model shouldbe encouraged to capture this form of semantic relatednessin Twitter space. Mathematically, we place constraints onparameters among different tasks

argminW

|Φ|∑c=1

mc∑t=1


ctW

c)})

s.t. ‖W3 −W5‖22 ≤ η1, ‖W1 −W5‖22 ≤ η2

‖W6 −W2‖22 ≤ η3, ‖W1 −W6‖22 ≤ η4

η1 ≥ 0, η2 ≥ 0, η3 ≥ 0, η4 ≥ 0,

(4)

where each constraint forces the Euclidean distance betweenmodel parameters for a specific pair of metro lines to be withina range. Detailed explanations for each constraint are shown inTable I and their expected effects are explained in constraints#1 — #4 in Figure 2.

Fig. 2. A schematic view of the Metro Disruption Detection Model (MDDM). Semantic similarity among complaints in feature space is modeled by twomajor factors: spatial connectivity between metro lines and a common complaint vocabulary. In particular, metro spatial connectivity constraints encouragethe model to decrease differences between spatially related metro lines in feature space. The common complaint vocabulary constraint encourages the modelto identify a core set of words most commonly used for reporting problems with metro service.

TABLE IMEANING OF CONSTRAINTS.

Constraint Meaning

‖W3 −W5‖22 ≤ η1Euclidean distance between (orange, silver)should be less than or equal to η1.

‖W1 −W5‖22 ≤ η2Euclidean distance between (blue, silver)should be less than or equal to η2.

‖W6 −W2‖22 ≤ η3Euclidean distance between (yellow, green)should be less than or equal to η3.

‖W1 −W6‖22 ≤ η4Euclidean distance between (blue, yellow)should be less than or equal to η4.

B. Modeling Common Complaint Vocabulary in Feature Space

In addition to similarities caused by the spatial intercon-nectedness between metro lines, we also consider a hiddenpattern in the usage of complaint words over time and acrossmetro lines. For example, consider the following two tweets“Come on @wmata — why so many delays on the blueline? Been trying to get home since 5:45.” and “Delaysevery day this week on OL. Was #SafeTrack just a taxpayermoney grab? @unsuckdcmetro @wmata”. These were postedat different timestamps and, although they complain about theblue line and the orange line respectively, they share a commoncomplaint word: “delay”. This observation leads us to believethat although we have adopted a large dictionary of semantickeywords, it is possible that only a small subset of them, like“delay,” contribute to the detection of metro disruptions. Thismeans that the learned parameters matrix W should be sparseand have nonzero values for only the most important features.Thus, the proposed model should be encouraged to capturehidden patterns among complaints and to maintain sparsity infeature space. Mathematically, this consideration inspires us

to use the `2,1 [25] norm to perform joint feature selection:

argminW

|Φ|∑c=1

mc∑t=1


ctW

c)})

s.t. ‖W‖2,1 ≤ η5, η5 ≥ 0.

(5)

The effect of ‖W‖2,1 is explained in constraint #5 in Figure 2.

C. Metro Disruption Detection Model

Combining Model 4 and Model 5 together, we obtain ourproposed metro disruption detection model:

argminW

|Φ|∑c=1

mc∑t=1


ctW

c)})

s.t. ‖W3 −W5‖22 ≤ η1, ‖W1 −W5‖22 ≤ η2

‖W6 −W2‖22 ≤ η3, ‖W1 −W6‖22 ≤ η4

‖W‖2,1 ≤ η5,

η1 ≥ 0, η2 ≥ 0, η3 ≥ 0, η4 ≥ 0, η5 ≥ 0.

(6)

By moving the constraints to an objective function, we canobtain an equivalent regularized problem, which is easier tosolve

argminW

|Φ|∑c=1

mc∑t=1


ctW

c)}) + λ5‖W‖2,1

+ λ1‖W3 −W5‖22 + λ2‖W1 −W5‖22+ λ3‖W6 −W2‖22 + λ4‖W1 −W6‖22,

(7)where λ1, λ2, λ3, λ4, and λ5 are trade-off penalties balancingthe value of the loss function and the regularizers.

V. ALGORITHM

The objective function 7 is multi-convex and the regularizer`2,1 is non-smooth. This increases the difficulty of solving thisproblem. A traditional way to solve this kind of problem isto use proximal gradient descent. But this approach is slowto converge. Recently, the alternating direction method ofmultipliers (ADMM) [26] has become popular as an efficientalgorithm framework which decouples the original probleminto smaller and easier to handle subproblems. Here we pro-pose an ADMM-based algorithm 1 which is able to efficientlyoptimize the proposed models. In particular, primal variablesare updated on Line 4, dual variables on Line 5, and Lagrangemultipliers on Line 6. Line 7 calculates both primal and dualresiduals.

A. Augmented Lagrangian SchemeFirst, we introduce an auxiliary variable Uw = W into

the original problem 7 and obtain the following equivalentproblem:

argminΘ

|Φ|∑c=1

mc∑t=1


ctW

c)})

+ λ1‖W3 −W5‖22 + λ2‖W1 −W5‖22+ λ3‖W6 −W2‖22 + λ4‖W1 −W6‖22+ λ5‖Uw‖2,1

s.t. Uw = W,

(8)

where Θ = {Uw,W} is the set of variables to be optimized.Then we transform the above problem into its augmentedLagrangian form as follows:

Lρ =

|Φ|∑c=1

mc∑t=1


ctW

c)})

+ λ1‖W3 −W5‖22 + λ2‖W1 −W5‖22+ λ3‖W6 −W2‖22 + λ4‖W1 −W6‖22+ λ5‖Uw‖2,1 + 〈U1,W −Uw〉

+ρ

2‖W −Uw‖22.

(9)

where U1 is the Lagrangian multiplier. With this step, wedecouple the original problem into two easier to handleproblems in which three variables W, Uw and U1 will beoptimized individually.

B. Parameter OptimizationFirst, the primal variable W is updated by solving the

following subproblem:

W+ ← argminW

Q =

|Φ|∑c=1

mc∑t=1


ctW

c)})

+ λ1‖W3 −W5‖22 + λ2‖W1 −W5‖22+ λ3‖W6 −W2‖22 + λ4‖W1 −W6‖22+ 〈U1,W −Uw〉+

ρ

2‖W −Uw‖22.

(10)

Algorithm 1: An ADMM-based solver for MDDM.Input: X,YOutput: W

1 Initialize ρ = 1, W(0), U(0)w , U(0)

1 ;2 Initialize εr > 0, εs > 0, MAX ITER;3 for k = 1 : MAX ITER do4 Update W(k) with BCD using 12;5 Update U(k)

w with proxf1,1/ρ(U(k−1)1 + W(k));

6 Update U(k)1 with Equation 14;

7 Compute r and s by Equations 15;8 if r < εr and s < εs then9 break;

10 end11 end

The objective function Q is multi-convex. In particular, Q ofWj is convex where all other Wj′ 6=j are fixed. This kindof problem can be decoupled into subproblems using blockcoordinate descent (BCD) [27], in which each Wj is updatedby solving the following sub-optimization problems:

Wj ← argminWj

Q. (11)

Q is smooth and convex for each Wj and can be solved bygradient descent as follows:

∂Q∂W1 = P(1) + 2λ2(W1 −W5) + 2λ4(W1 −W6),

∂Q∂W2 = P(2) + 2λ3(W2 −W6),

∂Q∂W3 = P(3) + 2λ1(W3 −W5),

∂Q∂W4 = P(4),

∂Q∂W5 = P(5)− 2λ1(W3 −W5)− 2λ2(W1 −W5),

∂Q∂W6 = P(6) + 2λ3(W6 −W2)− 2λ4(W1 −W6),

(12)

where

P(c) =(Xc)T

(−Yc ◦ (I− I� (I + exp{Zc})))+ Uc

1 + ρ (Wc −Ucw)

where ◦ is the element-wise product (Hadamard product),� is element-wise division (Hadamard division), I is a mc-dimensional vector of ones, and Zc is defined as

Zc = −Yc ◦ (XcWc).

Now that primal variable W is taken care of, the dualvariable Uw is updated as follows:

U+w ← proxf1,1/ρ(U1 + W), (13)

where f1 is the non-smooth function λ5‖Uw‖2,1. The proxi-mal operator can be solved efficiently using [28].

Next, the Lagrangian multiplier U1 is updated as follows:

U+1 ← U1 + ρ(W+ −U+

w). (14)

Finally, primal and dual residuals are computed with

r = ‖W+ −U+w‖2, s = ρ

(‖U+

w −Uw‖2). (15)

where r is primal residual, and s is dual residual.

VI. EVALUATION

A. Dataset and Ground Truth

Dataset and Preprocessing: We evaluated our proposedMetro Disruption Detection Model using tweets collected fromJanuary 2015 through June 2016 from GNIP’s decahose (anapprox. 10% sample of all tweets). This dataset was separatedinto two parts: (1) data from January 2015 to December 2015,which serves as the training set for supervised comparisonmethods, and (2) data from January 2016 to June 2016, whichserves as the test set for validating our methods. Both thetraining set and test set were partitioned into hourly intervals.Event detection is performed for each metro line individuallybased on each hour’s data. Each tweet collection is fed intoa preprocessing pipeline during which stopwords are elimi-nated; enrichment involving tokenization and lemmatizationis performed using SpaCy.

Sentiment Features Selection: Our approach derives fea-tures from tweet content using a pipeline of pruning operationsand social media sentiment data provided by the StaticTwit-terSent dictionary [29]. First, we removed non-English words,hashtags, usernames, single letter words, numbers, and Englishstopwords as defined by the NLTK toolkit. With our dataset,this produced 17,977 features. Secondly, we removed wordswith positive frequency or negative frequency less than 10from the feature set. Here, positive frequency refers to thenumber of times a word appears in a positive sentiment tweet.Likewise, negative frequency denotes the frequency of a wordin tweets with negative sentiment. As a result, a set of 6,600words with negative sentiment were selected. Thirdly, we usedmetro related keywords (e.g. metro, wmata) provided by metroexperts to gather a sub-collection of tweets from the entiredataset. Only features provided by tweets in this metro-relatedsub-collection are used. This leaves us with a set of 2,200features which we use for validating our methods. The mostfrequent features for each metro line are listed in Table II.Note that several keywords (e.g. interrupted, injury) are sharedamong different metro lines, further motivating us to train alltasks simultaneously.

Metro Delay Disruption Dataset: The WMATA DailyService Report keeps track of all delay records in the WMATAMetro system. For this paper, we scraped metro disruptiondata from this service for each day from January 2015 to June2016. This dataset serves as ground truth data for trainingand validation. We scrape data such as the amount of timedelayed, metro line color, metro station and start time of thedelay, which is aggregated into six-hour time interval. Thisdata is separated into training and test sets corresponding tothe Twitter dataset’s partition.

TABLE IITOP 3 MOST FREQUENT WORDS FOR EACH METRO LINE.

Metro Line Keywords

Blue line mole interrupted endersGreen line interrupted injury spineOrange line spine mouldy safeRed line interrupted injury computersSilver line injury interrupted blisterYellow line interrupted injury tennant

B. Comparison Methods and Experiment Setup

To quantify and validate model performance, different met-rics are adopted. Precision denotes the ratio of detected metrodisruptions in ground truth over all predicted disruptions by amodel. Recall designates the percentage of metro disruptionswhich are actually detected by the model. F-measure is theharmonic mean of precision and recall which is defined as 2·Precision · Recall / (Precision + Recall).

There are five tunable parameters in our MDDM model,namely the regularizer penalties λ1, λ2, λ3, λ4 and λ5. Duringthe experiment, we observe that the value of the loss functionis significantly larger than regularizer 2 — 4, which means alarge penalty should be used to balance the loss function andthe regularizers. Thus, λ1 is searched from {1, 10, 100} andthe rest four penalties is searched from {100, 200, . . . , 1000}.

We compared MDDM with the following three methods:• Logistic Regression (LOGR) [30]. For each metro line,

LOGR utilizes a logit function to predict the probabilityof the occurrence of disruption of metro service basedon the observation of tweets. Input features are featurescount, and no tunable parameter is needed.

• LASSO with Logistic Regression (LOG-LASSO)LASSO [31] is a classic model which is widely usedin the event detection field. For each metro line Xc, aLASSO model is trained and evaluated independently.Input features are feature count and the The trade-offpenalty parameter is searched from {1, 10, 100}.

• Regularized Multi-task Feature Learning Model withLogistic Regression (LOG-RMTFL) [23]. We replacethe least squares loss with logistic loss to fit our proposedclassification application. The feature set is our set of2,200 words with negative sentiment. All six modelsare trained simultaneously and evaluated separately. Thetrade-off penalty parameter is searched from {1, 10, 100}.

C. Metro Disruption Detection Results

Table III summarizes the comparisons of our proposedmethod to the competing methods for the task of disruptiondetection of metro service. From the experimental results, wecan justify our application of a multi-task learning frameworkfor detecting disruptions in metro service. In general, MDDMoutperforms the single task model on five metro lines on recalland F-measure due to the benefits of multi-task learning whichintegrates information from each transit line. LOG-LASSO,LOG-RMTFL, and MDDM outperform LOGR for everything

TABLE IIIMETRO DISRUPTION DETECTION PERFORMANCE COMPARISONS (PRECISION, RECALL, F-MEASURE)

Method Blue Green Orange Red Silver Yellow

LOGR 0.37, 0.37, 0.37 0.48, 0.48, 0.48 0.56, 0.55, 0.56 0.63, 0.63, 0.63 0.49, 0.49, 0.49 0.40, 0.40, 0.40LOG-LASSO 0.43, 0.42, 0.42 0.50, 0.50, 0.50 0.56, 0.56, 0.56 0.67, 0.66, 0.67 0.49, 0.49, 0.49 0.43, 0.43, 0.43LOG-RMTFL 0.36, 0.35, 0.35 0.52, 0.52, 0.52 0.59, 0.59, 0.59 0.68, 0.68, 0.68 0.49, 0.49, 0.49 0.39, 0.39, 0.39MDDM 0.36, 0.56, 0.44 0.55, 0.55, 0.55 0.63, 0.62, 0.63 0.69, 0.68, 0.69 0.52, 0.52, 0.52 0.40, 0.40, 0.40

0 10 20 30 40 50 60 70 800

0.2

0.4

0.6

0.8

1x 10

−5

Iteration

Eu

cli

dean

Dis

tan

ce

Fig. 3. Spatial connectivity component validation. To obtain an optimal valuefor the objective function, the model iteratively decreases the value of theloss function and reduces the penalties of the regularizers. As a result, thedifferences between spatially related metro lines such as Orange (W3) andSilver (W5) lines decrease until reaching a stable state.

except precision on the blue line. Considering that these threemodels account for feature sparsity, these results demonstratethe existence of sparsity in metro service features as introducedin Section I. Table III also shows that model performancefor metro service disruption detection is not the same acrossdifferent transit lines. For instance, the performance of LOG-LASSO, LOG-RMTFL, and MDDM on the red line onlydiffers slightly. Because the red line is spatially independentfrom other lines in the WMATA network, performance in notgreatly affected by the application of a multi-task learningmodel. Also, LOG-RMTFL and MDDM perform better onthe Orange line. The Orange line has unusual properties. Forinstance, it shares many feature weights with other transitlines and its spatial connectivity to other lines (such as Blueand Silver) is high. This enables multi-task learning basedmethods to utilize as much information as possible to boostperformance on the Orange line. We find that, across metrolines, MDDM outperforms LOG-RMTFL by 1% to 10% onprecision and by 2% to 25% on F-measure. These resultsdemonstrated that the consideration of spatial connectivity intransit networks contributes to improved detection.

To further demonstrate the impact of modeling spatialconnectivity, as proposed in Section IV, Figure 3 shows thedevelopment of the Euclidean distance for each constraint inTable I. As shown in the figure, distance increases at firstbecause W is initialized as a matrix of zeros. But, at higheriteration counts, distance decreases until it reaches a stablestate. In regards to balancing the value of the loss function andthe regularizers, our MDDM approach does learn to optimize

for the similarity between spatially connected transit linesduring each iteration step.

D. Case Studies

To justify our proposed method, we present a qualitativeanalysis of a real world case study. Figure 4 shows disruptionevents for 2015 on the Orange, Silver, and Blue lines operatedby the Washington Metropolitan Transit Authority. Disruption1, disruption 2, and disruption 3 occurred on Orange line. Dis-ruption 4 and disruption 5 occurred on Silver line. Disruption6, disruption 7 and disruption 8 occurred on Blue line. OurMDDM model successfully detects disruptions 1, 4 and 6.Because MDDM uses a multi-task framework to jointly learnmodels for all metro lines, it can detect co-occurring eventsusing data from other lines even when a model has few trainingsamples. The LOG-LASSO model only detects disruption 1.Because the training step for each metro line is independentin LOG-LASSO, its performance suffers from lack of trainingsamples. Although LOG-RMTFL detects disruptions 2, 3, 7,and 8, it does not perform well on the Silver line. That’sbecause it does not model spatial connectivity which couldboost performance on the Silver line by training together withthe spatially interconnected Orange line.

VII. CONCLUSION

We have demonstrated a multi-task learning based super-vised learning model for the problem of metro delay detectionusing social media data. We have motivated the need fortraining all tasks simultaneously instead of building a modelfor each line separately. Our work considers the unique metro-specific assumptions in feature space, reflected in the two kindsof regularizers proposed in the model. We proposed an efficientalgorithm based on the ADMM framework in which the mainproblem is divided into several sub-problems which can besolved using block coordinate descent and proximal operators.Our empirical results demonstrate that our proposed model caneffectively detect metro delays even with very weak signalsin social media space and outperform competing methods bya substantial margin on both precision and recall. For futurework, we plan to extend our model to use multiple data sourcesincluding transportation related data.

ACKNOWLEDGMENT

This work is supported in part by the National ScienceFoundation via grants DGE-1545362, IIS-1633363, and by theArmy Research Laboratory under grant W911NF-17-1-0021.The US Government is authorized to reproduce and distribute

Fig. 4. A timeline of metro disruptions on the Orange, Blue, and Silver metro lines in 2015. Events along these spatially interconnected lines often co-occur.

reprints of this work for Governmental purposes notwithstand-ing any copyright annotation thereon. Disclaimer: The viewsand conclusions contained herein are those of the authorsand should not be interpreted as necessarily representing theofficial policies or endorsements, either expressed or implied,of NSF, Army Research Laboratory, or the U.S. Government.

REFERENCES

[1] E. L. Chao, J. Rosen, P. Hu, and R. Schmitt, “Transportation statisticsannual report, 2017,” 2018.

[2] J. Neff and M. Dickens, “2016 public transportation fact book,” 2017.[3] X. Zhang, Z. Chen, W. Zhong, A. P. Boedihardjo, and C.-T. Lu, “Story-

telling in heterogeneous twitter entity network based on hierarchicalcluster routing,” in Big Data (Big Data), 2016 IEEE InternationalConference on. IEEE, Conference Proceedings, pp. 1522–1531.

[4] S. Aslam, “Twitter by the numbers: Stats, demographics &fun facts,” Omnicoreagency. com, 2018. [Online]. Available:https://www.omnicoreagency.com/twitter-statistics

[5] S. Hemminki, P. Nurmi, and S. Tarkoma, “Accelerometer-based trans-portation mode detection on smartphones,” in Proceedings of the 11thACM Conference on Embedded Networked Sensor Systems. ACM,Conference Proceedings, p. 13.

[6] F. Jiang, J. Yuan, S. A. Tsaftaris, and A. K. Katsaggelos, “Anomalousvideo event detection using spatiotemporal context,” Computer Visionand Image Understanding, vol. 115, no. 3, pp. 323–333, 2011.

[7] S. Hasan and S. V. Ukkusuri, “Urban activity pattern classification usingtopic models from online geo-location data,” Transportation ResearchPart C: Emerging Technologies, vol. 44, pp. 363–381, 2014.

[8] E. Mai and R. Hranac, “Twitter interactions as a data source fortransportation incidents,” in Proc. Transportation Research Board 92ndAnn. Meeting, Conference Proceedings.

[9] B. Pender, G. Currie, A. Delbosc, and N. Shiwakoti, “Social media useduring unplanned transit network disruptions: A review of literature,”Transport Reviews, vol. 34, no. 4, pp. 501–521, 2014.

[10] V. Guralnik and J. Srivastava, “Event detection from time series data,”in Proceedings of the fifth ACM SIGKDD international conference onKnowledge discovery and data mining. ACM, Conference Proceedings,pp. 33–42.

[11] T. Ma, G. Motta, and K. Liu, “Delivering real-time information serviceson public transit: A framework,” IEEE Transactions on IntelligentTransportation Systems, vol. 18, no. 10, pp. 2642–2656, 2017.

[12] T. Sakaki, M. Okazaki, and Y. Matsuo, “Earthquake shakes twitterusers: real-time event detection by social sensors,” in Proceedings of the19th international conference on World wide web. ACM, ConferenceProceedings, pp. 851–860.

[13] W. Zhang, G. Qi, G. Pan, H. Lu, S. Li, and Z. Wu, “City-scale socialevent detection and evaluation with taxi traces,” ACM Transactions onIntelligent Systems and Technology (TIST), vol. 6, no. 3, p. 40, 2015.

[14] M. Santillana, A. T. Nguyen, M. Dredze, M. J. Paul, E. O. Nsoesie, andJ. S. Brownstein, “Combining search, social media, and traditional datasources to improve influenza surveillance,” PLoS computational biology,vol. 11, no. 10, p. e1004513, 2015.

[15] M. S. Gerber, “Predicting crime using twitter and kernel density esti-mation,” Decision Support Systems, vol. 61, pp. 115–125, 2014.

[16] M. J. Paul, A. Sarker, J. S. Brownstein, A. Nikfarjam, M. Scotch,K. L. Smith, and G. Gonzalez, “Social media mining for public healthmonitoring and surveillance,” in Biocomputing 2016: Proceedings ofthe Pacific Symposium. World Scientific, Conference Proceedings, pp.468–479.

[17] J. Parker, Y. Wei, A. Yates, O. Frieder, and N. Goharian, “A frameworkfor detecting public health trends with twitter,” in Proceedings ofthe 2013 IEEE/ACM International Conference on Advances in SocialNetworks Analysis and Mining. ACM, Conference Proceedings, pp.556–563.

[18] L. Chen, K. T. Hossain, P. Butler, N. Ramakrishnan, and B. A. Prakash,“Syndromic surveillance of flu on twitter using weakly supervisedtemporal topic models,” Data Mining and Knowledge Discovery, vol. 30,no. 3, pp. 681–710, 2016.

[19] J. Zhou, J. Chen, and J. Ye, “Malsar: Multi-task learning via structuralregularization,” Arizona State University, vol. 21, 2011.

[20] T. Evgeniou and M. Pontil, “Regularized multi–task learning,” inProceedings of the tenth ACM SIGKDD international conference onKnowledge discovery and data mining. ACM, Conference Proceedings,pp. 109–117.

[21] A. Argyriou, T. Evgeniou, and M. Pontil, “Multi-task feature learning,”in Advances in neural information processing systems, ConferenceProceedings, pp. 41–48.

[22] R. K. Ando and T. Zhang, “A framework for learning predictivestructures from multiple tasks and unlabeled data,” Journal of MachineLearning Research, vol. 6, no. Nov, pp. 1817–1853, 2005.

[23] L. Zhao, Q. Sun, J. Ye, F. Chen, C.-T. Lu, and N. Ramakrishnan, “Multi-task learning for spatio-temporal event forecasting,” in Proceedingsof the 21th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining. ACM, 2015, pp. 1503–1512.

[24] S. Mohammad, S. Kiritchenko, and X. Zhu, “Nrc-canada: Buildingthe state-of-the-art in sentiment analysis of tweets,” in Proceedings ofthe seventh international workshop on Semantic Evaluation Exercises(SemEval-2013), Atlanta, Georgia, USA, June 2013.

[25] A. Argyriou, T. Evgeniou, and M. Pontil, “Convex multi-task featurelearning,” Machine Learning, vol. 73, no. 3, pp. 243–272, 2008.

[26] S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein et al., “Distributedoptimization and statistical learning via the alternating direction methodof multipliers,” Foundations and Trends R© in Machine learning, vol. 3,no. 1, pp. 1–122, 2011.

[27] Y. Xu and W. Yin, “A block coordinate descent method for regularizedmulticonvex optimization with applications to nonnegative tensor fac-torization and completion,” SIAM Journal on imaging sciences, vol. 6,no. 3, pp. 1758–1789, 2013.

[28] N. Parikh, S. Boyd et al., “Proximal algorithms,” Foundations andTrends R© in Optimization, vol. 1, no. 3, pp. 127–239, 2014.

[29] S. M. Mohammad, S. Kiritchenko, and X. Zhu, “Nrc-canada: Buildingthe state-of-the-art in sentiment analysis of tweets,” arXiv preprintarXiv:1308.6242, 2013.

[30] R. Compton, C. Lee, J. Xu, L. Artieda-Moncada, T.-C. Lu, L. De Silva,and M. Macy, “Using publicly visible social media to build detailedforecasts of civil unrest,” Security informatics, vol. 3, no. 1, p. 4, 2014.

[31] N. Ramakrishnan, P. Butler, S. Muthiah, N. Self, R. Khandpur, P. Saraf,W. Wang, J. Cadena, A. Vullikanti, G. Korkmaz et al., “’beatingthe news’ with embers: forecasting civil unrest using open sourceindicators,” in Proceedings of the 20th ACM SIGKDD internationalconference on Knowledge discovery and data mining. ACM, 2014,pp. 1799–1808.

multi-task learning for transit service disruption...

Documents