estimating primary demand in bike-sharing systemsjaillet/general/ssrn-id3311371.pdfgoh, yan, and...

Estimating Primary Demand in Bike-sharing Systems

Chong Yang GohOperations Research Center, Massachusetts Institute of Technology

[email protected]

Chiwei YanOperations Research Center, Massachusetts Institute of Technology

[email protected]

Patrick JailletOperations Research Center, Massachusetts Institute of Technology

[email protected]

We consider the problem of estimating the primary (first-choice) demand in a bike-sharing system. Such

estimates are crucial for various operational decisions, such as the allocation and rebalancing of bikes to

meet demand over a planning horizon. A key challenge in estimating the demand is to account for the

choice substitution effect, where a commuter may substitute a trip for close alternatives when the first-choice

location for picking up or returning a bike is not available. In this paper, we propose a method for estimating

the primary demand using a rank-based demand model. The model accounts for choice substitutions by

treating each observed trip as the best available option in a latent ranking over origin-destination (OD)

pairs. To solve the high-dimensional estimation problem that arises from the combinatorial growth of origin-

destination rankings, we propose algorithms that (i) find sparse representations of the model parameters

efficiently, and (ii) restrict station substitutions spatially according to the bike-sharing network. We prove

consistency results of the model and develop efficient, first-order methods based on difference of convex

(DCP) programming. Our approach is tractable on a city scale, which we demonstrate on a bike-sharing

system in the Boston metropolitan area. Experimental evaluations on both simulated and real-world data sets

show that our method can significantly reduce biases in ridership prediction and primary demand estimation,

which results in increased ridership when used as inputs to an operational model for bike allocation.

Key words : bike-sharing, rank-based choice models, demand estimation

History : January 6th, 2019, this version

1. Introduction

Over the past decade, bike-sharing services have seen widespread adoption in cities across the world

in response to urban congestion. As of 2018, there are approximately 1,500 bike-sharing services in

operation globally1, spanning continents from North America, Europe to Asia. Most bike-sharing

services operate by providing a fleet of rentable bikes that are distributed over a fixed network of

1 Based on data compiled by www.bikesharingmap.com, as of February 2018.

1

Goh, Yan, and Jaillet: Estimating Primary Demand in Bike-sharing Systems2

stations, each capable of holding a certain number of bikes at a given time. A commuter can pick

up a bike from any station for temporary use and drop it off later by parking it in an available

dock at the destination station.

In operating a bike-sharing system, a fundamental challenge is to match the supply of bikes

or docks with the demand for rides across location and time. Without timely intervention, the

asymmetry between pick-up and drop-off demand can often result in the quick depletion of either

bikes or docks, rendering the stations out of service during peak hours. At best, such temporary

outage is an inconvenience to the commuters who have to find alternatives at nearby stations;

worse, it turns them away and result in ridership losses over time. To mitigate this supply-demand

imbalance, operators often carry out overnight or periodic relocation of the bikes from one station

to another—a costly operation known as bike rebalancing. The success of these operations crucially

depends on the operator’s ability to estimate the ridership demand accurately over the planning

horizon, which motivates the topic of our paper.

In this paper, we focus on the problem of estimating the primary demand in a bike-sharing

system. Here, the primary demand refers to the demand that would be observed if the service

never goes out anywhere—or equivalently, if every pick-up or drop-off request by a potential com-

muter is fulfilled at the station of her first choice. For operational purposes, the demand is usually

quantified by the bike outflow and inflow rates at individual stations, or the flow rates between

origin-destination pairs. Estimating these flow rates by naively aggregating past trip data, however,

can be problematic for two reasons: demand censoring and choice substitution—both can introduce

significant biases in the estimates if not properly corrected for. Demand censoring occurs when

stations are temporarily out of bikes or docks, during which there is apparently zero demand. This

can be corrected, to some degree, by filtering the time periods of outage before aggregating the

trip data, as done in O’Mahony and Shmoys (2015). But it is more difficult to account for choice

substitution, which occurs when commuters substitute their first choices for other alternatives due

to service unavailability. Choice substitution in bike-sharing can happen in two ways: commuters

may decide to use available stations nearby instead of their first choices—in which case, the demand

is recaptured elsewhere; otherwise, commuters may leave the system for external alternatives, such

as other modes of transportation, hence the demand is said to be spilled or lost. Both factors can

contribute to either overestimation (by recaptures) or underestimation (by spills) of the primary

demand.

A standard approach for analyzing substitution behaviors is to use a choice model, which specifies

the likelihood that a customer makes a certain choice when presented with a set of alternatives or

offer set. Analogous to the retail settings in which such models are most commonly applied, we can

view each station in a bike-sharing system as a product to be considered among a set of available


stations (the offer set). But one unique characteristic of a bike-sharing service is that the products

are not differentiated by their prices or other intrinsic attributes, such as brand and color, because

most operators charge identical rates and provide similar bikes regardless of location; instead, they

are largely differentiated by their relative ease of access to the commuter—for example, earlier

estimates showed that a commuter whose point of origin is 300m away from a station is about

60% less likely to use the service than one who is close by (Kabra et al. 2016). Further, commuters

often have real-time access to information regarding the availability of all stations via websites

or smartphone apps, which facilitates timely discovery of alternatives. These factors suggest that

substitution is most likely to occur in parts of the network where stations are densely positioned,

hence the need to consider its effect in demand modeling.

Most earlier works on bike-sharing demand modeling do not take into account choice substitution.

For example, there is a line of work on predicting the observed ridership with regression models

that include covariates such as weather and demographic features (Rixey 2013, Singhvi et al. 2015),

but without isolating the contributions due to spills and recaptures. Others have focused mainly on

optimizing bike rebalancing operations (O’Mahony and Shmoys 2015, Raviv et al. 2013), wherein

the demand is either assumed to be known or estimated by ad-hoc data aggregation. The work

most relevant to ours are by Kabra et al. (2016) and Pendem (2016), who proposed multinomial

logit models (MNL) that spatially differentiate choices based on the walking distances between

stations. However, our approach differs in that we estimate the primary demand without explicitly

modeling its functional relationship with other variables, such as population density and distances

between stations. While the inclusion of these covariates is useful for analyzing how the demand

varies with the environmental factors, they also impose parametric modeling constraints that are

difficult to justify a priori. In what follows, we instead consider a more general class of choice

models that can flexibly adapt to data with minimal assumptions, while exploiting the network

structure of a bike-sharing service to make its estimation tractable.

We propose a bike-sharing demand model that consists of two parts: (i) a Poisson process that

models commuter arrivals, and (ii) a generic rank-based choice model that defines commuter behav-

iors by their preference rankings of the alternatives. Specifically, we assume that each arriving

commuter has a strict ordering of stations by preference, which is drawn i.i.d. from some probability

distribution over preference rankings, and always chooses the highest ranked alternative from the

offer set (including the option to leave the system). From a modeling perspective, one advantage of

rank-based choice models is that they are “non-parametric”; that is, they are compatible with any

rational choice behaviors and require no prior assumption about the functional form of the choice

probabilities. Furthermore, its complexity can flexibly scale with data as preference rankings are

incrementally added to the model (Farias et al. 2013, van Ryzin and Vulcano 2014). By contrast,


parametric models such as MNL impose certain structural restrictions on substitutions, such as the

independent irrelevant alternatives (IIA) property, which imply behaviors that may be unrealistic

in practice. Specifically, the IIA property requires that the ratio of ridership shares between any

two available stations stays constant regardless the availability of other choices. In Figure 1 below,

we illustrate why such property is likely to be violated in a bike-sharing system. Suppose there are

three stations A, B and C, and three commuters that are near to each of the stations. We assume

that all commuters rank the stations in increasing order of distance to their respective locations,

with their preference rankings shown in Figure 1. At an aggregate level, the ridership shares of

the three stations A, B and C are (1/3,1/3,1/3), respectively. Now, suppose station A runs out

of bikes. Since station B is closest to A, the commuter near station A will pick up a bike from

station B instead, which results in ridership shares of (0,2/3,1/3). The ratio of shares between

stations B and C is thus not preserved. Although there are more complex parametric models that

relax the IIA property, such as nested logit (Ben-Akiva et al. 1985) and mixed logit (Train 2009)

models, specifying a suitable parametric structure a priori is often difficult in practice. Moreover,

fitting a complex but incorrectly specified parametric model to data can lead to poor generalization

performance due to overfitting (Farias et al. 2013).

Figure 1 An example with 3 stations (labeled A, B, C) and 3 commuters to illustrate the violation of IIA property.

Below, each commuter’s rankings over the choice set {A,B,C} is specified by a 3-tuple in decreasing

order of preference.

BA C

Rank-based demand models of the kind that we proposed have been studied in various domains,

particularly in retail and airline industries (Haensel et al. 2011, van Ryzin and Vulcano 2014, 2017).

But one distinguishing feature of our application is that choice substitutions tend to be spatially

constrained in the bike-sharing network—that is, commuters tend to substitute an unavailable

station for other alternatives that are in close proximity to their points of origin. In this regard,

rank-based models afford us the flexibility to specify exactly how the choices are substitutable by

each another depending on the spatial configuration of the bike-sharing network, without neces-

sarily satisfying the IIA property. In what follows, we summarize the main contributions of this

paper in three areas: modeling, estimation and experimentations.

Modeling. We propose to augment the rank-based choice model with a substitution graph, which

precisely characterizes the set of substitutable alternatives for each station in the bike-sharing


network through a set of graphical constraints. The introduction of these substitution constraints

serve two purposes: (i) to model commuters’ tendency to explore only alternatives that are in close

proximity to their first-choice stations, and (ii) to reduce the number of parameters of the model by

eliminating preference rankings that imply unrealistic substitution behaviors. Further, we extend

the rank-based choice model to jointly account for pick-up and drop-off substitutions using paired

rankings of OD choices. This extended model allows us to estimate the primary demand for trips

between all origin-destination (OD) station pairs, which are useful for various inventory planning

and rebalancing operations (e.g., Shu et al. (2013), Ghosh et al. (2017)).

Estimation. Our proposed model is characterized by a set of parameters—the Poisson arrival

rate and the probability mass function of preference rankings—that can be estimated by maxi-

mum likelihood estimation (MLE). One issue, however, is that the number of preference rankings

grows factorially with the number of alternatives. Therefore, it is usually not feasible to solve

the fully parametrized MLE problem using off-the-shelf optimization solvers. Instead, large-scale

optimization methods such as constraint sampling (Farias et al. 2013) and column generation (van

Ryzin and Vulcano 2014) have been applied. These methods begin by solving the MLE problem

with a small set of preference rankings, and then iteratively improving the likelihood objective by

adding new rankings to the model. However, van Ryzin and Vulcano (2014) showed that finding

a ranking that improves the likelihood requires solving an integer programming sub-problem that

is NP-hard, making it intractable when there are hundreds of alternatives, as is the case in many

bike-sharing systems. In this paper, we take a different approach by proposing a dimensionality

reduction technique that is based on sparse representations. More specifically, given a sequence of

observed available stations (offer sets) and choices over time, our algorithm will generate a parsimo-

nious set of preference rankings that sufficiently explain the observations, subject to the constraints

imposed by a substitution graph. For our application, we find that this approach can reduce the

problem dimension by orders of magnitude, making it feasible to solve the MLE problem without

resorting to large-scale optimization methods. Our second contribution is in developing a special-

ized estimation algorithm. We show that the MLE problem, while nonconcave in general, can be

reduced to a difference-of-convex (DC) program. Based on this reduction, we apply an iterative DC

programming technique that estimates the parameters by solving a sequence of convex programs,

where each convex program can be solved efficiently using the Frank-Wolfe method. In particular,

the estimation procedure can be described by a series of closed-form updates that uses only the

first derivative of the likelihood function, which makes it scalable to large bike-sharing systems

with hundreds of stations. We show that the proposed procedure enjoys theoretical convergence

guarantees and is very efficient in practice—often achieving vast but quickly diminishing improve-

ments in the likelihood objective after solving only the first few convex programs. Finally, from a


statistical point of view, we also extend the consistency results of van Ryzin and Vulcano (2014)

by proving the consistency of MLE under a Poisson arrival process and a general non-concave

likelihood function, where one cannot distinguish no arrivals from arrival without pickups (due to

a stockout).

Experimentations. We evaluate our methods on synthetic and actual data sets of Hubway, a

bike-sharing service in the Boston metropolitan area. The purpose of our study is two-fold: (i) to

demonstrate the practicality of our method on a city scale, and (ii) to evaluate the effectiveness of

our primary demand estimation procedure, in terms of both its accuracy and operational impact

on ridership. We show that across a wide range of simulated conditions, our method consistently

outperforms several benchmarks, including the independent demand and MNL models. Specifically,

we achieve error reduction by more than 20% when the average stockout frequency is close to the

actual levels observed in Boston during the evening peak hours. These improvements translate into

ridership increases of up to 3% when bikes are allocated based on the estimated demands, using a

bike deployment model by Shu et al. (2013) with a planning horizon of several hours. When applied

to the actual Hubway data set, our method achieves an overall out-of-sample error of 24.8% in

ridership prediction, which is 10% lower compared to the best-performing benchmark.

The rest of this paper is organized as follows. We first give a literature review in Section 2. We

then formally describe our bike-sharing demand model and derive the MLE problem in Section 3.

Next, we propose an MLE estimation procedure and address the computational issues resulting

from high dimensional parameter space in Section 4. After that, we discuss the statistical properties

of our methods in Section 5. We then extend our model to jointly account for origin and destination

demand using paired preference rankings in Section 6. Finally, in Section 7 and 8, we present the

results of our empirical studies on Hubway bike-share.

2. Related Work

Bike-sharing Demand. There have been numerous works that analyzed bike-sharing usage

by incorporating data from heterogeneous sources (Rixey 2013, Singhvi et al. 2015, El-Assi et al.

2017), among them being demographic characteristics, built environment factors, weather and

usage data of other connecting transportation (see El-Assi et al. (2017) for a summary of the recent

literature and references therein). By comparison, there is relatively scant work on estimating the

primary demand of bike-sharing systems. Pendem (2016) proposed a stochastic demand model that

assumes a Poisson arrival process and a MNL choice model for pick-ups, where the substitution

probability of one station by another is parametrized by a nonlinear function of the walking distance

between them. Kabra et al. (2016) focused primarily on the question of how the accessibility and

the availability of a bike-sharing service impact ridership. To that end, the authors proposed a


structural demand model in which the commuter arrival rate at a location is assumed to vary

linearly with several covariates, such as the local population density and metro usage. The pick-up

choice model is similar to that of Pendem (2016), except that the choice probability is a piecewise

linear function of distance. Both works differ from ours in two aspects: (i) we adopt a more general

class of nonparametric choice models, and (ii) we do not impose parametric relationships between

the demand and other covariates. Further, our approach can be extended to model the trip demand

for all origin-destination station pairs, whereas earlier works focused only on pick-ups.

Rebalancing Operations and Capacity Allocation. Existing work in bike rebalancing

optimization can generally be classified into two streams: (i) static rebalancing (Raviv et al. 2013,

Chemla et al. 2013), where bikes are repositioned (usually overnight) ahead of major demand spikes,

and (ii) dynamic rebalancing (O’Mahony and Shmoys 2015, Ghosh et al. 2017), which is carried out

continuously during peak usage periods. These operations are complex and multifaceted, involving

problems such as inventory planning (Shu et al. 2013, O’Mahony and Shmoys 2015, Schuijbroek

et al. 2017), vehicle routing (Schuijbroek et al. 2017) and pricing (Singla et al. 2015). Aside from

rebalancing operations, Freund et al. (2017) also address the dock-capacity allocation problem. One

common thread is that all these operations depend on estimated demand for decision support. For

example, O’Mahony and Shmoys (2015) uses the pick-up and drop-off rates at individual stations

as inputs to a dynamic rebalancing model, whereas Shu et al. (2013) uses the OD flow rates between

all station pairs to optimize bike inventory levels over a planning horizon. However, in all of these

works, the demand is either assumed to be known or estimated in an ad-hoc way without taking

into account choice substitution.

Demand Estimation using Choice Models. Discrete choice models have long been applied

to estimate the primary demand of substitutable products (Vulcano et al. 2010, 2012, Haensel

et al. 2011, van Ryzin and Vulcano 2014, 2017). In the literature, our model is closest to that of

Haensel et al. (2011) and van Ryzin and Vulcano (2014). Both considered rank-based choice models

under a customer arrival process that is either Poisson (Haensel et al. 2011) or Bernoulli (van

Ryzin and Vulcano 2014). But one distinguishing feature of our model is that choice substitutions

can be constrained according to the spatial configuration of the bike-sharing network—a notion

that we formalize using a substitution graph later in Section 4.1.1. To handle the high parameter

dimension of rank-based choice models, earlier work has applied constraint sampling (Farias et al.

2013) and column generation (van Ryzin and Vulcano 2014) for estimation. Farias et al. (2013)

considered a different setup where instead of MLE, the goal is to estimate ranking probabilities

that are robust against a revenue function under a fixed assortment, while assuming that a subset


of choice probabilities are observable and consistent. van Ryzin and Vulcano (2014) solved the high-

dimensional MLE problem by column generation, which incrementally adds rankings to the model

using dual information. But each iteration requires solving an integer program that is NP-hard.

Another related approach was taken by Jagabathula and Rusmevichientong (2016), who identified

certain structures of offer set collections under which the estimation of rank-based models can

be formulated efficiently. Our proposed dimensionality reduction method departs from the ones

by Farias et al. (2013) and van Ryzin and Vulcano (2014) in that it is applied before solving the

estimation problem itself. It also differs from Jagabathula and Rusmevichientong (2016) in that

we do not make any particular assumptions on the structure of the offer sets, while we take into

account of the observed choices and offer sets to identify a parsimonious set of preference rankings

that are sufficient for solving the MLE problem optimally.

To estimate the parameters of rank-based models, previous works have derived specialized MLE

methods based on the expectation-maximization (EM) procedure. van Ryzin and Vulcano (2017)

derived an EM procedure that has closed form solution iterates, but it is tailored to the Bernoulli

arrival assumption and not directly applicable to our setting. Haensel et al. (2011) proposed an

EM formulation that works with Poisson arrivals, but their approach did not fully address the core

issue of data incompleteness—that we can only observe the choices made by purchasing customers

without knowing their full preference rankings—but instead relied on data imputation heuristics

that are not soundly justified. Our proposed DC programming technique differs from the EM

procedure in that it directly exploits convex structures inherent in the likelihood function itself,

rather than evaluating the conditional expectation of the function with respect to an extended data

space (also known as the E-step in EM procedure). The two approaches, however, can be understood

as special cases of the majorization-minimization (MM) algorithm. The interested reader may refer

to (Lipp and Boyd 2016, Le Thi and Dinh 2018) for a discussion of these connections. A similar

MM algorithm for estimating MNL models was recently proposed by Abdallah and Vulcano (2017),

who demonstrated empirically that it converges much faster than an EM method. To the best of

our knowledge, ours is the first work that applied this approach in estimating rank-based choice

models.

3. Demand Model

We first develop a demand model for pick-up requests, which describes the arrival process of

potential commuters and their choices of where to pick up the bikes from. This approach can

be extended to model demand for both pick-up and return requests, which we explicate later in

Section 6, with an essential difference being that the choices consist of origin-destination station

pairs rather than individual stations.


3.1. Preliminaries

Let us consider a bike-sharing system with N stations, denoted as N , {1, ...,N}. We define the

commuter’s choice set as N ∪ {0}, where 0 denotes the choice of not using the service. To model

the demand, we assume that commuters with pick-up requests arrive to the system according to

a Poisson process with rate λ per unit time. Each arriving commuter is characterized by a type

σ : {0, . . . ,N} 7→ {0, . . . ,N}, a bijection that specifies a strict ordering over the choice set as follows:

for any two choices i, j ∈ {0, . . . ,N}, i is preferred over j if and only if σ(i) < σ(j), i.e., choice i

is ranked higher than choice j. Note that we do not require each commuter to exhaust all of the

N station choices before deciding to leave the system. For example, when N = 3, a commuter of

type σ with σ(1)<σ(2)<σ(0)<σ(3) would consider up to two alternatives, station 1 followed by

station 2, before choosing not to use the service at all. With slight abuse of notation, we sometimes

represent the type in tuple form as σ = (1,2), which lists all stations that are ranked higher than

the option 0 in decreasing order of preference. In general, we denote the length of a type σ as |σ|,which is the number of alternatives that are ranked higher than the option 0. The set of all types is

denoted Σ. For the model to be useful, we assume that Σ does not include any type of length zero

(i.e., with leaving the system as the first choice) because commuters of such types are unobservable

and irrelevant as far as the demand is concerned.

We assume that the type of each arriving commuter is drawn independently from a probability

distribution over the set of all possible types, Σ , {σ1, . . . , σK}. We parametrize the probability

mass function of this distribution with a vector x, (x1, . . . , xK), where xk is the probability that

the commuter is of the type σk. Overall, the demand model is parametrized by the arrival rate

λ and the type probability vector x, both of which are unknown and have to be estimated from

data. For simplicity, we assume that the model is time-homogeneous, i.e., both λ and x do not

vary as a function of time. In practice, to account for time heterogeneity (for example, to model

arrival rates during peak and off-peak periods differently), one simple approach is to model the

demand separately for different time segments of data, each with a different set of parameters. All

estimation methods that we will discuss later can likewise be applied separately in this manner.

3.2. Observation and Type Compatibility

To setup the estimation problem, we first introduce several definitions that relate an observed

choice to the underlying model parameters. Suppose we observe a pick-up at station j while the

set of available stations is S ⊆N (S is called the offer set). We say that a type σ is compatible

with the observed choice j ∈ S given the offer set S if σ(j) corresponds to the highest rank among

the alternatives in S. The set of all such types, Mj(S) is defined as follows,

Mj(S), {k ∈ {1, . . . ,K} : σk(j)<σk(i),∀i∈ S, i 6= j}.


The definition allows for j = 0, so that M0(S) is the set of types which are compatible with not

using the service given the offer set S.

For a given probability vector x that parametrizes the type distribution, the probability that a

commuter chooses option j when presented with the offer set S is given by

P(j | S ;x),

{∑k∈Mj(S) xk, if j ∈ S ∪{0},

0, otherwise.

The function P(· | · ;x) fully specifies a choice model that predicts the fraction of commuters picking

a bike from any station or not using the service under any service outage pattern. In particular,

P(j | N ;x) and λj , λP(j | N ;x) correspond to the first-choice probability and primary demand

for station j, respectively, while ρ(S;x),∑

j∈S P(j | S;x) = 1− P(0 | S;x) is the probability that

given the offer set S, an arriving commuter will stay and use the service.

In general, x overparametrizes the choice model because its dimension K scales with N ! ∼

N (N+ 12 ), whereas the function P(· | · ;x) can be fully specified by only O(N2N) parameters. There-

fore, for rank-based choice models of this kind, it is well known that x cannot be uniquely identified

based on only observed choices and offer sets (van Ryzin and Vulcano 2014). However, as we will

show in Section 5, it is still possible to identify P(· | · ;x) and recover the primary demand under

conditions that are more relaxed than those assumed in a similar result by van Ryzin and Vulcano

(2014).

3.3. Likelihood Function

Suppose we have a series of observations over time periods indexed by t= 1, . . . , T . Without loss

of generality, we assume that each time period is of unit length, so that the number of arrivals in

the time period has a Poisson distribution with parameter λ. For each time period t, we observe

the offer set St and the arrival vector (mt1, . . . ,mtN), where mtj is the number of bikes picked up

at station j within the time period. To avoid difficult modeling issues involving data truncation,

we assume that the offer set St remains the same throughout the t-th time period. That is, each

station is either available or unavailable for service throughout the entire time period. In practice,

for this assumption to be reasonable, the length of the time period should be chosen to be small

enough (such as one minute or less) such that service availability is unlikely to change in between.

In our data sets, time granularity of station status is one minute.

For each t, let us define Mt ,∑N

j=1mtj to be the total number of pick-ups observed over the

time period. The probability of observing the vector (mt1, . . . ,mtN) under the offer set St is

exp (−λρ(St;x))(λρ(St;x))Mt

Mt!· Mt!

mt1!mt2 ! . . .mtN !

N∏j=1

(P(j | St ;x)

ρ(St;x)

)mtj.


In the above, recall that ρ(St;x) denotes the probability that given the offer set St, an arriving

commuter will stay and use the service. Thus, λρ(St;x) is the observed rate of arrival to the

system. Taking the logarithm of the probability and disregarding the constant terms, we obtain

the likelihood function `t(λ,x) for the t-th time period as

`t(λ,x) =−λρ(St;x) +Mt log (λρ(St;x)) +N∑j=1

mtj log

(P(j | St ;x)

ρ(St;x)

)

=−λρ(St;x) +Mt logλ+N∑j=1

mtj logP(j | St ;x).

By the independent increment property of the Poisson process, the likelihood function L(x,λ) for

all observations over time periods t= 1, . . . , T can be expressed as

L(λ,x) =T∑t=1

`t(λ,x)

=−λT∑t=1

ρ(St;x) +T∑t=1

Mt logλ+T∑t=1

N∑j=1

mtj logP(j | St ;x).

=−λT∑t=1

1−∑

k∈M0(St)

xk

+M logλ+T∑t=1

∑j∈St

mtj log

∑k∈Mj(St)

xk

, (1)

whereM ,∑T

t=1Mt is the total number of arrivals over all time periods. To estimate the parameters

by MLE, our goal is to maximize x and λ over L(x,λ), subject to the constraints λ≥ 0 and x∈∆K ,

where ∆K , {x ∈ RK :∑K

k=1 xk = 1, x ≥ 0} is the (K − 1)-dimensional simplex. It is helpful to

understand how the difficulty of this optimization problem depends on the service outage patterns.

For example, if the offer sets are St =N ,∀t (i.e., there is no service outage over the observation

horizon), then the simplex constraint on x implies that ρ(St;x) = 1, resulting in a linearly separable

concave optimization problem,

maxλ≥0,x∈∆K

L(λ,x) = maxλ≥0{−λT +M logλ}+ max

x∈∆K

T∑t=1

N∑j=1

mtj log

∑k∈Mj(N )

xk

, (2)

which gives the following solution.

Proposition 1. Under complete offer sets, St =N for all t= 1, . . . , T , any pair (λ, x) that satisfies

λ=M

T,

∑k∈Mj(N )

xk =

∑T

t=1mjt

M∀j = 1, . . . ,N.

is an optimal solution of the MLE problem.


Intuitively,∑

k∈Mj(N ) xk can be interpreted as the first-choice probability estimate, P(j | N ; x) for

station j, which is given by the fraction of observed pick-ups at j.

In general, when the offer sets are incomplete (St 6=N ), the objective function is non-separable,

non-concave and does not admit a closed form solution such as the above stated by Proposition 1

due to the existence of a bilinear term −λ∑

t(1−ySt,0). Adding to this difficulty is the high dimen-

sion of x, which makes it impractical to solve the problem directly using off-the-shelf optimization

solvers. In the next section, we address both of these issues, first by proposing dimensionality

reduction techniques based on network substitution constraints and sparse representations of the

choice probabilities, and then reformulating the objective function into a form that is amenable to

sequential convex optimization methods.

4. Estimation

4.1. Dimensionality Reduction

We propose two approaches for reducing the dimension of x in the MLE problem. Our first approach

is based on a behavioral assumption that commuters may only consider alternatives within a

neighborhood of their first choices. This assumption is grounded on the belief that commuters are

only willing to expend limited amounts of effort to access available alternatives, which, in the case

of a bike-sharing network, usually requires walking from a point of origin to a station to pick up a

bike. By imposing a reasonable distance limit on the accessible alternatives, we effectively constrain

the set of commuter types Σ based on the spatial configuration of the bike-sharing network. Our

second approach does not rely on any behavioral assumption, but is instead based on finding

sparse representations of the choice probabilities P(· | · ;x). Roughly speaking, the goal is identify a

parsimonious subset of types that could explain the observed data without necessarily enumerating

all of the K =O(N !) types. In what follows, we formalize both approaches and provide real-world

examples to illustrate their effectiveness in reducing the problem dimension.

4.1.1. Network Substitution Constraints. Let us formalize the first approach that con-

strains the set of types based on the spatial configuration of the bike-sharing network. For each

station pair i∈N and j ∈N , we define dij ∈R+ as the distance between the two stations satisfying

two properties: (i) dij = dji (ii) dij = 0 if and only if i= j. For example, dij may represent the actual

walking distance or travel time from station i to j. Based on this distance measure, we define the

substitution graph as G= (N ,E), where the nodes N correspond to the set of stations and the edges

E = {(i, j) : dij ≤D,∀i, j ∈N , i 6= j} consist of all (unordered) station pairs that are within distance

of D units apart from each other. The graph G imposes constraints on substitution behaviors as

follows: if a commuter’s first choice is i, then any substitute j must be in the neighborhood of


Figure 2 Substitution graphs for Hubway, a Boston-based bike-sharing system, under varying maximum substi-

tution distance D (in unit of km). The nodes correspond to the 171 stations and the edges connect

station pairs that are less than D km apart.

(a) D= 0.6 (b) D= 0.8 (c) D= 1.0

i, i.e., j ∈ δG(i), where δG(i) denotes the neighborhood of node i in G. More formally, the set of

types constrained by G can be expressed as ΣG = {σ ∈Σ : σ−1(n)∈ δG(σ−1(0)),∀n= 1, . . . , |σ|−1},where |σ| denotes the length of the type and σ−1 is the inverse function of σ that maps each ordinal

number (starting from 0) to the corresponding station that occupies that position in the ranking.

By introducing these network constraints, we reduce the number of types from |Σ| = O(N !) to

|ΣG|=∑

i∈N∑deg(i)

k=1

(deg(i)k

)k! =O(N(∆(G))!), where deg(i) denotes the degree of node i and ∆(G)

is the maximum degree of G. That is, the dimension now scales linearly instead of exponentially

with the number of stations N , provided that the maximum degree is held constant as the network

grows.

In practice, many bike-sharing networks are designed such that stations are spread out to max-

imize service coverage, so the corresponding substitution graph is naturally sparse (i.e., having

low node degrees) under a reasonable distance cutoff D. As a result, imposing network constraints

can often decrease the dimension by orders of magnitude. To illustrate this by example, we show

in Figure 2 the substitution graphs corresponding to Hubway, a bike-sharing network in Boston.

By imposing distance limits of D= 0.6 km,D= 0.8 km and D= 1.0 km, we drastically reduce the

number of types from |Σ| ∼ 10311 down to ∼104, ∼107 and ∼1010, respectively. However, even after

this reduction, the numbers of parameters under D = 0.8 and D = 1.0 are still too large for full

instances of the MLE problem to be solvable. This motivates our next approach based on sparse

representations.

4.1.2. Sparse Representations of Choice Probabilities. Given a substitution graph G

and a series of observed offer sets S1, . . . , ST , we now describe an explicit procedure for generating

types that obey the network constraints induced by G, while exploiting data sparsity to further

reduce the problem dimension. From here on, we will assume that the set of all possible types is

ΣG , {σ1, . . . , σK}.


To motivate our approach, let us denote Pt , {j ∈ St :mtj > 0} as the subset of stations in which

at least one pick-up is observed during time period t. By representing each choice probability

P(j | St ;x) with an auxiliary variable ySt,j for all t and all j ∈ Pt ∪ {0}, we can write the MLE

problem in (1) with respect to the constrained set of types ΣG as:

maximizex,y,λ

−λT∑t=1

(1− ySt,0) +M logλ+T∑t=1

∑j∈Pt

mtj log ySt,j

subject to Ax= y

x∈∆K , λ≥ 0,

(3)

where the simplex ∆K characterizes the set of possible distributions over ΣG, y is the vector of

choice probabilities and A∈ {0,1}(∑Tt=1 |Pt|+T )×K is the compatibility matrix that specifies the linear

relationships between y and the ranking probabilities x, with ySt,j =∑

k∈Mj(St)xk holding for each

t and j ∈ Pt ∪{0}. With this reformulation, our proposed dimensional reduction technique can be

understood as finding an efficient representation of the feasible polytope Y = {y : ∃x∈∆K s.t. y=

Ax} in (3), which is the convex hull of the columns of A. Recall that A comprises K columns, each

corresponding to a type in ΣG. The key idea is to identify a subset of types ΣG ⊆ΣG such that the

following MLE problem (4) has the same optimal value as of the MLE problem (3) where types

are subject to network substitution constraints, ΣG.

maximizex,y,λ

−λT∑t=1

(1− ySt,0) +M logλ+T∑t=1

∑j∈Pt

mtj log ySt,j

subject to AΣGx= y

x∈∆|ΣG|, λ≥ 0.

(4)

Here AΣG denotes the column submatrix of A that corresponds to the types in ΣG. In other

words, if such a set ΣG can be identified, then we can effectively discard all types not in ΣG from

the original choice model without impacting the maximum likelihood value obtained. Trivially, a

representation in which |ΣG| <K always exists in the case where K > 2NT , i.e., there are more

columns in A (each being a zero-one vector of dimension no more than NT ) than the combinatorial

set of all possible zero-one columns. This occurs, for example, when we have a large network but few

observations. In which case, at least some of the columns in A are redundant and can be removed.

But such redundancies can also arise in more general settings, which we illustrate by an example as

follows. Suppose we have a substitution graph characterized by nodes N = {1,2,3} and edges E =

{(1,2), (2,3)}, and we observe offer sets S1 = {2,3} and S2 = {1,2}, pick-up stations P1 = {2} and

P2 = {1} over two observation periods (T = 2). For this network, there areK = 9 types that obey the

substitution constraints, ΣG , {σ1, . . . , σ9}= {(1), (2), (3), (1,2), (2,3), (2,1), (3,2), (2,1,3), (2,3,1)}


with corresponding type probabilities x , (x1, . . . , x9). The choice probability vector consists of

four components, y, (yS1,2, yS1,0, yS2,1, yS2,0), which can be written in the form y=Ax as

yS1,2

yS1,0

yS2,1

yS2,0

=

0 1 0 1 1 1 0 1 11 0 0 0 0 0 0 0 01 0 0 1 0 0 0 0 00 0 1 0 0 0 0 0 0

x1

x2

...x9

It is easy to check that there are 6 unique zero-one columns in the compatibility matrix A above.

Without loss of generality, we can choose the set ΣG = {σ1, σ2, σ3, σ4, σ5, σ7}, which corresponds

to the types {(1), (2), (3), (1,2), (2,3), (3,2)}, such that any feasible y can be represented as the

convex combination of these 6 columns. Intuitively, these redundancies exist because we are unable

to distinguish preferences beyond the observable choices in each offer set: for instance, types (2,1)

and (2,1,3) are indistinguishable from each other given the observed pickups over the two time

periods, while they will be distinguishable if we observe another offer set in time period 3, S3 = {3}

and a pickup at station 3, e.g., P3 = {3}, so that only type (2,1,3) is compatible with this new

observation but not type (2,1).

Aside from removing types corresponding to duplicate columns in A, we can further reduce the

size of ΣG by exploiting the property of the optimal solutions of the MLE problem. We begin by

showing the following Lemma which generalizes redundancy definition:

Lemma 1. For any two columns Ai and Aj in the compatibility matrix A, if Ai − Aj ≥ 0, i.e.,

column Ai is greater than or equal to Aj element-wise, then type variable xj corresponding to Aj

is redundant, i.e. any optimal solution x∗ of the MLE problem (3) satisfies x∗j = 0.

Proof. Suppose we have x∗j = ε > 0 for some j. Let us consider another solution x∗∗ where x∗∗j =

0, x∗∗i = x∗i + ε and x∗∗k = x∗k, ∀k 6= i, j. We then have y∗∗ =Ax∗∗ =∑

k 6=i,jAkx∗∗k +Aix

∗∗i +Ajx

∗∗j ≥∑

k 6=i,jAkx∗k +Aix

∗i +Ajx

∗j =Ax∗ = y∗ because of Ai ≥Aj. Since the likelihood function is strictly

increasing in y, we know that x∗∗ has higher objective value than x∗, which contradicts to x∗ being

an optimal solution. �

Lemma 1 implies that types whose corresponding columns are element-wise less than or equal

to some other columns in the compatibility matrix can be effectively dropped without impacting

the optimal value of the MLE problem. For instance, σ5, σ6, σ7, σ8, σ9 in the above example are all

redundant according to this criterion. This leads to a final pick of ΣG = {σ1, σ2, σ3, σ4}.

Based on this intuition, we propose an algorithm below that generates a parsimonious type set

ΣG given a sequence of observed offer sets S1, . . . , ST and pick-up stations P1, . . . , PT .


4.1.3. Type Enumeration Algorithm. For each offer set St and each observed pick-up

station i ∈ Pt, the basic operation of our proposed algorithm is to enumerate all types in ΣG of

the form (j1, . . . , jk, i) and (j′1, . . . , j′k), where the first k choices consist of only stations that are

unavailable at time t. In what follows, we define ΠS(i, j) as the set of all types with station i as

the first choice and station j as a last choice, while having any permutation of the set S as the

intermediate choices—for example, we have Π∅(1,2) = {(1,2)}, Π{3,4}(1,2) = {(1,3,4,2), (1,4,3,2)},

Π∅(1,ø) = {(1)} and Π{3,4}(1,ø) = {(1,3,4), (1,4,3)}.

Algorithm 1 Parsimonious type enumeration algorithm

input: Substitution graph G, offer sets S = {S1, S2, . . . , ST} and pick-up stations P = {P1, P2, . . . , PT}

output: Set of types ΣG satisfying the constraints induced by G

1: procedure SparseEnumerate(G,S,P)

2: ΣG←∅ . Initialization with empty set.

3: for t= 1, . . . , T do

4: for each i∈ Pt do

5: ΣG← ΣG ∪{(i)} . Add a type comprising only one choice i.

6: for each j ∈ (N \St)∩ δG(i) do

7: for each P ∈ 2(N\St)∩δG(j) do . 2S denotes the power set of S.

8: ΣG← ΣG ∪ΠP (j, i) . Add types comprising unavailable stations followed by choice i.

9: for each i∈N \St do

10: for each P ∈ 2(N\St)∩δG(i) do

11: ΣG← ΣG ∪ΠP (i,ø) . Add types comprising unavailable stations only.

12: return ΣG

The proposed algorithm can be easily modified to impose an additional length constraint L on

the types generated, by adding a condition in line 7 of SparseEnumerate that each subset P

satisfies |P | ≤ L− 2 and line 10 that |P | ≤ L− 1. The inclusion of the length parameter L allows

us to tune the complexity of the choice model to prevent overfitting—for instance, in data-scarce

settings, it may be favorable to choose a small L so that model class is less complex and leads

to better generalization performance. In the following theorem, we establish the correctness of

the algorithm in producing feasible types that satisfy the network constraints and are provably

sufficient to represent the polytope Y.

Theorem 1. Given offer sets S1, . . . , ST and pick-up stations P1, . . . , PT , the set ΣG generated by

Algorithm 1 satisfies ΣG ⊆ ΣG and the MLE problem (4) has the same optimal value as problem

(3).


Another main question of interest is how much, if any, can we benefit from reducing the full

set of types ΣG to ΣG? We answer this through a simple analysis, followed by a case study with

actual data sets. First, observe that each type generated by the algorithm can be categorized into

two parts: (i) either of the form (j1, . . . , jk, i) for some i ∈ Pt, j1 ∈ (N \St)∩ δG(i) and j2, . . . , jk ∈

(N \St)∩δG(j1), that is, j1 is an unavailable station in the neighborhood of i, whereas j2, . . . , jk are

unavailable stations in the neighborhood of j1; (ii) or of the form (j1, . . . , jk) for some j1 ∈N \St and

j2, . . . , jk ∈ (N \St)∩δG(j1). To derive an upper bound on k for the first category, let us define Qt ,

{j ∈N \St : δG(j)∩St 6= ∅} to be the set of unavailable stations in time step t having at least one

available choice in its neighborhood. Then, we must have k≤U1 ,maxt,j∈Qt |(N \St)∩ δG(j)|+ 1.

Similarly, for the second category, we have k ≤ U2 , maxt,j∈N\St |(N \ St) ∩ δG(j)|+ 1. We thus

obtain an upper bound

|ΣG| ≤U1∑k=1

(U1

k

)k!×N +

U2−1∑k=1

(U2− 1

k

)k!×N =O(NU !)

on the number of possible types that could be generated by the algorithm where U = max{U1,U2−

1}. In particular, if St =N for all t (no service outage is ever observed), then U = 1 and we only

require O(N) types to parsimoniously account for the observations. More generally, we have a

reduction of dimension from |ΣG|=O(N(∆(G))!) to |ΣG|=O(NU !), where U ≤∆(G) by defini-

tion.

To empirically demonstrate the effectiveness of our approach, we apply the type enumeration

algorithm to real-world data sets for a Boston bike-sharing system, which were collected over a

period of one week in the summer of 2016. In our analysis, we use only the data for the evening

rush hours (5pm-7pm) when the frequency of service outage is at its highest, at 11.5%2. Table 1

shows the dimensionality reduction achieved by applying the type enumeration algorithm under

various choices of distance limit D and type length L. As these results indicate, massive reduction

in variable count by up to a factor of over 1,000 are possible in practice. The reason is that even

though the overall service outage frequency is high during the rush hours, the number of stations

that are unavailable simultaneously within any neighborhood of the network tends to be fairly low,

thus contributing to a small U . As we demonstrate in Section 7.2 later, by finding sparse sets of

variables in this manner, we are able to solve full instances of the MLE problem efficiently without

resorting to large-scale optimization methods such as column generation or constraint sampling.

We end this section with several remarks regarding the proposed method. First, while our dis-

cussions thus far have focused on a constrained class of types ΣG, the proposed method also applies

to the unconstrained setting if we define G as a complete graph (D=∞). Second, the output set

2 This is defined as the average percentage of time when stations are out of service over the 5pm-7pm period.


Table 1 Dimensionality reduction under varying distance threshold D (in km) and maximum type length L.

L= 2 L= 3 L= 4 L= 5 L=∞

D (km) |ΣG| |ΣG| |ΣG| |ΣG| |ΣG| |ΣG| |ΣG| |ΣG| |ΣG| |ΣG|

0.6 326 429 598 925 904 1,705 1,078 2,473 1,078 2,833

0.8 464 661 1,357 2,489 3,837 9,737 9,897 37,337 52,737 3.4×106

1.0 612 945 2,381 5,097 9,069 29,139 32,193 168,771 2.4×106 3.2×109

of types ΣG is not necessarily the sparsest, i.e., two different types σ,σ′ generated by the algorithm

might have identical columns Aσ =Aσ′ . One way to address this issue is to apply post-processing

steps that eliminate any duplicate column in the associated matrix AΣG . This can be done very

efficiently due to the fact that AΣG is a binary matrix. In practice, the rate of redundancy ranges

from 0% to 10% in our computation experiments.

4.2. Difference of Concave Formulation

In this subsection, we show that the MLE problem can be reformulated as an optimization problem

involving only x as the decision variables, for which specialized algorithms can be derived. First,

observe that (1) is equivalent to the following nested optimization problem:

maxx∈∆K

maxλ≥0

−λT∑t=1

1−∑

k∈M0(St)

xk

+M logλ

+T∑t=1

∑j∈St

mtj log

∑k∈Mj(St)

xk

.

For any x, the inner maximization problem, maxλ≥0

{−λ∑T

t=1

(1−

∑k∈M0(St)

xk

)+M logλ

}is

strictly concave in λ and has a unique optimal solution λ(x), given by

λ(x) =M∑T

t=1

(1−

∑k∈M0(St)

xk

) . (5)

By substituting λ(x) into the original objective function in (1), we obtain an equivalent optimization

problem of the form,

maxx∈∆K

f(x)− g(x), (6)

where f and g are two concave functions:

f(x),T∑t=1

∑j∈St

mtj log

∑k∈Mj(St)

xk

, g(x),M log

T∑t=1

1−∑

k∈M0(St)

xk

. (7)

We have thus reduced the MLE problem into a difference of convex (concave) (DC) program,

which is a convex-constrained optimization problem with an objective that can be written as the

difference of two concave functions. A practical implication of this reformulation is that we can


apply existing DC programming techniques (Le Thi and Dinh 2018) to devise specialized algorithms

for solving the MLE problem. One such technique is the convex-concave procedure (CCCP), an

iterative method for DC programming that solves a sequence of concave programs, each involving

the maximization of a concave surrogate (i.e., a concave lower bound) of the difference of concave

objective function, f(x)− g(x). Starting from an initial solution x(0) in a convex feasible set X ,

CCCP generates a sequence of solution iterates as follows:

x(n) ∈ arg maxx∈X

f(x)− g(x;x(n−1)) ∀n= 1,2,3, . . . , (8)

where g(x;x(n−1)), g(x(n−1)) +∇g(x(n−1))T (x−x(n−1)) is a linearization of the concave function g

around the previous solution iterate x(n−1). It is easy to check that the function f(x)− g(x;x(n−1))

is concave and forms a lower bound of the difference of concave objective, f(x)− g(x) ≥ f(x)−

g(x;x(n−1)), with equality holding at x = x(n−1). Disregarding the constant term in g, we can

simplify (8) into

x(n) ∈ arg maxx∈X

f(x)−∇g(x(n−1))Tx ∀n= 1,2,3, . . . , (9)

In what follows, we will apply the CCCP method to the reformulated MLE problem in (6).

As we demonstrate below, one particular advantage of decomposing the problem into a sequence

of convex programs is that each of them can be solved efficiently by Frank-Wolfe method with

closed-form iterates. Remarkably, the resulting CCCP algorithm can also be interpreted as an

application of block coordinate ascent to the original log likelihood function in (1) and satisfies

global convergence. All these properties do not apply to DC programs in general.

4.2.1. Convex-Concave MLE Procedure. By substituting f and g from (7) into the CCCP

algorithm in (9), we obtain the following procedure for solving the reformulated MLE problem:

starting from an arbitrary point x(0) ∈∆K , solve for

x(n) ∈ arg maxx∈∆K

T∑t=1

∑j∈St

mtj log

∑k∈Mj(St)

xk

+M∑T

t=1

(∑k∈M0(St)

xk

)∑T

t=1

(1−

∑k∈M0(St)

x(n−1)k

) ∀n= 1,2, . . . (10)

Intuitively, we can interpret the constant M/∑T

t=1

(1−

∑k∈M0(St)

x(n−1)k

), λ(x(n−1)) as an esti-

mate of the arrival rate λ based on the previous solution iterate, as was defined in (5). This

motivates an alternative derivation of the same algorithm using block coordinate ascent below.

Proposition 2. Let {(λ(n), x(n))}n≥1 be a sequence generated by the following block coordinate

ascent procedure based on some initialization point x(0) ∈∆K,

λ(n) ∈ arg maxλ≥0L(λ,x(n−1)). (11)


x(n) ∈ arg maxx∈∆K

L(λ(n), x). (12)

Then, the sequence {x(n)}n≥1 is also optimal for the procedure in (10), provided that it is based on

the same initialization point x(0).

We remark that the equivalence between CCCP and block coordinate descent in the sense defined

in Proposition 2 does not hold for general DC programs, but is applicable here only because of

special structures of the likelihood function.

Note that problem (10) can be solved by Frank-Wolfe (FW) algorithm (Frank and Wolfe 1956)

with simple updates as follows. At each iteration, FW solves the linear program maxx∈∆k cTx where

c is the gradient of the function f(x)−∇g(x(n−1))Tx evaluated at the current iterate. It is easy

to check that an optimal solution of this linear program is a standard unit vector ej for some

j ∈ arg maxi ci. Thus each iteration of FW is simply reduced to a findmax operation. FW has gained

widespread interest recently as an efficient approach in solving large-scale convex optimization

problems (Clarkson 2010, Jaggi 2011, Freund and Grigas 2016). Compared to projection-based

first-order methods, FW fits naturally to our problem for a number of reasons: (i) the linear

program sub-problem under simplex constraint can be solved in O(K) number of operations by

simply finding the maximum component of the gradient; (ii) each FW update naturally inherits

sparsity properties as the algorithm is adjusting the current solution towards an extreme point of

the simplex at each iteration: if we initialize with a vector x with m non-zero entries, the solution

that the method produces at iteration n has at most m+n non-zero entries.

We summarize the complete solution approach for the MLE problem in Algorithm 2 below. To

have better sparsity, one can initialize the solution procedure with a vector with only positive

entries on a small subset of rankings while still possess finite likelihood value. For example, a

uniform distribution over all singleton rankings {(i)i∈N} can be a natural initial solution.

Algorithm 2 MLE Solution Procedure

1: Initialize x for CCCP iterate

2: Initialize x′ for Frank-Wolfe iterate

3: while no convergence on f(x)− g(x) do

4: while no convergence on f(x′)−∇g(x)Tx′ do

5: Compute j ∈ arg maxi∈{1,...,K} (∇f(x′)−∇g(x))i

. solving the FW subproblem

6: α∗ ∈ arg maxα∈[0,1] f(x′+α(ej −x′))−∇g(x)T (x′+α(ej −x′)) . line search the step size

7: x′← x′+α∗(ej −x′) . FW update

8: x← x′ . CCCP update

9: return x,λ=M/∑T

t=1

(1−

∑k∈M0(St)

xk

)


4.2.2. Convergence Properties. We show that the proposed convex-concave procedure for

solving the MLE problem satisfies the global convergence property. That is, given an arbitrary

feasible starting point x(0), the sequence of solutions generated by (10) converges in objective value

to that of a stationary point of the likelihood function.

Theorem 2. For any initialization point x(0) ∈∆K, the sequence of solutions {x(n)}n≥1 generated

by the procedure in (10) satisfies

limn→∞

f(x(n))− g(x(n))→ f(x∗)− g(x∗),

where x∗ is a stationary point of the likelihood function in (6). Furthermore, any limit point of the

sequence {x(n)}n≥1 is a stationary point of the likelihood function.

This result follows directly from Theorem 4 in Sriperumbudur (2009). CCCP also enjoys linear

convergence for general DC programs (Le Thi and Dinh 2018). We find that in practice, this

procedure converges very quickly in an extensive range of numerical experiments. These findings

will be discussed in Section 7.2.

5. Statistical Properties

In this section, we analyze the asymptotic properties of the MLE estimators (λ, x) as the observation

period T tends to infinity. For our analysis, we will assume that the demand model described

in Section 3 is correct and (λ∗, x∗) are the ground truth parameters. As we discussed earlier in

Section 3, it is generally not possible to uniquely identify x∗ based on only the observed choices

and offer sets due to over-parametrization of the model. However, we show that it is possibly to

consistently estimate λ∗ and the ground truth choice probabilities y∗S,j , P(j | S;x∗) under certain

observability conditions. The key step of our consistency results is to show that the demand model is

partially identifiable asympototically, meaning that (λ∗, y∗) is the unique maximizer of the expected

likelihood function, although it is non-concave and non-separable. We will assume, without loss of

generality, that y∗S,j > 0 for all j ∈ S and S ∈ 2N (we can simply drop station j from S if y∗S,j = 0).

By defining yS,j , P(j | S; x) to be the choice probability estimates based on x, we then formally

describe the following consistency results.

Theorem 3. Assume that every offer set S ∈ 2N is observed infinitely often; that is, every S

satisfies, for some αS > 0, limT→∞(1/T )∑T

t=1 1(St = S)→ αS. Then, with probability one, the MLE

estimators satisfy λ→ λ∗ and yS,j→ y∗S,j for all S ∈ 2N and j ∈ S, as T →∞.

As a corollary to the above theorem, we can also consistently estimate the primary demand for

individual choices. That is, we have λyN j→ λ∗y∗N j for every choice j ∈N . Note that the theorem


extends the consistency results of van Ryzin and Vulcano (2014), which is based on Bernoulli

approximation of the Poisson arrival process and specialized to the case where the likelihood

function is complete (i.e., time periods with no arrival and arrival with no pickups due to a stockout

are distinguishable) and strictly concave. We also remark that the results are general and do not

rely on any particular network G.

6. Extension to Origin-Destination Demand Model

We have thus far developed a model for estimating the primary pick-up demand only. However,

this model does not account for the two-sided dynamics of a bike-sharing system, wherein a pick-up

at one station is always followed by a drop-off at another, hence it is not directly applicable if

we wish to analyze the origin-destination (OD) trip patterns in the system. Such analysis can be

useful, for instance, in planning operations that rely on estimated OD flow rates to decide how to

best allocate bikes to meet the service demand (e.g., Shu et al. (2013)). In what follows, we discuss

an extension of our model that accounts for both the origin (pick-up) and destination (drop-off)

demand.

6.1. Model

The main distinguishing feature of the extended OD model is that each commuter now chooses

a trip from the set Nod ,N ×N , which consists of all origin-destination station pairs such that

each choice (i, j) ∈ Nod denotes a trip that begins with pick-up at station i and ends with drop-

off at station j. More specifically, we assume as before in Section 3 that commuters arrive to

the system according to a Poisson process with rate λ, but besides having ranked preferences

over the pick-up choices, each commuter also has separately ranked preferences for the drop-off

destinations. We characterize these preferences with a paired type σ = (σo, σd), where σo and σd

specify the commuter’s rankings over pick-up and drop-off choices, respectively. The paired type

fully determines the commuter’s behaviors as follows: upon arrival, the commuter first chooses an

origin station according to σo. If none of the stations in σo is available, then the commuter leaves

the system; otherwise, the commuter initiates a trip and chooses an available destination according

to σd when dropping off later. Effectively, we are assuming that a commuter’s decision of whether

to stay in the system or not depends only on the availability of the origin stations.

Formally, we define the set of all paired types, Σod = Σo×Σd as the Cartesian product of the origin

type set Σo and the destination type set Σd. We parametrize the probability distribution over the

paired types Σod with a vector x, where each doubly indexed element xkl is the probability assigned

to the paired type (σok, σdl ) ∈Σod. Based on this probability distribution, we let P(i, j | S,Q;x) be

the probability of observing trip (i, j) given the pick-up offer set S (i.e., the set of non-empty


stations at pick-up) and the drop-off offer set Q (i.e., the set of non-full stations at drop-off), which

is defined as

P(i, j | S,Q;x),

{∑k∈Mo

i (S)

∑l∈Md

j (Q) xkl, if i∈ S, j ∈Q,0, otherwise,

whereMoi (S) (respectively,Md

j (Q)) is the set of types in Σo (Σd) that are compatible with observed

pick-up (drop-off) at station i (j) under the offer set S (Q). Given the arrival rate λ and probability

vector x, we can recover the primary OD demand for trip (i, j) as λP(i, j | N ,N ;x), while the

primary pick-up and drop-off demand at each station i correspond to λ∑N

j=1 P(i, j | N ,N ;x) and

λ∑N

j=1 P(j, i | N ,N ;x), respectively.

6.2. Likelihood Function

Based on the extended choice model, we can now rewrite the original likelihood function in (1) in

terms of the observed trips. Suppose over time periods t= 1, . . . , T , we observe a sequence of M

trips, (i1, j1), . . . , (iM , jM), where each trip (ik, jk) begins at time period tok and ends at time period

tdk. Let us define St and Qt as the pick-up and drop-off offer sets at time period t, respectively, and

let ρ(St;x),∑

i∈St

∑N

j=1 P(i, j | St,N ;x) be the probability that given offer set St, a commuter will

stay in the system and begin a trip upon arrival. Again, the second summation is over the set of

all stations N because we assume that a commuter’s decision of using the system or not depends

only on the availability of the origin stations. The OD log likelihood function is given by

Lod(λ,x) =−λT∑t=1

ρ(St;x) +M logλ+M∑k=1

logP(ik, jk | Stok,Qtd

k;x). (13)

Since ρ(St;x) and P(ik, jk | Stk ,Qtk ;x) are both linear in x, the MLE problem is reduced to an

optimization problem that is solvable with the same methods described in Section 4. One major

computational issue, however, is that the dimension of x is quadratically larger than in the original

model due to the fact that |Σ|= |Σo||Σd|=O((N !)2). To address this complication, we can adapt

the dimensionality reduction techniques from Section 4.1 as follows.

6.3. Dimensionality Reduction

Given a substitution graph G, let us denote ΣGo and ΣG

d as the constrained sets of origin and

destination types, respectively, that are induced by G. We first apply Algorithm 1 to generate a

subset of types ΣGo ⊆ ΣG

o that parsimoniously accounts for the observed pick-ups given the offer

sets S1, . . . , ST . Separately, we apply the same algorithm to generate another set of types ΣGd ⊆ΣG

d

that explains the observed drop-offs given offer sets Q1, . . . ,QT . We then apply Algorithm 3 below

to generate a parsimonious set of paired rankings, ΣGod by taking the sets ΣG

o and ΣGd as inputs.

To facilitate our presentation below, we define Moi (S) (respectively, Md

j (Q)) as the set of types in


Algorithm 3 Parsimonious paired type enumeration algorithm

input: pickup offer sets S = {S1, S2, . . . , ST}, dropoff offer sets Q = {Q1,Q2, . . . ,QT}, trips T =

{(i1, j1), (i2, j2), . . . , (iM , jM)}, parsimonious set of origin types ΣGo and destination types ΣG

d

output: Parsimonious sets of paired types ΣGod

1: procedure SparseEnumeratePaired(G,S,Q,T , ΣGo , Σ

Gd )

2: ΣGod←∅ . Initialization with empty set.

3: for k= 1, . . . ,M do

4: for each (σo, σd)∈ Moik

(Stok)×Md

jk(Qtd

k) do

5: ΣGod← ΣG

od ∪{(σo, σd)} . Add a paired type compatible with trip (ik, jk).

6: for t= 1, . . . , T do

7: for each (σo, σd)∈ Moik

(Stok)× ΣG

d do

8: ΣGod← ΣG

od ∪{(σo, σd)} . Add a paired type compatible with not using the service at time t.

9: return ΣGod

the parsimonious set ΣGo (ΣG

d ) that are compatible with observed pick-up (drop-off) at station i

(j) under the offer set S (Q).

Building on top of Theorem 1, it can be shown that the set ΣGod is sufficient to account for all

the observations; that is, we can fix xkl = 0,∀(σok, σdl )∈ (ΣGo ×ΣG

d ) \ ΣGod without incurring any loss

in optimality when solving the MLE problem. We formalize this in the following corollary.

Corollary 1. The MLE problems maxλ≥0,x∈∆K Lod(λ,x) with types ΣGo × ΣG

d and ΣGod =

SparseEnumeratePaired(G,S,Q,T , ΣGo , Σ

Gd ) of Algorithm 3 have the same optimal value.

As a distinct feature of our OD estimation problem, we can further show that, for destination

ranking, any type that is nested in some other types can be proven to be redundant. Here we define

a type σ is nested in another type σ′, σ⊂ σ′, if |σ|< |σ′| and σ−1(i) = σ′−1(i),∀i∈ {0, . . . , |σ| − 1}.

This corollary mainly follows by the assumption that the availability of destination stations doesn’t

impact commuters’ decisions of using the system or not.

Corollary 2. Define ΣGd = {σ ∈ ΣG

d : ∃σ′ ∈ ΣGd s.t. σ ⊂ σ′} as the set of types that are nested in

some other types. Then the MLE problems maxλ≥0,x∈∆K Lod(λ,x) with types ΣGo ×ΣG

d and ΣGod =

SparseEnumeratePaired(G,S,Q,T , ΣGo , Σ

Gd \ ΣG

d ) of Algorithm 3 have the same optimal value.

7. Simulation Studies

In this section, we present our results for an extensive range of numerical experiments based on

simulations of a bike-sharing network in Boston. These experiments are designed with two goals

in mind. First, we evaluate the accuracy of demand estimation under a range of sample sizes and

service outage scenarios. Second, we quantify the operational impact of the demand estimation


procedure when it is used as the basis for allocating bikes in a system. Specifically, we use the

estimated demand as inputs to an optimization model for allocating bikes with the objective of

maximizing ridership over a planning horizon. We then evaluate the actual ridership with ground-

truth demand profile as a measure of how well the estimation method helps inform operational

decision-making. Throughout this section, we will refer to our proposed rank-based estimation

method using the abbreviation RANK. For comparison, we run the same set of experiments with

the following benchmark methods:

1. Sample Average (SA). We estimate the OD demand λij for each station pair (i, j) by averaging

the number of observed trips over all time periods, λij := 1T

∑T

t=1mtij, where mtij denotes the

number of trips from station i to j that begins at time step t. This corresponds to the MLE

estimate under the assumption that the station is never out of service. Based on the estimated OD

demand, we obtain the pick-up and drop-off primary demand for each station i as λoi =∑N

j=1 λij

and λdi =∑N

j=1 λji, respectively.

2. Filtered Sample Average (FSA). To take into account of service availability, we modify the SA

estimator by considering only the time periods when both origin station i and destination station

j are in service, denoted as Ti , {t : i ∈ St}, so that λij := 1|Ti|

∑T

t=1mtij. This is similar to the

approach taken by O’Mahony and Shmoys (2015) to estimate the pick-up and drop-off demand as

inputs for bike rebalancing.

3. OD Multinomial Logit Model (MNL). For this model, we define the choice probability of trip

(i, j) as

P(i, j | S,Q;u, v) =ui∑

k∈S uk +u0

· vij∑k∈Q vik

,

where u and v are nonnegative utilities associated with pick-up and drop-off choices, respectively.

Specifically, the first term, ui/(∑

k∈St uk +u0) can be understood as the probability of observing a

pick-up at station i given offer sets (S,Q), where u1, . . . , uN are the utilities assigned to the stations

and u0 is the utility associated with leaving. The second term is the conditional probability of

observing a drop-off at station j given that the trip originated from station i, where the utilities

assigned to the drop-off destinations are ui1, . . . , uiN . Since we assume that all bikes ever picked up

will be dropped off, there is zero utility associated with the option to abstain from returning the

bike.

7.1. Setup

7.1.1. Simulation Platform. To generate synthetic data for our experiments, we built a

simulation platform based on Hubway, a bike-sharing network based in Boston that comprises 171

stations based on its summer 2016 data. As a preprocessing step, we first estimate the primary


pick-up demand at individual stations by applying the FSA methd on a real Hubway data set

collected over summer 2016, wherein the timestamps are discretized into units of one minute each.

We then estimate an OD matrix from the data based on the empirical trip frequency of each origin-

destination pair, which determines the probability that a commuter decides to drop off a bike at

a station given the pick-up origin. These estimates will serve as the ground truth parameters on

which the simulation is based, as we detail below.

Arrival Process. Let us denote {λ∗j}j∈N as the ground truth arrival rates (per minute) at individ-

ual stations obtained in the preprocessing step above. We simulate arrivals to the system minute

by minute according to a Poisson process, such that the arrival rate at station j is equal to λ∗j . For

each arriving commuter, we choose the intended drop-off destination probabilistically according to

the ground truth OD matrix. Once a bike is picked up by a commuter, it will only be dropped off

after the duration of the trip time, which is determined by the origin-destination distance divided

by a fixed biking speed of 0.258km per minute.

Substitution Behaviors. In the event that the first-choice station is unavailable at the time of

pick-up or drop-off, a commuter may search for an alternative as follows. To determine the pick-up

substitution, we first assign to the commuter a pair of “exact origin” coordinates, which correspond

to the exact reference point from which the commuter determines the distance to every station.

Examples of such reference points could be the commuter’s home address or workplace. These

coordinates are sampled uniformly in an area surrounding the commuter’s first-choice pick-up sta-

tion. Given that the first choice is unavailable, the commuter will consider all available alternatives

within distance D from his exact origin, where D is sampled from a uniform distribution over [0,1].

If any such alternatives exist, then the commuter will pick the one that has the shortest distance

to the destination station: the distance between the exact origin and the alternative origin station

plus the distance between the origin and destination station is minimum; otherwise, the commuter

will leave the system. Likewise, when dropping off the bike, the commuter will consider any alterna-

tives using another random pair of “exact destination” coordinates as his reference point. However,

unlike in the case of pick-up, we assume that the commuter always drops off the bike to the closest

available alternative regarding its distance from his reference point. By randomizing the reference

point coordinates and maximum search distance D (for pick-up) in this manner, we can generate

a variety of possible commuter types and simulate a complex range of substitution behaviors.

Initial Inventory. We adopt an approach in O’Mahony and Shmoys (2015) to decide the initial

allocation of bikes based on the net balance of pick-up and drop-off demand at individual stations.

Specifically, each station is classified as either consumer, producer or self-balancer, according to

whether it has net inflow, net outflow or approximately equal in and out flow of bikes. The self-

balancers are further divided into high-usage and low-usage stations depending on their combined


Table 2 Mean Absolute Relative Errors (MAREs) of pick-up and drop-off primary demand estimation under

varying rebalancing intervals (Tr).

Outage Frequency (%) Pick-up Errors (%) Drop-off Errors (%)

Tr (min) Pick-up Drop-off RANK SA FSA MNL RANK SA FSA MNL

90 4.0 0.5 12.0 14.4 13.8 13.0 10.7 11.3 12.8 17.8

120 6.6 1.1 12.6 16.6 16.3 15.6 11.0 12.9 16.1 18.7

150 8.8 1.8 12.4 18.3 18.0 15.6 11.3 14.6 18.4 17.7

180 12.7 2.9 14.2 23.4 22.8 19.6 13.7 19.8 24.9 20.5

210 13.3 3.4 13.2 22.5 21.8 18.4 16.1 22.4 27.2 19.8

240 17.1 4.8 14.3 26.6 26.7 22.3 25.2 34.0 42.8 31.4

pick-up and drop-off demand. Then, at the beginning of each simulation, we stock the stations at

the following levels of their capacities: 10% for producers, 90% for consumers, 30% for low-usage

self balancers and 50% for high-usage self-balancers.

Rebalancing Operations. To modulate the service outage frequencies of our simulated system, we

replenish the stations at regular intervals according to the target inventory levels specified earlier.

This is done by greedily transferring bikes from the most overstocked stations (i.e., with inventory

exceeding the target levels) to the most understocked stations, up to a maximum number of 100

moves in each interval. After every 12 hours or 720 time steps, we simulate a daily reset by restoring

the inventory back to their initial levels. By varying the rebalancing intervals, we can generate a

wide range of service outage conditions that could subsequently impact demand estimation.

7.1.2. MLE Estimation. We solve the MLE problem with the convex-concave procedure in

Algorithm 2. For MNL models, we solve the MLE problem using IPOPT (Wachter and Biegler

2006), a nonlinear optimization solver. To reduce the dimension of the MLE problem, we use

Algorithm 1 to generate a sparse set of types before solving each instance. The substitution distance

D and type length L are selected by cross validation based on minimum mean squared error of

station-wise ridership prediction on the training set over values ranging from 0.5km to 1.0km and

1 to 5, respectively.

7.2. Results

To quantify the errors of demand estimation, we define the Relative Error for a given station as

Relative Error =Estimated Primary Demand−True Primary Demand

True Primary Demand.

The Mean Absolute Relative Error (MARE) is then calculated by averaging the absolute values of

the relative errors over all stations.


7.2.1. Demand Estimation. Our first set of results pertain to estimation of the pick-up and

drop-off demands based on 60 hours (T = 3600) of simulated data, with service outage frequencies

that vary between 0.5% to 17%. We summarize these results with three key observations:

1. Across the entire range of simulated outage frequencies, the RANK method consistently out-

performs SA, FSA and MNL in terms of MARE. As shown in Table 2, improvements of RANK over

the best-performing benchmarks are most significant when outage frequencies are high, with about

36% and 20% reduction of pick-up and drop-off demand estimation errors, respectively, when the

rebalancing interval is set to 240 minutes. For comparison with the actual Hubway data sets (on

which our simulations are based), the actual stockout frequency of Hubway is 11.5% during the

evening rush hours, which tracks closely with our simulated stockout frequency of 12.7% when the

rebalancing interval is set to 180 minutes.

2. Breaking down the estimation errors by individual stations in Figure 3, we observe that

the stations with the highest demands tend to benefit most from the RANK method in terms of

error reduction over the benchmarks—these are also the stations that are most frequently out of

service but account for disproportionately large shares of the overall ridership. Therefore, improved

estimates at these high-demand stations tend to have the largest impact on operational decision-

making, as we discuss later in the next section.

3. The proposed CCCP procedure converges very fast in practice, with quickly diminishing

objective value improvements after the first few iterations (here, each iteration involves solving

a concave subproblem derived from the main MLE problem using FW method, as described in

Algorithm 2). In Figure 4, we plot the log likelihood objective values as a function of CCCP itera-

tion, which are averaged over 10 independent simulations with identical setup. After only the first

3 iterations, each subsequent iteration increases the objective value by less than 0.001%, which

corresponds to negligible changes in the MLE estimates thereafter. The average times for solv-

ing the MLE instances in our experiments under Tr = 90,120,150,180,210,240 are approximately

3,4,19,24,15,24 minutes, respectively.

7.2.2. Operational Impact. Using the estimated demand from the earlier methods, we apply

a bike allocation model by Shu et al. (2013) to measure the extent to which more accurate demand

estimates can improve operational decisions. This optimization model allows us to determine,

on the basis of estimated OD demand of all station pairs, an initial allocation of bikes in the

network that maximizes ridership over a planning horizon. However, the model operates under

the assumption that commuters will leave the system immediately when their first choices are

unavailable, thus it does not take into account of choice substitution in the allocation. Despite its

limitations, we adopt this optimization model because it is tractable and can be solved optimally


Figure 3 Comparison of estimated and true pick-up demand for 171 Hubway stations. The results are averaged

over 10 independent simulations with a rebalancing interval of Tr = 180 min.

(a) RANK (b) SA

(c) FSA (d) MNL

by linear programming. Furthermore, Shu et al. (2013) has shown that the optimal value of the

linear program closely matches the expected ridership obtained by simulations with a Poisson

arrival model. After configuring the initial inventory according to this model, we evaluate the actual

ridership in terms of the total number of trips taken over the planning horizon, averaged over 10

independent simulations. Figure 4 shows the results based on different methods of estimating OD

demand under Tr = 180. We observe that allocation decisions that are made on the basis of the

RANK method lead to more trips taken compared to the benchmarks, with 3% increase for the

longest planning horizon.

8. Empirical Studies: The Boston Hubway Data Set

We next present our results of applying the four methods introduced earlier, RANK, SA, FSA

and MNL, to the actual Boston Hubway data set. The data comprises two sets of records: (i) the

individual trips taken by commuters, each defined by an origin, a destination and the associated


Figure 4 Left: Convergence of the MLE objective value (in log likelihood) as a function of the iteration of the

CCCP procedure. Right: Total ridership achievable when a fleet of 1,000 bikes are allocated according

to the demand estimated by each method.

4 5 6 7Planning Horizon (Hour)

0

500

1000

1500

2000

2500

Tota

l Trip

s

RANKMNLFSASA

pick-up and drop-off times, and (ii) the inventory levels of individual stations, which are updated

every minute. As in the simulation studies, we focus on the peak usage hours between 5pm and

7pm on the weekdays of summer 2016, with a total of 69 days. Since there is no “true primary

demand” with which to compare the results, we instead evaluate the methods by their accuracy

in predicting the ridership at individual stations. More specifically, we first fit a demand model to

the data collected in the first 54 days, and then use it to predict the total number of bike pick-ups

and drop-offs at every station over the same 54 days (in-sample evaluation), as well as also over

the subsequent 15 days (out-of-sample evaluation). For RANK and MNL methods, the predicted

number of pickups (or drop-offs) at a station given an offer set is defined to be its sum of outflow

(inflow) OD rates over the available destinations (origins). For SA and FSA methods, the prediction

is either equal to the estimated pick-up (drop-off) rate if the station is available, or zero otherwise.

To measure the predictive accuracy, we calculate the relative error for each station as

Relative Error =Predicted Ridership−Actual Ridership

Actual Ridership,

where ridership is quantified by either the total number of pick-ups or drop-offs at the station

through the evaluation period. Averaging the absolute values of these relative errors, we obtain

the Mean Absolute Relative Error (MARE) of prediction.

8.1. Results

Our main results are summarized in Table 3, which shows the in-sample and out-of-sample MAREs

of ridership prediction for each method. While the RANK method has a higher in-sample error

than MNL when it comes to pick-up prediction, it generalizes better than MNL out-of-sample

and is nearly on par with the best-performing method, FSA. On the other hand, in terms of


Table 3 Mean Absolute Relative Errors (MAREs) of in-sample and out-of-sample

ridership predictions for the Hubway data set.

In-Sample Errors (%) Out-of-Sample Errors (%)

RANK SA FSA MNL RANK SA FSA MNL

Pick-up 11.6 22.7 11.5 0.3 24.6 32.6 24.5 25.5

Drop-off 2.6 3.9 24.5 16.4 24.9 25.7 38.6 30.0

Overall 7.1 13.3 18.0 8.4 24.8 29.2 31.6 27.8

drop-off prediction, the RANK method outperforms the rest by significant margins both in-sample

and out-of-sample. In Figure 5, we observe that the RANK method has a drop-off error distribu-

tion that is more concentrated around zero compared to the rest, as indicated by its quartiles,

(−0.141,−0.009,0.119). Meanwhile, the poor drop-off predictive accuracy of FSA and MNL can

be attributed to overestimation biases, as evidenced in their large positive error medians, 0.224

and 0.120, respectively. However, when it comes to pick-up prediction, both methods exhibit much

smaller biases, with quartiles of (−0.243,−0.097,0.041) and (−0.146,−0.003,0.152), respectively;

whereas, RANK and SA have larger underestimation biases, with quartiles of (−0.257,−0.098,0.033)

and (−0.391,−0.227,−0.036), respectively.

Overall, the relatively poor out-of-sample performance of SA and FSA is hardly surprising given

that both methods ignore the effects of choice substitution altogether. As Table 3 shows, taking

into account of these effects in modeling by MNL and RANK can reduce the overall out-of-sample

errors by 5% and 15%, respectively, when compared with the SA method. Furthermore, the 11%

error improvement of RANK over MNL suggests that our proposed rank-based model is better able

to capture the choice behaviors of Hubway commuters overall.

Finally, we complement our empirical studies in this paper with an interactive web-based visual-

ization, which is accessible online at http://web.mit.edu/cygoh/www/hubway2017. A computer

screenshot of this visualization is shown in Figure 6. On the web page, the user can explore the

historical averages of stockout rates and ridership statistics that are derived from our proposed

rank-based model. In particular, we retrospectively estimate the number of rides that were dis-

placed (i.e., substituted by other choices) or lost over the 69-day period. These insights could help

bike-sharing operators better understand the experience of commuters on the ground and highlight

specific stations where the bikes were not adequately stocked.

9. Conclusions and Future Work

We presented a new approach for estimating the primary demand for a bike-sharing service. Our

approach departs from earlier work in that we adopted a rank-based demand model whose complex-

ity can flexibly scale with data, allowing us to infer commuter choice behaviors without imposing


Figure 5 Out-of-sample relative errors of pick-up (left) and drop-off (right) ridership predictions at individual

stations for the Hubway data set.

0.00.060.120.18 RANK

0.00.060.120.18 SA

0.00.060.120.18 FSA

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.00.0

0.060.120.18 MNL

Relative Error

Rela

tive

Freq

uenc

y

0.00.060.120.18 RANK

0.00.060.120.18 SA

0.00.060.120.18 FSA

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.00.0

0.060.120.18 MNL

Relative ErrorRe

lativ

e Fr

eque

ncy

Note. The quartiles of pick-up relative errors (left) for RANK, SA, FSA and MNL are (−0.257,−0.098,0.033),

(−0.391,−0.227,−0.036), (−0.243,−0.097,0.041) and (−0.146,−0.003,0.152), respectively. For drop-off prediction

(right), the corresponding error quartiles are (−0.141,−0.009,0.119), (−0.212,−0.047,0.064), (0.034,0.224,0.381) and

(−0.011,0.120,0.287).

parametric assumptions a priori. The proposed model comprises a Poisson process describing the

arrivals of commuters, whose choice behaviors are fully determined their preference rankings over

the stations. We then jointly estimate the arrival rate and probability distribution over preference

rankings by maximum likelihood estimation. To make the estimation problem tractable, we intro-

duced dimensionality reduction techniques that are based on network substitution constraints and

sparsity. In practice, these techniques are able to reduce the parameter dimension by orders of

magnitude, allowing us to scale our model to a city-sized network. We also derived a DC program-

ming method for solving the MLE problem, which reduces it to a sequence of convex optimization

problems with provable convergence guarantees and each convex problem can be efficiently solved

by FW method. Finally, we demonstrated the scalability and effectiveness of our approach through

a simulation study. Our results showed that the proposed methods are able to overcome the estima-

tion biases due to demand censoring and choice substitution, leading to increased ridership when

used as the basis for inventory planning.

Although the focus of this paper is on bike-sharing, our proposed demand model and estimation

methods are applicable more generally. For example, the OD demand model is appropriate for

other systems that share similar characteristics as a bike-sharing service, such as one-way car rental

services that allow pick-ups and drop-offs at designated parking lots. In other domains such as


Figure 6 An interactive visualization of the station stockout distribution of Hubway and its estimated impact on

ridership. Available on http://web.mit.edu/cygoh/www/hubway2017.

retail, our proposed substitution graph is also applicable if products that differ in certain quality-

defining attributes are very unlikely to be substitutable by each other. Last but not least, we tackled

the core computational issues in estimating rank-based choice models—following the work of Farias

et al. (2013) and van Ryzin and Vulcano (2014)—by proposing a new iterative MLE procedure

and a sparsity-based type enumeration algorithm that achieved good empirical performance.

There are several issues we did not address in this paper that merit consideration in future work.

First, our model assumed a static commuter arrival rate without taking into account of covariates

such as weather and seasonality factors. Combining our approach with other streams of work in

structural demand modeling could yield a model that adapts better to changing scenarios. Second,

we focused on the demand of dock-based bike-sharing service in this paper. With the increasing

adoption of dockless bike-sharing systems, which allow commuters to pick up and drop off bikes

anywhere, it is worth exploring an extension of our approach to this setting. Finally, developing

tailored operation models for capacity planning and rebalancing that takes into account of choice

substitutions could be a natural next step on top of this work.

Acknowledgments


The authors would like to thank Boston Hubway Bikes (now Blue Bikes Boston) for providing all the data

sets used in this study and for their clarifications of the data specifications. We also express our gratitude

to Weijun Ding, who was affiliated with the Georgia Institute of Technology at the time, for contributing to

early discussions of this research project.

References

Abdallah, Tarek, Gustavo Vulcano. 2017. Demand estimation under the multinomial logit model from sales

transaction data. Working paper, Columbia University, New York City, NY.

Ben-Akiva, Moshe E, Steven R Lerman, Steven R Lerman. 1985. Discrete Choice Analysis: Theory and

Application to Travel Demand , vol. 9. MIT Press.

Chemla, Daniel, Frederic Meunier, Roberto Wolfler Calvo. 2013. Bike Sharing Systems: Solving the Static

Rebalancing Problem. Discrete Optimization 10(2) 120–146.

Clarkson, Kenneth L. 2010. Coresets, sparse greedy approximation, and the frank-wolfe algorithm. ACM

Transactions on Algorithms (TALG) 6(4) 63.

El-Assi, Wafic, Mohamed Salah Mahmoud, Khandker Nurul Habib. 2017. Effects of Built Environment And

Weather on Bike Sharing Demand: A Station Level Analysis of Commercial Bike Sharing In Toronto.

Transportation 44(3) 589–613.

Farias, Vivek F, Srikanth Jagabathula, Devavrat Shah. 2013. A Nonparametric Approach to Modeling Choice

with Limited Data. Management Science 59(2) 305–322.

Frank, Marguerite, Philip Wolfe. 1956. An algorithm for quadratic programming. Naval research logistics

quarterly 3(1-2) 95–110.

Freund, Daniel, Shane G Henderson, David B Shmoys. 2017. Minimizing multimodular functions and

allocating capacity in bike-sharing systems. International Conference on Integer Programming and

Combinatorial Optimization. Springer, 186–198.

Freund, Robert M, Paul Grigas. 2016. New analysis and results for the frank–wolfe method. Mathematical

Programming 155(1-2) 199–230.

Ghosh, Supriyo, Pradeep Varakantham, Yossiri Adulyasak, Patrick Jaillet. 2017. Dynamic repositioning to

reduce lost demand in bike sharing systems. Journal of Artificial Intelligence Research 58 387–430.

Haensel, Alwin, Ger Koole, Jeroen Erdman. 2011. Estimating Unconstrained Customer Choice Set Demand:

A Case Study on Airline Reservation Data. Journal of Choice Modelling 4(3) 75–87.

Jagabathula, Srikanth, Paat Rusmevichientong. 2016. The limit of rationality in choice modeling: Formula-

tion, computation, and implications. Management Science .

Jaggi, Martin. 2011. Sparse convex optimization methods for machine learning. Ph.D. thesis, ETH Zurich.

Kabra, Ashish, Elena Belavina, Karan Girotra. 2016. Bike-share Systems: Accessibility and Availability.

Working paper, INSEAD, Fontainebleau, France.


Le Thi, Hoai An, Tao Pham Dinh. 2018. DC programming and DCA: thirty years of developments. Mathe-

matical Programming 1–64.

Lipp, Thomas, Stephen Boyd. 2016. Variations and Extension of the Convex–concave Procedure. Optimiza-

tion and Engineering 17(2) 263–287.

O’Mahony, Eoin, David B Shmoys. 2015. Data Analysis and Optimization for (Citi) Bike Sharing. Proceedings

of the Twenty-Ninth AAAI Conference on Artificial Intelligence. 687–694.

Pendem, V. Deshpande, P. 2016. Maximizing Ridership in Bike-sharing Systems Using Empirical Data and

Stochastic Models. Working paper, University of North Carolina, Chapel Hill, NC.

Raviv, Tal, Michal Tzur, Iris A Forma. 2013. Static Repositioning in a Bike-sharing System: Models and

Solution Approaches. EURO Journal on Transportation and Logistics 2(3) 187–229.

Rixey, R. 2013. Station-level forecasting of bike sharing ridership: Station network effects in three u.s.

systems. Transportation Research Record: Journal of the Transportation Research Board (2387) 46–55.

Schuijbroek, Jasper, Robert C Hampshire, W-J Van Hoeve. 2017. Inventory rebalancing and vehicle routing

in bike sharing systems. European Journal of Operational Research 257(3) 992–1004.

Shapiro, Alexander, Darinka Dentcheva, Andrzej Ruszczynski. 2009. Lectures On Stochastic Programming:

Modeling and Theory . SIAM.

Shu, Jia, Mabel C Chou, Qizhang Liu, Chung-Piaw Teo, I-Lin Wang. 2013. Models for Effective Deployment

and Redistribution of Bicycles Within Public Bicycle-Sharing Systems. Operations Research 61(6)

1346–1359.

Singhvi, Divya, Somya Singhvi, Peter I Frazier, Shane G Henderson, Eoin O’Mahony, David B Shmoys,

Dawn B Woodard. 2015. Predicting Bike Usage for New York City’s Bike Sharing System. AAAI

Workshop On Computational Sustainability .

Singla, Adish, Marco Santoni, Gabor Bartok, Pratik Mukerji, Moritz Meenen, Andreas Krause. 2015. Incen-

tivizing Users for Balancing Bike Sharing Systems. Proceedings of the Twenty-Ninth AAAI Conference

on Artificial Intelligence. 723–729.

Sriperumbudur, Bharath K. 2009. On the Convergence of the Concave-Convex Procedure. Advances in

Neural Information Processing Systems 1759–1767.

Train, Kenneth E. 2009. Discrete Choice Methods with Simulation. Cambridge University Press.

van Ryzin, Garrett, Gustavo Vulcano. 2014. A Market Discovery Algorithm to Estimate a General Class of

Nonparametric Choice Models. Management Science 61(2) 281–300.

van Ryzin, Garrett, Gustavo Vulcano. 2017. Techical Note–An Expectation-Maximization Method to Esti-

mate a Rank-Based Choice Model of Demand. Operations Research 65(2) 396–407.

Vulcano, Gustavo, Garrett Van Ryzin, Wassim Chaar. 2010. Om practicechoice-based revenue management:

An empirical study of estimation and optimization. Manufacturing & Service Operations Management

12(3) 371–392.


Vulcano, Gustavo, Garrett Van Ryzin, Richard Ratliff. 2012. Estimating Primary Demand for Substitutable

Products from Sales Transaction Data. Operations Research 60(2) 313–334.

Wachter, Andreas, Lorenz T Biegler. 2006. On the implementation of an interior-point filter line-search

algorithm for large-scale nonlinear programming. Mathematical Programming 106(1) 25–57.


10. Appendices

10.1. Appendix A: Proofs

10.1.1. Proof of Proposition 1

Proof. The optimality of λ follows directly from strict concavity and first-order optimality

conditions. To derive the optimality conditions for x, let us denote yj ,∑

k∈Mj(N ) xk for all j =

1, . . . ,N . Since∑N

j=1 yj =∑N

j=1 P(j | N ) = 1, we can rewrite the second maximization problem in

(2) in terms of y, as follows

maxy∈∆N

{M

N∑j=1

∑T

t=1mtj

Mlog yj

}.

By Gibbs’ inequality, the optimal solution of the above problem is given by yj =∑Tt=1mjtM

for all j.

Further, a feasible x that corresponds to y can be constructed as follows: for all k = 1, . . . ,K, let

xk = yj if σk = (j) (a type comprising only one choice j), and zero otherwise. �

10.1.2. Proof of Proposition 2

Proof. For a fixed x(n−1), the optimal solution for (11) is given by

λ(n) =M∑T

t=1

(1−

∑k∈M0(St)

xk

) =M∑T

t=1 ρ(St;x(n−1)),

as derived in (5). To solve for x(n) in (12), we substitute λ(n) into the likelihood function in (1) and

then obtain

arg maxx∈∆K

−M∑T

t=1 ρ(St;x)∑T

t=1 ρ(St;x(n−1))+M log

M∑T

t=1 ρ(St;x(n−1))+

T∑t=1

∑j∈St

mtj log

∑k∈Mj(St)

xk

=arg max

x∈∆K−M∑T

t=1 ρ(St;x)∑T

t=1 ρ(St;x(n−1))+

T∑t=1

∑j∈St

mtj log

∑k∈Mj(St)

xk

,

where the equality is due to the fact that the term M log(M/∑T

t=1 ρ(St;x(n−1))) does not depend

on x. Since this maximization problem is identical to the CCCP sub-problem in (10), we conclude

by induction that any sequence {xn}n≥1 generated by the above procedure is optimal for (10) given

a common initialization point x(0). �

10.1.3. Proof of Theorem 1

Proof. The relation ΣG ⊆ΣG clearly holds because every type generated by Algorithm 1 satisfies

the network substitution constraints. Denote L∗ as the optimal value of problem (3), and L∗ as

the optimal value of problem (4). We first show L∗ ≥ L∗. Denote x∗, λ∗ as an optimal solution of

problem (4). We construct another solution λ∗ = λ∗ and x∗ = [x∗,0] where x∗i = x∗i ,∀i ∈ ΣG and


x∗i = 0,∀i∈Σ\ ΣG. Clearly (x∗, λ∗) is a feasible solution for problem (3) and has objective value of

L∗. Thus the optimal value of (3), L∗, should be at least L∗.We then show L∗ ≤ L∗. To prove that, we provide a method to convert any optimal solution

(x∗, λ∗) of problem (3) to a feasible solution of problem (4) with the same objective value. Let’s

consider a type σ1 = (j1, . . . , jk)∈ΣG that is not in ΣG. For all t= 1, . . . , T and all i∈ St, the type

σ1 must be compatible with either choice i or not using the service (choice 0) given offer set St.

We consider the following three disjoint cases:

Case 1: If σ1 is compatible with not using the service for any offer set St, t∈ {1, . . . , T}, it must be

generated within steps (9) to (11) in Algorithm 1 at time period t. This contradicts with σ1 /∈ ΣG.

Case 2: For the remaining types, if there exists some s ∈ {1, . . . , k} and a corresponding t ∈{1, . . . , T}, such that js ∈ Pt and the type σ1 is compatible with the choice js given offer set St,

i.e., σ1 ∈ Γjs(St) , {σk : k ∈Mjs(St)}. Let us choose the largest possible integer s ≤ k for which

this condition holds, and construct another type σ2 = (j1, . . . , js). Roughly speaking, our goal is

to show that (i) σ2 ∈ ΣG, and (ii) Aσ1=Aσ2

. This means that for a solution x′ constructed from

x with x′σ1= 0, x′σ2

= xσ2+ xσ1

and x′σ = xσ,∀σ ∈Σ : σ 6= σ1, σ2 has the same objective value as x.

Clearly, σ2 ∈ ΣG because it is enumerated in the t-th iteration of the algorithm, where we have

j1 ∈ (N \ St)∩ δG(js), {j2, . . . , js−1} ⊂ (N \ St)∩ δG(j1) and σ2 ∈Π{j2,...,js−1}(j1, js). We now show

that Aσ1=Aσ2

. Recall that Aσ comprises individual rows that, for every t′ and every i∈ Pt′ ∪{0},takes value 1 if σ ∈ Γi(Pt′), and 0 other. Thus it suffices to prove that σ1 ∈ Γi(St′) ⇐⇒ σ2 ∈ Γi(St′)

for every t′ and every i ∈ Pt′ ∪ {0}. Since we already rule out the possibility of case 1, we have

σ1, σ2 /∈ Γ0(St′). We thus only focus on proving the equivalence for every i ∈ Pt′ and every t′. One

direction of this equivalence is trivial: if σ2 ∈ Γi(St′) for some i ∈ Pt′ , then i must be the highest

ranked available alternative among (j1, . . . , js) given St′ . Since σ1 and σ2 have identical rankings

of the first s choices, we also have σ1 ∈ Γi(S′t). For the other direction, observe that if σ1 ∈ Γi(S

′t),

then i ∈ {j1, . . . , js} holds because of how we have chosen s. Again, since the first s choices of σ1

and σ2 are identically ranked, we also have σ2 ∈ Γi(S′t).

Case 3: For the remaining types, σ1 must be compatible with some unobserved choice i∈ St \Ptgiven offer set St for some t∈ {1, . . . , T}. Since σ1 does not belong to cases 1 and 2, we have Aσ1

= 0.

According to Lemma 1, any optimal solution x∗ of problem (3) has x∗σ1= 0.

These analyses suggest that for any optimal solution (x∗, λ∗) of problem (3) of which x∗σ > 0

for some σ ∈ΣG \ ΣG, σ must belong to case 2. Applying the transformation method we describe

in case 2, we can get a solution x∗∗ of which x∗∗σ = 0,∀σ ∈ ΣG \ ΣG. x∗ and x∗∗ also have the

same objective values. Now consider the sub-vector x∗∗ΣG

by only selecting elements corresponding

to types in ΣG. It is clear that (x∗∗ΣG, λ∗) is a feasible solution of problem (4) and has the same

objective value. This leads to our conclusion that L∗ ≤ L∗.We thus have L∗ = L∗, which completes the proof. �


10.1.4. Proof of Corollary 1

Proof. We start by considering two compatibility matrices, A∈ {0,1}(T+M)×|ΣGo | for origin types

and B ∈ {0,1}(T+M)×|ΣGd | for destination types where T is the number of time periods, M is the

number of observed trips, |ΣGo | and |ΣG

d | are the number of types considered for origin and destina-

tion station rankings. yo =Au specifies the linear relationship between origin station choice vector

yo and the origin station ranking probabilities u, with yoStok,ik

=∑

l∈Moik

(Stok

) ul for all observed trip

origin stations {ik}k=1,...,M and yoSt,0 =∑

l∈Mo0(St)

ul,∀t= 1, . . . , T for all choices corresponding to not

using the systems. Similarly, yd =Bv specifies the linear relationship between destination station

choice vector yd and the destination station ranking probabilities v, with ydQtdk,jk

=∑

m∈Mdjk

(Qtdk

) vm

for all observed trip destination stations {jk}k=1,...,M . However, for all choices corresponding to not

using the systems, we have ydQt,0 =∑

m∈ΣGdvm,∀t= 1, . . . , T . This is counter-intuitive at first glance.

The idea is that commuter’s leaving decision only depends on the availability of origin stations.

All destination ranking types are thus specified to be compatible with not using the system.

We now consider another compatibility matrix C ∈ {0,1}(T+M)×|Σo||Σd| constructed by the Had-

mard product (element-wise product) of all pairs of columns in A and B:

C = [Al ◦Bm]l∈{1,...,|Σo|},m∈{1,...,|Σd|} =

A1,lB1,m

A2,lB2,m

...AT+M,lBT+M,m

l∈{1,...,|Σo|},m∈{1,...,|Σd|}

where Al and Bm are the lth and mth columns of matrix A and B respectively. We can now check

that the joint origin-destination choice vector yod(Stok,Qtdk

),(ik,jk) =∑

l∈Moik

(Stok

)

∑m∈Md

jk(Qtdk

) xlm and

yod(St,Qt),0=∑

m∈Mdjk

(Qtdk

) xlm can be specified by yod =Cx compactly.

Based on Lemma 1 we know that identical columns and zero columns in matrix C are redundant

for maximizing the likelihood. Further note that the following two facts hold:

• For any two identical columns in A, Ai and Aj, the Hadmard product with any column in B

are identical, i.e. Ai ◦Bk =Aj ◦Bk,∀k ∈ {1, . . . , |Σd|}. Similarly, for any two identical columns in B,

Bi and Bj, the Hadmard product with any column in A are identical, i.e., Bi ◦Ak =Bj ◦Ak,∀k ∈

{1, . . . , |Σo|}.

• For any zero column in A (B), its Hadmard product with any column in B (A) is still a zero

column.

Together, these two bullet points reveal that for any origin type σo ∈ΣGo \ ΣG

o and any destina-

tion type σd ∈ ΣGd , the OD type variable corresponding to the paired type (σo, σd) is redundant.

Similarly, for any destination type σd ∈ΣGd \ ΣG

d and any origin type σo ∈ΣGo , the OD type variable

corresponding to origin type σo and destination type σd is also redundant. This means that we can

restrict ourselves to types in ΣGo × ΣG

d . Note from the description of Algorithm 3, we can see that


ΣGod = SparseEnumeratePaired(G,S,Q,T , ΣG

o , ΣGd )∈ ΣG

o × ΣGd . What is left here is to show any

(σo, σd)∈ (ΣGo × ΣG

d )\ ΣGod can be effectively dropped. This is true because such (σo, σd) can neither

explain pickups and dropoffs nor not using the service. Thus its corresponding column in matrix

C is a zero column. This completes the proof. �

10.1.5. Proof of Corollary 2

Proof. We first show that for any two destination types σdl ⊂ σdm, we have Bl ≤Bm, where B

is the destination type compatibility matrix. To see that, for row r corresponding to not using

the service under available destination stations Qt, it is by definition that Br,l =Br,m = 1; for row

r corresponding to return bikes at station jk under available destination stations Qtdk, we have

Br,l ≤Br,m because if σdl ∈Mdjk

(Qtdk) then σdm ∈Md

jk(Qtd

k) but not vice versa.

We further know that for any column Ak in the origin compatibility matrix A, Ak ◦Bl ≤Ak ◦Bmgiven Bl ≤ Bm. Applying Lemma 1, we know that all type variables corresponding to Ak ◦Bl is

zero in the optimal solution. Combined with Corollary 1, this completes the proof. �

10.1.6. Proof of Theorem 3

Proof. We first show that under the observability conditions, the parameters λ∗ and y∗ can be

uniquely identified by maximizing the expected log likelihood function under a fixed time period

T . That is, any optimal solution λ and x of the problem

maxλ≥0,x∈∆K

1

TE[LT (λ,x)] (14)

necessarily satisfies λ = λ∗ and P(j | S; x) = y∗S,j,∀S ∈ 2N ,∀j ∈ S. To show this, let us denote

2N = {S1, . . . , SL} and assume that T is sufficiently large such that each offer set has been observed

at least once, i.e., for each l = 1, . . . ,L, we have 1T

∑T

t=1 1(St = Sl) , αSl > 0 (here, αSl is not

necessarily equal to αSl for any finite T ). We have

1

TE[LT (λ,x)]

=− λT

T∑t=1

ρ(St;x) +1

T

T∑t=1

E[Mt] logλ+1

T

T∑t=1

∑j∈St

E[mtj] logP(j | St;x)

=− λT

T∑t=1

ρ(St;x) +1

T

T∑t=1

λ∗∑j∈St

y∗St,j logλ+1

T

T∑t=1

∑j∈St

λ∗y∗St,j logP(j | St;x)

=−L∑l=1

αSlλρ(Sl;x) +L∑l=1

αSlλ∗∑j∈Sl

y∗Sl,j logλ+L∑l=1

αSlλ∗∑j∈Sl

y∗Sl,j logP(j | Sl;x).

(15)

Let us define λ∗Sl, λ∗

∑j∈Sl

y∗Sl,j

to be the true arrival rate to the system given the offer set Sl, and

λSl , λρ(Sl;x) to be its corresponding estimate based on variables λ and x. Also, let us define the


variables ySl,j , P(j | Sl;x) for every l and j ∈ Sl. We can rewrite (15) as a function of the variables

{λSl}l and {ySl,j}l,j as follows,

1

TE[LT (λ,y)]

=−L∑l=1

αSlλSl +L∑l=1

αSlλ∗Sl

logλSl∑

j′∈SlySl,j′

+L∑l=1

αSlλ∗Sl

∑j∈Sl

y∗Sl,j∑

j′∈Sly∗Sl,j′log ySl,j

=L∑l=1

αSl(−λSl +λ∗Sl logλSl) +L∑l=1

αSlλ∗Sl

∑j∈Sl

y∗Sl,j∑

j′∈Sly∗Sl,j′log

ySl,j∑j′∈Sl

ySl,j′.

In the above equation, we can interpret p∗l,j ,y∗Sl,j∑

j′∈Sly∗Sl,j′

as the conditional probability that an

arriving commuter chooses station l from the offer set Sl given that she stays in the system, and

pl,j ,ySl,j∑

j′∈SlySl,j

′to be the corresponding estimate based on y. Given our assumption that y∗

Sl,j> 0,

we have p∗l,j > 0. Note that for any feasible choice probability vector y, we must have pl,j ≥ 0,∀l, j

and∑

j∈Slpl,j = 1,∀l satisfied. We then consider the optimization problem over {λSl}l and {pl,j}l,j

below:

maximizeL∑l=1

αSl(−λSl +λ∗Sl logλSl,j) +L∑l=1

αSlλ∗Sl

∑j∈Sl

p∗l,j log pl,j

subject to∑j∈Sl

pl,j = 1, ∀l= 1, . . . ,L

pl,j ≥ 0, ∀l= 1, . . . ,L,∀j ∈ Sl

λSl ≥ 0, ∀l= 1, . . . ,L

(16)

By Gibbs’ inequality, it has a unique optimal solution λSl = λ∗Sl,∀l and pl,j = p∗l,j,∀l, j. Now, observe

that the original MLE problem in (14) is equivalent to (16), but with auxiliary variables λ,x and

y that are subject to additional constraints:

λSl = λ∑j∈Sl

ySl,j, ∀l= 1, . . . ,L

pl,j =ySl,j∑

j′∈SlySl,j′

, ∀l= 1, . . . ,L,∀j ∈ Sl

y=Ax

x∈∆K .

(17)

Note that since ySl,j appears for all l ∈ {1, . . . ,L} and j ∈ Sl in this formulation, we can simply

replace ySl,0 with 1−∑

j∈SlySl,j, and matrix A does not need to contain rows corresponding to not

using the system for each offer set. To be specific, A only specifies the linear relationships between

y and the ranking probabilities x, with ySl,j =∑

k∈Mj(Sl)xk holding for each l and j ∈ Sl.

Let us fix λSl = λ∗Sl

and pl,j = p∗l,j on the LHS of (17). Clearly, the above equations are satisfied if

we assign λ= λ∗, ySl,j = y∗Sl,j

,∀l, j and any x∈∆K satisfying Ax= y∗. Since we can find a feasible


variable assignment (λ∗, y∗) for the above equations that results in the same objective value as the

relaxed problem in (16), it follows that the assignment must be optimal. We next show that (λ∗, y∗)

is the unique assignment that corresponds to the optimal objective value. Given that we observe

every possible offer set, including the complete set N , we obtain from the first set of equations in

(17) that λ∗N = λ∗ = λ∑

j∈N yN ,j = λ, which uniquely determines λ. As a result, the first two set

of equations (with λSl = λ∗Sl

, pl,j = p∗l,j and λ= λ∗ now fixed) in (17) become:

λ∗∑j∈Sl

y∗Sl, = λ∗∑j∈Sl

ySl,j, ∀l= 1, . . . ,L

y∗Sl,j∑

j′∈Sly∗Sl,j′

=ySl,j∑

j′∈SlySl,j′

, ∀l= 1, . . . ,L,∀j ∈ Sl(18)

Since we have assumed that all possible offer sets are observable and y∗ > 0, the above set of

equations are uniquely determined by y= y∗.

Having proven that maximizing the expected log likelihood function in (14) results in an optimal

solution that corresponds to the true parameters λ∗ and y∗, we only have to show that with

probability 1, it is equivalent to maximizing the empirical log likelihood function (in the limit

T →∞),

limT→∞

maxλ,y

1

TLT (λ,y)≡max

λ,ylimT→∞

1

TE[LT (λ,y)],

where the equivalence above holds in the sense that both optimization problems have the same

optimal objective value and the same set of optimal solutions in the limit. Indeed, this equivalence

can be proven to be correct by checking that the conditions for strong uniform convergence holds

(see Theorem 5.3 from Chapter 5 of Shapiro et al. (2009)). The most critical step of the proof

is to find a compact set C where E[LT (λ,x)] is finite valued for x ∈ C and the optimal solution

of E[LT (λ,y)] is contained in C and with probability one, for T large enough, optimal solution

of LT (λ,y) is also contained in C. Note that the difficulty here is that C cannot be the original

feasible region because any y∗Sl,j = 0 makes the likelihood −∞. Here we provide a proof which shows

that we can find a constant ε such that for sufficiently large T , optimal solutions of LT (λ,y) can

be restricted to ySl,j ≥ ε. With a change of variable similar to (6), we have the following sample

likelihood function only in variable y.

LT (y) =

(L∑l=1

∑j∈Sl

mlj log(ySl,j)

)−M log

(L∑l=1

T αSl

∑j∈Sl

ySl,j

),

where mlj =∑T

t=1mtj1(St = Sl).

Now, thinking about a feasible solution y0 = Ax constructed by a uniform probability vector

xi = 1/K,∀i= 1, . . . ,K. Let y∗ be the optimal solution, we then have the following,

LT (y)≥LT (y0)


=

(L∑l=1

∑j∈Sl

mlj log

(|Mj(Sl)|

K

))−M log

(L∑l=1

T αSl

∑j∈Sl

|Mj(Sl)|K

)

≥L∑l=1

∑j∈Sl

mlj log

(1

K

)−M log(T ) =M log

(1

K

)−M log(T )

Suppose minl∈{1,...,L},j∈Sl y∗Sl,j

= ε, we have the following,

LT (y∗) =

(L∑l=1

∑j∈Sl

mlj log(ySl,j)

)−M log

(L∑l=1

T αSl

∑j∈Sl

ySl,j

)

≤mlj log(ε)−M log(L∑l=1

T αSl1(Sl =N )∑j∈Sl

ySl,j)

≤mlj log(ε)−M log(T αN∑j∈N

yN,j)

=mlj log(ε)−M log(T αN)

In order to have LT (y∗)≥LT (y),∀y, the following needs to be hold,

mlj log(ε)−M log(T αN )≥M log

(1

K

)−M log(T )

log(ε)≥log( 1

K) + log(αN )mljM

Note that as T →∞, αN → αN and (mlj/M)→ αSly∗Sl,j

. This means for any ε1 > 0 such that

αN − ε1 > 0 and ε2 > 0, there exists sufficiently large T we have

log(ε)≥ log(1/K) + log(αN − ε1)

αSly∗Sl,j

+ ε2

We thus identify such compact set C as exp

(log(1/K)+log(αN−ε1)

αSly∗Sl,j

+ε2

)≤ ySl,j ≤ 1, ∀l ∈ {1, . . . ,L}, j ∈

Sl. To avoid tedious presentation, we omit the details of checking the rest of the regularity conditions

for strong uniform convergence. �

estimating primary demand in bike-sharing systemsjaillet/general/ssrn-id3311371.pdfgoh, yan, and...

Documents