mining spectrum usage data a large-scale

Upload: laerciomos

Post on 04-Apr-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/29/2019 Mining Spectrum Usage Data a Large-Scale

    1/14

    Mining Spectrum Usage Data: A Large-ScaleSpectrum Measurement Study

    Sixing Yin, Dawei Chen, Student Member , IEEE , Qian Zhang, Fellow , IEEE ,

    Mingyan Liu, Senior Member , IEEE , and Shufang Li

    Abstract Dynamic spectrum access has been a subject of extensive study in recent years. The increasing volume of literatures callsfor a deeper understanding of the characteristics of current spectrum utilization. In this paper, we present a detailed spectrummeasurement study, with data collected in the 20 MHz to 3 GHz spectrum band and at four locations concurrently in Guangdongprovince of China. We examine the statistics of the collected data, including channel vacancy statistics, channel utilization within eachindividual wireless service, and the spectral and spatial correlation of these measures. Main findings include that the channel vacancydurations follow an exponential-like distribution, but are not independently distributed over time, and that significant spectral and spatialcorrelations are found between channels of the same service. We then exploit such spectrum correlation to develop a 2D frequentpattern mining algorithm that can predict channel availability based on past observations with considerable accuracy.

    Index Terms Spectrum measurement, channel vacancy duration, service congestion rate, spectrum usage prediction, frequentpattern mining, FPM-2D, spectral correlation, spatial correlation.

    1 INTRODUCTION

    RECENT advances in software defined radio (SDR) [11]and cognitive radio (CR) [6], [22] combined with ever-increasing demand for wireless spectrum resources haveled to the notion of dynamic spectrum access; wirelessdevices with ability to detect spectrum availability andflexibility to adjust operating frequencies can opportunisti-cally access underutilized spectrum. Such type of access isaimed at significantly improving the spectrum efficiency inlight of existence of abundant spectrum availability [13],[14] in current allocation. This has also led to the notion of open access, whereby unlicensed users or devices areencouraged to access licensed spectrum bands such thatspectrum opportunity can be fully exploited. 1

    These concepts have motivated extensive studies on both technical and policy issues related to dynamicspectrum access. With this comes the need for better andquantitative understanding of current spectrum utilization, beyond the qualitative knowledge of the existence of amplespectrum opportunities. With such understanding one canhelp 1) validate spectrum/channel models for analysis,

    2) provide grounds for more realistic channel models with better predictive competence, and ultimately, 3) allowdevices to adaptively make more effective dynamic spec-trum access decisions.

    To achieve these goals, we recently conducted acomprehensive spectrum measurement study in the20 MHz to 3 GHz spectrum band in Guangdong province,China. This paper reports our methodology and findings

    from this study. There has been a number of spectrummeasurement studies published in recent years, like [13],[14] conducted in the US, one in Singapore [8], one in NewZealand [2], and one in Germany [20]. A common findingamong these studies is that spectrum resources is indeedheavily underutilized at the moment.

    Compared to the prior work, the salient features of thedata sets we collected are:

    . Our measurements are carried out in four locationsconcurrently;

    . Our measurement locations are specifically selected(two urban and two suburban locations) in order to

    study the potential spatial correlation of spectrumusage between similar and different types of locations.

    There are two parts in this study. In the first part, weexamine statistics of the collected data and use a variety of models to fit the data. These include:

    1. channel vacancy statistics (precisely defined later),over time, across channels, and for different wirelessservices (that group multiple channels),

    2. service congestion rate (SCR) that reveals how wellchannels assigned to a particular wireless service areutilized, and

    3. the use of a subscriber model to explain thedynamics in channel utilization for specific services,

    4. spectral and spatial correlation of spectrum usage. Inthe second part of the study, we exploit the spectrum

    IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 11, NO. 6, JUNE 2012 1033

    . S. Yin and S. Li are with the Beijing University of Posts andTelecommunications, Mailbox 171, No. 10 Xitucheng Road, HaidianDistrict, Beijing 100876, China.E-mail: [email protected], [email protected].

    . D. Chen and Q. Zhang are with the Hong Kong University of Science andTechnology, Academic Building, Clear Water Bay, Kowloon, Hongkong.E-mail: {dwchen, qianzh}@cse.ust.hk.

    . M. Liu is with the University of Michigan, 1301 Beal Ave, 4427 EECS, Ann Arbor, MI 48109-2122. E-mail: [email protected].

    Manuscript received 18 Nov. 2010; revised 16 Apr. 2011; accepted 14 May2011; published online 7 June 2011.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TMC-2010-11-0527.Digital Object Identifier no. 10.1109/TMC.2011.128.

    1. For instance, the FCC on November 4, 2008 approved unlicensedwireless devices that operate in the empty white space between TVchannels, after four years of effort.

    1536-1233/12/$31.00 2012 IEEE Published by the IEEE CS, CASS, ComSoc, IES, & SPS

  • 7/29/2019 Mining Spectrum Usage Data a Large-Scale

    2/14

    correlations to develop a 2-dimensional frequentpattern mining (FPM) algorithm that can predictchannel availability based on past observations withconsiderable accuracy.

    The key findings and contributions of this work aresummarized as follows:

    1. The channel vacancy duration (CVD) distribution isshown to have an exponential tail (but not exactly anexponential distribution; this is empirically obtained but with very high statistical significance), in allchannels and locations we studied. To a certainextent, this evidence supports some widely usedchannel models (e.g., the 0-1 Gilbert-Eliot model)under which such vacancy durations are exponen-tially distributed. On the other hand, statisticalanalysis on our data reveals that these vacancydurations are not independently distributed overtime, as is commonly assumed. This finding sug-gests that spectrum usage is inherently morepredictable than current models imply, and that better and more sophisticated models may allow usto exploit such predictability.

    2. Service Congestion Rate, the spectrum utilizationwithin a specific wireless service, can be well fitted by autoregressive model, which makes possiblehigh-precision prediction for SCR such that degreeof congestion for a service can be known about inadvance.

    3. Spectral correlation of spectrum utilization is sig-nificant between channels within the same service but quite insignificant otherwise. This reflects differ-

    ent usage patterns of these services in nature.4. There is a very significant spatial correlation between the SCRs of the same service (e.g. ,GSM900 uplink) at different locations. The spatialcorrelation is even higher when the two locations areof the same type (both in urban or both in suburbanareas). This suggests that usage patterns are heavilyinfluenced by the nature/type of the wirelessservice, rather than the location. There is a popula-tion factor (high versus low density), but overallsimilarities in collective usage pattern of the sameservice are significant in different regions.

    5. Motivated by the strong correlation in spectral and

    spatial dimensions, we propose an effective 2Dfrequent pattern mining algorithm, which canpredict spectrum usage with the accuracy exceeding95 percent.

    While our conclusions are drawn from data collected atspecific locations, the approaches investigated in this paperis more generally applicable to similar data sets collected

    elsewhere. We hope that these findings will lead to morediscussions on how to better model current spectrumutilization, which can help us design better and moreefficient spectrum sensing and access schemes for second-ary users.

    The remainder of the paper is organized as follows:Section 2 presents how our data are collected andprocessed. Then, Section 3 presents a comprehensivestatistical analysis on the measurement data includingCVD, SCR, and subscriber model. Spectrum correlation(spectral and spatial) results are presented in Section 4. InSection 5, a 2D frequent pattern mining technique to predictspectrum availability is developed in detail. How theseresults can be useful in spectrum sensing and access isdiscussed in Section 6. Related work is presented in Section7, and Section 8 concludes the paper.

    2 DATA COLLECTION AND P REPROCESS2.1 Data CollectionThe results presented in this paper are based on analysis of measurement data collected at four different locations inGuangdong province, China, from 15:00 Feb 16, 2009 to15:00 Feb 23, 2009. Locations 1 and 2, which are roughly10 kilometers apart, are downtown area in a metropolis of Guangdong province. Locations 3 and 4, which are roughly45 kilometers apart, are suburban areas in two under-developed cities. These locations are listed in Table 1.

    We are primarily interested in spectrum usage of thefrequency band between 20 MHz and 3 GHz. Within thisrange, the list of wireless services along with their spectrumassignment in the local region are provided in Table 2.

    The measurement equipment we used is an R&S EM550VHF/UHF Digital Wideband Receiver. EM550 is a super-heterodyne receiver that covers a wide frequency range,from 20 MHz to 3.6 GHz. The measurement resolution isone per 0.2 MHz, resulting in a total of 14,900 frequency

    readings per time slot (or sweep time), roughly 75 seconds.There are 8,058 (7 days/75 s) time slots. As a result, thereare 14 ;900 8 ;058 readings in the data set (roughly 2 GB insize) per location. Please note that, the shorter a time slot

    1034 IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 11, NO. 6, JUNE 2012

    TABLE 1Location Overview

    TABLE 2Spectrum Allocation of Popular Services

  • 7/29/2019 Mining Spectrum Usage Data a Large-Scale

    3/14

    is, the better the process of energy level changing can bereconstructed. The energy level may change in the middleof a time slot but this type of change cannot be recordedvia the measurement equipments, no matter how short atime slot is. And we will show that 75 seconds is smallenough for a time slot to reveal the channel usage patternsin Section 3.

    Here, we would like to briefly compare our data setswith those reported on the Shared Spectrum Company(SSC) website. 2 Judging by the published reports, our datasets are of a similar nature and have been collected in asimilar way, e.g., the antennas are placed outdoor on theroof of a building while the receivers are placed indoor.

    For illustration purpose, Fig. 1 shows a 3D depiction of the data set at Location 3. The color coding (energy levelfrom low to high on a scale from blue to bright red) on thefigure is an attempt to make the figure easier to visualize,

    but does not provide extra information, as the verticaldimension already shows the energy reading (in dB V).

    2.2 PreprocessingTo conduct the sequence of analysis presented in latersections, we will first convert the above measurement data(absolute energy reading) into a binary sequence of 0 and1 s, through a thresholding process, with 0 denoting achannel being unused, idle or available, and 1 denoting theopposite (i.e., used, busy, or unavailable).

    We begin with definition of the following terms.

    . Channel: a channel is an interval of radio frequency

    of bandwidth 200 KHz. Channels are indexedsequentially; Channel X is the frequency interval20 0 :2 X 1 MHz ; 20 0 :2 X MHz ; X > 0 . Since200 KHz is the resolution of our measurementdevices, a channel is the smallest unit at which wecan distinguish energy.

    . Service: a service is a set of channels that have beenassigned to the same application/service, as listed inTable 2. Without ambiguity we will use this term tomean both the service and the set of channelsassigned to the service.

    . Channel state (CS): this is a function of time andchannel. CS t; c 0 indicates that Channel c is idleat time slot t , and CS t; c 1 otherwise.

    . Channel vacancy: this is the period in which achannel remains idle.

    Converting energy level to CS (0 or 1) is essentially a binary hypotest process. For lack of a priori knowledge onchannel statistics, it is done through a simple thresholdingprocedure: for each channel, a threshold is set to be 3 dBhigher than the minimum value in this channel over theentire duration of the trace collected. At time slot t if theenergy level is lower than the threshold, then CS t; c 0 ;otherwise CS t; c 1 . The reason for this is as follows:Fig. 2 shows the maximum, minimum, and average energylevel of some noise channels (those higher than 2 GHz but below the ISM band) at Location 2. They are called noisechannels because they are currently assigned to satellite-to-satellite communications (the signal does not reach theground and thus the only energy present on the ground is

    due to noise). We see that for these channels, the maximumand minimum power levels are all within a 3 dB range.Assume that noise behaves similarly across channels(which is not exactly true, but probably close), anythingmore than 3 dB above the minimum power level suggeststhe presence of signal, hence the above thresholding isreasonable. Decreasing this threshold will improve signaldetection probability, but the false positive will becomehigher, while lowering the threshold increases falsenegative. 3 While an important subject in its own, calibrat-ing the error in such a process is out of the scope of thepresent paper. Note that the above threshold process is for

    the purpose of detecting the primary users presence;accordingly our subsequent analysis aims at characterizingthe features of primary users activities. This is differentfrom a thresholding procedure that an individual second-ary user may employ for the purpose of determining itsaccess strategy and appropriate transmit power, which canvary from one secondary user to another.

    The result of this process is a sequence of CS s (0 and 1 s)for each spectrum channel of 200 KHz wide, representingits availability over a time resolution of approximately75 seconds. This is shown in Fig. 3 where black dots indicate busy channels. In subsequent sections, we will try to

    uncover the properties inherent in these sequences.

    YIN ET AL.: MINING SPECTRUM USAGE DATA: A LARGE-SCALE SPECTRUM MEASUREMENT STUDY 1035

    2. http://www.sharedspectrum.com/measurements, one taken inMaine, one in Chicago, and one in Ireland.

    Fig. 1. 3D view of the energy level over all bands.

    3. We did try increasing this threshold from 3 to 4.5 dB and found theresulting 0-1 sequence to be nearly the same. The same thresholding processwas used in a measurement study conducted in Aachen, Germany [20].

    Fig. 2. The maximum, minimum, and average energy levels of noisechannels at Location 2.

  • 7/29/2019 Mining Spectrum Usage Data a Large-Scale

    4/14

    3 S TATISTICAL P ROPERTIES OF THE MEASUREMENTDATA

    In this section, we perform comprehensive statisticalanalysis on spectrum usage based on measurement data,including channel vacancy durations, service congestionrate series, as well as a subscriber model to explain channel

    utilization dynamics.3.1 Channel Vacancy Duration DistributionTo make better spectrum access decision, we are moreinterested in how long a channel will remain idle. CVD isdefined for this. In this section, we show that CVDdistribution has an exponential tail, but is not exactly anexponential distribution, nor is it independently distributedover time.

    As defined earlier, channel vacancy is the period inwhich a channel remains idle. If we use the CS time series of a given channel, then the channel vacancy duration willalways be a nonnegative integer (i.e., the number of consecutive 0s) since the CS is defined for discrete timeslots. In reality, however, the channel state does not ingeneral change at slot boundaries. In order to obtain a better, fractional estimate of the CVD, we use the originalenergy measurement data: by determining the threshold-crossing times, we can obtain CVD estimates in realnumbers rather than integers. This is illustrated in Fig. 4.

    On average each channel CS time sequence (more than8,000 time slots) contains hundreds of channel vacanciessamples, which turns out to be too small to approximateCVD distribution. To increase the sample size, we collectCVDs across all channels within the same service. This is

    done based on the observation that spectrum usages of channels within the same service are statistically similar(shown in Section 5.2). For example, spectrum usages inchannels within GSM900 uplink service (885-915 MHz) are

    nearly the same in terms of occupancy, periodicity, average

    energy level, etc. This ensures enough samples to obtainempirical distribution of channel vacancies.

    Fig. 5 shows the statistical histogram of the CVD inGSM900 uplink service at Location 2. Then, we plot theGaussian-kernelled density graph [12] such that the curveof its probability density function (PDF) can be approxi-mately restored as shown in Fig. 6. Here, we only focus onthe falling part of the curve shown in Fig. 6, where CVDsamples that exceed two time slots, because a longervacancy period can provide access opportunities for avariety of services, regardless of service type, real-time, anddelay-sensitive services such as push-to-talk and video

    teleconference, or best-effort services such as file transfer-ring. We then apply the least-square regression analysis tothe falling part of the curve in Fig. 6. The significance of thefitting is measured by the coefficient of determination r 2 ,defined as

    r 2 1 Piyi y2

    Piyi f i2 ; 1

    where yi is the sample value with mean y and f i is themodeled/fitted value.

    As shown in Fig. 7, the CVD distribution is very wellapproximated ( r 2 > 0 :95 ) by an exponential-like distribu-

    tion y a becx . We repeated this exercise in all theservices (GSM900/1800 uplink/downlink, broadcasting TV,CDMA, ISM) at all four locations and obtained similarresults. The regression results at Location 2 are shown in

    1036 IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 11, NO. 6, JUNE 2012

    Fig. 3. CS map at Location 3.

    Fig. 4. Extract channel vacancy durations from raw data.

    Fig. 5. Histogram of CVD distribution for GSM900 uplink at Location 2.

    Fig. 6. Approximated PDF of CVD for GSM900 uplink at Location 2.

    Fig. 7. Fitting curve of CVD for GSM900 uplink at Location 2.

  • 7/29/2019 Mining Spectrum Usage Data a Large-Scale

    5/14

    Table 3, where y denotes Pr CV D x and x 1 ; 2 ; 3 ;:::10time slot(s).

    It should be noted that y a becx has an exponentialtail, but is not an exponential distribution. Consequently itdoes not have the memoryless property, i.e., how long achannel is going to remain in a certain state is a function of its history, rather than being independent of it. This latterindependence assumption has been commonly used inchannel access studies, see, for example, [16], [18], [23].More precisely, these studies assume a two-state Markovchain model for the channel (i.e., an Eliot-Gilbert model),which implies that duration of channel vacancy aregeometrically distributed (the discrete equivalent of anexponential distribution), and that these durations areindependently distributed. Our results here indicate thatsuch an assumption is inaccurate and there is significantly

    more memory in the channel state information.The above observations indicate that we cannot describethe CS series using first order Markov model, wherebyfuture state is independent of history given present state. Toillustrate, we empirically obtain the following conditionalprobabilities for the GSM1800 uplink band at Location 2

    Pr CS t 1 ; c 0 jCS t; c 0 0 :91895 ;

    while

    Pr CS t 1 ; c 0 jCS t; c 0 ; CS t 1 ; c 1 0 :5545 :

    Clearly the channel state is highly history dependent, whichthis type of Markov model fails to capture. We tried higherorder Markov models, by defining a higher dimensionalstate space (a higher dimension state consists of a sequenceof channel states, which results in an increase the statespace), without much success. Besides, since CVD in ourcase is a statistic for the whole service rather than a singlechannel such that CVD samples cannot constitute a timeseries, we did not analyze CVD samples in any time-seriesway, such as autoregressive model fitting.

    All the above indicate that the channel state informationpossess some far richer properties. Technically, any discretesystem can be modeled as a Markov chain provided we

    embed sufficient memory into the state, but the resultingexpansion in the state space is in general computationallyprohibitive. In Section 5, we use a frequent pattern miningtechnique to get around this problem. This technique

    exploits the potential correlation in CS s and providesaccurate prediction.

    In order to show that the time scanning resolution(roughly 75 s) is sufficiently small to reveal the CVDdistribution, we increase the sampling interval to inspecthow the regression significance varies correspondingly because statistically less frequent sampling will downgradethe regression accuracy, as shown in Fig. 8. On the otherhand, if the regression is already accurate, increasing thesampling frequency will improve little. We can see that inFig. 8, when sampling interval increases to more than200 time slots, which makes the sampling not sufficientlyfrequent to keep a high regression accuracy, determinationcoefficient starts to rapidly drop. The result implies that, forthe sampling interval less than 200 time slots, the statisticalproperty of CVD remains such that the scanning timeresolution in our measurement (75 seconds) is testified to be a reasonable time scale.

    3.2 Service Congestion Rate

    The channel state CS t; cis a microscopic level measure of the channel utilizationindeed a single time slot togetherwith a single channel is the finest resolution we can obtainfrom the measurement data. Examining it over time foreach channel results in CVD, a measure we examined in theprevious section. In this section, we will examine it acrosschannels within the same service for a given time slot, ameasure captured in the service congestion rate. We willthen show how this measure evolves over time.

    SCR for service S at time t, denoted by SCR t; S isdefined as the ratio between the number of busy channelsin S at time t and the total number of channels in S . Thus,SCR t; S

    Pc2 S CS t; c=n, where n is number of chan-

    nels in service S . Ranging from 0 to 1, SCR is thus a measureof the level of congestion in a service; the larger the SCR of aservice is, the fewer idle channels there are.

    YIN ET AL.: MINING SPECTRUM USAGE DATA: A LARGE-SCALE SPECTRUM MEASUREMENT STUDY 1037

    TABLE 3Channel Vacancy Duration Distribution

    Regression Results at Location 2

    Regression equation: y a becx .

    Fig. 8. Determination coefficient for CVD fitting of GSM1800 uplink atLocation 2 as sampling interval increases.

    Fig. 9. Determination coefficient for MSU distribution of GSM1800 uplinkat Location 2 as sample interval increases.

  • 7/29/2019 Mining Spectrum Usage Data a Large-Scale

    6/14

    Fig. 10 shows SCR series of service GSM1800 uplink andGSM900 uplink at Location 2. We can see that the SCRseries are periodic in their outline, as expected. Also notethat the SCR series of the two services are highly correlated:the rises and drops are very much in synchrony.

    Of particular interest is the high SCR regions in Fig. 10,i.e., those plateau regions. Within these regions, the SCRas a random process appears to be stationary. This turns outto be true: it passes augmented Dickey-Fuller (ADF) testwith significance 0.01, verifying its stationarity [12] (Thistest was performed for each of the plateau regions). Wefurther calculate the correlogram for the plateau part of the SCR series of service GSM900 uplink at Location 2 andresults are shown in Figs. 11 and 12. Here, correlogramrefers to autocorrelation function (ACF), denoted by t;s ,and partial autocorrelation function (PACF), denoted by t;sof a time series Y , which are, respectively, given by

    t;s Corr Y t ; Y ts;

    t;s Corr Y t ; Y ts jY t1 ; Y t2 ; . . . ; Y ts 1 ;

    where Corr refers to the correlation coefficient.We see that the correlation coefficient decreases at annegligible rate, while the partial correlation coefficient tailsoff rapidly. This suggests that the SCR series may be

    potentially well modeled as an autoregressive process [17].The autoregressive model can be written as

    SCR t; S XO

    m1cm SCR t m; S nt; 2

    where O and nt are order and residual of the model,respectively.

    A third-order autoregressive model of the above form is

    applied to the SCR series of all services at Location 2.4

    Theregression results are shown in Table 4. We did not includethe regression results of CDMA service as the existence of CDMA signal cannot be verified simply using energydetection [4]. We see that in all cases the third-orderautoregressive model achieved very high accuracy, asindicated by the coefficients of determination shown in (1).

    Since autoregressive model can fit the SCR series quitewell, we can leverage it into temporal prediction for SCRseries, that is, to use its past to predict its future. Therefore,again we resort to third-order autoregressive model. Forillustration purpose, we only show the prediction results for

    the service of GSM900 uplink at Location 2 as shown inFig. 13, where the second days data are tested with the firstdays data as training set. Results for other services are quitesimilar. Wealso show residualof theprediction model anditsstatistical histogramas shown in Figs. 14 and15, respectively.The probability distribution of the residual passes theAnderson-Darling hypotest with confidence level 0.1 suchthat it is testified to comply with a normal distribution withzeromean,which indicates thattheautoregressivepredictionmodel is adequate for the SCR series [17].

    1038 IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 11, NO. 6, JUNE 2012

    TABLE 4Third-Order Autoregressive ModelRegression Results at Location 2

    Fig. 10. SCR series at Location 2. GSM1800 uplink versus GSM900uplink.

    Fig. 11. Autocorrelation coefficient of SCR of GSM900 uplink atLocation 2.

    Fig. 12. Partial autocorrelation coefficient of SCR of GSM900 uplink atLocation 2.

    4. Regarding selecting the right order for this model: a higher ordergenerally results in better approximation; on the other hand, the ordercannot be arbitrarily high based on Akaike Information Criterion, in whicha high order will improve regression quality but will also lead to thereduction of degree of freedom [1]. The choice of third-order in ourexperiment seems to be appropriate; in particular we were not able toobtain significant fitting improvement by orders higher than 3.

    Fig. 13. SCR prediction with the third-autoregressive model.

  • 7/29/2019 Mining Spectrum Usage Data a Large-Scale

    7/14

    3.3 Dynamic Utilization of GSM ServicesOf all the services listed in Table 2, mobile services arespecial with high dynamics. This is because they aresubscriber based and energy level present in thesechannels are driven by the arrivals and departures of subscribers (calls). Understanding how channel utilizationchanges over time as a function of subscriber dynamicscan be very helpful in marketing and operating strategies

    including pricing and promotions. In this section, weshow that channel utilization given by the measurementdata can be well described using a simple queuing model.Due to the difficulty in energy-detecting CDMA signal asmentioned previously, we will only consider GSM servicesin this section.

    To this end, we note that the received energy at ameasurement location is the summation of all transmittedsignals, whose strength gets attenuated over the propaga-tion distance. This means that different users/callers willhave a different contribution to this sum depending onwhere they are located. However, if we assume that the

    locations of most of the users/callers are approximatelyuniformly distributed, and that their activities (call arrivalsand departures) are statistically similar regardless of theirlocations, then one would expect that the sum of receivedenergy changes proportional to the change in the activecaller population within an applicable region (i.e., theregion in which a transmitted signal results in nonnegligiblereception at the measurement location). This suggests thatwe could use the change in energy levels from one time slotto another as a measure of the change in arriving/departingcalls. Below we will specifically consider the GSM uplinkservice while excluding downlink services as the GSMdownlink traffic contains not only user traffic but also thatof signaling applications.

    We define the mobile service utilization (MSU) of a GSMservice S , denoted by MSU t; S , as the time-varyingdifference in the total received energy, expressed as

    MSU t; S Xc2 S et; c Xc2 S et 1 ; c:Fig. 16 shows the probability distribution of MSU series of GSM900 uplink at Location 3 for the whole week.

    We next show that this distribution can be well describedusing a simple queuing model. Suppose that in each time slotthe arrival and departure of calls each follows a Poissondistribution withrate a and h , respectively.Assume furtherthat they are independently distributed (note that this isequivalent of saying that the call durations are exponentially

    distributed). Denote the number of active callers at the beginning of time slot t by N t. Then, we have N t 1 N t At Dt, where At (Dt) is the arrival (depar-ture) within time slot t . Then, the change in the number of active callers in the system, denoted by X t N t 1 N t, is given by

    P X t m e a h ma X

    1

    k1

    ka

    kh

    k!k m!m ! 0

    e a h mh X1

    k1

    ka

    kh

    k!k m!m < 0 :

    8>>>>>:

    3

    The probability distribution curve in Fig. 16 is centeredaround zero, suggesting that a and h are approximatelyequal. Equation (3) can therefore be simplified to

    P X m e2 m X

    1

    k1

    2 k

    k!k m!m ! 0

    e2 m X1

    k1

    2 k

    k!k m!m < 0 :

    8>>>>>:

    4

    We then compare (4) with the distribution curves of MSUof the GSM uplink services at four locations. For brevity, weonly show the result of GSM900 uplink at Location 3 asshown in Fig. 17 while noting the other results are similar.As shown in Fig. 17, the two are very close, and thecoefficient of determination is over 0.97.

    We also examined whether the above utilizationdynamics change significantly over time. Figs. 18 and 19show the MSU distributions obtained using measurementdata collected over two different weeks, for the GSM900

    YIN ET AL.: MINING SPECTRUM USAGE DATA: A LARGE-SCALE SPECTRUM MEASUREMENT STUDY 1039

    Fig. 14. SCR prediction residual with the third-order autoregressivemodel.

    Fig. 15. Histogram of prediction residual with the third-autoregressivemodel.

    Fig. 16. Distribution curve of MSU of GSM900 uplink at Location 3.

    Fig. 17. Fitting result of MSU of GSM900 uplink at Location 4.

  • 7/29/2019 Mining Spectrum Usage Data a Large-Scale

    8/14

    uplink service and the GSM1800 uplink service, respec-tively. In both cases, we see that this distribution stayroughly the same from week to week, indicating a rathersteady collective calling pattern in terms of arrival anddeparture statistics.

    In addition, we carried out experiments as what we didin CVD study to verify the validity of the scanning timeresolution in our measurement. We increase the samplinginterval to multiple time slots and obtain determinationcoefficient between MSU distribution curve with samplinginterval of multiple time slots and the regression function(4). As shown in Fig. 9, MSU distribution can be wellfitted by (4) with sampling interval less than 10 time slots,while a higher sampling interval leads to rapid drop of determination coefficient. It implies that with samplinginterval less than 10 time slots, MSU distribution remainsunchanged statistically. Therefore, the scanning timeresolution in our measurement (75 seconds) is implicitlytestified to be a reasonable time scale, which is shortenough to reveal the MSU.

    4 SPECTRAL AND

    SPATIAL

    CORRELATIONS

    The more we know about spectrum usage characteristics (of the primary users), the better we can predict spectrumopportunity, and the better we can make dynamic spectrumsensing and access decisions. Much of this predictive powerlies in the spectral and spatial dependence of spectrumusage. For instance, if everything is independently dis-tributed, then knowing the past does not offer informationfor the future. On the other hand, Fig. 10 shown in theprevious section suggests that measuring/sensing channelsin GSM900 uplink provides ample information aboutchannel availability in GSM1800 uplink. We thus set out

    to take a more in-depth look at the dependence character-istics of our data sets in this section. Specifically, we willanalyze the spectral, and spatial correlation of the CS seriesand the SCR series, respectively.

    We will use the following two measures of correlation,the first one defined for two random variables and the

    second one defined for two 0-1 random sequences. Thecorrelation coefficient X;Y between two random variablesX and Y with sample mean X and Y and sample standarddeviations X and Y is defined as

    X;Y covX; Y

    X Y E X X Y Y

    X Y ; 5

    where cov is the covariance operator ranging from 1 to 1,extreme values indicating (inversely) fully correlation. Ingeneral, correlation is higher as absolute value of thecoefficient is closer to 1.

    The correlation between two discrete-time 0-1 series X tand Y t is defined as follows:

    Corr X t;Y t Pt I f X t Y tg Pt I f X t 6 Y tg

    Pt I f X t Y tg

    Pt I f X t 6 Y tg

    ; 6

    where I f Ag is the indicator function: I f Ag 1 if A is trueand 0 otherwise. The two summations in the above equationare the total number of positions that the two sequencescoincide and differ, respectively. This is commonly used forevaluating the correlation between two binary sequences.

    4.1 Spectral CorrelationIn order for investigation on spectral correlation, we takethe SCR and CS series and cross correlate them with theircounterparts from a different service and different channelwithin the same service, using (5) and (6), respectively.

    Fig. 20 shows the CS correlation coefficients between

    every two channels in GSM900 uplink at Location 4. We seethat these coefficients are extremely high for almost everytwo channels within the same service. This is because inmost cases the channels are either all busy or all idle. Thereare also cases where the spectral correlation coefficientsare closed to 1 . This is because some channels are alwaysidle while some others are always busy. For instance, thereare channels in GSM that are kept idle to avoid interchannelinterference. These results are representative of what wefound in other services.

    Table 5 shows the spectral correlation coefficients between two SCR series at Location 1; these results are

    also representative of what we found at the other locations.From the results shown in Table 5, we see that there issignificant correlation among these services, for none of the coefficients falls below 0.55, even when between a

    1040 IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 11, NO. 6, JUNE 2012

    Fig. 18. Comparison of MSU of GSM900 uplink at Location 1 betweentwo measurements.

    Fig. 19. Comparison of MSU of GSM1800 uplink at Location 1 betweentwo measurements.

    Fig. 20. Spectral Correlation Coefficients of CS series within GSM900uplink at Location 4.

  • 7/29/2019 Mining Spectrum Usage Data a Large-Scale

    9/14

    broadcasting TV service and a GSM service. In addition,services of the same type are particularly correlated, ashigh as 0.952 in the case between GSM900 uplink andGSM1800 uplink.

    4.2 Spatial CorrelationFig. 21 shows four SCR series of the same service (GSM900downlink) at all four locations. At a high level, they appearall correlated with each other; they share common changesin value. In particular, Locations 1 and 2 are very similar, upto a constant shift, and Locations 3 and 4 are very similar,also up to a constant shift. This suggests spatial correlationacross different locations. We thus cross correlate SCR =CS series from the same service/channel at different locations.

    Table 6 shows the spatial correlation coefficients of theSCR series in GSM900 downlink service among fourlocations. These are very high and none of the values falls below 0.8. It appears that within the same service, thespectrum utilization is highly correlated across differentlocations and different types of locations.

    Results for other services are quite similar and are thusnot presented separately. For instance between Location 1and 2, spatial correlation coefficient of SCR series is as highas 0.962.

    Its easy to understand the reason behind such highcorrelation: for the same service, such as GSM voice calls,

    subscribers at different locations share common behavioralpatterns (e.g., same peak calling hours of the day), eventhough the actual SCR values are different.

    5 P REDICTION USING FREQUENT P ATTERN MINING

    The correlation structure presented in the previous sectionsuggests that it can be exploited to help us better predictchannel state or spectrum opportunity based on measure-ments made in the past, in adjacent channels, or at similarlocations. More precisely, we are interested in the questionof whether one could accurately predict the value of CS t; cbased on the knowledge of CS t k; c0; k > 0 , andif so how many past observations (what values of k) andover what set of channels (what values of c0) are needed.

    There are different approaches one could take to modelsuch correlation. For instance, as pointed out in Section 3,memory in channel state information could conceptually becaptured by a sufficiently high-ordered Markov chain (andone can use measurement data to collect statistics on itstransition probabilities), but with significant technicaldifficulty due to the exponential increase of the state space.

    To overcome this difficulty, in this section we present atechnique based on frequent pattern mining ([3], [15] and asurvey [5]) through an efficient pattern identificationprocess over the spectrum data, which generates predic-tions of future channel state based on a collection of pastobservations in a set of channels. It leverages both temporaland spectral correlations examined in earlier sections andits unique feature is that it automatically adjusts thealgorithm to the appropriate size of past observations

    (temporally and spectrally).5.1 FPM and PredictionWe begin by illustrating how FPM can be used for spectrumusage prediction and the challenges in doing so. Supposewe know the CS s of Channels c; c 1 of previous 8,000 time

    YIN ET AL.: MINING SPECTRUM USAGE DATA: A LARGE-SCALE SPECTRUM MEASUREMENT STUDY 1041

    TABLE 5Spectral Correlation Coefficients of SCR at Location 1

    TABLE 6SCR Spatial Correlation Coefficients for

    GSM900 Downlink among Four Locations

    Fig. 21. SCR Series of GSM900 Downlink at four Locations.

  • 7/29/2019 Mining Spectrum Usage Data a Large-Scale

    10/14

    slots (from time slot t 7 ;999 to t) and we would like topredict the CS of the next time slot of Channel c and c 1(i.e., CS t 1 ; c and CS t 1 ; c 1 ). The known CS s can be written into a single matrix shown below

    a t7999 ;c a t7998 ;c ::: at;ca t7999 ;c 1 a t7998 ;c 1 ::: at;c 1 ;

    where a i;j CS i; j .Define a submatrix as a pattern if it appears no less than200 times throughout the CS series of these channels. Forinstance, if the submatrix 1 0 11 1 0 appears 1,000 times, itsconsidered a pattern. We may find another pattern 1 0 1 11 1 0 0which appears 990 times. Comparing these two patterns wecan predict that CS t 1 ; c 1 and CS t 1 ; c 1 0with probability 99 percent (990/1,000) if CS t 2 ; c 1 ,CS t 1 ; c 0 , CS t; c 1 , CS t 2 ; c 1 1 , CS t 1 ;c 1 1 , and CS t; c 1 0 .

    Clearly, this needs to successfully handle two issues: oneis to find frequent patterns, referred to as frequent pattern

    mining. The other is to find associations among thesepatterns, referred to as pattern association rules mining.In our setting dimension (number of rows and columns)

    of the patterns of interest are not fixed in advance, i.e., wedo not know in advance how much history and how manyneighboring channels are needed in order to have accurateprediction. Rather, this has to be learned during miningprocess. The 2D (of patterns) learning element is a uniquechallenge in mining process compared to existing FPMliterature, e.g., [3], [5], [15]. In addition, in all these studiespatterns are 1D and can be written in a row, although datasets can be in multiple rows. In our problem, patterns are in2D, which is another unique challenge. To summarize, both

    width and height of the patterns are variable, and theirappropriate sizes need to be automatically identified in thisprocess. In the following, we will refer to our problem as anFPM-2D problem.

    5.2 FPM-2D Problem DefinitionThe goal of the FPM-2D problem is to find all relevant 2Dpatterns. Once this is achieved, it is fairly easy to compute theprobabilities of future channel state (spectrum prediction).Table 7 contains a list of terminologies used in FPM-2D.

    Formally, the FPM-2D problem is stated as follows:Given the input set IM , min area , and min rep , find all valid block patterns and the corresponding match numbers.

    5.3 Proposed AlgorithmWe start by scanning the input matrix IM from left toright, top to bottom to find all the blocks with size x y.This is done for all x and y such that x y ! min area .We use a hash table to store the blocks for efficientlysearch since it takes O1 time to search an item. Apotential problem is the overabundance of such blocks; itis 2 xy in the worst case. This problem is addressed by thefollowing simple property.

    Consider a block A xy. We say block B is A xys parentblockif: 1) B is A xys subblock, 2) B s size is x 1 y or

    x y 1 , and 3) B s area is not less than min area . Asimple yet key property concerning a valid pattern is thatfor any block A xy, it has at most four parent blocks and it isa valid pattern only if all its parent blocks are valid patterns.

    Thus, if any block has a parent block that is not valid,then we can simply skip this block because it cannot bevalid. By checking parent blocks, a large number of blockscan be ignored, which significantly reduces the memoryconsumption and improves the performance. The pseudo-code of the proposed algorithm is given in Algorithm 1where T x;y is the hash table to store patterns with size x y.

    Algorithm 1. FPM-2Dfor N 2 to 1 :

    flag falsefor each x; y s.t. (x y N , x y ! min area ):

    T x;y new empty hash tableif T x1 ;y and T x;y1 are both empty:

    continue;for each block B xy in IM :

    if one of B

    xys parent blocksis not a valid pattern:continue

    if B xy is in T x;y:T x;yB xy T x;yB xy 1

    else:T x;yB xy 1

    for each block P xy in T x;y:if T x;yP xy < min rep :

    remove P xy from T x;yif T x;y is not empty:

    flag true

    output the valid patterns in T x;y andthe corresponding match numbersif flag == false :

    break

    1042 IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 11, NO. 6, JUNE 2012

    TABLE 7Notations/Concepts in FPM-2D

  • 7/29/2019 Mining Spectrum Usage Data a Large-Scale

    11/14

    After all valid patterns have been identified, theprediction rules are extracted as follows: A prediction rule

    is defined as P 1 P 2 , where P 1 and P 2 are all validpatterns. P 2 has the form [ P 1 V ], where V is a columnvector v1 ; v2 ; v3 ; :::vkT ; vi 2 . Thus, P 1 is one of the parent blocks of P 2 . Let M P 1 denote the match number of P 1 ,then the transferring rate RP 1 P 2 M P 2 =M P 1 What this rule says is that if the current CS appears tomatch P 1 , then the CS in the next time slot will match Vwith probability RP 1 P 2 . Clearly, a similar procedurecan be used to predict the CS over multiple future slots, bysimply considering V as multiple column vectors.

    5.4 Experiment ResultTo test our algorithm we split the measurement data intotwo part, one as a training set on which we run Algorithm 1and find the prediction rules, while the other as the testingset on which we apply the prediction rules and performspectrum usage prediction. We set min rep 200 ;min area 4 .

    In our experiment, we adopted the following predictionmethods: 1) We only consider those rules whose transferringrates are larger than 0.9. 2) If the current CS s appears tomatch the pattern P 1 in a rule P 1 P 2 , we predict the CS smatches P 2 in the next time slot. 3) If the current CS s do notmatch any patterns, we do not predict and count it as a miss.

    We are interested in answering the following questions:

    1. what is the prediction accuracy if the training set andcorresponding testing set are from the same service(self-service prediction)?

    2. what is the missing rate of the prediction?3. can the mining result in one service be used to

    predict the CS s in another service (cross-serviceprediction)?

    4. can the mining result in a service at one location beused to predict the CS s in the same service at adifferent location (cross-location prediction)?

    5. how large the training set needs to be for accurate

    prediction?We define prediction accuracy to be the ratio between thenumber of correctly predicted CS s and the total number of predicted CS s, and define missing rate as the ratio between

    the number of CS s that cannot be predicted and the totalnumber of CS s.

    We first study the case where the training set and thecorresponding testing set are from the same service atLocation 1, i.e., both from TV1 or both from GSM900 uplink.The training sets are CS series over durations from 1 hourto 3 days. The testing sets are CS s of the last four days. Theresults are shown in Fig. 22. We observe that:

    . The prediction accuracy is larger than 0.95, a veryencouraging sign.

    . The missing rate is around 5 percent for TV1 service,which is very low. It is higher, around 15 percent forGSM900 uplink. The reason why the missing rate onGSM900 uplink is higher than TV1 is that TV service

    is a pure broadcast service, while GSM service is aninteractive service whose patterns are more compli-cated to match.

    . The training set cannot be less than 2 hours. Other-wise, FPM-2D cannot find sufficient prediction rules.

    . A 3-hour training set appears sufficient and perfor-mance of the algorithm saturates at this level.Training data more than that do not seem to helpimprove the prediction accuracy or reduce themissing rate.

    For other services of Location 1, the results are similarexcept for the missing rate, which is listed in Table 8. We

    see that the prediction accuracy is consistently high butthe missing rate varies from 4 to 25 percent. We also givethe overall occupancy of the services as a reference. If theoverall occupancy is a, then the prediction does not help if its accuracy is lower than max{ a, 1 a}. This is becausewe can always achieve this accuracy simply by guessing.

    For comparison, we have also shown the predictionaccuracy using the first-order Markov Chain model (firstMC), which is one of the most commonly used models inthe existing papers [16], [18], [23]. We could see that theprediction accuracy of the first MC is only around80 percent, which is much lower than that of FPM-2D,

    except for those services whose occupancy is high. In theselatter cases, the prediction accuracy can be naturally higheven by guessing. The results of the other locations aresimilar, and thus not repeated.

    YIN ET AL.: MINING SPECTRUM USAGE DATA: A LARGE-SCALE SPECTRUM MEASUREMENT STUDY 1043

    Fig. 22. The experiment results of intralocation intraservice spectrumusage prediction at Location 1. The prediction accuracy is larger than0.95 if the training set size is no less than 3 hours.

    TABLE 8The Spectrum Usage Prediction Results at Location 1

    The training/testing sets are from the same service.

  • 7/29/2019 Mining Spectrum Usage Data a Large-Scale

    12/14

    We next study the case where the training set ( CS s of three days) and the testing set ( CS s of four days) are fromdifferent services. The results are shown in Table 9.

    We see that the accuracy of cross-service prediction ismuch lower than that of self-service prediction, and themissing rate is quite high. This is because the patterns indifferent services collide, i.e., patterns and prediction rulesfound in one service might lead to wrong prediction resultsin other services. This is another manifestation of theconclusion drawn earlier that the spectral correlation is highfor channels within the same service but low acrossdifferent services.

    Consistent with earlier correlation result, we alsoobserve that the prediction accuracy is high if we use theprediction rules at one location to predict the CS s of thesame service at another location, due to the high spatialcorrelation of channels within the same service. Table 10shows the experiment results of this cross-location self-service prediction.

    To summarize this section, we conclude that:

    1. The self-service self-location spectrum usage predic-tion accuracy is higher than 95 percent, which issignificantly larger than that of the commonly usedfirst-order Markov model.

    2. The missing rate varies from 4 to 25 percent, anoverall acceptable range.

    3. The cross-service prediction accuracy ranges from60 to 80 percent, much lower than the self-serviceprediction. The corresponding missing rate is

    above 30 percent, sometimes over 70 percent,which is too high.4. The accuracy and missing rate of cross-location self-

    service prediction are nearly as high as that of self-location self-service prediction.

    5. CS s of 3 hours appear sufficient for trainingpurpose.

    6 DISCUSSIONSIn this section, we briefly summarize how findings andresults presented in previous sections may be used toward both the theory and practice in spectrum sensing and access

    within the context of cognitive radio networks.Broadly speaking, these results contribute to two

    aspects of dynamic spectrum access: 1) the constructionof better channel models (as a way of describing the

    spectrum usage of the primary users), that may be moregenerally applicable in other environments, and 2) theprediction of channel availability in a similar wirelessspectrum environment.

    Channel models are an essential component in an arrayof spectrum access studies, especially theoretical analysis.Our study has shown certain weaknesses of existingchannel models (e.g., insufficient capture of history depen-dence, inability to describe spectral, and spatial depen-

    dence). Our results here can help build better channelmodels that more accurately reflect the primary usersactivity. In particular, the statistics we collected on channeloccupancy/vacancy, its rich dependency property, as wellas statistics on the CS series may motivate the constructionof certain type of discrete event models (e.g., a Petri net) todescribe channel behavior. This may allow us to modelmemory structure as well as the spatial and spectralcorrelation while avoiding a large state space.

    The frequent pattern mining technique introduced herecan be a very powerful tool in analyzing spectrum usagedata. For specific wireless environments where such dataare available for training, we have shown that using thistechnique can generate very accurate predictions onchannel availability (especially in the TV broadcast chan-nels in our study). This allows a secondary user to make far better channel sensing and access decisions, and exploitmuch more effectively underutilized spectrum opportunity.

    7 RELATED WORKSThe SSC performed wideband and long-period spectrummeasurements at seven locations in the US and one outsidethe US between 2004 and 2007 [13], [14] for a betterunderstanding of the actual utilization of spectrum in ruraland urban environments. According to their reports, amongthose seven locations in the US, the lowest averageoccupancy is 1 percent at Greenbank, West Virginia whilethe highest is 17.4 percent at Chicago, Illinois. In addition toSSC, there have been a few similar measurement studiesrecently. In 2008, Institute for Infocomm Research ( I 2 R )published their spectrum measurement results in Singapore[8]. They scanned from 80 to 5,850 MHz for 12 weekdays.They found the average occupancy to be 4.54 percent andmost of the allocated frequencies were heavily under-utilized except the TV broadcast channels and cell phonechannels. Similar results were reported from Auckland,

    New Zealand according to [2]. The above cited workprimarily focused on collecting statistics on the averageutilization of wireless spectrum, and they all confirmed thatit is indeed heavily underutilized.

    1044 IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 11, NO. 6, JUNE 2012

    TABLE 9The Spectrum Usage Prediction Results at Location 1

    The training/testing sets are from different services.

    TABLE 10The Cross-Location Spectrum Usage Prediction Results

  • 7/29/2019 Mining Spectrum Usage Data a Large-Scale

    13/14

    There has also been other measurement studies forqualitative analyses. Kamakaris et al. [10] had extensivespectrum utilization measurements in existing cellularnetworks aimed at characterizing feasibility of improvingspectrum efficiency via coordinated DSA and showed thatStatistically Multiplexed Access technique can increasespectrum utilization. Wellens et al. [19] presented a

    specifically developed measurement setup for a detailedevaluation of cooperative sensing in practical scenarios andfound that cooperation gain increased with the distance between nodes and cooperation could help to alleviatesensitivity requirements. Willkomm et al. [21] presented alarge-scale measurement-driven characterization of primaryusage in cellular networks focused on cellular spectrum andinvestigated a call arrival model and a random walk modelthat directly models the aggregate load.

    Besides, other work are focused on correlation of spectrum usage. A spectrum measurement was carriedout during the 2006 Football World Cup in Germany, in the

    cities of Kaiserslautern and Dortmund [7] and they foundthat the change of spectrum usage was clearly related tospecific events. Later in 2007 another measurement wasconducted in the US on the public safety band (around800 MHz) [9]. The authors collected data concurrently attwo locations, with a total of five pairs of locations withdistance ranging from 5 meters to a few kilometers. Theyinvestigated the adjacent channel interference and spatialcorrelation and revealed that very different energy detec-tion results were obtained at closely located detectionstations (5 meters apart); this was attributed to thedifference in sensitivity in the sensing devices used.

    Compared to this set of studies, our analysis alsoexplored spectral correlation within the same service andacross services. The service congestion rate is a uniquenotion to examine spectrum usage service by service. Inaddition, our measurement involves the most concurrentlocations (4), is over a fairly long duration (7 days), andscans from 20 MHz to 3 GHz. This allowed us to conduct avery detailed analysis on both the first and second orderstatistics of these data sets.

    8 CONCLUSIONS

    In this paper, we carried out a set of spectrum measure-ments in the 20 MHz to 3 GHz spectrum band at fourlocations concurrently in Guangdong province of China.Using these data sets we conducted a set of detailedanalysis of the first and second order statistics of thecollected data, including channel occupancy/vacancy sta-tistics, channel utilization within each individual wirelessservice, also spectral and spatial correlation of thesemeasures. Moreover, we also utilized such spectrumcorrelation to develop a 2Dimensional frequent patternmining algorithm that can accurately predict channelavailability based on past observations. We believe our

    findings will spur more discussions on how to better modelcurrent spectrum utilization and help us design moreefficient spectrum sensing and accessing schemes forsecondary users.

    ACKNOWLEDGMENTSThis research was supported in part by grants from RGCunder contracts CERG 623209 and 622410, a grant fromHuawei-HKUST Joint Lab, the National Natural ScienceFoundation of China with grant nos. 60933012 and 60632030,the Major State Basic Research Development Program of China (973 Program) (No. 2011CB302900), the National

    High Technology Research and Development Program of China (863 Program) (No. 2006AA04A106), and US NationalScience Foundation grant CCF-0910765. The authors wouldalso like to express their gratitude to the Radio MonitoringCenter of Guangzhou, Jiangmen, and Zhongshan forproviding them measurement devices and helping themcollect the measurement data.

    REFERENCES[1] H. Akaike, A New Look at the Statistical Model Identification,

    IEEE Trans. Automatic Control,vol. 19, no. 6, pp. 716-723, Dec. 1974.[2] R. Chiang, G. Rowe, and K. Sowerby, A Quantitative Analysis of

    Spectral Occupancy Measurements for Cognitive Radio, Proc.IEEE 65th Vehicular Technology Conf. (VTC 07-Spring),pp. 3016-3020, Apr. 2007.

    [3] G. Cong, K.-L. Tan, A. Tung, and F. Pan, Mining Frequent ClosedPatterns in Microarray Data, Proc. IEEE Fourth Intl Conf. Data Mining (ICDM 04), pp. 363-366, Nov. 2004.

    [4] W. Gardner and S. CA, Cyclostationarity in Communications andSignal Processing. IEEE, 1994.

    [5] J. Han, H. Cheng, D. Xin, and X. Yan, Frequent Pattern Mining:Current Status and Future Directions, Data Mining and KnowledgeDiscovery, vol. 15, no. 1, pp. 55-86, 2007.

    [6] S. Haykin, Cognitive Radio: Brain-Empowered Wireless Com-munications, IEEE J. Selected Areas of Comm.,vol. 23, no. 2,pp. 201-220, Feb. 2005.

    [7] O. Holland, P. Cordier, M. Muck, L. Mazet, C. Klock, and T. Renk,Spectrum Power Measurements in 2G and 3G Cellular PhoneBands during the 2006 Football World Cup in Germany, Proc.IEEE Second Intl Symp. New Frontiers in Dynamic Spectrum AccessNetworks (DySPAN 07), pp. 575-578, Apr. 2007.

    [8] M. Islam, C. Koh, S. Oh, X. Qing, Y. Lai, C. Wang, Y.-C. Liang, B.Toh, F. Chin, G. Tan, and W. Toh, Spectrum Survey in Singapore:Occupancy Measurements and Analyses, Proc. Third Intl Conf.Cognitive Radio Oriented Wireless Networks and Comm. (CrownCom08), pp. 1-7, May 2008.

    [9] S. Jones, E. Jung, X. Liu, N. Merheb, and I.-J. Wang, Character-ization of Spectrum Activities in the U.S. Public Safety Band forOpportunistic Spectrum Access, Proc. IEEE Second Intl Symp.New Frontiers in Dynamic Spectrum Access Networks (DySPAN 07),pp. 137-146, Apr. 2007.

    [10] T. Kamakaris, M. Buddhikot, and R. Iyer, A Case for CoordinatedDynamic Spectrum Access in Cellular Networks, Proc. IEEE FirstIntl Symp. New Frontiers in Dynamic Spectrum Access Networks

    (DySPAN 05), pp. 289-298, Nov. 2005.[11] J. Kennedy and M. Sullivan, Direction Finding and SmartAntennas Using Software Radio Architechtures, IEEE Comm. Magazine, vol. 33, no. 5, pp. 62-68, May 1995.

    [12] P.S. Mann, Introductory Statistics. John Wiley&Sons, 2003.[13] M.A. McHenry, NSF Spectrum Occupancy Measurements Project

    Summary, technical report, Shared Spectrum Company, Aug.2005.

    [14] M.A. McHenry, P.A. Tenhula, D. McCloskey, D.A. Roberson, andC.S. Hood, Chicago Spectrum Occupancy Measurements &Analysis and a Long-Term Studies Proposal, Proc. First IntlWorkshop Technology and Policy for Accessing Spectrum,2006.

    [15] A. Ng and A. Fu, Mining Frequent Episodes for RelatingFinancial Events and Stock Trends, Proc. Seventh Pacific-AsiaConf. Knowledge Discovery and Data Mining (PAKDD),2003.

    [16] H. Su and X. Zhang, Cross-Layer Based Opportunistic MAC

    Protocols for QoS Provisionings over Cognitive Radio WirelessNetworks, IEEE J. Selected Areas in Communications,vol. 26, no. 1,pp. 118-129, Jan. 2008.

    [17] R.S. Tsay, Analysis of Financial Time Series,pp. 28-42. John Wiley &Sins, Inc., 2001.

    YIN ET AL.: MINING SPECTRUM USAGE DATA: A LARGE-SCALE SPECTRUM MEASUREMENT STUDY 1045

  • 7/29/2019 Mining Spectrum Usage Data a Large-Scale

    14/14

    [18] M. Wellens, A. de Baynast, and P. Mahonen, ExploitingHistorical Spectrum Occupancy Information for Adaptive Spec-trum Sensing, Proc. IEEE Wireless Comm. and Networking Conf.(WCNC 08), pp. 717-722, Apr. 2008.

    [19] M. Wellens, J. Riihijarvi, M. Gordziel, and P. Mahonen, Evalua-tion of Cooperative Spectrum Sensing Based on Large ScaleMeasurements, Proc. IEEE Third Symp. New Frontiers in DynamicSpectrum Access Networks (DySPAN 08),pp. 1-12, Oct. 2008.

    [20] M. Wellens, J. Wu, and P. Mahonen, Evaluation of SpectrumOccupancy in Indoor and Outdoor Scenario in the Context of Cognitive Radio, Proc. Second Intl Conf. Cognitive Radio OrientedWireless Networks and Comm. (CrownCom 07),pp. 420-427, Aug.2007.

    [21] D. Willkomm, S. Machiraju, J. Bolot, and A. Wolisz, PrimaryUsers in Cellular Networks: A Large-Scale Measurement Study,Proc. IEEE Third Symp. New Frontiers in Dynamic Spectrum AccessNetworks (DySPAN 08) pp. 1-11, Oct. 2008.

    [22] Q. Zhang, J. Jia, and J. Zhang, Cooperative Relay to ImproveDiversity in Cognitive Radio Networks, IEEE Comm. Magazine,vol. 47, no. 2, pp. 111-117, Feb. 2009.

    [23] Q. Zhao, L. Tong, and A. Swami, Decentralized Cognitive MACfor Dynamic Spectrum Access, Proc. IEEE First Intl Symp. NewFrontiers in Dynamic Spectrum Access Networks (DySPAN 05),pp. 224-232, Nov. 2005.

    Sixing Yin received the BS, MS, and PhDdegrees from the Department of Informationand Communication Engineering at the BejiingUniversity of Posts and Telecommunications,China, in 2003, 2006, and 2010, respectively. Heis currently a postdoctoral fellow of the Depart-ment of Information and Communication Engi-neering at the Bejiing University of Posts andTelecommunications. His research interests arein cognitive radio networks, dynamic spectrum

    management, spectrum measurement, and the Internet of things.

    Dawei Chen received the BS degree in theDepartment of Computer Science and Technol-ogy from Tsinghua University, Beijing, China, in2007. He is currently working toward the PhD

    degree in the Department of Computer Scienceand Engineering at the Hong Kong University ofScience and Technology. His research interestsinclude spectrum management, spectrum mea-surement, and the testbed in cognitive radionetworks. He is a student member of the IEEE.

    Qian Zhang received the BS, MS, and PhDdegrees from Wuhan University, China, in 1994,1996, and 1999, respectively, all in computerscience. She joined the Hong Kong University ofScience and Technology in 2005 and is nowa professor in the Department of ComputerScience and Engineering. Before that, she wasat Microsoft Research Asia, Beijing, China, fromJuly 1999, where she was the research managerof the Wireless and Networking Group. She has

    published more than 200 refereed papers in international leading journals and key conferences in the areas of wireless/Internet multi-media networking, wireless communications and networking, andoverlay networking. She is the inventor of about 30 pending patents.She is an associate editor for the IEEE Transactions on Mobile Computing , IEEE Transactions on Multimedia , IEEE Wireless Commu- nications Magazine , Computer Networks , and Computer Communica- tions . She received the TR 100 (MIT Technology Review) worlds topyoung innovator award in 2004, the Best Asia Pacific (AP) YoungResearcher Award elected by the IEEE Communication Society in 2004,the Best Paper Award by the Multimedia Technical Committee (MMTC)of the IEEE Communications Society, and Best Paper Awards fromQShine 2006, IEEE GlobeCom 2007, IEEE ICDCS 2008, and IEEE ICC2010. She is a fellow of the IEEE.

    Mingyan Liu received the PhD degree inelectrical engineering from the University ofMaryland, College Park, in 2000. She joinedthe Department of Electrical Engineering andComputer Science at the University of Michigan,Ann Arbor, in September 2000, where she iscurrently an associate professor. Her researchinterests are in optimal resource allocation,performance modeling and analysis, and energyefficient design of wireless, mobile ad hoc, and

    sensor networks. She was the recipient of the 2002 US National ScienceFoundation (NSF) CAREER Award, the University of Michigan ElizabethC. Crosby Research Award in 2003, and the 2010 EECS DepartmentOutstanding Achievement Award. She serves on the editorial boards ofIEEE/ACM Transactions of Networking , IEEE Transactions of Mobile Computing , and ACM Transactions of Sensor Networks . She is a senior

    member of the IEEE.Shufang Li received the PhD degree in theDepartment of Electrical Engineering, TsinghuaUniversity, Beijing, China, in 1997. She iscurrently the director of the Ubiquitous Electro-magnetic Environment Center of EducationMinistry, China, and the director of the JointLab of BUPT & State Radio Monitoring Center(SRMC), China. Her research interests includethe theory and design technology of radiofrequency circuits in wireless communication,

    EMI/EMC, simulation technology and optimization for radiation interfereon high-speed digital circuit, etc. She received the Young ScientistsReward sponsored by the International Union of Radio Science (URSI).She has published hundreds of papers in China and overseas, as well

    as several textbooks, translation works, and patents.

    . For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

    1046 IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL. 11, NO. 6, JUNE 2012