opportunistic spectrum access learning proof of …14... · 2014-02-24 · opportunistic spectrum...

8
OPPORTUNISTIC SPECTRUM ACCESS LEARNING PROOF OF CONCEPT Clément ROBERT 1 , Christophe MOY 1 , Honggang ZHANG 1,2 1 SUPELEC/IETR, Avenue de la Boulaie, 35576 Cesson-Sévigné, France 2 Université Européenne de Bretagne & SUPELEC [email protected] ABSTRACT This paper presents the results of first ever implementation of learning algorithm for opportunistic spectrum access (OSA) in lab conditions. The OSA scheme consists of two USRP N210 platforms, one acting as primary users of a primary network and another as secondary user. Primary user network generates carriers with a pre-defined probability of occupancy through an OFDM modulation scheme implemented in GRC environment (GNU Radio Companion). Secondary user, implemented using another USRP platform programmed through Simulink TM environment, learns and predicts the channel’s (carriers) occupancy with the help of learning algorithms. Two reinforcement learning algorithms, UCB (Upper Confidence Bound) and WD (Weight Driven), are used by a secondary user to learn and predict the spectrum occupancy. These learning algorithms have been chosen as they are capable of acting and learning in highly unpredictable conditions such as met in cognitive radio context. This proof-of- concept validates decision making capabilities of reinforcement learning algorithms for OSA in real wireless conditions. At the end, performance comparison between these two learning algorithms in is also done. 1. INTRODUCTION Radio frequency spectrum is a rare, and expensive resource. During the 20th century, spectrum scarcity issue has been solved at each period by exploiting new accessible higher frequency bands thanks to electronics progresses. During this period also, radio techniques improvements have been made by the radio research community, leading to send a higher number of bits per second per Hz. Progresses have been mainly obtained on analogue circuits during a first period and then mainly on digital signal processing, such as channel coding, modulation schemes, cellular networking, etc. Despite the progresses however, wireless applications’ demand is growing faster than the new spectrum opportunities and consequently spectrum scarcity is becoming worse every year. We are facing such a limit today that spectrum access paradigm needs to be changed. It has been historically relying on FDMA (Frequency Division Multiple Access), improved by the combination with TDMA (Time Division Multiple Access) and CDMA (Code Division Multiple Access). We should now move from just a single improvement to a real breakdown. Many measurements indeed have shown recently that if all spectrum bands are reserved for a specific service or application, many of them are underutilized in time, depending on each location [1][2]. Then, an opportunity exists for new spectrum if spectrum sharing is done differently [3]. Opportunistic Spectrum Access (OSA) is one solution proposed where Secondary Users (SUs) are allowed to use the spectrum left vacant by licensed Primary Users (PUs). Cognitive radio [4], and its features in terms of permanent adaptability to varying conditions, is foreseen as key technology in order to implement such new schemes in the commercial spectrum. 2. LEARNING SPECTRUM FOR COGNITIVE RADIO A. Cognitive Radio decision making and learning The facilities a cognitive radio equipment (or a cognitive network by extension) should include in addition to conventional radio processing can be summarized as [5]: - Sensing units, - Learning and decision making units, - Reconfigurable units. Let us look into the OSA scheme. Time is slotted in iterations. At each iteration, the SU radio system senses a channel. If channel is detected vacant, SU transmits over that channel. If channel is occupied, no transmission is done in that iteration. In latter case, the SU system must wait for next iteration to sense

Upload: others

Post on 12-Mar-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: OPPORTUNISTIC SPECTRUM ACCESS LEARNING PROOF OF …14... · 2014-02-24 · OPPORTUNISTIC SPECTRUM ACCESS LEARNING PROOF OF CONCEPT Clément ROBERT 1, Christophe MOY 1, Honggang ZHANG

OPPORTUNISTIC SPECTRUM ACCESS LEARNING PROOF OF CONCEPT

Clément ROBERT1, Christophe MOY1, Honggang ZHANG1,2

1SUPELEC/IETR, Avenue de la Boulaie, 35576 Cesson-Sévigné, France 2Université Européenne de Bretagne & SUPELEC

[email protected]

ABSTRACT This paper presents the results of first ever implementation of learning algorithm for opportunistic spectrum access (OSA) in lab conditions. The OSA scheme consists of two USRP N210 platforms, one acting as primary users of a primary network and another as secondary user. Primary user network generates carriers with a pre-defined probability of occupancy through an OFDM modulation scheme implemented in GRC environment (GNU Radio Companion). Secondary user, implemented using another USRP platform programmed through SimulinkTM environment, learns and predicts the channel’s (carriers) occupancy with the help of learning algorithms. Two reinforcement learning algorithms, UCB (Upper Confidence Bound) and WD (Weight Driven), are used by a secondary user to learn and predict the spectrum occupancy. These learning algorithms have been chosen as they are capable of acting and learning in highly unpredictable conditions such as met in cognitive radio context. This proof-of-concept validates decision making capabilities of reinforcement learning algorithms for OSA in real wireless conditions. At the end, performance comparison between these two learning algorithms in is also done.

1. INTRODUCTION Radio frequency spectrum is a rare, and expensive resource. During the 20th century, spectrum scarcity issue has been solved at each period by exploiting new accessible higher frequency bands thanks to electronics progresses. During this period also, radio techniques improvements have been made by the radio research community, leading to send a higher number of bits per second per Hz. Progresses have been mainly obtained on analogue circuits during a first period and then mainly on digital signal processing, such as channel coding, modulation schemes, cellular networking, etc. Despite the progresses however, wireless applications’ demand is growing faster than

the new spectrum opportunities and consequently spectrum scarcity is becoming worse every year. We are facing such a limit today that spectrum access paradigm needs to be changed. It has been historically relying on FDMA (Frequency Division Multiple Access), improved by the combination with TDMA (Time Division Multiple Access) and CDMA (Code Division Multiple Access). We should now move from just a single improvement to a real breakdown. Many measurements indeed have shown recently that if all spectrum bands are reserved for a specific service or application, many of them are underutilized in time, depending on each location [1][2]. Then, an opportunity exists for new spectrum if spectrum sharing is done differently [3]. Opportunistic Spectrum Access (OSA) is one solution proposed where Secondary Users (SUs) are allowed to use the spectrum left vacant by licensed Primary Users (PUs). Cognitive radio [4], and its features in terms of permanent adaptability to varying conditions, is foreseen as key technology in order to implement such new schemes in the commercial spectrum.

2. LEARNING SPECTRUM FOR COGNITIVE RADIO

A. Cognitive Radio decision making and learning

The facilities a cognitive radio equipment (or a cognitive network by extension) should include in addition to conventional radio processing can be summarized as [5]:

- Sensing units, - Learning and decision making units, - Reconfigurable units.

Let us look into the OSA scheme. Time is slotted in iterations. At each iteration, the SU radio system senses a channel. If channel is detected vacant, SU transmits over that channel. If channel is occupied, no transmission is done in that iteration. In latter case, the SU system must wait for next iteration to sense

Page 2: OPPORTUNISTIC SPECTRUM ACCESS LEARNING PROOF OF …14... · 2014-02-24 · OPPORTUNISTIC SPECTRUM ACCESS LEARNING PROOF OF CONCEPT Clément ROBERT 1, Christophe MOY 1, Honggang ZHANG

another channel. Hence, the main task of learning and decision making algorithm is to decide which channel to target in next iteration depending upon the past trials’ results. The final goal is to select at each iteration the channel with higher probability of success (e.g. a vacant channel) in order to maximize transmission opportunities. In this paper, we focus on the decision making and learning attribute. These two attributes are not independent but intimately related. Paper [6] presents an overview of decision making strategies that have been studied during the first ten years of CR. To sum-up, in order to select which decision strategy is appropriate for a given CR context, it is necessary to analyze the degree of a priori knowledge about the environment is available to the system (in the widest sense of [4] and [5]). If the system perfectly knows the environmental state (which can be derived from many parameters), expert schemes can be used. In this scenario, the system configuration corresponding to each known environment state can be pre-defined a priori. On the other hand, when the environment state is unpredictable instantaneously, or at least statistically predictable, learning is helpful. This paper deals with this latter case and focuses on learning.

B. Reinforcement Learning for OSA Reinforcement Learning (RL) is based on the “try and evaluate” principle and consists of iteratively trying a set of solutions, evaluating their result and then deriving some quality factor for each trial. The goal is to order solutions, given a quality objective, so that the best one is used at next iteration. In other words, this aims at predicting which solution is giving the best opportunity at the next iteration. Figure 1 shows how this can be derived in the OSA context, at the output of a sensing algorithm detecting the presence of a PU signal (energy detector, cyclostationarity detector, etc.). The learning and decision process aims at:

1. Deciding to transmit or not, 2. Updating learning information. 3. Deciding which channel to sense and choose for

transmission at next iteration. The decision to transmit is done only if the channel is sensed vacant. Learning, as well as the decision on which channel to sense in the next iteration, are done independent of the detection result.

Figure 1 - Learning and decision making processes based on

a reinforcement learning approach in OSA context. The OSA context can be modeled as a MAB problem (Multi-Armed Bandit) [7]. In the OSA context, the MAB model is as follows: each frequency channel is equivalent to a gambling machine or a bandit arm. The figure of merit of a channel is its probability of being vacant, e.g. not used by a PU, which is equivalent to the probability for a gambling machine or arm to win.

3. REINFORCEMENT LEARNING ALGORITHMS FOR OSA

The aim of the work presented in this paper is to study and compare the performance of two RL algorithms, proposed previously in the literature. The first one is UCB, which has been theoretically addressed in [8], and other one is Weight Driven algorithm [9].

A. RL model for OSA The spectrum is divided in K channels denoted by ��{1, 2,… , }, each having the same bandwidth and representing one arm for the MAB algorithm. We suppose that time is discrete, slotted in iterations, and that only one channel is sensed in each iteration. The temporal occupancy of every channel follows a Bernoulli distribution �� for which the expected value, � = �[��], can be set independently. The SUs are supposed to be synchronous with PUs. We define t as the discrete time index representing the total number of times (or iterations) that the algorithm has been played. The cumulative number of times that channel k has been chosen in previous iterations is ��.

B. Upper Confidence Bound algorithm for OSA UCB is a reinforcement learning algorithm that can solve problems modeled as MAB. We have analyzed, at theoretical level, the potential capabilities of UCB algorithms as a means for a CR equipment to learn

update knowledge

PU detected?

yes do not transmit

transmit at sensed frequency no

next iteration frequency choice

LEARNING

DECISION

Page 3: OPPORTUNISTIC SPECTRUM ACCESS LEARNING PROOF OF …14... · 2014-02-24 · OPPORTUNISTIC SPECTRUM ACCESS LEARNING PROOF OF CONCEPT Clément ROBERT 1, Christophe MOY 1, Honggang ZHANG

about spectrum opportunities [6, 8, 10]. In [6], we explained that learning algorithm based approach is efficient in a context of high uncertainty, e.g. where the cognitive system has a priori no knowledge about the environment conditions. In [8], UCB has been identified indeed as a solution for decentralized learning for cognitive radio. Then, performance of UCB was analyzed in presence of sensing errors and cooperative mechanisms. It has been proven theoretically in [10] that UCB still converges to the best solution(s) if sensing errors are committed, which confirmed its validity in the cognitive radio context. The validity of the theoretical results has been confirmed by simulating a radio chain taking into account sensing errors. One goal of this paper is to check it experimentally. There exists several UCB [11] algorithms but that does not change the generality of the proposed approach. Let’s state that an independent realization ��,��(�) of the statistical distribution �� described previously has an empirical sample mean ��,��(�). If we define ��,�,��(�) as a bias added to the empirical sample

mean��,��(�), we can compute UCB coefficients ��,�,��(�) as in [13] for UCB1 with:

��,�,��(�) ← ����(�)��(�) , ∀� (1)

and ��,�,��(�) ← ��,��(�) +��,�,��(�), ∀� (2) At each iteration, decision based on UCB algorithm returns the index of the maximum value of ��,�,��(�). Indeed as shown in equation (2), ��,�,��(�) indexes are constituted for each channel of the empirical mean (obtained on the trials made on each channel) upper bounded by a ��,�,��(�) bias (specific to each channel also). The higher the ��,�,��(�) index of a channel, the higher is the probability of corresponding channel to be vacant. So the SU will choose the channel having the highest ��,�,��(�) index for transmission at next iteration. A consequence is that the best channels are more sensed than others and hence, the knowledge on their availability is better, or closer to the reality. Note also that coefficient α is relative to the speed of convergence of the algorithm. In other words, α sets the relative ratio between exploration (learn more about less known cases having a low Bk index) and exploitation (rely on well known cases having a high Bk index).

C. Weight Driven algorithm for OSA The WD algorithm is structured the same way as UCB except for three aspects:

- Bk index of each channel is replaced by a weight Wk. The weight Wk is not based on the empirical mean, and hence, the probability of occupancy of the channel but it is a direct function of the number of times corresponding channel has been sensed vacant,

- the introduction of a preferred set of channels limits the channel selection to a subset of the total number of channels. This subset consists of best channels and is formed after a given initialization step [9],

- the decision to transmit is not only based on the presence or absence of the primary user [9], but it is also a function of the quality of the considered channel. However the latter part will not be considered in this paper.

After each iteration, the weight of the channel k that has been chosen for transmission is updated as follows:

"�#$,� = "�,� + % (3)

where f = +1 if the channel is rewarded, e.g. detected vacant, and f = -1 if it is punished, e.g. detected occupied. Note that if Wk is null and channel k is detected occupied, Wk stays null. Each channel is ranked thanks to its weight which reflects the quality of the resource. The WD algorithm is directly derived from the two stage RL algorithms proposed in [12]. The decision process is based on a statistical distribution constructed from the weights of equation (4):

&�(�) = '(,�∑ '(,**+{,,…,-}

(4)

where &�(�) is the probability that channel k is chosen. Moreover, in WD, if the weight of a channel is above a given threshold .�, then the channel is selected to enter the preferred set. Once the preferred set is full, the choice is restricted to only the channels in the preferred set. A new channel may be included in the preferred set only if another is leaving it while its Wk falls below the threshold value .�. The threshold value and the size of the preferred set are parameters which dimension the WD algorithm in terms of exploration and exploitation.

Page 4: OPPORTUNISTIC SPECTRUM ACCESS LEARNING PROOF OF …14... · 2014-02-24 · OPPORTUNISTIC SPECTRUM ACCESS LEARNING PROOF OF CONCEPT Clément ROBERT 1, Christophe MOY 1, Honggang ZHANG

4. PROOF-OF-CONCEPTEXPERIMENTAL RESULTS

A. Primary user network platform

In our experiments, the number of channelsto 8, e.g. a primary network of 8 channels is generateThe probability of occupancy of each channel primary users can be set beforehandof vacancy of the channels has been set to {0.5;0.3;0.4;0.5;0.6;0.7;0.8;0.9}, in the following experiments, which means that the probability of occupancy of channel 1 by PUs is 0.5, of channel 2 is 0.7, of channel 3 is 0.6, and so on. Instead of using 8 platforms that shosynchronized to coordinate PUs frequency jumps, an OFDM signal generation with 8 carrierschosen [14]. Only one platform is necessary and the synchronization between users is straightforward as all channels traffic is generated through OFDM symbols. The chosen design environment for the primanetwork radio signal generation is GNU Radio Companion (GRC) and the hardware platform is made of a USRP platform from Ettus Research connected to a laptop running Linux as shown left hand side of Figure 2. For simplicity purposes, OFDM symbols rate is set to one symbol per second. This means that the channel occupancy varies once a second and can be followed by humantechnically prevents from acceler

Figure 2 – Experimental testbed for learning in anprimary network transmission with the visualization of the generated traffic on 8

is playing the role of the secondary user learning algorithm, implementing an energy detector as a sensor.generated radio signal on a spectrum analyzer with an antenna at its input

CONCEPT AND EXPERIMENTAL RESULTS

he number of channels (K) is set e.g. a primary network of 8 channels is generated.

of each channel by the set beforehand. The probability

of the channels has been set to in the following

which means that the probability of is 0.5, of channel 2 is

Instead of using 8 platforms that should be frequency jumps, an

ith 8 carriers has been . Only one platform is necessary and the

synchronization between users is straightforward as all channels traffic is generated through OFDM symbols. The chosen design environment for the primary network radio signal generation is GNU Radio Companion (GRC) and the hardware platform is made

from Ettus Research [15] ected to a laptop running Linux as shown left

. For simplicity purposes, OFDM symbols rate is set to one symbol per second. This means that the channel occupancy varies once a

human eye. Nothing m accelerating this rate

(sensing accuracy indeed is a function of sensing duration, but this is out of the scope of this paper)Algorithms converge in function of the number of trials, so learning algorithms convergence speed is directly a function of this rate. long, we could directly conclude o1000 times faster than the

B. Secondary user platform The right hand side platform of computer and a USRP platform, represents Only sensing and learning are implemented here. In other words, the decision to transmit, after detecting that the channel of interestimplemented here and no SU transmission occurs.This is out of the scope of our experiment which focuses on learning validation. sensing is an energy detector, but any other detector could be used without any loss of generality on the learning results. Sensing quality may have an influence but what is important in the experiment is that there may be errors of detection, e.g. decide a PU is present when it is not (false alarm), and decide a PU is not present when it is (mis The first channel is used as a synchronization means for the SU platform. Channel 1’s probability of occupancy is 0.5 as it is switching from occupied and vice-versa at each OFDM symbol.

learning in an OSA context. Left hand side (laptop + USRP) isprimary network transmission with the visualization of the generated traffic on 8 channels. Right hand side (laptop + USRP)

is playing the role of the secondary user learning algorithm, implementing an energy detector as a sensor.generated radio signal on a spectrum analyzer with an antenna at its input in between

(sensing accuracy indeed is a function of sensing duration, but this is out of the scope of this paper). Algorithms converge in function of the number of trials, so learning algorithms convergence speed is directly a function of this rate. If frames were 1 ms

directly conclude on a learning speed current experiment.

latform

The right hand side platform of Figure 2, consiting of a d a USRP platform, represents a SU link.

Only sensing and learning are implemented here. In other words, the decision to transmit, after detecting

of interest is empty, is not implemented here and no SU transmission occurs.

the scope of our experiment which learning validation. The detector for

sensing is an energy detector, but any other detector could be used without any loss of generality on the learning results. Sensing quality may have an

important in the experiment is that there may be errors of detection, e.g. decide a PU is present when it is not (false alarm), and decide a PU

misdetection).

The first channel is used as a synchronization means form. Channel 1’s probability of

occupancy is 0.5 as it is switching from vacant to versa at each OFDM symbol.

. Left hand side (laptop + USRP) is playing the role of the

. Right hand side (laptop + USRP) is playing the role of the secondary user learning algorithm, implementing an energy detector as a sensor. We can see the

in between the two platforms.

Page 5: OPPORTUNISTIC SPECTRUM ACCESS LEARNING PROOF OF …14... · 2014-02-24 · OPPORTUNISTIC SPECTRUM ACCESS LEARNING PROOF OF CONCEPT Clément ROBERT 1, Christophe MOY 1, Honggang ZHANG

This enables the secondary user to detect when the transition between OFDM symbols is done, and to synchronize the energy detection phase on an entire OFDM symbol of the primary network. A snapshot of frequency responses in Fig. 2 shows the shape of the PU network signals. Note that there is a shift between all 3 spectrums: primary network on the left hand side is displaying the spectrum of iteration n+1, while the spectrum analyzer measures the current PU activity on the air, and the SU displays the result of the analysis of the previous n-1 iteration on the right hand side. Sensing and learning signal processing is so simple that it can be done in real time using SimulinkTM on a conventional laptop. We have usually experienced on other radio applications a factor 60 between SimulinkTM and GRC execution speed, in favor of GRC. Most of the processing is used for sensing and displaying purposes. Learning only consists of updating a set of sixteen variables, eight of them for UCB requiring a square root, a logarithm, 3 multiplications, 3 additions and a comparison operator, and eight needing for WD an addition by ±1. So both UCB and WD learning algorithms can easilly be implemented in parallel for real-time execution.

5. COMPARISON OF THE TWO MACHINE LEARNING ALGORITHMS FOR OSA

A. Comparison context

The goal of this experiment is to evaluate in real radio conditions the speed of convergence of RL algorithms in a laboratory environment and see if it makes sense for radio applications. Then comparing two RL algorithms is expected to give both information on convergence speed and divergence eventuality. UCB and WD are only compared in these experiments in terms of learning, based on the presence or absence of a PU. Other parameters concerning the quality of the channel (SINR) considered in [12] cannot be included here as transmission by SU is not implemented. Following figures help comparing UCB and WD algorithms. For both algorithms, tables are ordered from channel 1 to 8 starting at the top. WD algorithm table is upper UCB one with the following information:

- central column is the number of times WD algorithm played each channel previously,

- right column is the weight Wk of each channel, consequently a function of the difference between the number of times the channel was detected vacant with the number of times it was detected occupied,

- left column is the empirical probability of vacancy of the channels obtained with WD algorithm. It is just given here for information purposes, in order to compare with the real probability predefined in section 4.A and the one obtained with UCB. It is not used by WD algorithm.

The preferred set has a size of 3 in these experiments. Concerning UCB algorithm, two columns are given: - right column is the number of times UCB

algorithm played each channel, - left column is the empirical probability of

vacancy �� for the 8 channels, from which Bk index value is derived by addition to Ak bias.

Channels with a higher Bk index are more played compared to those with a lower index. This means that the channels which are the most likely to be vacant are more sensed in order to obtain more transmission opportunities as only one channel is sensed at each iteration and only this one can be used for transmission. Note that in this experiment, both algorithms are executed in parallel, on exactly the same experimental data in terms of carrier randomness and radio channel conditions. However, the choice of the channel selected at each iteration is different since these algorithm follow different strategies.

B. Early learning experimental results snapshot Figure 3 shows results of a first experiment on real signals after 340 iterations. They explain clearly how differently the considered two algorithms behave. WD has only tried a few channels. The first times it sensed them, they were sufficiently often vacant to obtain a weight superior to the threshold, so they entered the preferred set. Once the preferred set is full, only these channels are played, unless they are found so many times occupied that their weight decreases under the threshold value. Then another channel may replace them in the preferred set. Once the preferred set is full, other channels are not explored but we can see on Figure 3 that it makes WD very efficient. As the 3 best channels indeed are in the preferred set, almost half of

Page 6: OPPORTUNISTIC SPECTRUM ACCESS LEARNING PROOF OF …14... · 2014-02-24 · OPPORTUNISTIC SPECTRUM ACCESS LEARNING PROOF OF CONCEPT Clément ROBERT 1, Christophe MOY 1, Honggang ZHANG

trials (192 over 351) are made on the best channel, which is channel number 8 at the bottom of each table. UCB algorithm on its side makes an exhaustive exploration of all the channels. However it gives priority to the channels it has mostly found vacant. We can see on Figure 3 that half of the trials have been made on the 2 best channels which is good after only that small number of iterations for learning. We can derive from UCB algorithm results on Figure 4 over time, and on bottom Figure 3 after 340 iterations the approximated probability of vacancy of each channel. After 340 iterations on Figure 3, which corresponds to the end of Figure 4, we do not obtain exactly the transmitted set of probabilities {0.5;0.3;0.4;0.5;0.6;0.7;0.8;0.9}, however channels have been correctly ordered (except channels 4 and 3). Figure 4 illustrates over time the empirical probability of non occupation of PUs derived by UCB algorithm on the eight channels. We can see how estimation evolves as trials are made during the first 340 iterations.

Figure 3 - Learning results on the eight channels after 340

iterations – top for WD and bottom for UCB algorithm

Figure 4 – Empirical probability of vacancy ��,��(�)on the

eight channels derived by UCB algorithm during 340 iterations: channel #8 is violet / #7 yellow / #6 blue / #5 green / #4 red / #3 light blue / #2 purple / #1 light yellow

Finally, quite good results are obtained even before 100 iterations, e.g. only just a little more than 10 times the number of channels to learn about. Note that the channels with lower empirical mean are those which are tried less, so each trial makes huge variations on the empirical value. The role of the Ak bias in the UCB formula consists in upper bounding the empirical mean of channel k with a value inversely proportional to the (square root of the) number of trials Tk on this channel. It gives therefore, badly noted channels some chance to be selected from time to time and recover if they are finally often vacant.

C. Experimental results evolution in time We now consider the results of another experiment on new real signals. We can see on Figure 5 the very beginning of the learning phase, e.g. after 80 iterations (a mean of 10 trials per channel). We just consider the best channel first. Whereas UCB algorithm is only theoretically guaranteed to converge at infinity, we can see that it has already chosen almost one third of the time the most available channel (channel 8), maximizing that way the transmission opportunities compared to a uniform rule which would have selected this channel only 1/8 of the times. WD algorithm on its side has just finished filling its preferred set. Channel 1 has been tried 10 times whereas W1 is only 1, which

Page 7: OPPORTUNISTIC SPECTRUM ACCESS LEARNING PROOF OF …14... · 2014-02-24 · OPPORTUNISTIC SPECTRUM ACCESS LEARNING PROOF OF CONCEPT Clément ROBERT 1, Christophe MOY 1, Honggang ZHANG

means that it has been found vacant 5 times and 4 times occupied so that it has been either never accepted in the preferred set, or has been rejected from it, to the profit of channel 5. But channel 5 is weak (W4=4) and can be soon ejected. WD has not selected yet the best channel in its preferred set. We can see that WD tried two times the best channel but as it was occupied once, it has been over-balanced by channel 5 which was vacant in the early trials. Figure 6 confirms after 1500 iterations that WD did not select the best channel in its preferred set. It is now very hard to change the preferred set as it means that channel 5 should be detected occupied as many times as it has been detected vacant before to let a chance to channel 8. UCB on its side has chosen the best channel with a mean of 2 over 5 iterations. The probabilities of vacancy are perfectly ordered even if a little bit optimistic in general, but most important to make the choice of a channel is the relative probability between bands, not the absolute one. In Figure 7, 7000 iterations have been done.UCB selected the best channel half of the time, and the 2 best channels 3/4 of the time, showing that learning is improving, thanks to the past accumulated knowledge.

Figure 5 - Learning results on the eight channels after 80 iterations – top for WD and bottom for UCB algorithm

Figure 6 - Learning results on the eight channels after 1500

iterations – top for WD and bottom for UCB algorithm WD has not recovered the best channel in its preferred set. However, is has almost selected the second best channel 1/3 of the time, which is quite good too.

Figure 7 - Learning results on the eight channels after 7000 iterations – top for WD and bottom for UCB algorithm

Page 8: OPPORTUNISTIC SPECTRUM ACCESS LEARNING PROOF OF …14... · 2014-02-24 · OPPORTUNISTIC SPECTRUM ACCESS LEARNING PROOF OF CONCEPT Clément ROBERT 1, Christophe MOY 1, Honggang ZHANG

D. Results interpretation Figure 3 to Figure 7 show how, on real radio signals, SU learns about the occupation rate of primary bands. This helps SU privileging at next transmission (not implemented here) the use of the channel which has the best probability of being vacant. This guarantees the best transmission opportunities for SU. The speed of convergence in real radio conditions is surprisingly good if we consider that RL algorithms are based on a theory with asymptotic guaranteed results at infinity. In a context where a trial is made at each frame of 1 ms, results of Figure 7, e.g. the best channel selected half of the time for UCB, would be obtained in 7 seconds. A little bit less (2 over 5) would be obtained just after 1.5 second as shown in Figure 6, which is compatible with radio applications constraints. Just a few seconds of a communication are enough to get benefit from spectrum opportunities. When WD selects the best channels in its preferred set, WD is very efficient as shown in first experimental results. However, when WD has not selected the best channel in its preferred set, it is considered as diverging in the machine learning sense. Simulations have also shown such divergence eventuality for WD. However, from the CR point of view, as WD chooses second and third best channels, it still performs a very good average behavior.

6. CONCLUSION First experimental results of learning algorithms for OSA are provided in this paper. This proof-of-concept shows that reinforcement learning provides solutions which are realistic and accurate in real conditions in order to seek for spectrum availability. This paper also compares, in addition, two reinforcement learning algorithms, UCB and WD. Experiments illustrate how they respectively behave in different circumstances. The ratio between exploration and exploitation performance is mainly differentiating WD from UCB and can be a choice criteria. We can say that UCB provides a good compromise between exhaustive exploration and exploitation performance. These experiments will be extended in a multi-user context in order to analyze how SUs behave respectively and how they compete or cooperate.

7. ACKNOWLEDGMENT This work has received a French state support granted to the CominLabs excellence laboratory and managed by the National Research Agency in the “Investing for the Future” program under reference Nb. ANR-10-LABX-07-01. The authors would also like to thank the Region Bretagne, France, for its support of this work, as well as Paul Sutton for advising the papers of the Trinity College of Dublin dealing with holes generation in OFDM spectrum.

8. REFERENCES [1] "FCC Spectrum Policy Task Force: Report of the spectrum

efficiency working group," 15 November 2002. [2] M. López-Benítez, F. Casadevall, A. Umbert, J. Pérez-Romero,

J. Palicot, C. Moy, R. Hachemani, “Spectral occupation measurements and blind standard recognition sensor for cognitive radio networks,” CrownCom, June 2009

[3] Q. Zhao, A. Swami, "A Survey of Dynamic Spectrum Access: Signal Processing and Networking Perspectives", in IEEE ICASSP: special session on Signal Processing and Networking for Dynamic Spectrum Access, Hawaii, USA, 15-20 Apr. 2007

[4] J. Mitola, “Cognitive Radio: An Integrated Agent Architecture for Software Defined Radio,” Ph.D. diss., KTH, Sweden, 2000

[5] J. Palicot, "Radio Enineering: From Software radio to Cognitive Radio", Wiley 2011; ISBN: 978-1-84821-296-1

[6] W. Jouini, C. Moy, J. Palicot, "Decision making for cognitive radio equipment: analysis of the first 10 years of exploration", EURASIP Journal on Wireless Communications and Networking 2012, 2012:26.

[7] R. Agrawal, “Sample mean based index policies with o(log(n)) regret for the multi-armed bandit problem,” Advances in Applied Probability, 27:1054–1078, 1995

[8] W. Jouini, D. Ernst, C. Moy, J. Palicot, "Upper confidence bound based decision making strategies and dynamic spectrum access", ICC, Cape Town, South Africa, May 2010

[9] T. Jiang, D. Grace, P.D. Mitchell, “Efficient exploration in reinforcement learning-based cognitive radio spectrum sharing”, IET Communications, Vol. 5, Issue 10, July 2011.

[10] W. Jouini, C. Moy, J. Palicot, "Upper Confidence Bound Algorithm for Opportunistic Spectrum Access with Sensing Errors", CrownCom'11, 1-3 June 2011, Osaka, Japan

[11] J.-Y. Audibert, R. Munos, and C. Szepesvari. "Tuning bandit algorithms in stochastic en-vironments," International conference on Algorithmic Learning Theory, 2007.

[12] T. Jiang, D. Grace, Y. Liu, “Two stage reinforcement learning based cognitive radio with exploration control”, IET Communications, Vol. 5, Issue 5, March 2011.

[13] P. Auer, N. Cesa-Bianchi, P. Fischer, “Finite time analysis of multi-armed bandit problems.”, Machine learning, 47(2/3):235-256, 2002.

[14] I. Macaluso, B. Özgül, K T. Forde, P. Sutton, L. Doyle, “Spectrum and Energy Efficient Block Edge Mask-Compliant Waveforms for Dynamic Environments”. IEEE Journal on selected areas in communications, Vol. 32, No. 12, Dec. 2014

[15] Ettus Research “Products” - accessed 02/04/2012 http://www.ettus.com/products