Deep Q-Learning Based Computation Offloading Strategy for Mobile Edge Computing

2019-04-29 03:21:20YifeiWeiZhaoyingWangDaGuoandRichardYu

Computers Materials&Continua 2019年4期

Yifei Wei , Zhaoying Wang Da Guo and F. Richard Yu

Abstract: To reduce the transmission latency and mitigate the backhaul burden of the centralized cloud-based network services, the mobile edge computing (MEC) has been drawing increased attention from both industry and academia recently. This paper focuses on mobile users’ computation offloading problem in wireless cellular networks with mobile edge computing for the purpose of optimizing the computation offloading decision making policy. Since wireless network states and computing requests have stochastic properties and the environment’s dynamics are unknown, we use the modelfree reinforcement learning (RL) framework to formulate and tackle the computation offloading problem. Each mobile user learns through interactions with the environment and the estimate of its performance in the form of value function, then it chooses the overhead-aware optimal computation offloading action (local computing or edge computing) based on its state. The state spaces are high-dimensional in our work and value function is unrealistic to estimate. Consequently, we use deep reinforcement learning algorithm, which combines RL method Q-learning with the deep neural network(DNN) to approximate the value functions for complicated control applications, and the optimal policy will be obtained when the value function reaches convergence. Simulation results showed that the effectiveness of the proposed method in comparison with baseline methods in terms of total overheads of all mobile users.

Keywords: Mobile edge computing, computation offloading, resource allocation, deep reinforcement learning.

1 Introduction

As smartphones are getting more and more popular, a variety of new mobile applications such as face recognition, natural language processing, augmented reality are becoming an increasing part of daily life, and thus people need high rate computation and high amount of computational resource [Wang, Liang, Yu et al. (2017)]. As we all know, cloud computing depends on its powerful centralized computing capability which meets the demands of resource-limited end users for effective computation. However, moving all the distributed data and high demand computation applications to the cloud server will result in heavy burden on network performance and the long latency for resource transmission between users and cloud computing devices, which will degrade the quality of service [Shi, Cao, Zhang et al. (2016); Bao and Ding (2016)].

In order to further reduce the latency and enhance the network performance while provide powerful computational capability for end users, the mobile edge computing(MEC) has been proposed to deploy computing resources closer to the end users. As a remedy to the problem which cloud computing exists, mobile edge computing makes the function of cloud at the edge of the networks which obtains a tradeoff between computation-intensive and latency-critical for mobile users [Mao, You, Zhang et al.(2017)]. Mobile edge computing enables the mobile user’s equipment (UEs) execute computation offloading by sending computation tasks to the MEC server through wireless cellular networks [Wang, Liang, Yu et al. (2017)], which means the MEC server executes the computational task on behalf of the UE. In mobile edge computing, network edge devices such as base stations, access points and routers, are empowered with computing and storage capabilities to serve users’ requests as a substitute of clouds [Patel,Naughton, Chan et al. (2014)]. In this paper, we consider an edge system as the combination of an edge device (macro-cell) and the associated edge servers which provides IT services, environments and cloud computing capabilities to meet mobile users’ low-latency and high-bandwidth service requirements.

The survey of computation offloading illustrated two objectives for computation offloading: reduce the execution time and mitigate the energy consumption [Kumar, Liu,Lu et al. (2013)]. Computation offloading decisions are classified into two parts: what computation to offload, and where to offload computation. The decision in regard to what computation to offload is generally considered as the partitioning problem which computation task are partitioned into different components and make decision of whether to offload each component or not [Huang, Wang and Niyato (2012); Alsheikh, Hoang,Niyato et al. (2015)]. Another decision of where to offload computation tasks focuses on the binary decision of local computation or offloading the computation task to computing devices, which is similar to the decision of what to offload.

A number of previous works have discussed the computation offloading and resource allocation problem in mobile edge computing scenarios [Yu, Zhang and Letaief (2016);Mao, Zhang, Song et al. (2017); Mao, Zhang and Letaief (2016)]. Wang et al. [Wang,Liang, Yu et al. (2017)] considered the computation offloading decision, physical resource block (PRB) allocation, and MEC computation resource allocation as optimization problems in wireless cellular networks by graph coloring method. Xu et al.[Xu and Ren (2017)] presented the optimal policy of dynamic workload offloading and edge server provisioning to minimize the long-term system cost (including both service delay and operational cost) by using online learning algorithm which including value iteration and reinforcement learning (RL). While Liu et al. [Liu, Mao, Zhang et al. (2016)]proposed a two-timescale stochastic optimization problem as the Markov decision process in MEC scene and solved the problem using linear programming problem.

In this paper, we focus on the computation offloading decision making problem of whether to compute on local equipment or to offload the task to the MEC server for cloud computing and propose an efficient deep reinforcement learning (DRL) scheme. By making the right decision of computation offloading, mobile user can enhance the computation efficiency and decrease the energy consumption. Each agent learns through interactions with the environment and evaluates its performance in the form of value function. Since wireless network states and computing requests have stochastic properties which causes the value function is intractable to evaluate by traditional RL algorithm, we apply the deep neural network (DNN) to approximate the action-value function with a reinforcement learning method deep Q-learning. Each agent chooses action in state and receives an immediate reward, then it uses DNN to approximate the value functions. After the value functions reach convergence, the user is capable to select the overhead-aware optimal computation offloading strategy based on its state and learning results. We aim to minimize the total overheads in terms of computational time and energy consumption of all users. Simulation results have proved that the proposed deep reinforcement learning based computation offloading policy performances effectively compared with baseline methods in this work.

2 System models and problem formulation

In this section we will introduce the system models including network model,communication model and computation model adopted in this work.

2.1 Network model

An environment of one macrocell andN small cells in the terminology of LTE standards is considered here. An MEC server is placed in the macro eNodeB (MeNB), and all the small cell eNodeBs (SeNBs) are connected to the MeNB as well as the MEC server. In this paper, it is assumed that the SeNBs are connected to the MeNB in wired manner[Jafari, López-Pérez, Song et al. (2015)]. The set of small cells is denoted as N ={1,2,…,N}, and we let M={1,2,…,M}denotes the set of mobile user equipment and define that each single-antenna UE is associated with one SeNB. We assume that each UE has a computation-intensive and latency-sensitive task to be completed at each time slott. Each UE can execute the computation task locally, or offload the computation task to the MEC server via the SeNB with which it is connected.MEC server can handle all computing tasks because of its multi-tasking capability.Similar to many previous works in mobile cloud computing [Barbera, Kosta, Mei et al.(2013)] and mobile networking [Iosifidis, Gao, Huang et al. (2013)] to get tractable analysis, we consider a quasi-static scenario where the set of mobile users mremain unchanged during a computation offloading period (e.g., within several seconds),whereas may change during different periods. The network model is shown in Fig. 1.

Figure 1: Network model

2.2 Communication model

Because every SeNB is connected to the MEC server, UE can offload computation tasks to the MEC server through the SeNB with which it is connected. The computation offloading decision of UE mis denoted as am∈{0,1},?m. Specifically, we set am= 0when UEmdecides to compute its task on its local equipment and am= 1when UE decides to offload its computation task to MEC server by wireless manner. There areKorthogonal FDM (Frequency Division Multiplexing) sub-channels without interference to each other between UEs and SeNBs, and each sub-channel bandwidth is assumed asw. By given the computation offloading decision profile of all the UEs asa={a1, a2,…,am}, we describe the Signal to Interference plus Noise Ratio (SINR)γm(t)and uplink data rate rm(t)of UE mat time slot t as

where km(t)∈[0,1,...,K]denotes the number of sub-channels allocated by SBS to users andpm(t )is transmission power of UEm.Gm,n(t),Gi,n(t)denote the channel gain between UEmand SeNBn, UEi and SeNBn. And σ(t)is the additive white Gaussian noise. For the sake of simplicity, we omit(t )in the following expressions, e.g.,rmstands forrm(t ), unless time slottis emphasized.

2.3 Computation model

We consider each mobile user m has a computational taskat each time slott, which can be computed either locally on the mobile user's equipment or remotely on the MEC server by computation offloading, as inChen [Chen (2014)].Bm(in KB) denotes the computation input data which includingprogram codes and input parameters andDm(in Megacycles) stands for the total number of CPU cycles to complete the computational task Jm.stands for the maximum tolerable delay for executing the computation task Jm. A user can apply the methods in [Yang, Cao, Tang et al. (2012)] to obtain the information ofBm,Dmand. Next wewill discuss the overhead in terms of computation time and energy consumption for both local computation and MEC computation offloading cases.

2.3.1 Local computing

In this case, the computational task J is executed on local mobile equipment.m represents the computational capacity (e.g., CPU cycles per second) of UEm .The situation is allowed here that different UEs may have different computational capabilities.The computational time executed locally by UE mis expressed as

and the energy consumption for computation is given as

where ρmis the coefficient representing the energy consumed by each CPU cycle.

According to the realistic measurements in Wen et al. [Wen, Zhang, Luo et al. (2012)],we set

According to the computational time in Eq. (3) and energy consumption in Eq. (4), we can describe the total overheads of the local computation by UE mas

2.3.2 MEC server computing

We will state the case where the computational task Jmis offloaded to the MEC server in this section. UE mwould generate the extra overhead of transmission time and energy consumption for offloading the computation input data to MEC server and downloading the computation outcome data to local equipment. The transmission time and energy consumption of UE mare computed respectively as

When the computation input data are uploaded to the MEC server, MEC server will execute the computation task on behalf of UE. Letdenotes the computational capability (i.e., CPU cycles per second) of the MEC server assigned to UEm. We assume thatfor computing resources allocated to all users can not exceed the computational capability of the MEC server fc. So the computation time and corresponding energy consumption of the MEC server on task Jmare given as

After the completion of the computing task executing by MEC server, the computation outcome dataneeds to be transmitted back to the mobile users. Therefore, the downlink transmission timeand energy consumptionare given as

we can compute the total overheads by offloading the computational task of UEm to MEC server as

2.4 Problem formulation

Our goal is to optimize expected long-term utility performance of all users, and at each time slott which is a decision step each user has only one task to perform. Specifically,we aim at minimizing the total overheads of all the users which can execute tasks on local mobile users' equipment or perform computation offloading with mobile edge computing.By minimizing the total overheads, users can make the overhead-aware optimal decision of computation offloading which of great importance in augmenting the computational efficiency and reducing the latency. We can model the optimization formulation of the problem as follows:

subject to

The first term of Eq. (15) is the overheads generated by local computing and the second term is the overheads due to the computation offloading. The first constraint ensures that an overhead-aware solution can be obtained by finding the optimal values of the offloading decision profilea . The second constraint means that the delay for performing each calculation task cannot exceed the maximum tolerance delay. The third constraint manifests that the computing resources allocated to all users for offloading computation tasks cannot exceed the total amount of computing resources of the MEC server. And the last constraint specifies that the bandwidth allocated to all users cannot exceed the total spectrum bandwidthW . However, objective formula is difficult and impractical to solve due to the fact that ais binary variable, the feasible set of problem is non-convex and the optimization formulation is not a convex function. From another perspective, the problem can be viewed as sequence decision problem which need make continuous decisions to achieve the ultimate goal. In the following section, we will propose deep reinforcement learning algorithm to optimize the computation offloading problem.

3 Computation offloading algorithm based on deep RL

The reinforcement learning algorithm aims at solving the sequential decision problem and general sequential decision problems can be expressed in the framework of the Markov decision process (MDP). The MDP describes a stochastic decision process of an agent interacting with an environment or system. At each decision time, the system stays in a certain state s and the agent chooses an actionathat is available at this state. After the action is performed, the agent receives an immediate rewardRand the system transits to a new states′according to the transition probability. The goal of an MDP or RL is to find an optimal policy which is a mapping from state to action to maximize or minimize a certain objective function [Alsheikh, Hoang, Niyato et al. (2015)].

3.1 Definitions using RL

To model this problem using RL, we set the following definitions:

Agent:the mobile userm who has computation-intensive and delay-sensitive tasks to complete.

State:stands for the state of the agentmwhich constitutes SINR and computational capability of agent m. Let stdenotes the system state at time slott,where s( t)=｛s1( t),s2(t),…sn(t)｝.

Action:am∈{0,1}where am= 0represents for the UEmchooses to compute task on local equipment, whileam= 1means the UEmchooses to offload the computation task to MEC server. Andat=｛a1( t),a2(t),…an(t)｝denotes the computation offloading decision profile of all UEs at time slott.

Reward:The reward of all mobile users with computation tasks at time slott denotes asThe first term denotes the minus of the total overheads of local computation by UEm, and second term denotes the minus of the total overheads of computation offloading of UEmwith MEC.

An agent chooses an actiona at a particular states, and evaluates its performance in the form of state-value function based on the received immediate rewardRand its estimate value of the state to which it is taken. After the convergence of state-value functions, it learns the optimal policy π*judged by long-term discounted reward [Watkins and Dayan(1992)]. The discounted expected reward is defined by Bellman expectation equation as follows [Wei, Yu and Song (2010); Wei, Yu, Song et al. (2018)]

where R( s, a)is the immediate reward received by the agent when it selects actionaat statesandγ∈(0,1)is a discount factor,is the transition probability from states tos′when the agent chooses the action a. The discounted expected reward for taking actionaincludes the immediate reward and the future expected return.

According to the theory of Bellman's optimal equation, if we denote the V*(s)as the maximum total discounted expected reward at every state and it can be solved recursively by solving the following equation:

then the optimal policy π*can be obtained when the total discounted expected reward is maximum as follows:

However, the reward and probability are unknown in RL method which means it is a model-free based policy. For finite state MDP, action-value functions are usually stored in a lookup table and can be recursively learned. So we have to learn the Q-value which is defined as

The Q-value stands for the discounted expected reward for taking actiona at statesand following policyπthereafter. The update of Q-values for an optimal policy π*in conventional RL method Q-learning is performed as

where Qtis the target value including current reward r and the maximum Q-value maax Q( st+1,a)in next state and the Q( st, at)is estimated value.α∈[0,1]is the learning rate.

3.2 Value function representation and approximation using DNN

In conventional RL method Q-learning, Q-table can be used to store the Q-value of each state action pair when the state and action spaces are discrete and the dimension is not high. However the state spaces are high-dimensional in our work, it's unrealistic to use Q-table mentioned in the previous section. Accordingly, function Qw(s, a)is used to represent and approximate value functionQ( s, a)in RL to reduce the dimension in our work. Deep neural network has the advantage of extracting complex features in feature learning or representation learning [Bengio, Courville and Vincent (2013); Khatana,Narang and Thada (2018)], so we use DNN which is a nonlinear approximation to approximate the value function and improve the Q-learning method.

Deep reinforcement learning combines RL with deep learning (DL). The Q-value can be represented asQw(s, a)using DNN with two convolutional layers and two fully connected layers that are parameterized by a set of parametersw={w1, w2,…,wn}. Each hidden layer is composed of nonlinear analog neurons which can transform linear combination into an output value using non-linear activation functions (e.g., sigmoid, tanh, ReLU, etc).The output of thejth neuron in layer i can be formulated as

The DNN can be trained to update the value function by updating the parameters w including weightswand biasesb. And the best fitting weightscan be learned by iteratively minimizing the loss functionL(w), which is the mean-squared error (MSE)between the estimated value and the target value, i.e.,

where w are the parameters of the neural network, andis the target value. The error between the target value and the estimated valueQw(st, at)is called temporal-difference(TD) error, denoted as

Since DNN may cause the training of RL algorithm unstable and diverge due to the nonstationary targets and the correlations between samples. We adopt target network with fixed parametersw-updated in a slower cycle, and experience replay which stores experiencein a replay buffer Dand randomly sample a mini-batch of the experience to train the network, so the target value and loss function become:

where the parameterswused for approximating the estimated value updates at every step while the fixed parametersw-for approximating target value updates at each fixed steps.Stochastic gradient descent method is applied to minimize the loss function, and the update of the parameterswdefined as follows:

where ?ωQw(st, at)is the gradient of Qw(st, at).

The deep reinforcement learning algorithm performed by mobile users for computation offloading decision making is presented in Tab. 1. For each state, the agent chooses action randomly with the probability of1-εand with the probability ofεchooses the action with maximum action value function which calledεgreedy strategy. When the agent performs the action in state stand receives the immediate reward r, it will observes the subsequent state st+1and approximate the action-value functionby DNN.After the convergence of action-value functions, each mobile user can select the overhead-aware optimal computation offloading action based on its state to minimize the total overheads of all users.

Table 1: Deep reinforcement learning based computation offloading algorithm

4 Simulation results and discussion

In this section, we assess the performance of the proposed deep RL based computation offloading decision method compared with two baseline schemes. Simulation scenarios are presented that there are 10 small cells randomly deployed. The transmission power of UE m is set to be pm=100mWatts. The spectrum bandwidth is set as W=10MHZ,while the additive white Gaussian noiseσ=-100dBm. The channel gain model presented in 3GPP standardization is adopted here. We applied the face recognition as the computation task here [Soyata, Muraleedharan, Funai et al. (2012)]. The size of computation input dataBm(KB) and the total number of CPU cyclesDm(Megacycles)randomly distributed in the range[1000, 10000]. The computational capabilityof a mobile userm is assigned from the set ｛0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0｝GHz at random which reveals the heterogeneity of mobile user’s computational capability. The total computational capability of the MEC server isfc=100GHz. We assume that the weighting factor of computation time asandfor energyweighting factor correspondingly.

Firstly, we demonstrate the convergence of the proposed deep RL algorithm. Fig. 2 shows the total rewards of all UEs at every episode with different learning rates. As we can see, the proposed learning strategy with learning rate of 0.01 obtain the reward per episode fluctuates around -400 after 1000 episodes, while algorithm with learning rate of 0.001 and 0.0001 obtain the rewards per episode fluctuates around -500 and -450 after 1000 episodes. As expected, different learning rates result in different convergence performance, and the algorithm with 0.01 learning rate outperforms compared with other learning rates. The fluctuation of the curve after algorithm converges is for theεgreedy strategy adopted here which users not always choose the action with maximum action value function and have the possibility to choose action randomly.

Figure 3: Computational capability versus total overheads with different schemes

We will show the performance of proposed scheme in comparison with baseline methods,including the local computation policy, which executes all the computational task on local mobile user’s equipment, and the edge computation policy which offloads all the tasks of UEs to the MEC server for edge computing. Fig. 3 demonstrates that with the increase of computational capability, the total overheads for edge computation policy and proposed learning algorithm decreases due to the change of MEC server’s computational capability will influence the computation offloading policy of mobile users. With the increasing computational capability of MEC server, edge computation strategy performs better than local computation due to its multi-tasking capability. However, baseline methods are not effective than proposed learning method on account of proposed method can obtain the optimal overhead aware policy according to its learning result.

Fig. 4 shows the relationship between the number of mobile users and the total overheads of all the mobile users. The total overheads increase gradually with the number of users grows. The overhead generated by edge computation method is less than overhead of local computation method gradually due to the increasing of number of users with more computation tasks to execute. While local computation policy consumes more time and energy than the baseline schemes on account of the limited computational capability when the number of users increases. Contrasted with baseline methods, the proposed learning algorithm always obtains the minimum overhead which means the proposed scheme can achieve the optimal computation offloading decision for reducing the latency,energy consumption and improving the efficiency.

Figure 4: Number of mobile users versus total overheads with different schemes

The assignment of weighting factor which will represent different states of users. Mobile user which is sensitive to delay will take more proportion ofinto account while user in low-battery state will consider more proportion offor the overhead computation.Fig. 5 presents that when the weighting factor of time increases from 0 to 1 (while the proportion of energy decreases from 1 to 0 accordingly), total overheads rise due to the fact that the computational and transmission time occupy more proportion in total overheads. As we can see from the above results, the decision-making performance of the proposed learning algorithm performances better than baseline methods in terms of total overheads of all the mobile users.

Figure 5: Weighting factor of time versus total overheads with different schemes

5 Conclusion

In this paper, we propose a deep reinforcement learning approach for computation offloading decision issue with mobile edge computing. The problem is formulated as minimizing the total overheads of all the users which can execute tasks on local mobile users’ device or offload the computation to MEC server. In order to solve this problem,we apply deep neural network in RL framework to approximate action-value action and obtain the overhead-aware optimal computation offloading strategy based on deep Q-learning method. The performance evaluation of proposed method is compared with two baseline methods. Simulation results showed that the proposed policy can achieve better performance than baseline methods in terms of total overheads which reduces the latency,energy consumption and enhances the computation efficiency.

Acknowledgement:This work was supported by the National Natural Science Foundation of China (61571059 and 61871058).

Computers Materials&Continua2019年4期

Computers Materials&Continua的其它文章: A Credit-Based Approach for Overcoming Free-Riding Behaviour in Peer-to-Peer Networks; Image Augmentation-Based Food Recognition with Convolutional Neural Networks; Artificial Neural Network Methods for the Solution of Second Order Boundary Value Problems; DPIF: A Framework for Distinguishing Unintentional Quality Problems From Potential Shilling Attacks; Color Image Steganalysis Based on Residuals of Channel Differences; Mechanical Response and Energy Dissipation Analysis of Heat-Treated Granite Under Repeated Impact Loading