a real-time job-shop scheduling method based on … · a real-time job-shop scheduling method based...

A Real-time Job-shop Scheduling Method Based on Adaptive Agent

Xinli Xu, Xiangli Wang, Wanliang Wang∗College of Information Engineering, Zhejiang University of Technology,

Zhaohui LiuQu, Hangzhou 310014, P. R. China

Abstract: - Combining the intelligent ant and reinforcement learning, an on-line job-shop scheduling model based on the adaptive agent was proposed. In the process of learning, the intelligent ant made decision according to the past rewards and an immediate reward. When the production environment changed, e.g. the machines or the orders were changed, the adaptive agent could make an adjustment and the optimal assignment of resources could be realized finally. The simulation results show that the method is effective. Key-Words: - job-shop scheduling, agent, reinforcement learning, ant colony algorithm

∗ Corresponding author. Tel: +86-571-88320163;Fax：+86-571-88320101

1 Introduction In the classical job-shop scheduling, every machine available is in a normal state in the job shop, namely, the state of machine is not changed in the course of scheduling and machining. However, it is difficult to assure that the state of machine isn’t varied with the performance of the machine and the effect of extern factors in the actual environment of the plant. Moreover, if an urgent order arrives in the job shop, the machining plan will be broken out and rescheduled or adjusted. So, it is highly important that how to automatically adjust the scheduling scheme in term of the change of the production environment, and how to enforce the adaptability of the system. With the development of Distributed Artificial Intelligence (DAI) and appearance of the concept of multi-agent, it is possible to solve the dynamic and complicated scheduling problems in virtue of the autonomy and cooperation of multi-agent [1, 2]. Among some agents with own objective and autonomy, their negotiation and cooperation have been used to solve the problem [3]. Now in the research of multi-agent technology and its applications to the scheduling, there are dynamic job-shop scheduling using reinforcement learning agents [4], dynamic scheduling and rescheduling negotiation algorithm based on agents [5], and so on. This paper analyzes the basic frame of agent based on reinforcement learning and proposes a modeling method combining intelligent ant with reinforcement learning. Finally, the simulation example proves the validity of the method.

2 Reinforcement Learning Based on the correlative subjects, such as Cybernetics, Statistics and Psychology, reinforcement learning technology has been developed for a long time. However, it is widely studied in the artificial intelligence and machine learning till late eighties and early nineties. It is capable of adaptability with no teachers, so it is one of the kernel techniques to design the intelligent agent [6, 7]. 2.1 Model of Reinforcement Learning If there are following characteristics in order to adapt to the environment in the course of learning, (1) agent can’t still and passively wait but feel the environment on its own, (2) reward of environment for action is an estimate (award or penalty), and (3) agent can obtain knowledge in the environment of action and estimate, improve action scheme to adapt to the environment, and reach to the expectant aim. Then it is reinforcement learning (Watkins 1992, Yen 1996). A normal model of reinforcement learning is shown in Fig.1. Generally, it includes a discrete set of states S, a discrete set of actions A, and a set of rewards R. In the interaction of each step t , the agent can feel its state ts in the environment and an immediate reward for action last time. In term of a certain strategy, action is chosen. Based on action

, the state

tr

ta

ta ts can be changed into 1ts + and an immediate reward 1tr + can act on the environment.

Proceedings of the 6th WSEAS International Conference on Robotics, Control and Manufacturing Technology, Hangzhou, China, April 16-18, 2006 (pp171-176)

mailto:[email protected]

Observingly, the environment is not invariable and the same action selected in the same state may result in the different state and (or) the different reward. However, it is commonly stable in the probability. The final objective of learning is to achieve an action function, namely, to choose which action in a certain state of the environment.

Fig.1 Model of Reinforcement Learning

2.2 General Reinforcement Learning Common learning algorithms mainly include Monte Carlo Method [8] and Temporal Difference Method [9, 10]. Monte Carlo Method is purely a trial-and error method and it is also the simplest and intuitionistic reinforcement learning. It has no use for any known knowledge, and it can entirely learn from the direct experience, such as the states, the actions and the encouraging samples. In Monte Carlo Method, the model can’t be built step by step in the exploration. Being short of the model of the environment, the value function isn’t enough to decide the action (Because it is hardly known which of states is there). So is replaced with value function to guide the action of agent. Temporal Difference Method can be regarded as the combination of Dynamic Programming Method and Monte Carlo Method, namely, the combination of bootstrapping and sampling. Being similar to Monte Carlo Method, Temporal Difference Method has no use for the model of environment and can directly study from the interaction with the environment. So the value of Q is only used to decide the score. However, it is similar to Dynamic Programming Method. Based on the estimate of the other states, it can update the estimate of the value function and not wait for the final result.

( )V s( , )Q s a

Because all above methods based on Markov Decision Processes are not fit for the practical production scheduling. These methods aim at the optimal strategy, and speed of convergence is hardly assured for the problems of the large scale. So, combining ant colony algorithm and reinforcement learning, the model of adaptive agent is presented to solve the dynamic and complicated job-shop scheduling problem in this paper.

3 Job-shop Scheduling Method based on Adaptive Agent In reinforcement learning, agent takes action acting on the environment in a certain state, and then the environment gives the estimate for action. If the reward is bigger then the probability of action adopted is bigger. The value of reward is used to instruct the selection of action [11]. There is a strong resemblance between ant and it, which selects the path in term of the information in the ant algorithm [12]. Combining intelligent ants and reinforcement learning, this paper presents an adaptive agent model to solve the dynamic and complicated job-shop scheduling problem. 3.1 Model of Adaptive Agent Firstly, Resource Agent, Task Agent and Harmony Agent are defined in the job-shop scheduling. Resource Agent can give the machining task to the resource and assure that the task can be normally completed. Task Agent can look for the resource where the task can be accomplished and supervise the successful completion of the task. Harmony Agent is charge of management, negotiation and control of the scheduling process for a whole job shop. It must make correct assignment in term of the machining environment. At the same time, it is capable of adaptability in the environment. When the environment changed, Harmony Agent must adjust the assignment strategy to satisfy with the requirement of the new environment. With the interaction of Harmony Agent, Resource Agent and Task Agent, the model of adaptive agent continually can study and realize the optimization assignment of the tasks. The circumstances of resources and tasks are regarded as the environment. When the environment changed, Harmony Agent can learn in order to automatically adjust the assignment strategy of the tasks. In the decision-making process of adaptive agent, there are four aspects including a set of states S, a set of actions A, the mapping from the states and the actions to the immediate rewards R： and state transfer function T. A set made of the tasks is regarded as S. An arrangement process is described as action {task, machine} where the task is composed of jobs and their operations. So, the objective of scheduling based on agents is to realize the task assignment problem {job, operation, machine} in a certain state. In the course of assignment, Harmony Agent is regarded as the intelligent ant, which can make decision in term of the state and the task, select the corresponding action, realize the transfer of the

S A R× →


state, and then obtain an immediate reward. Once study is finished at a time, the path going across by the ant will be recorded. At the same time, the value of objective is computed and a global value of reward is obtained. Then the global coefficients are updated. For a certain state, the probability of the action with bigger value of reward is bigger in the next selection. If the better action is chose every state, the value of objective will be better and the value of reward in the path will be bigger in the global update. So repeat, it is a positive feedback process. For -job -machine problem, every job includes operations where represents the i th job.

A set of tasks is{1 where . A set of

states is

n mik i

,2, , }tL1

n

ii

t i k=

= ×∑1 2{ , }ts s sL and a set of actions { ={job,

operation, machine}, . }ia

1,2, ,i t= ×L m Based on the resources and tasks, the task of the adaptive agent in the state is to take action and to distribute the task to the corresponding machine resource. If study is over at a time then the assignment strategy

is ia

1 1 2 2{ ( ), ( ), ( )}t ts a s a s aL is obtained. 3.2 Rewards Rules In the decision-making process, the intelligent ant can choose the corresponding action based on the value of award and penalty, so the estimate of award and penalty must be correlated with the optimization objective. To convergence to the optimal in the learning process, the method combing the local and global reward is applied. Local reward rules are given whether the buffer and the processing course of machine are busy or idle. The state of the buffer is described as emptyflag. If emptyflag = true then the buffer is empty. Else if emptyflag = false then the buffer is not empty. emptydegree represents the degree of emptiness. The state of machine is denoted as busy. If busy = true then machine is busy, else if busy = false then the machine is idle. The busyness degree of machine is described as busydegree, and 1f and 2f denote the mapping function from the busyness and idleness of buffer and machine to the value of reward as follow, respectively. if (emptyflag=false and busy=true) r = f1( emptydegree )； else if (emptyflag=true and busy=true) r = f2( busydegree )； else if (emptyflag=true and busy=false) ； rewardr max=

In the rules of global update, a global reward is given whether the strategy obtained is good or bad after study is finished at a time. Then the reward is assigned for every state of the strategy. The global reward is computed as follow, gr

1 1 2 2 3 3/gr c Q totaltime c Q balance c Q utility= + + (1) where are weight coefficients. totaltime, balance and utility are the whole processing time, the ratio of load balance and machine utility, respectively.

are all adjustment parameters.

1 2 3, andc c c

1 2 3, andQ Q Q 3.3 States Transfer Rules If the system is in the state s, the intelligent ant will make decision based on the resources and the results of study for a few past times and select a better action. To avoid getting into the local optimal, every probability of action is computed, and then the roulette is used in the selection. If the probability is big then the chances of selection are many, or if the probability is small then the chances of selection are few. The transfer probability of every action is as follow,

( , ) i

ia

xp s ax

=∑

(2)

where is described as the probability taking action in the state

( , )p s aa s . ix is the value of reward

selecting action , a1 1( ) 0i i ix x r x 1α α− −= + − ≤ ≤ (3)

where 1ix − is denoted as the reward before the local value updates for this time, α is the ratio of study, and is the immediate reward selection action . r a 3.4 Updating Coefficients of Reward Updating coefficients of the reward includes local and global update. If the intelligent ant in the state s selects the action , then the local update is executed for the strategy at a time. The update rule is given in the equation (3).

a( , )s a

If study is completed at a time, a global reward shown in the equation (1) will be obtained. At the same time, the strategy is recorded for this study and the path is updated in term of the equation (3).

1 1 2 2{( , ), ( , )...( , )}t ts a s a s a


3.5 Tradeoff between Exploration and Exploitation In reinforcement learning, the environment must be explored to gain the optimal action in the continual search. However, there is a question whether it is to search the new action or to continually adopt the optimal action in the known actions. The former is dangerous, because the new action is possibly not as good as the known optimal. But the latter possibly gets into the near optimal action and can’t find the global optimal action. To solve the problem, the method is used in this paper where better records are saved. In some initial stages of study, the intelligent ant can explore the new path and the path can be saved till the number of path records equals to the given number L. When the number of records reaches to L, make decision according to the probability P whether the intelligent ant carries out exploration of the new path or the strategy is selected in the historical records before next study. If the historical record is selected then the probability of every record is computed as follow,

1

iLi

ii

RpR

=

=∑

(4)

where iR is the th global reward of the historical record.

i

If it is the new exploration, the value of objective will be compared with the objective values of historical records every study. If it is better than the worst record in the known records then the worst record is replaced with it. If not, then the set of historical records isn’t changed. 3.6 Realization of Adaptive Agent with Reinforcement Learning The idea of ant colony algorithm is combined with reinforcement learning in this paper. With the intelligent ant, task assignment is realized in the course of study, and the convergence of the algorithm is accelerated through the method saving better records. Based on the environment, the intelligent ant can make decision and realize the optimal assignment of the task. At the same time, as the value of reward, the past study experience is saved to the subsequent decision-making and the optimal scheduling of the shop is realized finally. The flow chart of the algorithm is shown in Fig.2 and the detail steps of realization are as follow, Step 1 Initialization. Read information of tasks and machine resources and set the initial value of reward 0 for every decision-making.

Initialization

Path Exploration

Local Update

Study Ends ?

Global Update

Records Update

Select History Record?

New Exploration?

Global Update

Selecting Path

N

Y

N

Y

N

Y

Fig.2 The flow chart of reinforcement

learning algorithm Step 2 Path Exploration. In the course of exploration, the better action is selected in term of the historical rewards and the immediate reward, and then the value of reward is updated. Step 3 Global Update. If study is finished at a time, then the path which ant goes across is recorded and the global reward is updated. Step 4 Tradeoff between Exploration and Exploitation. If the number of historical records reaches to L, make decision according to the probability P whether search a new path or select the historical record. Step 5 Records Update. If the historical record is chosen then machining is carried out in term of the historical record. Else if the objective value obtained is better than that of the worse record in the historical records, then the worst record is replaced with it. Step 6 Terminations. Because the algorithm is on-line study and its max iteration step is decided by the number of tasks, the max step is regarded as infinity. 4 Simulation Here a flexible job-shop scheduling problem is given as an example of on-line study based on adaptive agents. Considering that the ratio of load and utility of machines is practically complicated, the optimization objective is only finishing time. In the case of unchangeable machines and tasks show in Table 1, the finishing time is obtained in Fig. 3 for the order denoted in. The parameters are set as follow, 1.0=α , 1 1.0c = , , the max iteration steps=100, and 0

1 8Q =0P .8< < . In study, the intelligent


ant can make decisions based on the historical rewards and the immediate reward, and then the task is distributed to the appropriate machine resource. With the probability of exploration from 1 to 1-P, it is stable at the about 70th step in Fig. 3. Finally, it is convergent to 23.

Table 1 Machining Task Job Operation Machine Processing time1 1 1 5 1 1 2 6 1 1 3 6 1 2 3 3 1 3 1 3 2 1 1 5 2 1 2 6 2 1 3 6 2 2 1 3 2 3 4 6 3 1 1 4 3 1 5 5 3 2 2 4 3 2 3 8 3 3 3 3 4 1 4 5 4 2 5 4 4 3 4 4 5 1 5 3 5 2 4 6 5 3 1 5

Fig.3 Convergence of Learning In addition, machine resources and orders are changed to validate the adaptability of agents. If the machines change, for example, machine 2 is broken, and then the records correlating with machine 2 can’t be processed in Table 1, such as the records with the tick in Table 1. In the course of on-line study, if the machine is broken at the 100th step and the other parameters are not changeable, then the adaptive agents can automatically adjust themselves. With the automatic adjustment, it is newly stable to 25 at about 180th iteration in Fig 4 (a). If the order changes, then the value of historical rewards has no use for the decision-making and the agent must renew study. Supposing that the order is become into that in Table 2 at the 100th step, the figure of convergence shown in Fig. 4 (b) is still obtained. In Fig. 4 (b), it is newly convergent at about 180th step.

Fig.4 Convergence Figure with Changed Environment

Table 2 Changed Order Job Operation Machine Processing 3 1 1 2 3 1 1 4 3 1 5 5 3 2 2 4 3 2 3 8 3 3 3 7 1 2 3 3 1 3 1 5 2 2 1 3 1 1 2 5 2 1 2 6 2 3 4 6 4 1 4 7 4 2 5 4 5 1 5 3 5 2 4 2 4 3 4 4 5 3 1 5 1 1 3 4 2 1 3 6

5 Conclusion Reinforcement learning is combined with the intelligent ant in this paper. The model of job-shop scheduling based on the adaptive agent is established to realize on-line study. In study, the intelligent ant can take corresponding action based on the historical rewards and the immediate reward, and the machining task is reasonably arranged on the corresponding machine resource. Because the manufacturing environment is continually changeable, for example, the machine resources or the orders often changes, the adaptive agents will be automatically adjusted themselves to realize the optimization assignment of resources.


The model is set up and the example simulation validates the validity of the method. However, it must be more studied, such as the parameters setting and the parallel decision-making of multi ants. Acknowledgement This work is supported by National Natural Science Foundation of China (No.60374056), Significant Science and Technology item of Zhejiang Province (No.2004C11011) and Science and Technology Developing Foundation of Zhejiang University of Technology. References: [1] Hon Wai Chun and Rebecca Y.M. Wong, An

agent-based negotiation algorithm for dynamic scheduling and rescheduling, Advanced Engineering Informatics, No.17, 2003, pp.1-22.

[2] P.C. Pendharkar, A computational study on design and performance issues of multi-agent intelligent systems for dynamic scheduling environments, Expert Systems with Applications, No.16, 1999, pp. 121-133.

[3] Rao Yunqing, Xie Chang and Li Shuxia, Research on Multi - Agent - Based Schedul ing Approach for Job Shop Schedul ing, China Mechanical Engineering, Vol.15, No.10, 2004, pp.873-877.

[4] M. Emin Aydin and Ercan Öztemel, Dynamic job-shop scheduling using reinforcement learning agents, Robotics and Autonomous Systems, No.33, 2000, pp. 169-178.

[5] Jihad Reaidy, Pierre Massottea and Daniel Diep, Comparison of negotiation protocols in dynamic agent-based manufacturing systems, Int. J. Production Economics, 2005.

[6] B. Hayes-Roth, Architectural foundations for real-time performance in intelligent agents, The Real-Time Systems, No.2, 1990, pp. 99-125.

[7] L.J. Lin, Self-improving reactive agents based on reinforcement learning, planning and teaching, Machine Learning, No.8, 1992, pp. 293-321.

[8] Carlos D, Paternina-Arboleda and Tapas K. Das, A multi-agent reinforcement learning approach to obtaining dynamic control policies for stochastic lot scheduling problem, Simulation Modeling Practice and Theory, No.13, 2005,pp. 389-406.

[9] W. Zhang and T. Dietterich, A reinforcement learning approach to job shop scheduling, Proceedings of the IJCAI-95, 1995, pp.1114-1120.

[10] Attila Lengyel, Itsuo Hatono and Kanji Ueda, Scheduling for on-time completion in job shops using feasibility function, Computers & Industrial Engineering, No.45, 2003, pp. 215-229.

[11] WAN G Xiao - f ang and YAN G Jia – ben, Self - Adaptive Agent Model for Task Allocation in Manufacturing System, Computer Integrated Manufacturing System, Vol.7, No.8, 2001,pp.17-21.

[12] CHEN Wen, WAN G Shilong and HUAN G He, Research on ant colony algorithm for multi - agent based workshop dynamic scheduling, Modular Machine Tool & Automatic Manufacturing Technique, No.6, 2004, pp.56-59


a real-time job-shop scheduling method based on … · a real-time job-shop scheduling method based...

Documents