reinforcement learning for the game of tetris using cross entropy

Playing Tetris with RL using CE

Roee Zinaty and Sarai DuekSupervisor: Sofia BerkovichReinforcement Learning for the game of Tetris using Cross EntropyTetris GameThe Tetris game is composed of a 10x20 board and 7 types of blocks that can spawn:

Each block can be rotated and translated to the desired placement.Points are given upon completion of rows.

Our Tetris ImplementationWe used a version of the Tetris game which is common in many computer applications (various machine learning competitions and the like).We differ from the known game in several ways:We rotate the pieces at the top, and then drop them straight down, simplifying and removing some possible moves.We dont award extra points for combos we record simply how many rows were completed in each game.Reinforcement LearningA form of machine learning, where each action is evaluated and then awarded a certain grade good actions are awarded points, while bad actions are penalized.Mathematically, it is defined as follows:

Where is the value given to the state s, based on the reward function which is dependent on the state s and action a, and on the value of the next resulting state.

WorldAgentInputActionCross EntropyA method for achieving a rare-occurrence result from a given distribution with minimal steps/iterations. We need it to find the optimal weights of given features in the Tetris game (our value function). That is because the chance of success in Tetris is much smaller than the chance of failure a rare occurrence.This is an iterative method, using the last iterations results and improving them.We add noise to the CE result to prevent an early convergence to a wrong result.CE AlgorithmFor iteration t, with distribution ofDraw sample vectors and evaluate their valuesSelect the best samples, and denote their indices byCompute the parameters of the next iterations distribution by:

is a constant vector (dependent on iteration) of noise.

Weve tried different kinds of noise, eventually using

RL and CE in the Tetris CaseWe use a certain amount of in-game parameters, and generate using the CE method corresponding weights for each, using a base distribution of Our reward function is derived from the weights and features:where is the weight of the matching feature Afterwards, we run games using the above weights and sort those weights according to the number of rows completed in each game, computing the next iterations distributions according to the best results.

Our Parameters First TryWe used initially a set of parameters detailing the following features:Max pile height.Number of holes.Individual column heights.Difference of heights between the columns.Results from using these features were bad, and didnt match the original paper they were taken from.Our Parameters Second TryAfterwards, we tried the following features, which have their results displayed next:Pile heightMax well depth (width of one)Sum of wellsNumber of holesRow transitions (occupied - unoccupied, sum on all rows)Altitude difference between reachable points Number of connected vertically holesColumn transitionsWeighted sum of filled cells (higher row > lower)Removed lines (in the last move)Landing height (last move)2-Piece StrategyHowever, we tried using a 2-piece strategy (look at the next piece and plan accordingly) with the first set of parameters. We thus achieved superb results after ~20 iterations of the algorithm, we scored 4.8 million rows on average!The downside was running time, approx. 1/10 of the speed of our normal algorithm. Coupled with the better results, long running times resulted.Only two games were run using the 2-piece strategy, and they ran for about 3-4 weeks before ending abruptly (the computer restarted).The Tetris AlgorithmNew Tetris blockUse two blocks for strategy?YesNoCompute best action using both blocks and the feature weightsCompute best action using the current block and the feature weightsMove block according to the best actionUpdate board if necessary (collapse full rows, lose)Upon loss, return number of completed rowsPast Results# of Games Played During LearningMean ScoreMethod / ReferenceNon-reinforcement learningn.a.631,167Hand-coded (P. Dellacherie) [Fahey, 2003]3000586,103GA [Bohm et al., 2004]Reinforcement learning120~50RRL-KBR [Ramon and Driessens, 2004]15003,183Policy iteration[Bertsekas and Tsitsiklis, 1996]~17< 3,000LSPI [Lagoudakis et al., 2002]n.a.4,274LP+Bootstrap [Farias and van Roy, 2006]~10,000~6,800Natural policy gradient [Kakade, 2001]10,00021,252CE+RL [Szita and Lorincz, 2006]10,00072,705CE+RL, constant noise5,000348,895CE+RL, decreasing noiseResultsFollowing are some results we have from running our algorithm with the afore-mentioned features (second try).Each run takes approx. two days, with arbitrary 50 iterations of the CE algorithm.Each iteration includes 100 randomly generated weights, and a game played for each to evaluate, and then 30 games of the best result (most rows completed) for statistics.Results Sample OutputBelow is a sample output as printed by our program.Performance & Weight Overview (min, max, avg weight values):Iteration 46, average is 163667.57 rows, best is 499363 rowsmin: -41.56542, average: -13.61646, max: 5.54374Iteration 47, average is 138849.43 rows, best is 387129 rowsmin: -38.91538, average: -12.93479, max: 4.42429Iteration 48, average is 251081.03 rows, best is 806488 rowsmin: -38.60941, average: -11.88640, max: 11.97776Iteration 49, average is 251740.57 rows, best is 648248 rowsmin: -38.41177, average: -11.81831, max: 7.05757Feature Weights & Matching STD (of the Normal distribution):-10.748 -20.345 -7.5491 -11.033 7.0576 -8.9337 -12.211 -0.063724 -11.804 -15.959 -38.4122.9659 1.4162 1.3864 1.6074 1.4932 0.93831 1.0166 0.34907 1.1931 0.7918 2.536ResultsWeights vs. Iteration, Game 0

15ResultsSTD vs. Iteration, Game 0

ResultsEach graph is a features weight, averaged over the different simulations, versus the iterations

ResultsEach graph is the STD of a features weight (derived from the CE method), averaged over the different simulations, versus the iterations

ResultsFinal weight vectors per simulation, reduced to 2D-space, with the average rows performance matching the weights (averaged over 30 games each)

19ConclusionsWe currently see lack of progress in the games. Games get quickly to a good result, but swing back and forth in coming iterations (i.e. on one iteration, 100K rows, next 200K, next 50K).We can see from the graphs that the STD of the weights didnt go down to near-zero, meaning we might gain more from further iterations.Another thing is that the weights vectors are all different, meaning there wasnt some convergence to a similar weight vector there is room for improvement.Possible DirectionsWe might try different approaches of noise. We currently used a noise model of where t is the iteration. So we can either use a smaller or bigger noise to check for changes.We can try updating the distributions only partly with the new parameters, and partly from the last iterations parameters [ ].We can use a certain threshold for the weights Variance. Once they pass it, we will lock that weight from changing, and thus randomize less weight vectors (less possible combinations).

21

reinforcement learning for the game of tetris using cross entropy

Documents

game of tetris

game parameters

tetris casewe

tetris implementationwe

known game

given distribution

following features

number of rows