jeffrey barratt deep imitation learning for playing chuanbo...
TRANSCRIPT
DeepImitationLearningforPlayingRealTimeStrategyGames
ChuanboPan
JeffreyBarratt
MotivationCompetitiveComputerGames,despiterecentprogressinthearea,stillremainalargelyunexploredapplicationofMachineLearning(ML),ArtificialIntelligence(AI),andComputerVision.RealTimeStrategygamessuchasStarCraftIIprovideanidealtestingenvironmentforAIandMLtechniques,as:• Theyruninrealtimeandinvolveincompleteinformation
(theplayercannotseethewholebattlefieldatonce).• Usersmustbalancemakingunits,controllingthoseunits,
executingastrategy,andhidinginformationfromtheirenemytosuccessfullywinagameofStarCraftII.
• Simpleactionsmadeearlyoninthegamecangreatlyimpactlaterstagesofthegame.
ChallengesApartfromthereasonsmentionedinthemotivation,StarCraftIIisachallenginggamebecause:• Theactionspaceishuge–O(n4)possibleselections,and
O(n2)possibleplacestomoveto.• Thegameisinreal-time–can’tspendforeverthinking!• Thegameishugeandmulti-layered.
o Havetoconsiderattacking,producing,gathering,etc.o Ourmini-gamealreadyincludestaskssuchasmoving
tobeacon(enemy),attacking,andmovingback.• Overfittingthedataisaneasytraptofallinto.
ProblemDefinition
1)ThestartingpositionofthePySc2-providedmini-game
2)Themarinescanstartbyattackingthemoredangeroustargetsfirst
3)Splittingthemarinesmitigatestheeffectofsplashdamage
4)KitingcanbeusedtotrademoreefficientlywiththeZerglings
Wefocusoureffortsononemini-gamewhichinvolves“splitting”,aprocessofengagingarea-of-effectunitswithagroupoflow-healthrangedunits,wherethelow-healthunitsaremovedaparttomitigatetheareaofeffectdamagetoasfewunitsatonce,aswellasothermanagementtacticssuchaskitingandtargeting.Wecollecteddatabyplayingthemini-gamerepeatedlyandrecordingallactionstakenandthestatetheyweremadein,yieldingaround10,000state-actionpairs.
FeaturesandModelWeusethenewly-releasedPySc2[1]frameworktointerfacewiththeStarCraftIIenvironment.ThisAPIallowsusthefollowingchannelstocommunicatewithStarCraftII:• Astatefeaturevectorthatrepresentswhattheagent
currentlyseesonthescreen.o 1784x84featurelayers,eachrepresentingadifferent
aspectofthegame(health,alliancestatus,etc.).o Becausetheypreservethespatiallayoutofthescreen,
featurelayersactas“images”ofwhat’scurrentlyseen.• Asetofavailableactionsthattheagentcanlegallymake
inthecurrentstate.WewrotecodetoconvertthefeaturesandactionsfromStarCraftIItothenetworkandviceversa.OurapproachutilizesDeepBehavioralCloning.Wetakeadvantageoftheimage-likenatureofthestatebyusingaConvolutionalNeuralNetwork(CNN).
Thenetworkoutputsscoresforeachpossibleaction(select,attack,move,noop,etc.)aswellasabestguessforeachoftheirparametersforthecurrentstate,whichwillforthemostpartbecoordinatesofwheretoattack,select,move,etc.
Results
DiscussionandFutureWorkWecanobserveseveralthingsfromourresults:• Theoreticalscorerangesfrom-9to750+.Obviously,we
arenotnearmaximumpotential.• Thelearningcurveindicatesthatweareimprovingour
scorefromrandombutstillsufferfromoverfitting.• Theagentsometimesgetsstuckmovingbackandforth
betweentwostates.• Theagentdoesn’tseemtohaveafullunderstandingof
whatit’sdoing,justmimickingwhatlooksright.Futurework,werewetohave6monthstowork:• UsingaLSTMRecurrentNeuralNetwork(RNN)tohelp
determineactions,especiallyinsequence.• Refiningourabstractiontoensurestabilityduringtraining• Gathermoretrainingdatatocovermorepossiblecases.
Thegraphsonthetopshowtraininglossduringtrainingwithandwithoutregularization.Thegraphsonthebottomshowaveragescoresforthemini-gameduringtraining.
References[1]Vinyals,Oriol,etal.,"StarCraftII:Anewchallengeforreinforcementlearning."arXivpreprintarXiv:1708.04782(2017).[Online].Available:https://arxiv.org/abs/1708.04782.