application of deep reinforcement learning for …

72
APPLICATION OF DEEP REINFORCEMENT LEARNING FOR BATTERY DESIGN A Thesis presented to the Faculty of the Graduate School at the University of Missouri In Partial Fulfillment of the Requirements for the Degree Master of Science by DONGPENG LIU Dr. Dong Xu, Thesis Supervisor JULY 2020

Upload: others

Post on 21-Apr-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: APPLICATION OF DEEP REINFORCEMENT LEARNING FOR …

APPLICATION OF DEEP REINFORCEMENT LEARNING

FOR BATTERY DESIGN

A Thesis presented to

the Faculty of the Graduate School

at the University of Missouri

In Partial Fulfillment

of the Requirements for the Degree

Master of Science

by

DONGPENG LIU

Dr. Dong Xu, Thesis Supervisor

JULY 2020

Page 2: APPLICATION OF DEEP REINFORCEMENT LEARNING FOR …

The undersigned, appointed by the Dean of the Graduate School, have examined

the thesis entitled:

APPLICATION OF DEEP REINFORCEMENT LEARNING FOR BATTERY DESIGN

presented by Dongpeng Liu,

a candidate for the degree of Master of Science and hereby certify that, in their

opinion, it is worthy of acceptance.

Dr. Dong Xu

Dr. Jianlin Cheng

Dr. Jian Lin

dp leo
Dong Xu
dp leo
Jianlin Cheng
dp leo
Jian Lin
Page 3: APPLICATION OF DEEP REINFORCEMENT LEARNING FOR …

ACKNOWLEDGMENTS

I would like to thank Dr. Dong Xu for his supportive instructions. He pointed

out the direction for me. He also helped me to open up the research ideas and to

foster my research taste. Thanks to his guidance, I know how to do a good job in

society and contribute my own strength to the society. which not only helped me in

my Master’s study, but also gave me enlightenment for my future career.

I want to thank my colleagues and mentors in Automat Inc. for their professionally

support in battery, robotic control and machine learning. Their domain knowledge

helped me extend this project when I interned at the company. I also want to thank

my parents, who have always supported me and thanked them for being so good. I

want to thank my classmates and labmates: we solved many problems together with

classmates’ company, and my amiability fellows in the lab lead me to the way to

research.

ii

Page 4: APPLICATION OF DEEP REINFORCEMENT LEARNING FOR …

TABLE OF CONTENTS

ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . ii

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

CHAPTER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.1 Problem of material research and development . . . . . . . . . 1

1.2 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3.1 Steps of machine learning application project . . . . . . . . . 5

1.4 Problem formulation of battery recipe generation . . . . . . . . . . . 6

1.4.1 Prediction problem . . . . . . . . . . . . . . . . . . . . . . . . 8

1.4.2 Generation problem . . . . . . . . . . . . . . . . . . . . . . . 8

1.5 Materials Artificial Intelligence Robotics-driven System (MARS) . . 11

2 Data preprocessing and representation . . . . . . . . . . . . . . . . 18

2.1 Data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.2 Data cleaning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.3 Data visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.4 Feature engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.4.1 Vector representation . . . . . . . . . . . . . . . . . . . . . . . 29

3 Prediction models and experiments . . . . . . . . . . . . . . . . . . 32

iii

Page 5: APPLICATION OF DEEP REINFORCEMENT LEARNING FOR …

3.1 Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.1.1 Conductivity prediction model . . . . . . . . . . . . . . . . . . 32

3.1.2 State-of-the-art methods on structural data prediction . . . . 33

3.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4 Generation models and experiments . . . . . . . . . . . . . . . . . . 40

4.1 Models for structural data generation . . . . . . . . . . . . . . . . . . 40

4.1.1 Formulated as an optimization problem . . . . . . . . . . . . . 40

4.1.2 Markov Decision Process for battery recipe generation . . . . 41

4.2 Model training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.2.1 Bayesian optimization setting . . . . . . . . . . . . . . . . . . 47

4.2.2 Training Reinforcement Learning model . . . . . . . . . . . . 47

4.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

iv

Page 6: APPLICATION OF DEEP REINFORCEMENT LEARNING FOR …

LIST OF TABLES

Table Page

2.1 Parameter and target inputs . . . . . . . . . . . . . . . . . . . . . . . 21

2.2 Formulation examples of 6 domains, selected from database. . . . . . 27

2.3 One-hot-like example recipe . . . . . . . . . . . . . . . . . . . . . . . 29

3.1 Formulations generated by machine learning . . . . . . . . . . . . . . 38

v

Page 7: APPLICATION OF DEEP REINFORCEMENT LEARNING FOR …

LIST OF FIGURES

Figure Page

1.1 MARS workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1.2 MARS system architecture . . . . . . . . . . . . . . . . . . . . . . . . 13

1.3 MARS machine learning model . . . . . . . . . . . . . . . . . . . . . 14

1.4 MARS machine learning flow chart . . . . . . . . . . . . . . . . . . . 15

2.1 Data flow and data QA/QC . . . . . . . . . . . . . . . . . . . . . . . 19

2.2 Box plot for each dimension of input . . . . . . . . . . . . . . . . . . 25

2.3 Conductivity distribution plot . . . . . . . . . . . . . . . . . . . . . . 26

2.4 Feature correlation plot . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.1 Deep neural network architecture . . . . . . . . . . . . . . . . . . . . 37

3.2 Scatter plot of LightGBM prediction result and ground truth . . . . . 39

3.3 Scatter plot of XGBoost prediction result and ground truth . . . . . . 39

3.4 Scatter plot of Neural Network prediction result and ground truth . . 39

4.1 DDGP Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.2 Generated conductivity result w.r.t iteration . . . . . . . . . . . . . . 49

4.3 Generated conductivity plot . . . . . . . . . . . . . . . . . . . . . . . 52

4.4 Bayesian optimization conductivity plot . . . . . . . . . . . . . . . . 52

4.5 Standard deviation of generated recipes . . . . . . . . . . . . . . . . . 53

vi

Page 8: APPLICATION OF DEEP REINFORCEMENT LEARNING FOR …

ABSTRACT

The conventional material research and development are mainly driven by human

intuition, labor, and manual decision. It is ineffective and inefficient. Due to the

complexity of material design and the magnitude of experimental and computational

work, the discovery of materials with conventional methods usually takes very long

development cycles (10-20 years) with enormous labor and costs. To address this

challenge, we proposed a machine-learning framework called Material Artificial Intel-

ligence Robotics-driven System (MARS), aiming to reduce the costs with the help of

machine learning techniques.

We applied advanced deep-learning networks to better predict conductivity. We

explored neural network models and tree-based models such as LightGBM. In partic-

ular, we made the models more interpretable and identified the relationships between

the electrolyte’s composition and the ionic conductivity. To search for the optimal

conductivity, we developed a sophisticated deep reinforcement learning (RL) model

called DDPG (Deep Deterministic Policy Gradient) to explore novel recipes to reach

much higher conductivity. DDPG begins the RL process by entering new states

through actions, where each action at a specific state (with a one-hot vector, repre-

senting selections of electrolyte components) would yield a reward Q, trained by the

predictor developed in the previous step. After the optimal compositions have been

found for the maximum conductivity, voltage stability and modulus, new measure-

ments would be conducted to confirm these compositions. The new measurement

data were then fed back to improve the prediction model. In this way, the prediction

model is constantly being updated by each RL prediction. Once a successful update

has been made to the prediction model, the whole process iterates. Finally, a well-

trained DDPG model combines the benefits of both Q-learning and Policy Gradient

method. DDPG is faster, simpler, more robust, and able to achieve much higher

vii

Page 9: APPLICATION OF DEEP REINFORCEMENT LEARNING FOR …

conductivity than conventional search methods.

Finally, the model could provide compositions that lead to higher conductivi-

ties than the highest conductivity in the training data. Then, we generated more

training data according to these compositions to retrain the prediction model. The

generated recipes have been attested both by machine learning metrics and wet lab

experiments. The generated best conductivity (2.51e−3) has meet our expectations

of battery recipes.

viii

Page 10: APPLICATION OF DEEP REINFORCEMENT LEARNING FOR …

Chapter 1

Introduction

1.1 Background

1.1.1 Problem of material research and development

The mainstream material R&D today is mainly driven by human intuition, labor,

and manual decision. It is ineffective and inefficient given the complexity of material

design, and the magnitude of experimental and computational work. Thus, material

discovery sometimes looks like “treasure hunt” – with long development cycles (10-20

years) and occasional lucky breakthroughs. Besides, it is very difficult to solve most

of the complicated problems in materials exploration by using only mechanics and

statistical mechanics (named first principles), although many mechanics approaches

are truly helpful in materials discovery and optimization [1].

The electrolyte materials discovery has attracted people’s technical interest be-

cause of its possible applications in various electrochemical devices such as fuel cells,

solid batteries, etc. It is estimated that chemical materials’ R&D spending is $50 bil-

lion yearly, but currently the applications of software technologies in general accounts

1

Page 11: APPLICATION OF DEEP REINFORCEMENT LEARNING FOR …

for less than 1% [2]. The technical challenges that need to be addressed include: (1)

achieving parallelized mechanical measurements which otherwise are usually devised

in a series fashion; (2) designing sample holders that are compatible with complex

formulated samples such as the polymer electrolyte; (3) collecting meaningful data

that is interpretable and convertible to mechanical property values.

This work will primarily develop a model for battery materials with the first focus

on polymer electrolytes. The electrolyte is a key component in next-gen lithium-metal

batteries that double energy densities and halve the cost with improved safety [3],

which is urgently needed, such as in electric vehicles (EV). EV battery performance

is one major bottleneck. There is an urgent need to accelerate battery material

innovation that leads to batteries that show long mileage, durability, low cost, safety

and fast charging.

Specifically, the machine learning-based material discovery workflow combines:

(1) initial knowledgebase collection, including parameters (e.g., material compositions

and physical properties), and objective functions (e.g. conductivity and durability);

(2) AI model training and learning using the knowledgebase; (3) experimental design

aimed towards the optimal solution by the AI model; (4) parallelization experimen-

tation via high-throughput platform; and (5) knowledgebase updates based on new

results. This process iterates towards the global maximum of material performance.

One analogy would be the self-driving car, which uses algorithms and automation to

make its driving decisions.

Our contribution includes a system and a method for a Materials Artificial Intel-

ligence Robotics-driven System (MARS). MARS includes a machine learning frame-

work, a knowledge data base that includes training data, a robotic preparation module

and a robotic testing module. The provide the advantage of accelerating advanced

materials and device research and development. Various embodiments of MARS are

centralized, autonomous, combinatorial, and closed-loop with combine machine learn-

2

Page 12: APPLICATION OF DEEP REINFORCEMENT LEARNING FOR …

ing and robotic high-throughput automation. According to various embodiments,

MARS can be implemented to discover new high-performance battery materials and

improve existing battery materials.

1.2 Related works

There are several existing robotic and high throughput systems that help to design

and execute R&D experiments more rapidly and efficiently. There are also Artificial

Intelligent applications that used to predict material properties like Lattice thermal

conductivity [4] and band gap [5]. therefore, speeding up product properties testing

like battery life cycles. We have witnessed tremendous progress and great benefits in

machine learning with robotic automation and material science [6], both in academia

and in industry. However, there are only a few studies on conductivity measurement

by AI. Among them, the relevant one would be [7], who employed multi-layer neural

network for only one fixed type of polymer electrolyte. To make progress on the

limitation of polymer type, we could consider employing deep learning with huge

amount of data, and generative model for data reproduction.

High-throughput experimentation allows parallelizing many experiments at one

time, via automation thus greatly compressing research time-to-market [8] [9]. In

materials science, for example, companies such as Wildcat Discovery [10] and In-

termolecular have applied high-throughput robotic tools to prepare large number of

samples and screen them for properties such as battery coulombic efficiency and di-

electric constant. We employed high-throughput experiment to cope with challenge

1.

Recent emerged Materials Informatics approaches aim to accelerate materials dis-

covery, with the help of machine learning and big data [11]. As a strong supplement

of the first principles strategy, machine learning is a strategy that uses known data

3

Page 13: APPLICATION OF DEEP REINFORCEMENT LEARNING FOR …

about properties and descriptors (including both computational and experimental

parameters) of some materials to find semi-empirical rules, explicitly or implicitly,

and uses these rules to predict and evaluate the properties of unknown materials [12].

Conventional machine learning methods have been applied in as a conductivity clas-

sifier, among them, tree-based methods shows promising results over other methods

[13]. They further analysed the factors (or nodes) in tree-based methods, such as

Li-salt content or temperature threshold. However, they haven’t deeply investigated

the continues space, as well as the further optimization in continues space. Neural

networks inherently suitable for continues problems, and its feasibility on conductiv-

ity has been tested [14]. However, the model is trained by a few of samples, while the

real word material information contains huge amount of data for models. We should

leverage more computing ability of deep learning models for better prediction.

Many generative algorithms, such as reinforcement learning, have found their

applications in new areas such as developing drugs [15] [16], discovering materials

[17], and managing supply chain [18]. However, researches on generation of electro-

chemistry formulations such as battery recipes, are relatively scarce. A close ap-

plication relevant to battery recipe would be Li-salt structural design aided by AI

[19]. Robotic systems are widely expanding into manufacturing, R&D, and Internet

of Things (IoT) [20] [21]. Machine learning and robotic arms are combined to tackle

with challenge 2 and 3.

1.3 Machine Learning

Machine Learning is a kind of special methods or algorithms that given plenty of

data, using statistical techniques and computer algorithms to extract and find the

hidden pattern among these data.

Machine learning explores the study and construction of algorithms that can learn

4

Page 14: APPLICATION OF DEEP REINFORCEMENT LEARNING FOR …

from and make predictions on data. The whole flow of the machine learning prob-

lem can describe as below: by making data-driven predictions or decisions, through

building a model from sample inputs. Machine learning is employed in a range of

computing tasks where designing and programming explicit algorithms with good

performance is difficult or infeasible; example applications include email filtering,

detection of network intruders, and computer vision.

1.3.1 Steps of machine learning application project

Machine Learning is a computational statistics problem. The development flow in-

cludes data preprocessing, modeling and evaluation. Explanations are following:

Data preprocessing: It’s the first and foremost important step of the whole ma-

chine learning project. The performance of our machine learning model is related

to the size of high quality dataset. We should conduct some appropriate processing

methods for our data, such as data normalization and data vectorization of data char-

acteristics, to enhance their representational ability, and avoid our machine learning

model becoming so complex. We also need to split the dataset into training set, test

set and validation set.

Modeling: We need to formulate our problem in a specific, machine-solvable form.

Every machine learning problem should have an ”X” with component features (while

we can name it as “input data” ) and which kind of feature is target ”Y” that we

want our machine learning algorithm to yield. The model selection is based on data

type and dataset size, etc. Our tabular and category data then is naturally suitable

for tree-based methods. We train and fine-tune the model by using the data we

pre-processed in the previous step.

Evaluation: The evaluation is the final step of the whole project, we need to

find the best index for our output. Since our problem have multiple models solving

different problems, every problem should have individual criteria. Our prediction

5

Page 15: APPLICATION OF DEEP REINFORCEMENT LEARNING FOR …

model employ commonly used metric, MSE, and our generation model is assessed

with a domain-specific metric. We use test data to evaluate our model, and compare

our result with other models.

Upon finishing the above steps, a complete loop is done. We shall analyse our

result to decide whether or not to conduct another experiment-analyse loop.

1.4 Problem formulation of battery recipe gener-

ation

Generation new recipe of battery is a long-stand, meanwhile hard to complete require-

ment. The difficulty for solving it lies in the huge variance of the intrinsic information

of structure data that fits within fixed fields and columns in relational databases and

spreadsheets.

For data generation, there are many researches on image, audio and text gen-

eration. By contrast, the generation of structural data is much less studied than

prediction of structural data. Common practices are to learn from the structural

data for other goals. Researchers apply neural networks to generate texts from struc-

tured data [22]. Potential applications include auto-generating news articles, weather

reports and industry reports. Here we consider generating another structural data

from source structural data, in other words, from electrolyte materials database to

battery recipe.

What makes the generation problem complex is, the structural input data may

have columns with rambling inter-relationships. Taking the house price prediction

problem as an example, the number of family member has associations with the num-

ber of room they have, and the location can nearly determine the house price. More-

over, sometimes the generated data should meet some standard, or preference. For

human face generation, a more natural human-like face is preferred. For the battery

6

Page 16: APPLICATION OF DEEP REINFORCEMENT LEARNING FOR …

recipe generation, a higher conductivity is preferred. Searching for higher conduc-

tivity neutrally forms a optimization problem, Therefore the optimization methods,

regardless of with or without constrained, are worth exploring [23]. If we treat the

target output (conductivity) as one of the constrains, there are other feasible un-

supervised machine learning methods, for example, Variational Autoencoder (VAE)

[24], Generative adversarial networks (GAN) [25] and Reinforcement Learning (RL)

[26].

Optimization algorithm describes how a combination and variance of x can mini-

mum the output y [23] [27]. Almost all machine learning algorithms ultimately come

down to the maximum or minimum of an objective function of optimization problem.

For example, for supervised learning, we need to find an optimal mapping function f(x)

to minimize the loss function (empirical risk or structural risk) of training samples.

Or, find an optimal probability density function p(x), make logarithmic likelihood

function of training sample maximization (maximum likelihood estimation).

VAE and GAN based methods [28] generate data by estimate and mimic the

distribution inside the input data, mostly assuming Gaussian distribution as initial

distribution. However, distributions for some attributes are sometimes counterintu-

itive and contrary to nature; columns data such as polymer kind and solvent kind

could be arbitrary chosen. In addition, to add constrain to GAN, common solution

is devising a new loss for extra generator updating, which also require the constrain

variable, conductivity, to function as label or output, not input. AlphaGo [29] has

demonstrated the capability for Reinforcement Learning to search large space of 10170

for optimal solutions. We believe such algorithms can solve material problems and

greatly enhance performance.

These approaches draws the same conclusion that, an explicit map from x to y

should be determined before we employ whether optimization methods or unsuper-

vised learning methods. A function or model fittingly describing the relationships of

7

Page 17: APPLICATION OF DEEP REINFORCEMENT LEARNING FOR …

x and y is needed. Thus, we will resolve the whole generation process into a two-steps

pipeline: first we build and refine the prediction model, then use it in the following

generation model.

1.4.1 Prediction problem

For prediction model, there are lots of structural data available in Kaggle competitions

[30] that we can learning and analysis from. Among the winners’ solution, the most

common and efficient way to pre-process structure data is feature engineering [31].

Overall, we focused on prediction model as preparation, then compared different

generation model with or without the prediction. Because the prediction can be a

totally different sub field with generation, we dedicate a whole chapter for building

and then refining it. We have successfully run two rounds of sample preparation using

our machine learning model, for the electrolyte materials in lithium metal batteries.

The electrolyte components are encoded as one-hot vector as input for a regression

model (the predictor). The model was trained using over 1000 experimental samples.

The validation result shows the accuracy of prediction to the level of industrial ap-

plication. The prediction model was further applied to our RL generation model as

the environment in explore novel formulations.

We believe this process could generalizing to other structural data generation

problem, which is lack of examined in researches.

1.4.2 Generation problem

We could intuitively model the chemical reaction process for recipe generation is

formulated as learning a reinforced agent, which performs discrete actions of slight-size

addition or removal in a chemistry-aware Markov Decision Process (MDP). Herein, we

include a assumption that chemical reaction process for recipe generation has Markov

8

Page 18: APPLICATION OF DEEP REINFORCEMENT LEARNING FOR …

property. MDPs are a classical formalization of sequential decision making, where

actions influence not only immediate rewards of current state, but also subsequent

states through those future rewards [26]. Thus MDPs involve delayed reward and

the need to trade-off immediate and delayed reward. They are useful for studying

optimization problems, here defined by optimizing conductivity of recipes. We then

employ reinforcement learning to solve this MDP problem.

The MDP M formally have components: states, actions and rewards (M =

S,A,R), where each term is defined as follows:

S = St is the state, whose value can be all possible intermediate and final generated

recipes. Each St is a tuple of state and its corresponding time step, denoted as (s, t).

Here we consider the case of finite MDP, that the set of states, actions and rewards all

have a finite number of elements. Also, all of the three components are defined discrete

with regard to time, presenting as dependence on preceding component. For the initial

state S0, we randomly chosen from a combination of our battery material recipe

database and those already generated recipes. MDP modeling typically required an

ending state tailed a series of states forming episode. We explore the episodic case

by limiting the maximum number of time steps T in our tabular chemical-reaction

based MDP, after T time steps the episode will end and then start a new episode.

A = At denotes a set of actions that describe the modification made on the current

state (intermediate recipe) at each time step t. Action space here is same as state

space, the only difference is that the modifications in action are often micro-scale in

comparison with state space. We enforce this because we want to simulate a chemical

reaction environment where each component is added gradually, following suggestions

from [32]. Therefore, the space is also continues, represented by a distribution of each

component.

Rt is the reward function that specifies the reward after reaching state St, with

discount factor γ. This hyper-parameter is set to 0.9 in our study. In our framework,

9

Page 19: APPLICATION OF DEEP REINFORCEMENT LEARNING FOR …

the state will be post-processed to a valid and complete structure form at each step.

That is, all component content sum should be 1, and all component content should

be larger than or equal to 0. Note that in our virtual environment, A reward is given

not just at the terminal states, but after each action step. Both intermediate rewards

and final rewards are used to guide the behavior of the reinforcement learning (RL)

agent, avoided delayed or sparse reward issue as many other reinforced frameworks

suffered [33]. Furthermore, to ensure that the last state is rewarded the most, we use

γ to discount the value of the rewards at state St. In addition, our reward function

consider the similarity of recipes, in order to avoid generate many repeated recipes.

Reinforcement learning

To solve an MDP, conventional approaches such as Dynamic Programming (DP) and

Monte Carlo harness the iterative nature of MDP problem. Reinforcement learning is

an iterative process, each iteration to solve two problems: given a strategy evaluation

function, and according to the value function to update the strategy. Methods of

reinforcement learning can be considered to achieve a similar effect to DP, weaken

the assumptions of the known accurate environment model or to calculate less. The

DP method is generally used for finite MDP problems, where the set of states, actions,

and returns are finite. For continuous state action space problems, optimal solutions

are obtained only in special cases.

The DP-based method requires an environment model, while the Monte Carlo-

and TD- (Temporal Difference) based methods do not require an environment model.

The former is called model-based method [26], uses a model of the environment for

planning, while the latter is not model-free methods, which learn from the experience

of directly interacting with the environment. If the model is used to enhance the

strategy, the biggest benefit of introducing environment modeling is that it can make

better use of prior knowledge and improve learning efficiency. Model-free methods do

10

Page 20: APPLICATION OF DEEP REINFORCEMENT LEARNING FOR …

not try to learn environment dynamics and reward function, which have an advantage

in saving computation and space for more trials.

Reinforcement learning algorithms can be divided into three categories: value

based, policy based and actor-critic based models. The commonly used value based

algorithm, such as DQN, has only one value function network without policy network,

while the actor-critic algorithm represented by DDPG (Deep Deterministic Policy

Gradient) has both value function network and policy network. DDPG is also model

free and off-policy, and also USES depth neural network for function approximation.

However, unlike DQN, whose vanilla version merely capable to solve the problem of

discrete and low-dimensional action spaces. DDPG can solve the continuous motion

space problem by introducing action policy modeling. In addition, DQN is the value

based method, that is, there is only one value function network, while DDPG is the

actor-critic method, that is, there is both value function network (critic) and policy

network (actor).

1.5 Materials Artificial Intelligence Robotics-driven

System (MARS)

We employ our MARS platform on polymer electrolyte. The electrolyte consists of

polymers, lithium salts, plasticizers and solvents. Experimentally, the formulations

recommended by the model are prepared by:

1) Weighing the components according to composition recommended by the model.

For example, a formulation consists of polymer A with 50 wt%, lithium salt B with

40 wt%, plasticizer C with 9 wt% and additive D with 1 wt%. For preparing this

formation, we weigh 5 g of polymer A, 4 g of salt B, 0.9 g of plasticizer C, and 0.1 g

of additive D.

2) Dispensing the components of each formulation into a vial. Please note the

11

Page 21: APPLICATION OF DEEP REINFORCEMENT LEARNING FOR …

plasticizer is not added in this step.

3) Dissolving, blending and suspending components in a solvent by magnetic mix-

ing. The ratio between the sum of all components and the solvent is 50:1 to 2:1 by

weight.

4) Removing the solvent by heating.

5) Dispensing the plasticizer(s) onto the polymer electrolyte film and letting them

diffuse homogeneously.

Currently, there are 8 solvents to be used for acceptable solubility. In this step,

viscosity is a key processing parameter to optimize. Besides, stock solutions are used

for shortening the preparation time. Figure 1.1 outlines the experimental workflow

on polymer electrolyte preparation.

Figure 1.1: MARS workflow.

Then, the samples are loaded onto characterization modules. High-throughput

characterization is carried out and data is collected. The data is fed back to the

AI model, for it to learn and improve. The closed loop iterates and converges into

materials that meet our requirements.

Figure 1.2 shows a block diagram of the MARS example system, which includes

a machine learning model with input training data database, which is constantly up-

dated to adapt the machine learning model based on our actual experimental test

12

Page 22: APPLICATION OF DEEP REINFORCEMENT LEARNING FOR …

System User Device

Machine LearningModel Module

Training & TuningModule

Robotic PreparationModule

UI & API Module

Robotic TestingModule

Training Database

Application Engine User Interface

Machine Learning Model

ReinforcementLearning DNN

LightGBM ...

Figure 1.2: MARS architecture diagram, illustrating an exemplary environment inwhich some embodiments may operate.

results. The system includes a machine learning module, a robotic preparation mod-

ule, a robotic testing module, a training tuning module and user interface (UI) mod-

ule. The system can communicate with the user device through the user interface

generated by the application engine to display output (such as suggested recipes and

test results). Machine learning models and databases may further become part of the

system. The way the database is deployed may affect retrieval and storage efficiency

and/or data security. Taken together, we use mongoDB [34] as the database.

The machine learning module, as Figure 1.3 shows, uses a machine learning model

to generate one or more suggested recipe outputs. The machine learning module out-

puts the proposed recipe to the robot readiness module mixes and prepares instances

of each proposed recipe, and stores each prepared recipe instance as part of an electro-

chemical module. The robot preparation module provides the connection between the

robot test module and the electro-chemical module. The robot test module performs

13

Page 23: APPLICATION OF DEEP REINFORCEMENT LEARNING FOR …

generate

Machine LearningModel Module

Robotic PreparationModule

Electro-Chem Module

RecipeDeposits

testing feedbackRobotic TestingModule

Proposed Recipes

TraingingData

Figure 1.3: MARS machine learning model diagram, illustrating an exemplary envi-ronment in which some embodiments may operate.

one or more tests on any stored recipe instance and generated test results. The robot

test module will input the proposed formula and test results into the training data,

and the training and tuning module can further tune the machine learning model

according to the proposed formula and test results.

As shown in Figure 1.4, MARS receives requests for one or more optimized tar-

get functions for the battery selection portion (such as polymer electrolyte, liquid

electrolyte, cathode, or anode). Using machine learning models, MARS generated

a number of different formulations of the battery material optimization at least the

objective function. A machine learning model may include a training optimization

module that trains the machine learning model based on the training data, including

one or more parameter types, such as (for example): one or more chemicals, one or

more components, one or more components, one or more physical properties, and one

or more processes. Machine learning models can also be trained based on the training

data of one or more objective functions corresponding to parameter types. Machine

learning models are tuned by training and tuning modules to identify one or more

14

Page 24: APPLICATION OF DEEP REINFORCEMENT LEARNING FOR …

Generate, by a machine learning network, a plurality of proposeddifferent recipes of batterymaterials optimization of at least objectivefunction

Prepareaninstanceofatleastoneoftheproposeddifferentrecipesofbatterymaterialsviaaroboticpreparationmodule

Deposit the instance of the proposed different recipe of batterymaterials into an electrochemical module via the robotic preparationmodule

Executeapluralityofformulationcharacteristictestsoneachdepositedinstanceintheelectrochemicalmoduleviaarobotictestingmodule,therobotictestingmoduleloadedwithapluralityofdifferenttestsforoneormoreofthebatterymaterials

Updatethemachinelearningmodel,viatherobotictestingmodule,witharesultofatleastoneoftheformulationcharacteristictests

Figure 1.4: MARS machine learning flow chart, illustrating an exemplary methodthat may be performed in some embodiments.

15

Page 25: APPLICATION OF DEEP REINFORCEMENT LEARNING FOR …

combinations of parameter types that are compatible to a certain extent to create the

desired optimization of one or more objective functions.

According to various embodiments, machine learning model training is carried out

for polymer electrolyte formula components such as polymer components, lithium salt

components, plasticizer components and additive components. However, according

to various embodiments, the plasticizer and/or additive assembly may be an op-

tional formulation assembly. Additional training data may include training data such

as physical properties, composition, and viscosity encoded prediction signals based

on direct relationship composition (i.e., formula) and viscosity, direct relationship

prediction, higher concentration electrolyte formula components (without plasticizer

components) results in a higher viscosity.

MARS performs multiple formulations for each deposition instance in the elec-

trochemical module using the robotic test module, which performs multiple tests for

one or more battery materials. For example, a robot test module may have one or

more stored experiments and test protocols that will be applied to one or more stored

instances of proposed recipes. The robot test module can apply different experiments

and test protocols to different storage instances of the proposed recipe and determine

the results of each different experiment and test protocol. Through the robot test

module, MARS updates the machine learning model to get at least one formula per-

formance test. For example, the robot test module can add the proposed formula and

its corresponding test results to the training data, and the training and tuning mod-

ule can update and tune the machine learning model based on the updated training

data.

Understandably, Figure 1.4 operations can be repeated in different order or exe-

cuted in parallel. In addition, the behavior of the models and methods in the example

might occur on two or more computers, for example, in an online network environ-

ment. Some behaviors may occur on the local computer, while others may occur on

16

Page 26: APPLICATION OF DEEP REINFORCEMENT LEARNING FOR …

the remote computer. Further understood, the behavior of the flowchart 1.4 can be

performed iteratively in order to converge to the final multiple formulations.

17

Page 27: APPLICATION OF DEEP REINFORCEMENT LEARNING FOR …

Chapter 2

Data preprocessing andrepresentation

2.1 Data collection

We plan to search a vast space of materials via our MARS system designs and pro-

cesses. The space consists of: material and component data such as the chemical

structure and the particle size. On the chemical structure, in the polymer electrolyte

use case, there are four categories of chemical structure that are represented in the

training data, namely the polymer, the lithium salt, the additive, and the plasticizer.

Regarding their functionality, the polymer conducts the lithium ion, and provides

mechanical integrity to the polymer electrolyte. The lithium salt is the source of

lithium ions. The additive improves electrolyte properties such as ionic conductivity

and voltage stability, and battery performance such as battery cycle life, safety and

charging rate. The plasticizer makes the polymer flexible and supports ion conduction.

To supplement the chemical list (See Table 2.2), other chemicals that fall into

the four categories above but have not been used in polymer electrolytes represent

the other set (Chemical Database) for the machine learning algorithms to perform

18

Page 28: APPLICATION OF DEEP REINFORCEMENT LEARNING FOR …

exploration. For example, from the chemical supplier Sigma-Aldrich website, we

collected about 250 different chemicals, with the data on for example, linear formula,

price, SMILES representation, molecular weight, melting point, density, toxicity, and

flash point.

On the machine learning exploration in this Chemical Database, the chemicals in

the training set (Table 2.2) are part of the Chemical Database. We keep the chemical

name and SMILES representation to be consistent between two sets. The Chemical

database has these four chemical categories as well. In this way, the machine learning

algorithms know there are new chemicals in the Chemical Database for them to

explore. Figure 2.1 depicts this.

Figure 2.1: Data flow chart and and data QA/QC.

In terms of particular chemical structures that would work as the components in

the polymer electrolyte, the following are guiding principles:

(1) On the polymer, there needs to be functional groups that can interact with

lithium ions and thus “dissolve” and conduct lithium ions. For example, polyethylene

oxide is a common polymer for the polymer electrolyte, the ether oxygen in polyethy-

lene oxide has lone ion pairs that can be shared with lithium ions. Other functional

groups such as nitrile, carbonyl, carboxylate ester, fluoride and amine.

(2) On the plasticizer, the chemicals with the following attributes are preferred: It

19

Page 29: APPLICATION OF DEEP REINFORCEMENT LEARNING FOR …

should be able to dissolve salts to sufficient concentration. In other words, it should

have a high dielectric constant. It should be fluid (low viscosity), so that facile ion

transport can occur. It should remain inert to all cell components, especially the

charged surfaces of the cathode and the anode, during cell operation. It should

remain liquid in a wide temperature range. In other words, its melting point (Tm)

should be low and its boiling point (Tb) high. It should also be safe (high flash

point Tf), nontoxic, and economical. (3) On the lithium salt, it should be able to

completely dissolve and dissociate in the nonaqueous media, and the solvated ions

(especially lithium cation) should be able to move in the media with high mobility.

The anion should be stable against oxidative decomposition at the cathode. The

anion should be inert to electrolyte solvents. Both the anion and the cation should

remain inert toward the other cell components such as separator, electrode substrate,

and cell packaging materials. The anion should be nontoxic and remain stable against

thermally induced reactions with electrolyte solvents and other cell components.

(3) On the additive, there are those used for improving the ion conduction prop-

erties in the bulk electrolytes, those used for SEI chemistry modifications, and those

used for preventing overcharging of the cells. However, on the other hand, it is

possible that the machine learning algorithm will discover new and novel polymer

electrolyte mechanisms among the chemicals available. Besides, liquid electrolyte in

lithium batteries serves as another use case. In this case, polymer is absent. There

are three the lithium salt, the additive, and the plasticizer (which is usually referred

to as the solvent in liquid electrolyte).

2.2 Data cleaning

In the polymer electrolyte use case, the polymer(s) and the lithium salt(s) are essen-

tial, and the additive(s) and the plastizer(s) are optional (also notice that we have

20

Page 30: APPLICATION OF DEEP REINFORCEMENT LEARNING FOR …

recipes with optional polymer). Table 2.1 lists the representative chemical types in

the training set. These chemicals have been reported.

In terms of required objective functions as inputs to train the model for polymer

electrolyte development, there are ionic conductivity, voltage stability and Young’s

modulus. Table 2.1 shows a complete list on the inputs to train the model for polymer

electrolyte development in our current knowledgebase, including the required ones

discussed above. In the use case of liquid electrolyte, the complete list on the inputs

to train the model is similar to Table 2.1. The polymer parameters, free-standing

parameters and mechanical parameters are not applicable here.

Table 2.1: Parameter and objective function inputs to

train the model in the use case of polymer electrolyte.

Parameters Electrolyte - Polymer Polymer 1 Type

Polymer 1 CAS #

Polymer 1 Vendor

Polymer 1 Product #

Polymer 1 Repeat unit

Polymer 1 Repeat unit MW

Polymer 1 MW

Polymer 1 Mw/Mn

Polymer 1 Density

Polymer 1 Voltage stability range

Polymer 1 wt%

Copolymer Type

Electrolyte - Li Salt 1 Li Salt 1 Type

Li Salt 1 CAS #

Li Salt 1 Vendor

21

Page 31: APPLICATION OF DEEP REINFORCEMENT LEARNING FOR …

Table 2.1 continued from previous page

Li Salt 1 Product #

Li Salt 1 Structure

Li Salt 1 MW

Li Salt 1 wt%

molar ratio and Li salt 2

Li Salt 2 conc. In liquid electrolyte / M

Li Salt 2 Solubility / g/g

Li Sat 2 pH in water

Electrolyte - Plasticizer plasticizer 1 Type

plasticizer 1 CAS #

plasticizer 1 Vendor

plasticizer 1 Product #

plasticizer 1 structure

plasticizer 1 repeat unit MW

plasticizer 1 MW

plasticizer 1 density / g/cm3

plasticizer 1 wt%

Electrolyte - Additive Additive 1 Type

Additive 1 CAS #

Additive 1 Vendor

Additive 1 Product #

Additive 1 size / nm

Additive 1 solubility / g/g

Additive 1 structure

Additive 1 MW

Additive 1 wt%

22

Page 32: APPLICATION OF DEEP REINFORCEMENT LEARNING FOR …

Table 2.1 continued from previous page

Electrolyte - Process Formulation Method

Solvent

Synthesis Method

Electrolyte - Formulation properties Tg1

Tm1

Tpc

Tg2

Tm2

Xc %

Tf

Td

viscosity

color

solubility

surface tension

temperature effect

Formulation objective functions - Electrolyte Electrolyte Conductivity / S/cm

Electrolyte Conductivity temperature / C

Electrolyte Feedstock Miscibility

Electrolyte Film visual uniformity

Electrolyte Film visual color

Electrolyte Free-standing

Electrolyte Li transference number

Electrolyte Li TN temperature / C

Li diffusion coefficient

Electrolyte Cathodic stability vs Li / V

23

Page 33: APPLICATION OF DEEP REINFORCEMENT LEARNING FOR …

Table 2.1 continued from previous page

Electrolyte Anodic stability vs Li / V

Electrolyte Anodic stability temperature

Electrolyte Li depo onset potential / V vs Li

Electrolyte Modulus / MPa

Electrolyte Adhesion

Electrolyte Tensile Strength / Mpa

Electrolyte Elongation / %

2.3 Data visualization

Normalization First, we want to know whether the target (conductivity) have

a proper distribution for machine learning models. That is, because our data is

collected by manual experiments and academic reports, the distribution may be highly

imbalanced (e.g. A distribution with long tail, such as some dimensions Figure 2.2

shown. Both our target y and our input X have imbalance problems in a way.

Inputting imbalanced distribution would dim the accuracy of the machine learning

model. Thus, we decided to transform the target y into a common distribution

such as normal distribution, which will conduce to a better performance of machine

learning model [35]. For our conductivity y, the origin data has a very large difference,

from 1e-3 to 1e-11, so we can’t directly use it as reward. After log transformation

and normalization, the range are reduced to range from zero to one. We applied

logarithm transform and normalization, yielding the new conductivity score shown in

distribution plot, Figure 2.3.

24

Page 34: APPLICATION OF DEEP REINFORCEMENT LEARNING FOR …

Figure 2.2: Box plot for each dimension of input. Need normalization of featuredistributions for easier modeling.

Analysis of numerical data. In addition to our current input features, we

also want to disentangle the relations of input dimensions. Correlation is a commonly

used metric for uncovering the relationship between two continuous variables. There

would be some certain reactions between some components (w.r.t type and size). We

measure the inter-relations of numerical dimensions, results shown in Figure 2.4. We

could conclude that the conductivity enhancement depends on the filler type and size.

Strong relationships are: 12:’Solvent 1 wt% and 13:’Solvent 2 wt%, 10:’Li Salt wt%

and 11:’molar ratio of (repeat unit+solvent) and Li salt’.

We tested some feature combinations in our prediction model, such as weight

percentage of Li salt multiply by weight percentage of all Solvent, discussed in later

Chapter. Note that we prefer a solid battery, which means that the less of solvent,

the better a recipe would be. Thus, after investigation we still take the origin one-hot

vector, multiplied with corresponding weight percentage described below.

25

Page 35: APPLICATION OF DEEP REINFORCEMENT LEARNING FOR …

Figure 2.3: Conductivity distribution plot. After transformation, the conductivityscore is nearly normal distributed.

2.4 Feature engineering

Table 2.2 provide lists of categories and chemicals of training data according to some

embodiments. According to various embodiments, the training tuning module trains

the machine learning model according to any portion(s) of the training data. Follow-

ing these forms, the machine learning model returns different formulations of multiple

battery materials, such as polymer electrolytes, liquid electrolytes, anodes or cath-

odes. All of them in the input are features that the model can learn from. Besides,

some of the feature maybe more important so we should emphasize them in the input

vector. Also, there would exist inter-related features that have redundancy or special

connection among them. Disentangling and handling these connections is what we

should do in feature engineering step.

For electrolyte formulations, machine learning models suggest polymers, plasti-

cizers, lithium salts and additives, and weight percentages for each component. Ac-

cording to various embodiments, the formulation presented in Table 2.2 is a polymer-

electrolyte formulation proposed by machine learning models in response to input

26

Page 36: APPLICATION OF DEEP REINFORCEMENT LEARNING FOR …

Figure 2.4: Feature correlation plot. Linearity relations of these numerical inputs arestuck out.

requests for optimized objective functions such as conductivity. That is, each formu-

lation based on machine learning models that predict the combination of ingredients

(polymers, lithium salts, plasticizers and additives) will converge to the best conduc-

tivity.

Table 2.2: Formulation examples of 6 domains, selected

from database.

Polymer A Type PBP PBP LPE16 LPE7

Polymer A wt% 17.4 17.6 16.7 17.5

Polymer B Type LAPPE1 LAPPE1 LAPPE4 LAPPE1

Polymer B wt% 17.4 17.6 16.7 17.5

27

Page 37: APPLICATION OF DEEP REINFORCEMENT LEARNING FOR …

Table 2.2 continued from previous page

Li Salt Type S2ALP S2ALP LBI LBI

Li Salt wt% 17.4 17.6 16.7 17.5

Plasticizer A Type DO DO Et Pr

Plasticizer A wt% 17.4 17.6 16.7 17.5

Plasticizer B Type N N D D

Plasticizer B wt% 12.8 11.9 16.7 12.4

Additive Type AO SMMS LMMS LMMS

Additive wt% 17.4 17.6 16.7 17.5

Machine learning model determines all of the components, for example, the first

formula including polymer type ”PBP” weight 17.4 percentage points, the polymer

type B ”LAPPE1” 17.4 weight percent, lithium salt ”S2ALP” 17.4 percentage by

weight, of the type of plasticizer ”DO” a 17.4 weight percent, plasticizer type B 12.8

weight percentage of the ”N” and a ”AO” 17.4 weight percentage of the additives. The

second of all components in the recipes including polymer type ”PBP” 17.6 weight

percent, polymer type B ”LAPPE1” 17.6 weight percentage, lithium salt ”S2ALP”

17.6 percentage by weight, of the type of plasticizer ”DO” a 17.6 weight percent,

”N” plasticizer type B 11.9 weight percent (wt %) and a ”social media” 17.6 weight

percentage of the additives. The wt% of the polymer in this formulation ranges from

2wt% to 99 wt%, ideally from 50 wt% to 85%. The wt% of salt in the recipe ranges

from 2 wt% to 80 wt%, ideally from 10 wt% to 60 wt%. The plasticizer wt% in the

formulation ranges from 0 wt% to 90 wt%, preferably from 0 wt% to 30 wt%. The

wt% of the additive in the formula ranges from 0 wt% to 40 wt%, preferably from 0

wt% to 15%.

28

Page 38: APPLICATION OF DEEP REINFORCEMENT LEARNING FOR …

2.4.1 Vector representation

Our raw data (see Table 2.2) is a mix of categorical and numerical data. We need

to find a vector representation of categorical data as the input, to match the model’s

input requirement. We adopt one-hot representation for our chemical components,

that is, each chemical is represented in one element in the whole vector. This element

is marked “1” while others are remain “0”. Then, we multiply the element with its

weight percent wt%. After that, the chemical is processed and next chemical in the

recipe is processed, repeating the process. Finally we got a weight percent-masked

vector; here we still name it one-hot vector.

Table 2.3: One-hot-like example recipe. For each for-

mula, sum of all content should be 1.0 (i.e. 100%).

polymer 1 type x content 0

0

. . .

0.25

0

polymer 2 type x content 0

. . .

0.3

0

0

Li salt type x content 0

0

0

. . .

29

Page 39: APPLICATION OF DEEP REINFORCEMENT LEARNING FOR …

Table 2.3 continued from previous page

polymer 1 type x content 0

0

. . .

0.1

solvent 1 type x content 0

0.05

. . .

0

0

0

0

solvent 2 type x content 0

0

. . .

0

0.15

0

0

additive 1 type x content 0

0

0

0.25

. . .

0

30

Page 40: APPLICATION OF DEEP REINFORCEMENT LEARNING FOR …

Table 2.3 continued from previous page

polymer 1 type x content 0

0

. . .

0

The one-hot vector representation has 102 components – 39 of them being poly-

mers, 26 being lithium salts, 19 being plasticizers and 17 being additives, plus 1

dimension of conductivity as output. Then each one-hot vector in the input is corre-

spondingly multiplied by the quantitative content. The input of the model includes

the chemical names of their components and the components’ quantitative content

(that is, wt%) in the use case of the polymer electrolyte, which are the two parameters

used in the model. In terms of the chemical components, Table 2.2 lists examples in

each chemical category. Example one-hot-like vector is shown in Table 2.3.

In following chapters, all machine learning models will harness this vector repre-

sentation, in order to keep the input/output consistency. In particular, the recipe

generator model will also yield a 102 dimension vector. Then for the post-processing

of the generated recipe, we have an inverse transform function, to help us project the

102D vector back to 6 domain, simple representation as Table 2.2 shows.

31

Page 41: APPLICATION OF DEEP REINFORCEMENT LEARNING FOR …

Chapter 3

Prediction models and experiments

3.1 Modeling

3.1.1 Conductivity prediction model

Machine learning framework can include, and is not limited to, a model based on

a neural net based algorithm, such as Artificial Neural Network, Deep Learning; a

robust linear regression algorithm, such as Random Sample Consensus [36], Huber

Regression [37] or Bayesian Regression [38]; a tree-based algorithm, such as Clas-

sification and Regression Tree, Random Forest [39], Gradient Boost Machines and

Gradient Boosting Decision Tree [40]; Naıve Bayes Classifier [41], and other suitable

machine learning algorithms, such as XGBoost [42] and LightGBM [43].

Those model can capture different details of the data. For example, Artificial

Neural Network is adept at capturing spatial features in images, and tree-based algo-

rithms are skillful in categorical data. Because our chemical recipe data is structural

data mixed with categorical and numerical data, we investigated XGBoost, Light-

GBM and CNN models to find the best model for capturing the latent information,

32

Page 42: APPLICATION OF DEEP REINFORCEMENT LEARNING FOR …

described in following sections. Then, their ensembles combination also have been

evaluated.

3.1.2 State-of-the-art methods on structural data prediction

Convolutional Neural Network (CNN)

Convolutional Neural Network (CNN) has extraordinary performance in many com-

puter vision tasks. It is in contrast to the typical paradigm in computer vision, where

hand-crafted features must be designed for each specific task [44]. By using repetitive

blocks of neurons in the form of convolutional layers, pooling layers and fully con-

nected layers, CNN has not only the ability to acquire image feature representations,

but also outperformed many conventional hand-crafted feature techniques [45]. In

this study, we applied a typical architecture of CNN. We built our neural network

by stacking layers, mainly Convolutional layer, Pooling layer and Fully Connected

Layer. Those layers function together to yield a conductivity prediction.

Convolutional Layer Convolution layers are the core process for extracting

features from the input image. Kernels are generated randomly to do the convolu-

tion operation by using the back-propagation algorithm. The output of this layer is

computed as:

y`ij = σ(x`ij) = σ(m−1∑a=0

m−1∑b=0

ωaby`−1(i+a)(j+b)) (3.1)

where x`ij is the ith, jth input unit from the `th layer. The convolutional operation is

denoted as matrix multiplication and ωab is the kernel for convolution. Bias term of

the `th layer is omitted. We used the rectified linear unit (ReLU, σ(x) = max(0, x))

as the non-linear activation function.

33

Page 43: APPLICATION OF DEEP REINFORCEMENT LEARNING FOR …

Pooling Layer The pooling layer is responsible for reducing the size of the

feature maps. Max-pooling and mean-pooling are the two standard methods used in

the pooling layers. The primary function of this layer is to reduce the dimensions of

the output of convolutional layers and the number of parameters to learn to prevent

overfitting. In this study, we chose the max-pooling for pooling layers. Max-pooling

only picks the maximum value in the region instead of calculating its arithmetic

mean in mean-pooling. Max pooling has demonstrated faster convergence and out-

performed the average pooling and other variants [46].

Fully Connected Layer After flattened, each value in the vector gets a vote

through one or more fully connected layers [47]. In this study, a softmax function (σ)

is used in the output layer, defined as:

y`i = σ(x`i) = σ(∑j

w`−1ji y`−1

j ) (3.2)

where σ is the score from the fully connected layer. The main function of softmax

layer is to compress the score to the values between zero and one, with unitary sum.

Tree-based model

Common machine learning algorithms, such as neural networks, can be trained in the

form of Mini-batch, and the size of the training data will not be limited by memory.

GBDT needs to traverse the entire training data for many times in each iteration.

If you put the whole training data into memory, it will limit the size of the training

data. If it is not loaded into memory, reading and writing training data repeatedly

will consume a very large amount of time. Especially in the face of industrial mass

data, ordinary GBDT algorithm is unable to meet its requirements. XGBoost and

LightGBM (Light Gradient Boosting Machine) are frameworks to realize GBDT al-

34

Page 44: APPLICATION OF DEEP REINFORCEMENT LEARNING FOR …

gorithm, which has the following advantages: Faster training speed; Lower memory

consumption; Better accuracy; Distributed support for fast processing of massive

data.

XGBoost XGBoost is based on the gradient boosting machine was improved,

at the system level optimization, including joining regularization to prevent fitting,

similar to the random forest sampling, the columns of the training model can be used

directly or using the greedy algorithm approximate algorithm, suitable for sparse

data, using column blocks for distributed learning, make full use of the cache, we can

use outside of the CPU resources to calculate.

LightGBM Gradient Boosting Decision Tree (GBDT) is a long-lasting model

in machine learning, whose main idea is to use weak classifier (Decision Tree) iterative

training to get the optimal model, which has the advantages of good training effect

and difficulty in overfitting. GBDT is widely used in the industry, it is often used for

click-through rate prediction, search sorting and other tasks. GBDT is also a lethal

weapon in data mining competitions like Kaggle.

XGBoost uses pre-sorted decision tree algorithms, while LightGBM uses His-

togram based decision tree algorithms. The pre-sort algorithm needs to calculate

the gain of splitting each time it traverses a feature value, while the histogram algo-

rithm only needs to calculate k times (k can be considered as constant), and the time

complexity is optimized from O(data*feature) to O(K *features). LightGBM uses

the level-wise decision tree growth strategy used by XGBoost. LightGBM uses the

leaf-wise algorithm with depth limit. The disadvantage of Leaf-Wise is that deeper

decision trees may grow, resulting in overfitting.

35

Page 45: APPLICATION OF DEEP REINFORCEMENT LEARNING FOR …

3.2 Implementation

All of models were implemented in Python programming language. Neural network

models were built with Tensorflow framework [48] and pytorch framework [49], Tree-

based model were implemented by Scikit-learn framework [50]. Pandas toolkit [51]

was used in data pre-processing part. In view of the fact that only a few studies

applied machine learning methods on battery recipe data for conductivity, we decided

to select and examine models who may suitable for this problem. Here we compare

various of neural network models and tree-based model, as well as their embedding

models. Their parameter settings are listed below.

LightGBM We tested 600-750 of estimators and find the best performance with

720; along with max tree depth equals to 7, number of leaves of 48. The learning rate

is controlled at 0.05. We have also setup bagging and feature sub-sampling rate to

0.9 to avoid overfitting.

XGBoost We tested max tree depth from 5 to 25, with learning rate set to 0.05.

We also conducted grid-search on the maximum depth from 5 to 10, and picked 7 in

following experiments. We have also setup subsample rate (0.85) for XGBoost.

Neural Network Model Since our data have special format and pattern, we

built our Neural Network Model accordingly. The details of the network is shown

in Figure 3.1. The first multuplier layer is inspired by DeepFM [52] and attention

mechanism [53] to understand and learn feature interactions from the input features.

We built a bottleneck 1D Convolutional block with skip connection. Each block has

3 Conv layers embedded, window size of each Conv layer is 3. The blocks are stacked,

following by batch normalization and dense layer. Then, the network is optimized by

36

Page 46: APPLICATION OF DEEP REINFORCEMENT LEARNING FOR …

Adam [54] compiled by MSE metric with learning rate set to 1e-3.

Multiplier Convolution Dense Convolution Dense

8@102x18@208x1

256@208x1 256@208x1 230@208x1

1x512

1x1

Figure 3.1: Deep neural network architecture. Conv layer denotes the convolutionalblock. Concatenation operations are omitted.

3.3 Evaluation

We employed statistical metrics to evaluate our prediction model. Commonly used

metrics for a regression task such as RMSE is evaluated, as well as correlation metrics

including Spearman correlation and Pearson correlation. These metrics could help

us find factors in failed predictions and better fine-tune our model during model

validation time.

We have proven the feasibility of our Neural Network prediction model. As shown

in Table 3.1, we measured the real conductiviity scale and the scale of whole range of

normalized conductivity scores from 0 to 1. The difference between predicted score

and the corresponding measured score was no more than 0.08 (around 10−2 in real

scale), for each of the 9 electrolytes studied. The smallest difference was 0.01 (around

3e−7 in real scale).

37

Page 47: APPLICATION OF DEEP REINFORCEMENT LEARNING FOR …

Table 3.1: Formulations generated by machine learning

with their predicted and measured conductivity scores.

Formulation # Predicted score Measurement score

1 1.12e-03 3.02e-03

2 9.55e-07 5.00e-06

3 1.20e-06 1.55e-06

4 6.02e-06 3.00e-07

5 8.29e-07 3.30e-08

6 7.41e-06 4.05e-05

7 3.23e-04 1.30e-03

8 6.87e-07 3.50e-07

9 3.79e-06 7.22e-05

Figure 3.2, Figure 3.3 and Figure 3.4 show that the prediction model predicted the

conductivity accurately to the level of industrial application. LightGBM rank highest

among all the three models. The Pearson correlation coefficient of LightGBM is 0.92

between the predicted conductivities and measured ones with pvalue of 1.36e−17,

which means our model is able to predict correct ordinal information. LightGBM

also perform well in spearman correlation, with 0.96, and pvalue with 1.91e−64. The

Root Mean Square Error (RMSE) between prediction value and measurement value

is acceptable as well, with 0.06. Thus, we obtain the prediction model and further

apply it in our generation model as environment model.

38

Page 48: APPLICATION OF DEEP REINFORCEMENT LEARNING FOR …

Figure 3.2: Scatter plot of LightGBM. With RMSE of 0.07, std of 0.003, Spearmancorrelation of 0.91 and pearson correlation of 0.92.

Figure 3.3: Scatter plot of XGBoost. With RMSE of 0.08, std of 0.005, Spearmancorrelation of 0.89 and pearson correlation of 0.91.

Figure 3.4: Scatter plot of our Neural Network model. With RMSE of 0.08, std of0.007, Spearman correlation of 0.83 and pearson correlation of 0.93.

39

Page 49: APPLICATION OF DEEP REINFORCEMENT LEARNING FOR …

Chapter 4

Generation models andexperiments

4.1 Models for structural data generation

4.1.1 Formulated as an optimization problem

Optimization problem is a process of finding extremum, which is often the problem in

data science. So the problem of finding the maximum conductivity can be naturalized

as an optimization problem. The usual way to find the extreme value is to take the

derivative, that is, to optimize based on the gradient, if the function form is known,

then the derivative can be found, and the function can only be convex. However,

in most cases, the problem situation does not meet these two conditions, such as

inversion problem (inversion problem refers to the determination of parameters (or

model parameters) representing the characteristics of the problem based on the results

and some general principles (or models)), so in this case gradient optimization cannot

be used.

40

Page 50: APPLICATION OF DEEP REINFORCEMENT LEARNING FOR …

Bayesian optimization Bayesian optimization was proposed to solve the in-

version problem. The advantage of bayesian optimization is only need continuous

sampling, to estimate the maximum of a function, at the same time the required

sampling points. Bayesian optimization applies to don’t know what function the spe-

cific form of expression, but if given a x, y can be calculated. The calculation method

here can use Gaussian Process Regression, etc. If (x,y) is sufficient, then the trend

of the function image is basically known. Bayesian optimization is especially suitable

for small space optimization [55].

4.1.2 Markov Decision Process for battery recipe generation

We could intuitively model the chemical reaction process for recipe generation is

formulated as learning a reinforced agent, which performs discrete actions of slight-size

addition or removal in a chemistry-aware Markov Decision Process (MDP). Herein, we

include a assumption that chemical reaction process for recipe generation has Markov

property. MDPs are a classical formalization of sequential decision making, where

actions influence not only immediate rewards of current state, but also subsequent

states through those future rewards [26]. Thus MDPs involve delayed reward and

the need to trade-off immediate and delayed reward. They are useful for studying

optimization problems, here defined by optimizing conductivity of recipes. We then

employ reinforcement learning to solve this MDP problem.

The MDP M formally have components: states, actions and rewards (M =

S,A,R), where each term is defined as follows:

S = St is the state, whose value can be all possible intermediate and final generated

recipes. Each St is a tuple of state and its corresponding time step, denoted as (s, t).

Here we consider the case of finite MDP, that the set of states, actions and rewards all

have a finite number of elements. Also, all of the three components are defined discrete

with regard to time, presenting as dependence on preceding component. For the initial

41

Page 51: APPLICATION OF DEEP REINFORCEMENT LEARNING FOR …

state S0, we randomly chosen from a combination of our battery material recipe

database and those already generated recipes. MDP modeling typically required an

ending state tailed a series of states forming episode. We explore the episodic case

by limiting the maximum number of time steps T in our tabular chemical-reaction

based MDP, after T time steps the episode will end and then start a new episode.

A = At denotes a set of actions that describe the modification made on the current

state (intermediate recipe) at each time step t. Action space here is same as state

space, the only difference is that the modifications in action are often micro-scale in

comparison with state space. We enforce this because we want to simulate a chemical

reaction environment where each component is added gradually. Therefore, the space

is also continues, represented by a distribution of each component.

Rt is the reward function that specifies the reward after reaching state St, with

discount factor γ. This hyper-parameter is set to 0.9 in our study. In our framework,

the state will be post-processed to a valid and complete structure form at each step.

That is, all component content sum should be 1, and all component content should

be larger than or equal to 0. Note that in our virtual environment, A reward is given

not just at the terminal states, but after each action step. Both intermediate rewards

and final rewards are used to guide the behavior of the reinforcement learning (RL)

agent, avoided delayed or sparse reward issue as many other reinforced frameworks

suffered [33]. Furthermore, to ensure that the last state is rewarded the most, we use

γ to discount the value of the rewards at state St. In addition, our reward function

consider the similarity of recipes, in order to avoid generate many repeated recipes.

Reinforcement learning

To solve an MDP, conventional approaches such as Dynamic Programming (DP) and

Monte Carlo harness the iterative nature of MDP problem. Reinforcement learning is

an iterative process, each iteration to solve two problems: given a strategy evaluation

42

Page 52: APPLICATION OF DEEP REINFORCEMENT LEARNING FOR …

function, and according to the value function to update the strategy. Methods of

reinforcement learning can be considered to achieve a similar effect to DP, weaken

the assumptions of the known accurate environment model or to calculate less. The

DP method is generally used for finite MDP problems, where the set of states, actions,

and returns are finite. For continuous state action space problems, optimal solutions

are obtained only in special cases.

The DP-based method requires an environment model, while the Monte Carlo-

and TD- (Temporal Difference) based methods do not require an environment model.

The former is called model-based method [26], uses a model of the environment for

planning, while the latter is not model-free methods, which learn from the experience

of directly interacting with the environment. If the model is used to enhance the

strategy, the biggest benefit of introducing environment modeling is that it can make

better use of prior knowledge and improve learning efficiency. Model-free methods do

not try to learn environment dynamics and reward function, which have an advantage

in saving computation and space for more trials.

Reinforcement learning algorithms can be divided into three categories: value

based, policy based and actor-critic based models. The commonly used value based

algorithm, such as DQN, has only one value function network without policy network,

while the actor-critic algorithm represented by an model following Asynchronous Ad-

vantage Actor-Critic framework (A3C) [56] called DDPG (Deep Deterministic Policy

Gradient). DDPG has both value function network and policy network. DDPG is

also model free and off-policy, and also USES depth neural network for function ap-

proximation. However, unlike DQN, whose vanilla version can only solve the problem

of discrete and low-dimensional action spaces. DDPG can solve the continuous mo-

tion space problem by introducing action policy modeling. In addition, DQN is the

value based method, that is, there is only one value function network, while DDPG

is the actor-critic method, that is, there is both value function network (critic) and

43

Page 53: APPLICATION OF DEEP REINFORCEMENT LEARNING FOR …

policy network (actor).

Environment model

In our RL framework, the chemical environment receives action At from the agent

and yield a scalar reward Rt and a state St+1 to the agent. We define the state of

the chemical recipe content St as the intermediate generated tabular recipe at time

step t, which is fully observable by the RL agent. For the task of battery recipe

generation, the environment incorporates rules from domain knowledge. Therefore,

our environment should figure out the state transaction:St to St+1, and evaluate the

action to get reward.

Our environment mimic the chemical experiment process, that allows adding com-

ponent gradually while monitoring the current status of the battery solution. Thus,

we model the state transaction as simply adding operation. When a new action

comes, the new state should be current state add the new action. Noise is included

to enclose the operation mistake.

The reward function is calculated from an extra model, which is the conductivity

prediction model built before. As discussed in last Chapter, hereinafter we using our

trained LightGBM as environment model.

Reward design Instead of simply putting attention on the diversity of recipe,

we explore the possibility of generating novel recipe based on the existing knowl-

edge base. We designed a reward function that consists of the final property score,

containing conductivity score and other constrains as

Rew = ω(st) + α1√

2π × 8.2e

12( temp−25

σ)2

44

Page 54: APPLICATION OF DEEP REINFORCEMENT LEARNING FOR …

where ω(st) represents the prediction model. We include the temperature as a con-

strain because too high and too low temperature is unfavourable by domain knowl-

edge. Model will generate more room temperature recipes, while preserve probabilities

of other temperatures, controlled by a weight α.

Model design

Actor Critic

inputinput

replay buffer(s, a, r, s')

aa

a

Online Policy Net

Target Policy Net

update

policy gradientyOnline Critic Net

y

yTarget Critic Net

update

loss function

s, s', r

Environment(conductivitypredictor)

Figure 4.1: Framework of DDPG model in our approach.

Our model is built based on DDPG modeling. Input of critic network is action

and observation, the output is value function estimation Q(s, a). In addition, a neural

network is used to approximate the policy function, also known as actor network, and

its input is observation s, the output is action a. Critic network and Actor network

are represented as Q(s, a;ω) and a = π(s; θ). Here, ω and θ denote parameters in

these models. An asynchronous update target network is used in DDPG to ensure

parameter convergence. The whole network architecture shown in Fig. 4.1.

The connection between the two networks is as follows: first the environment gives

an observation, the agent makes a decision to take action based on the return of actor

network (with adding noise to the action), the environment receives the action and

gives a reward R, and the new observation. This process is called time step of an

iteration. At this point we need to update the critic network according to reward R,

and then update the actor network in the direction of the critic. Then move on to the

next step, keep iterating until we’ve trained a good network of actors. The goal of

45

Page 55: APPLICATION OF DEEP REINFORCEMENT LEARNING FOR …

recipe generation is equally to fit a Q function to make the agent generate an action

at at state st that maximizes the future expected cumulative rewards with policy of

action a = πθ(s).

The critic network is used for value function approximation and is updated using

gradient descent. Notice that both actor and critic use the target network:

targett = Rt+1 + γQ(St+1, π(St+1; θ−);ω−) (4.2)

and loss function:

Loss =1

N

N∑t=1

(targett −Q(St, at;ω))2 (4.3)

To evaluate the policy, we need a object to achieve called policy objective function

J(θ). We want the best policy θ that can make J(θ) optimal. Then the derivative lead

to Policy Gradient, 5θJ(θ). We should update the policy parameters in a way that

makes the value of the value function larger. Deterministic Policy Gradient Theorem

[57] provides a method to update deterministic strategy. Given the agent’s policy π,

the TD error δ, the value of the state-action pair Qπ(St, At) and the value of state

Qπ(St) are updated:

δt = Rt+1 + γQ(St+1, πθ(St+1);ω)−Q(St, at;ω) (4.4)

ωt+1 = ωt + αω · δt · 5ωQ(St, at;ω) (4.5)

θt+1 = θt + αθ · 5θπθ(St)5a Q(St, at;ω) (4.6)

where α is learn rate, a hyper-parameter that controls scale of update. Here, TD

error δ record and update a difference in last time step, then keep this value for next

time step’s update.

46

Page 56: APPLICATION OF DEEP REINFORCEMENT LEARNING FOR …

4.2 Model training

4.2.1 Bayesian optimization setting

We implement Bayesian optimization in this project using ”bayes opt” toolkit [58].

The parameter search scope is set from 0.10 to 0.99. Likelihood function is the default

gaussian function. Number of initial points is 400, number of iterations is set to 500.

4.2.2 Training Reinforcement Learning model

Model-environment relations

We use the experience simulated by the model to replace the actual experience in

the learning method. In our approach, we build a environment model, but just for

simulate the real chemical reaction in battery. Our environment model is outside with

the Reinforcement Learning (RL) model, trying to mimic a true wet-lab chemical

reaction return. Note that in model-based method, agent could predict which action

would be more worthwhile taking, while in our simulation environment model, each

step’s reward could be calculated. Our model is classified as model-free method, while

our simulation model helps the agent perform better.

Adoption of RL

For an actor network, its output dimension is the dimension of our target artifact.

For the critic network, its output is a 1-dimensional vector, where for each sample, a

corresponding y (conductivity) estimation is the output. Their input dimensions are

fixed recipe dimensions (102D).

We found that the output of the actor network is a dense matrix, and in actual

experiments, we hope that the generated matrix is sparse, that is, only a few dimen-

sions have values, and the rest are 0. Our way to achieve sparseness is to take Max

47

Page 57: APPLICATION OF DEEP REINFORCEMENT LEARNING FOR …

out. For each domain, we limit to only take the largest 1-2 dimension as the final

reserved dimension, and other reduce to 0. In practice, we implemented the max out

method in two steps: in the actor network and in the environment model. For the

former, we extended and rewritten the final activation function of the actor network

so that argmax is calculated when forward and backward (argmax is not derivable,

the back propagation here needs to explicitly calculate the mean). For the latter, we

could explicitly check whether the dimension of state meets our expectations when

apply action to state.

Another case required post-processing is the density vector from actor network. If

the dense vector is nearly even distributed, the argmax operation is especially hard to

determine which chemical should be selected from the 102D vector. In other words, if

2 elements in the yield vector have few difference, then the argmax operation is likely

to pick the wrong one. Besides, by using argmax we already assume that elements

in the generated vector denote possibilities of each position’s chemical. However,

after argmax operation, the wt% of chemicals are also obtained by the elements. Our

current solution is keeping argmax out the same highest element for possibility and

wt%, while processing them by normalization.

Training setting

Parameter in RL model is three-fold: environment related parameters, critic model

related parameters and actor model related parameters. We generally follow the plain

DDPG settings, but adjust the feature related part. Our action and state space is in

same-dimension numerical space. Our environment and problems define a continues

problem, wherein the end situation of iterations should be set (tested 1 to 64).

We applied 1e-4 as the learning rate of actor and critic network, with batch size

set to 32, hidden size set to 512. Critic network is optimized by MSE. As the origin

version of DDPG, memory mechanism and noise mechanism are also included, with

48

Page 58: APPLICATION OF DEEP REINFORCEMENT LEARNING FOR …

memory capacity of 5000. Default temperature is set to 25 degree.

4.3 Evaluation

After harvesting the generated recipes, we still have to post-process them. Common

processings are shared among generation models we used, such as availability filtering

and ranking selection.

Figure 4.2: Line plot: generated recipe conductivity results versus iterations. Barplot: sample number w.r.t bins of conductivity.

For Bayesian optimization, we need to take care of the distribution of the vectors.

Because Bayesian optimization is prone to stacking in local minima, the result may

finally be congregated, and the vector distributions would also be similar. We ob-

served sometimes the yield vectors looks similar, and after distribution check we will

discard these recipes.

For RL, as we discussed in the reward design part, our reward function is devised

to have recipe conductivity thresholds (e.g. score more than 0.8), the score predicted

by our environment model. In each episode, 2 recipes were automatically generated

and they all matched the defined rules by using the model without pre-training. Then,

49

Page 59: APPLICATION OF DEEP REINFORCEMENT LEARNING FOR …

as the common filtering and selection steps, we applied our filter to screen out the

available recipe. Last, these recipes are ranked by their estimated conductivities; top

recipes will get verified by wet lab experiments.

First we want to verify the RL model works properly w.r.t iteration. Figure

4.2 shows the average conductivity per 100 epoches during model training. We can

observe that the model is jittering all the time, having no clue of convergence even

after 400 epoches. The jittering maybe come from the frequent restart of the episode.

In addition, the figure shows a histogram-like bar plot (Red). We can easily conclude

that most generated recipes have around score value 0.8. Sample numbers of higher

and lower score recipes are similar. Based on these evidence we could say the model

did work on our problem.

Figure 4.3 and Figure 4.4 portray representative predicted normalized ionic con-

ductivity of polymer electrolytes generated by the two models. The X axis represents

the electrolyte sample number. The Y axis represents the normalized ionic conductiv-

ity. The reinforcement learning model 130 generates and optimizes formulations (i.e.

recipes) with ionic conductivity higher than a maximum conductivity represented in

the training data (i.e. normalized conductivity less than 1, or ionic conductivity of as

high as 3.7 x 10-2 S/cm). By comparison, our RL model is more preferred as recipe

generator.

In addition, we also conducted studies on controlling difference of generation,

shown in Figure 4.5. Inspired by the earth-moving distance [59] of distributions,

we also include this mechanism in our loss function as comparison. The standard

deviation (std) over iteration also indicates that our model did learn from the data

while keeping a low generative std.

The limitation of the generative model would be the rate of generating satisfactory

recipes. Sometimes it may need lots of iterations to get a valuable recipe for testing.

We can draw a line over y=1 to see if the prediction exceed the maximum score in

50

Page 60: APPLICATION OF DEEP REINFORCEMENT LEARNING FOR …

the database. The predictor, as the environment, in this case is the bottleneck of

generative model. We will devise a metric about it and tune it in the future.

51

Page 61: APPLICATION OF DEEP REINFORCEMENT LEARNING FOR …

Figure 4.3: Line chart shows the generated recipes’ transformed conductivity score(Y-axis) with respect to iteration time step (X-axis), associated with an exemplaryenvironment in which some embodiments may operate. Points above the line y=1are conductivities larger than all existing recipes’ conductivity in our database.

Figure 4.4: Line chart shows the generated recipes’ transformed conductivity score(Y-axis) with respect to iteration time step (X-axis). There are no points above theline y=1, in this situation RL model outperformed Bayesian optimization method.

52

Page 62: APPLICATION OF DEEP REINFORCEMENT LEARNING FOR …

Figure 4.5: Standard deviation of generated recipes. Std is slowly decreasing becausethe generator learned the environment. Our model successfully controled the std.

53

Page 63: APPLICATION OF DEEP REINFORCEMENT LEARNING FOR …

Chapter 5

Summary

We developed a transformative machine learning framework, MARS, to accelerate ad-

vanced material RD. The machine learning framework comprised a machine learning

predictor to predict an objective function based on a recipe, and a reinforcement learn-

ing model to generate the plurality of proposed different recipes of battery materials

that provides optimized objective function. This framework spotlighted generative

machine learning models using structural input data.

The whole process of our proposed framework includes: predicting, by a machine

learning model (LightGBM); generating, by a reinforcement learning model (DDPG),

a plurality of proposed different recipes of battery materials by optimization of con-

ductivity and objective functions by given recipes; preparing an instance of at least

one of the proposed different recipes of battery materials via a robotic preparation

module. The framework further integrating with a high-throughput robotic plat-

form. Instances of the different proposed recipes of battery materials are prepared

and deposited into an electrochemical module by a robotic preparation module. A

robotic testing module executes a plurality of formulation characteristic tests on each

deposited recipe instance and updates the database and machine learning model.

Based on the results of RL model, we believe our approach of combining AI

54

Page 64: APPLICATION OF DEEP REINFORCEMENT LEARNING FOR …

machine learning and robotic high-throughput automation will greatly reduce the

cost and time to market for new and improved materials. It can potentially cut the

discovery time for new material solutions by 10 times, from 10-20 years down to 1-2

years.

We made an assumption that the direct tabular information can be accessed by

machine learning model. Our experiments on prediction model could confirm the

assumption. Other information such as chemical formula in SMILE format would

also help the model. The generated recipes are only focusing on conductivity, some

other objects haven’t been included. We plan to follow the experience in conductivity

project, predicting then generating by extra information. In the future, when multiple

objective functions are considered, we can combine them together into a weighted

single total objective function so that the overall objective will be optimized using

the same search algorithm. In some other cases, a certain objective function may be

used as constraints, for example, searching for the optimal ionic conductivity given

a certain range of Young’s modulus. In this case, the search will reject the solutions

that do not satisfy the constraints.

In addition, after some amount of iterations, we noticed that those models all

have fewer improvement, or even declination. Recently view of generation model re-

searchers concern about that the generation model will learn to hack the environment

model. So this phenomena maybe caused by the generative model fool the predictor.

This will also lead to similarity of generated recipes. Our reports on T-SNE analysis

also indicates that the recipes are grouped in clusters, while their score not follow any

obvious pattern. Investigation of interactions between environments and generation

models would be our future work.

55

Page 65: APPLICATION OF DEEP REINFORCEMENT LEARNING FOR …

Bibliography

[1] Wencong Lu, Ruijuan Xiao, Jiong Yang, Hong Li, and Wenqing Zhang. Data

mining-aided materials discovery and optimization. Journal of Materiomics,

3(3):191 – 201, 2017. High-throughput Experimental and Modeling Research

toward Advanced Batteries.

[2] Propagator Ventures. Why we invested in kebotix, 2018. [Online; posted 8-Nov-

2018].

[3] Christophe Pillot. The rechargeable battery market and main trends 2016–2025.

2017.

[4] Cormac Toher, Jose Plata, Ohad Levy, Maarten Jong, Mark Asta, Marco Buon-

giorno Nardelli, and Stefano Curtarolo. High-throughput computational screen-

ing of thermal conductivity, debye temperature, and gruneisen parameter using

a quasiharmonic debye model. Physical Review B, 90, 11 2014.

[5] Yuan Dong, Chuhan Wu, Chi Zhang, Yingda Liu, Jianlin Cheng, and Jian Lin.

Bandgap prediction by deep learning in configurationally hybridized graphene

and boron nitride. npj Computational Materials, 5:26, 02 2019.

[6] Rama Vasudevan, Kamal Choudhary, Apurva Mehta, Ryan Smith, Gilad Kusne,

Francesca Tavazza, Lukas Vlcek, Maxim Ziatdinov, Sergei Kalinin, and Ja-

son Hattrick-Simpers. Materials science in the artificial intelligence age: high-

56

Page 66: APPLICATION OF DEEP REINFORCEMENT LEARNING FOR …

throughput library generation, machine learning, and a pathway from correla-

tions to the underpinning physics. MRS Communications, 9:1–18, 07 2019.

[7] Suriani Ibrahim and Mohd Johan. Conductivity, thermal and neural network

model nanocomposite solid polymer electrolyte s lipf 6 ). International Journal

of Electrochemical Science, 6, 11 2011.

[8] Fang Ren, Logan Ward, Travis Williams, Kevin Laws, Christopher Wolverton,

Jason Hattrick-Simpers, and Apurva Mehta. Accelerated discovery of metallic

glasses through iteration of machine learning and high-throughput experiments.

Science Advances, 4:eaaq1566, 04 2018.

[9] Krishna Rajan. Combinatorial materials sciences: Experimental strategies for

accelerated knowledge discovery. Annual Review of Materials Research, 38:299–

322, 08 2008.

[10] Wildcat Discovery Technologies. Wildcat Discovery Technolo-

gies Discloses Fundamental Advances in Rechargeable Bat-

tery Materials Technology, month = 03, year = 2011, url =

”https://www.businesswire.com/news/home/20110314005427/en/wildcat-

discovery-technologies-discloses-fundamental-advances-rechargeable”.

[11] Xiao Wan, Wentao Feng, Yunpeng Wang, Haidong Wang, Xing Zhang,

Chengcheng Deng, and Nuo Yang. Materials discovery and properties prediction

in thermal transport via materials informatics: A mini review. Nano Letters,

19(6):3387–3395, 2019. PMID: 31090428.

[12] Gerbrand Ceder. Opportunities and challenges for first-principles materials de-

sign and applications to li battery materials. MRS bulletin, 35(9):693–701, 2010.

57

Page 67: APPLICATION OF DEEP REINFORCEMENT LEARNING FOR …

[13] Ao Huang, Yanzhu Huo, Juan Yang, and Guangqiang Li. Computational simu-

lation and prediction on electrical conductivity of oxide-based melts by big data

mining. Materials, 12, 2019.

[14] Tulay Ekemen Keskin, Emre Ozler, Emrah Sander, Muharrem Dugenci, and

Mohammed Ahmed. Prediction of electrical conductivity using ann and mlr: a

case study from turkey. Acta Geophysica, 05 2020.

[15] Antonio Lavecchia. Machine-learning approaches in drug discovery: methods

and applications. Drug Discovery Today, 20(3):318–331, March 2015.

[16] Bowen Tang, Fengming He, Dongpeng Liu, Meijuan Fang, Zhen Wu, and Dong

Xu. Ai-aided design of novel targeted covalent inhibitors against sars-cov-2.

bioRxiv, 2020.

[17] Rafael Gomez-Bombarelli, Jennifer N. Wei, David Duvenaud, Jose Miguel

Hernandez-Lobato, Benjamın Sanchez-Lengeling, Dennis Sheberla, Jorge

Aguilera-Iparraguirre, Timothy D. Hirzel, Ryan P. Adams, and Alan Aspuru-

Guzik. Automatic chemical design using a data-driven continuous representation

of molecules. ACS Central Science, 4(2):268–276, January 2018.

[18] Arpan Kar. Machine learning applications in supply chain management. CII Con-

ference on E2E TrimodalSupply chain: Envisioning Collaborative, Cost Centric,

Digital Cognitive Supply Chain, 07 2016.

[19] Kan Hatakeyama-Sato, Toshiki Tezuka, Momoka Umeki, and Kenichi Oyaizu.

Ai-assisted exploration of superionic glass-type li+ conductors with aromatic

structures. Journal of the American Chemical Society, 142(7):3301–3305, 2020.

PMID: 31939282.

[20] M. L. Green, C. L. Choi, J. R. Hattrick-Simpers, A. M. Joshi, I. Takeuchi, S. C.

Barron, E. Campo, T. Chiang, S. Empedocles, J. M. Gregoire, A. G. Kusne,

58

Page 68: APPLICATION OF DEEP REINFORCEMENT LEARNING FOR …

J. Martin, A. Mehta, K. Persson, Z. Trautt, J. Van Duren, and A. Zakutayev.

Fulfilling the promise of the materials genome initiative with high-throughput

experimental methodologies. Applied Physics Reviews, 4(1):011105, March 2017.

[21] Juan J. de Pablo, Nicholas E. Jackson, Michael A. Webb, Long-Qing Chen,

Joel E. Moore, Dane Morgan, Ryan Jacobs, Tresa Pollock, Darrell G. Schlom,

Eric S. Toberer, James Analytis, Ismaila Dabo, Dean M. DeLongchamp, Gre-

gory A. Fiete, Gregory M. Grason, Geoffroy Hautier, Yifei Mo, Krishna Rajan,

Evan J. Reed, Efrain Rodriguez, Vladan Stevanovic, Jin Suntivich, Katsuyo

Thornton, and Ji-Cheng Zhao. New frontiers for the materials genome initiative.

npj Computational Materials, 5(1), April 2019.

[22] Tianyu Liu, Kexiang Wang, Lei Sha, Baobao Chang, and Zhifang Sui. Table-

to-text generation by structure-aware seq2seq learning. In Thirty-Second AAAI

Conference on Artificial Intelligence, 2018.

[23] Nansi Xue, Wenbo Du, Amit Gupta, Wei Shyy, Ann Sastry, and Joaquim Mar-

tins. Optimization of a single lithium-ion battery cell with a gradient-based

algorithm. Journal of The Electrochemical Society, 160:A1071–A1078, 05 2013.

[24] Diederik P Kingma and Max Welling. Auto-encoding variational bayes, 2013.

[25] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-

Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adver-

sarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and

K. Q. Weinberger, editors, Advances in Neural Information Processing Systems

27, pages 2672–2680. Curran Associates, Inc., 2014.

[26] Richard S. Sutton and Andrew G. Barto. Introduction to Reinforcement Learn-

ing. MIT Press, Cambridge, MA, USA, 1st edition, 1998.

59

Page 69: APPLICATION OF DEEP REINFORCEMENT LEARNING FOR …

[27] RE l Perez and K Behdinan. Particle swarm approach for structural design

optimization. Computers & Structures, 85(19-20):1579–1588, 2007.

[28] C. Zhang and Y. Peng. Stacking vae and gan for context-aware text-to-image

generation. In 2018 IEEE Fourth International Conference on Multimedia Big

Data (BigMM), pages 1–5, 2018.

[29] Steffen Holldobler, Sibylle Mohle, and Anna Tigunova. Lessons learned from

alphago. 06 2017.

[30] kaggle, an online community of data scientists and machine learning practition-

ers. https://www.kaggle.com.

[31] mlcourse.ai. mlcourse.ai – open machine learning course, eature engineering and

feature selection. https://mlcourse.ai.

[32] Piotr Gromski, Alon Henson, Jaroslaw Granda, and Leroy Cronin. How to ex-

plore chemical space using algorithms and automation. Nature Reviews Chem-

istry, 3, 01 2019.

[33] Mariya Popova, Olexandr Isayev, and Alexander Tropsha. Deep reinforcement

learning for de novo drug design. Science Advances, 4(7):eaap7885, July 2018.

[34] Kyle Banker. MongoDB in Action. Manning Publications Co., USA, 2011.

[35] Andrius Vabalas, Emma Gowen, Ellen Poliakoff, and Alexander J. Casson. Ma-

chine learning algorithm validation with a limited sample size. PLOS ONE,

14(11):1–20, 11 2019.

[36] Martin A. Fischler and Robert C. Bolles. Random sample consensus: A paradigm

for model fitting with applications to image analysis and automated cartography.

Commun. ACM, 24(6):381–395, June 1981.

60

Page 70: APPLICATION OF DEEP REINFORCEMENT LEARNING FOR …

[37] Peter J. Huber. Robust estimation of a location parameter. Ann. Math. Statist.,

35(1):73–101, 03 1964.

[38] Thomas P. Minka. Bayesian linear regression. Technical report, 3594 Security

Ticket Control, 1999.

[39] Leo Breiman. Random forests. Mach. Learn., 45(1):5–32, October 2001.

[40] Alexey Natekin and Alois Knoll. Gradient boosting machines, a tutorial. Fron-

tiers in Neurorobotics, 7:21, 2013.

[41] I. Rish. An empirical study of the naive bayes classifier. Technical report, 2001.

[42] Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. In

Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge

Discovery and Data Mining, KDD ’16, page 785–794, New York, NY, USA, 2016.

Association for Computing Machinery.

[43] Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qi-

wei Ye, and Tie-Yan Liu. Lightgbm: A highly efficient gradient boosting decision

tree. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vish-

wanathan, and R. Garnett, editors, Advances in Neural Information Processing

Systems 30, pages 3146–3154. Curran Associates, Inc., 2017.

[44] Suraj Srinivas, Ravi Kiran Sarvadevabhatla, Konda Reddy Mopuri, Nikita

Prabhu, Srinivas S. S. Kruthiventi, and R. Venkatesh Babu. A taxonomy of

deep convolutional neural nets for computer vision. Frontiers in Robotics and

AI, 2:36, 2016.

[45] Joseph Walsh, Niall O’ Mahony, Sean Campbell, Anderson Carvalho, Lenka

Krpalkova, Gustavo Velasco-Hernandez, Suman Harapanahalli, and Daniel Ri-

ordan. Deep learning vs. traditional computer vision. 04 2019.

61

Page 71: APPLICATION OF DEEP REINFORCEMENT LEARNING FOR …

[46] Dominik Scherer, Andreas Muller, and Sven Behnke. Evaluation of pooling

operations in convolutional architectures for object recognition. In Konstantinos

Diamantaras, Wlodek Duch, and Lazaros S. Iliadis, editors, Artificial Neural

Networks – ICANN 2010, pages 92–101, Berlin, Heidelberg, 2010. Springer Berlin

Heidelberg.

[47] Convolutional Neural Networks for Image and Technische Universitat Munchen

Video Processing. Layers of a convolutional neural network, 2014.

[48] Martın Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey

Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al.

Tensorflow: A system for large-scale machine learning. In 12th {USENIX} Sym-

posium on Operating Systems Design and Implementation ({OSDI} 16), pages

265–283, 2016.

[49] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gre-

gory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga,

Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Rai-

son, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai,

and Soumith Chintala. Pytorch: An imperative style, high-performance deep

learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlche-Buc,

E. Fox, and R. Garnett, editors, Advances in Neural Information Processing

Systems 32, pages 8024–8035. Curran Associates, Inc., 2019.

[50] Fabian Pedregosa, Gael Varoquaux, Alexandre Gramfort, Vincent Michel,

Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron

Weiss, Vincent Dubourg, et al. Scikit-learn: Machine learning in python. Journal

of machine learning research, 12(Oct):2825–2830, 2011.

62

Page 72: APPLICATION OF DEEP REINFORCEMENT LEARNING FOR …

[51] Wes McKinney et al. Data structures for statistical computing in python. In

Proceedings of the 9th Python in Science Conference, volume 445, pages 51–56.

Austin, TX, 2010.

[52] Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He.

Deepfm: A factorization-machine based neural network for ctr prediction. pages

1725–1731, 08 2017.

[53] Dzmitry Bahdanau, Kyunghyun Cho, and Y. Bengio. Neural machine translation

by jointly learning to align and translate. ArXiv, 1409, 09 2014.

[54] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization.

International Conference on Learning Representations, 12 2014.

[55] Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes

for Machine Learning (Adaptive Computation and Machine Learning). The MIT

Press, 2005.

[56] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Tim-

othy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous

methods for deep reinforcement learning. In International conference on machine

learning, pages 1928–1937, 2016.

[57] David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and

Martin Riedmiller. Deterministic policy gradient algorithms. 2014.

[58] Fernando Nogueira. Bayesian Optimization: Open source constrained global

optimization tool for Python, 2014–.

[59] Wei Zhao, Maxime Peyrard, Fei Liu, Yang Gao, Christian Meyer, and Steffen

Eger. Moverscore: Text generation evaluating with contextualized embeddings

and earth mover distance, 09 2019.

63