experiments with distributed training of neural networks on the grid

Maciej Malawski1 Marian Bubak1,2 Elżbieta Richter-Wąs3,4 Grzegorz Sala3,5 Tadeusz Szymocha3

1Institute of Computer Science AGH, Mickiewicza 30, 30-059 Kraków, Poland2Academic Computer Centre CYFRONET, Nawojki 11, 30-950 Kraków, Poland

3Institute of Nuclear Physics, Polish Academy of Sciences, Krakow, Poland4Institute of Physics, Jagiellonian University, Kraków, Poland

5Faculty of Physics and Applied Computer Science AGH, Kraków, Poland {bubak,malawski}@agh.edu.pl, [email protected], [email protected], [email protected]

Testbed for our experiments: EGEE project• Virtual Organization for Central Europe• CYFRONET Kraków, PSNC Poznań, KFKI

Budapest, CESNET Prague, TU Kosice Grid sites• Support for MPI applications

Why neural networks• Once trained, are efficient and

accurate• Applicable for classification and

prediction • Proven in wide area of applications

Challenges • Neural network training is a highly compute-

intensive task – may need High Performance Computing

• Finding optimal configuration may be time consuming: many experiments with various parameters – may need High Throughput Computing

Solution: The Grid• The distribution of the computation on a cluster of

machines can lead to significant improvement in decreasing computation time.

• Utilizing resources (multiple clusters) available on the Grid can make this task less time consuming for researcher.

Target application• High Energy Physics• Discrimination between signal and

background events coming from the particle detector (simulation)

• ROOT and Athena as basic data analysis tools

Observation Training of neural networks on the Grid requires many repeated tasks:

• job preparation, • submission, • monitoring of status,• gathering results.

Performing them manually is time consuming for the researcher→ Preparation of tools for automating such tasks can facilitate the whole process considerably.

Our Goals • Develop the tools facilitating usage of Grid for

multiple classification experiments• Investigate and validate algorithms for

distributed neural network training• Allow seamless integration with data analysis

tools such as ROOT

Node i

Update Weights

Calculate error

Node 2

Update Weights

Calculate error

Node 1

Update Weights

Calculate error

Read the training data& the network details

Distribute data to the nodes

Master

repeat untilstopping criterian

is metBackpropagation algoritm

Node i

Update Weights

Calculate error

Node i

Update Weights

Calculate error

Node 2

Update Weights

Calculate error

Node 2

Update Weights

Calculate error

Node 1

Update Weights

Calculate error

Node 1

Update Weights

Calculate error



Master


is metBackpropagation algoritm

Jacobian

Hessjan

Node i

Jacobian

Hessjan

Node 2

Jacobian

Hessjan

Node 1



Update Weights

Master


is met

Levenberg-Marquardt algoritm

Jacobian

Hessjan

Node i

Jacobian

Hessjan

Node 2

Jacobian

Hessjan

Node 1



Update Weights

Master


is met

Jacobian

Hessjan

Node i

Jacobian

Hessjan

Node i

Jacobian

Hessjan

Node 2

Jacobian

Hessjan

Node 2

Jacobian

Hessjan

Node 1

Jacobian

Hessjan

Node 1



Update Weights

Master


is met

Levenberg-Marquardt algoritm

Grid

User

Submit()

Cluster C

Cluster B

Cluster A

Grid

User

Submit()

Cluster CCluster C

Cluster BCluster B

Cluster ACluster A

experiments with distributed training of neural networks on the grid

Documents

time consuming

usage of grid

central europecyfronet

decreasing computation

poland5faculty of physics

poland4institute of

applied computer science

tu kosice grid sitessupport