for natural sandstone on gpu supercomputer · for natural sandstone on gpu supercomputer fig.2...

Two-phase Porous Flow Simulation for Natural Sandstone on GPU Supercomputer

Distributed Computing for Machine Learning on Large-Scale Image Dataset

Electronic structure calculation of large nano carbon molecules using multi-GPU massively parallel cluster system

We investigated multiphase flow characteristics inside 3D porous rocks by a highly efficient multi-phase lattice Boltzmann method (LBM). Using GPU supercomputer (TSUBAME 2.5), we can conduct two-phase LBM simulation for extremely large-scale digital rock model (1000 3 grids) reconstructed from micro-CT scanned images. These efforts enable us to enlarge the size of digital rock model from the pore-scale ( μm scale ) to the core-scale (mm scale). As a result, we can compare the results derived from LBM simulation with laboratory experiments. In Carbon Capture and Storage (CCS) project, the CO2-water (two-phase fluid) migration processes in geologic reservoir are influenced by many reservoir parameters (e.g., temperature, interfacial tension, pore structure, wettability and pressure). Using the two-phase LBM, we calculated CO2–water behavior under various reservoir conditions and identified suitable conditions for effective CO2 storage. We further modeled mineral precipitation process of CO2 inside the rock pore by considering CO2-water flow.

Takeshi Tsuji Fei JiangInternational Institute for Carbon-Neutral Energy Research (I2CNER), Kyushu University

The characteristics of fluid flow in porous media have been

subject of great interest in a wide range of scientific and

engineering disciplines. In particular, the displacement of one

fluid by another (i.e., two-phase flow) is a process that is central

to enhanced oil recovery (EOR), carbon capture and storage

(CCS), remediation of non-aqueous phase l iquids (NAPL),

and geothermal system. Here we focus on CCS project as an

application of multiphase flow simulations.

CCS project involves capturing CO2 produced by large

industrial plants and injecting it deep into a geological formation.

This project prevents large amounts of CO2 emission into the

atmosphere [1]. In this project, the behavior of the CO2 injected

into a reservoir can be characterized as two-phase flow in a

porous media—specifically, the displacement of formation water

(wetting phase) by supercritical CO2 (nonwetting phase) during

CO 2 injection (i.e., drainage process), and CO 2 displacement

by water (i.e., imbibition process) after CO 2 injection. Pore-

scale interfacial instabilities, such as Haines jumps, snap-off and

fingering phenomena, relate to the stability, injectivity, mobility

and saturation of CO 2 in the reservoir. Therefore, understanding

pore-scale CO 2 f low is crucia l to est imating cr i t ical CO 2

characteristics in reservoir, including storage capacity, leakage

risk, and storage efficiency.

Recently, numerical models based on the lattice

Boltzmann method (LBM) have been proposed and are growing

in popularity because they provide a convenient means to

simulate both single-fluid and multiphase flow behavior within

a porous medium system [2]. Unlike traditional computational

fluid dynamics (CFD) methods, which solve the conservation

equations of macroscopic properties numerically, the basic idea

of LBM is that it considers a many-fictive-particle system obeying

the same conservation laws. Those particles perform consecutive

propagation and collision processes over a discrete lattice mesh.

The conventional multiphase models, such as the

level set [3], volume of fluid [4] and phase-field [5] methods, usually

need complicated auxiliary algorithms to track and handle the

fluid interfaces. Whereas phase separations can be generated

automatically from particle dynamics, and no special treatment

is needed to manipulate the interfaces in LBM. Moreover,

LBM has advantages in dealing with complex boundaries and

incorporating microscopic interactions for multi-physics. These

features make the LB model well suited to simulate multiphase

flow directly on digital rock models with complex pore structure

(Fig. 1).

However, multiphase LBM simulation for porous

medium systems with sufficient resolution and large grid-

number remains computationally challenging [6,7]. Since LB

models are computationally expensive, without efficient parallel

implementations of the algorithms, large amounts of computer

time will be required due to the number and complexity of

computations required.

Recent ly, a powerful para l le l co-processor , the

graphics processor unit (GPU), has drawn much attention

due to its outstanding performance for accelerating general

scientific computation. GPU computing can be performed using

parallel computing architecture CUDA. Over the past few years

implementation of single-phase LBM on GPUs has provided

excellent performance when compared with their central

processing unit (CPU) equivalent [8,9,10].

In this study, we extend and implement mult i -

phase LBM by adopting GPU computing techniques in order

to investigate multiphase flow characteristics inside 3D porous

rocks (Fig. 1). We used the micro-CT to scan the rock sample

with enough resolution to reveal the correct pore structures

in 2D slices. These 2D images are segmented and interpolated

to reconstruct the 3D digital rock model (Fig. 1). This binarized

porous media data is mapped to a uniform Cartesian grid.

Introduction 1

Fig. 1 Pore geometry of the sandstone extracted from micro-X CT images. Black parts indicate pore space. In CO2 storage, we inject CO2 into the pore space.

Multiphase lattice Boltzmann models have been proposed as an

efficient method to solve immiscible two-phase flow systems

in porous media. In the present study, we adopt an optimized

version of the RK-model [11] which uses a multiple-relaxation-time

lattice Boltzmann approach and is able to allow simulations with

variations of density and viscosity. The bounce-back scheme is

implemented for the non-slip boundary condition at the fluid–

solid interfaces. Different wetting conditions such as wetting

with a non-zero contact angle or mixed wetting can be easily

incorporated by changing the order parameter on the solid wall.

The multi-phase LBM used here has three remarkable

advantages. First, almost all the calculations are local; second,

complicated procedures for interface tracking are not necessary;

and third, the scheme is purely explicit without solving any

Poisson equations. These characteristics are very favorable for

parallel computing.

The parallel implementations are carried out at two

levels: the CUDA core level and the inter-GPU level. Because the

microprocessors of the GPU chips are thread-block organized

at the logical level, parallel computing inside a single GPU

board (CUDA core level) is to essentially define a relationship

between thread/block IDs (threadIdx.x/y, blockIdx.x/y) and

lattice nodes. Here, we define a grid of blocks in the x- and

y-directions, and assign each block a number of threads in each

direction corresponding to the number of lattice nodes (Fig. 2).

Consequently, one thread is responsible for calculating a column

of lattice nodes along the z-direction. The block size (blockDimx,

blockDimy) should be determined at the beginning of the

calculation, and the numbers of blocks in the x- and y-directions

are therefore Nx/blockDimx, Ny/blockDimy, respectively.

Consequently, there is no need to loop in the x- and y-direction

in the kernel code and all the threads are automatically ordered

by the GPU controller and executed simultaneously. We obtain

the following relations between thread/block IDs and nodes

index jx, jy in the x- and y-direction (Fig. 2):

Another important point to consider is the data layout

of the array to store the particle distribution function. Because

the threads do not access the memory individually, but in groups

of 32, the GPU will coalesce access to the contiguous memory

elements. As a result, continuous memory access is extremely

important to exploit the maximum memory bandwidth. Here,

to store the particle distribution, a one-dimensional data array is

adopted in which the data layout is defined as

To reduce the data memory access, the collision

step and the stream step are combined. As a consequence, the

computational efficiency is increased by 30% .

To simulate a large-scale system, two memory-saving

schemes have been incorporated into our code due to the

limited memory space in a single GPU board. First, we adopted a

sparse memory storage method to avoid storing the information

of solid nodes. Second, both collision and propagation processes

are designed to evolve in a single array with the size of a grid cell

at one time step.

Moreover, for extremely large-scale simulation, multi-

GPU implementation (inter-GPU level) is necessary. A simple

domain decomposition along the z-direction is performed, and

each GPU is assigned a subdomain of the total computational

volume. For each interface between these subdomains, we use

one additional layer of ghost cells (Fig. 3) which are synchronized

at every time step to ensure the correct calculation of the color

gradient.

Data communications have to be carried out between

neighboring GPU boards at each time step. There are two

patterns of data transfer mode (Fig. 3). One is the data transfer

between GPUs on different motherboards, and the other

is between GPUs connected to the same PCI-E bus on one

motherboard. For the first case, the data transfer has to be

conducted through host CPU sides, whereas the second can be

carried out by using direct peer-to-peer communication features

GPU implemented Lattice Boltzmann method 2

(Fig. 3). Ghost cell data that need to be exchanged between hosts

are transferred by an Message Passing Interface (MPI) library. The

data exchange between device and host is the most limiting

factor. To improve the efficiency, we used pinned-host memory

and only exchanged the values that were absolutely necessary

corresponding to five distribution functions for the velocity field

and two densities for the phase field.

With this highly efficient multi-phase LBM code, large-

scale computation with sufficiently high resolution for digital

models of natural rocks becomes possible. Using 20 nodes of

TSUBAME 2.5 (equipped with total 60 NVIDIA K20X GPUs), we

successfully carried out the two-phase flow behaviors in a large-

scale digital rock model with a system size of 10003 (Fig. 4). These

efforts enable us to enlarge the size of digital rock from the

pore-scale ( μm scale ) to the core-scale ( mm scale ). As a result,

we can compare the results derived from LBM simulation with

laboratory experiments.

Fig.2 Domain distribution in the CUDA framework. (a) Physical node assignment. (b) Logical block/thread partition. Each thread is assigned to a single (x,y) value and works on all z values.

Fig.3 Communication scheme showing data transfer between dif ferent GPU boards: Unified virtual addressing (UVA) CUDA memory copy between host and device; Peer-to- peer memory copy between devices in the same node; MPI data exchange between hosts. Ghost cells are represented by black filled squares.

Fig.4 Example of CO2 behavior within pore space of Berea sandstone (Fig. 1) using two-phase lattice Boltzmann method. The left is inf low side. White indicates the injected CO2. The solid grain and water are transparent in this figure. The grid size is 10003.

Here we showed a few applications of multiphase LBM scheme;

(1) multiphase flow behavior under various reservoir conditions,

(2) optimum reservoir conditions for effective residual CO 2

trapping and solubility CO 2 trapping, and (3) mineralization

modeling and its influence on the hydrologic properties.

3. 1 Multi-phase flow behavior controlled by reservoir conditions

The CO 2 - water behaviors in reservoir are influenced by many

reservoir parameters (e.g., temperature, interfacial tension,

pore structure, wettability and pressure). Here we quantify

hydrological properties under several reservoir conditions (e.g.,

interfacial tension) from CO 2 behavior within pore space (Fig. 5) [12,13].

In the conditions of lower interfacial tension (Fig. 5a),

the CO 2 moves faster. Indeed, the relative permeability calculated

from the fluid behavior decreases with increasing IFT because

of growing capillary trapping intensity (Fig. 5c). The relative

permeability estimated for several reservoir conditions is crucial

information in conventional field-scale reservoir fluid simulation.

In contrast, equilibrium CO 2 saturation is higher at higher

interfacial tension case (Fig. 5b), indicating the CO 2 storage

capacity is higher in this condition. This simulation demonstrates

that we can estimate CO 2 saturation and permeability at any

reservoir conditions using this approach.

3. 2 Optimum conditions for residual and solubility CO2 trapping

We calculate residually trapped CO 2 clusters to quantify CO 2

trapping mechanisms in natural sandstone (Fig. 1). The residual

CO 2 distribution is generated following simulation of the

drainage and imbibition processes (Fig. 6) [14]. The characteristics

of the residual CO 2 cluster in terms of size distribution, major

length, interfacial area, and sphericity are investigated under

conditions of different interfacial tension (IFT).

Our results indicate that high interfacial tension

increases the residual CO 2 saturation and leads to a large size

distribution of residual CO 2 clusters (Fig. 6c). In contrast, low

interfacial tension results in a larger interfacial area, which would

be beneficial for dissolution and reaction processes during

geological CO 2 storage (Fig. 6a). By using this method, we can

obtain residual CO 2 distributions under different conditions for

optimizing the CO 2 storage capacity.

Applications 3

Fig.5 Supercritical CO 2 infiltration into rock as a function of IFT [12]. CO 2 behavior displayed in panel (a) was calculated at lower IFT, compared to panel (b). (c) Relative permeability as a function of IFT at 50% CO 2 saturation.

Fig.6 Residual CO2 distributions [14]. IFT increases from left to right.

Fig.7 Mineral precipitation calculated from the CO2 f low within the rock [15]. Red and white parts represent precipitated mineral (carbonate) and pore space, respectively. Left is original rock model.

We have successfully developed two-phase LBM scheme for

large-scale digital rock models. Multi-GPU implementation

is crucial for the extremely large-scale simulation. Using

this approach, we (1) characterize the influence of reservoir

conditions upon multiphase flow behavior (permeability) and

saturation, (2) reveal optimum conditions for residual and

solubility CO2 trapping in CCS project, and (3) calculate CO 2

mineralization (or carbonate precipitation) by considering CO2

behavior, and evaluate its influence upon the permeability.

Recently, we have calculated elastic properties of CO 2 saturated

and mineralized rocks (i.e., digital rock) using dynamic wave

propagation simulations [17].

Using these numerical simulations for the large-scale

digital rock models, we can accurately model CO2 behaviors

and reactions occurred in rock pore. In near future, large-scale

geological formation at CO2 storage sites could be digitalized and

accurately controlled (or managed) by these modeling studies.

Acknowledgements

This study was supported by JSPS through a Grant-in-Aid for

Scientific Research on Innovative Areas (no. 15H01143), Grant-

in-Aid for Scientific Research (A) (no. 24246148), Grant-in-

Aid for Research Activity Startup (no. 26887028) and Bilateral

Joint Research Projects with MOSR, and JICA/JST through

SATREPS project. We gratefully acknowledge support of I2CNER,

sponsored by the World Premier International Research Center

Initiative (WPI), Ministry of Education, Culture, Sports, Science,

and Technology (MEXT), Japan.

References

[1] Metz, B., et al. eds. (2005), IPCC Special Report, Carbon

Dioxide Capture and Storage. Cambridge University Press,

Cambridge.

[2] Boek, E.S., M. Venturoli (2010), Lattice -Boltzmann studies

of fluid flow in porous media with realistic rock geometries.

Comput. Math. Appl. 59, 2305–2314.

[3] Sussman, M., A.S. Almgren, J.B. Bell, P. Colella, L.H. Howell,

M.L. Welcome (1999), An adaptive level set approach for

incompressible two-phase flows. J. Comput. Phys. 148,

81–124.

[4] Hirt , C.W., B.D. Nichols (1981), Volume of f luid ( VOF)

method for the dynamics of free boundaries. J. Comput.

Summary 43. 3 CO2 mineralization modeling

In CCS project, the injected CO2 would be precipitated in rock

pore. Here we calculate the mineralization processes of injected

CO2 and its influence upon the hydrological properties [15]. The

calcite deposition within the pore space is calculated by using an

advection–reaction formulation solved by finite volume method

(FVM); we model the precipitated rock by transferring the fluid

node to solid node according to the calcium concentration

level. The clogging model updates the solid phase according

to precipitat ion process and consequently changes the

geometry and porosity of the sample rock. Then fluid solver

is called to recalculate the flow field based on the evolved

pore microstructures. To validate our method, the carbonate

precipitation simulation is carried out on a beads pack model

and compared with laboratory experiment. Since the simulation

results are in consistent well with the laboratory data, our

method can simulate realistic mineralization process.

Our ca lculat ion shows that locat ion of mineral

deposition depends on not only the fluid velocity field but also the

rock structure (Fig. 7). The calculated permeability variation due to

the carbonate precipitation demonstrates that evolution of pore

structure significantly influences the absolute permeability, while

it only affects the relative permeability of non-wetting phase.

Because the movement of non-wetting fluid is suppressed due

to the increased capillary pressure associated with mineralization,

permeability of non-wetting phase is significantly influenced by

mineralization (pore size reduction) [16]. These observations can be

used in the long-term reservoir simulation including geochemical

reactions.

Phys. 39(1), 201–225.

[5] Bada lass i , V .E . , H .D . Ceniceros , S . Baner jee (2003) ,

Computation of multiphase systems with phase field

models. J. Comput. Phys. 190(2), 371–397.

[6] Ahrenholz. B, J. Tölke, P. Lehmann, A. Peters, A. Kaestner, M.

Krafczyk, W. Durner (2008), Prediction of capillary hysteresis

in a porous material using lattice-Boltzmann methods and

comparison to experimental data and a morphological

pore network model, Advances in Water Resources, 31 (9),

1151–1173.

[7] Ramstad. T, N. Idowu, C. Nardi, P.E. Øren (2012), Relative

permeability calculations from two-phase flow simulations

directly on digital images of porous rocks, Transport in

Porous Media, 94 (2), 487-504.

[8] Tölke, J., M. Krafczyk (2008), TeraFLOP computing on a

desktop PC with GPUs for 3D CFD. Int. J. Comput. Fluid

Dyn. 22(7), 443–456.

[9] Tölke, J. (2010), Implementation of a lattice Boltzmann

kernel using the compute unified device architecture

developed by nVIDIA. Comput. Vis. Sci. 13(1), 29–39.

[10] Kuznik, F., C. Obrecht, G. Rusaouen, J.J. Roux (2010), LBM

based flow simulation using GPU computing processor.

Comput. Math. Appl. 59(7), 2380–2392.

[11] Tölke, J., S. Freudiger, M. Krafczyk (2006), An adaptive

scheme using hierarchical grids for lattice Boltzmann multi-

phase flow simulations. Comput. Fluid 35, 820–830.

[12] Jiang, F., T. Tsuji, and C. Hu (2014), Elucidating the role

of interfacial tension for hydrological properties of two-

phase flow in natural sandstone by an improved lattice

Boltzmann method, Transport in Porous Media, 104, 1, 205-

[13] Yamabe, H., T. Tsuji, Y. Liang, and T. Matsuoka (2015),

Lattice Boltzmann simulations of supercritical CO 2- water

drainage displacement in porous media: CO2 saturation

and displacement mechanism, Environmental Science &

Technology, 49 (1), 537-543.

[14] Jiang, F., and T. Tsuji (2015), Impact of interfacial tension on

residual CO2 clusters in porous sandstone, Water Resources

Research, 51, 1710-1722.

[15] Jiang, F., and T. Tsuji (2014), Changes in pore geometry and

relative permeability caused by carbonate precipitation in

porous media, Physical Review E 90, 053306.

[16] Kitamura, K., F. Jiang, A.J. Valocchi, S. Chiyonobu, T. Tsuji,

and K.T. Christensen (2014), The study of heterogeneous

two-phase f low around small-scale heterogeneity in

porous sandstone by measured elastic wave velocities and

lattice Boltzmann method simulation, J. Geophys. Res. 119,

7564-7577.

[17] Yamabe, H., T. Tsuji, Y. Liang, T. Matsuoka (2016), Influence

of fluid displacement patterns on seismic velocity during

supercritical CO2 injection: Simulation study for evaluation

of the relationship between seismic velocity and CO 2

saturation, International Journal of Greenhouse Gas

Control, 46, 197-204.

Distributed Computing for Machine Learning on Large - Scale Image Dataset

Intensive researches have been revealing that machine-learning methods known as Deep Neural Networks (DNNs) show great classification capabilities through supervised training on massive datasets. This research aims to quantify a condition that primarily controls classification capabilities, and generate a high-performing ensemble classifier consisting of plural DNN models. As for the training, we used node-distributed machine-learning program that we developed from scratch. As many as 9 6 GPUs are used to train a single DNN model. Most models are trained during the TSUBAME Ground Challenge, using 1146 GPUs simultaneously at peak, reaching about 1 TFLOPS (single) per GPU in the cost derivative parts.

Image classif ication is one of the main research f ields in

Information Technology. Many industrial applications have come

out over the years, such as person detection with surveillance

cameras, appearance inspection in production equipment, and

object detection for automatic emergency braking systems

equipped with high-end automobiles, to name a few. Especially

for applications that ensure human safety classification accuracy

is crucial; therefore, technological progresses or breakthroughs

are still in high demand.

Performance of a machine-learning based classifier

in general depends on the algorithm, the size of the training

dataset, and the computational resources. Krizhevsky et al.

showed that far better classification performance is obtained

by DNNs, which learn the image features from data, rather than

conventional approaches that use predefined image features

and only train feature classifier [1]. It is important to know the

fact that a large-scale image dataset named as ImageNet was

made available behind this success [2]. In fact, there are cases

where DNNs perform rather poorly when the size of the dataset

is not large [3]. Since optimization of DNN requires a large amount

of numerical operations, therefore use of GPUs is a standard

practice today. Still, it is not rare to spend a month or even

longer for network training on a single GPU.

We have developed a node-distributed, machine-

learning program for fast DNN training as a part of project of

driving safety application development. We have confirmed that

our program runs a few tens of times faster than a single GPU on

TSUBAME 2.5.

We applied the above-mentioned program to a large-

scale image dataset, generate a high-performing image classifier

made of an ensemble of several DNN models selected from a

There are in general two types of approaches in distributed

machine learning programs: model parallel and data parallel.

Model parallel means to split model parameters and hidden

layer representation into two or more GPUs. (Here, we assume

the use of GPUs for training phase.) The forward and backward

steps require GPU-GPU communications in a synchronous way

to compute the cost derivatives [1, 4, 5]. Model parallel is only

meaningful when the memory required for all variables necessary

to compute cost derivatives exceeds the GPU memory. In other

words, it is a technique for training relatively large networks. The

other approach, data parallel, is to split data samples into two or

more GPUs. In this approach different GPUs work on different

images. Only data parallel is implemented in our program. One

can implement only model parallel [1] or both [4, 5].

Data parallel approach can further be categorized

into synchronous types [6] and asynchronous types [4]. In the

former types, once all worker GPUs reads in different images

and compute corresponding cost derivatives, the derivatives

get added up through communications to update weight

parameters and updated parameters are distributed to all

workers. This cycle is repeated until convergence. The merits

of this method are two-fold: many more images are processed

for a given time compared to single machine training, and

larger set of trained DNN models in a relatively short period of

time. We also quantify the relation between the number of free

parameters and classification performance from the set of trained

DNN. The latter experiment is preliminary at the moment, but it

still shows an intriguing trend behind the classification accuracy

and the DOF.

Introduction 1

Software 2

Ikuro Sato* Ryutaro Watanabe* Hiroki Nishimura** Akihiro Nomura*** Satoshi Matsuoka**** DENSO IT LABORATORY, INC.　** DENSO CORPORATION　*** Tokyo Institute of Technology

We used ILSVRC 2012 dataset [2], which is a subset of ImageNet.

The ILSVRC 2012 dataset consists of 128M training samples, 50K

validation samples and 100K test images. This dataset is one of

the largest datasets among public image dataset. Each image

has a class label out of 1000 class. (Number of images per class

is not unique.) The word “sample” means a pair of image and

the true class label. No true classes for the test set are given.

Classification accuracy on the test set can be only obtained by

submitting the result to the test server.

We describe our network designing strategy here. The fraction of

the number of network parameters over the number of training

samples is chosen from (1, 30) to study the relationship between

numbers of parameters and the classification scores. Total

number of convolutional layers in a network is chosen from 11

up to 48. The filter size is fixed to be 3 × 3, as is seen in [8]. The

numbers of feature maps in hidden layers are set by hand. A

fully-connected layer is placed at the very last layer before taking

softmax normalization in each model. Input image size is 200

× 200 pixel2 or larger for good visibility. This resolution can be

realized by inserting pooling layers appropriately. The number

of pooling layers in a model ranges from 2 to 5. One of the

networks trained in this work is shown in Table 1.

A few more sentences on fully-connected layers are in order. It

is a regular practice to place a couple of fully-connected layers

after map resolution is sufficiently reduced (for example, see [1, 8]).

In these layers particular weight parameter is linked to particular

pair of neurons, indicating that many parameters are required.

As a result, overfitting tends to take place under some conditions.

It is a common practice to use a regularization technique such as

dropout [10] in fully-connected layers to prevent such behavior. In

our model construction, a network has one and only one fully-

connected layer and the last feature map has relatively small

spatial dimension 3 × 3. We expected that this construction

poses structural regularization to some extent, preventing severe

overfitting, while reducing the number of parameters. In our

preliminary experiment, we confirmed the network similar to

that of Table 1 has better generalization ability, while it has fewer

parameters, than a network with more than one fully-connected

layer.

We summarize a l l the detai ls about the way of

optimization in Table 2. As many (few) as 96 GPUs (24 GPUs) are

used to train a single model. Most of the DNN models are trained

during the TSUBAME Ground Challenge. Training took roughly 7

days. During the Ground Challenge, we used 1146 GPUs at peak

to train 13 models simultaneously. The computational efficiency

that we measured on the derivative computation part was 0.98

TFLOPS (single) per GPU, whereas the theoretical computational

efficiency of the GPU (Tesla K20X) is 3.95 TFLOPS (single).

convergence is guaranteed. On the other hand, asynchronous

data parallel is more advantageous in training speed because it

prevent hardware resources from idling, though convergence

is not guaranteed. Reference [4] reports asynchronous training

cause no problem in convergence in practice.

Throughout this research we used an in-house,

machine-learning program, categorized into asynchronous data-

parallel approach without special nodes (known as parameter

servers). We will discuss the detail of the algorithm in other

occasion.

Dataset 3

DNN models 4

Table 1. An example of neural networks that we used in this work. It has 16M parameters and consists of 15 convolutional, 5 pooling, and 1 fully-connected layers. “C” means convolution, “CP” means convolution and pooling, and “S” means softmax normalization. Activation function symbols are omitted.

Distributed Computing for Machine Learning on Large -Scale Image Dataset

Table 2. Optimization detail.

Figure 1 shows some of validation images classified successfully.

The ground truth of Fig. 1 (a) is “unicycle”. What makes the

prediction difficult is that the wheel part is out of the frame, and

only a part of the shaft is visible. Nevertheless, the model can

predict the class with high confidence. Interpretation is that

crowd of people looking above and the pose of the performer in

the center tell much about the target object. From this sample,

DNNs likely capture scene context including the background.

Similar interpretation holds in Fig. 1 (b), where the ground truth

is “paddle”. Figure 1 (c) shows image of “bubble”, and again the

classifier correctly predicts the class. It used to be a hard problem

to recognize semi-transparent object because the background

image features admixes through the transparent part. The fact

that our model successfully recognizes this sample with high

confidence indicates that DNNs easily overcome this type of

difficulty.

Figure 2 shows some of validation images classified

wrongly. The ground truth of Fig. 2 (a) is “Arabian camel”, but

the model prediction is “dog sled”. The interpretation is that the

DNNs over-evaluate the context and belittle the shape of the

head. Figure 2 (b) shows image of “triceratops”, but our model

fails to answer this. Since our model fails to recognize frontal

shape of triceratops, the generalization ability is somehow

limited.

Qualitative evaluation 5

Fig. 1 Validation images correctly classified by our model.

(a) unicycle

(b) paddle

(c) bubble

Figure 3 shows the relationship between the classification

score and the number of training parameters. It indicates that

classification performance is more-or-less equally good when

the number of parameters is in the range of ( 8 × 106, 4 0 ×

106 ). It also indicates that classification performance drops

drastically when the number of parameters goes below 8 × 106.

It is interesting that this kind of trend is seen even though the

network architectures vary. This fact implies that the number

of parameters is the major factor for classification performance.

When distributed machine learning platform is taken into

consideration, use of models with less parameters is desirable in

terms of training speed. As long as this study is concerned, it is

reasonable to use of models with around 8 ×10 6 parameters.

We have tested every possible combinations of trained

DNNs to generate an ensemble classifier. It ended up with six

models summarized in Table 3. Here, ensemble classification

simply means to take an expectation value of the logarithm of

softmax outputs. We confirmed that about 2% reduction in top-

5 error rate on the validation set. It means that every model

extracts slightly different image features.

Quantitative evaluation 6

Fig. 3 The relationship between the classification accuracy and the number of training parameters. The blue line indicates the number of training samples.

Fig.2 Validation images falsely classified by our model.

(a) Arabian camel

(b) triceratops

Distributed Computing for Machine Learning on Large -Scale Image Dataset

Table 3. Constituents of the ensemble classifier.

Table 4. Evaluation of our ensemble model with representative ones. Column 3 and 4 show the largest numbers in the constituent models. (*We estimate the #parameters from their network architecture.)

We compare the result with representative ones.

Classification scores on the test set are summarized in Table

4. The state - of - the -art score is 3.57 % [9] , which is better than

human classification score 5.1 % [2]. Interestingly their networks

have connections that skip some layers, and they claim that

these architectures allow fast training even though the network

is deeper than 100 layers. GoogLeNet [7] has so-called “Inception

Modules” and authors claim that such networks yield high

classification performance with considerably reducing the number

of parameters. OxfordNet [8] proposes relatively simple architecture

with repeated 3 × 3 convolutional layers and dropout-regularized,

fully-connected layers. AlexNet [1] is famous for its breakthrough

in 2012. GoogLeNet has the lowest number of parameters among

these models, and ours is the second. AlexNet has the fewest

convolutional layers, and ours is the second. The score of our

model is better than AlexNet, lying in the reasonable range.

Many researches have been revealing the techniques for

effective optimization of deep (sometimes extremely deep)

neural networks. We are interested in finding ways to exploit

representation ability of neural networks while keeping the

amount of parameters as low as possible. Reduction of the

number of parameters is very important for training speedup.

Today’s DNNs requires a very large number of parameters

compared to the number of training samples. This fact makes

us believe there is some room to make the parameter space

compact.

Acknowledgements

The node-distr ibuted machine-learning program used in

this work was developed during one-year TSUBAME Trial Use

Program. Most of the DNN models used in this work were

trained during the TSUBAME Ground Challenge 2015. We

express our gratitude for such generous opportunities.

References

[1] A. Krizhevsky, et al., “ImageNet Classification with Deep

Convolutional Neural Networks”, NIPS 2012.

[2] O. Russakovsky, et a l . , “ ImageNet Large Sca le V isua l

Recognition Challenge”, IJCV 2015.

[3] M. Oquab, et al., "Learning and transferring mid-level image

representions using convolutional neural networks", CVPR

[4] J. Dean, et al., “Large Scale Distributed Deep Networks”,

NIPS 2012.

[5] R. Wu, et al., “Deep Image: Scaling up Image Recognition”,

arxiv:1501.02876, 2015.

[6] F. N. Iandola, et al. , “FireCaffe: Near-Linear Acceleration

of Deep Neural Network Training on Compute Clusters”,

arxiv:1511.00175, 2015.

[7] C. Szegedy, et al., “Going Deeper with Convolutions”, CVPR

[8] K. Simonyan and A. Zisserman, “Very Deep Convolutional

Networks for Large-Scale Image Recognition”, ICLR 2015.

[9] K. He, et al., “Deep Residual Learning for Image Recognition”,

arxiv:1512.03385, 2015.

[10] G. E. Hinton, et al., “Improving neural networks by preventing

co-adaptation of feature detectors”, arxiv:1207.0580, 2012.

[11] V. Nair and G. E. Hinton, “Rectified Linear Units Improve

Outlook 7

Restricted Boltzmann Machines”, ICML 2010.

[12] I. Sato, et al., “APAC: Augmented PAttern Classification with

Neural Networks”, arxiv:1505.03229, 2015.

Nano carbon molecules have been studied actively with

experimental, theoretical, and computational approach for the

application of organic molecular function materials. Elucidation

of local structure and stabil ity of nano carbon molecular

assemblers is essential to develop and evaluate those materials.

Electronic structure calculation is recognized as the powerful tool

for the investigation of those properties. Second order Møller-

Plesset perturbation (MP2) theory [1] is the robust method for the

accurate treatment of electronic correlations that is important

for the description of weak inter-molecular interactions such as

the van der Waals interaction, and applicable to the application

of calculations of large nano carbon molecular assemblers.

However, the computation costs of MP2 calculations are much

higher than the Hartree-Fock (HF) calculations, and the sizes of

molecules applicable to the MP2 calculations are limited. Thus,

the use of large computation systems such as K computer and

TSUBAME 2.5 is essential for the allocation of MP2 calculations of

large carbon molecular assemblers.

Our research team has been performed the NTChem[2,3]

software project for the massively paral lel computing of

electronic structures of large-scale molecules utilizing the

petascale supercomputers such as the K computer. Various

efficient theories and algorithms have been implemented into

NTChem (Fig. 1) such as (1) electronic structure theories of the

ground and excited states of molecules based on HF theory

and density functional theory (DFT) and (2) accurate electron

correlation theories for ground and excited states such as MP2

theory, coupled-cluster theory, and quantum Monte Carlo

method.

Message passing interface (MPI)/OpenMP hybrid

parallelization of HF, DFT, and MP2 codes are performed for

the massively parallel computing on the K computer. Massively

parallel computations using 10-100 thousand nodes of K

computer are possible for those calculations. Recently, the

authors have been developed an efficient massively parallel

algorithm and code for the calculations of electron correlation

energy of large molecules based on the resolution-of-identity

MP2 (RI-MP2) method[4,5]. The RI-MP2 calculation of large

molecules containing 360 atoms and 9,840 atomic orbitals

(AOs) was performed successfully using 8,911 nodes of the K

computer.[6]

Furthermore, number of heterogeneous system in

which the accelerators such as graphics processing units (GPUs)

are installed has been increasing year by year in the field of

supercomputers. Demands for electronic structure calculations

using accelerators have been increasing day by day. In the

previous study, we have been developed codes for the general

purpose computing on GPU (GPGPU) in NTChem. Recently, we

have performed the improvement of parallel algorithm and the

GPGPU implementation of RI-MP2 energy calculations utilizing

the multi-node and multi-GPU systems.[7]

In this article, we report the results of the multi-node

and multi-GPU computation of the RI-MP2 energies of large

Introduction 1

In this study, we performed electronic structure calculations of large nano carbon molecular assemblers on TSUBAME 2 .5 using newly developed multi general purpose computing on graphics processing units (GPGPU) program of the resolution-of-identity second order Møller-Plesset perturbation (RI-MP 2 ) energy calculations. We report the results of performance evaluation on TSUBAME 2 . 5 . The peak performance of the multi-GPU computations of largest nano carbon molecules (nanographene dimer (C9 6 H2 4 )2 ) using 1 , 3 4 9 nodes and 4 , 0 4 7 GPUs of TSUBAME 2.5 are 514.7 TFLOPs. We also report the inter-molecular interaction analysis of two-layer π - π stacked nanographene dimers.

Michio Katouda* Akira Naruse** Takahito Nakajima* * RIKEN Advanced Institute for Computational Science　** NVIDIA Japan

Fig.1 Main feature of NTChem program

nano carbon molecules on TSUBAME 2.5. The overview of GPGPU

code of RI-MP2 calculations used in this study is introduced in

Section 2. The results of performance evaluation of our GPGPU

code of RI-MP2 calculations are presented in Sections 3 and 4.

We also report the inter-molecular interaction analysis of π - π

stacked nanographene dimers as an example of nano carbon

molecular assemblers in Section 5.

2. 1 RI-MP2 method

The RI-MP2 method [4,5] is an efficient computation method to

speed up the MP2 calculations introducing the RI approximation

(or the density fitting approximation) of four-center electron

repulsion integrals (ERIs) as

introducing the auxil iary basis functions (indices l and n )

expanding the products of AOs. This approximation reduces

t h e c o m p u ta t i o n c o s t s a n d t h e m e m o r y r e q u i r e m e n t s

considerably due to the reduction of prefactors. The main

computation bottleneck of RI-MP2 calculations is the matrix-

matrix multiplication in eq. (1). But, the efficient computation

and thread-based parallelization is possible utilizing the DGEMM

routine in the optimized Basic Linear Algebra Subroutines (BLAS)[8] library for systems. The MP2 correlation energy is evaluated by

where the indices ( i and j ) of occupied MOs are distributed[5,9].

We also apply the in-core storage scheme for utilization of

distributed memory by intermediate data of three-center

integrals in eqs. (1) and (2) to eliminate the input-output (I/O)

overhead demanding in the conventional MP2 code. Recently,

we have improved the MPI parallel scalability applying the two-

dimensional (2D) hierarchical MPI parallelization utilizing the

local MPI communicator. This improvement attains the massively

parallelization utilizing more than 10 thousand processes. In

this scheme, we parallelized the matrix-matrix multiplication in

eqs. (1) and (2) with the block data distribution of matrices as

the second dimension of MPI parallelization as well as the first

dimension of MPI parallelization of virtual MO indices ( a and

b ). Using the improved code, the RI-MP2 calculation of large

molecules containing 360 atoms and 9,840 AOs was performed

successfully up to 80,199 nodes of the K computer.[7]

2. 3 Multi GPGPU implementation of RI-MP2 method

Before this study, we have performed multi GPGPU implementation[7]

into RI-MP2 parallel code mentioned in Section 2.2 using the

compute unified device architecture (CUDA)[10]. We apply the

offloading model in the multi-GPU computing where the matrix-

matrix multiplication of eqs. (1) and (2) is offloaded to the multi-

GPUs (Fig. 2).

The task distribution to the multi-GPUs are attained

assigning each MPI process for the partial matrix-matrix

multiplication of the blocked matrices mentioned in Section

2.2. The matrix-matrix multiplication is implemented using

cublasDgemm routine in cuBLAS[11] library optimized by NVIDIA.

The data transfer between CPUs and GPUs becomes serious

bottleneck in this GPU computing. To reduce this overhead,

we have implemented the batch processing of matrix-matrix

using the approximated integrals of eqs (1) and (2).

2. 2 Massively parallel implementation of RI-MP2 method

Before this study, we have performed the development of

massively parallel algorithm and MPI/OpenMP hybrid parallel

implementation of RI-MP2 energy calculations. [6,7] In this

implementation, the matrix-matrix multiplication in eq. (1) is

parallelized distributing the indices ( a and b ) of virtual molecular

orbitals (MOs) to the MPI processes. This scheme improves the

parallel scalability considerably than the previous schemes

Multi GPGPU implementation of RI-MP2 energy calculations on NTChem program 2

Fig.2 Scheme of GPGPU implementation of eq. (1) in RI-MP2 energy calculation

multiplication for different indices of virtual MOs ( a and b ) where

the data in the same batch is transfer between CPUs and GPUs

at once utilizing the page-locked (pinned) memory that enables

a direct memory access on GPUs to request transfers to and

from the host memory without the involvement of CPUs. The

CPU code and GPU code were compiled with Intel compiler and

CUDA 5.0 compiler, respectively. Intel Math Kernel Library was

used for BLAS and LAPACK library and cuBLAS library was used.

We first investigated the elapsed times and parallel performance

of the multi-GPU jobs of RI-MP2 energies on TSUBAME 2.5. We

performed the RI-MP2 energy calculations of nanographene

C96H24 using 64-512 nodes of TSUBAME 2.5. cc-pVTZ basis set [12]

and corresponding auxiliary basis set [13] are employed (120

atoms, 3,216 AOs, 300 occupied MOs, 2,916 virtual MOs, 8,496

auxiliary basis functions, and no symmetry). HF calculations

before MP2 calculations were performed on the K computer, and

the results of MOs and MO energies were used for input data.

3 MPI processes and 4 OpenMP threads per node and 1 MPI

process and 12 OpenMP threads per node are employed for GPU

and CPU jobs, respectively. The elapsed times, parallel scalability,

and speedup using GPUs are summarized in Table 1.

number of nodes increases up to 512 nodes (148 and 234 times of

speedups for the GPU and CPU jobs, respectively, using 512 nodes).

The GPU jobs are 4.1-6.6 times faster than the CPU jobs

using 64-512 nodes. Although the parallel performance of GPU jobs

is lower than that of CPU jobs, it is good for both cases with the

3Performance evaluation of multi-GPU computation of RI -MP2 energy on TSUBAME 2.5: speedups using GPU and parallel performance

Table 1. Elapsed time and parallel speedups of RI-MP2/cc-pVTZ calculation of nanographene C96H24 on TSUBAME 2.5

We then performed the multi-GPU computation of large

carbon molecules using nearly entire system of TSUBAME2.5.

We performed the RI-MP2 calculations of nanographene dimer

(C96H24) 2 using 1,349 nodes and 4,047 GPUs of TSUBAME 2.5.

cc-pVTZ basis set and corresponding auxiliary basis functions

are employed (240 atoms, 6,432 AOs, 600 occupied MOs, 5,832

virtual MOs, 16,992 auxiliary basis functions, and no symmetry).

3 MPI processes and 4 OpenMP threads per node and 1 MPI

process and 12 OpenMP threads per node are employed for GPU

and CPU jobs, respectively.

Fig. 3 presents a detailed profile of the elapsed time of

GPU and CPU jobs. Considerable improvement of performance

is attained using GPUs. The GPU computing is 5.9 times faster

than the CPU computing due to the reduction of time for matrix-

matrix multiplication in DGEMM routine decreases to very short

times, as shown in Fig. 3 (see the red color for this part). The

elapsed times of GPU job and CPU job are, respectively, 419 s

and 2,467 s. Considerable improvement of the measured peak

performance value was attained using GPUs. The measured peak

performance values of GPU and CPU codes are, respectively,

514.7 TFLOPs, and 87.5 TFLOPs. However, the ratio of measured

Multi -GPU computation of RI-MP2 energy of large carbon molecules 4

Fig.3 Detailed profile of elapsed time of RI-MP2/ cc-pVTZ calculations of nanographene dimer (C96 H24)2 using GPUs and CPUs on TSUBAME 2.5

We analyzed the π - π stacking interaction energies of two-layer

nanographene dimers from the results of calculations performed

in this study. The model molecules of dimers employed in this

analysis are (C24H12)2, (C54H18)2, and (C96 H24)2 (Fig. 4). We consider

the model of AB stacking structure (Fig. 4 (a)). Experimental C – C

bond (1.45 Å) and inter-layer (3.35 Å) distances of bulk graphite,

and C – H bond distance (1.1 Å) of benzene are used to build the

models.

The interaction energies are calculated using the

MP2 method, the spin-component-scaled MP2 (SCS-MP2)[14] method, and the double-hybrid DFT with the empirical

dispersion correction (B2PLYP-D3[15] ). The cc-pVTZ basis set and

corresponding auxiliary basis set are used. Basis set superposition

errors were corrected using the counterpoise method [16]. The

interaction energies per one carbon atom are summarized in

Table 2.

These results demonstrated that the π - π stacking

interaction energies converge to the value of bulk two-layer

graphite sheets with the size of model increases. The results

of SCS-MP2 (68.0 meV/atom) and B2PLYP-D3 (61.2 meV/atom)

reproduces the experimentally estimated dispersion interaction

energy of bulk graphite [17] ( 52 ± 5 meV/atom) well, although

that of MP2 (85.7 meV/atom) yielded a much overestimated

interaction energy than the other methods and the experimental

value.

In this study, we performed the RI-MP2 energy calculations of

nano carbon molecular assemblers using up to 1,349 nodes and

4,047 GPUs on TSUBAME 2.5. Speedups using GPUs and parallel

scalability were investigated on TSUBAME 2.5 using up to 512

nodes. The GPU computation speeds up considerably (4.1- 6.6

times) the RI-MP2 calculations. Parallel scalability of present GPU

implementation is good with the number of nodes. 514.7 TFLOPs

of the measured peak performance is attained for the GPU job

of (C96H24)2 using 1,349 nodes and 4,047 GPUs of TSUBAME 2.5,

which is much higher than that of CPU jobs (87.5 TFLOPs).

We also analyzed the π - π stacking interaction

energies of two-layer nanographene dimers. The π - π stacking

interaction energies converge to the value of bulk two-layer

graphite sheets with the size of model increases. The SCS-

MP2 and B2PLYP-D3 methods reproduced the experimentally

estimated dispersion interaction energy of bulk graphite well,

although the MP2 method yielded much larger dispersion

energies than the other methods.

To date, we have not yet performed the GPGPU

implementation of the three-center AO ERIs routine and

optimized our code for network communications to match the

network of TSUBAME 2.5. We are planning these developments

to match the TSUBAME 2.5 architecture.

and theoretical peak performances of the GPU implementation

(9.3% ) is lower than that of the CPU implementation (42% ). This

is because the most time consuming part of the GPU jobs is no

longer matrix-matrix multiplication performed on GPUs but other

operations such as the intra-node collective communication

of four-center MO ERIs and the inter-node point-to-point

communication of three-center ERIs.

Analysis of π - π stacking interactions of nanographene dimers 5

Concluding remarks 6

Fig.4 Model molecules of two-layer nanographene dimers.

Table 2. π - π stacking interaction energies of nanographene dimers per one carbon atom (meV/atom)

Acknowledgements

The authors would like to thank Mr. Yukihiko Hirano at NVIDIA

Japan for fruitful discussions and his warm encouragement.

This work was supported by by the TSUBAME grand challenge

program, category A and the MEXT, Japan, Next-Generation

Supercomputer project (the K computer project). Benchmark

calculations were performed on TSUBAME 2.5 system (Tokyo

Institute of Technology, Tokyo, Japan). Preliminary calculations

were performed on the K computer (RIKEN, Kobe, Japan), the

supercomputer system of the Research Center for Computational

Science (National Institute of Natural Sciences, Okazaki, Japan).

References

[1] C. Møller and M. Plesset: Note on an Approximation

Treatment for Many-Electron Systems, Phys. Rev., Vol. 46,

pp. 618-622 (1934)

[2] T. Nakajima, M. Katouda, M. Kamiya, and Y. Nakatsuka:

NTChem: A high-performance software package for

quantum molecular simulation, Int. J. Quant. Chem., Vol.

115, pp. 349-359 (2015)

[3] NTChem, http://labs.aics.riken.jp/nakajimat_top/ntchem_

e.html

[4] M. Feyereisen, G. Fitzgerald, and A. Komornicki: Use of

approximate integrals in ab initio theory. An application

in MP2 energy calculations, Chem. Phys. Lett., Vol. 208, pp.

359-363 (1993)

[5] D. E. Bernholdt and R. J. Harrison, Large-scale correlated

electronic structure calculations: the RI-MP2 method on

parallel computers, Chem. Phys. Lett., Vol. 250, pp. 477-484

(1996)

[6] M. Katouda and T. Nakajima: MPI/OpenMP hybrid parallel

algorithm of resolution of identity second-order Møller–

Plesset perturbation calculation for massively parallel

multicore supercomputers, J. Chem. Theory Comput., Vol. 9,

pp. 5373-5380 (2013)

[7] M. Katouda, A. Naruse, and T. Nakajima: Massively parallel

algorithm and implementation of RI-MP2 energy calculation

for peta-scale many-core supercomputers, manuscript in

preparation.

[8] BLAS, http://www.netlib.org/blas/

[9] M. Katouda and S. Nagase: Efficient parallel algorithm of

second-order Møller–Plesset perturbation theory with

resolution-of-identity approximation (RI-MP2), Int. J. Quant.

Chem., Vol. 109, pp. 2121-2130 (2009)

[10] NVIDIA CUDA ZONE, https://developer.nvidia.com/cuda-

[11] cuBLAS http://docs.nvidia.com/cuda/cublas/

[12] T. H. Dunning Jr.: Gaussian basis sets for use in correlated

molecular calculations. I. The atoms boron through neon

and hydrogen, J. Chem. Phys., Vol. 90, pp. 1007-1022 (1989)

[13] F. Weigend, A. Köhn, and C. Hättig: Efficient use of the

correlation consistent basis sets in resolution of the identity

MP2 calculations, J. Chem. Phys., Vol. 116, pp. 3175-3183

(2002)

[14] S . Gr imme: Improved second-order Møl ler–Plesset

perturbation theory by separate scaling of parallel-and

antiparallel-spin pair correlation energies, J. Chem. Phys.,

Vol. 118, pp. 9095-9102 (2003)

[15] L. Goerigk and S. Grimme: Efficient and Accurate Double-

Hybrid-Meta-GGA Density Functionals—Evaluation with

the Extended GMTKN30 Database for General Main Group

Thermochemistry, Kinetics, and Noncovalent Interactions,

J. Chem. Theory Comput., Vol. 7, pp. 291-309 (2011)

[16] S. F. Boys and F. Bernardi: The calculation of small molecular

interactions by the differences of separate total energies.

Some procedures with reduced errors, Mol. Phys., Vol. 19,

pp. 553-566 (1970)

[17] R. Zacharia, H. Ulbricht, and T. Hertel: Interlayer cohesive

energy of graphite from thermal desorption of polyaromatic

hydrocarbons, Phys. Rev. B, Vol. 69, pp. 155406 (2004)

● TSUBAME e-Science Journal vol.14Published 2/29/2016 by GSIC, Tokyo Institute of Technology ©ISSN 2185-6028Design & Layout: Kick and PunchEditor: TSUBAME e-Science Journal - Editorial room Takayuki AOKI, Toshio WATANABE, Atsushi SASAKI, Yuki ITAKURAAddress: 2-12-1-E2-6 O-okayama, Meguro-ku, Tokyo 152-8550Tel: +81-3-5734-2085　Fax: +81-3-5734-3198 E-mail: tsubame_ j@sim.gsic.titech.ac.jpURL: http://www.gsic.titech.ac.jp/

International Research Collaboration

Application Guidance

Inquiry

Please see the following website for more details.http://www.gsic.titech.ac.jp/en/InternationalCollaboration

The high performance of supercomputer TSUBAME has been extended to the international arena. We promote international research collaborations using TSUBAME between researchers of Tokyo Institute of Technology and overseas research institutions as well as research groups worldwide.

Recent research collaborations using TSUBAME

1. Simulation of Tsunamis Generated by Earthquakes using Parallel　Computing Technique

2. Numerical Simulation of Energy Conversion with MHD Plasma-fluid Flow

3. GPU computing for Computational Fluid Dynamics

Candidates to initiate research collaborations are expected to conclude MOU (Memorandum of Understanding) with the partner organizations/departments. Committee reviews the “Agreement for Collaboration” for joint research to ensure that the proposed research meet academic qualifications and contributions to international society. Overseas users must observe rules and regulations on using TSUBAME. User fees are paid by Tokyo Tech’s researcher as part of research collaboration. The results of joint research are expected to be released for academic publication.

vol. 14

for natural sandstone on gpu supercomputer · for natural sandstone on gpu supercomputer fig.2...

Documents

navajo sandstone

sandstone worktops

scalable sparse optimization in dense wireless cooperative...

cemented sandstone

sandstone bolting

appro supercomputer solutions - gpu technology...

for natural sandstone on gpu supercomputer two-phase porous...

gpu supercomputer simulation of complete h1n1 virus

gpu gpu and supercomputer - university of rochester€¦ ·...

the juropa supercomputer

building a supercomputer

stanford streaming supercomputer

baumberg calcareous sandstone and obernkirchener...

supercomputer architecture

supercomputer benchmarking

the gpu advantage...the gpu advantage tianhe-1a at nsc...

using gpus to generate reproducible workflows to ... ·...

a life-cycle inventory of sandstone quarrying and...

supercomputer performance

sandstone geochemistry