seismic hazard visualization from big simulation data

Seismic Hazard Visualization from Big Simulation Data

Paper:

Seismic Hazard Visualization from Big Simulation Data:Construction of a Parallel Distributed Processing System for

Ground Motion Simulation DataTakahiro Maeda and Hiroyuki Fujiwara

National Research Institute for Earth Science and Disaster Prevention (NIED)3-1 Tennodai, Tsukuba, Ibaraki 305-0006, Japan

E-mail: [email protected][Received October 2, 2015; accepted January 3, 2016]

We have developed a data mining system of paralleldistributed processing system which is applicable tothe large-scale and high-resolution numerical simula-tion of ground motion by transforming into groundmotion indices and their statistical values, and thenvisualize their values for the seismic hazard informa-tion. In this system, seismic waveforms at many lo-cations calculated for many possible earthquake sce-narios can be used as input data. The system utilizesHadoop and it calculates the ground motion indices,such as PGV, and statistical values, such as maximum,minimum, average, and standard deviation of PGV, byparallel distributed processing with MapReduce. Thecomputation results are being an output as GIS (Ge-ographic Information System) data file for visualiza-tion. And this GIS data is made available via the WebMap Service (WMS). In this study, we perform twobenchmark tests by applying three-component syn-thetic waveforms at about 80,000 locations for 10 pos-sible scenarios of a great earthquake in Nankai Troughto our system. One is the test for PGV calculation pro-cessing. Another one is the test for PGV data miningprocessing. A maximum of 10 parallel processing aretested for both cases. We find that our system can holdthe performance even when the total tasks is largerthan 10. This system can enable us to effectively studyand widely distribute to the communities for disastermitigation since it is built with data mining and visu-alization for hazard information by handling a largenumber of data from a large-scale numerical simula-tion.

Keywords: seismic hazard, visualization, simulation,parallel distributed processing

1. Introduction

In a seismic hazard assessment, prior to a seismic riskassessment, it is important to appropriately evaluate thevariation of ground motions of all possible earthquakes.In the ground motion assessment, the generation of seis-mic waves (source effect), propagation (propagation patheffect), and site amplification effect are appropriately

modeled, and then the seismic wave propagation is calcu-lated in numerical simulations from the model. The modeluncertainty can be reduced if the model precision is im-proved by conducting a survey because the propagationpath effect and the site amplification effect usually arisefrom a velocity structure. It is also possible to improve thesource effects by creating a high-precision source modelbased on observation records. However, prior develop-ment of a source model of high prediction precision isdifficult as learned in the case of the 2011 Tohoku earth-quake since the observations of great earthquakes are lim-ited. On the other hand, methods of constructing a sourcemodel based on a theory or empirical law have been pro-posed [1, 2] to improve the prediction of ground motion.It is important to analyze the variation of possible groundmotions and uncertainty of model parameters as well asto evaluate an average ground motion based on an aver-age source model as conventional ways. Therefore, manynumerical simulations are necessary to take into accountthe uncertainty in the ground motion evaluation based onmany possible source models.

Recent computer performance has enabled us to large-scale and high-resolution numerical simulation such ascomputation for great earthquakes for various earthquakescenarios. In fact, Maeda et al. [3] and Iwaki et al. [4]showed that the simulated long-period ground motionscould vary significantly depending on the source modelfor the Nankai Trough and the Sagami Trough. In orderto widely and effectively utilize the assessment results forreducing the disruptive impacts of a natural disasters oncommunities, it is critical to develop a data mining sys-tem to quickly extract useful hazard information from alarge amount of data set, such as simulation results, andtransform it into visual presentations.

This study aims to develop a system for compiling alarge amount of data from large-scale ground motion sim-ulations through parallel distributed processing and visu-alizing useful seismic hazard information extracted fromthe big data. Particularly, this system has been applied forthe case of long-period ground motion simulation of theNankai Trough earthquake to calculate the statistical val-ues of PGV at each location and visualize the results. Theparallel performance of this system is evaluated.

Journal of Disaster Research Vol.11 No.2, 2016 265

https://doi.org/10.20965/jdr.2016.p0265

© Fuji Technology Press Ltd. Creative Commons CC BY-ND: This is an Open Access article distributed under the terms of the Creative Commons Attribution-NoDerivatives 4.0 International License (http://creativecommons.org/licenses/by-nd/4.0/).

http://creativecommons.org/licenses/by-nd/4.0/

Maeda, T. and Fujiwara, H.

Fig. 1. Overview of developed parallel distribution processing system.

2. Construction of Parallel Distributed Pro-cessing System

The core architecture of this parallel distributed pro-cessing system is Hadoop and Hive [5]. Hadoop isthe software framework that supports parallel distributedprocessing and very large data sets on computer clus-ters along with distributed algorithm of MapReduce [6].Hadoop is composed of the following three compo-nents: (1) Distributed file system (HDFS: Hadoop Dis-tributed File System) [7]: (2) parallel distributed pro-cessing framework (MapReduce Framework): and (3) re-source management framework (YARN: Yet Another Re-source Negotiator) [8]. For the component (1), it is tostore a large amount of data, such as terabyte data orpetabyte data by using multiple physical disks to make asingle large virtual disk. The component (2) is to solve ahighly parallelizable problem with a large dataset by par-allel processing using a computer cluster or grid of manycomputers (nodes). The component (3) is to execute andcontrol the parallel distributed processing framework inan efficient and versatile manner, and manages resourcesused in distributed applications built by APIs (Applica-tion Programming Interface). Hive conducts MapReducewith a domain-specific language (DSL) called HiveQL fordata saved on HDFS. While Hadoop has been applied tothe analysis of observed seismic data [9], we applied tosimulation data in this study.

In Fig. 1, a flowchart of the system developed in this

study is shown. And in Tables 1 and 2 show a list ofthe system functions and an explanation of data handling,respectively. Input data of this system consist of three-component, Up-Down, North-South, and East-West, syn-thetic velocity waveforms at many locations of possibleearthquake scenarios.

As shown in Fig. 1, at first, we prepare the data set,which associates with the mesh code of each location fordistributed processing. Next, the PGV values are calcu-lated as a maximum value of vector sum of the three-component waveforms. This PGV calculation processingis conducted as Map processing in MapReduce. The cal-culation of statistical values (maximum, minimum, aver-age, and standard deviation) of the PGV at each outputlocation over many scenarios is executed by distributedprocessing with Hive. Finally, the statistical values areconverted into raster format of GIS files and provided viaWMS for visualization.

In this study, the PGV values are used as ground mo-tion indices and the maximum, minimum, average, andstandard deviation are used as statistical values. However,the ground motion indices and statistical values can be ar-ranged by modifying the system functions (Table 1) forfuture study. This system was developed on the open-typecloud system operated by National Research Institute forEarth Science and Disaster Prevention (NIED). The sys-tem consists of one master server and two slave servers,and each virtual server has 8-GB memory and four CPUcores (2.70 GHz).

266 Journal of Disaster Research Vol.11 No.2, 2016


Table 1. Overview of system functions.

No. Function Type Explanation1 Creation of input data for

PGV calculationConversion Many scenario waveform data files are loaded and converted into a suitable

format as input for the MAP processing of Hadoop. This is executed aspreprocessing of the distributed processing.[Input] Scenario waveform data[Output] Full scenario waveform data

2 PGV calculation process-ing

Calculation The full scenario waveform data given in No. 1 are distributed according tothe Hadoop resource setting, and maximum value of vector sum of the three-component waveforms is calculated as PGV value. The calculated resultsare output in a CSV format associated with the mesh code and the scenariocode.[Input] Full scenario waveform data[Output] PGV database

3 PGV data mining process-ing

Data mining Statistical values (minimum, maximum, average, and standard deviation) arecalculated from PGV database by HiveQL using the scenario code, and as-sociated with the mesh code and the scenario code.[Input] PGV database[Output] Statistic database

4 GIS conversion Conversion The statistic database is loaded, and a GIS file corresponding to the specifiedstatistical value is created.[Input] Statistic database[Output] GIS readable statistic data

5 WMS service Service A service of receiving a WMS request and sending a WMS response byreferring to the corresponding GIS readable statistic data.[Input] Statistical value code[Input] WMS request parameter[Output] WMS response

Table 2. Input/output data file in the system.

No. File type File format Explanation1 Scenario waveform data K-NET

ASCIIFile containing one-component time-series for each earthquake scenario andeach location obtained from simulation

2 Full scenario waveformdata

JAVA seri-alized

File where an object having three-component time-series data is serializedusing <mesh code>as the key

3 PGV database CSV File containing <mesh code>, <scenario code>, and PGV4 Statistic database CSV File containing <mesh code>, <scenario code>, minimum, maximum, av-

erage, and standard deviation5 GIS readable statistic data GeoTIFF Minimum value raster file

Maximum value raster fileAverage value raster fileStandard deviation raster file

3. Application to Ground Motion SimulationData

In this study, we apply the long-period ground mo-tion waveforms calculated by the seismic wave propaga-tion simulations of a hypothetical megathrust earthquake,which is expected to occur in the Nankai Trough. Thesimulations were conducted in a previous study [10] bythe authors. The seismic source area of the Nankai Troughextends over wide area from Suruga Bay to Hyuga-nada [11] so that the target area covers a large area fromKyushu to Tohoku. To simulate the seismic wave propa-gation, we adopted GMS (ground motion simulator) [12],a practical tool for wave propagation simulation using dis-continuous grid.

The velocity structure used is a three-dimensionalstructure model of the range of over 950 km north andsouth, 1150 km east and west, and down to 100 kmin depth from the Japan integrated velocity structuremodel [13]. The three-dimensional mesh is discretizedwith a grid interval of 200 m in the horizontal directionand of 100 m in the depth direction (three times largergrid interval for depth deeper than 8 km). The total num-ber of grid points is about 3.2 billion. Further, a character-ized source model based on a “recipe” [2] for predictingstrong ground motion is used. The rupture area giving bythis source model is interpret as 290,000 point sources,which are then placed on the FD grid. During the simula-tions, different rupture starting points were used while thefault plane and asperity, where seismic waves are strongly



(a)

(b)

Fig. 2. Example of visualization interface: (a) average and(b) standard deviation. Colored areas denote the target of thesimulation. The location and the name of the typical planesare given in (a). Small values at the edge of the areas are ar-tificial because of the boundary conditions of the simulation.

excited, of the 10 characterized source models were con-sistent. These simulation results were the ground motionvelocities at about 80,000 observation points with the in-terval of about 2 km on land. And their three componentsat each point are output in separate files (The size of eachfile is about 60 KB).

These simulated long-period ground motion data wereused as input to our system, and the statistical values ateach point were obtained and converted into GIS files.The average and the standard deviation of PGV are over-laid on top of the GSI Maps released by Geospatial In-formation Authority of Japan (GSI) (Fig. 2) using QGIS(2.8.1 Wien). The figure illustrates an uncertainty of thepossible ground motions arise from the diversity of the

earthquake occurrence. It is worth mentioning that theaverage and the standard deviation tend to be large at notonly the Pacific coast along the Nankai Trough but alsothe Oita Plane, the Osaka Plane, the Nobi Plane, and theKanto Plane, as well as the Toyama Plane and the NiigataPlane on the Sea of Japan side. This result implies that theamplitude can be extremely large depending on the loca-tion of the rupture starting point. Caveat of this map isthat the edge of the calculation area in which small ampli-tude is illustrated because of the boundary condition dur-ing the simulations. From many earthquake scenarios, wecan generate seismic hazard information from uncertaintyof the possible ground motions. An application of the sim-ulated data based on many possible earthquake scenariosis necessary for this purpose, and our system can executethis mission efficiently.

4. Scalability of Parallel Distributed Processing

4.1. PGV Calculation ProcessingWe performed a benchmark test of parallel computing

for PGV calculation processing. Ten files (about 9 GB perfile) of the full scenario waveform data (Table 2) createdfor each earthquake scenario were used as input data, andthe number of parallel-running tasks was set to range from1 to 10 by adjusting the YARN setting. The processingtime was measured twice for each of the parallel-runningtasks, and the speed-up rate and the parallelization effi-ciency were evaluated using the average of the calculatedtime. The processing time was shown on the Web inter-face of Hadoop.

Figure 3 shows the relation between the processingtime and the number of the parallel-running tasks. AndFig. 4 shows the relation between the parallelization ef-ficiency and the number of parallel-running tasks. Fig. 4indicates that the parallelization efficiency do drop signif-icantly from 100% to 65% when running tasks shift from1 to 2, while it remains around 70% when the numberof parallel-running tasks is ≥2. This may be because asingle task has a small overhead of the inter-node datacommunication, disk I/O, or job execution management,however, two or more tasks has a large influence on theseoverheads. If this overhead is assumed about 3,000 s andis added to the single-task processing time, the process-ing time efficiency of two or more parallel-running tasksis within the range of 90%–110%. Therefore, an efficientcalculation can be performed by increasing the number ofparallel-running tasks even when larger data need to beprocessed in the future.

4.2. PGV Data Mining ProcessingWe performed a benchmark test for calculating the sta-

tistical values with Hive. The number of parallel-runningtasks was set to range from 1 to 10 by adjusting the YARNsetting. The processing time was measured three timesfor each of the parallel-running tasks, and the speed-uprate and the parallelization efficiency were evaluated us-



0

1000

2000

3000

4000

5000

6000

7000

8000

time, s

1 2 3 4 5 6 7 8 9 10

task

1st2ndaverage

Fig. 3. Relation between PGV calculation processing timeand the number of parallel-running tasks.

0102030405060708090100

parallelization efficiency, %

1 2 3 4 5 6 7 8 9 10

taskFig. 4. Parallelization efficiency of PGV calculation pro-cessing time.

ing the average of the calculated time. The processingtime was shown on the Web interface of Hadoop. The re-lation between the processing time and the number of theparallel-running tasks is shown in Fig. 5. And the par-allelization efficiency is shown in Fig. 6. The differencefrom the previous benchmark test is that the PGV dataof 1,270 scenarios are added to the original 10 scenar-ios. These 1,270 scenarios are redundant to the original10 scenarios. Adding the dummy scenarios was neededbecause the input data size was too small to perform thebenchmark test.

In Fig. 6, the parallelization efficiency became lessthan 60% and around 50% in the case of seven and eightparallel-running tasks, respectively. The speed-up ratewas compared with the theoretical one using the paral-lelization efficiency according to Amdahl’s law as follow-ing formula:

S(N) = 1/((1−P)+P/N). . . . . . . . . (1)

Here, P denotes the ratio of the execution time of the pro-gram, which can be parallelized, N represents the numberof processors (number of parallel processors), and (1−P)indicates the ratio of the execution time that cannot be par-allelized. Fig. 7 shows the comparison of our benchmark

0

50

100

150

200

250

300

350

time, s

1 2 3 4 5 6 7 8 9 10

task

1st2nd3rdaverage

Fig. 5. Relation between PGV data mining processing timeand the number of parallel-running tasks.

0102030405060708090100

parallelization efficiency, %

1 2 3 4 5 6 7 8 9 10

taskFig. 6. Parallelization efficiency of PGV data mining pro-cessing time.

1

2

3

4

5

6

7

8

9

10

speed up

1 2 3 4 5 6 7 8 9 10

task

P=0.99

P=0.95

P=0.90

P=0.85

Fig. 7. Relation between the parallelization efficiency ofthe PGV data mining processing and the speed-up rate. Thecurves indicate the theoretical speed-up rates obtained usingAmdahl’s law with P being the ratio of the execution time ofthe parallelizable part of the program.

result to the theoretical value of Eq. (1). And the figure in-dicates that the P value of our system is within 0.85–0.90according to the theoretical P value. Therefore, we can



expect the speed-up rate of this system to be within the Prange of 0.85–0.90 as shown in Fig. 7 even when the num-ber of parallel-running tasks is increased. Moreover, thebenchmark test results are significantly different betweenPGV calculation processing and PGV data mining pro-cessing, particularly the overhead between a single-taskand double-task. Although such differences have not beenstudied in detail, these differences could be attributed toHive used in the PGV data mining processing, which isnot used in the PGV calculation processing.

5. Summary and Discussions

We have developed a data mining system for seismichazard. Our system calculates and builds the informa-tion such as ground motion indices from large-scale high-resolution simulation data of various earthquake scenar-ios. We apply simulation data of 10 earthquake scenariosto test our system and will include more earthquake sce-narios in the near future. The parallel distribution process-ing used in our system is effective in handling these bigdata. Particularly, our system is adequate to tackle the fol-lowing issues; (1) the hazard information extracted fromthis system depends on the numerical simulations; (2) thePGV values are used as an index while there are otherindices may be useful as well as PGV; (3) an information-sharing technology and an information utilization tech-nology need to be improved for even larger-scale higher-resolution information.

For issue (1), because numerical simulations result iseasily extracted for many possible scenarios by our sys-tem, the investigation of whether the set of scenarios issuitable for the seismic hazard assessment can be moremanageable. It is important to note that not only inves-tigating the suitable source model but also need to testmany more parameters such as the occurrence probability,accurate velocity structure, and to improve the simulationmethod itself.

For issue (2), to study not only PGV but also to findother appropriate indices, data mining for extracting use-ful information is necessary. Particularly, it is important tofocus more on developing effective visualization methodadopting not only PGV but also other various piece of in-formation for future works.

For issue (3), a large amount of highly advanced in-formation by large-scale and high-resolution simulationsmust be effectively and widely distributed for disaster mit-igation. Our system can contribute to enhance the func-tions of existing systems such as a seismic hazard infor-mation providing system “Japan Seismic Hazard Informa-tion Station J-SHIS 1,” and systems developed as a mutualoperation-type information-sharing platform for disasterinformation “E-community platform2” or “Cloud systemfor joint public-private crisis management3,” developedby NIED.

1. http://www.j-shis.bosai.go.jp/en/2. http://ecom-plat.jp/3. http://ecom-plat.jp/k-cloud/index.php

AcknowledgementsWe thank the anonymous reviewers for providing comments. Thisresearch was supported by CREST, JST. Many of the figures inthis paper were made using GMT [14].

References:[1] K. Irikura and H. Miyake, “Prediction of strong ground motions for

scenario earthquakes,” Journal of Geography, Vol.110, pp. 849-875,2001 (in Japanese with English abstract).

[2] Earthquake Research Committee, “Strong Ground Motion Pre-diction Method (“Recipe”) for Earthquakes with SpecifiedSource Faults,” 2009 (in Japanese), http://www.jishin.go.jp/main/chousa/09 yosokuchizu/g furoku3.pdf [accessed March 4, 2016]

[3] T. Maeda, N. Morikawa, A. Iwaki, S. Aoi, and H. Fujiwara,“Finite-Difference Simulation of Long-Period Ground Motion for theNankai Trough Megathrust Earthquakes,” JDR, Vol.8, No.5, pp.912-925, 2013.

[4] A. Iwaki, N. Morikawa, T. Maeda, S. Aoi, and H. Fujiwara,“Finite-Difference Simulation of Long-Period Ground Motion forthe Sagami Trough Megathrust Earthquakes,” JDR, Vol.8, No.5, pp.926-940, 2013.

[5] A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony,H. Liu, P. Wyckoff, and R. Murthy, “Hive – A Warehousing Solu-tion Over a Map-Reduce Framework,” In Proc. of Very Large DataBases, Vol.2 No.2, pp. 1626-1629, August 2009.

[6] J. Dean and S. Ghemawat, “MapReduce: simplified data process-ing on large clusters,” In Proc. of the OSDI’04, 6th Symposiumon Operating Systems Design and Implementation, Sponsored byUSENIX, in Cooperation with ACM SIGOPS, pp. 137–150, 2004.

[7] K. Shvachko, H. Kuang, S. Radia, and R. Chansler, “The Hadoopdistributed file system,” In Mass Storage Systems and Technologies(MSST), 2010 IEEE 26th Symposium on, pages 1–10, May 2010.

[8] V. K. Vavilapalli, A. C Murthy, C. Douglas, S. Agarwal, M. Konar,R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth, B. Saha, C. Curino,O. O’Malley, S. Radia, B. Reed, and E. Baldeschwieler, “ApacheHadoop YARN: Yet Another Resource Negotiator,” In Proc. of the4th ACM Symposium on Cloud Computing (SoCC), 2013.

[9] T. G. Addair, D. A. Dodge, W. R. Walter, and S. D. Ruppert, “Large‐scale seismic signal analysis with Hadoop,” Comput. Geosci.Vol.66, pp. 145-154, doi: 10.1016/j.cageo.2014.01.014, 2014.

[10] T. Maeda, N. Morikawa, S. Aoi, and H. Fujiwara, “Long-periodground motion evaluation for the Nankai Trough megathrust earth-quakes,” In Proc. of JpGU, SSS23-13, 2014.

[11] Earthquake Research Committee, “On the long-term evalu-ation of earthquakes in the Nankai Trough (2nd edition),”2013 (in Japanese), http://www.jishin.go.jp/main/chousa/13maynankai/index.htm [accessed March 4, 2016]

[12] S. Aoi, T. Hayakawa, and H. Fujiwara, “Ground motion simulator:GMS,” Butsuri Tansa. Vol.57, pp. 651–666, 2004 (in Japanese withEnglish abstract).

[13] Earthquake Research Committee, “Long-period grond motionhazard maps for Japan,” 2012 (in Japanese), http://www.jishin.go.jp/main/chousa/12 choshuki/index.htm [accessed March 4,2016]

[14] P. Wessel and W. H. F. Smith, “New version of the Generic MappingTools released,” Eos Transactions, American Geophysical Union,Vol.76, p. 329, 1995.



Name:Takahiro Maeda

Affiliation:Senior Researcher, National Research Insti-tute for Earth Science and Disaster Prevention(NIED)

Address:3-1 Tennodai, Tsukuba, Ibaraki 305-0006, JapanBrief Career:2004- Postdoctoral Fellow, Hokkaido University2009- Postdoctoral Fellow, University of California, Santa Barbara2010- Research Fellow, National Research Institute for Earth Science andDisaster Prevention2014- Senior Researcher, National Research Institute for Earth Scienceand Disaster PreventionSelected Publications:• T. Maeda, N. Morikawa, A. Iwaki, S. Aoi, and H. Fujiwara,“Finite-Difference Simulation of Long-Period Ground Motion for theNankai Trough Megathrust Earthquakes,” JDR, Vol.8, No.5, pp. 912-925,2013.Academic Societies & Scientific Organizations:• Seismological Society of Japan (SSJ)• Japan Association of Earthquake Engineering (JAEE)• Architectural Institute of Japan (AIJ)• American Geophysical Union (AGU)

Name:Hiroyuki Fujiwara

Affiliation:Director, Department of Integrated Research onDisaster Prevention, National Research Insti-tute for Earth Science and Disaster Prevention(NIED)

Address:3-1 Tennodai, Tsukuba, Ibaraki 305-0006, JapanBrief Career:1989- Researcher, NIED2001- Head of strong motion observation network laboratory, NIED2006- Project director, Disaster prevention system research center, NIED2011- Director, Department of Integrated Research on Disaster Prevention,NIEDSelected Publications:• “Seismic Hazard Assessment for Japan: Reconsideration After the 2011Tohoku Earthquake,” JDR, Vol.8, No.5, pp. 848-860, 2013.Academic Societies & Scientific Organizations:• Seismological Society of Japan (SSJ)• Japan Association for Earthquake Engineering (JAEE)


Powered by TCPDF (www.tcpdf.org)

http://www.tcpdf.org

seismic hazard visualization from big simulation data

Documents