iris/sage ear 1724509, unavco/gage ear 1724794 ... · tech (gamit software) processing $9.69/day...
Post on 05-Jun-2020
1 Views
Preview:
TRANSCRIPT
Investigating the Effectiveness of Cloud Services for Geodetic and Seismic Data Management
2018 Fall AGU Meeting
C. Meertens1, C. Trabant2, D. Philllips1, T. Ahern2, D. Mencin1, M. Stults2, D. Ertz1, and S. Baker1
NSF Award Numbers: UNAVCO ICER-1639709, IRIS ICER 1639719 EarthCube Program, IRIS/SAGE EAR 1724509, UNAVCO/GAGE EAR 1724794
Joint Project : UNAVCO1 and IRIS Data Management Center2
The IRIS Data Management Center
● NSF funded (SAGE)
● Global, multi-decade data coverage
● 80,214 sites
● 477 terabytes, selectable at the sample level
● Seismological focus
UNAVCO Geodetic Data Services
● NSF funded (GAGE)
● Global, multi-decade data coverage
● 13,700 GNSS stations, 87 Borehole (seismic, strain, pore pressure, tilt); 350 terrestrial laser scanning campaigns; 123 Tb of Synthetic Aperture Radar Scenes
● 323 terabytes
● Geodetic focus
Data streams (red arrows) and file downloads (blue arrows) from remote instruments to data centers for distribution, processing and archiving
The EarthCube GeoSciCloud Project
Exploring the advantages and disadvantages of operating a data center in cloud environments
The IRIS DMC and UNAVCO GDS as a proxy for general geoscience data centers and repositories
● Deploy a subset of data resources in two cloud environments
● Compare performance
● Compare costs
● Compare operational details
GeoSciCloud high level plan
● Chosen cloud providers:
○ XSEDE Jetstream + Wrangler, NSF funded
○ Amazon AWS
● Test data set: IRIS ~41 terabytes (full archive is 447 TB and growing); UNAVCO ~130 terabytes (full archive 323 TB and growing)
● Deploy core data access and processing services
○ Plus other key systems (real time collection, GeoWS services)
○ Implement new technologies for elasticity (Docker + Kubernetes)
● Benchmark testing, data center comparison
● Evaluation by scientist users and facility domain experts
GeoSciCloud high level plan
● Chosen cloud providers:
○ XSEDE Jetstream + Wrangler, NSF funded
○ Amazon AWS
● Test data set: IRIS ~41 terabytes (full archive is 447 TB and growing); UNAVCO ~130 terabytes (full archive 323 TB and growing)
● Deploy core data access and processing services
○ Plus other key systems (real time collection, GeoWS services)
○ Implement new technologies for elasticity (Docker + Kubernetes)
● Benchmark testing, data center comparison
● Evaluation by scientist users and facility domain experts
Example 1: IRIS Web service performance comparison, high concurrency (XSEDE/Jetstream & AWS)
I/O focused test CPU focused test
● SAR Data Archive○ Data from foreign space
agencies (JAXA, DLR, EU, CA)
○ 100,000 individual scenes○ 120 Tb○ Production archive and
services have been running on XSEDE JetStream for about two years
○ Growing need for big data resources. E.g. EU Sentinel openly available SAR already 2.5 petabytes!
UNAVCO Synthetic Satellite Radar (SAR) Archive - Fully operational on XSEDE!
● SAR Archive XSEDE Benefits:○ Access to OpenStack API allows easy VM
and network provisioning○ SAR bandwidth and storage costs for multiple
GB files would be prohibitive on AWS but are available at no cost from XSEDE
○ Help desk support available and was budgeted for
○ Problem of not enough fast storage access on Jetstream solved by IU who made NFS mount to Wrangler ● Challenges:
○ Not yet ideal for full production 24/7 data center operations○ Currently XSEDE is more intended for run and run (away) but XSEDE
and UNAVCO are learning how to optimize for continuous operations○ Maintenance on Wrangler storage causes outages approximately once
a month○ Need for annual reapplication for allocations
Synthetic Satellite Radar (SAR) Archive
Cost comparison:
On Premises
● Capitalized● Rent,
maintenance
XSEDE/Jetstream
● No direct costs● Time for
allocation proposal
Cost comparison cont.:
UNAVCO AWS
● Full annual cost not possible (we only are testing a subset of tasks)
● See next slides
IRIS AWS, ~1 year GeoSciCloud
● $125,000 total○ $94K storage*○ $31K compute and other*
● Ingress/egress costs○ Usage for tests within free tier○ Estimate for actual shipments
would be $7,000+/month
IRIS data shipments 635 Tb in 2018
Opportunities, go cloud native:
● S3 storage, much cheaper (> 13X)● Lambda compute, likely cheaper
Changes require significant adaptation
Cost UNAVCO GPS Data FTP Service
FTP Service Performance (AWS)● Can be faster for downloads than current UNAVCO systems.
FTP Cost Comparison● At full capability, an AWS FTP service (utilizing full required 40TB of less expensive EBS magnetic volumes storage and bandwidth of 3.5 TB/month for downloads) annual dollar cost estimate is $29,100 (AWS)● UNAVCO’s comparable current offsite redundant server costs are an initial cost of $25,000 for 60TB SAN storage and annual $18,000 colocation services. Amortizing the UNAVCO-owned SAN over 5 years, the current annual cost is $23,000.
XSEDE: Major setback is that standard ftp is NOT supported on XSEDE. We are testing sftp (anonymous) and https but this has taken some time to implement
CentralWashingtonUniversity(GIPSY software)
Processing $3.86/day
Storage $1.95/day
Reprocessing of 20 years of daily data from ~1,500 stations is performed every few years
PerformanceRepro time < 1 month
(local cluster)
Performance Repro time
~19 months (AWS)
Total: $5.81/day = $2,120/yr of data*
*extrapolated
Repro cost approximately $40,000
New Mexico Tech(GAMIT software)
Processing$9.69/day
Storage: $3.50/day
$35,388 $12,782 Repro time ~4 months
(local cluster)
<< 1 month(AWS)
Total $13.19/day = $4,800/yr of data*
Repro cost approximately $100,000
GPS Data Processing: Cost and performancefor processing of daily data pulled from AWS ftp show reasonable annual costs on AWS but reprocessing would be hugely expensive. GIPSY processing performance is poor and could not be optimized on AWS.
Real-time GNSS Processing for Earthquake Early Warning
Nov 30, 2018 M7.0 Earthquake, Anchorage, AK, showing rapid magnitude estimate using 1 hz sampled continuous real-time GNSS data streams from 800 stations
Cost for Real-time GNSS Processing for Earthquake Early Warning testing on AWS
● Cost for PPP processing for one core/station is ~$1.05/day
● This scales to 20 stations/core per instance which is similar to on premises.
● For 800 stations this scales to about $15k/year.
● Ingress data cost is negligible.● Egress data cost may be substantial as
~10,000 connections are supported each day to real time streams.
● VPN costs may also be impactful in a true failover system. Estimated AWS costs are $0.05/hour/station ~ $350k/yr. We still need to test this
VPN
Operational comparison, IRIS and UNAVCO on premises versus cloud providers
On Premises Facility
● Own systems, total control ● Fully autonomous
○ With UW and LLNL support (IRIS)
Operating in the cloud
● No hardware control● No maintenance schedule control● Limited support
Level of system administration is the similar
Operational comparison, key points
XSEDE/Jetstream
● Somewhat known Openstack● No-cost, high performance
connectivity● Possible to get highly valuable
direct support● No VM host with large storage● No guaranteed uptime, 95% w/NSF● Saturated storage system path● Short-term allocation model● No support for ftp
AWS
● The Amazon way● Free consulting for startup, EDU?● Base level of support $50/day, next-day● Lots of options including global
replication of data, cost sharing, etc.● Anything significant costs● No insight into internals● Limits (e.g. CPU) and complexity● Potential for vendor lock in
Preliminary conclusions
Pros● Both XSEDE/Jetstream and AWS can host a facility-like repository
o UNAVCO SAR primary archive fully operational on XSEDE
● Impacts on research could be significant○ Increased data shipment capabilities through scale-out○ Increased capability to process data on researcher’s behalf○ Data directly in a high performance compute environment
Cons● Operating in the cloud requires adaptation● FTP, the primary access to UNAVCO data, not available at XSEDE● XSEDE’s current allocation model is not an ideal fit for long-term facilities● AWS, as used, is quite expensive, in particular with large data
○ Significant cost savings possible, with significant adaptation*
Collaborative Research:NSCI:HDR:Framework:Data:GeoSCIFramework: Scalable Real-time Streaming Analytics and Machine Learning for Geoscience and Hazards Research
An experimental computational framework enables natural hazards research and enhanced earthquake, tsunami and volcano early warning systems. Real-time streaming analytics on continuous integrated data streams from thousands continental and oceanic high-rate sensors, when combined with satellite radar time series, give a coherent high-resolution global-scale view of the motions of the earth.●Participating Institutions: UNAVCO/GAGE (Meertens, Lead),
Rutgers University (Ocean Observatories Initiative - OOI), University of Colorado, University of Oregon
●Collaborating Institutions: IRIS/SAGE, University of Texas Arlington (TACC/XSEDE)
● Integrated data access: The framework leverages and provides seamless access to considerable NSF investments in EarthScope (GAGE and SAGE) and OOI in situ sensor networks, internationally-operated space radar systems, and NSF XSEDE computational and data storage resources.
●Algorithm development: An interactive environment allows users to test, modify, and implement their ideas as they integrate the large variety and volume data into new algorithms and products.
●Broader Impacts Activities: Resources for internal and external capacity building are integral to the project including support for students and technical workshops, development of supportive materials such as online notebooks, and access to open software development platforms and computational resources.
4-yr project starts 1 Jan 2019
Next steps - finish XSEDE tests and ->
Talk, Session, Meeting...thanks for staying to the end!
top related