accessing data in the cloud groups...accessing data in the cloud using sas to read data from amazon...
TRANSCRIPT
![Page 1: Accessing Data in the Cloud Groups...Accessing Data in the Cloud Using SAS to read data from Amazon Simple Storage Service (S3) seleritysas.com What is Amazon Simple Storage Service](https://reader033.vdocument.in/reader033/viewer/2022060301/5f084c007e708231d4214fa6/html5/thumbnails/1.jpg)
Accessing Data in the CloudUsing SAS to read data from Amazon Simple Storage Service (S3)
seleritysas.com
![Page 2: Accessing Data in the Cloud Groups...Accessing Data in the Cloud Using SAS to read data from Amazon Simple Storage Service (S3) seleritysas.com What is Amazon Simple Storage Service](https://reader033.vdocument.in/reader033/viewer/2022060301/5f084c007e708231d4214fa6/html5/thumbnails/2.jpg)
What is Amazon Simple Storage Service (S3)?
• An object store, not a file system
• Write once, read many (WORM)
• Eventually consistent
• 99.999999999% durability
• Unlimited storage capacity
• Highly scalable and available data storage
• Low latency and high throughput performance
![Page 3: Accessing Data in the Cloud Groups...Accessing Data in the Cloud Using SAS to read data from Amazon Simple Storage Service (S3) seleritysas.com What is Amazon Simple Storage Service](https://reader033.vdocument.in/reader033/viewer/2022060301/5f084c007e708231d4214fa6/html5/thumbnails/3.jpg)
What Public Data is Available in S3?
• AWS Public Datasets• https://aws.amazon.com/public-datasets/• Geospatial and Environmental Datasets• Genomics and Life Science Datasets• Datasets for Machine Learning• Regulatory and Statistical Data
• awesome-public-datasets• https://github.com/caesar0301/awesome-
public-datasets
• NYC Taxi and Limousine Commission• http://www.nyc.gov/html/tlc/html/about/trip_r
ecord_data.shtml
![Page 4: Accessing Data in the Cloud Groups...Accessing Data in the Cloud Using SAS to read data from Amazon Simple Storage Service (S3) seleritysas.com What is Amazon Simple Storage Service](https://reader033.vdocument.in/reader033/viewer/2022060301/5f084c007e708231d4214fa6/html5/thumbnails/4.jpg)
What is the typical workflow to use raw data from S3?• Download the data file from S3 to your PC using http/https
• Upload/Import the data to SAS
![Page 5: Accessing Data in the Cloud Groups...Accessing Data in the Cloud Using SAS to read data from Amazon Simple Storage Service (S3) seleritysas.com What is Amazon Simple Storage Service](https://reader033.vdocument.in/reader033/viewer/2022060301/5f084c007e708231d4214fa6/html5/thumbnails/5.jpg)
What would make this more efficient?
• Cutting out the middle-man (your local PC)
![Page 6: Accessing Data in the Cloud Groups...Accessing Data in the Cloud Using SAS to read data from Amazon Simple Storage Service (S3) seleritysas.com What is Amazon Simple Storage Service](https://reader033.vdocument.in/reader033/viewer/2022060301/5f084c007e708231d4214fa6/html5/thumbnails/6.jpg)
How can we have S3 communicate direct to the SAS Server?• Use the FILENAME URL access method
✓ Easy to implement
✗ File is retrieved using the http protocol (serially)
✗ The slowest of all options, subject to timeouts for very large files
• Use PROC S3 to download files to the SAS Server’s filesystem✓ Very fast, as it uses parallel downloads
✗ Only available from 9.4M4
✗ Only works with secure S3 files, not public S3 files
![Page 7: Accessing Data in the Cloud Groups...Accessing Data in the Cloud Using SAS to read data from Amazon Simple Storage Service (S3) seleritysas.com What is Amazon Simple Storage Service](https://reader033.vdocument.in/reader033/viewer/2022060301/5f084c007e708231d4214fa6/html5/thumbnails/7.jpg)
How can we have S3 communicate direct to the SAS Server?• Use the AWS CLI to download files to the SAS Server’s filesystem
✓ Very fast, as it uses parallel downloads
✗ Need to install the AWS CLI on the SAS Server
✗ Need the ability to run X commands on the SAS Server
• “Mount” the S3 storage on the SAS Server✓ Treat it like a local disk
✗ S3 is not designed for block storage/access
✗ Potential issues with current storage driver implementations
![Page 8: Accessing Data in the Cloud Groups...Accessing Data in the Cloud Using SAS to read data from Amazon Simple Storage Service (S3) seleritysas.com What is Amazon Simple Storage Service](https://reader033.vdocument.in/reader033/viewer/2022060301/5f084c007e708231d4214fa6/html5/thumbnails/8.jpg)
Example: NYC Trip Data in S3
• NYC Yellow Cab trip data for January 2017• 9,710,124 records• CSV format• 815 MB
• Location• Bucket: nyc-tlc• Object Key: trip data/yellow_tripdata_2017-01.csv
• HTTP Protocol: https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2017-01.csv
• S3 Protocol: “s3://nyc-tlc/trip data/yellow_tripdata_2017-01.csv”
![Page 9: Accessing Data in the Cloud Groups...Accessing Data in the Cloud Using SAS to read data from Amazon Simple Storage Service (S3) seleritysas.com What is Amazon Simple Storage Service](https://reader033.vdocument.in/reader033/viewer/2022060301/5f084c007e708231d4214fa6/html5/thumbnails/9.jpg)
FILENAME URL Access Method
NOTE: The data set WORK.YELLOW_TRIPDATA_2017_01 has 9710124 observations and 17
variables.
real time 36.09 seconds
cpu time 33.85 seconds
![Page 10: Accessing Data in the Cloud Groups...Accessing Data in the Cloud Using SAS to read data from Amazon Simple Storage Service (S3) seleritysas.com What is Amazon Simple Storage Service](https://reader033.vdocument.in/reader033/viewer/2022060301/5f084c007e708231d4214fa6/html5/thumbnails/10.jpg)
PROC S3
NOTE: PROCEDURE S3 used (Total process
time):
real time 3.77 seconds
cpu time 6.31 seconds
NOTE: PROCEDURE IMPORT used (Total
process time):
real time 26.75 seconds
cpu time 26.75 seconds
![Page 11: Accessing Data in the Cloud Groups...Accessing Data in the Cloud Using SAS to read data from Amazon Simple Storage Service (S3) seleritysas.com What is Amazon Simple Storage Service](https://reader033.vdocument.in/reader033/viewer/2022060301/5f084c007e708231d4214fa6/html5/thumbnails/11.jpg)
AWS CLI
NOTE: DATA statement used (Total process
time):
real time 5.80 seconds
cpu time 0.00 seconds
NOTE: PROCEDURE IMPORT used (Total process
time):
real time 26.59 seconds
cpu time 26.59 seconds