eecs e6893 big data analytics cong han, ch3212@columbia

59
EECS E6893 Big Data Analytics Intro to Big Data Analytics on GCP Cong Han, [email protected] 1

Upload: others

Post on 23-Feb-2022

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: EECS E6893 Big Data Analytics Cong Han, ch3212@columbia

EECS E6893 Big Data AnalyticsIntro to Big Data Analytics on GCP

Cong Han, [email protected]

1

Page 2: EECS E6893 Big Data Analytics Cong Han, ch3212@columbia

Agenda● GCP

○ Setup○ Interaction

● Services○ Cloud Storage

○ BigQuery

○ Dataproc (Spark)

● HW0

2

Page 3: EECS E6893 Big Data Analytics Cong Han, ch3212@columbia

Google Cloud Platform (GCP)

3

Page 4: EECS E6893 Big Data Analytics Cong Han, ch3212@columbia

GCP● Cloud computing platform

○ Flexibility: on-demand and scale as you want○ Efficiency: no need to maintain infra

● Services (relevant to this assignment)○ Compute

■ Compute Engines: VMs / Servers (automatically created by Dataproc)○ Big data products

■ BigQuery: Data warehouse for analytics■ Dataproc: Hadoop and Spark

○ Storage■ Cloud Storage: Object storage system

○ Much much more at https://cloud.google.com/products/

4

Page 5: EECS E6893 Big Data Analytics Cong Han, ch3212@columbia

GCP Setup● Create a google account, you could use your Columbia account● Apply for $300 credit for the first year: https://cloud.google.com/free/● Go to Console dashboard -> Billing to check credit is there

5

Page 6: EECS E6893 Big Data Analytics Cong Han, ch3212@columbia

6

Page 7: EECS E6893 Big Data Analytics Cong Han, ch3212@columbia

7

Page 8: EECS E6893 Big Data Analytics Cong Han, ch3212@columbia

8

Page 9: EECS E6893 Big Data Analytics Cong Han, ch3212@columbia

9

Page 10: EECS E6893 Big Data Analytics Cong Han, ch3212@columbia

10

Page 11: EECS E6893 Big Data Analytics Cong Han, ch3212@columbia

11

Page 12: EECS E6893 Big Data Analytics Cong Han, ch3212@columbia

GCP: Create project● Project: basic unit for creating, enabling, and using all GCP services

○ managing APIs, billing, permissions○ adding and removing collaborators

● Visit console dashboard or cloud resource manager● Click on “create project / new project” and complete the flow● Ensure billing is pointing to the $300 credit

12

Page 13: EECS E6893 Big Data Analytics Cong Han, ch3212@columbia

13

Page 14: EECS E6893 Big Data Analytics Cong Han, ch3212@columbia

14

Page 15: EECS E6893 Big Data Analytics Cong Han, ch3212@columbia

GCP: Interaction● Graphical UI / console: Useful to create VMs, set up clusters, provision

resources, manage teams, etc● Command line tools / Cloud SDK: Useful for interacting from local host and

using the resources once provisioned. E.x. ssh into instances, submit jobs, copy files, etc

● Cloud Shell: Same as command line, but web-based and pre-installed with SDK and tools

15

Page 16: EECS E6893 Big Data Analytics Cong Han, ch3212@columbia

GCP: console

16

Search for services here

Manage / Enable APIs

Page 17: EECS E6893 Big Data Analytics Cong Han, ch3212@columbia

GCP: Cloud SDK● Install the SDK that is suitable for your local environment:

https://cloud.google.com/sdk/docs/quickstarts● Some testing after installation:

○ gcloud info○ gcloud auth list○ gcloud components list

● Change default config:○ gcloud init

17

Page 18: EECS E6893 Big Data Analytics Cong Han, ch3212@columbia

18

Page 19: EECS E6893 Big Data Analytics Cong Han, ch3212@columbia

19

Page 20: EECS E6893 Big Data Analytics Cong Han, ch3212@columbia

20

Page 21: EECS E6893 Big Data Analytics Cong Han, ch3212@columbia

21

Page 22: EECS E6893 Big Data Analytics Cong Han, ch3212@columbia

22

Page 23: EECS E6893 Big Data Analytics Cong Han, ch3212@columbia

GCP: Cloud Shell

23

persistent home directory :)

Page 24: EECS E6893 Big Data Analytics Cong Han, ch3212@columbia

GCP: Cloud Shell

24

Page 25: EECS E6893 Big Data Analytics Cong Han, ch3212@columbia

GCP: Cloud Shell Code Editor

25

Page 26: EECS E6893 Big Data Analytics Cong Han, ch3212@columbia

Cloud Storage

26

Page 27: EECS E6893 Big Data Analytics Cong Han, ch3212@columbia

Cloud Storage● Online file storage system● Graphical UI through console● Command line tool: gsutil

27

Page 28: EECS E6893 Big Data Analytics Cong Han, ch3212@columbia

Cloud Storage

28

Page 29: EECS E6893 Big Data Analytics Cong Han, ch3212@columbia

Cloud Storage

29

Page 30: EECS E6893 Big Data Analytics Cong Han, ch3212@columbia

Cloud Storage

30

Page 31: EECS E6893 Big Data Analytics Cong Han, ch3212@columbia

Cloud Storage

31

Page 32: EECS E6893 Big Data Analytics Cong Han, ch3212@columbia

Cloud Storage

32

Page 33: EECS E6893 Big Data Analytics Cong Han, ch3212@columbia

Cloud Storage

33

Page 34: EECS E6893 Big Data Analytics Cong Han, ch3212@columbia

Cloud Storage

34

Page 35: EECS E6893 Big Data Analytics Cong Han, ch3212@columbia

Cloud Storage

35

Page 36: EECS E6893 Big Data Analytics Cong Han, ch3212@columbia

Cloud Storage

36

Uniform Resource Identifier, like a filepath on GCP, use this in your program

Page 37: EECS E6893 Big Data Analytics Cong Han, ch3212@columbia

Cloud Storage - gsutil● Interact with Cloud Storage through command line● Works similar to unix command line● Useful commands:

○ Concatenate object content to stdout:gsutil cat [-h] url…

○ Copy file:gsutil cp [OPTION]... src_url dst_url

○ List files:gsutil ls [OPTION]... url…

● Explore more at https://cloud.google.com/storage/docs/gsutil

37

Page 38: EECS E6893 Big Data Analytics Cong Han, ch3212@columbia

BigQuery

38

Page 39: EECS E6893 Big Data Analytics Cong Han, ch3212@columbia

BigQuery● Data warehouse for analytics● SQL-like languages to interact with DB● RESTful APIs / client libraries for programmatic access● Graphical UI

39

Page 40: EECS E6893 Big Data Analytics Cong Han, ch3212@columbia

BigQuery

40

Page 41: EECS E6893 Big Data Analytics Cong Han, ch3212@columbia

BigQuery

41

Page 42: EECS E6893 Big Data Analytics Cong Han, ch3212@columbia

BigQuery

42

Page 43: EECS E6893 Big Data Analytics Cong Han, ch3212@columbia

BigQuery

43

Page 44: EECS E6893 Big Data Analytics Cong Han, ch3212@columbia

BigQuery

44

Page 45: EECS E6893 Big Data Analytics Cong Han, ch3212@columbia

BigQuery

45

Page 46: EECS E6893 Big Data Analytics Cong Han, ch3212@columbia

BigQuery

46

Page 47: EECS E6893 Big Data Analytics Cong Han, ch3212@columbia

Dataproc

47

Page 48: EECS E6893 Big Data Analytics Cong Han, ch3212@columbia

Dataproc

48

What is dataproc?

Page 49: EECS E6893 Big Data Analytics Cong Han, ch3212@columbia

Dataproc

49

Why dataproc?

Page 50: EECS E6893 Big Data Analytics Cong Han, ch3212@columbia

Dataproc

50

Page 51: EECS E6893 Big Data Analytics Cong Han, ch3212@columbia

Dataproc - graphical UI

51

Page 52: EECS E6893 Big Data Analytics Cong Han, ch3212@columbia

Dataproc - Cloud SDK

52

Cluster creation (using Cloud SDK):

Page 53: EECS E6893 Big Data Analytics Cong Han, ch3212@columbia

Dataproc - Cloud SDK

53

Cluster creation (using Cloud SDK):

Page 54: EECS E6893 Big Data Analytics Cong Han, ch3212@columbia

Dataproc - Cloud SDK

54

Submit a job - Pi calculation

Page 55: EECS E6893 Big Data Analytics Cong Han, ch3212@columbia

Dataproc - Cloud SDK

55

Submit a job - Pi calculation

Page 56: EECS E6893 Big Data Analytics Cong Han, ch3212@columbia

Dataproc● On-demand, fully managed cloud service for running Apache Hadoop and

Spark on GCP● Cluster creation (using Cloud SDK):

○ Automatically creates VMs with Spark pre-installed

56

Cloud Storage bucket: where your jupyter notebooks are saved

Install Jupyter Notebook

Works like pip install <your package>

Page 57: EECS E6893 Big Data Analytics Cong Han, ch3212@columbia

Dataproc - Spark execution / submit jobs● Jupyter notebook:

● Cloud SDK:○ gcloud dataproc jobs submit pyspark <your_program.py>

--cluster=<cluster-name>○ View your jobs in console

57

- Program could be Cloud Storage URI / local path / Cloud Shell path

- Data should be on Cloud storage

Page 58: EECS E6893 Big Data Analytics Cong Han, ch3212@columbia

Dataproc - Spark execution / submit jobs (cont’)● Spark shell

○ ssh into master node

○ pyspark

58

Page 59: EECS E6893 Big Data Analytics Cong Han, ch3212@columbia

HW01. Read documentations and tutorials

a. Setup GCP and Cloud SDKb. Familiar with BigQueryc. Run Spark examples on Dataproc - Pi calculation and word count

2. Two light programming questionsa. BigQueryb. Spark program - Find top k most frequent words

59

Remember to delete your dataproc clusters when you finish executions to save money.