![Page 1: Python for Practice NPRG067...Python for Practice NPRG067 Tomas Bures Petr Hnetynka Ladislav Pe ška {bures,hnetynka}@d3s.mff.cuni.cz peska@ksi.mff.cuni.cz 2 Spark Open-source cluster](https://reader030.vdocument.in/reader030/viewer/2022040409/5ec5fe4e1ef0f36d12360f56/html5/thumbnails/1.jpg)
CHARLES UNIVERSITY IN PRAGUE
http://d3s.mff.cuni.cz
faculty of mathematics and physics
Python for PracticeNPRG067
Tomas BuresPetr HnetynkaLadislav Peška
{bures,hnetynka}@[email protected]
![Page 2: Python for Practice NPRG067...Python for Practice NPRG067 Tomas Bures Petr Hnetynka Ladislav Pe ška {bures,hnetynka}@d3s.mff.cuni.cz peska@ksi.mff.cuni.cz 2 Spark Open-source cluster](https://reader030.vdocument.in/reader030/viewer/2022040409/5ec5fe4e1ef0f36d12360f56/html5/thumbnails/2.jpg)
2
Spark
Open-source cluster computing frameworkPart of Apache’s family of big-data tools
Runs as standalone or on top of Hadoop Yarn and Apache MesosCan access data stored in HDFS, Cassandra, Hbase, …
Main API in ScalaBinding to other languages Java, Python, R
![Page 3: Python for Practice NPRG067...Python for Practice NPRG067 Tomas Bures Petr Hnetynka Ladislav Pe ška {bures,hnetynka}@d3s.mff.cuni.cz peska@ksi.mff.cuni.cz 2 Spark Open-source cluster](https://reader030.vdocument.in/reader030/viewer/2022040409/5ec5fe4e1ef0f36d12360f56/html5/thumbnails/3.jpg)
3
Spark Architecture
Image from https://spark.apache.org/docs/latest/cluster-overview.html
![Page 4: Python for Practice NPRG067...Python for Practice NPRG067 Tomas Bures Petr Hnetynka Ladislav Pe ška {bures,hnetynka}@d3s.mff.cuni.cz peska@ksi.mff.cuni.cz 2 Spark Open-source cluster](https://reader030.vdocument.in/reader030/viewer/2022040409/5ec5fe4e1ef0f36d12360f56/html5/thumbnails/4.jpg)
4
Types of API
RDD (Resilient distributed dataset)Collections of serializable black-box objectsMore low-level operations
DataFramesTables of rows with schemaQueries mimic SQLQuery optimization before executionTwo main variants
SQL-like statements (strings)Query builder API
Seamless interoperability between RDDs and DataFramesNote that there is no DataSet API (as in Scala)
![Page 5: Python for Practice NPRG067...Python for Practice NPRG067 Tomas Bures Petr Hnetynka Ladislav Pe ška {bures,hnetynka}@d3s.mff.cuni.cz peska@ksi.mff.cuni.cz 2 Spark Open-source cluster](https://reader030.vdocument.in/reader030/viewer/2022040409/5ec5fe4e1ef0f36d12360f56/html5/thumbnails/5.jpg)
5
First Example
Exercise: e01
Uses the DataFrames API
![Page 6: Python for Practice NPRG067...Python for Practice NPRG067 Tomas Bures Petr Hnetynka Ladislav Pe ška {bures,hnetynka}@d3s.mff.cuni.cz peska@ksi.mff.cuni.cz 2 Spark Open-source cluster](https://reader030.vdocument.in/reader030/viewer/2022040409/5ec5fe4e1ef0f36d12360f56/html5/thumbnails/6.jpg)
6
Different APIs
Exercise: e03
“Show max 10 reviewers which have the salary above average. Sort them by ascending age.”
Examples:DataFrame APISQL APIRDD API
![Page 7: Python for Practice NPRG067...Python for Practice NPRG067 Tomas Bures Petr Hnetynka Ladislav Pe ška {bures,hnetynka}@d3s.mff.cuni.cz peska@ksi.mff.cuni.cz 2 Spark Open-source cluster](https://reader030.vdocument.in/reader030/viewer/2022040409/5ec5fe4e1ef0f36d12360f56/html5/thumbnails/7.jpg)
7
DataFrame APIQuery building functions
Projection: select, selectExpr, drop, withColumn, withColumnRenamed, agg, replaceSelection: filter, where Deduplication of rows: distinct, dropDuplicatesHandling null values: dropna, fillnaJoins: join, crossJoinSet operations: intersect, subtractGrouping: groupByOperations for online analytical processing (OLAP): cube, rollup, crosstabOrdering, limiting: limit, orderBy / sort, sortWithinPartitionsStatistics over the whole dataframe: approxQuantile, cov, corr, …
Actionscoalesce, repartitioncollect, take, first, headtoLocalIteratorcountforeach, foreachPartitionwrite.xxx
![Page 8: Python for Practice NPRG067...Python for Practice NPRG067 Tomas Bures Petr Hnetynka Ladislav Pe ška {bures,hnetynka}@d3s.mff.cuni.cz peska@ksi.mff.cuni.cz 2 Spark Open-source cluster](https://reader030.vdocument.in/reader030/viewer/2022040409/5ec5fe4e1ef0f36d12360f56/html5/thumbnails/8.jpg)
8
DataFrame APIRDD
rdd, toJSONExploratory analysis
describesample, sampleByrandomSplit
Debuggingshow, printSchemaexplain
Miscellaneous functionscache, persist
![Page 9: Python for Practice NPRG067...Python for Practice NPRG067 Tomas Bures Petr Hnetynka Ladislav Pe ška {bures,hnetynka}@d3s.mff.cuni.cz peska@ksi.mff.cuni.cz 2 Spark Open-source cluster](https://reader030.vdocument.in/reader030/viewer/2022040409/5ec5fe4e1ef0f36d12360f56/html5/thumbnails/9.jpg)
9
Aggregations
Exercise: e05
“Compute the sum of review lengths by every reviewer and make summary statistics of the lengths along with breakdown by gender and age”
![Page 10: Python for Practice NPRG067...Python for Practice NPRG067 Tomas Bures Petr Hnetynka Ladislav Pe ška {bures,hnetynka}@d3s.mff.cuni.cz peska@ksi.mff.cuni.cz 2 Spark Open-source cluster](https://reader030.vdocument.in/reader030/viewer/2022040409/5ec5fe4e1ef0f36d12360f56/html5/thumbnails/10.jpg)
10
Cube operator
Used in online analytical processingReturns aggregated data for different combinationsof aggregationsRuns faster thandoing the aggregationsone-by-one
Picture from https://blog.xlcubed.com/wp-content/uploads/2010/06/cube.png
![Page 11: Python for Practice NPRG067...Python for Practice NPRG067 Tomas Bures Petr Hnetynka Ladislav Pe ška {bures,hnetynka}@d3s.mff.cuni.cz peska@ksi.mff.cuni.cz 2 Spark Open-source cluster](https://reader030.vdocument.in/reader030/viewer/2022040409/5ec5fe4e1ef0f36d12360f56/html5/thumbnails/11.jpg)
11
Column Mapping
A number of functions pre-defined for operations over columns:
http://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html#module-pyspark.sql.functions
User defined mappingVia transformation to RDD and then using RDD.mapOr via user-defined functions (UDF)
Exercise: e09
![Page 12: Python for Practice NPRG067...Python for Practice NPRG067 Tomas Bures Petr Hnetynka Ladislav Pe ška {bures,hnetynka}@d3s.mff.cuni.cz peska@ksi.mff.cuni.cz 2 Spark Open-source cluster](https://reader030.vdocument.in/reader030/viewer/2022040409/5ec5fe4e1ef0f36d12360f56/html5/thumbnails/12.jpg)
12
Windows
Per-row functions in the context of a windowAka partition by
Window functions:row_number, rank, dense_rank, lag, lead, ntile, percent_rank, rank, cume_dist
Exercise: e11Exercise: e13
![Page 13: Python for Practice NPRG067...Python for Practice NPRG067 Tomas Bures Petr Hnetynka Ladislav Pe ška {bures,hnetynka}@d3s.mff.cuni.cz peska@ksi.mff.cuni.cz 2 Spark Open-source cluster](https://reader030.vdocument.in/reader030/viewer/2022040409/5ec5fe4e1ef0f36d12360f56/html5/thumbnails/13.jpg)
13
Determining Trends
“Is there a trend in time for the length of reviews per reviewer? E.g. does anyone tend to write longer and longer reviews?”
Exercise: e16
Math on the next two slides…
![Page 14: Python for Practice NPRG067...Python for Practice NPRG067 Tomas Bures Petr Hnetynka Ladislav Pe ška {bures,hnetynka}@d3s.mff.cuni.cz peska@ksi.mff.cuni.cz 2 Spark Open-source cluster](https://reader030.vdocument.in/reader030/viewer/2022040409/5ec5fe4e1ef0f36d12360f56/html5/thumbnails/14.jpg)
14
Simple Linear Regression
Simple Linear regression
![Page 15: Python for Practice NPRG067...Python for Practice NPRG067 Tomas Bures Petr Hnetynka Ladislav Pe ška {bures,hnetynka}@d3s.mff.cuni.cz peska@ksi.mff.cuni.cz 2 Spark Open-source cluster](https://reader030.vdocument.in/reader030/viewer/2022040409/5ec5fe4e1ef0f36d12360f56/html5/thumbnails/15.jpg)
15
Quality of Fit
Pearson’s correlation coefficientdetermines how dependent data are
![Page 16: Python for Practice NPRG067...Python for Practice NPRG067 Tomas Bures Petr Hnetynka Ladislav Pe ška {bures,hnetynka}@d3s.mff.cuni.cz peska@ksi.mff.cuni.cz 2 Spark Open-source cluster](https://reader030.vdocument.in/reader030/viewer/2022040409/5ec5fe4e1ef0f36d12360f56/html5/thumbnails/16.jpg)
16
Statistical Testing
“Check whether there is a statistical difference between the length of reviews written by man and women.”Welch’s t-testwith 𝜈𝜈 degreesof freedom
![Page 17: Python for Practice NPRG067...Python for Practice NPRG067 Tomas Bures Petr Hnetynka Ladislav Pe ška {bures,hnetynka}@d3s.mff.cuni.cz peska@ksi.mff.cuni.cz 2 Spark Open-source cluster](https://reader030.vdocument.in/reader030/viewer/2022040409/5ec5fe4e1ef0f36d12360f56/html5/thumbnails/17.jpg)
17
t-distribution
![Page 18: Python for Practice NPRG067...Python for Practice NPRG067 Tomas Bures Petr Hnetynka Ladislav Pe ška {bures,hnetynka}@d3s.mff.cuni.cz peska@ksi.mff.cuni.cz 2 Spark Open-source cluster](https://reader030.vdocument.in/reader030/viewer/2022040409/5ec5fe4e1ef0f36d12360f56/html5/thumbnails/18.jpg)
18
Structured columns
Columns may contain structured data or listsThey can be indexed with the usual Pythonic way
df.select(x.y.z)df.select(x[0][1])
Exercise: e19