on introduction to dataframe

24
Introduction to DataFrame: working with csv, json, parquet, xml data files Workshop on Natalia Myronova Scientific Researcher, University of Applied Science and Arts Dortmund, Germany, [email protected] Online Spring School on Data Science, 17.05.2021 21.05.2021

Upload: others

Post on 01-Jan-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: on Introduction to DataFrame

Introduction to DataFrame:working with csv, json, parquet, xml data files

Workshop on

Natalia MyronovaScientific Researcher,

University of Applied Science and Arts Dortmund, Germany,[email protected]

Online Spring School on Data Science, 17.05.2021 – 21.05.2021

Page 2: on Introduction to DataFrame

Natalia Myronova 2

Introduction to DataFrame abstraction

Working with csv data

Working with json data

Working with Parquet data

Working with xml data

Examples

Practical task

Summary

Content Overview

Online Spring School on Data Science, 17.05.2021 – 21.05.2021

Page 3: on Introduction to DataFrame

Natalia Myronova 3

Introduction to DataFrame abstraction

• DataFrame is single abstraction for representing structured data in Spark

• DataFrame represents rows, each of which consists of a number of observations. Rows can have a variety of data formats (heterogeneous), whereas a column can have data of the same data type (homogeneous).

• DataFrame usually contain some metadata in addition to data; for example, column and row names.

• We can say that DataFrame are nothing, but 2-dimensional data structures, similar to a SQL table or a spreadsheet.

Source: https://dzone.com/articles/pyspark-dataframe-tutorial-introduction-to-datafra

Description Column One Column Two Column Three Column Four

First Feature value11 value12 value13 value14

Second Feature value21 value22 value23 value24

Third Feature value31 value32 value33 value34

Fourth Feature value41 value42 value43 value44

Fifth Feature value51 value52 value53 value54

Page 4: on Introduction to DataFrame

Natalia Myronova 4

Approaches to create DataFrame

• You can Create a DataFrame using toDF() and createDataFrame() methods, both these function takes different signatures in order to create DataFrame from existing RDD, list, and DataFrame.

• You can also create DataFrame from data sources like TXT, CSV, JSON, ORV, Avro, Parquet, XML formats by reading from HDFS, S3, DBFS, Azure Blob file systems e.t.c.

• DataFrame also can be created by reading data from RDBMS Databases and NoSQL databases, e.g. Hive or Cassandra.

Source: https://sparkbyexamples.com/pyspark/different-ways-to-create-dataframe-in-pyspark/,

https://dzone.com/articles/pyspark-dataframe-tutorial-introduction-to-datafra

DATAFRAME

Page 5: on Introduction to DataFrame

Natalia Myronova 5

Create DataFrame from Data source: CSV(1/5)

• In real-time mostly you create DataFrame from data source files like CSV, TXT, JSON, XML e.t.c. PySpark by default supports many data formats out of the box without importing any libraries and to create DataFrame you need to use the appropriate method available in DataFrameReader class

What is CSV file format

• CSV, Comma-Separated Values file is a delimited text file that uses a comma to separate values. Each line of the file is a data record. Each record consists of one or more fields, separated by commas.

• Example CSV file

Source: https://sparkbyexamples.com/pyspark/different-ways-to-create-dataframe-in-pyspark/Source: https://sparkbyexamples.com/pyspark/pyspark-read-csv-file-into-dataframe/#write-csv-file /

Page 6: on Introduction to DataFrame

Natalia Myronova 6

Create DataFrame from Data source: CSV(2/5)

• Using csv("path") or format("csv").load("path") of DataFrameReader, you can read a CSV file into a PySpark DataFrame. These methods take a file path to read from as an argument. When you use format("csv") method, you can also specify the Data sources by their fully qualified name.

• Creating DataFrame from CSV without header: using csv("path")

Source: https://sparkbyexamples.com/pyspark/pyspark-read-csv-file-into-dataframe/#write-csv-file /

Page 7: on Introduction to DataFrame

Natalia Myronova 7

Create DataFrame from Data source: CSV(3/5)

• Creating DataFrame from CSV without header: using format("csv").load("path")

Source: https://sparkbyexamples.com/pyspark/pyspark-read-csv-file-into-dataframe/#write-csv-file /

Page 8: on Introduction to DataFrame

Natalia Myronova 8

Create DataFrame from Data source: CSV(4/5)

• Creating DataFrame from CSV with header. If you have a header with column names on your input file, you need to explicitly specify True for header option using option("header",True)

Source: https://sparkbyexamples.com/pyspark/pyspark-read-csv-file-into-dataframe/#write-csv-file /

Page 9: on Introduction to DataFrame

Natalia Myronova 9

Create DataFrame from Data source: CSV(5/5)

• Options while reading CSV file. PySpark CSV dataset provides multiple options to work with CSV files.

• delimiter option is used to specify the column delimiter of the CSV file. By default, it is comma (,) character, but can be set to any character like pipe(|), tab (\t), space using this option.

• inferSchema The default value set to this option is False when setting to true it automatically infers column types based on the data. Note that, it requires reading the data one more time to infer the schema.

• header This option is used to read the first line of the CSV file as column names. By default the value of this option is False, and all column types are assumed to be a string.

Source: https://sparkbyexamples.com/pyspark/pyspark-read-csv-file-into-dataframe/#write-csv-file /

Page 10: on Introduction to DataFrame

Natalia Myronova 10

Write DataFrame to file: CSV

• Use the write() method of the PySpark DataFrameWriter object to write PySpark DataFrame to a CSV file.

• While writing a CSV file you can use several options. For example, header to output the DataFrame column names as header record and delimiter to specify the delimiter on the CSV output file.

• Saving modes: DataFrameWriter also has a method mode() to specify saving mode:

• overwrite – mode is used to overwrite the existing file;

• append – to add the data to the existing file;

• ignore – ignores write operation when the file already exists;

• error – this is a default option when the file already exists, it returns an error.

• or

Source: https://sparkbyexamples.com/pyspark/pyspark-read-csv-file-into-dataframe/

Page 11: on Introduction to DataFrame

Natalia Myronova 11

Create DataFrame from Data source: JSON(1/2)

What is JSON file format

• JSON, JavaScript Object Notation, is a lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate. It is based on a subset of the JavaScript Programming Language Standard.

• Example JSON file

Source: https://www.json.org/json-en.html

Page 12: on Introduction to DataFrame

Natalia Myronova 12

Create DataFrame from Data source: JSON(2/2)

• Using spark.read.json("path") or spark.read.format("json").load("path") you can read a JSON file into a Spark DataFrame, these methods take a file path as an argument (single-line mode)

• Read JSON file from multiline

Source: https://sparkbyexamples.com/spark/spark-read-and-write-json-file/

Page 13: on Introduction to DataFrame

Natalia Myronova 13

Write DataFrame to file: JSON

• Use the Spark DataFrameWriter object “write” method on DataFrame to write a JSON file

• Spark Options while writing JSON files: while writing a JSON file you can use several options: nullValue, dateFormat and others

• Saving modes: Spark DataFrameWriter also has a method mode() to specify SaveMode; the argument to this method either takes below string or a constant from SaveMode class:

• overwrite – mode is used to overwrite the existing file, alternatively, you can use SaveMode.Overwrite;

• append – to add the data to the existing file, alternatively, you can use SaveMode.Append;

• ignore – ignores write operation when the file already exists, alternatively you can use SaveMode.Ignore;

• errorifexists or error – this is a default option when the file already exists, it returns an error, alternatively, you can use SaveMode.ErrorIfExists.

Source: https://sparkbyexamples.com/spark/spark-read-and-write-json-file/

Page 14: on Introduction to DataFrame

Natalia Myronova 14

Working with data file: Parquet(1/3)

What is Parquet file format

• Apache Parquet is a free and open-source column-oriented data storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.

• Apache Parquet is a far more efficient file format than CSV or JSON.

• Apache Parquet Advantages

• Below are some of the advantages of using Apache Parquet. Combining these benefits with Spark improves performance and gives the ability to work with structure files:

• Reduces IO operations;

• Fetches specific columns that you need to access;

• It consumes less space;

• Support type-specific encoding.

Source: http://parquet.apache.org/, https://sparkbyexamples.com/spark/spark-

read-write-dataframe-parquet-example/

Page 15: on Introduction to DataFrame

Natalia Myronova 15

Working with data file: Parquet(2/3)

• Spark Write DataFrame to Parquet file format: using parquet() function of DataFrameWriter class, we can write Spark DataFrame to the Parquet file.

• Writing Spark DataFrame to Parquet format preserves the column names and data types, and all columns are automatically converted to be nullable for compatibility reasons. Notice that all part files Spark creates has parquet extension.

Source: http://parquet.apache.org/, https://sparkbyexamples.com/spark/spark-

read-write-dataframe-parquet-example/

Page 16: on Introduction to DataFrame

Natalia Myronova 16

Working with data file: Parquet(3/3)

• Spark Read Parquet file into DataFrame: DataFrameReader provides parquet() function (spark.read.parquet) to read the parquet files and creates a Spark DataFrame. In this example snippet, we are reading data from an apache parquet file we have written before.

• Append to existing Parquet file: Spark provides the capability to append DataFrame to existing parquet files using “append” save mode. In case, if you want to overwrite use “overwrite” save mode.

• Spark parquet partition – Improving performance: Partitioning is a feature of many databases and data processing frameworks and it is key to make jobs work at scale. We can do a parquet file partition using spark partitionBy() function. Parquet Partition creates a folder hierarchy for each spark partition

• Spark Read a specific Parquet partition. This code snippet retrieves the data from the university partition value “FH”.

Source: https://sparkbyexamples.com/spark/spark-read-write-dataframe-parquet-example/

Page 17: on Introduction to DataFrame

Natalia Myronova 17

Working with data file: XML(1/2)

What is XML file format

• Extensible Markup Language (XML) is a markup language that defines a set of rules forencoding documents in a format that is both human-readable and machine-readable.

• The design goals of XML emphasize simplicity, generality, and usability across the Internet. Itis a textual data format with strong support via Unicode for different human languages.Although the design of XML focuses on documents, the language is widely used for therepresentation of arbitrary data structures such as those used in web services.

• Several schema systems exist to aid in the definition of XML-based languages, whileprogrammers have developed many application programming interfaces (APIs) to aid theprocessing of XML data.

• Example XML file

Source: https://en.wikipedia.org/wiki/XML

Page 18: on Introduction to DataFrame

Natalia Myronova 18

Working with data file: XML(2/2)

We can work with XML file using following approaches:

• XML Data Source for Apache Spark - a library for parsing and querying XML data with Apache Spark, for Spark SQL and DataFrames. This package supports to process format-free XML files in a distributed way, unlike JSON datasource in Spark restricts in-line JSON format. Link of library: https://github.com/databricks/spark-xml

• Writing program for processing XML files to extract the required records, transform them into DataFrame, then write as CSV files (or any other format) to the destination. For this it is necessary to implement this steps:

• Step 1: Read XML files into RDD

• Step 2: Parse XML files, extract the records, and expand into multiple RDDs

• Step 3: Convert RDDs into DataFrame

• Step 4: Save DataFrame as CSV files

Source: https://q15928.github.io/2019/07/14/parse-xml/

Page 19: on Introduction to DataFrame

Natalia Myronova 19

Example of program for processing XML

Source: https://q15928.github.io/2019/07/14/parse-xml/

Page 20: on Introduction to DataFrame

Natalia Myronova 20

Datasets

• FIFA 2021 Complete Player Dataset: This Data set contains data of the players in FIFA-2021

• https://www.kaggle.com/aayushmishra1512/fifa-2021-complete-player-data

• Coronavirus Dataset: World Health Organization Coronavirus Disease (COVID-19) Dashboard

• https://covid19.who.int/table

• Amazon Top 50 Bestselling Books 2009 – 2019: Dataset on Amazon's Top 50 bestselling books from 2009 to 2019. Contains 550 books, data has been categorized into fiction and non-fiction using Goodreads

• https://www.kaggle.com/sootersaalu/amazon-top-50-bestselling-books-2009-2019

Page 21: on Introduction to DataFrame

Natalia Myronova 21

Self-study assignment

• Task 1. FIFA 2021 Complete Player Dataset: Create DataFrame from FIFA-21 Complete.csv and separate the position of player into different columns

• Task 2. FIFA 2021 Complete Player Dataset: Create DataFrame from FIFA-21 Complete.json and add column year of birth (fill up using information from column age)

• Task 3. Coronavirus Dataset: Create DataFrame from WHO COVID-19 global table data November 30th 2020.csv and add column Risk. If number of Cases - cumulative total per 1 million population more then 10000 than Risk="High", otherwise Risk = "Low"

• Task 4. Coronavirus Dataset: Create DataFrame from WHO COVID-19 global table data November 30th 2020.json and add column Risk. If number of Cases - newly reported in last 24 hours more then 5000 than Risk="High", otherwise Risk = "Low"

• Task 5. Amazon's Top 50: Create DataFrame from bestsellers with categories.csv and add column Star (fill up using information from column User Rating, If User Rating>4 than Star=“*****” else Star=“****” )

• Task 6. Amazon's Top 50: Create DataFrame from bestsellers with categories.json. Then create file book.parquet using partitionBy() function depend on Genre(Non Fiction and Fiction)

Page 22: on Introduction to DataFrame

Natalia Myronova 22

Now you…

• … can explain the DataFrame abstraction of Apache Spark

• … understand the working with different types of data files: CSV, JSON, Parquet and XML

• … can write the program using Spark

• … can run program using Colab with Spark

Summary

Source: https://towardsdatascience.com/apache-spark-a-

conceptual-orientation-e326f8c57a64/

Page 23: on Introduction to DataFrame

Natalia Myronova 23

Thank youfor your

attention!

Page 24: on Introduction to DataFrame

Natalia Myronova 24

Books and References

Books

• J. Aven, “Sams Teach Yourself Apache Spark™ in 24 Hours”, Pearson Education, 2017

• Bill Chambers&Matei Zaharia, “Spark: The Definitive Guide: Big Data Processing Made Simple”, O'Reilly Media, 2018

• Jean-Georges Perrin, “Spark in Action, Second Edition: Covers Apache Spark 3 with Examples in Java, Python, and Scala”, Manning Publications, 2020

• Jules S. Damji, Brooke Wenig, Tathagata Das, Denny Lee, “Learning Spark: Lightning-fast Data Analytics”, O'Reilly Media, 2020, 2nd Edition

Other References:

• Apache Spark https://spark.apache.org/

• PySpark Tutorial https://sparkbyexamples.com/pyspark-tutorial/

Online course

• Big Data Essentials: HDFS, MapReduce and Spark RDDhttps://www.coursera.org/learn/big-data-essentials

• Cloud Computing Applications, Part 2: Big Data and Applications in the Cloud https://www.coursera.org/learn/cloud-applications-part2

• CCA 175 Spark and Hadoop Developer Certification using Scala https://www.udemy.com/course/cca-175-spark-and-hadoop-developer-certification-using-scala