apache spark usage in the open source ecosystem

28
Apache Spark Usage in the Open Source Ecosystem Hossein Falaki @mhfalaki

Upload: databricks

Post on 14-Apr-2017

1.704 views

Category:

Software


1 download

TRANSCRIPT

Page 1: Apache Spark Usage in the Open Source Ecosystem

Apache Spark Usage in the Open Source Ecosystem

Hossein Falaki @mhfalaki

Page 2: Apache Spark Usage in the Open Source Ecosystem

About me

• Software Engineer / part-time Data Scientist at Databricks• I started using Apache Spark since version 0.6 • Developed first version of Apache Spark CSV data source• Worked on SparkR and R notebooks at Databricks

2

Page 3: Apache Spark Usage in the Open Source Ecosystem

Stackoverflow 2016 trending tech

3

Page 4: Apache Spark Usage in the Open Source Ecosystem

Apache Spark PhilosophyUnified engineSupport end-to-end applications

High-level APIsEasy to use, rich optimizations

Integrate broadlyStorage systems, libraries, etc

SQLStreaming ML Graph

1

2

3

Page 5: Apache Spark Usage in the Open Source Ecosystem

Databricks Community Edition

• In February Databricks launched a free version of its cloud based platform in beta

• Since then more than 8,000 users registered• Users created over 61,000 notebooks in different languages• This is an analysis of third party libraries that our beta users

imported to complement Apache Spark in Scala, Python, and R

5

Page 6: Apache Spark Usage in the Open Source Ecosystem

What % of users use other librariesLanguage %usersimporting externallibs Average#libs Median#libsPython 75% 9 2Scala 55% 3 1R 57% 6 1

6

Page 7: Apache Spark Usage in the Open Source Ecosystem

Installing libraries is easy

7

Page 8: Apache Spark Usage in the Open Source Ecosystem

Python Packages

8

Page 9: Apache Spark Usage in the Open Source Ecosystem

Most popular Python packages

9

Page 10: Apache Spark Usage in the Open Source Ecosystem

What is test_helper?

10

Page 11: Apache Spark Usage in the Open Source Ecosystem

What are these?

ETL• re• datetime• pandas• json• csv• string• math / operator• urllib / urllib2

11

Visualization• matplotlib• ggplot• seaborn

Advanced analytics• numpy• sklearn• graphframes• tensorflow• scipy

Other• test_helper• os• md5

Page 12: Apache Spark Usage in the Open Source Ecosystem

Python package categories

12

Page 13: Apache Spark Usage in the Open Source Ecosystem

What packages go together?

13

Page 14: Apache Spark Usage in the Open Source Ecosystem

Scala Packages

14

Page 15: Apache Spark Usage in the Open Source Ecosystem

Most popular Scala libraries

15

Page 16: Apache Spark Usage in the Open Source Ecosystem

What are these?

ETL• java/scala util• scala.collection• scala.math• java.{io, nio}• java.text• o.a.commons• kafka• twitter4j

16

Visualization• ?

Advanced analytics• spark.ml• graphframes

Other• java.net• scala.sys

Page 17: Apache Spark Usage in the Open Source Ecosystem

Scala package categories

17

Page 18: Apache Spark Usage in the Open Source Ecosystem

What libraries go together?

18

Page 19: Apache Spark Usage in the Open Source Ecosystem

R Packages

19

Page 20: Apache Spark Usage in the Open Source Ecosystem

Most popular R packages

20

Page 21: Apache Spark Usage in the Open Source Ecosystem

What are these?

ETL• dplyr• plyr• reshape2• jsonlite• tidyr• lubridate• httr• data.table

21

Visualization• ggplot2• beanplot• plotly• ...

Advanced analytics• sparkr• h2o• caret• e1071

Other• devtools• magrittr

Page 22: Apache Spark Usage in the Open Source Ecosystem

R package categories

22

Page 23: Apache Spark Usage in the Open Source Ecosystem

Comparing Python, Scala & R

23

Page 24: Apache Spark Usage in the Open Source Ecosystem

Languages have unique features

24

Scala / Python / R R / Python Scala / Python / R

• 25 % of users, use multiple languages• 3% of notebooks mix different languages

Page 25: Apache Spark Usage in the Open Source Ecosystem

Summary

• Spark users extensively mix it with other packages in different languages– One of goals of Spark project is working well with other projects

• ETL related libraries are the most popular category– Opportunities for new data sources

• Notebooks are being used for “small data” as well as “big data.”• Languages and their ecosystems have diverse capabilities. Users seem to

be mixing languages to their advantage– Scala is missing visualization libraries

25

Page 26: Apache Spark Usage in the Open Source Ecosystem

Try your favorite library in Databricks

26

http://databricks.com/ceTry latest version of Apache Spark and preview of Spark 2.0

Page 27: Apache Spark Usage in the Open Source Ecosystem

Thank you!

Page 28: Apache Spark Usage in the Open Source Ecosystem

What packages are used together?

28