![Page 1: Exploring language classification with spark and the spark notebook](https://reader030.vdocument.in/reader030/viewer/2022021507/58e551471a28ab5b778b4579/html5/thumbnails/1.jpg)
Exploring Language ClassificationWith Apache Spark and the Spark Notebook
A practical introduction to interactive Data Engineering
Gerard Maas
![Page 2: Exploring language classification with spark and the spark notebook](https://reader030.vdocument.in/reader030/viewer/2022021507/58e551471a28ab5b778b4579/html5/thumbnails/2.jpg)
Gerard MaasLead Engineer @ Kensu
Computer EngineerScala ProgrammerEarly Spark AdopterSpark Notebook Dev
Cassandra MVP (2015, 2016)
Stack Overflow Top Contributor(Spark, Spark Streaming, Scala)
Wannabe IoT HackerArduino Enthusiast
@maasg
https://github.com/maasg
https://www.linkedin.com/in/gerardmaas/
https://stackoverflow.com/users/764040/maasg
![Page 3: Exploring language classification with spark and the spark notebook](https://reader030.vdocument.in/reader030/viewer/2022021507/58e551471a28ab5b778b4579/html5/thumbnails/3.jpg)
DATA SCIENCE GOVERNANCE
Adalog helps enterprises to ensure that data pipelines continually deliver
their value by combining the contextual information when the pipeline was
created with the evolving environment where the pipelines execute.
CONNECT - COLLECT - LEARN
![Page 4: Exploring language classification with spark and the spark notebook](https://reader030.vdocument.in/reader030/viewer/2022021507/58e551471a28ab5b778b4579/html5/thumbnails/4.jpg)
![Page 5: Exploring language classification with spark and the spark notebook](https://reader030.vdocument.in/reader030/viewer/2022021507/58e551471a28ab5b778b4579/html5/thumbnails/5.jpg)
Language Classification
![Page 6: Exploring language classification with spark and the spark notebook](https://reader030.vdocument.in/reader030/viewer/2022021507/58e551471a28ab5b778b4579/html5/thumbnails/6.jpg)
Language ClassificationSome inspiration...
![Page 7: Exploring language classification with spark and the spark notebook](https://reader030.vdocument.in/reader030/viewer/2022021507/58e551471a28ab5b778b4579/html5/thumbnails/7.jpg)
What’s is a language? How is it composed?
![Page 8: Exploring language classification with spark and the spark notebook](https://reader030.vdocument.in/reader030/viewer/2022021507/58e551471a28ab5b778b4579/html5/thumbnails/8.jpg)
Letter FrequencyCould we characterize a language by calculating the relative frequency of letters in some text ?
Spanish vs English letter frequency
![Page 9: Exploring language classification with spark and the spark notebook](https://reader030.vdocument.in/reader030/viewer/2022021507/58e551471a28ab5b778b4579/html5/thumbnails/9.jpg)
n-grams
"cavnar and trenkle"
bi-grams: ca,av,vn,na,ar,r_,_a,an,nd,d_,_t,tr,re,en,nk,kl,le,e_
tri-grams: cav,avn,vna,nar,ar_,r_a,_an,and,nd_,d_t,_tr,tre,ren,enk,nkl,kle,le_
quad-grams: cavn,...
http://odur.let.rug.nl/~vannoord/TextCat/textcat.pdf
Could we characterize a language by calculating the relative frequency of sequence of letters in some text ?
![Page 10: Exploring language classification with spark and the spark notebook](https://reader030.vdocument.in/reader030/viewer/2022021507/58e551471a28ab5b778b4579/html5/thumbnails/10.jpg)
Tech
![Page 11: Exploring language classification with spark and the spark notebook](https://reader030.vdocument.in/reader030/viewer/2022021507/58e551471a28ab5b778b4579/html5/thumbnails/11.jpg)
![Page 12: Exploring language classification with spark and the spark notebook](https://reader030.vdocument.in/reader030/viewer/2022021507/58e551471a28ab5b778b4579/html5/thumbnails/12.jpg)
Spark APIs
RDD -> Resilient Distributed Datasets
- Lazy, functional-oriented, low level API- Basis for execution of all high-level libraries
Dataframes
- Column-oriented, SQL-inspired DSL- Many optimizations under the hood (Catalyst, Tungsten)
Dataset
- Best of both worlds (except …)
![Page 13: Exploring language classification with spark and the spark notebook](https://reader030.vdocument.in/reader030/viewer/2022021507/58e551471a28ab5b778b4579/html5/thumbnails/13.jpg)
Spark NotebookA dynamic and visual web-based notebook for Spark with Scala
![Page 14: Exploring language classification with spark and the spark notebook](https://reader030.vdocument.in/reader030/viewer/2022021507/58e551471a28ab5b778b4579/html5/thumbnails/14.jpg)
Spark Notebook - Open Source Roadmap
2017
GIT KerberosProject Generator
Q1 Q2 Q3
Announcements: blog.kensu.io
![Page 15: Exploring language classification with spark and the spark notebook](https://reader030.vdocument.in/reader030/viewer/2022021507/58e551471a28ab5b778b4579/html5/thumbnails/15.jpg)
Notebooks
Notebooks for this presentation are located at:
https://github.com/maasg/spark-notebooks
- have fun!
![Page 16: Exploring language classification with spark and the spark notebook](https://reader030.vdocument.in/reader030/viewer/2022021507/58e551471a28ab5b778b4579/html5/thumbnails/16.jpg)
https://github.com/maasg/spark-notebooks/languageclassification/language-detection-letter-freq.snb
Implements the idea of using a letter frequency model to classify the language in a doc.
Uses the dataset found in https://github.com/maasg/spark-notebooks/languageclassification/data/
It produces a training set of sampled strings that will be used also for the n-gram classifier
(Note: this notebook is missing a function that’s left as an exercise to the reader. The folder /solutions contains the full working version.)
Notebook 1 : Naive Language Classification
![Page 17: Exploring language classification with spark and the spark notebook](https://reader030.vdocument.in/reader030/viewer/2022021507/58e551471a28ab5b778b4579/html5/thumbnails/17.jpg)
Notebook 2 : n-gram Language Classification
https://github.com/maasg/spark-notebooks/languageclassification/n-gram-language-classification.snb
Implements the n-gram algorithm described in the paper.
Uses the dataset found in https://github.com/maasg/spark-notebooks/languageclassification/data/
Uses the resulting classifier to implement a custom Spark ML Transformer that can be easily used to classify new texts. Transformers can be combined into Spark ML Pipelines of arbitrary complexity.
(Note: this notebook is missing a function that’s left as an exercise to the reader. The folder /solutions contains the full working version.)