apache spark mllib guide for pipeliningcis.csuohio.edu/~sschung/cis612/apache spark mllib guide for...

Apache spark mllib guide for pipelining Here is the Apache spark mllib guide for pipelining: http://spark.apache.org/docs/latest/ml-pipeline.html An example pipeline can include the following stages: The first two the Tokenizer and HashingTF are transformers which implement the method transform() that converts one dataFrame into another by appending one or more columns. A feature transformer might take a DataFrame, read a column (e.g., text), map it into a new column (e.g., feature vectors), and output a new DataFrame with the mapped column appended. A learning model might take a DataFrame, read the column containing feature vectors, predict the label for each feature vector, and output a new DataFrame with predicted labels appended as a column. Logistic Regression is an Estimator. An Estimator abstracts the concept of a learning algorithm or any algorithm that fits or trains on data. Technically, an Estimator implements a method fit(), which accepts a DataFrame and produces a Model, which is a Transformer. For example, a learning algorithm such as LogisticRegression is an Estimator, and calling fit() trains a LogisticRegressionModel, which is a Model and hence a Transformer (from apache spark pipeline website). The bottom line is the workflow of data in the pipeline after each stage we begin with rawText after Tokenizer we have words, HashingTF we have feature vectors then after we fit and train the model to both the test and training sets we can arrive at our predictions. This introduces both speed and efficiency, This methodology allows a flexible way to transform text columns and map it to a new data frame of feature vectors outputting a final data frame of these mapped feature vectors.

Upload: others

Post on 22-May-2020

35 views

Category:

Documents

0 download

Report

Download

Embed Size (px):

TRANSCRIPT

Page 1: Apache spark mllib guide for pipeliningcis.csuohio.edu/~sschung/cis612/Apache spark mllib guide for pipelining.pdfHere is the Apache spark mllib guide for pipelining: ... hence a Transformer

Apache spark mllib guide for pipelining Here is the Apache spark mllib guide for pipelining:

http://spark.apache.org/docs/latest/ml-pipeline.html

An example pipeline can include the following stages:

The first two the Tokenizer and HashingTF are transformers which implement the

method transform() that converts one dataFrame into another by appending one or

more columns.

A feature transformer might take a DataFrame, read a column (e.g., text), map it into a

new column (e.g., feature vectors), and output a new DataFrame with the mapped

column appended.

A learning model might take a DataFrame, read the column containing feature vectors,

predict the label for each feature vector, and output a new DataFrame with predicted

labels appended as a column.

Logistic Regression is an Estimator. An Estimator abstracts the concept of a learning

algorithm or any algorithm that fits or trains on data. Technically, an Estimator

implements a method fit(), which accepts a DataFrame and produces a Model, which is

a Transformer. For example, a learning algorithm such as LogisticRegression is an

Estimator, and calling fit() trains a LogisticRegressionModel, which is a Model and

hence a Transformer (from apache spark pipeline website).

The bottom line is the workflow of data in the pipeline after each stage we begin with

rawText after Tokenizer we have words, HashingTF we have feature vectors then after

we fit and train the model to both the test and training sets we can arrive at our

predictions.

This introduces both speed and efficiency, This methodology allows a flexible way to

transform text columns and map it to a new data frame of feature vectors outputting a

final data frame of these mapped feature vectors.

Page 2: Apache spark mllib guide for pipeliningcis.csuohio.edu/~sschung/cis612/Apache spark mllib guide for pipelining.pdfHere is the Apache spark mllib guide for pipelining: ... hence a Transformer

Our model:

We have a dataframe consisting of tweet_text (cleaned) and sentiment that has been

run through our sentiment algorithm, we read the column containing feature vectors

(tweet_text) this is transformed in the pipeline, predict the label for each feature vector

(sentiment), and output a new DataFrame with predicted labels appended as a

column(prediction).

Page 3: Apache spark mllib guide for pipeliningcis.csuohio.edu/~sschung/cis612/Apache spark mllib guide for pipelining.pdfHere is the Apache spark mllib guide for pipelining: ... hence a Transformer

Spark MLlib is the Spark component providing the machine learning…€¦ · Spark MLlib is the Spark component providing the machine learning/data mining algorithms ... applying

Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing

Production Readiness Testing At Salesforce Using Spark MLlib

Developing Apache Spark Applications - Cloudera · Apache Spark Quick Start Apache Spark Overview Apache Spark Programming Guide Using the Spark DataFrame API A DataFrame is a distributed

Process Monitoring Platform based on Industry 4.0 tools: a ... · Apache Spark MLlib, Scikit-learn, TensorFlow, H2O.ai, BigML, Accord.NET, Apache SystemML, Apache Mahout, Oryx 2,

Lightning Fast Cluster Computing - Apache Software …people.apache.org/~marmbrus/talks/Spark.UIUC.2015.pdfSpark Spark Streaming real-time Spark SQL structured data MLlib machine learning

Reference Architecture · Machine learning – discipline in computer science which gives computers the ability to learn from data. MLlib Apache Spark MLlib is a distributed machine

Generalized Linear Models in Spark MLlib and SparkR

Spark MLlib is the Spark component providing the machine ......Spark MLlib is the Spark component providing the machine learning/data mining algorithms Pre-processing techniques Classification

Recent Developments in Spark MLlib and Beyond

Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames

Spark MLlib - Training Material

Convolutional Neural Networks at scale in Spark MLlib

Data Driven-Toyota Customer 360 Insights on Apache Spark and MLlib-(Brian Kursar, Toyota)

Best practices for productionizing Apache Spark MLlib models

Spark MLlib is the Spark component providing the machine ... · Spark MLlib is the Spark component providing the machine learning/data mining algorithms Pre-processing techniques

Accelerator Design for Big Data Processing Frameworksmatutani/papers/m... · Machine learning (Apache Spark MLlib) Serialization (Apache Thrift) KVS / Column DB (Redis, HBase) Document

Apache Spark MLlib - GitHub Pagesbekbolatov.github.io/docs/SparkMLlib.pdf · Apache Spark MLlib Machine Learning Library for a parallel computing framework Review by Renat Bekbolatov

Spark MLlib - courses.physics.illinois.edu · Machine Learning on Spark (MLlib) MLlib allows for distributed machine learning on very large datasets. Built on top of Spark so you

DECISION MAKING WITH MLLIB, SPARK AND SPARK STREAMING

MLlib and Machine Learning on Spark

Apache Spark™ is a fast and general-purpose engine for ......Apache Spark™ is a fast and general-purpose engine for large-scale data processing ... SQL, Spark Streaming, MLlib,

How to Integrate Spark MLlib and Apache Solr to Build Real-Time Entity Type Recognition System for Better Query Understanding: Spark Summit East talk by Khalifeh Aljadda

Elasticsearch And Apache Lucene For Apache Spark And MLlib

In-Memory Processing with Apache Spark - harschalig-membres.imag.fr/.../uploads/sites/125/2016/11/Spark.pdfStreaming Spark SQL MLlib & ML (machine learning) GraphX (graph) Resilient

Spark Streaming and MLlib - Hyderabad Spark Group

Introduction to Apache Spark and MLlib

Learning Spark - Chapter 11: Machine learning with MLlib

Session #2442: Flash-Optimized Apache Spark: Expanding In ... · R Scala SQL Python Java Spark SQL Streaming MLlib GraphX #ibmedge Apache Spark 6 • Unified Analytics Platform –

Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models

Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk by Yanbo Liang

MLlib and All-pairs Similarity - Stanford Universityrezab/slides/maryland_mllib.pdfSpark Core Spark Streaming" real-time Spark SQL structured GraphX graph MLlib machine learning …

Distributing Matrix Computations with Spark MLlib

Matrix Computations and Optimization in Apache Spark€¦ · Distributed Linear Algebra, Matrix Computations, Opti-mization, Machine Learning, MLlib, Spark 1. INTRODUCTION Modern

Apache Spark MLlib 2.0 Preview: Data Science and Production