madlib analytics library contributions · madlib analytics library contributions babak alipour,...

1
Click to add text Click toadd text MADlib Analytics Library Contributions Babak Alipour, Aditya Nain, Giang Nguyen CISE department, University of Florida • The philosophy behind MADlib is to prevent data from moving between multiple runtimes while using advanced machine learning capabilities. • MADlib uses SQL-based algorithms and syntax, making it very straightforward for adoption among millions of running systems. • MADlib is open source and enjoys an active community of developers. • MADlib has been designed to run on MPP databases, namely GreenPlum and HAWQ, for data parallelism. • Using C++ for per-record processing, Python for driver functions and PL/pgSQL, many popular machine learning algorithms are supported. • The module anatomy allows for flexible implementations of algorithms and tuning for different database backend engines. • We focused on contributions by adding new modules. K-Nearest Neighbors algorithm is a popular algorithm used in machine learning and data analytics. Due to the request of its corresponding JIRA, our implementation used linear search approach for finding the k-nearest neighbors. This algorithm was at first developed using a combination of Python, C++ and SQL but later on all parts were moved into the SQL code to simplify debugging and to allow MPP engine to do data partitioning efficiently. A design goal is this project was to have a generalizable interface so that users can plug in different tables, distance function and weighting functions for flexibility and applicability to a wider range of scenarios. The JIRA does not specify the interface and details thus we made some assumptions through discussions and implemented based on those. The input is two tables, one is the training set and the other is test set. For every row i in the test set, its k-nearest neighbors in training set will be returned in a table of results. A distance function should be provided as input with a signature of DOUBLE[] x DOUBLE[] -> DOUBLE Input validation is performed and appropriate error messages are produced in case of invalid input parameters. The default supported functions are Manhattan, Euclidean and Minkowski distance with choice of arbitrary p, all of which are implemented purely in PL/pgSQL. The longer term vision is to add support for data structures such as KD- tree or Ball-tree to improve performance and support more advanced kNN queries such as kNN-join. Background K-Nearest Neighbors Merge Step in MPP database. An example of a blog enhanced by NLP name entity extraction. Gaussian Mixture Models 1. J. M. Hellerstein, F. Schoppmann, D. Z. Wang, E. Fratkin, and C. Welton, “The MADlib Analytics Library or MAD Skills , the SQL,” Proc. VLDB Endow., pp. 1700–1711, 2012 2. MADlib methods, retrieved from: http://pivotal.io/madlib An exploration of MADlib module anatomy was performed and several introductory documents were produced on getting started with MADlib, its installation on different platforms and its application in some scenarios. Per JIRA {MADLIB-927}, an implementation of k-Nearest Neighbors algorithm was carried out and is now being tested. Per JIRA {MADLIB-410}, an implementation of Gaussian Mixture Models was integrated into MADlib and is now being tested. An application of advanced features enabled by MADlib-enabled database backend was explored. Supported by Dr Daisy Zhe Wang as part of the Projects in Data Science Course Web Application with MADlib-enabled Database Backend CONCLUSIONS REFERENCES Motivation Use MADlib NLP to improve web application features such as information extraction, summarization, recommendation system, sentiment analysis, etc... Django is an efficient Python web application development framework that supports PostgreSQL. Django model represent the data of the application; Each model is mapped to a single table in the database automatically and each attribute of the mode represents a database field. SQL query can be executed directly from Django with connection.cursor() Idea behind out web blog application Each user's post (text) is stored in a table in Postgres database We can call MADlib NLP (CRF) function on this table to extract noun entity. For each noun entity, check if it is in Wikipedia and make it Wiki link accordingly. This can be easily accomplished by send a http request and check if the status code is 200.

Upload: others

Post on 19-Apr-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: MADlib Analytics Library Contributions · MADlib Analytics Library Contributions Babak Alipour, Aditya Nain, Giang Nguyen CISE department, University of Florida •The philosophy

Template ID: multicolorgradients Size: 36x48

Click to add text

Click toadd text

MADlib Analytics Library ContributionsBabak Alipour, Aditya Nain, Giang Nguyen

CISE department, University of Florida

• The philosophy behind MADlib is to prevent data from moving between multiple runtimes while using advanced machine learning capabilities.

• MADlib uses SQL-based algorithms and syntax, making it very straightforward for adoption among millions of running systems.

• MADlib is open source and enjoys an active community of developers.

• MADlib has been designed to run on MPP databases, namely GreenPlum and HAWQ, for data parallelism.

• Using C++ for per-record processing, Python for driver functions and PL/pgSQL, many popular machine learning algorithms are supported.

• The module anatomy allows for flexible implementations of algorithms and tuning for different database backend engines.

• We focused on contributions by adding new modules.

K-Nearest Neighbors algorithm is a popular algorithm used in machine learning and data analytics.Due to the request of its corresponding JIRA, our implementation used linear search approach for finding the k-nearest neighbors.This algorithm was at first developed using a combination of Python, C++ and SQL but later on all parts were moved into the SQL code to simplify debugging and to allow MPP engine to do data partitioning efficiently.A design goal is this project was to have a generalizable interface so that users can plug in different tables, distance function and weighting functions for flexibility and applicability to a wider range of scenarios.The JIRA does not specify the interface and details thus we made some assumptions through discussions and implemented based on those.• The input is two tables, one is the training set and the other is test set.• For every row i in the test set, its k-nearest neighbors in training set

will be returned in a table of results.• A distance function should be provided as input with a signature of

DOUBLE[] x DOUBLE[] -> DOUBLE• Input validation is performed and appropriate error messages are

produced in case of invalid input parameters.• The default supported functions are Manhattan, Euclidean and

Minkowski distance with choice of arbitrary p, all of which are implemented purely in PL/pgSQL.

• The longer term vision is to add support for data structures such as KD-tree or Ball-tree to improve performance and support more advanced kNN queries such as kNN-join.

Background

K-Nearest Neighbors

Merge Step in MPP database. An example of a blog enhanced by NLP name entity extraction.

Gaussian Mixture Models

1. J. M. Hellerstein, F. Schoppmann, D. Z. Wang, E. Fratkin, and C. Welton, “The MADlib Analytics Library or MAD Skills , the SQL,” Proc. VLDB Endow., pp. 1700–1711, 2012

2. MADlib methods, retrieved from: http://pivotal.io/madlib

• An exploration of MADlib module anatomy was performed and several introductory documents were produced on getting started with MADlib, its installation on different platforms and its application in some scenarios.

• Per JIRA {MADLIB-927}, an implementation of k-Nearest Neighbors algorithm was carried out and is now being tested.

• Per JIRA {MADLIB-410}, an implementation of Gaussian Mixture Models was integrated into MADlib and is now being tested.

• An application of advanced features enabled by MADlib-enabled database backend was explored.

Supported by Dr Daisy Zhe Wang as part of the Projects in Data Science Course

Web Application with MADlib-enabled Database Backend

CONCLUSIONS

REFERENCES

Motivation• Use MADlib NLP to improve web application features such as

information extraction, summarization, recommendation system, sentiment analysis, etc...

• Django is an efficient Python web application development framework that supports PostgreSQL.

• Django model represent the data of the application; Each model is mapped to a single table in the database automatically and each attribute of the mode represents a database field.

• SQL query can be executed directly from Django with connection.cursor()

Idea behind out web blog application• Each user's post (text) is stored in a table in Postgres database• We can call MADlib NLP (CRF) function on this table to

extract noun entity. • For each noun entity, check if it is in Wikipedia and make it

Wiki link accordingly.• This can be easily accomplished by send a http request

and check if the status code is 200.