large scale ip filtering using apache pig and case study kaushik chandrasekaran nabeel akheel

15
Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel

Upload: eustacia-daniels

Post on 03-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel

Large scale IP filtering using Apache Pig and case study

Kaushik ChandrasekaranNabeel Akheel

Page 2: Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel

What is Apache Pig?

Platform for analyzing large data sets.

Merging data sets, filtering them, and applying functions to records or groups of records

Allows you to create user defined functions

Page 3: Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel

Pig Infrastructure

Mainly consists of two layers,

Compiler that produces sequences of Map-Reduce programs

Pig's language layer currently consists of a textual language called Pig Latin

Page 4: Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel

Nested Data Model

Pig Latin has a fully-nestable data model with:◦Atomic values, tuples, bags (lists), and maps

More natural to programmers than flat tuples

Avoids expensive joins

Computers,DesktopsLaptopsNetbooks

Page 5: Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel

Pig Latin vs. SQL

• Little control over execution method• Query optimization is hard• Parallel environment• Little or no statistics• Lots of UDFs

SQL

Ease of programmingOptimization opportunitiesExtensibility

Pig Latin

Page 6: Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel

JOIN vs. COGROUP

Page 7: Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel

Using Pig on cloud

Pig Latin programs run in a distributed fashion on a cluster

Programs are complied into Map/Reduce jobs and executed using Hadoop

Pig Latin programs can also run in "local mode" without a cluster

Page 8: Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel

Data Flow

Load Visits

Group by url

Foreach urlgenerate count

Load Url Info

Join on url

Group by category

Foreach categorygenerate top10 urls

Page 9: Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel

Map-Reduce on Data Flow

Load Visits

Group by url

Foreach urlgenerate count

Load Url Info

Join on url

Group by category

Foreach categorygenerate top10 urls

Map1

Reduce1

Map2

Reduce2

Map3

Reduce3

Every group or join operation forms a map-reduce boundary

Page 10: Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel

Implementation

cluster

Hadoop

Map-Reduce

Pig

SQL

automatic

rewrite +

optimize

user

Page 11: Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel

IP Filtering

Internet companies swimming in data

Analyzing of huge data is needed to filter out BOT IP’s

A High level language in a cloud environment would be useful to filter out these IP’s efficiently

Page 12: Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel

Objective

Understand how Pig Latin works

Implement an IP Address filter using Apache Pig

Implement a similar IP filter using purely Hadoop

Comparison & Analysis of the two implementations

Conduct a case study of pros and cons of other high-level languages with Pig

Page 13: Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel

Time LineMilestone Schedule

Understand how Pig Latin worksRead through the tutorial

11/07/2011

Implement IP filter using Apache Pig and perform analysis to figure out best scenarios for specific optimizations

11/14/2011

Implement IP filter using purely Hadoop and compare it to the Pig implementation

11/28/2011

Conduct a case study on the pros and cons of high level languages 12/05/2011

Final Report 12/12/2011

Page 14: Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel

References

A. Pig Latin: A not-so-foreign language for data processing. In Proceedings of the ACM SIGMOD 2008 International Conference on Management of Data (Auckland, New Zealand, June 2008); 

https://cwiki.apache.org/confluence/display/PIG/Index

http://pig.apache.org/

Page 15: Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel

Thank you