large scale ip filtering using apache pig and case study kaushik chandrasekaran nabeel akheel
Post on 03-Jan-2016
215 Views
Preview:
TRANSCRIPT
Large scale IP filtering using Apache Pig and case study
Kaushik ChandrasekaranNabeel Akheel
What is Apache Pig?
Platform for analyzing large data sets.
Merging data sets, filtering them, and applying functions to records or groups of records
Allows you to create user defined functions
Pig Infrastructure
Mainly consists of two layers,
Compiler that produces sequences of Map-Reduce programs
Pig's language layer currently consists of a textual language called Pig Latin
Nested Data Model
Pig Latin has a fully-nestable data model with:◦Atomic values, tuples, bags (lists), and maps
More natural to programmers than flat tuples
Avoids expensive joins
Computers,DesktopsLaptopsNetbooks
Pig Latin vs. SQL
• Little control over execution method• Query optimization is hard• Parallel environment• Little or no statistics• Lots of UDFs
SQL
Ease of programmingOptimization opportunitiesExtensibility
Pig Latin
JOIN vs. COGROUP
Using Pig on cloud
Pig Latin programs run in a distributed fashion on a cluster
Programs are complied into Map/Reduce jobs and executed using Hadoop
Pig Latin programs can also run in "local mode" without a cluster
Data Flow
Load Visits
Group by url
Foreach urlgenerate count
Load Url Info
Join on url
Group by category
Foreach categorygenerate top10 urls
Map-Reduce on Data Flow
Load Visits
Group by url
Foreach urlgenerate count
Load Url Info
Join on url
Group by category
Foreach categorygenerate top10 urls
Map1
Reduce1
Map2
Reduce2
Map3
Reduce3
Every group or join operation forms a map-reduce boundary
Implementation
cluster
Hadoop
Map-Reduce
Pig
SQL
automatic
rewrite +
optimize
user
IP Filtering
Internet companies swimming in data
Analyzing of huge data is needed to filter out BOT IP’s
A High level language in a cloud environment would be useful to filter out these IP’s efficiently
Objective
Understand how Pig Latin works
Implement an IP Address filter using Apache Pig
Implement a similar IP filter using purely Hadoop
Comparison & Analysis of the two implementations
Conduct a case study of pros and cons of other high-level languages with Pig
Time LineMilestone Schedule
Understand how Pig Latin worksRead through the tutorial
11/07/2011
Implement IP filter using Apache Pig and perform analysis to figure out best scenarios for specific optimizations
11/14/2011
Implement IP filter using purely Hadoop and compare it to the Pig implementation
11/28/2011
Conduct a case study on the pros and cons of high level languages 12/05/2011
Final Report 12/12/2011
References
A. Pig Latin: A not-so-foreign language for data processing. In Proceedings of the ACM SIGMOD 2008 International Conference on Management of Data (Auckland, New Zealand, June 2008);
https://cwiki.apache.org/confluence/display/PIG/Index
http://pig.apache.org/
Thank you
top related