mining big data

24
Mining Big Data for Ranking System Mentor :- Dr. Shirshu Verma Associate Professor Team :- Aman Kr. Raj (IIT2011041) Khushal Gautam (IIT2011054) Neelesh Kr. Nirmal (IIT2011033) Nikhil Passey (IIT2011159 ) Shivam Chaudhary (IIT2011047) Sudheer Singh (IIT2011064) Vishal Chaudhary (IIT2011042)

Upload: khushal-gautam

Post on 02-Jun-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Mining Big Data

8/11/2019 Mining Big Data

http://slidepdf.com/reader/full/mining-big-data 1/24

Mining Big Datafor RankingSystem Mentor :-

Dr. Shirshu VermaAssociate Professor

Team :-Aman Kr. Raj (IIT2011041)Khushal Gautam (IIT2011054)

Neelesh Kr. Nirmal (IIT2011033)Nikhil Passey (IIT2011159)Shivam Chaudhary (IIT2011047)Sudheer Singh (IIT2011064)Vishal Chaudhary (IIT2011042)

Page 2: Mining Big Data

8/11/2019 Mining Big Data

http://slidepdf.com/reader/full/mining-big-data 2/24

  Introduction

Ranking is required whenever there isrequirement for comparing relevancy.

The complexity of modern analytics needsis outstripping the available computingpower of legacy systems.

Page 3: Mining Big Data

8/11/2019 Mining Big Data

http://slidepdf.com/reader/full/mining-big-data 3/24

Introduction (continue…) 

In legacy environments, traditional toolsand batch processes can take hours,

days, or even weeks, in a world wherebusinesses require access to data inminutes, or seconds –  or even sub-seconds.

For example, to rank the user’s ofstackoverflow on the basis of level ofexpertise in various fields , we need toanalyze huge amount of data.

Page 4: Mining Big Data

8/11/2019 Mining Big Data

http://slidepdf.com/reader/full/mining-big-data 4/24

  Introduction (continue…) 

Hadoop is a great choice for these kind ofproblems.

Hadoop is used not only for handling thehistorically grown BIG data, but also usedfor meeting high performance needs for

an application.

Page 5: Mining Big Data

8/11/2019 Mining Big Data

http://slidepdf.com/reader/full/mining-big-data 5/24

  what is hadoop?? Hadoop is an open source Apache project.

Hadoop framework was written in Java. It isscalable and therefore can support highperformance demanding applications.Storing very large amounts of data on the filesystems of multiple computers are possible inHadoop framework. It is configured to enablescalability from single node or computer tothousands of nodes or independent systems insuch a way that the individual nodes uselocal computer storage, CPU, memory andprocessing power.

Page 6: Mining Big Data

8/11/2019 Mining Big Data

http://slidepdf.com/reader/full/mining-big-data 6/24

Hadoop (contd…) 

Page 7: Mining Big Data

8/11/2019 Mining Big Data

http://slidepdf.com/reader/full/mining-big-data 7/24

Problem definition

Ranking the users for every topic onstackoverflow.com is approached in this

project.Stackoverflow is an open-sourcewebsite where any body can ask questionor answer any question with assigned user-id and password or as a guest user.Questions and answers are categorized on

the basis of the area in which they are morerelevant. Hashtags are assigned to everyquestion and answers.

Page 8: Mining Big Data

8/11/2019 Mining Big Data

http://slidepdf.com/reader/full/mining-big-data 8/24

Problem definition..

. Hashtag tells us the relevancy of thequestion or answer in the particular

area. Our main objective is to assignthe level of expertize to the user inevery field on the basis of the responseon the user's questions and positive and

negative responses on the answersgiven by other users.

Page 9: Mining Big Data

8/11/2019 Mining Big Data

http://slidepdf.com/reader/full/mining-big-data 9/24

Problem definition… 

We take all the features of the user'sresponse like positive/negative

response, location preference, studentor professional which are valid users ofthe website. We have taken the data-dump of the stackoverflow, which is

trememdous amount of data(20 GB).

Page 10: Mining Big Data

8/11/2019 Mining Big Data

http://slidepdf.com/reader/full/mining-big-data 10/24

Problem definition(contd…) 

We will analyze all the data for every userand develop a recommendation system

on the basis of level of expertize of theusers in various areas. To process this hugevolume of data, heuristic approach toanalyze this data will take huge amount

of time, months or even years. So weneed something special technique

Page 11: Mining Big Data

8/11/2019 Mining Big Data

http://slidepdf.com/reader/full/mining-big-data 11/24

Problem definition… 

In order to achieve this we will bring―computation to the data instead of data

to the computational method‖, becauseto bring the data for computation weneed extra I/O operations to load thishuge volume data into memory and if the

memory is limited then we need extramemory and resources to process thisdata

Page 12: Mining Big Data

8/11/2019 Mining Big Data

http://slidepdf.com/reader/full/mining-big-data 12/24

WHY HADOOP…???  You can't have a conversation about Big Data for

very long without talking about Hadoop. But whatexactly is Hadoop, and what makes it so special?Basically, it's a way of storing enormous data setsacross distributed clusters of servers and thenrunning "distributed" analysis applications in eachcluster. It's designed to be robust, in that your BigData applications will continue to run even whenindividual servers —  or clusters —  fail. And it's alsodesigned to be efficient, because it doesn'trequire your applications to shuttle huge volumesof data across your network

Page 13: Mining Big Data

8/11/2019 Mining Big Data

http://slidepdf.com/reader/full/mining-big-data 13/24

Hadoop VS Rdbms..

RDBMS and Hadoop are differentconcepts of storing, processing and

retrieving the information. DBMS andRDBMS are in the literature for a long timewhereas Hadoop is a new conceptcomparatively. As the storage capacitiesand customer data size are increased

enormously, processing this informationwith in a reasonable amount of timebecomes crucial.

Page 14: Mining Big Data

8/11/2019 Mining Big Data

http://slidepdf.com/reader/full/mining-big-data 14/24

Hadoop VS Rdbms(contd…) 

Especially when it comes to datawarehousing applications, business

intelligence reporting, and variousanalytical processing, it becomes verychallenging to perform complex reportingwithin a reasonable amount of time as thesize of the data grows exponentially as

well as the growing demands ofcustomers for complex analysis andreporting.

Page 15: Mining Big Data

8/11/2019 Mining Big Data

http://slidepdf.com/reader/full/mining-big-data 15/24

Hadoop VS Rdbms(contd…) 

Hadoop framework works very well with

structured and unstructured data. This also

supports variety of data formats in real timesuch as XML, JSON and text based flat file

formats. However, RDBMS only work with

better when an entity relationship model (ER

model) is defined perfectly and therefore, the

database schema or structure can grow andunmanaged otherwise,i.e., an RDBMS works

well with structured data.

Page 16: Mining Big Data

8/11/2019 Mining Big Data

http://slidepdf.com/reader/full/mining-big-data 16/24

Hadoop VS Rdbms(contd…) 

Hadoop will be a choice in environmentssuch as when there are needs for BIG

data processing on which the data beingprocessed does not have consistentrelationships. Where the data size is tooBIG for complex processing, or not easyto define the relationships between the

data, then it becomes difficult to save theextracted information in an RDBMS with acoherent relationship

Page 17: Mining Big Data

8/11/2019 Mining Big Data

http://slidepdf.com/reader/full/mining-big-data 17/24

POINTS TO CONSIDER: 

RDBMS is relational databasemanagement system. Hadoop is node

based flat structure.

RDMS is generally used for OLTPprocessing whereas Hadoop is currentlyused for analytical and especially for BIG

DATA processing.

Page 18: Mining Big Data

8/11/2019 Mining Big Data

http://slidepdf.com/reader/full/mining-big-data 18/24

Points contd..

Any maintenance on storage, or data files, a

downtime is needed for any available RDBMS.

In standalone database systems, to addprocessing power such as more CPU, physical

memory in non-virtualized environment, a

downtime is needed for RDBMS such as DB2,

Oracle, and SQL Server. However, Hadoop

systems are individual independent nodesthat can be added in an as needed basis.

Page 19: Mining Big Data

8/11/2019 Mining Big Data

http://slidepdf.com/reader/full/mining-big-data 19/24

Points contd..

The database cluster uses the same datafiles stored in shared storage in RDBMS

systems, whereas the storage data canbe stored independently in eachprocessing node.

The performance tuning of an RDBMS cango nightmare. Even in proven

environment. However, Hadoop enableshot tuning by adding extra nodes whichwill be self-managed.

Page 20: Mining Big Data

8/11/2019 Mining Big Data

http://slidepdf.com/reader/full/mining-big-data 20/24

Overview of Hadoop… 

Page 21: Mining Big Data

8/11/2019 Mining Big Data

http://slidepdf.com/reader/full/mining-big-data 21/24

Implementations of Hadoop..

Page 22: Mining Big Data

8/11/2019 Mining Big Data

http://slidepdf.com/reader/full/mining-big-data 22/24

HDFS…  Hadoop, including HDFS, is well suited for

distributed storage and distributed processing

using commodity hardware. It is fault tolerant,scalable, and extremely simple to expand.MapReduce, well known for its simplicity andapplicability for large set of distributedapplications, is an integral part of Hadoop.

HDFS is highly configurable with a defaultconfiguration well suited for manyinstallations. Most of the time, configurationneeds to be tuned only for very large clusters.

Page 23: Mining Big Data

8/11/2019 Mining Big Data

http://slidepdf.com/reader/full/mining-big-data 23/24

HDFS… 

Hadoop is written in Java and issupported on all major platforms.

Hadoop supports shell-like commands tointeract with HDFS directly.

The NameNode and Datanodes havebuilt in web servers that makes it easy to

check current status of the cluster.

Page 24: Mining Big Data

8/11/2019 Mining Big Data

http://slidepdf.com/reader/full/mining-big-data 24/24