mining big data

8/11/2019 Mining Big Data

http://slidepdf.com/reader/full/mining-big-data 1/24

Mining Big Datafor RankingSystem Mentor :-

Dr. Shirshu VermaAssociate Professor

Team :-Aman Kr. Raj (IIT2011041)Khushal Gautam (IIT2011054)

Neelesh Kr. Nirmal (IIT2011033)Nikhil Passey (IIT2011159)Shivam Chaudhary (IIT2011047)Sudheer Singh (IIT2011064)Vishal Chaudhary (IIT2011042)



Introduction

Ranking is required whenever there isrequirement for comparing relevancy.

The complexity of modern analytics needsis outstripping the available computingpower of legacy systems.



Introduction (continue…)

In legacy environments, traditional toolsand batch processes can take hours,

days, or even weeks, in a world wherebusinesses require access to data inminutes, or seconds – or even sub-seconds.

For example, to rank the user’s ofstackoverflow on the basis of level ofexpertise in various fields , we need toanalyze huge amount of data.



Introduction (continue…)

Hadoop is a great choice for these kind ofproblems.

Hadoop is used not only for handling thehistorically grown BIG data, but also usedfor meeting high performance needs for

an application.



what is hadoop?? Hadoop is an open source Apache project.

Hadoop framework was written in Java. It isscalable and therefore can support highperformance demanding applications.Storing very large amounts of data on the filesystems of multiple computers are possible inHadoop framework. It is configured to enablescalability from single node or computer tothousands of nodes or independent systems insuch a way that the individual nodes uselocal computer storage, CPU, memory andprocessing power.



Hadoop (contd…)



Problem definition

Ranking the users for every topic onstackoverflow.com is approached in this

project.Stackoverflow is an open-sourcewebsite where any body can ask questionor answer any question with assigned user-id and password or as a guest user.Questions and answers are categorized on

the basis of the area in which they are morerelevant. Hashtags are assigned to everyquestion and answers.



Problem definition..

. Hashtag tells us the relevancy of thequestion or answer in the particular

area. Our main objective is to assignthe level of expertize to the user inevery field on the basis of the responseon the user's questions and positive and

negative responses on the answersgiven by other users.



Problem definition…

We take all the features of the user'sresponse like positive/negative

response, location preference, studentor professional which are valid users ofthe website. We have taken the data-dump of the stackoverflow, which is

trememdous amount of data(20 GB).



Problem definition(contd…)

We will analyze all the data for every userand develop a recommendation system

on the basis of level of expertize of theusers in various areas. To process this hugevolume of data, heuristic approach toanalyze this data will take huge amount

of time, months or even years. So weneed something special technique



Problem definition…

In order to achieve this we will bring―computation to the data instead of data

to the computational method‖, becauseto bring the data for computation weneed extra I/O operations to load thishuge volume data into memory and if the

memory is limited then we need extramemory and resources to process thisdata



WHY HADOOP…??? You can't have a conversation about Big Data for

very long without talking about Hadoop. But whatexactly is Hadoop, and what makes it so special?Basically, it's a way of storing enormous data setsacross distributed clusters of servers and thenrunning "distributed" analysis applications in eachcluster. It's designed to be robust, in that your BigData applications will continue to run even whenindividual servers — or clusters — fail. And it's alsodesigned to be efficient, because it doesn'trequire your applications to shuttle huge volumesof data across your network



Hadoop VS Rdbms..

RDBMS and Hadoop are differentconcepts of storing, processing and

retrieving the information. DBMS andRDBMS are in the literature for a long timewhereas Hadoop is a new conceptcomparatively. As the storage capacitiesand customer data size are increased

enormously, processing this informationwith in a reasonable amount of timebecomes crucial.



Hadoop VS Rdbms(contd…)

Especially when it comes to datawarehousing applications, business

intelligence reporting, and variousanalytical processing, it becomes verychallenging to perform complex reportingwithin a reasonable amount of time as thesize of the data grows exponentially as

well as the growing demands ofcustomers for complex analysis andreporting.




Hadoop framework works very well with

structured and unstructured data. This also

supports variety of data formats in real timesuch as XML, JSON and text based flat file

formats. However, RDBMS only work with

better when an entity relationship model (ER

model) is defined perfectly and therefore, the

database schema or structure can grow andunmanaged otherwise,i.e., an RDBMS works

well with structured data.




Hadoop will be a choice in environmentssuch as when there are needs for BIG

data processing on which the data beingprocessed does not have consistentrelationships. Where the data size is tooBIG for complex processing, or not easyto define the relationships between the

data, then it becomes difficult to save theextracted information in an RDBMS with acoherent relationship



POINTS TO CONSIDER:

RDBMS is relational databasemanagement system. Hadoop is node

based flat structure.

RDMS is generally used for OLTPprocessing whereas Hadoop is currentlyused for analytical and especially for BIG

DATA processing.



Points contd..

Any maintenance on storage, or data files, a

downtime is needed for any available RDBMS.

In standalone database systems, to addprocessing power such as more CPU, physical

memory in non-virtualized environment, a

downtime is needed for RDBMS such as DB2,

Oracle, and SQL Server. However, Hadoop

systems are individual independent nodesthat can be added in an as needed basis.



Points contd..

The database cluster uses the same datafiles stored in shared storage in RDBMS

systems, whereas the storage data canbe stored independently in eachprocessing node.

The performance tuning of an RDBMS cango nightmare. Even in proven

environment. However, Hadoop enableshot tuning by adding extra nodes whichwill be self-managed.



Overview of Hadoop…



Implementations of Hadoop..



HDFS… Hadoop, including HDFS, is well suited for

distributed storage and distributed processing

using commodity hardware. It is fault tolerant,scalable, and extremely simple to expand.MapReduce, well known for its simplicity andapplicability for large set of distributedapplications, is an integral part of Hadoop.

HDFS is highly configurable with a defaultconfiguration well suited for manyinstallations. Most of the time, configurationneeds to be tuned only for very large clusters.



HDFS…

Hadoop is written in Java and issupported on all major platforms.

Hadoop supports shell-like commands tointeract with HDFS directly.

The NameNode and Datanodes havebuilt in web servers that makes it easy to

check current status of the cluster.

mining big data

Documents