mining big data
TRANSCRIPT
8/11/2019 Mining Big Data
http://slidepdf.com/reader/full/mining-big-data 1/24
Mining Big Datafor RankingSystem Mentor :-
Dr. Shirshu VermaAssociate Professor
Team :-Aman Kr. Raj (IIT2011041)Khushal Gautam (IIT2011054)
Neelesh Kr. Nirmal (IIT2011033)Nikhil Passey (IIT2011159)Shivam Chaudhary (IIT2011047)Sudheer Singh (IIT2011064)Vishal Chaudhary (IIT2011042)
8/11/2019 Mining Big Data
http://slidepdf.com/reader/full/mining-big-data 2/24
Introduction
Ranking is required whenever there isrequirement for comparing relevancy.
The complexity of modern analytics needsis outstripping the available computingpower of legacy systems.
8/11/2019 Mining Big Data
http://slidepdf.com/reader/full/mining-big-data 3/24
Introduction (continue…)
In legacy environments, traditional toolsand batch processes can take hours,
days, or even weeks, in a world wherebusinesses require access to data inminutes, or seconds – or even sub-seconds.
For example, to rank the user’s ofstackoverflow on the basis of level ofexpertise in various fields , we need toanalyze huge amount of data.
8/11/2019 Mining Big Data
http://slidepdf.com/reader/full/mining-big-data 4/24
Introduction (continue…)
Hadoop is a great choice for these kind ofproblems.
Hadoop is used not only for handling thehistorically grown BIG data, but also usedfor meeting high performance needs for
an application.
8/11/2019 Mining Big Data
http://slidepdf.com/reader/full/mining-big-data 5/24
what is hadoop?? Hadoop is an open source Apache project.
Hadoop framework was written in Java. It isscalable and therefore can support highperformance demanding applications.Storing very large amounts of data on the filesystems of multiple computers are possible inHadoop framework. It is configured to enablescalability from single node or computer tothousands of nodes or independent systems insuch a way that the individual nodes uselocal computer storage, CPU, memory andprocessing power.
8/11/2019 Mining Big Data
http://slidepdf.com/reader/full/mining-big-data 6/24
Hadoop (contd…)
8/11/2019 Mining Big Data
http://slidepdf.com/reader/full/mining-big-data 7/24
Problem definition
Ranking the users for every topic onstackoverflow.com is approached in this
project.Stackoverflow is an open-sourcewebsite where any body can ask questionor answer any question with assigned user-id and password or as a guest user.Questions and answers are categorized on
the basis of the area in which they are morerelevant. Hashtags are assigned to everyquestion and answers.
8/11/2019 Mining Big Data
http://slidepdf.com/reader/full/mining-big-data 8/24
Problem definition..
. Hashtag tells us the relevancy of thequestion or answer in the particular
area. Our main objective is to assignthe level of expertize to the user inevery field on the basis of the responseon the user's questions and positive and
negative responses on the answersgiven by other users.
8/11/2019 Mining Big Data
http://slidepdf.com/reader/full/mining-big-data 9/24
Problem definition…
We take all the features of the user'sresponse like positive/negative
response, location preference, studentor professional which are valid users ofthe website. We have taken the data-dump of the stackoverflow, which is
trememdous amount of data(20 GB).
8/11/2019 Mining Big Data
http://slidepdf.com/reader/full/mining-big-data 10/24
Problem definition(contd…)
We will analyze all the data for every userand develop a recommendation system
on the basis of level of expertize of theusers in various areas. To process this hugevolume of data, heuristic approach toanalyze this data will take huge amount
of time, months or even years. So weneed something special technique
8/11/2019 Mining Big Data
http://slidepdf.com/reader/full/mining-big-data 11/24
Problem definition…
In order to achieve this we will bring―computation to the data instead of data
to the computational method‖, becauseto bring the data for computation weneed extra I/O operations to load thishuge volume data into memory and if the
memory is limited then we need extramemory and resources to process thisdata
8/11/2019 Mining Big Data
http://slidepdf.com/reader/full/mining-big-data 12/24
WHY HADOOP…??? You can't have a conversation about Big Data for
very long without talking about Hadoop. But whatexactly is Hadoop, and what makes it so special?Basically, it's a way of storing enormous data setsacross distributed clusters of servers and thenrunning "distributed" analysis applications in eachcluster. It's designed to be robust, in that your BigData applications will continue to run even whenindividual servers — or clusters — fail. And it's alsodesigned to be efficient, because it doesn'trequire your applications to shuttle huge volumesof data across your network
8/11/2019 Mining Big Data
http://slidepdf.com/reader/full/mining-big-data 13/24
Hadoop VS Rdbms..
RDBMS and Hadoop are differentconcepts of storing, processing and
retrieving the information. DBMS andRDBMS are in the literature for a long timewhereas Hadoop is a new conceptcomparatively. As the storage capacitiesand customer data size are increased
enormously, processing this informationwith in a reasonable amount of timebecomes crucial.
8/11/2019 Mining Big Data
http://slidepdf.com/reader/full/mining-big-data 14/24
Hadoop VS Rdbms(contd…)
Especially when it comes to datawarehousing applications, business
intelligence reporting, and variousanalytical processing, it becomes verychallenging to perform complex reportingwithin a reasonable amount of time as thesize of the data grows exponentially as
well as the growing demands ofcustomers for complex analysis andreporting.
8/11/2019 Mining Big Data
http://slidepdf.com/reader/full/mining-big-data 15/24
Hadoop VS Rdbms(contd…)
Hadoop framework works very well with
structured and unstructured data. This also
supports variety of data formats in real timesuch as XML, JSON and text based flat file
formats. However, RDBMS only work with
better when an entity relationship model (ER
model) is defined perfectly and therefore, the
database schema or structure can grow andunmanaged otherwise,i.e., an RDBMS works
well with structured data.
8/11/2019 Mining Big Data
http://slidepdf.com/reader/full/mining-big-data 16/24
Hadoop VS Rdbms(contd…)
Hadoop will be a choice in environmentssuch as when there are needs for BIG
data processing on which the data beingprocessed does not have consistentrelationships. Where the data size is tooBIG for complex processing, or not easyto define the relationships between the
data, then it becomes difficult to save theextracted information in an RDBMS with acoherent relationship
8/11/2019 Mining Big Data
http://slidepdf.com/reader/full/mining-big-data 17/24
POINTS TO CONSIDER:
RDBMS is relational databasemanagement system. Hadoop is node
based flat structure.
RDMS is generally used for OLTPprocessing whereas Hadoop is currentlyused for analytical and especially for BIG
DATA processing.
8/11/2019 Mining Big Data
http://slidepdf.com/reader/full/mining-big-data 18/24
Points contd..
Any maintenance on storage, or data files, a
downtime is needed for any available RDBMS.
In standalone database systems, to addprocessing power such as more CPU, physical
memory in non-virtualized environment, a
downtime is needed for RDBMS such as DB2,
Oracle, and SQL Server. However, Hadoop
systems are individual independent nodesthat can be added in an as needed basis.
8/11/2019 Mining Big Data
http://slidepdf.com/reader/full/mining-big-data 19/24
Points contd..
The database cluster uses the same datafiles stored in shared storage in RDBMS
systems, whereas the storage data canbe stored independently in eachprocessing node.
The performance tuning of an RDBMS cango nightmare. Even in proven
environment. However, Hadoop enableshot tuning by adding extra nodes whichwill be self-managed.
8/11/2019 Mining Big Data
http://slidepdf.com/reader/full/mining-big-data 20/24
Overview of Hadoop…
8/11/2019 Mining Big Data
http://slidepdf.com/reader/full/mining-big-data 21/24
Implementations of Hadoop..
8/11/2019 Mining Big Data
http://slidepdf.com/reader/full/mining-big-data 22/24
HDFS… Hadoop, including HDFS, is well suited for
distributed storage and distributed processing
using commodity hardware. It is fault tolerant,scalable, and extremely simple to expand.MapReduce, well known for its simplicity andapplicability for large set of distributedapplications, is an integral part of Hadoop.
HDFS is highly configurable with a defaultconfiguration well suited for manyinstallations. Most of the time, configurationneeds to be tuned only for very large clusters.
8/11/2019 Mining Big Data
http://slidepdf.com/reader/full/mining-big-data 23/24
HDFS…
Hadoop is written in Java and issupported on all major platforms.
Hadoop supports shell-like commands tointeract with HDFS directly.
The NameNode and Datanodes havebuilt in web servers that makes it easy to
check current status of the cluster.
8/11/2019 Mining Big Data
http://slidepdf.com/reader/full/mining-big-data 24/24