data mining on big data

25
DATA MINING ON BIG DATA Presented by - Swapnil H. Chaudhari Guided by Prof. B. R. Mandre DEPARTMENT OF COMPUTER ENGINEERING SSVPS’s B. S. DEORE COLLEGE OF ENGINEERING, DHULE 2013 - 2014 5 / 2 8 / 2 2

Upload: swapnil-chaudhari

Post on 12-Apr-2017

354 views

Category:

Engineering


0 download

TRANSCRIPT

Page 1: Data mining on big data

May 3, 2023

DATA MINING ON BIG DATA

Presented by - Swapnil H. Chaudhari

Guided byProf. B. R. Mandre

 

DEPARTMENT OF COMPUTER ENGINEERINGSSVPS’s B. S. DEORE COLLEGE OF ENGINEERING, DHULE

2013 - 2014

Page 2: Data mining on big data

May 3, 2023

2

OBJECTIVE : Brief introduction on Big Data

What is Data Mining

Rise of big data

Big Data Characteristics: HASE Theorem

Data Mining Challenges with Big Data

A Big Data processing framework

Page 3: Data mining on big data

May 3, 2023

3

BIG DATA AND DATA MINING Big Data concern large-volume, complex, growing data sets with

multiple, autonomous sources.

Data Mining is Process of semi-automatically analyzing large databases to find patterns that are: valid: hold on new data with some certainty useful: should be possible to act on the item understandable: humans should be able to interpret the

pattern Also known as Knowledge Discovery in Databases (KDD)

Page 4: Data mining on big data

HOW BIG IS THE BIG DATA?

4

- What is big today maybe not big tomorrow

- Fast growing Big data can challenge our current technology in some manner - Volume- Communication- Speed of Generating- Meaningful Analysis

Page 5: Data mining on big data

BIG DATA VECTORS (4VS)- Volume

amount of data

- VelocitySpeed rate in collecting or acquiring or generating or

processing of data

- Variety different data type such as audio, video, image data (mostly

unstructured data)

- Variability semantics, or the variability of meaning in language.

[Gartner 2012] 5

Page 6: Data mining on big data

May 3, 2023

6

EXAMPLES: Government

On 4 October 2012, the first presidential debate between President Barack Obama and Governor Mitt Romney triggered more than 10 million tweets within 2 hours

Private Sector Walmart handles more than 1 million customer transactions every hour,

which is imported into databases estimated to contain more than 2.5 petabytes of data

Facebook handles 40 billion photos from its user base.

Flickr, a public picture sharing site, which received 1.8 million photos per day, on average, from February to March 2012 [5]. Assuming the size of each photo is 2 megabytes (MB), this requires 3.6 terabytes (TB) storage every single day.

Page 7: Data mining on big data

May 3, 2023

7

BIG DATA CHARACTERISTICS: HASE THEOREM HACE Theorem. Big Data starts with large volume,

Heterogeneous, Autonomous sources with distributed and decentralized control, and seeks to explore Complex and Evolving relationships among data [1].

Page 8: Data mining on big data

May 3, 2023

8Fig. The blind men and the giant elephant: the localized (limited) view of each blind man leads to a biased conclusion.

Page 9: Data mining on big data

May 3, 2023

9

BIG DATA CHARACTERISTICS Huge Data with Heterogeneous and Diverse Dimensionality.

Autonomous Sources with Distributed and Decentralized Control.

Complex and Evolving Relationships.

Page 10: Data mining on big data

May 3, 2023

10

CONCEPTUAL VIEW OF THE BIG DATAPROCESSING FRAMEWORK

Fig. A Big Data processing framework

Page 11: Data mining on big data

May 3, 2023

11

A BIG DATA PROCESSING FRAMEWORK: Tier I :- which focuses on low-level data accessing and

computing.

Tier II:- which concentrates on high-level semantics, application domain knowledge, and user privacy issues.

Tier III:- challenges on actual mining algorithms.

Page 12: Data mining on big data

May 3, 2023

12

TIER I: BIG DATA MINING PLATFORM TIRE I -which focuses on low-level data accessing and

computing.

One of the most important characteristics of Big Data is to carry out computing on the petabyte (PB), even the exabyte (EB)-level data with a complex computing process.

Page 13: Data mining on big data

May 3, 2023

13

Small scale data mining tasks: a single desktop computer, which contains hard disk and CPU

processors, is sufficient to fulfill the data mining goals.

Medium scale data mining tasks: Common solutions are to rely on parallel computing [3], [4] or

collective mining [2] parallel computing programming.

Big Data mining tasks: with a data mining task being deployed by running some parallel programming tools, such as MapReduce or Enterprise Control Language (ECL), on a large number of computing nodes (i.e., clusters).

Page 14: Data mining on big data

May 3, 2023

14

MAPREDUCE TECHNIQUE MapReduce is programming model for distributed system.

MapReduce program execute in three stages Map Shuffle Reduce

MapReduce is a batch-oriented parallel computing model[7]

Page 15: Data mining on big data

May 3, 2023

15

MAPREDUCE ALGORITHM

Page 16: Data mining on big data

May 3, 2023

16

FLOW OF MAP REDUCE FUNCTION

Fig. MapReduce Technique[IBM.COM]

Page 17: Data mining on big data

May 3, 2023

17

EXAMPLE : WORD COUNT

Fig. MapReduce Technique for word count [IBM.COM]

Page 18: Data mining on big data

May 3, 2023

18

Page 19: Data mining on big data

May 3, 2023

19

Page 20: Data mining on big data

May 3, 2023

20

TIER II: BIG DATA SEMANTICS AND APPLICATIONKNOWLEDGE Information Sharing and Data Privacy

To protect privacy, two common approaches are to

1. restrict access to the datasuch as adding certification or access control to the data entries, so sensitive information is accessible by a limited group of users only

2. anonymize data fields

sensitive information cannot be pinpointed to an individual record [15].

Page 21: Data mining on big data

May 3, 2023

21

TIER II: BIG DATA SEMANTICS AND APPLICATIONKNOWLEDGE Domain and Application Knowledge

Domain and application knowledge [28] provides essential information for designing Big Data mining algorithms and systems.

Help identify right features for modeling the underlying data.

Help design achievable business objectives by using Big Data analytical techniques

Page 22: Data mining on big data

May 3, 2023

22

TIER III: BIG DATA MINING ALGORITHMS Local Learning and Model Fusion for Multiple Information

Sources

Mining from Sparse, Uncertain, and Incomplete Data

Mining Complex and Dynamic Data

Page 23: Data mining on big data

May 3, 2023

23

CONCLUSION

To explore Big Data, we have analyzed several challenges at the data, model, and system levels.

To support Big Data mining, high-performance computing platforms are required, which impose systematic designs to unleash the full power of the Big Data.

Page 24: Data mining on big data

REFERENCES1. Xindong wu, Xingquan zhu, Gong-qing wu, Wei ding, “Data Mining With Big Data” IEEE transactions on

knowledge and data engineering, vol. 26, no. 1, january 2014

2. B. Brown, M. Chuiu and J. Manyika, “Are you ready for the era of Big Data?” McKinsey Quarterly, Oct 2011, McKinsey Global Institute3. C. Bizer, P. Bonez, M. L. Bordie and O. Erling, “The Meaningful Use of Big Data: Four Perspective Four

Challenges” SIGMOD Vol. 40, No. 4, December 20114. D. Boyd and K. Crawford, “Six Provation for Big Data” A Decade in Internet Time: Symposium on the

Dynamics of the Internet and Society, September 2011, Oxford Internet Institute5. D. Agrawal, S. Das and A. E. Abbadi, “Big Data and Cloud Computing: Current State and Future Opportunities” ETDB 2011, Uppsala, Sweden6. D. Agrawal, S. Das and A. E. Abbadi, “Big Data and Cloud Computing: New Wine or Just New Bottles?” VLDB 2010, Vol. 3, No. 27. F. J. Alexander, A. Hoisie and A. Szalay, “Big Data” IEEE Computing in Science and Engineering journal 20118. O. Trelles, P Prins, M. Snir and R. C. Jansen, “Big Data, but are we ready?” Nature Reviews, Feb 2011 9. K. Bakhshi, “Considerations for Big data: Architecture and approach” Aerospace Conference, 2012 IEEE10. S. Lohr, “The Age of Big Data” Thr New York times Publication, February 201211. M. Nielsen, “Aguide to the day of big data”, Nature, vol. 462, December 2009

24

Page 25: Data mining on big data

May 3, 2023

25

Thank you