big data: data wrangling boot camp what is big data? › ... › presentations ›...

19
1/19 What is Big Data What sets BD apart Real-world definitions Q&A Conclusion References Big Data: Data Wrangling Boot Camp What is Big Data? Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD 27 January 2017 27 January 2017 27 January 2017 27 January 2017 27 January 2017 27 January 2017 27 January 2017 27 January 2017 27 January 2017 27 January 2017 27 January 2017 27 January 2017 27 January 2017 27 January 2017 27 January 2017 27 January 2017 27 January 2017 27 January 2017 27 January 2017 27 January 2017 27 January 2017

Upload: others

Post on 06-Jun-2020

28 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Big Data: Data Wrangling Boot Camp What is Big Data? › ... › Presentations › 010-whatIsBigData.pdf · What is Big Data What sets BD apart Real-world de nitionsQ & AConclusionReferences

1/19

What is Big Data What sets BD apart Real-world definitions Q & A Conclusion References

Big Data: Data Wrangling Boot CampWhat is Big Data?

Chuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhD

27 January 201727 January 201727 January 201727 January 201727 January 201727 January 201727 January 201727 January 201727 January 201727 January 201727 January 201727 January 201727 January 201727 January 201727 January 201727 January 201727 January 201727 January 201727 January 201727 January 201727 January 2017

Page 2: Big Data: Data Wrangling Boot Camp What is Big Data? › ... › Presentations › 010-whatIsBigData.pdf · What is Big Data What sets BD apart Real-world de nitionsQ & AConclusionReferences

2/19

What is Big Data What sets BD apart Real-world definitions Q & A Conclusion References

Table of contents I

1 What is Big Data

2 What sets BD apart

3 Real-world definitions

4 Q & A

5 Conclusion

6 References

Page 3: Big Data: Data Wrangling Boot Camp What is Big Data? › ... › Presentations › 010-whatIsBigData.pdf · What is Big Data What sets BD apart Real-world de nitionsQ & AConclusionReferences

3/19

What is Big Data What sets BD apart Real-world definitions Q & A Conclusion References

And, why is it interesting?

And, why is it interesting?

Big data has emerged as a technology term and trendthat is complementary to and considered to be equally astransformational as the cloud computing model.. . . represented as an “old” or “new” capability dependingon the perspective of those defining it, . . .

Lee Badger [5]

Big Data can be characterized by the three V’s:volume (large amounts of data), variety (includesdifferent types of data), and velocity (constantlyaccumulating new data).

Jules. J. Berman [2]

Page 4: Big Data: Data Wrangling Boot Camp What is Big Data? › ... › Presentations › 010-whatIsBigData.pdf · What is Big Data What sets BD apart Real-world de nitionsQ & AConclusionReferences

4/19

What is Big Data What sets BD apart Real-world definitions Q & A Conclusion References

Statistics and BD

Important ideas from statistics

How “good” an answer do you want?Questions that need to beanswered:

How accurately do you needthe answer?

What level of confidence doyou intend to use?

What is your currentestimate of the answeryou’re after? Image from [4].

The greater the tolerance for error, the fewer samples needed.

Page 5: Big Data: Data Wrangling Boot Camp What is Big Data? › ... › Presentations › 010-whatIsBigData.pdf · What is Big Data What sets BD apart Real-world de nitionsQ & AConclusionReferences

5/19

What is Big Data What sets BD apart Real-world definitions Q & A Conclusion References

Statistics and BD

If you have some pre-knowledge of the “population” then you onlyneed to sample a very small number of “individuals” to get a goodenough answer.[7]

Page 6: Big Data: Data Wrangling Boot Camp What is Big Data? › ... › Presentations › 010-whatIsBigData.pdf · What is Big Data What sets BD apart Real-world de nitionsQ & AConclusionReferences

6/19

What is Big Data What sets BD apart Real-world definitions Q & A Conclusion References

Statistics and BD

How sampling differs from “Big Data”

Sampling – start with apreconceived idea of the outcome

Sampling – few data pointsextremely valuable (n = 1000)

Big data – you don’t know whatthe data holds

Big data – many data pointsextremely cheap (n = all)

Leadership role changes frominvestigator to data [6].

Large data sets are messy, incomplete, inconsistent, and errorprone. Require lots of data munging and data wrangling.

Page 7: Big Data: Data Wrangling Boot Camp What is Big Data? › ... › Presentations › 010-whatIsBigData.pdf · What is Big Data What sets BD apart Real-world de nitionsQ & AConclusionReferences

7/19

What is Big Data What sets BD apart Real-world definitions Q & A Conclusion References

Statistics and BD

We’ll be covering virtually “bleeding edge” stuff.

Data too big for a singlemachine.

Processing too long for asingle machine.

Question/analysis isparalizable.

Page 8: Big Data: Data Wrangling Boot Camp What is Big Data? › ... › Presentations › 010-whatIsBigData.pdf · What is Big Data What sets BD apart Real-world de nitionsQ & AConclusionReferences

8/19

What is Big Data What sets BD apart Real-world definitions Q & A Conclusion References

Statistics and BD

Lots of places, lots of it, and fast.

We are “drowning” in Big Data.

230,000,000 tweets per day[3]

2,700,000,000 Facebooklikes per day [1]

100 hours of YouTube videoevery minute [8]

Clickstream left on serversOur wearable devices are contributing to this avalanche of data.

Page 9: Big Data: Data Wrangling Boot Camp What is Big Data? › ... › Presentations › 010-whatIsBigData.pdf · What is Big Data What sets BD apart Real-world de nitionsQ & AConclusionReferences

9/19

What is Big Data What sets BD apart Real-world definitions Q & A Conclusion References

Statistics and BD

With all this data, what kinds of questions can we ask?

How is data from one dataset related to data inanother?

Are the relationshipsone-to-one or, one-to-many,or many-to-many?

Is the data “clean” or not?

What are we trying to findfrom the data?

The details of the questions depend on the data and what we areinterested in finding.

Page 10: Big Data: Data Wrangling Boot Camp What is Big Data? › ... › Presentations › 010-whatIsBigData.pdf · What is Big Data What sets BD apart Real-world de nitionsQ & AConclusionReferences

10/19

What is Big Data What sets BD apart Real-world definitions Q & A Conclusion References

Statistics and BD

Some questions are easily stated, . . .

Which of these questions areamenable to Big Data processing(and why)?

1 a[i ] = b[i ] + c[i ]

2 a[i ] = f (b)

3 a[i ] = a[i − 1] + b[i − 1]

4 a = b + c

Page 11: Big Data: Data Wrangling Boot Camp What is Big Data? › ... › Presentations › 010-whatIsBigData.pdf · What is Big Data What sets BD apart Real-world de nitionsQ & AConclusionReferences

11/19

What is Big Data What sets BD apart Real-world definitions Q & A Conclusion References

Statistics and BD

Does the tweet sentiment change over time?

Page 12: Big Data: Data Wrangling Boot Camp What is Big Data? › ... › Presentations › 010-whatIsBigData.pdf · What is Big Data What sets BD apart Real-world de nitionsQ & AConclusionReferences

12/19

What is Big Data What sets BD apart Real-world definitions Q & A Conclusion References

Statistics and BD

What sends what type of tweet?

Page 13: Big Data: Data Wrangling Boot Camp What is Big Data? › ... › Presentations › 010-whatIsBigData.pdf · What is Big Data What sets BD apart Real-world de nitionsQ & AConclusionReferences

13/19

What is Big Data What sets BD apart Real-world definitions Q & A Conclusion References

Statistics and BD

Where do tweets come from?

Page 14: Big Data: Data Wrangling Boot Camp What is Big Data? › ... › Presentations › 010-whatIsBigData.pdf · What is Big Data What sets BD apart Real-world de nitionsQ & AConclusionReferences

14/19

What is Big Data What sets BD apart Real-world definitions Q & A Conclusion References

Pragmatic and practical

A pragmatic definition

“. . . big data refers to things one can do at a largescale that cannot be done at a smaller one, to extractnew insights or create new forms of value, in ways thatchange markets, organizations, the relationship betweencitizens and governments, and more.”

Mayer-Schonberger and Cukier [6]

Page 15: Big Data: Data Wrangling Boot Camp What is Big Data? › ... › Presentations › 010-whatIsBigData.pdf · What is Big Data What sets BD apart Real-world de nitionsQ & AConclusionReferences

15/19

What is Big Data What sets BD apart Real-world definitions Q & A Conclusion References

Pragmatic and practical

A practical definition based on “people” time.

If:

your data won’t fit into onemachine or application, or

you are waiting too long foran answer

then:

You have a Big Data problem that requires Big Data tools andtechniques.

Page 16: Big Data: Data Wrangling Boot Camp What is Big Data? › ... › Presentations › 010-whatIsBigData.pdf · What is Big Data What sets BD apart Real-world de nitionsQ & AConclusionReferences

16/19

What is Big Data What sets BD apart Real-world definitions Q & A Conclusion References

Q & A time.

Q: Name two families whose kidswon’t join the Marines.A: The Halls of Montezuma andthe Shores of Tripoli.

Page 17: Big Data: Data Wrangling Boot Camp What is Big Data? › ... › Presentations › 010-whatIsBigData.pdf · What is Big Data What sets BD apart Real-world de nitionsQ & AConclusionReferences

17/19

What is Big Data What sets BD apart Real-world definitions Q & A Conclusion References

What have we covered?

Big Data is all around us.Big Data is about volume, variety,velocity, and getting answersquickly.Some Big Data questions are easyto state, but impossible to answer.

Next: Digging into Big Data overview and concepts.

Page 18: Big Data: Data Wrangling Boot Camp What is Big Data? › ... › Presentations › 010-whatIsBigData.pdf · What is Big Data What sets BD apart Real-world de nitionsQ & AConclusionReferences

18/19

What is Big Data What sets BD apart Real-world definitions Q & A Conclusion References

References I

[1] Anson Alexander, Facebook user statistics 2012 [infographic],ansonAlex.com (2012).

[2] Jules J Berman, Principles of big data: Preparing, sharing, andanalyzing complex information, Newnes, 2013.

[3] Joab Jackson, The big promise of big data, Business Software(2012).

[4] James Klurfeld, Making sense of the campaign: The truthabout polling,http://drc.centerfornewsliteracy.org/resource/

making-sense-campaign-truth-about-polling, 2016.

Page 19: Big Data: Data Wrangling Boot Camp What is Big Data? › ... › Presentations › 010-whatIsBigData.pdf · What is Big Data What sets BD apart Real-world de nitionsQ & AConclusionReferences

19/19

What is Big Data What sets BD apart Real-world definitions Q & A Conclusion References

References II

[5] Robert Bohn Lee Badger, David Bernstein, Us governmentcloud computing technology roadmap volume i, Tech. report,National Institute of Standards and Technology, 2014.

[6] Viktor Mayer-Schonberger and Kenneth Cukier, Big data: Arevolution that will transform how we live, work, and think,Houghton Mifflin Harcourt, 2013.

[7] Mario F Triola, Essentials of statistics, Pearson Addison WesleyBoston, MA, USA:, 2008.

[8] YouTube, Statistics,http://www.youtube.com/yt/press/statistics.html.