zwei jahrebigdata

Post on 26-Jan-2015

104 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

 

TRANSCRIPT

Big Data:Das zweite Jahr.

Joerg Blumtritt

2

4

5

The Future of Market Research

Hardware

Traditional• exotic hardware• big central servers• SAN• RAID• hardware reliability• expensive• limited scalability

Big Data• commodity HW• racks of pizza boxes• Ethernet• JBOD• unreliable HW• cost effective• scales further

Software

Traditional• monolithic• centralized storage• RDBMS• schema first• proprietary

Big Data• distributed• storage & compute• nodes• raw data• open source

Quanti fication

VolumeVelocityVariety

DataScience

1. Volume– Very large data sets– Data Center → Data Warehouse → Internet Scale– Typical dimensions: billions or trillions of records, millions

or billions of variables– e.g. Twitter: > 400 M Tweets per day– Technologies: MapReduce, HDFS, Project Voldemort

... das erste V

Map-Reduce

12

http://hadoop.apache.org/docs/r0.18.3/mapred_tutorial.html#Example%3A+WordCount+v2.0

1. Volume2. Velocity

– Very fast data streams– sensor data, smartphones, socia media:– Typical dimensions: 15k-300k/s– Real time inputs / real time outputs– Stream/event pocessing– Technologies: Storm, S4, Esper, HBase, Kafka

zweites V

Storm

14

http://engineering.twitter.com/2011/08/storm-is-coming-more-details-and-plans.html

1. Volume2. Velocity3. Variety / Variability

– Manifold and highly variable data structures– data market places, e.g. Datasift, GNIP, Enigma.io– No schema / NoSQL– Distributed storage– Immutability

... und das letzte V

16

{"created_at":"Sat Apr 13 08:07:34 +0000 2013", "id":322984390491774976, "id_str":"322984390491774976", "text":"getr\u00e4umt, ich h\u00e4tte \u00fcber den Skandal geblogt, dass wir immernoch geschirrsp\u00fchlen, genau wie zu Car\u00eames Zeiten.", "source":"\u003ca href=\"http:\/\/twitter.com\/download\/android\" rel=\"nofollow\"\u003eTwitter for Android\u003c\/a\u003e", "truncated":false, "in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":10177792,"id_str":"10177792", "name":"Joerg Blumtritt", "screen_name":"jbenno", "location":"Stockdorf", "url":"http:\/\/slow-media.net", "description":"I just coined the word panfuturistic because it sounds cool. http:\/\/memeticturn.com\/declaration-of-liquid-culture", "protected":false,"followers_count":2671,"friends_count":1599,"listed_count":141,"created_at":"Mon Nov 12 11:16:15 +0000 2007", "favourites_count":3582,"utc_offset":3600, "time_zone":"Berlin", "geo_enabled":true,"verified":false,"statuses_count":30140,"lang":"en", "contributors_enabled":false,"is_translator":false,"profile_background_color":"FFFFFF", "profile_background_image_url":"http:\/\/a0.twimg.com\/profile_background_images\/816896285\/688fcbc8df9391dfd71012d06ca34002.jpeg", "profile_background_image_url_https":"https:\/\/si0.twimg.com\/profile_background_images\/816896285\/688fcbc8df9391dfd71012d06ca34002.jpeg", "profile_background_tile":false,"profile_image_url":"http:\/\/a0.twimg.com\/profile_images\/3315156408\/db719e7db02772e468179545fb06e7f9_normal.jpeg", "profile_image_url_https":"https:\/\/si0.twimg.com\/profile_images\/3315156408\/db719e7db02772e468179545fb06e7f9_normal.jpeg", "profile_banner_url":"https:\/\/si0.twimg.com\/profile_banners\/10177792\/1365261531", "profile_link_color":"0000FF", "profile_sidebar_border_color":"FFFFFF", "profile_sidebar_fill_color":"E0FF92", "profile_text_color":"000000", "profile_use_background_image":true,"default_profile":false,"default_profile_image":false, "following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"retweet_count":0,"favorite_count":0,"favorited":false,"retweeted":false,"lang":"de"}

17

Statt die Konsistenz der Daten schon in der Struktur festzulegen,wird eine Funktion definiert, die jeden Record nach den vorgegebenen Kriterien überprüft:

function IsConsistent(Record, Schema) as Boolean

18

Operation SQL Create INSERT Read (Retrieve) SELECT Update (Modify) UPDATE Delete (Destroy) DELETE

"mutable"

"Each event happens at a particular time and is always true"

• Just C+R; nothing gets ever "updated"

• Records are stored as files. Each record is a new file.

"immutable"

19

Query

Precomputed View(Batch Mode)

Data Stream

All Data

Precomputed realtime view

Quanti fication

VolumeVelocityVariety

DataScience

known knowns known unknowns unknowns unkonws

„data puking“(Dashboards)

„analysis throwing“(Modellings)

„data democracy“(Big Data)

Avinash Kaushik

As we know, There are known knowns. There are things we know we know. We also know There are known unknowns. That is to say: We know there are some things

We do not know. But there are also unknown unknowns, The ones we don't know We don't know.

Donald Rumsfeld

Data Science

22

• Text comparism of party programmes

• Cosinus-Vector distance

26

0

500

1000

1500

0 4 8 12 16 20 0 4 8 12 16 22 2 6 10 14 20

DSDSTatort

So 10.3.Sa 9.3.Fr 8.3.

Personahttp://twitter.com/FlaviaReil/statuses/308321057499144193http://twitter.com/froschmann1968/statuses/308321920200364034http://twitter.com/VeronikaTangen/statuses/308322141676388352http://twitter.com/froschmann1968/statuses/308322188501602304http://twitter.com/QWallyTy/statuses/308322522863128576http://twitter.com/Duftlavendel/statuses/308322911444406272http://twitter.com/kakakiri/statuses/308323144836456448http://twitter.com/Chake/statuses/308323468179566592http://twitter.com/RegulaAeppli/statuses/308323570386350083http://twitter.com/Imissmycat1/statuses/308323602342764544http://twitter.com/WorldNewsGerman/statuses/308323834749140995http://twitter.com/Zoran2010/statuses/308324446035386368

27

28

männlichweiblichn.a.

29

http://www.jasondavies.com/parallel-sets/

http://www.nytimes.com/interactive/2012/05/17/business/dealbook/how-the-facebook-offering-compares.html?_r=0

http://www.senchalabs.org/philogl/PhiloGL/examples/winds/

Quanti fication

VolumeVelocityVariety

DataScience

D3

31

32

33

Quantified Self

34

35

36

37

38

39

40

41

42

43

44

45

Digital Darwinismis the Evolution ofConsumer Behavior whenSociety & TechnologyEvolve FasterThan the AbilityTo Adapt

Brian Solis

47

{"name": "Joerg Blumtritt", "job":

{title: "Strategy Consultant", startdate: "2005", enddate: null

}"job":

{title: "Chairman", company: "Arbeitsgemeinschaft Social Media e.V.", startdate: "2008", enddate: null

}"email": "joerg.blumtritt@mediagnosis.de""twitter":"@jbenno", "blog": "http://beautifuldata.net", "blog": "http://slow-media.net", "blog": "http://kuirjeo.net", "blog": "http://memeticturn.net", "website":"http://mediagnosis.de" , "image": "http://slow-media.net/wp-content/uploads/jb_creeper.jpg", "bio": http://beautifuldata.net/Joerg-blumtritt/

}

top related