zwei jahrebigdata
DESCRIPTION
TRANSCRIPT
Big Data:Das zweite Jahr.
Joerg Blumtritt
2
4
5
The Future of Market Research
Hardware
Traditional• exotic hardware• big central servers• SAN• RAID• hardware reliability• expensive• limited scalability
Big Data• commodity HW• racks of pizza boxes• Ethernet• JBOD• unreliable HW• cost effective• scales further
Software
Traditional• monolithic• centralized storage• RDBMS• schema first• proprietary
Big Data• distributed• storage & compute• nodes• raw data• open source
Quanti fication
VolumeVelocityVariety
DataScience
1. Volume– Very large data sets– Data Center → Data Warehouse → Internet Scale– Typical dimensions: billions or trillions of records, millions
or billions of variables– e.g. Twitter: > 400 M Tweets per day– Technologies: MapReduce, HDFS, Project Voldemort
... das erste V
Map-Reduce
12
http://hadoop.apache.org/docs/r0.18.3/mapred_tutorial.html#Example%3A+WordCount+v2.0
1. Volume2. Velocity
– Very fast data streams– sensor data, smartphones, socia media:– Typical dimensions: 15k-300k/s– Real time inputs / real time outputs– Stream/event pocessing– Technologies: Storm, S4, Esper, HBase, Kafka
zweites V
Storm
14
http://engineering.twitter.com/2011/08/storm-is-coming-more-details-and-plans.html
1. Volume2. Velocity3. Variety / Variability
– Manifold and highly variable data structures– data market places, e.g. Datasift, GNIP, Enigma.io– No schema / NoSQL– Distributed storage– Immutability
... und das letzte V
16
{"created_at":"Sat Apr 13 08:07:34 +0000 2013", "id":322984390491774976, "id_str":"322984390491774976", "text":"getr\u00e4umt, ich h\u00e4tte \u00fcber den Skandal geblogt, dass wir immernoch geschirrsp\u00fchlen, genau wie zu Car\u00eames Zeiten.", "source":"\u003ca href=\"http:\/\/twitter.com\/download\/android\" rel=\"nofollow\"\u003eTwitter for Android\u003c\/a\u003e", "truncated":false, "in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":10177792,"id_str":"10177792", "name":"Joerg Blumtritt", "screen_name":"jbenno", "location":"Stockdorf", "url":"http:\/\/slow-media.net", "description":"I just coined the word panfuturistic because it sounds cool. http:\/\/memeticturn.com\/declaration-of-liquid-culture", "protected":false,"followers_count":2671,"friends_count":1599,"listed_count":141,"created_at":"Mon Nov 12 11:16:15 +0000 2007", "favourites_count":3582,"utc_offset":3600, "time_zone":"Berlin", "geo_enabled":true,"verified":false,"statuses_count":30140,"lang":"en", "contributors_enabled":false,"is_translator":false,"profile_background_color":"FFFFFF", "profile_background_image_url":"http:\/\/a0.twimg.com\/profile_background_images\/816896285\/688fcbc8df9391dfd71012d06ca34002.jpeg", "profile_background_image_url_https":"https:\/\/si0.twimg.com\/profile_background_images\/816896285\/688fcbc8df9391dfd71012d06ca34002.jpeg", "profile_background_tile":false,"profile_image_url":"http:\/\/a0.twimg.com\/profile_images\/3315156408\/db719e7db02772e468179545fb06e7f9_normal.jpeg", "profile_image_url_https":"https:\/\/si0.twimg.com\/profile_images\/3315156408\/db719e7db02772e468179545fb06e7f9_normal.jpeg", "profile_banner_url":"https:\/\/si0.twimg.com\/profile_banners\/10177792\/1365261531", "profile_link_color":"0000FF", "profile_sidebar_border_color":"FFFFFF", "profile_sidebar_fill_color":"E0FF92", "profile_text_color":"000000", "profile_use_background_image":true,"default_profile":false,"default_profile_image":false, "following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"retweet_count":0,"favorite_count":0,"favorited":false,"retweeted":false,"lang":"de"}
17
Statt die Konsistenz der Daten schon in der Struktur festzulegen,wird eine Funktion definiert, die jeden Record nach den vorgegebenen Kriterien überprüft:
function IsConsistent(Record, Schema) as Boolean
18
Operation SQL Create INSERT Read (Retrieve) SELECT Update (Modify) UPDATE Delete (Destroy) DELETE
"mutable"
"Each event happens at a particular time and is always true"
• Just C+R; nothing gets ever "updated"
• Records are stored as files. Each record is a new file.
"immutable"
19
Query
Precomputed View(Batch Mode)
Data Stream
All Data
Precomputed realtime view
Quanti fication
VolumeVelocityVariety
DataScience
known knowns known unknowns unknowns unkonws
„data puking“(Dashboards)
„analysis throwing“(Modellings)
„data democracy“(Big Data)
Avinash Kaushik
As we know, There are known knowns. There are things we know we know. We also know There are known unknowns. That is to say: We know there are some things
We do not know. But there are also unknown unknowns, The ones we don't know We don't know.
Donald Rumsfeld
Data Science
22
• Text comparism of party programmes
• Cosinus-Vector distance
26
0
500
1000
1500
0 4 8 12 16 20 0 4 8 12 16 22 2 6 10 14 20
DSDSTatort
So 10.3.Sa 9.3.Fr 8.3.
Personahttp://twitter.com/FlaviaReil/statuses/308321057499144193http://twitter.com/froschmann1968/statuses/308321920200364034http://twitter.com/VeronikaTangen/statuses/308322141676388352http://twitter.com/froschmann1968/statuses/308322188501602304http://twitter.com/QWallyTy/statuses/308322522863128576http://twitter.com/Duftlavendel/statuses/308322911444406272http://twitter.com/kakakiri/statuses/308323144836456448http://twitter.com/Chake/statuses/308323468179566592http://twitter.com/RegulaAeppli/statuses/308323570386350083http://twitter.com/Imissmycat1/statuses/308323602342764544http://twitter.com/WorldNewsGerman/statuses/308323834749140995http://twitter.com/Zoran2010/statuses/308324446035386368
27
28
männlichweiblichn.a.
29
http://www.jasondavies.com/parallel-sets/
http://www.nytimes.com/interactive/2012/05/17/business/dealbook/how-the-facebook-offering-compares.html?_r=0
http://www.senchalabs.org/philogl/PhiloGL/examples/winds/
Quanti fication
VolumeVelocityVariety
DataScience
D3
31
32
33
Quantified Self
34
35
36
37
38
39
40
41
42
43
44
45
Digital Darwinismis the Evolution ofConsumer Behavior whenSociety & TechnologyEvolve FasterThan the AbilityTo Adapt
Brian Solis
47
{"name": "Joerg Blumtritt", "job":
{title: "Strategy Consultant", startdate: "2005", enddate: null
}"job":
{title: "Chairman", company: "Arbeitsgemeinschaft Social Media e.V.", startdate: "2008", enddate: null
}"email": "[email protected]""twitter":"@jbenno", "blog": "http://beautifuldata.net", "blog": "http://slow-media.net", "blog": "http://kuirjeo.net", "blog": "http://memeticturn.net", "website":"http://mediagnosis.de" , "image": "http://slow-media.net/wp-content/uploads/jb_creeper.jpg", "bio": http://beautifuldata.net/Joerg-blumtritt/
}