whynosql

29
Big Data and Why NoSQL Andy Cobley School of Computing University of Dundee Twitter: @andycobley

Upload: andy-cobley

Post on 17-May-2015

1.617 views

Category:

Technology


0 download

DESCRIPTION

Talk to techmeetup Aberdeen on bigdata and nosql Some links seem to be missing from the onscreen presentation, particularly http://www.dbshards.com/dbshards/ for the sharding diagram

TRANSCRIPT

Page 1: Whynosql

Big Data and Why NoSQL Andy Cobley

School of ComputingUniversity of DundeeTwitter: @andycobley

Page 2: Whynosql

Who am I ?Lecturer at University of DundeeProgram director of Business Intelligence

and new program Data Science (http://goo.gl/ljl6N and http://goo.gl/uwHSi )

Geek and Hacker

Page 3: Whynosql

So what is Big Data?

Page 4: Whynosql

From evil Wikipedia“In information technology, big data[1]

consists of datasets that grow so large that they become awkward to work with using on-hand database management tools.”

Which doesn’t tell us muchAny definition that relies on data “size” will

become obsolete very quickly as data storage capabilities grows.

Page 5: Whynosql

Lets try something differentThe Three V’sVolume

How Big is the data, Terabytes ? Petabytes?Variety

Is it the same sort of data, what about blobs ? Does it change ?

VelocityHow fast is it coming in ? Can we store it fast

enough and then use it ?

http://nosql.mypopescu.com/post/5547192335/bigdata-the-three-vs-volume-variety-velocity

Page 6: Whynosql

The Twitter problemTwitpocalypseOverflow of status ids for 32 bit signed

integersBut beyond that, can we physically store data

fast enough ?

Page 7: Whynosql

Suppose we are storing 16 columns of 16 bytes

At 100 per second0.7 Terabyte per yearAdd at 1 million per second that’s 7 petabytes per yearThis is volume

Page 8: Whynosql

VariabilityData is sparse and can be different sizesOver time the type of data changesConsider click through data, as pages evolve

new data types and fields need to be stored

Page 9: Whynosql

What about

id MassSpec Meta data Meta data

1

2

Page 10: Whynosql

We need UDFUser Defined functions inside the dBOr a different way of dealing with it, such as

Hadoop or MRSQL.

Page 11: Whynosql

So what is NoSqlThrows away everything you know about

DatabasesIs a family of different databasesLots of different “products”BUT !http://nosql.mypopescu.com/post/101632061

7/mongodb-is-web-scale (warning might offend)

They should only be used when it’s sensible, they are not magic sauce.

Page 12: Whynosql

NoSql typesKey-ValueColumn-familyDocument databases

Allow sharding across nodesGraph

Fast for graph like data and operations

Page 13: Whynosql

Some NoSQL databasesCouchDb MongoDbCassandraRiakHbaseNeo4j

http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis

Page 14: Whynosql

Sharding ?Distribution of data across nodesAllows performance to be spread across

multiple machinesSQL databases can be shardedNot all NoSQL databases can be sharded

Page 15: Whynosql

Cap TheoremCAP (or Brewers) theorem says:It’s impossible for a web service to provide

the following ConsistencyAvailabilityPartition tolerance

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.20.1495&rep=rep1&type=pdf

But see : http://www.infoq.com/articles/cap-twelve-years-later-how-the-rules-have-changed and http://codahale.com/you-cant-sacrifice-partition-tolerance/

Page 16: Whynosql

http://blog.nahurst.com/visual-guide-to-nosql-systems

Page 17: Whynosql

Partitions ?Essentially failing to achieve consistency

within a set time causes a partition.You can sacrifice availability to ensure

consistencyPartitions are rare and if you have one server,

almost never happenPartitions are caused by networks, failed

nodees

Page 18: Whynosql

Eventual Consistency Eventually all nodes will tell the same storyIsn’t this a mad idea ?Facebook (Actually not)The Internet is based on and Eventual

Consistency dBDNS

Page 19: Whynosql

Introducing CassandraDistributed / DecentralizedColumn OrientatedKey Value StoreFault Tolerant

Page 20: Whynosql

Network topology of a Cassandra dbMultiple nodesCassandra can be Rack AwareKeys are replicated across nodesIt’s essentially a DHT Distributed Hash

TableThink BitTorrent

Page 21: Whynosql

CQLVersion 8 introduced CQL Cassandra Query

LanguageAlmost looks like SQL !http://crlog.info/2011/09/17/cassandra-query-

language-cql-v2-0-reference/ Language ref

http://www.datastax.com/docs/0.8/dml/using_cql

Page 22: Whynosql

DemoStart CassandraOpen CQLSHCreate KeyspaceCreate a columnfamilyNow we can insert !

Page 23: Whynosql

So why does this work ?Jsmith

Password: ch@ngem3a

JbrownGender: MalePhone: 01382 345078

Column store, keys with name: value pairs underneath

Page 24: Whynosql

Interfacing to CassandraBased on Thrift

http://thrift.apache.org/Large number of Languages supported

http://wiki.apache.org/cassandra/ClientOptionsI’ve used Java and Hector

http://prettyprint.me/Although there is a Csharp version

http://hectorsharp.com/

Page 25: Whynosql

Cassandra JDBCVery new, difficult to know how stable it isNeeds compiling and libraries not in

Cassandra !http://code.google.com/a/apache-extras.org/p/cassandra-jdbc/

Page 26: Whynosql

AstyanaxFrom NetflixBased on Hector but said to be a lot simpler!https://github.com/Netflix/astyanax/wiki

Page 27: Whynosql

jBloggyAppy a demo app of CassandraAll Source code on Githubhttps://github.com/acobley/jBoggyAppyFeel free to use and abuseSimple blogging App

Page 28: Whynosql

A word on using OpenSource softwareVersioning !Things Change !Documentation is wrong !

http://prettyprint.me/End up reading unit tests to actually

program.

Page 29: Whynosql

One Last thing Dundee DDD 17th November , Big Data trackAnyone interested in speaking ?