whynosql
DESCRIPTION
Talk to techmeetup Aberdeen on bigdata and nosql Some links seem to be missing from the onscreen presentation, particularly http://www.dbshards.com/dbshards/ for the sharding diagramTRANSCRIPT
Big Data and Why NoSQL Andy Cobley
School of ComputingUniversity of DundeeTwitter: @andycobley
Who am I ?Lecturer at University of DundeeProgram director of Business Intelligence
and new program Data Science (http://goo.gl/ljl6N and http://goo.gl/uwHSi )
Geek and Hacker
So what is Big Data?
From evil Wikipedia“In information technology, big data[1]
consists of datasets that grow so large that they become awkward to work with using on-hand database management tools.”
Which doesn’t tell us muchAny definition that relies on data “size” will
become obsolete very quickly as data storage capabilities grows.
Lets try something differentThe Three V’sVolume
How Big is the data, Terabytes ? Petabytes?Variety
Is it the same sort of data, what about blobs ? Does it change ?
VelocityHow fast is it coming in ? Can we store it fast
enough and then use it ?
http://nosql.mypopescu.com/post/5547192335/bigdata-the-three-vs-volume-variety-velocity
The Twitter problemTwitpocalypseOverflow of status ids for 32 bit signed
integersBut beyond that, can we physically store data
fast enough ?
Suppose we are storing 16 columns of 16 bytes
At 100 per second0.7 Terabyte per yearAdd at 1 million per second that’s 7 petabytes per yearThis is volume
VariabilityData is sparse and can be different sizesOver time the type of data changesConsider click through data, as pages evolve
new data types and fields need to be stored
What about
id MassSpec Meta data Meta data
1
2
We need UDFUser Defined functions inside the dBOr a different way of dealing with it, such as
Hadoop or MRSQL.
So what is NoSqlThrows away everything you know about
DatabasesIs a family of different databasesLots of different “products”BUT !http://nosql.mypopescu.com/post/101632061
7/mongodb-is-web-scale (warning might offend)
They should only be used when it’s sensible, they are not magic sauce.
NoSql typesKey-ValueColumn-familyDocument databases
Allow sharding across nodesGraph
Fast for graph like data and operations
Some NoSQL databasesCouchDb MongoDbCassandraRiakHbaseNeo4j
http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis
Sharding ?Distribution of data across nodesAllows performance to be spread across
multiple machinesSQL databases can be shardedNot all NoSQL databases can be sharded
Cap TheoremCAP (or Brewers) theorem says:It’s impossible for a web service to provide
the following ConsistencyAvailabilityPartition tolerance
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.20.1495&rep=rep1&type=pdf
But see : http://www.infoq.com/articles/cap-twelve-years-later-how-the-rules-have-changed and http://codahale.com/you-cant-sacrifice-partition-tolerance/
http://blog.nahurst.com/visual-guide-to-nosql-systems
Partitions ?Essentially failing to achieve consistency
within a set time causes a partition.You can sacrifice availability to ensure
consistencyPartitions are rare and if you have one server,
almost never happenPartitions are caused by networks, failed
nodees
Eventual Consistency Eventually all nodes will tell the same storyIsn’t this a mad idea ?Facebook (Actually not)The Internet is based on and Eventual
Consistency dBDNS
Introducing CassandraDistributed / DecentralizedColumn OrientatedKey Value StoreFault Tolerant
Network topology of a Cassandra dbMultiple nodesCassandra can be Rack AwareKeys are replicated across nodesIt’s essentially a DHT Distributed Hash
TableThink BitTorrent
CQLVersion 8 introduced CQL Cassandra Query
LanguageAlmost looks like SQL !http://crlog.info/2011/09/17/cassandra-query-
language-cql-v2-0-reference/ Language ref
http://www.datastax.com/docs/0.8/dml/using_cql
DemoStart CassandraOpen CQLSHCreate KeyspaceCreate a columnfamilyNow we can insert !
So why does this work ?Jsmith
Password: ch@ngem3a
JbrownGender: MalePhone: 01382 345078
Column store, keys with name: value pairs underneath
Interfacing to CassandraBased on Thrift
http://thrift.apache.org/Large number of Languages supported
http://wiki.apache.org/cassandra/ClientOptionsI’ve used Java and Hector
http://prettyprint.me/Although there is a Csharp version
http://hectorsharp.com/
Cassandra JDBCVery new, difficult to know how stable it isNeeds compiling and libraries not in
Cassandra !http://code.google.com/a/apache-extras.org/p/cassandra-jdbc/
AstyanaxFrom NetflixBased on Hector but said to be a lot simpler!https://github.com/Netflix/astyanax/wiki
jBloggyAppy a demo app of CassandraAll Source code on Githubhttps://github.com/acobley/jBoggyAppyFeel free to use and abuseSimple blogging App
A word on using OpenSource softwareVersioning !Things Change !Documentation is wrong !
http://prettyprint.me/End up reading unit tests to actually
program.
One Last thing Dundee DDD 17th November , Big Data trackAnyone interested in speaking ?