databases - gitlab · an sql versus a nosql database system the short summary for this unit is...
TRANSCRIPT
Databases
A dynamic web site needs permanent data, and adatabase is usually the place to store it
choices embedded nosql sql injection scale
1
The choicesThere are two main choices to make:
an embedded versus a served database systeman SQL versus a NoSQL database system
The short summary for this unit is going to be that youshould choose an embedded database system, and youshould choose an SQL database system
Don't just follow fashion and beware that textbookscopy each other and are full of incredibly obsoletegarbage
2
FashionMost under-educated amateurs choose database systemsaccording to fashion, i.e. they follow the crowd andcopy stuff out of tutorials without knowing why itworks or whether it is the right choice
As ever, tutorials show you how but don't tell you why
As with with all issues in this course, you should try tounderstand more deeply, and make informed choices
2a
Embedded versus servedA database is stored in a file (or sometimes a set of files)
Only one process/thread should access the file ( sqlite)
In embedded databases, this is the web server, and inserved databases, it is a separate database server:
3
Sqlite sharingSqlite is rare in allowing multiple processes/threads toaccess the same file
It uses whole-file locking, which is limited in itsefficiency in highly concurrent situations, so youprobably shouldn't use this feature
It is extremely unfair to criticize sqlite for this
3a
Choosing embedded/servedAn embedded database is simpler, easier to run,and more efficient (no extra network overhead fortransporting data)
A served database allows multiple web serverprocesses/threads to share the same database
you should choose embedded for easy markingupgrading to a served database to scale up is easyserved databases aren't always best for scaling up
Maintenance is an issue, and scaling up is complex -see later
4
Embedded is simplerA served database provides a complex protocol forhandling queries over a network connection
This protocol is not standardized - it is different foreach database system
On top of that, there is usually a system for caching thedata in the web server, so there is complexity at bothends
4a
Embedded is easierWith a served database, you have to manage two servers(web and db), and worry about the configuration,especially security, of both servers
In a commercial setting, there are extra operationalproblems (what happens if one server goes down, or ifthe web server comes up before the database server isready to accept queries)
For marking, I am not prepared to run two servers tomark each submitted web site
4b
Embedded is fasterWith a served database, each network request to theweb server may result in another network request tothe database server
This can effectively 'double' the network load, andnetwork load is often the efficiency bottleneck
When the two servers are running on the samecomputer, there should be no actual network traffic, butthere are still overheads with using the query protocol
4c
MaintenanceWith an embedded database, you can't normally use acommand line tool to update the database while theserver is running (with a served database, the tool canconnect to the running server)
That's because two processes can't access the samedatabase file
Either you have to shut down the server, or you canprovide a privileged web page for data maintenance onyour site
4d
NoSQL databasesNoSQL databases are simpler than SQL databases, so let'sdeal with them first
They are special purpose databases
There are different types for different special purposes
The most common type is the 'document store'
MongoDB, the current fashion in JavaScript, is adocument store
object oriented databases
5
Object oriented databasesThe idea of an object oriented database is that it lives inmemory as a collection of (normal) objects
There is no overhead in fetching the data or convertingit into object format
It is stored on disk like a write-through cache to makethe data permanent (so it survives a server restart)
NoSQL database systems are a long way from being OOdatabase systems, and OO database systems are not verypopular or successful anyway - see next few slides
5a
NoSQL is not OONoSQL databases are not OO because they lack severalfeatures of conventional SQL databases which are wellknown, and which need to be implemented in memoryto form a proper database
These are concurrency (thread-safety and thread-efficiency), transactions (ACID consistency), and indexes(to speed up searches)
Roughly speaking, these require versioning (for readconsistency) and locking (for write efficiency)
5b
Why not OO databases?Why are OO database systems unpopular andunsuccessful?
One problem is that OO databases don't seem to fit wellwith refactoring
In classed languages like Java, if you update the classesthat the database objects belong to, the data becomesinvalid, so needs to be massaged before it can be usedagain
Also, OO databases can only be used by programmers;they make no sense to non-experts
5c
Future of OO databases?In my opinion, OO databases make a great deal of sense,and deserve to be re-thought
One thing that needs doing is to build OO databasesupport into the programming language (Java's JPAsystem, bolted on from the outside, is incrediblycomplex)
Another is that databases need to be included in theprogram development process, so that refactoring of thedata is done naturally alongside refactoring of the code
5d
When does NoSQL work?NoSQL databases of the document store type work bestwhen the data consists of separate objects which mayhave internal structure, but which have no relationshipsor cross-references between them, and no importantconsistency requirements
For example, if you set up a shopping site, and your dataobjects are customers and products, everything is fine
But as soon as you add sales, you are in trouble, becausea sale object references a customer and a product
6
Eventual consistencyMongoDB is popular because it is simple, so it workswell for undereducated amateurs who don't know orcare about relationships or consistency
It can also work well for experts in very largedistributed projects if the data is suitable and if theyrealize that they (and not the database system) areresponsible for the implementation of consistency issues
These large projects require eventual consistency, thehardest kind to implement
7
AdviceIn my opinion it is usually a mistake to start with aNoSQL database
That's because, by the time you realize that it isinadequate for your kind of data, it will probably beexceedingly difficult to upgrade to an SQL database
8
NoSQL horror storiesHere are some things that go wrong with NoSQL:
relations (TV shows)transactions (bitcoins)resourcesqueriessecurity
You should not use MongoDB in this unit (for ease ofmarking, you should use an embedded system)
If you must use the MongoDB API (highly-non-recommended), then use something like TingoDB (anembedded version)
9
TV showsIn one famous case, (a minor example within this story)MongoDB was used to set up a system for accessing TVshows online
MongoDb worked well, because each episode was a self-contained object, and there were no cross-references
Then, one day, an extra feature was requested - to beable to click on an actor listed in an episode, and see allthe other TV shows the actor was in
See the next two slides for more details
9a
The dataFirst, there was a problem with the mass of data alreadycollected
Each actor was listed in different TV episodes usingslightly different strings - extra spaces, or full stops, orinitials versus first names, or ...
So how do you know when it is the same actor?
String comparison is not good enough
A lot of effort had to put into cleaning up the data,giving each actor a unique id
9b
The outcomeAs well as cleaning up the data, the relationshipsbetween the actors and the episodes had to be set up
Although there are lots of claims about being able torepresent relationships in MongoDB, it isn't usuallyviable
So with a lot more effort the data was moved to an SQLdatabase, where it should have been in the first place
9c
BitcoinsIn another case, a bitcoin site was set up
It is rather like a bank, with accounts, where atransaction consists of taking some money out of oneaccount and putting it into another
A system was set up using MongoDB, and then it wasdiscovered that the data was becoming inconsistent, i.e.money was evaporating or being generated out ofnowhere
See next slide for more details
9d
TransactionsThe point is that the transfer of money betweenaccounts involves updating two objects
And the pair of updates together must act as anindivisible transaction, i.e. both updates must work orneither
MongoDB is completely incapable of making such aguarantee - the system had to be switched to an SQLdatabase
9e
ResourcesSome people are horrified to discover that the files thatMongoDB creates are huge, or that the amount ofmemory it reserves when you start it up is huge
This is even if the MongoDB database is empty
It is because MongoDB is optimized for big data
It is a solvable problem, but why should you have tosolve it?
9f
QueriesMany people find they have a need to make complexqueries on MongoDB objects
So, they add a module which allows them to make SQLqueries on MongoDB objects, without realizing howutterly stupid this is - MongoDB is a NoSQL database -the clue is in the name
What they are doing is handling relationships withMongoDB in a completely implicit and non-robust way -relationships require consistency guarantees
9g
SecurityIt is very common to create an insecure system usingMongoDB
See survey
Or see study
This is a solvable problem, in fact the MongoDBsuppliers have been persuaded to improve the defaults,but the fundamental problem is that people generallydon't know or care about security - a professional musttake care over security
9h
SQL databasesSQL databases are general purpose, and they are allessentially the same
They are also called relational databases because they fitthe theory of relations ( roughly)
It takes a little knowledge and skill to (a) fit your data tothe relational model, and (b) use the SQL language toextract data
But it is not rocket science (and if your data is trivialenough for MongoDB to be suitable, then the SQLapproach is easy too!)
10
CriticismsSQL is not as relational as it should be - see criticismand the third manifesto
It produces results in implementation-dependent order,they may include duplicates, and problems arise withlimit clauses etc which depend on order, there is athree-value logic based on null which is incompleteand doesn't work, and so on
See notes on purifying it
10a
customersusername nameab12345 Patcd67890 Chris
Pet storeA simple pet store database might have three tables, foranimals, customers, and sales:
animalsid breed42 dog53 fish
salesid username price42 ab12345 10053 cd67890 50
11
RelationshipsEach row of a table is an object, each column is a field
The animals and customers are simple because theyhave no relationships
The sales table is more subtle, because it uses ids andusernames to represent relationships (cross-references)between the sales table and the other two tables, toavoid duplicating any information
12
sql
SQLA statement in SQL to fetch all the animals objects is:
select * from animals
The result, in JavaScript, is a set of objects, where eachobject has fields id and breed
SQL is designed to be programming-language-neutral,i.e. it can be used inside C, C++, Java, JavaScript or anyother language, object oriented or not
It is also designed to be used as an interactive commandlanguage, so that databases can be handled manually,outside of the web server
13
The SQL languageSQL is a narrow-purpose declarative scripting languagefor searching and sorting database records
SQL is supposed to be standardised across databasesystems, but (a) the standard is inaccessible and old andwoolly and (b) database systems don't always stick to it
The result is that SQL is reasonably standard as long itis a human who is moving from system to system, but ifyou move actual SQL statements from system to system,you should expect to have to adjust them
13a
SQLiteThe recommended system for this unit is SQLite(pronounced s-q-l-lite or sequelite)
If the sqlite3 node module is installed, the callbackstyle has to be used, so these notes assume that thesqlite node module is installed, which uses promisesso that the async/await keywords can be used.
command line tools, why lite? documentationother SQL systems
14
Command line toolsInstead of accessing the database via a program withSQL strings in it, you can install a command line toolwhich allows you to type SQL commands interactively
You could use sqlite3 command (not the node module)or SQLite Browser or SQuirrel (which accesses any typeof database)
This can be useful to maintain the database, e.g.initialize it, but you can't safely access the database thisway while the web server is using it, so the usefulnessis limited
14a
Why lite?What does the 'lite' in SQLite refer to?
It means that SQLite small and simple, suitable for smalldevices such as mobiles, so what do you lose?
It is embedded and not served, it has loose types in theJavaScript style, some features have to be 'switched on',and you have to think hard about scaling up
But, contrary to rumour, it doesn't compromise onconcurrency (though Node is single-threaded anyway)
14b
SQLite documentationMost database systems are written in C or C++,including SQLite (and MongoDb)
So, for SQLite, there are two parts to thedocumentation, one not language specific, one JavaScript
General SQLite documentation
Node sqlite3 module documentation
14c
Other SQL systemsSQL systems are all very similar, and the SQL languageis portable (with a few changes) between systems
Desirable features of an SQL database system are zeroconfiguration and robustness
See the next few slides for details
14d
ConfigurationConfiguration doesn't mean things like adding an indexto a table for efficiency - any database system needsthat - it means things like creating data files, choosingfile sizes, choosing limits, choosing text encodings,switching on foreign key support, upgrading, ...
Configuring databases can be a nightmare - because ofcomplexity and/or poor documentation
Good modern designs just work, because all theconfiguration is automatic
14e
Robustness1) data must not be corrupted by a program crash,operating system crash, or power failure - use adatabase system which has had extensive crash testing
2) transactions should be serializable (any otherisolation level, such as read committed, is a fudge)
3) there should be automatic consistency checking, e.g.for foreign keys
Most common database systems are OK for robustness
14f
Stable storageBeware that no database system can guaranteerobustness unless running on stable storage
And normal stable storage (RAID) isn't perfect because(for efficiency) the hardware confirms that data hasbeen written when it hasn't - it is still in a hardwarecache waiting to be written
True robustness only comes from inefficientlyconfigured specialist hardware
14g
OpinionsBased on experience, I would say:
Oracle: incredibly reliable, incredibly expensive andobsolete, requires a training course to configure
MySQL: incredibly popular, too proprietary,configuration and documentation very poor
MariaDb: newer open version of MySQL, better butconfiguration still poor
Postgresql: OK
Derby: excellent all round, but effectively Java only
14h
js
Creating a databaseHere is a program to create a database file:
"use strict"; var sqlite = require("sqlite"); create(); async function create() { try { var db = await sqlite.open("./db.sqlite"); await db.run("create table animals (id, breed)"); await db.run("insert into animals values (42,'dog')"); await db.run("insert into animals values (53,'fish')"); var as = await db.all("select * from animals"); console.log(as); } catch (e) { console.log(e); } }
interactive database files semicolons sampledata meta commands scripts callback style
15
sh
Interactive creationUsing the sqlite3 command:
> sqlite3 data.db > create table animals (id, breed); > insert into animals values (42, "dog"); > insert into animals values (53, "fish"); > select * from animals; > .quit
15a
Database filesWhen you open data.db, that says you want to usethe file data.db to hold the data
SQlite will create the file the first time, if it doesn't exist
If it does exist, SQLite will connect to it and give youaccess to all the data stored permanently in it
The file can be copied, renamed or moved as you like
15b
SemicolonsWhen you type SQL commands interactively, you needsemicolons
Leaving out a semicolon indicates that you want tocontinue the command on the next line
When you use SQL commands in a program, you don'tneed semicolons because each command is a self-contained string
15c
Sample dataThe two insert commands are just there to insertexample data to try out SQL
You can delete the sample data with delete fromanimals; if you want to, before using the database inyour server
15d
Meta commandA meta command is a command provided for interactiveuse which goes beyond the SQL language
For example .exit is clearly never needed in aprogram
Other useful meta commands are .help, .tables,.headers on
Meta commands don't need semicolons at the end
The SQL command pragma foreign_keys = on;is useful for enabling automatic consistency checks
15e
create.sql
ScriptsA script is a file containing SQL commands, e.g.
create table animals (id, breed); insert into animals values (42, "dog"); insert into animals values (53, "fish"); select * from animals;
This allows you to remember, edit and redo thecommands again later, if you feel the need to start againfrom scratch
You can redo by copying and pasting the commands, orwith .read create.sql
15f
create.js
Callback styleIf you use the sqlite3 module, the create program is:
"use strict"; var sql = require("sqlite3"); var db = new sql.Database("data.db"); db.serialize(create); function create() { db.run("create table animals (id, breed)"); db.run("insert into animals values (42,'dog')"); db.run("insert into animals values (53,'fish')"); }
See next slide for serialize
15g
js
Serialize methoddb.serialize(create); function create() { db.run("..."); db.run("..."); ... }
The serialize method calls a function in sequentialmode; the calls to run are chained, as if each had acallback function which triggered the next
serialize returns straight away, with the queriesstill outstanding, but automatically linked by callbacksso they happen in order
15h
fetch.js
Fetching rows"use strict"; var sqlite = require("sqlite"); fetch(); async function fetch() { try { var db = await sqlite.open("./db.sqlite"); var as = await db.all("select * from animals"); console.log(as); } catch (e) { console.log(e); } }
[ { id: 42, breed: 'dog' }, { id: 53, breed: 'fish' } ]
closing callback style
16
Closing a databaseTutorials often show a call db.close() to close theconnection to the database
Don't do this in a server; open the database once at thestart, and keep it open for every request
You would only use db.close() if you haveimplemented a clean way to shut down the server; butmost people just use CTRL/C (which is OK, because thesystem does frequent commits)
16a
fetch.js
Fetching rows"use strict"; var sql = require("sqlite3"); var db = new sql.Database("./db.sqlite"); db.all("select * from animals", show); function show(err, rows) { if (err) throw err; console.log(rows); }
[ { id: 42, breed: 'dog' }, { id: 53, breed: 'fish' } ]
16b
single.js
Fetching a single row"use strict"; var sqlite = require("sqlite"); fetch(); async function fetch() { try { var db = await sqlite.open("./db.sqlite"); var as = await db.get("select * from animals where id=42"); console.log(as); } catch (e) { console.log(e); } }
{ id: 42, breed: 'dog' }
If you only want one object, you can use get instead ofall (to avoid a list with one entry)
callback style
17
single.js
Callback style"use strict"; var sql = require("sqlite3"); var db = new sql.Database("./db.sqlite"); db.get("select * from animals where id=42", show); function show(err, row) { if (err) throw err; console.log(row); }
{ id: 42, breed: 'dog' }
17a
js
js
Inserting, updating"use strict"; var sqlite = require("sqlite"); insert(); async function insert() { var db = await sqlite.open("./db.sqlite"); await db.run("insert into animals values (64,'cat')"); }
"use strict"; var sqlite = require("sqlite"); update(); async function update() { var db = await sqlite.open("./db.sqlite"); await db.run("update animals set breed='terrier' where id=42"); }
callback style
18
js
js
Callback style"use strict"; var sql = require("sqlite3"); var db = new sql.Database("data.db"); db.run("insert into animals values (64,'cat')");
"use strict"; var sql = require("sqlite3"); var db = new sql.Database("data.db"); db.run("update animals set breed='terrier' where id=42");
18a
sql
JoinsA statement in SQL to load up all the sales, togetherwith the data about each animal and customer, is:
select * from sales join animals using (id) join customers using (username)
The column name id must match in sales andanimals, and similarly for username
The result, in JavaScript, is a set of objects, where eachobject has fields price, id, breed, username, name
natural joins other alternatives
19
sql
Natural joinsAn alternative is
select * from sales natural join animals natural join customers
This automatically finds the matching column names -so you have to be very careful about naming thecolumns correctly
In this case id is the only matching column in salesand animals, and the same with username insales and customers
19a
sql
sql
AlternativesOther alternatives which are more explicit aboutcolumn names are:
select * from sales join animals on animals.id = sales.id join customers on customers.username = sales.username
select * from sales, animals, customers where animals.id = sales.id and customers.username = sales.username
select * from sales as s, animals as a, customers as c where a.id = s.id and c.username = s.username
19b
ORMsIf you really don't like SQL, you can use an ORM(Object-Relational Mapping) such as sequelize
This allows you to access the data as objects, instead ofusing SQL
But I don't really recommend this approach, because (a)it doesn't give you the full power of SQL, only asimplified subset and (b) if you are only going to need asimplified subset, the SQL shouldn't be that difficult
20
js
SQL InjectionSQL injection = SQL poisoning means attacking a site asa hacker, using clever text to pervert SQL queries
Suppose you join strings to build an SQL query:
a = url.query.address; q = "update people set address='" + a + "' where ...";
Suppose an attacker types this address into your form:
anywhere', auth='admin
They could give themselves admin privileges
21
Prepared statementsThe best way to guard against SQL injection, and tomake your queries more efficient, is to use preparedstatements, instead of db.run, do this:
var ps = db.prepare("update animals set breed=? where id=?"); await ps.run("terrier", 42); await ps.finalize();
The first line can be done once, when the server startsup, and the third line can be done once when the servershuts down (if you have a clean shutdown)
Note: this needs to be done for all query types,including select
22
KeysA primary key is an id which identifies a row/record,e.g. student username or unit code
create table pets (name text, kindtext, primary key (name))
No two records can have the same primary key (andkeys can't be null)
A secondary key is a column (or group) which is keptunique by the database, e.g. student candidate number
23
id breed42 dog53 fish
Foreign keysThere are no pointers, references are via keys
A foreign key is a column (or group) in one table whichreferences a primary key in another:
id username42 ab1234553 cd67890
Foreign key consistency means the foreign key columnalways refers to an existing primary key
24
sql
Setting up foreign keyscreate table animals (id, ... primary key (id) ); create table sales (id, username, ... foreign key (id) references animals(id) ); pragma foreign_keys on;
25
Blobs and ClobsAn SQL database system usually has the ability tosupport Blobs (Binary large objects) and Clobs(Character large objects)
I suggest not using these features
For example, with photos, it is usually better to storeeach photo as a file, and put the filename into thedatabase
Generally, the database should hold the data which youuse to make decisions, not structureless data
26
ConcurrencyA server is highly concurrent, able to handle manysimultaneous requests
A node server handles concurrency in a particularway
In the callback style, each request is divided intofunction calls, each ending with an I/O operation with acallback
The function calls from separate requests can beinterleaved, but two function calls are never executed atthe same time, and each function call is neverinterrupted
27
Concurrency exampleA server which does template-filling might look like:
function getTemplate(file, response) { fs.readFile(file,"utf8",ready); function ready(err,text) { getData(text,response); } } function getData(text,response) { db.get("select ...", ready); function ready(err,row) { fillTemplate(text,row,response); } } function fillTemplate(text,row,response) { // put row data into text and deliver response }
27a
Example continuedImagine two requests A and B arriving almost together
The sequence of function calls might be:
A getTemplate B getTemplate A getData B getData A fillTemplate B fillTemplate
Any interleaved sequence is possible, depending ontimings, and the server code must handle any of them,i.e. it must be thread-safe
27b
Global variablesGlobal variables in the server, shared by all thefunctions, are not generally thread-safe
var sqlite = require("sqlite"); var db;
The sqlite variable is safe because it is constant
The db variable is safe because functions are neverinterrupted, so a shared database connection is OK (butnot if you have multi-query transactions!)
In languages other than nodejs, where interruptionsmay happen, each simultaneous request needs its owndatabase connection (taken from a pool)
28
Example: DIY auto-idsIf you want a table where each row has an auto-generated id, you might try a global variable:
var nextId = 1; async function insert(...) { var query = "insert into ... values (id=?, ..."; var ps = db.prepare(query); await ps.run(nextId); nextId++; }
The global variable nextId would be OK if it wasincremented in the same function call as using it, buttechnically this is two function calls separated byawait, so it is not safe
29
Example: DB auto-idsSuppose instead you get the database to do it
In sqlite, that's by defining an integer primary keycolumn id, say
You only add the autoincrement keyword (whichreduces efficiency) if you want to forbid the reuse ofold id numbers
Then you do insert queries without providing the idcolumn
30
Example continuedAnd suppose, just after the insert, you want to knowwhat the id number is:
async function insert(...) { await db.run("insert ..."); var id = await db.get("select last_insert_rowid() as id ..."); }
This is not safe because another request could insert arow after the run, before the get
31
Example solvedSo what is the best way to sort out auto-ids?
Answer: scour the database documentation to find outhow to insert-and-fetch-id as an indivisible operation:
async function insert(...) { var x = await db.run("insert ..."); var id = x.lastID; }
32
In-memory dataSuppose you want data available instantly as objects inmemory
This is more than just a caching system inside thedatabase, because it eliminates translation between dataformats
In the JavaScript world, systems like MongoDB can dothis, but you lose consistency guarantees in the process
You need a proper OODB, but there aren't any goodones
33
ScaleScaling up is difficult
The most important thing is overall design
It is difficult to give good advice, because every projecthas different requirements, so let's just look at a fewissues in the context of some case studies
34
FacebookFacebook is huge, so how do they organize their DB?
They use many different database systems for differentpurposes, but they mostly use SQL - their mostimportant data is relationships(!), so only SQL will do
They have one giant central database, and local copies
When you update, the central DB is updated, and yourbrowser is sent to the central database for 20 seconds,to give time (they hope!) for your update to reach yournearest local copy
35
YahooYahoo use the PNUTS system, which works the otherway round
Your data is updated in your local database, whichprovides consistency guarantees for local data
Then the data is shared between the local databases toform a virtual global database
In that virtual database, consistency isn't guaranteed, butinconsistencies are extremely unlikely to matter
36
SAFEThe SAFE system in the faculty was initially designedand set up by me, with a lot of help from Martin Bakerand later other people, though I no longer have anyresponsibility for it
Previously, the computer science web site (when itexisted) did all that SAFE does and more, and SAFE wasthe result of a challenge to do something similar at thefaculty level
SAFE is hideously complex (too complex, really) - itdoes all sorts of behind-the-scenes stuff
It is being retired at the same time as me!
37
Distributed or not?Perhaps the biggest question we faced was whetherSAFE would run on one computer, or several forrobustness or load balancing or database size reasons
We looked into the robustness issue, and our conclusionwas that multiple servers would make the system morecomplex and less robust
(True robustness is only possible with special andexpensive hardware, and a full time person to run thesystem - as 'amateurs' we couldn't afford either)
38
RobustnessAlthough we couldn't use multiple servers forrobustness, we needed to do something
We opted for (a) keeping the database continually andsafely backed up and (b) having a backup computerready to take over
It is a cold backup site, not a hot one, because we can'tafford the hardware or staff
39
Load BalancingDid we need multiple servers for load balancing?
Load balancing is used, for example, in the university'sBlackboard system
But the lack of responsiveness of Blackboard is mainlydue to the depth of each request (the number of layersof software and networking it goes through to get downto the data) and load balancing doesn't help there
Multiple servers, and the probably necessary networkedaccess to the database that results, may make that worse
40
LoadAssuming that network load is the efficiency bottleneck,could a single server cope with the load?
On the computer science system, the maximum load wemeasured was a quarter of a million hits in a day, i.e. 5per second
We would expect SAFE to peak at around 25 hits persecond
Given the university's good infrastructure, that isfeasible on a single server
41
Database sizeThat leaves database size as the only likely reason toneed multiple servers
Distributed databases are where NoSQL databases aresupposed to come into their own, and it is why they areadvertized as 'scalable'
But there is a problem known as the CAP theoremwhich means essentially that NoSQL databases cannotprovide consistency
42
Kind of dataNoSQL databases are brilliantly scalable if your data issuitable
They are scalable precisely because they give noconsistency guarantees
The type of data stored in SAFE (students, programmes,units, assignments, marks, ...) has huge amounts ofcross-referencing and huge consistency requirements
A NoSQL database could not possibly work
43
Local versus sharedBut the data held in the university as a whole is too bigto avoid distributed data
So, the university has a strategy: divide the data intocoherent local SQL databases, with each piece of dataonly being updated within its relevant database, andwith consistency guarantees within each database
The data which must be shared is duplicated in everydatabase by overnight copying (which works very well,though some data may be up to a day out of date)
44
Requirements for SAFESAFE is almost entirely responsible for day-to-day dataconcerning our faculty's students and units
But it needs central info about students from otherfaculties taking our units, and units from other facultiestaken by our students
And it needs to report some of our faculty's data forcentral storage, to be shared by other faculties
This is all done by automatic overnight processes
45
Cross-database queriesPreviously, we had done some cross-database queries inorder to fetch data overnight
An example might be "fetch the records from a centraltable which are relevant to the students listed in ourlocal table"
These queries only work when both databases use thesame system, and we wanted to use a different databasesystem for SAFE
46
ExperimentTo check the feasibility of switching to a differentdatabase system, we did an experiment
We took our biggest cross-database query and replacedit with a query on the remote database alone of theform "fetch all the data from the central table", and wewrote code to check each record as it arrived, discard itif not relevant, and update our own database if relevant
To our surprise, this was 100 times more efficient
Cross-database queries are an exceedingly crudemechanism, of use only to non-programmers
47
Database sizeThe SAFE database is exceedingly complex (hundreds oftables with hundreds of cross-reference fields) and large(hundreds of thousands of records in the largest tables)
But, by moving all the bulk unstructured data out intofiles, the total size is about a gigabyte, which easily fitsin one server's memory, let alone one server's disk
So, the final design decision was: use a single serverwith a single SQL database
48
Embedded or served?Our previous experience had been with serveddatabases, because that's what the university uses (it hasa site-wide Oracle licence)
But we knew from experience that the network is theefficiency bottleneck, and that using a served databasecan roughly double the network load
So one of the main reasons for wanting to stick to asingle server and database was so that the databasecould be embedded
49
QueriesThe SAFE database is a Derby one, within a Java server
How efficient are queries?
If you use prepared statements (as you always should)the first time you execute a query, it is slow, and afterthat, it is blindingly fast
That's because Derby compiles the query to Javabytecode, and then Java's JIT compiler compiles it tomachine code!
50
Query bottleneckBut we knew from experience that database querieswould probably form a second efficiency bottleneck
So, we planned to do better
On a system like SAFE, the number of read-onlyaccesses vastly outweighs the number of updateaccesses, which gives an opportunity for improvement
The improvement is to hold the data in memory asobjects, and use the database as a write-through cache
51
OODBThe result is effectively an Object Oriented Database,with these features
read-only accesses involve no copying of dataread-only accesses involve no memory turnoverACID transactions are supported in memoryindexes are implemented in memory
We did a survey, and found that there were no suitableOODB systems (either they involved copying, or theyleft transactions up to the programmer)
52
DIYI designed an in-memory database (called Adapt, nowdefunct)
Since SAFE was set up, JPA has come along, which hasthe right properties
But if you look into it, it is hideously complex (morethan Adapt, which is already too complex)
After a lot of thought and experiment, my conclusion isthat Java's flaws prevent a simple approach (and noother OO language is better)
53
ThreadsAlthough we went for a single server, we still wantedthreads
JavaScript's asynchronous approach effectively providespseudo-threads, but real threads are needed in Java -SAFE has 200
The original SAFE server was an 8x8 processor (8 cores,8 threads each), which speeds up the server by a factorof between 8 and 64
There are no problems because Java, Derby and Adaptare all designed to handle many threads
54
JavaScriptBut what if you want to go multi-threaded in theJavaScript world?
Node is single-threaded, though it does an excellent jobof handling simultaneous queries with a single thread(effectively providing pseudo-threads)
But it can't make use of multiple cores, it only ever usesone of them, so how can you speed things up?
55
NodeThe standard way to go multi-threaded with Node is torun several copies of your server, one on each processorcore (technically, these are separate processes)
You need a load-balancer to direct each incomingrequest to one of the copies
You can use a commercial one such as nginx, but youcan write a simple one yourself in five lines of code
56
SessionsIt is important to keep the communication betweenserver copies to a minimum
One issue is sessions; a session is a way of rememberinga login, plus some details of the logged-in user
A session can be treated as volatile, i.e. stored only inmemory while the server is running, and discardedwhen the server goes down
57
Local SessionsSharing sessions between server copies is a severecomplication
You can avoid it by arranging for the load balancer tosend requests from the same user to the same copy ofthe server - the only one which holds that user's sessiondata
What's needed is for the load balancer to hold a mapfrom session ids to server copy ids (so now the loadbalancer will be ten lines of code instead of five!)
58
Sharing the databaseThe only issue left is for the server copies to shareaccess to the database
Anything that the servers need to share can be put inthe database
The simplest option is to switch to a served database,e.g. from sqlite3 to postgresql, say
This isn't a big deal, especially compared to the overallamount of work needed to design and build a scaled-upsystem
59
Data APIsThere is another option which is worth considering
Lots of sites provide a data API (Google Maps, socialnetworking sites, ...)
You send a request asking for data, and you get back thedata you asked for
Each uses a customized protocol, based on the design ofthe data which the site holds
Each protocol is extremely simple, nothing like thegeneral querying protocol provided by a served database
60
Custom databaseFollowing the API model, what you can do is to run onedatabase process alongside the server processes
It isn't a served database, but instead an embeddeddatabase with code which you write to provide a dataAPI for your data
So the database may as well be sqlite3
The server copies each use the data API to interact withthe database process
This is probably as simple and efficient as you can get
61