databases - gitlab · an sql versus a nosql database system the short summary for this unit is...

Databases

A dynamic web site needs permanent data, and adatabase is usually the place to store it

choices embedded nosql sql injection scale

The choicesThere are two main choices to make:

an embedded versus a served database systeman SQL versus a NoSQL database system

The short summary for this unit is going to be that youshould choose an embedded database system, and youshould choose an SQL database system

Don't just follow fashion and beware that textbookscopy each other and are full of incredibly obsoletegarbage

FashionMost under-educated amateurs choose database systemsaccording to fashion, i.e. they follow the crowd andcopy stuff out of tutorials without knowing why itworks or whether it is the right choice

As ever, tutorials show you how but don't tell you why

As with with all issues in this course, you should try tounderstand more deeply, and make informed choices

Embedded versus servedA database is stored in a file (or sometimes a set of files)

Only one process/thread should access the file ( sqlite)

In embedded databases, this is the web server, and inserved databases, it is a separate database server:

Sqlite sharingSqlite is rare in allowing multiple processes/threads toaccess the same file

It uses whole-file locking, which is limited in itsefficiency in highly concurrent situations, so youprobably shouldn't use this feature

It is extremely unfair to criticize sqlite for this

Choosing embedded/servedAn embedded database is simpler, easier to run,and more efficient (no extra network overhead fortransporting data)

A served database allows multiple web serverprocesses/threads to share the same database

you should choose embedded for easy markingupgrading to a served database to scale up is easyserved databases aren't always best for scaling up

Maintenance is an issue, and scaling up is complex -see later

Embedded is simplerA served database provides a complex protocol forhandling queries over a network connection

This protocol is not standardized - it is different foreach database system

On top of that, there is usually a system for caching thedata in the web server, so there is complexity at bothends

Embedded is easierWith a served database, you have to manage two servers(web and db), and worry about the configuration,especially security, of both servers

In a commercial setting, there are extra operationalproblems (what happens if one server goes down, or ifthe web server comes up before the database server isready to accept queries)

For marking, I am not prepared to run two servers tomark each submitted web site

Embedded is fasterWith a served database, each network request to theweb server may result in another network request tothe database server

This can effectively 'double' the network load, andnetwork load is often the efficiency bottleneck

When the two servers are running on the samecomputer, there should be no actual network traffic, butthere are still overheads with using the query protocol

MaintenanceWith an embedded database, you can't normally use acommand line tool to update the database while theserver is running (with a served database, the tool canconnect to the running server)

That's because two processes can't access the samedatabase file

Either you have to shut down the server, or you canprovide a privileged web page for data maintenance onyour site

NoSQL databasesNoSQL databases are simpler than SQL databases, so let'sdeal with them first

They are special purpose databases

There are different types for different special purposes

The most common type is the 'document store'

MongoDB, the current fashion in JavaScript, is adocument store

object oriented databases

Object oriented databasesThe idea of an object oriented database is that it lives inmemory as a collection of (normal) objects

There is no overhead in fetching the data or convertingit into object format

It is stored on disk like a write-through cache to makethe data permanent (so it survives a server restart)

NoSQL database systems are a long way from being OOdatabase systems, and OO database systems are not verypopular or successful anyway - see next few slides

NoSQL is not OONoSQL databases are not OO because they lack severalfeatures of conventional SQL databases which are wellknown, and which need to be implemented in memoryto form a proper database

These are concurrency (thread-safety and thread-efficiency), transactions (ACID consistency), and indexes(to speed up searches)

Roughly speaking, these require versioning (for readconsistency) and locking (for write efficiency)

Why not OO databases?Why are OO database systems unpopular andunsuccessful?

One problem is that OO databases don't seem to fit wellwith refactoring

In classed languages like Java, if you update the classesthat the database objects belong to, the data becomesinvalid, so needs to be massaged before it can be usedagain

Also, OO databases can only be used by programmers;they make no sense to non-experts

Future of OO databases?In my opinion, OO databases make a great deal of sense,and deserve to be re-thought

One thing that needs doing is to build OO databasesupport into the programming language (Java's JPAsystem, bolted on from the outside, is incrediblycomplex)

Another is that databases need to be included in theprogram development process, so that refactoring of thedata is done naturally alongside refactoring of the code

When does NoSQL work?NoSQL databases of the document store type work bestwhen the data consists of separate objects which mayhave internal structure, but which have no relationshipsor cross-references between them, and no importantconsistency requirements

For example, if you set up a shopping site, and your dataobjects are customers and products, everything is fine

But as soon as you add sales, you are in trouble, becausea sale object references a customer and a product

Eventual consistencyMongoDB is popular because it is simple, so it workswell for undereducated amateurs who don't know orcare about relationships or consistency

It can also work well for experts in very largedistributed projects if the data is suitable and if theyrealize that they (and not the database system) areresponsible for the implementation of consistency issues

These large projects require eventual consistency, thehardest kind to implement

AdviceIn my opinion it is usually a mistake to start with aNoSQL database

That's because, by the time you realize that it isinadequate for your kind of data, it will probably beexceedingly difficult to upgrade to an SQL database

NoSQL horror storiesHere are some things that go wrong with NoSQL:

relations (TV shows)transactions (bitcoins)resourcesqueriessecurity

You should not use MongoDB in this unit (for ease ofmarking, you should use an embedded system)

If you must use the MongoDB API (highly-non-recommended), then use something like TingoDB (anembedded version)

TV showsIn one famous case, (a minor example within this story)MongoDB was used to set up a system for accessing TVshows online

MongoDb worked well, because each episode was a self-contained object, and there were no cross-references

Then, one day, an extra feature was requested - to beable to click on an actor listed in an episode, and see allthe other TV shows the actor was in

See the next two slides for more details

The dataFirst, there was a problem with the mass of data alreadycollected

Each actor was listed in different TV episodes usingslightly different strings - extra spaces, or full stops, orinitials versus first names, or ...

So how do you know when it is the same actor?

String comparison is not good enough

A lot of effort had to put into cleaning up the data,giving each actor a unique id

The outcomeAs well as cleaning up the data, the relationshipsbetween the actors and the episodes had to be set up

Although there are lots of claims about being able torepresent relationships in MongoDB, it isn't usuallyviable

So with a lot more effort the data was moved to an SQLdatabase, where it should have been in the first place

BitcoinsIn another case, a bitcoin site was set up

It is rather like a bank, with accounts, where atransaction consists of taking some money out of oneaccount and putting it into another

A system was set up using MongoDB, and then it wasdiscovered that the data was becoming inconsistent, i.e.money was evaporating or being generated out ofnowhere

See next slide for more details

TransactionsThe point is that the transfer of money betweenaccounts involves updating two objects

And the pair of updates together must act as anindivisible transaction, i.e. both updates must work orneither

MongoDB is completely incapable of making such aguarantee - the system had to be switched to an SQLdatabase

ResourcesSome people are horrified to discover that the files thatMongoDB creates are huge, or that the amount ofmemory it reserves when you start it up is huge

This is even if the MongoDB database is empty

It is because MongoDB is optimized for big data

It is a solvable problem, but why should you have tosolve it?

QueriesMany people find they have a need to make complexqueries on MongoDB objects

So, they add a module which allows them to make SQLqueries on MongoDB objects, without realizing howutterly stupid this is - MongoDB is a NoSQL database -the clue is in the name

What they are doing is handling relationships withMongoDB in a completely implicit and non-robust way -relationships require consistency guarantees

SecurityIt is very common to create an insecure system usingMongoDB

See survey

Or see study

This is a solvable problem, in fact the MongoDBsuppliers have been persuaded to improve the defaults,but the fundamental problem is that people generallydon't know or care about security - a professional musttake care over security

SQL databasesSQL databases are general purpose, and they are allessentially the same

They are also called relational databases because they fitthe theory of relations ( roughly)

It takes a little knowledge and skill to (a) fit your data tothe relational model, and (b) use the SQL language toextract data

But it is not rocket science (and if your data is trivialenough for MongoDB to be suitable, then the SQLapproach is easy too!)

CriticismsSQL is not as relational as it should be - see criticismand the third manifesto

It produces results in implementation-dependent order,they may include duplicates, and problems arise withlimit clauses etc which depend on order, there is athree-value logic based on null which is incompleteand doesn't work, and so on

See notes on purifying it

customersusername nameab12345 Patcd67890 Chris

Pet storeA simple pet store database might have three tables, foranimals, customers, and sales:

animalsid breed42 dog53 fish

salesid username price42 ab12345 10053 cd67890 50

RelationshipsEach row of a table is an object, each column is a field

The animals and customers are simple because theyhave no relationships

The sales table is more subtle, because it uses ids andusernames to represent relationships (cross-references)between the sales table and the other two tables, toavoid duplicating any information

SQLA statement in SQL to fetch all the animals objects is:

select * from animals

The result, in JavaScript, is a set of objects, where eachobject has fields id and breed

SQL is designed to be programming-language-neutral,i.e. it can be used inside C, C++, Java, JavaScript or anyother language, object oriented or not

It is also designed to be used as an interactive commandlanguage, so that databases can be handled manually,outside of the web server

The SQL languageSQL is a narrow-purpose declarative scripting languagefor searching and sorting database records

SQL is supposed to be standardised across databasesystems, but (a) the standard is inaccessible and old andwoolly and (b) database systems don't always stick to it

The result is that SQL is reasonably standard as long itis a human who is moving from system to system, but ifyou move actual SQL statements from system to system,you should expect to have to adjust them

SQLiteThe recommended system for this unit is SQLite(pronounced s-q-l-lite or sequelite)

If the sqlite3 node module is installed, the callbackstyle has to be used, so these notes assume that thesqlite node module is installed, which uses promisesso that the async/await keywords can be used.

command line tools, why lite? documentationother SQL systems

Command line toolsInstead of accessing the database via a program withSQL strings in it, you can install a command line toolwhich allows you to type SQL commands interactively

You could use sqlite3 command (not the node module)or SQLite Browser or SQuirrel (which accesses any typeof database)

This can be useful to maintain the database, e.g.initialize it, but you can't safely access the database thisway while the web server is using it, so the usefulnessis limited

Why lite?What does the 'lite' in SQLite refer to?

It means that SQLite small and simple, suitable for smalldevices such as mobiles, so what do you lose?

It is embedded and not served, it has loose types in theJavaScript style, some features have to be 'switched on',and you have to think hard about scaling up

But, contrary to rumour, it doesn't compromise onconcurrency (though Node is single-threaded anyway)

SQLite documentationMost database systems are written in C or C++,including SQLite (and MongoDb)

So, for SQLite, there are two parts to thedocumentation, one not language specific, one JavaScript

General SQLite documentation

Node sqlite3 module documentation

Other SQL systemsSQL systems are all very similar, and the SQL languageis portable (with a few changes) between systems

Desirable features of an SQL database system are zeroconfiguration and robustness

See the next few slides for details

ConfigurationConfiguration doesn't mean things like adding an indexto a table for efficiency - any database system needsthat - it means things like creating data files, choosingfile sizes, choosing limits, choosing text encodings,switching on foreign key support, upgrading, ...

Configuring databases can be a nightmare - because ofcomplexity and/or poor documentation

Good modern designs just work, because all theconfiguration is automatic

Robustness1) data must not be corrupted by a program crash,operating system crash, or power failure - use adatabase system which has had extensive crash testing

2) transactions should be serializable (any otherisolation level, such as read committed, is a fudge)

3) there should be automatic consistency checking, e.g.for foreign keys

Most common database systems are OK for robustness

Stable storageBeware that no database system can guaranteerobustness unless running on stable storage

And normal stable storage (RAID) isn't perfect because(for efficiency) the hardware confirms that data hasbeen written when it hasn't - it is still in a hardwarecache waiting to be written

True robustness only comes from inefficientlyconfigured specialist hardware

OpinionsBased on experience, I would say:

Oracle: incredibly reliable, incredibly expensive andobsolete, requires a training course to configure

MySQL: incredibly popular, too proprietary,configuration and documentation very poor

MariaDb: newer open version of MySQL, better butconfiguration still poor

Postgresql: OK

Derby: excellent all round, but effectively Java only

Creating a databaseHere is a program to create a database file:

"use strict"; var sqlite = require("sqlite"); create(); async function create() { try { var db = await sqlite.open("./db.sqlite"); await db.run("create table animals (id, breed)"); await db.run("insert into animals values (42,'dog')"); await db.run("insert into animals values (53,'fish')"); var as = await db.all("select * from animals"); console.log(as); } catch (e) { console.log(e); } }

interactive database files semicolons sampledata meta commands scripts callback style

Interactive creationUsing the sqlite3 command:

> sqlite3 data.db > create table animals (id, breed); > insert into animals values (42, "dog"); > insert into animals values (53, "fish"); > select * from animals; > .quit

Database filesWhen you open data.db, that says you want to usethe file data.db to hold the data

SQlite will create the file the first time, if it doesn't exist

If it does exist, SQLite will connect to it and give youaccess to all the data stored permanently in it

The file can be copied, renamed or moved as you like

SemicolonsWhen you type SQL commands interactively, you needsemicolons

Leaving out a semicolon indicates that you want tocontinue the command on the next line

When you use SQL commands in a program, you don'tneed semicolons because each command is a self-contained string

Sample dataThe two insert commands are just there to insertexample data to try out SQL

You can delete the sample data with delete fromanimals; if you want to, before using the database inyour server

Meta commandA meta command is a command provided for interactiveuse which goes beyond the SQL language

For example .exit is clearly never needed in aprogram

Other useful meta commands are .help, .tables,.headers on

Meta commands don't need semicolons at the end

The SQL command pragma foreign_keys = on;is useful for enabling automatic consistency checks

create.sql

ScriptsA script is a file containing SQL commands, e.g.

create table animals (id, breed); insert into animals values (42, "dog"); insert into animals values (53, "fish"); select * from animals;

This allows you to remember, edit and redo thecommands again later, if you feel the need to start againfrom scratch

You can redo by copying and pasting the commands, orwith .read create.sql

create.js

Callback styleIf you use the sqlite3 module, the create program is:

"use strict"; var sql = require("sqlite3"); var db = new sql.Database("data.db"); db.serialize(create); function create() { db.run("create table animals (id, breed)"); db.run("insert into animals values (42,'dog')"); db.run("insert into animals values (53,'fish')"); }

See next slide for serialize

Serialize methoddb.serialize(create); function create() { db.run("..."); db.run("..."); ... }

The serialize method calls a function in sequentialmode; the calls to run are chained, as if each had acallback function which triggered the next

serialize returns straight away, with the queriesstill outstanding, but automatically linked by callbacksso they happen in order

fetch.js

Fetching rows"use strict"; var sqlite = require("sqlite"); fetch(); async function fetch() { try { var db = await sqlite.open("./db.sqlite"); var as = await db.all("select * from animals"); console.log(as); } catch (e) { console.log(e); } }

[ { id: 42, breed: 'dog' }, { id: 53, breed: 'fish' } ]

closing callback style

Closing a databaseTutorials often show a call db.close() to close theconnection to the database

Don't do this in a server; open the database once at thestart, and keep it open for every request

You would only use db.close() if you haveimplemented a clean way to shut down the server; butmost people just use CTRL/C (which is OK, because thesystem does frequent commits)

fetch.js

Fetching rows"use strict"; var sql = require("sqlite3"); var db = new sql.Database("./db.sqlite"); db.all("select * from animals", show); function show(err, rows) { if (err) throw err; console.log(rows); }

[ { id: 42, breed: 'dog' }, { id: 53, breed: 'fish' } ]

single.js

Fetching a single row"use strict"; var sqlite = require("sqlite"); fetch(); async function fetch() { try { var db = await sqlite.open("./db.sqlite"); var as = await db.get("select * from animals where id=42"); console.log(as); } catch (e) { console.log(e); } }

{ id: 42, breed: 'dog' }

If you only want one object, you can use get instead ofall (to avoid a list with one entry)

callback style

single.js

Callback style"use strict"; var sql = require("sqlite3"); var db = new sql.Database("./db.sqlite"); db.get("select * from animals where id=42", show); function show(err, row) { if (err) throw err; console.log(row); }

{ id: 42, breed: 'dog' }

Inserting, updating"use strict"; var sqlite = require("sqlite"); insert(); async function insert() { var db = await sqlite.open("./db.sqlite"); await db.run("insert into animals values (64,'cat')"); }

"use strict"; var sqlite = require("sqlite"); update(); async function update() { var db = await sqlite.open("./db.sqlite"); await db.run("update animals set breed='terrier' where id=42"); }

callback style

Callback style"use strict"; var sql = require("sqlite3"); var db = new sql.Database("data.db"); db.run("insert into animals values (64,'cat')");

"use strict"; var sql = require("sqlite3"); var db = new sql.Database("data.db"); db.run("update animals set breed='terrier' where id=42");

JoinsA statement in SQL to load up all the sales, togetherwith the data about each animal and customer, is:

select * from sales join animals using (id) join customers using (username)

The column name id must match in sales andanimals, and similarly for username

The result, in JavaScript, is a set of objects, where eachobject has fields price, id, breed, username, name

natural joins other alternatives

Natural joinsAn alternative is

select * from sales natural join animals natural join customers

This automatically finds the matching column names -so you have to be very careful about naming thecolumns correctly

In this case id is the only matching column in salesand animals, and the same with username insales and customers

AlternativesOther alternatives which are more explicit aboutcolumn names are:

select * from sales join animals on animals.id = sales.id join customers on customers.username = sales.username

select * from sales, animals, customers where animals.id = sales.id and customers.username = sales.username

select * from sales as s, animals as a, customers as c where a.id = s.id and c.username = s.username

ORMsIf you really don't like SQL, you can use an ORM(Object-Relational Mapping) such as sequelize

This allows you to access the data as objects, instead ofusing SQL

But I don't really recommend this approach, because (a)it doesn't give you the full power of SQL, only asimplified subset and (b) if you are only going to need asimplified subset, the SQL shouldn't be that difficult

SQL InjectionSQL injection = SQL poisoning means attacking a site asa hacker, using clever text to pervert SQL queries

Suppose you join strings to build an SQL query:

a = url.query.address; q = "update people set address='" + a + "' where ...";

Suppose an attacker types this address into your form:

anywhere', auth='admin

They could give themselves admin privileges

Prepared statementsThe best way to guard against SQL injection, and tomake your queries more efficient, is to use preparedstatements, instead of db.run, do this:

var ps = db.prepare("update animals set breed=? where id=?"); await ps.run("terrier", 42); await ps.finalize();

The first line can be done once, when the server startsup, and the third line can be done once when the servershuts down (if you have a clean shutdown)

Note: this needs to be done for all query types,including select

KeysA primary key is an id which identifies a row/record,e.g. student username or unit code

create table pets (name text, kindtext, primary key (name))

No two records can have the same primary key (andkeys can't be null)

A secondary key is a column (or group) which is keptunique by the database, e.g. student candidate number

id breed42 dog53 fish

Foreign keysThere are no pointers, references are via keys

A foreign key is a column (or group) in one table whichreferences a primary key in another:

id username42 ab1234553 cd67890

Foreign key consistency means the foreign key columnalways refers to an existing primary key

Setting up foreign keyscreate table animals (id, ... primary key (id) ); create table sales (id, username, ... foreign key (id) references animals(id) ); pragma foreign_keys on;

Blobs and ClobsAn SQL database system usually has the ability tosupport Blobs (Binary large objects) and Clobs(Character large objects)

I suggest not using these features

For example, with photos, it is usually better to storeeach photo as a file, and put the filename into thedatabase

Generally, the database should hold the data which youuse to make decisions, not structureless data

ConcurrencyA server is highly concurrent, able to handle manysimultaneous requests

A node server handles concurrency in a particularway

In the callback style, each request is divided intofunction calls, each ending with an I/O operation with acallback

The function calls from separate requests can beinterleaved, but two function calls are never executed atthe same time, and each function call is neverinterrupted

Concurrency exampleA server which does template-filling might look like:

Example continuedImagine two requests A and B arriving almost together

The sequence of function calls might be:

A getTemplate B getTemplate A getData B getData A fillTemplate B fillTemplate

Any interleaved sequence is possible, depending ontimings, and the server code must handle any of them,i.e. it must be thread-safe

Global variablesGlobal variables in the server, shared by all thefunctions, are not generally thread-safe

var sqlite = require("sqlite"); var db;

The sqlite variable is safe because it is constant

The db variable is safe because functions are neverinterrupted, so a shared database connection is OK (butnot if you have multi-query transactions!)

In languages other than nodejs, where interruptionsmay happen, each simultaneous request needs its owndatabase connection (taken from a pool)

Example: DIY auto-idsIf you want a table where each row has an auto-generated id, you might try a global variable:

var nextId = 1; async function insert(...) { var query = "insert into ... values (id=?, ..."; var ps = db.prepare(query); await ps.run(nextId); nextId++; }

The global variable nextId would be OK if it wasincremented in the same function call as using it, buttechnically this is two function calls separated byawait, so it is not safe

Example: DB auto-idsSuppose instead you get the database to do it

In sqlite, that's by defining an integer primary keycolumn id, say

You only add the autoincrement keyword (whichreduces efficiency) if you want to forbid the reuse ofold id numbers

Then you do insert queries without providing the idcolumn

Example continuedAnd suppose, just after the insert, you want to knowwhat the id number is:

async function insert(...) { await db.run("insert ..."); var id = await db.get("select last_insert_rowid() as id ..."); }

This is not safe because another request could insert arow after the run, before the get

Example solvedSo what is the best way to sort out auto-ids?

Answer: scour the database documentation to find outhow to insert-and-fetch-id as an indivisible operation:

async function insert(...) { var x = await db.run("insert ..."); var id = x.lastID; }

In-memory dataSuppose you want data available instantly as objects inmemory

This is more than just a caching system inside thedatabase, because it eliminates translation between dataformats

In the JavaScript world, systems like MongoDB can dothis, but you lose consistency guarantees in the process

You need a proper OODB, but there aren't any goodones

ScaleScaling up is difficult

The most important thing is overall design

It is difficult to give good advice, because every projecthas different requirements, so let's just look at a fewissues in the context of some case studies

FacebookFacebook is huge, so how do they organize their DB?

They use many different database systems for differentpurposes, but they mostly use SQL - their mostimportant data is relationships(!), so only SQL will do

They have one giant central database, and local copies

When you update, the central DB is updated, and yourbrowser is sent to the central database for 20 seconds,to give time (they hope!) for your update to reach yournearest local copy

YahooYahoo use the PNUTS system, which works the otherway round

Your data is updated in your local database, whichprovides consistency guarantees for local data

Then the data is shared between the local databases toform a virtual global database

In that virtual database, consistency isn't guaranteed, butinconsistencies are extremely unlikely to matter

SAFEThe SAFE system in the faculty was initially designedand set up by me, with a lot of help from Martin Bakerand later other people, though I no longer have anyresponsibility for it

Previously, the computer science web site (when itexisted) did all that SAFE does and more, and SAFE wasthe result of a challenge to do something similar at thefaculty level

SAFE is hideously complex (too complex, really) - itdoes all sorts of behind-the-scenes stuff

It is being retired at the same time as me!

Distributed or not?Perhaps the biggest question we faced was whetherSAFE would run on one computer, or several forrobustness or load balancing or database size reasons

We looked into the robustness issue, and our conclusionwas that multiple servers would make the system morecomplex and less robust

(True robustness is only possible with special andexpensive hardware, and a full time person to run thesystem - as 'amateurs' we couldn't afford either)

RobustnessAlthough we couldn't use multiple servers forrobustness, we needed to do something

We opted for (a) keeping the database continually andsafely backed up and (b) having a backup computerready to take over

It is a cold backup site, not a hot one, because we can'tafford the hardware or staff

Load BalancingDid we need multiple servers for load balancing?

Load balancing is used, for example, in the university'sBlackboard system

But the lack of responsiveness of Blackboard is mainlydue to the depth of each request (the number of layersof software and networking it goes through to get downto the data) and load balancing doesn't help there

Multiple servers, and the probably necessary networkedaccess to the database that results, may make that worse

LoadAssuming that network load is the efficiency bottleneck,could a single server cope with the load?

On the computer science system, the maximum load wemeasured was a quarter of a million hits in a day, i.e. 5per second

We would expect SAFE to peak at around 25 hits persecond

Given the university's good infrastructure, that isfeasible on a single server

Database sizeThat leaves database size as the only likely reason toneed multiple servers

Distributed databases are where NoSQL databases aresupposed to come into their own, and it is why they areadvertized as 'scalable'

But there is a problem known as the CAP theoremwhich means essentially that NoSQL databases cannotprovide consistency

Kind of dataNoSQL databases are brilliantly scalable if your data issuitable

They are scalable precisely because they give noconsistency guarantees

The type of data stored in SAFE (students, programmes,units, assignments, marks, ...) has huge amounts ofcross-referencing and huge consistency requirements

A NoSQL database could not possibly work

Local versus sharedBut the data held in the university as a whole is too bigto avoid distributed data

So, the university has a strategy: divide the data intocoherent local SQL databases, with each piece of dataonly being updated within its relevant database, andwith consistency guarantees within each database

The data which must be shared is duplicated in everydatabase by overnight copying (which works very well,though some data may be up to a day out of date)

Requirements for SAFESAFE is almost entirely responsible for day-to-day dataconcerning our faculty's students and units

But it needs central info about students from otherfaculties taking our units, and units from other facultiestaken by our students

And it needs to report some of our faculty's data forcentral storage, to be shared by other faculties

This is all done by automatic overnight processes

Cross-database queriesPreviously, we had done some cross-database queries inorder to fetch data overnight

An example might be "fetch the records from a centraltable which are relevant to the students listed in ourlocal table"

These queries only work when both databases use thesame system, and we wanted to use a different databasesystem for SAFE

ExperimentTo check the feasibility of switching to a differentdatabase system, we did an experiment

We took our biggest cross-database query and replacedit with a query on the remote database alone of theform "fetch all the data from the central table", and wewrote code to check each record as it arrived, discard itif not relevant, and update our own database if relevant

To our surprise, this was 100 times more efficient

Cross-database queries are an exceedingly crudemechanism, of use only to non-programmers

Database sizeThe SAFE database is exceedingly complex (hundreds oftables with hundreds of cross-reference fields) and large(hundreds of thousands of records in the largest tables)

But, by moving all the bulk unstructured data out intofiles, the total size is about a gigabyte, which easily fitsin one server's memory, let alone one server's disk

So, the final design decision was: use a single serverwith a single SQL database

Embedded or served?Our previous experience had been with serveddatabases, because that's what the university uses (it hasa site-wide Oracle licence)

But we knew from experience that the network is theefficiency bottleneck, and that using a served databasecan roughly double the network load

So one of the main reasons for wanting to stick to asingle server and database was so that the databasecould be embedded

QueriesThe SAFE database is a Derby one, within a Java server

How efficient are queries?

If you use prepared statements (as you always should)the first time you execute a query, it is slow, and afterthat, it is blindingly fast

That's because Derby compiles the query to Javabytecode, and then Java's JIT compiler compiles it tomachine code!

Query bottleneckBut we knew from experience that database querieswould probably form a second efficiency bottleneck

So, we planned to do better

On a system like SAFE, the number of read-onlyaccesses vastly outweighs the number of updateaccesses, which gives an opportunity for improvement

The improvement is to hold the data in memory asobjects, and use the database as a write-through cache

OODBThe result is effectively an Object Oriented Database,with these features

read-only accesses involve no copying of dataread-only accesses involve no memory turnoverACID transactions are supported in memoryindexes are implemented in memory

We did a survey, and found that there were no suitableOODB systems (either they involved copying, or theyleft transactions up to the programmer)

DIYI designed an in-memory database (called Adapt, nowdefunct)

Since SAFE was set up, JPA has come along, which hasthe right properties

But if you look into it, it is hideously complex (morethan Adapt, which is already too complex)

After a lot of thought and experiment, my conclusion isthat Java's flaws prevent a simple approach (and noother OO language is better)

ThreadsAlthough we went for a single server, we still wantedthreads

JavaScript's asynchronous approach effectively providespseudo-threads, but real threads are needed in Java -SAFE has 200

The original SAFE server was an 8x8 processor (8 cores,8 threads each), which speeds up the server by a factorof between 8 and 64

There are no problems because Java, Derby and Adaptare all designed to handle many threads

JavaScriptBut what if you want to go multi-threaded in theJavaScript world?

Node is single-threaded, though it does an excellent jobof handling simultaneous queries with a single thread(effectively providing pseudo-threads)

But it can't make use of multiple cores, it only ever usesone of them, so how can you speed things up?

NodeThe standard way to go multi-threaded with Node is torun several copies of your server, one on each processorcore (technically, these are separate processes)

You need a load-balancer to direct each incomingrequest to one of the copies

You can use a commercial one such as nginx, but youcan write a simple one yourself in five lines of code

SessionsIt is important to keep the communication betweenserver copies to a minimum

One issue is sessions; a session is a way of rememberinga login, plus some details of the logged-in user

A session can be treated as volatile, i.e. stored only inmemory while the server is running, and discardedwhen the server goes down

Local SessionsSharing sessions between server copies is a severecomplication

You can avoid it by arranging for the load balancer tosend requests from the same user to the same copy ofthe server - the only one which holds that user's sessiondata

What's needed is for the load balancer to hold a mapfrom session ids to server copy ids (so now the loadbalancer will be ten lines of code instead of five!)

Sharing the databaseThe only issue left is for the server copies to shareaccess to the database

Anything that the servers need to share can be put inthe database

The simplest option is to switch to a served database,e.g. from sqlite3 to postgresql, say

This isn't a big deal, especially compared to the overallamount of work needed to design and build a scaled-upsystem

Data APIsThere is another option which is worth considering

Lots of sites provide a data API (Google Maps, socialnetworking sites, ...)

You send a request asking for data, and you get back thedata you asked for

Each uses a customized protocol, based on the design ofthe data which the site holds

Each protocol is extremely simple, nothing like thegeneral querying protocol provided by a served database

Custom databaseFollowing the API model, what you can do is to run onedatabase process alongside the server processes

It isn't a served database, but instead an embeddeddatabase with code which you write to provide a dataAPI for your data

So the database may as well be sqlite3

The server copies each use the data API to interact withthe database process

This is probably as simple and efficient as you can get

databases - gitlab · an sql versus a nosql database system the short summary for this unit is...

Documents

gitlab product update, january 2017

introducing gitlab - almtoolbox · github gitlab what does...

introduction to gitlab – basics and continuous … ·...

gitlab as an institutional service -...

ssh connection to gitlab

how to choose a database for your pet project

gitlab workflow

introduction to gitlab

enjoy privacy on gitlab

otn and gitlab

jenkins + gitlab + rabbitmq + symfony2 + phing

why choose a column database for business intelligence

gitlab flash talk

why you can't ignore gitlab

git and gitlab

github vs gitlab - gitlab | gitlab · lock discussion lock...

slide: introducing gitlab by almtoolbox

gitlab and lingvokot

gitlab cipipeline gitlab ci 6. ci/cd mkdocs pipeline gitlab...

prometheus monitoring mysql with - percona · prometheus...