introduction to databases - query optimizations for mysql

32
Databases Kodok Márton simb.ro kodokmarton.eu twitter.com/martonkodok facebook.com/marton.kodok stackoverflow.com/users/243782/pentium10 23 May, 2013 @Sapientia

Upload: marton-kodok

Post on 15-Jan-2017

363 views

Category:

Software


1 download

TRANSCRIPT

DatabasesKodok Márton

■ simb.ro■ kodokmarton.eu■ twitter.com/martonkodok■ facebook.com/marton.kodok■ stackoverflow.com/users/243782/pentium10

23 May, 2013 @Sapientia

Relational Databases● A relational database is essentially a group of tables (entities). ● Tables are made up of columns and rows. ● Those tables have constraints, and relationships are defined between them.● Relational databases are queried using SQL● Multiple tables being accessed in a single query are "joined" together, typically by

a criteria defined in the table relationship columns. ● Normalization is a data-structuring model used with relational databases that

ensures data consistency and removes data duplication.

Non-Relational Databases (NoSQL)● Key-value stores are the simplest NoSQL

databases. Every single item in the database is

stored as an attribute name, or key, together with its

value. Examples of key-value stores are Riak and

MongoDB. Some key-value stores, such as Redis,

allow each value to have a type, such as "integer",

which adds functionality.

● Document databases can contain many different

key-value pairs, or key-array pairs, or even nested

documents

● Graph stores are used to store information about

networks, such as social connections

● Wide-column stores such as Cassandra and

HBase are optimized for queries over large datasets,

and store columns of data together, instead of rows.

SQL vs NoSQL

The purpose of this presentation is NOT about SQL vs NoSQL.

Let’s be blunt: none of them are difficult, we need both of them. We evolve.

Spreadsheets/Frontends

Excel and Access are not a database.

Let’s be blunt: Excel does not need more than 256 columns and 65 536 rows.

DDL (Data Definition Language)

CREATE TABLE employees ( id INTEGER(11) PRIMARY KEY, first_name VARCHAR(50) NULL, last_name VARCHAR(75) NOT NULL, dateofbirth DATE NULL);

ALTER TABLE sink ADD bubbles INTEGER;ALTER TABLE sink DROP COLUMN bubbles;

DROP TABLE employees;

RENAME TABLE My_table TO Tmp_table;

TRUNCATE TABLE My_table;

CREATE, ALTER, DROP, RENAME, TRUNCATE

SQL (Structured Query Language)

INSERT INTO My_table (field1, field2, field3) VALUES ('test', 'N', NULL);

SELECT Book.title AS Title, COUNT(*) AS Authors FROM Book JOIN Book_author ON Book.isbn = Book_author.isbn GROUP BY Book.title;

UPDATE My_table SET field1 = 'updated value' WHERE field2 = 'N';

DELETE FROM My_table WHERE field2 = 'N';

CRUD (CREATE, READ, UPDATE, DELETE)

Indexes (fast lookup + constraints)

Constraint:

a. PRIMARYb. UNIQUEc. FOREIGN KEY

● Index reduce the amount of data the server has to examine ● can speed up reads but can slow down inserts and updates ● is used to enforce constraints

Type:

1. BTREE- can be used for look-ups and sorting- can match the full value- can match a leftmost prefix ( LIKE 'ma%' )

2. HASH- only supports equality comparisons: =, IN()- can't be used for sorting

3. FULLTEXT- only MyISAM tables- compares words or phrases- returns a relevance value

Some Queries That Can Use BTREE Index● point look-up

SELECT * FROM students WHERE grade = 100; ● open range

SELECT * FROM students WHERE grade > 75; ● closed range

SELECT * FROM students WHERE 70 < grade < 80; ● special range

SELECT * FROM students WHERE name LIKE 'ma%';

Multi Column Indexes Useful for sorting/where

CREATE INDEX `salary_name_idx` ON emp(salary, name);

SELECT salary, name FROM emp ORDER BY salary, name;

(5000, 'john') < (5000, 'michael') (9000, 'philip') < (9999, 'steve')

Indexing InnoDB Tables

● data is clustered by primary key ● primary key is implicitly appended to all indexes

CREATE INDEX fname_idx ON emp(firstname); actually creates KEY(firstname, id) internally

Avoid long primary keys! TS-09061982110055-12345 349950002348857737488334 supercalifragilisticexpialidocious

How MySQL Uses Indexes

● looking up data● joining tables● sorting● avoiding reading data

MySQL chooses only ONE index per table.

DON'Ts● don't follow optimization rules blindly ● don't create an index for every column in your table

thinking that it will make things faster ● don't create duplicate indexes

ex. BAD:create index firstname_ix on Employee(firstname); create index lastname_ix on Employee(lastname);

GOOD:create index first_last_ix on Employee(firstname, lastname); create index id_ix on Employee(id);

DOs

● use index for optimizing look-ups, sorting and retrieval of data

● use short primary keys if possible when using the InnoDB storage engine

● extend index if you can, instead of creating new indexes ● validate performance impact as you're doing changes ● remove unused indexes

Speeding it up

● proper table design (3nf)● understand query cache (internal of MySQL)● EXPLAIN syntax● proper indexes● MySQL server daemon optimization● Slow Query Logs● Stored Procedures● Profiling● Redundancy -> Master : Slave● Sharding● mix database servers based on business logic

-> Memcache, Redis, MongoDB

EXPLAIN

EXPLAIN SELECT * FROM attendees WHERE conference_id = 123 AND registration_status > 0

table possible_keys key rows

attendees NULL NULL 14052

The three most important columns returned by EXPLAIN 1) Possible keys● All the possible indexes which MySQL could have used ● Based on a series of very quick lookups and calculations

2) Chosen key 3) Rows scanned ● Indication of effort required to identify your result set

-> Interpreting the results

Interpreting the results

EXPLAIN SELECT * FROM attendees WHERE conference_id = 123 AND registration_status > 0

table possible_keys key rows

attendees NULL NULL 14052

● No suitable indexes for this query ● MySQL had to do a full table scan ● Full table scans are almost always the slowest query ● Full table scans, while not always bad, are usually an indication that an

index is required

-> Adding indexes

Adding indexesALTER TABLE ADD INDEX conf (conference_id); ALTER TABLE ADD INDEX reg (registration_status);

EXPLAIN SELECT * FROM attendees WHERE conference_id = 123 AND registration_status > 0

table possible_keys key rows

attendees conf, reg conf 331

● MySQL had two indexes to choose from, but discarded “reg” ● “reg” isn't sufficiently unique ● The spread of values can also be a factor (e.g when 99% of rows contain

the same value)● Index “uniqueness” is called cardinality ● There is scope for some performance increase... Lower server load,

quicker response

-> Choosing a better index

Choosing a better index

ALTER TABLE ADD INDEX reg_conf_index (registration_status, conference_id);

EXPLAIN SELECT * FROM attendees WHERE registration_status > 0 AND conference_id = 123

table possible_keys key rows

attendees reg, conf, reg_conf_index

reg_conf_index 204

● reg_conf_index is a much better choice ● Note that the other two keys are still available, just not as effective ● Our query is now served well by the new index

-> Using it wrong

Watch for WHERE column orderDELETE INDEX conf; DELETE INDEX reg;

EXPLAIN SELECT * FROM attendees WHERE conference_id = 123

table possible_keys key rows

attendees NULL NULL 14052

● Without the “conf” index, we're back to square one ● The order in which fields were defined in a composite index affects whether it is available for

use in a query ● Remember, we defined our index : (registration_status, conference_id)

Potential workaround:

EXPLAIN SELECT * FROM attendees WHERE registration_status >= -1 AND conference_id = 123

table possible_keys key rows

attendees reg_conf_index reg_conf_index 204

JOINs

● JOINing together large data sets (>= 10,000) is really where EXPLAIN becomes useful

● Each JOIN in a query gets its own row in EXPLAIN ● Make sure each JOIN condition is FAST ● Make sure each joined table is getting to its result set as quickly as possible ● The benefits compound if each join requires less effort

Simple JOIN exampleEXPLAIN SELECT * FROM conferences cJOIN attendees a ON c.conference_id = a.conference_id WHERE conferences.location_id = 2 AND conferences.topic_id IN (4,6,1) AND attendees.registration_status > 1

table type possible_keys key rows

conferences ref conference_topic conference_topic 15

attendees ALL NULL NULL 14052

● Looks like I need an index on attendees.conference_id ● Another indication of effort, aside from rows scanned ● Here, “ALL” is bad – we should be aiming for “ref” ● There are 13 different values for “type” ● Common values are:

const, eq_ref, ref, fulltext, index_merge, range, all

http://dev.mysql.com/doc/refman/5.0/en/using-explain.html

The "extra" columnWith every EXPLAIN, you get an “extra” column, which shows additional operations invoked to get your result set.

Some example “extra” values: ● Using index ● Using where ● Using temporary table ● Using filesort

There are many more “extra” values which are discussed in the MySQL manual:Distinct, Full scan, Impossible HAVING, Impossible WHERE, Not exists

http://dev.mysql.com/doc/refman/5.0/en/explain-output.html#explain-join-types

table type possible_keys key rows extra

attendees ref conf conf 331 Using whereUsing filesort

Using filesort

Avoid, because: ● Doesn't use an index ● Involves a full scan of your result set ● Employs a generic (i.e. one size fits all) algorithm ● Creates temporary tables● Uses the filesystem (seek) ● Will get slower with more data

It's not all bad... ● Perfectly acceptable provided you get to your ● result set as quickly as possible, and keep it predictably small ● Sometimes unavoidable - ORDER BY RAND() ● ORDER BY operations can use indexes to do the sorting!

Using filesortEXPLAIN SELECT * FROM attendees WHERE conference_id = 123 ORDER BY surname

ALTER TABLE attendees ADD INDEX conf_surname (conference_id, surname);

We've avoided a filesort!

table possible_keys key rows extra

attendees conference_id conference_id 331 Using filesort

MySQL is using an index, but it's sorting the results slowly

table possible_keys key rows extra

attendees conference_id,conf_surname

conf_surname 331

NoSQL engines

● Redis● MongoDB● Cassandra● CouchDB● DynamoDB● Riak● Membase● HBase

Data is created, updated, deleted, retrieved using API calls.All application and data integrity logic is contained in the application code.

Redis

● Written In: C/C++● Main point: Blazing fast● License: BSD● Protocol: Telnet-like● Disk-backed in-memory database,● Master-slave replication● Simple values or hash tables by keys,● but complex operations like ZREVRANGEBYSCORE.● INCR & co (good for rate limiting or statistics)● Has sets (also union/diff/inter)● Has lists (also a queue; blocking pop)● Has hashes (objects of multiple fields)● Sorted sets (high score table, good for range queries)● Redis has transactions (!)● Values can be set to expire (as in a cache)● Pub/Sub lets one implement messaging (!)

Best used: For rapidly changing data with a foreseeable database size (should fit mostly in memory).

For example: Stock prices. Analytics. Real-time data collection. Real-time communication.

SET uid:1000:username antirezr

uid:1000:followers

uid:1000:following

GET foo => bar

INCR foo => 11

LPUSH mylist a (now mylist holds one element list 'a')

LPUSH mylist b (now mylist holds 'b,a')

LPUSH mylist c (now mylist holds 'c,b,a')

SADD myset a

SADD myset b

SADD myset foo

SADD myset bar

SCARD myset => 4

SMEMBERS myset => bar,a,foo,b

MongoDB

● Written In: C++● Main point: Retains some friendly properties of SQL. (Query, index)● License: AGPL (Drivers: Apache)● Protocol: Custom, binary (BSON)● Master/slave replication (auto failover with replica sets)● Sharding built-in● Queries are javascript expressions / Run arbitrary javascript functions server-side● Uses memory mapped files for data storage● Performance over features● Journaling (with --journal) is best turned on● On 32bit systems, limited to ~2.5Gb● An empty database takes up 192Mb● Has geospatial indexing

Best used: If you need dynamic queries. If you prefer to define indexes, not map/reduce functions. If you need good performance on a big DB.

For example: For most things that you would do with MySQL or PostgreSQL, but having predefined columns really holds you back.

MongoDB -> JSON

The MongoDB examples assume a collection named users that contain

documents of the following prototype:

{ _id: ObjectID("509a8fb2f3f4948bd2f983a0"), user_id: "abc123", age: 55, status: 'A'}

MongoDB -> insert

SQL MongoDB

CREATE TABLE users ( id MEDIUMINT NOT NULL AUTO_INCREMENT, user_id Varchar(30), age Number, status char(1), PRIMARY KEY (id))

db.users.insert( { user_id: "abc123", age: 55, status: "A" } )

Implicitly created on first insert operation. The primary key _id is automatically added if _id field is not specified.

MongoDB -> Alter, Index, Select

SQL MongoDB

ALTER TABLE usersADD join_date DATETIME

db.users.update( { }, { $set: { join_date: new Date() } }, { multi: true })

CREATE INDEX idx_user_id_ascON users(user_id)

db.users.ensureIndex( { user_id: 1 } )

SELECT user_id, statusFROM usersWHERE status = "A"

db.users.find( { status: "A" }, { user_id: 1, status: 1, _id: 0 })

SELECT *FROM usersWHERE status = "A"OR age = 50

db.users.find( { $or: [ { status: "A" } , { age: 50 } ] })

Job Trends from Indeed.com

Thank you.

Questions?