collations in mysql 8 - percona · – because that might break existing applications using the old...

31
Copyright © 2017 Oracle and/or its affiliates. All rights reserved. Collations in MySQL 8.0 Bernt Marius Johnsen Senior QA Engineer Warning: This presentation uses unicode graphemes, even for ellipsis ('…' U+2026)

Upload: others

Post on 12-Jun-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Collations in MySQL 8 - Percona · – Because that might break existing applications using the old collations (The most serious issue for large databases: Indexes would have to be

Copyright © 2017 Oracle and/or its affiliates. All rights reserved.

Collations in MySQL 8.0

Bernt Marius JohnsenSenior QA Engineer

⚠ Warning: This presentation uses unicode graphemes, even for ellipsis ('…' U+2026)

Page 2: Collations in MySQL 8 - Percona · – Because that might break existing applications using the old collations (The most serious issue for large databases: Indexes would have to be

2Copyright © 2017 Oracle and/or its affiliates. All rights reserved.

Safe Harbor Statement

The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle.

Page 3: Collations in MySQL 8 - Percona · – Because that might break existing applications using the old collations (The most serious issue for large databases: Indexes would have to be

3Copyright © 2017 Oracle and/or its affiliates. All rights reserved.

AgendaWhy Unicode

What is character set/collation etc.

What’s new in MySQL 8.0

How to migrate and some issues to consider

1

2

3

4

5

6

7

Page 4: Collations in MySQL 8 - Percona · – Because that might break existing applications using the old collations (The most serious issue for large databases: Indexes would have to be

4Copyright © 2017 Oracle and/or its affiliates. All rights reserved.

Why Unicode?

● The whole world is moving towards Unicode as digital devices is used by more and more people across all cultures all around the globe.– Approximate billion users of the six most used writing systems:

Latin1: ~5, Chinese: ~1.5, Arabic: ~0.7, Devanagari: ~0.5, Cyrillic: ~0.25, Bengali: ~0.22, Kana: ~0.12

● One driving force is Emojis– Smileys, hearts, roses etc, and all the stuff people are sending to each other when communicating

these days. )(���

–“Useful” example: Unicode character U+1F574, MAN IN BUSINESS SUIT LEVITATING: �

1This is way more letters than just ASCII!

Page 5: Collations in MySQL 8 - Percona · – Because that might break existing applications using the old collations (The most serious issue for large databases: Indexes would have to be

5Copyright © 2017 Oracle and/or its affiliates. All rights reserved.

Why Unicode in a database?● You may use one character set for all your data, for all purposes.– E.g. if you make an application, using utf8mb4 for a

column with names, it may have Russian names, Chinese names, Japanese names etc.

– Even esoteric extinct writing systems are covered like e.g. the Phaistos disc (look it up...)

– But not Klingon, nor Tengwar �

Page 6: Collations in MySQL 8 - Percona · – Because that might break existing applications using the old collations (The most serious issue for large databases: Indexes would have to be

6Copyright © 2017 Oracle and/or its affiliates. All rights reserved.

What is Unicode?● Unicode is a computing industry standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. (Wikipedia)

● ISO/IEC 10646● Unicode covers most existing and extinct writing systems known to man in one standard.

● The standard has allocated 17 planes and blocks of characters are allocated into the planes

● Six planes assigned so far:– Plane 0: U+0000 - U+FFFF: Basic Multilingual Plane (BMP)– Plane 1: U+10000 - U+1FFFF: Supplementary Multilingual Plane (SMP)– Plane 2: U+20000 - U+2FFFF: Supplementary Ideographic Plane (SIP)– Plane 14: U+E0000 - U+EFFFF: Supplementary Special-Purpose Plane (SSP)– Plane 15 & 16: U+F0000 – U+10FFFF: Supplementary Private Use Area A and B (PUA-A and PUA-B)

Page 7: Collations in MySQL 8 - Percona · – Because that might break existing applications using the old collations (The most serious issue for large databases: Indexes would have to be

7Copyright © 2017 Oracle and/or its affiliates. All rights reserved.

What is a CHARACTER SET?● A character set is defined by:

– A repertoire of characters/graphemes

– A value given to each character/grapheme (codepoint)

– An encoding which defines the binary representation of the values

Page 8: Collations in MySQL 8 - Percona · – Because that might break existing applications using the old collations (The most serious issue for large databases: Indexes would have to be

8Copyright © 2017 Oracle and/or its affiliates. All rights reserved.

What is Encoding?● The binary representation of a character/grapheme.– The simplest ones: 1:1. A character is a byte and a byte

is a character (ASCII, ISO-8859-1/Latin-1 etc.)

● Unicode defines 3 encodings:– UTF-8 (1-4 bytes per character)

– UTF-16 (2 or 4 bytes per character)

– UTF-32 (4 bytes per character)

Page 9: Collations in MySQL 8 - Percona · – Because that might break existing applications using the old collations (The most serious issue for large databases: Indexes would have to be

9Copyright © 2017 Oracle and/or its affiliates. All rights reserved.

Character set examplesCharacter Character set Value Encoding Encoded as

A ASCIIISO-8859-1 (Latin-1)Unicode

4141U+0041

1:11:1UTF-8UTF16

4141410041

Ä ISO-8859-1 (Latin-1)Unicode

C4U+00C4

1:1UTF-8UTF16

C4C38400C4

д KOI8-RISO-8859-5Unicode

C4D4U+0434

1:11:1UTF-8UTF-16

C4D4D0B40434

人 GB-18030Unicode

Big5JIS X 0208 (SJIS)

C8CBU+4EBA

A448906C

1:1UTF-8UTF-161:11:1

C8CBE4BABA4EBAA448906C

� Unicode

GB-18030

U+1F574

9439EE36

UTF8UTF-161:1

F09F95B4D83DDD749439EE36

Page 10: Collations in MySQL 8 - Percona · – Because that might break existing applications using the old collations (The most serious issue for large databases: Indexes would have to be

10Copyright © 2017 Oracle and/or its affiliates. All rights reserved.

What is collation● Collation is the assembly of written information into a standard order

(Wikipedia)● Collation may consider

– Case (e.g 'A' vs. 'a')

– Accents (e.g. 'E' vs. 'É')

– Locale-specific rules (e.g. 'A' vs. 'Å' vs. 'AA' in Danish and Norwegian)

– Numeric characters (e.g. '2' vs. ' ')ⅱ– Punctuation (e.g. 'blackbird' vs. 'black-bird')

– Etc.●

Page 11: Collations in MySQL 8 - Percona · – Because that might break existing applications using the old collations (The most serious issue for large databases: Indexes would have to be

11Copyright © 2017 Oracle and/or its affiliates. All rights reserved.

What is a COLLATION in (My)SQL?● In (My)SQL, a COLLATION is a set of rules for a given character

set which defines an order and affects:

– ORDER BY– LIKE

– Primary keys and indexes

– Unique constraints

– Comparison operators

– Some string functions● All strings in MySQL have a character set and a collation

Page 12: Collations in MySQL 8 - Percona · – Because that might break existing applications using the old collations (The most serious issue for large databases: Indexes would have to be

12Copyright © 2017 Oracle and/or its affiliates. All rights reserved.

Character sets in MySQL+----------+---------------------------------+---------------------+--------+

| Charset | Description | Default collation | Maxlen |

+----------+---------------------------------+---------------------+--------+

| ascii | US ASCII | ascii_general_ci | 1 |

| latin1 | cp1252 West European | latin1_swedish_ci | 1 |

| utf8 | UTF-8 Unicode | utf8_general_ci | 3 |

| utf8mb4 | UTF-8 Unicode | utf8mb4_0900_ai_ci | 4 |

Get all by typing:

mysql> show character set;

The rest of them are:

armscii8, big5, binary, cp1250, cp1251, cp1256, cp1257, cp850, cp852, cp866, cp932, dec8, eucjpms, euckr, gb18030, gb2312, gbk, geostd8, greek, hebrew, hp8, keybcs2, koi8r, koi8u, latin2, latin5, latin7, macce, macroman, sjis, swe7, tis620, ucs2, ujis, utf16, utf16le, utf32

Page 13: Collations in MySQL 8 - Percona · – Because that might break existing applications using the old collations (The most serious issue for large databases: Indexes would have to be

13Copyright © 2017 Oracle and/or its affiliates. All rights reserved.

New in MySQL 8.0● Default character set: utf8mb4 with default collation: utf8mb4_0900_ai_ci

● Three language independent collations: utf8mb4_0900_ai_ci, utf8mb4_0900_as_ci, utf8mb4_0900_as_cs

– may be used for German dictionary order, English, French1, Irish Gaelic, Indonesian, Italian, Luxembourgian, Malay, Dutch, Portuguese, Swahili and Zulu

● A lot of new collations based on Unicode v. 9.0.0

– UCA (Unicode Collation Algorithm)

– DUCET (Default Unicode Collation Entry Table)

– CLDR v.30 (Common Locale Data Repository)

● All utf8mb4_*_0900_* collations are NO PAD

1) Canadian French may not use utf8mb4_0900_as_cs/utf8mb4_0900_as_ci collations due to differences to standard accent order.

Page 14: Collations in MySQL 8 - Percona · – Because that might break existing applications using the old collations (The most serious issue for large databases: Indexes would have to be

14Copyright © 2017 Oracle and/or its affiliates. All rights reserved.

New in MySQL 8.0● We have gone to great lengthts to make the new utf8mb4_*_0900_* collations correct and

complete.

● Accent insensitive/case insensitive (ai_ci) and accent sensitive/case sensitive (as_cs) utf8mb4 collations have been implemented for:

– Classical Latin (la), Croatian (hr), Czech (cs), Danish/Norwegian (da), Esperanto (eo), Estonian (et), German phone book order (de_pb), Hungarian (hu), Icelandic (is), Latvian (lv), Lithuanian (lt), Polish (pl), Romanian (ro), Russian (ru), Slovak (sk), Slovenian (sl), Modern Spanish (es), Traditional Spanish (es_trad), Swedish (sv), Turkish (tr), Vietnamese (vi)

● Accent/case sensitive (as_cs) and accent/case/kana sensitive (as_cs_ks) utfmb4 collations for:

Japanese (ja)

Page 15: Collations in MySQL 8 - Percona · – Because that might break existing applications using the old collations (The most serious issue for large databases: Indexes would have to be

15Copyright © 2017 Oracle and/or its affiliates. All rights reserved.

MySQL 8.0 collation name scheme● <charset>[_<language> [_<variant>]]_<unicodeversion>(_<attribute>)+

– <charset> = utf8mb4

– <language>, an ISO 639-1 language code (or ISO 639-2 if needed)

– <variant>, a variant to the standard collation for the language. Per today: utf8mb4_de_pb_0900_* and utf8mb4_es_trad_0900_*.

– <unicodeversion> = 0900

– <attribute>: accent sensitivity (ai, as), case sensitivity (ci, cs), kana sensitivity (ks) and possible future ones.

Page 16: Collations in MySQL 8 - Percona · – Because that might break existing applications using the old collations (The most serious issue for large databases: Indexes would have to be

16Copyright © 2017 Oracle and/or its affiliates. All rights reserved.

Why not ...● Fix utf8mb4_general_ci instead of introducing utf8mb4_0900_ai_ci or

fix utf8mb4_german2_ci instead of introducing utf8mb4_de_pb_0900_ai_ci?– Because that might break existing applications using the old collations (The most

serious issue for large databases: Indexes would have to be rebuilt). Our policy: Collations don't change!

● Have a simpler name scheme?– Because we prepare for

● More languages● New Unicode versions (Unicode 10.0.0 is expected in 2018)

– ISO-639-1/ISO-639-2 language codes are well defined

Page 17: Collations in MySQL 8 - Percona · – Because that might break existing applications using the old collations (The most serious issue for large databases: Indexes would have to be

17Copyright © 2017 Oracle and/or its affiliates. All rights reserved.

How to migrate?● When migrating from 5.7 tables:

– Just convert the table:ALTER TABLE foo CONVERT TO CHARACTER SET utf8mb4;

● This will change the default character set of the table (so that future new columns get utf8mb4) and the character set of all applicable

columns.

● In theory, all character data in MySQL may be converted to utf8mb4 without loss of data.

That was easy ..... is that all to it ... ?

Page 18: Collations in MySQL 8 - Percona · – Because that might break existing applications using the old collations (The most serious issue for large databases: Indexes would have to be

18Copyright © 2017 Oracle and/or its affiliates. All rights reserved.

Upgrading to MySQL 8.0● When upgrading to 8.0:

– Schemas (databases) keep their default character set/collation.

– Tables keep their default character set/collation.

– Columns keep their character set/collation

● To take advantage of utfmb4, you need to migrate.

Page 19: Collations in MySQL 8 - Percona · – Because that might break existing applications using the old collations (The most serious issue for large databases: Indexes would have to be

19Copyright © 2017 Oracle and/or its affiliates. All rights reserved.

… not quite … column by column● If you have more complex tables with different character sets:

– Change the default character set of the table:ALTER TABLE foo DEFAULT CHARACTER SET utf8mb4;

– Modify all relevant relevant columns:ALTER TABLE foo MODIFY bar VARCHAR(100) CHARACTER SET utf8mb4;

Generally we recommend doing it column by column.– ALTER TABLE … CONVERT … will e.g. change TEXT to MEDIUMTEXT

when you convert from latin1 to utf8mb4 and that won't necessarily be what you want.

Page 20: Collations in MySQL 8 - Percona · – Because that might break existing applications using the old collations (The most serious issue for large databases: Indexes would have to be

20Copyright © 2017 Oracle and/or its affiliates. All rights reserved.

… not quite … the schema too● A schema (aka. database) in MySQL has a default character set which will be the default character set of new tables in the schema– mysql> show create schema bar;+----------+----------------------------------------------------------------+| Database | Create Database |+----------+----------------------------------------------------------------+| bar | CREATE DATABASE `bar` /*!40100 DEFAULT CHARACTER SET latin1 */ |+----------+----------------------------------------------------------------+1 row in set (0.00 sec)

● Change the default character set of the schema(database):ALTER SCHEMA bar DEFAULT CHARACTER SET utf8mb4;

Page 21: Collations in MySQL 8 - Percona · – Because that might break existing applications using the old collations (The most serious issue for large databases: Indexes would have to be

21Copyright © 2017 Oracle and/or its affiliates. All rights reserved.

… not quite … collation differencesCollations are not equal, so converting from one collation to another may break UNIQUE constraints (e.g PRIMARY KEY).● Default collation:

– latin1_swedish_ci vs. utf8mb4_0900_ai_ciE.g. 'o'='ö' is false in the first, but true in the other.

– Possible solution: Stick to Swedish or another suitable collation depending on your application:ALTER TABLE foo CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_sv_0900_ai_ci;

– Generally, if you don't care about case insensitivity (just got it by default), utf8mb4_0900_as_cs should be safe.

● There's an huge number of possibilities depending on your data and the collations used, partly because pre MySQL 8.0 collations where not complete (and in some cases not correct).

Page 22: Collations in MySQL 8 - Percona · – Because that might break existing applications using the old collations (The most serious issue for large databases: Indexes would have to be

22Copyright © 2017 Oracle and/or its affiliates. All rights reserved.

… not quite … index and key issues● If you change the collation of a column, indexes on that column will be regenerated.– This takes time for large data, and the table is locked during that time.

– And the conversion may fail due to changed space consumption.

● Max key length is 3072 bytes1, which implies that max length of a utf8mb4 varchar column which is also a key is 768 characters (Worst case scenario: 4 bytes per character).– mysql> create table foo (v varchar(1000) character set latin1 primary key);Query OK, 0 rows affected (0.01 sec)mysql> alter table foo modify v varchar(1000) character set utf8mb4;ERROR 1071 (42000): Specified key was too long; max key length is 3072 bytes

1For default InnoDB row format and default innodb_page_size in MySQL 8.0. See the documentation for details.

Page 23: Collations in MySQL 8 - Percona · – Because that might break existing applications using the old collations (The most serious issue for large databases: Indexes would have to be

23Copyright © 2017 Oracle and/or its affiliates. All rights reserved.

Upgrade examplemysql> show create table cities;+--------+----------------------| Table | Create Table+--------+----------------------| cities | CREATE TABLE `cities` ( `name` varchar(1024) NOT NULL, `population` int(11) DEFAULT NULL, PRIMARY KEY (`name`)) ENGINE=InnoDB DEFAULT CHARSET=latin1+--------+----------------------

1 row in set (0.00 sec)

mysql> select * from cities;+------------+------------+| name | population |+------------+------------+| København | 1246611 || Orebro | 107380 || Oslo | 666759 || Stockholm | 935619 || Örebro | 107380 |+------------+------------+5 rows in set (0.00 sec)

mysql> alter table cities modify column name varchar(1024) charset utf8mb4;ERROR 1062 (23000): Duplicate entry 'Örebro' for key 'PRIMARY'

Page 24: Collations in MySQL 8 - Percona · – Because that might break existing applications using the old collations (The most serious issue for large databases: Indexes would have to be

24Copyright © 2017 Oracle and/or its affiliates. All rights reserved.

Upgrade example contd.mysql> alter table cities modify column name varchar(768) charset utf8mb4;Query OK, 4 rows affected (0.01 sec)Records: 4 Duplicates: 0 Warnings: 0

mysql> insert into cities values('東京 ',13617445);Query OK, 1 row affected (0.00 sec)

mysql> select * from cities;+------------+------------+| name | population |+------------+------------+| København | 1246611 || Oslo | 666759 || Örebro | 107380 | | Stockholm | 935619 |

| 東京 | 13617445 |+------------+------------+6 rows in set (0.00 sec)

Page 25: Collations in MySQL 8 - Percona · – Because that might break existing applications using the old collations (The most serious issue for large databases: Indexes would have to be

25Copyright © 2017 Oracle and/or its affiliates. All rights reserved.

⚠ 文字化け (Mojibake)

(… or what you see is not what you get...)mysql> create table foo(v varchar(10) character set latin1);

mysql> insert into foo values('å');

mysql> set names latin1;

mysql> insert into foo values('å');

mysql> set names utf8mb4;

mysql> select * from foo;

+------+

| v |

+------+

| å |

| å |

+------+

2 rows in set (0.00 sec)

mysql> select hex(v) from foo;+--------+| hex(v) |+--------+| E5 || C3A5 |+--------+2 rows in set (0.00 sec)

Page 26: Collations in MySQL 8 - Percona · – Because that might break existing applications using the old collations (The most serious issue for large databases: Indexes would have to be

26Copyright © 2017 Oracle and/or its affiliates. All rights reserved.

Fixing åmysql> select v from foo; +-------------------------------+| v |+-------------------------------+| å |+-------------------------------+1 row in set (0.01 sec)

mysql> update foo set v=convert(convert(convert(v using binary) using utf8mb4) using latin1) ;Query OK, 1 row affected (0.00 sec)Rows matched: 1 Changed: 1 Warnings: 0

mysql> select v from foo;+--------------+| v |+--------------+| å |+--------------+1 row in set (0.00 sec)

Page 27: Collations in MySQL 8 - Percona · – Because that might break existing applications using the old collations (The most serious issue for large databases: Indexes would have to be

27Copyright © 2017 Oracle and/or its affiliates. All rights reserved.

Fixing æ–‡å —åŒ–ã ‘�mysql> select v from foo;+-------------------------------+| v |+-------------------------------+| æ–‡å —åŒ–ã ‘ |�+-------------------------------+1 row in set (0.01 sec)mysql> alter table foo modify column v varchar(128) charset binary;Query OK, 1 row affected (0.14 sec)Records: 1 Duplicates: 0 Warnings: 0mysql> alter table foo modify column v varchar(128) charset utf8mb4;Query OK, 1 row affected (0.14 sec)Records: 1 Duplicates: 0 Warnings: 0mysql> select v from foo;+--------------+| v |+--------------+

| 文字化け |+--------------+1 row in set (0.00 sec)

Page 28: Collations in MySQL 8 - Percona · – Because that might break existing applications using the old collations (The most serious issue for large databases: Indexes would have to be

28Copyright © 2017 Oracle and/or its affiliates. All rights reserved.

Space consumption● utf8mb4 use

– 1 byte for ASCII characters (U+0000 - U+007F),

– 2 bytes for most alphabets/abjads (U+0080 - U+07FF),

– 3 bytes for Indic scripts, Hangul, Kana, the most used CJK Ideographs (U+0800 - U+FFFF),

– 4 bytes for the rest: Archaic scripts, Emojis, Rarely used CJK extensions, Variant selectors etc. (U+10000 -)

Page 29: Collations in MySQL 8 - Percona · – Because that might break existing applications using the old collations (The most serious issue for large databases: Indexes would have to be

29Copyright © 2017 Oracle and/or its affiliates. All rights reserved.

Speed issues● Operations on multibyte character sets inherently slower than

singlebyte character sets (e.g. latin1 vs. utf8mb4)

● We have done a lot of code improvements.

– New code for the new utf8mb4 collations

– New collations are NO PAD (which gives faster algorithms)

– But expect a performance degradation in the order of 10-20% for sorting when you migrate from e.g latin1 to utf8mb4, depending on your data of course.

● Some collations are inherently slower than others (e.g. utf8mb4_0900_ai_ci vs. utf8mb4_ja_0900_as_cs_ks)

Page 30: Collations in MySQL 8 - Percona · – Because that might break existing applications using the old collations (The most serious issue for large databases: Indexes would have to be

30Copyright © 2017 Oracle and/or its affiliates. All rights reserved.

Truly usable for global purposes.....

Page 31: Collations in MySQL 8 - Percona · – Because that might break existing applications using the old collations (The most serious issue for large databases: Indexes would have to be

31Copyright © 2017 Oracle and/or its affiliates. All rights reserved.

Q&A

● Check out my blogs at http://mysqlserverteam.com/author/bernt/

● The 8.0 documentation (if everything else fails … 😠)https://dev.mysql.com/doc/refman/8.0/en/charset.html

● The Unicode documents (for those truly interested … 😇)http://unicode.org/

�U+1F634