poc: using a group communication system to improve mysql replication ha

PowerPoint-Prsentation

PoC: MySQL HA improvedUlf Wendel, MySQL/Oracle

The speaker says...

If on a sailing boat on the wide, wide ocean and your captain is the only one who knows how to sail, would you feel safe? If a police helicopter, that will eventually loose sight or fail, monitors your ship, would you feel safer? No? You are right. Both the captain and the helicopter are Single Points of Failure.

BTW, does you MySQL Replication cluster have a proper high availability configuration with no single point of failure? No? Because it is too complicated? Hint: use a GCS for MySQL HA!

Tip of the dayA Single Point of Failure cannot cure a SPOFUlf Wendel, MySQL/Oracle

The speaker says...

MySQL Replication has a Single point of Failure: the master server. The master is, by design, the weak spot of every primary copy based replication cluster*. Replication breaks when the master fails. At best, read only queries can be served from the slaves until a new master has been elected and set up. Read: downtime, service outage, costs. Primary copy is still a valid design choice. It is simple. It is fast. But, the Single point of Failure (SPOF) remains fact

A clients view on existing solutions and a Proof-of-Concept mashup based on a recent Group Communication System. * http://www.slideshare.net/nixnutz/diy-a-distributed-database-cluster-or-mysql-cluster gives a technical overview on database clustering theory (MySQL Cluster, 3rd party...)

Things to care a lot aboutMaster database process and/or master host monitoring

How to identify a failover candidate

How not to loose transactions, ever

The Servers' worries

Master (Primary)Slave (Copy)Slave (Copy)GTID = 12GTID = 9GTID = 12

Monitor

The speaker says...

Let's recap. If the Master of a MySQL Replication system fails a slave must be promoted to become the new master. All slaves are examined to identify the most recent one. Finding the most recent slave only recently became less troublesome with the introduction of Global Transaction Identifier (GTIDs) in MySQL 5.6. Then, the candidate must be promoted to master and all other slaves must be updated to continue replicating from the new master. In heterogenous deployments with older MySQL versions, searching for the latest transactions and applying them on all slaves can be quite demanding. Hence, use a tool for it!

Introduced as MySQL 5.6 UtilityHealth monitoring, Failover

Aims for 99.9% HA 8 hours downtime per year

Example: mysqlfailover utility

Master (Primary)Slave (Copy)Slave (Copy)GTID = 12GTID = 9GTID = 12

heartbeating

mysqlfailover

The speaker says...

For years, MySQL has recommended using 3rd party monitoring solutions for MySQL Replication.

MySQL 5.6 finally introduces the mysqlfailover command line utility. It sends heartbeats to the nodes of a MySQL Replication cluster and monitors their health. If required, it performs a failover. Due to the complexity of the failure, its much welcome to see it being automated.

BTW, the utility is now GA, and can be run as a daemon.

Common design because of its simplicitySee mysqlfailover

See 3rd party, for example, MHA (MySQL High Availability)

Result: SPOFs doubled

SPOF: MasterSlave (Copy)Slave (Copy)GTID = 12GTID = 9GTID = 12

SPOF: Network

SPOF: Monitor

The speaker says...

If the MySQL Replication Master is a Single Point of Failure, what is a single health monitor? Be it MHA or mysqlfailover, this approach introduces a new Single Point of Failure: the health monitor.

Given how rare a failover is, and given how unlikely it is that two systems the master and the monitor fail at the same time, it is still a valid design. In the worst case, you can still manually (re)start the monitor. However, unnecessary failover may happen if the monitor uniliterally looses contact to the master. Generic HA Cluster solutions such as Windows Clustering or its Linux counterpart address such issues.

Developed and pushed by major Linux vendorsPacemaker, Corosync/Heartbeat, DRBD

Aims for 99.99% HA 50 minutes downtime/year

Generic HA Cluster solution

Master (Active)Slave (Active)Pacemaker (CRM)Corosync (CCM)Master (Standby)Pacemaker (CRM)Corosync (CCM)Pacemaker (CRM)Corosync (CCM)DRBDDRBD

The speaker says...

Higher HA levels require a significant more complex architecture, such as a combination of Pacemaker, Heartbeat/Corosync and DRBD. Pacemaker is a Cluster Resource Manager (CRM) that manages arbitrary services, for example, MySQL servers. The managed services are monitored by a Cluster Communication Manager (CCM), such as Corosync. Everything is broken into small, independent programs. There are no SPOFs because all the programs run on all the cluster nodes. A Distributed Replicated Block Device (DRBD) mirrors the MySQL master to lower the risk of transaction loss and speed up failover.

A tad complicated, maybe...

Things to care a lot aboutAvailable servers, their roles, and possibly replication lag

Partitioning and sharding hints, if needed

Real-time server load

The Clients' worries

Master (Primary)Slave (Copy)Slave (Copy)Load 80/100Load 5/100Load 95/100

All tables: A, B, CTables: A, BTables: CLag: 0sLag: 1 secondLag: 32 seconds

The speaker says...

Given enough information about the nature and status of a database cluster an intelligent client can dimmish the line between connecting to a single database and a database cluster. The mission statement of PECL/mysqlnd_ms*, a load balancer plugin for the PHP MySQL driver, is to hide the complexity of database clusters from the application developer: load balancing, read write splitting, read-your-writes (consistency), sharding and partitioning support, connection pool management, automatic caching of selected slave requests, GTID - all done by the driver! Dear database and/or HA cluster, just tell the driver.* http://www.slideshare.net/nixnutz/load-mysq-clusterin-balancing-peclmysqlndms-14
(it got even more feature loaded meanwhile)

Typical failover procedureMaster failure: switch virtual IP, no client deployment

Slave failure: deploy all clients

Role, status, lag, load, distribution, : hacks, at best

The Server/Client clash

Failed MasterNew MasterSlaveClientVirtual IPVirtual IP

192.168.128.11Master: Virtual IP,
Slave: 192.168.128.11

The speaker says...

All too often HA solutions stop to care about clients beyond performing an virtual IP switch as part of a MySQL replication master failover. But failover is only a fraction of the task: the cloud era demands elastic clusters.

At best, HA solutions support the use of proxy servers to redirect client requests. Proxies add complexity to the stack, they add latency to all requests, they can become bottle-necks, evolve into SPOFs and their failure affects many clients, not just one. Driver-integrated load balancers, such as PECL/mysqlnd_ms don't have these disadvantages*! * http://www.slideshare.net/nixnutz/load-balancing-for-php-and-mysql
(Pro/Con discussion of different load balancing approaches)

Server plugin for monitoring and managementMonitoring based on Group Communication System plugin

Management may utilize external scripts

Clients read I_S on any node for automatic self-deployment

A mashup for the mess

MasterSlaveSlavePlugin: CCM/CRMPlugin: CCM/CRMPlugin: CCM/CRM

The speaker says...

Could a MySQL server plugin using a Group Communication System (GCS) offer the robustness of a Pacemaker, Corosync approach (no SPOF), beat all including mysqlfailover on ease-of-use and be driver-based proxies best friend for near zero administration? At any time, a GCS can report its members and share state information among all members. A failed member (MySQL server) can be detected automatically by the GCS and appropriate action can be taken, for example, running the mysqlfailover command line utility to perform failover. If a client fails to connect to a node, it queries the INFORMATION_SCHEMA on any of the remaining to learn about changes.

Compared to Pacemaker/Corosync Similar no SPOF design for the monitor

Aims for: out-of-the-box experience, smaller installations

Aims for: continous, automatic client reconfiguration

Recap: simplified, client focus

MySQLPlugin: CCM/CRMMySQLPacemaker (CRM)Corosync (CCM)DRBD

GTID based

The speaker says...

A GCS not only helps with failover. It can also report newly added nodes. Clients can periodically check the list of nodes and start to use the new ones automatically.In the proposed solution, clients use plain SQL to learn about changes. Clients learn from MySQL nodes. State information is exposed through the INFORMATION_SCHEMA and exchanged (synchronously) by help of a GCS. Data may include load, replication lag, - whatever, clients reconfigure themselves continously. That's the idea.

(Basically, its a system that uses lazy primary copy for replication but is inspired by update everywhere for HA.)

Compared to super-sized central management monitor No SPOF design for the monitor

No central server that can get overloaded

No new communication channels for client reconfiguration

Recap: simplified, client focus

MySQLPlugin (CCM/CRM)Monitor (CRM)MySQL

Client (SQL)Client (Reconf)

MySQL

Client (SQL)

Client (Reconf)

Client (SQL)MySQLCCM/CRMClient

Client (Reconf)

The speaker says...

The proposed system is better than a super-sized monitor that manages state (nodes, roles, load, ) centrally.

A super-sized monitor can easily become a single point of failure. If clients notice the failure of a MySQL node and thousands of clients almost concurrently query the one centralized monitor to learn about cluster state changes the central monitor likely gets overloaded. Finally, a centralized monitor likely forces clients to learn a new, additional protocol for communication with the monitor. Not so with the GCS approach: SQL for everything. No overloading: load is distributed on all remaining MySQL servers.

nixnutz@linux-dstv:~/src/isis_201245> rm isis_deamon.exe ; dmcs isis_deamon.cs Isis.cs ; mono ./isis_deamon.exe Isis: Searching for the Isis ORACLE...[IsisDaemonMain] Connecting to ISIS...[GroupConnector][view change][ incoming multicasts delivery thread] Some view change (e.g. join), no action required[IsisDaemonMain] Starting daemon for communication with MySQL.[GroupDaemon] Server is waiting on socket 127.0.0.1:2200[IsisDaemonMain] Started. Listening to client requests.Q: 31 'join 127.0.0.1 3400 on master'A: 4 '0 OK'[GroupConnector][remote mysql register][RemoteRegisterMySQLServer mysql] Added (50851)-127.0.0.1-3400-on-master-07.08.2013 14:13:07-30 New server count 1LastUpdate before heartbeat 07.08.2013 14:13:07LastUpdate after heartbeat 07.08.2013 14:13:17[GroupConnector][heartbeat][ incoming multicasts delivery thread] Thread.CurrentThread.Name Ignoring Hearbeat message to ourselves to avoid deadlock.Q: 29 'heartbeat 127.0.0.1 3400 on'A: 56 '(50851) 127.0.0.1 3400 on master 30 07.08.2013 14:13:17

HA MySQL cluster using:ORACLE Rendevous Service

The speaker says...

Let's hack it ?! The first steps are trivial. The biggest challenge is to find a free and open source C/C++ group communication system that can be embedded in a MySQL daemon server plugin. Corosync has a client/server-deamon design which is no perfect match for the task. A brother, the Spread Toolkit, is somewhat limited to ~40 nodes and the API is not that appealing. LibPaxos is intentionally licensed under GPLv3 to make clear its experimental. The rest: old, Java, ... The OSS world lacks a cool C/C++ GCS :-/.

Then came Isis2! C#, wrong language; but what a nice API! Isis2 is designed to make distributed cloud computing easy. Thus, its ideal for a PoC that hides all the glory details ;-)

++Isis, Ken Birmans 1980/1990 masterpiece improved

https://isis2.codeplex.com/

Virtual Synchrony Model* merged with Paxos ideas

Most easy to use yet powerful API

Distributed (programming language) objects

Distributed key-value store/hashtable

From low level unreliable messaging for gossip protocols to high level globally ordered reliable messaging it's all there

New BSD license, C#/.NET (pure C++ port considered)

Aims to support cloud sized clusters (thousands of nodes)

* http://de.slideshare.net/nixnutz/diy-a-distributed-database-cluster-or-mysql-cluster (Virtual Synchrony, slide 34+)

Isis2 Cloud Computing Library

The speaker says...

Isis2 is almost a perfect choice for the PoC. The API is perfect, the documentation is great but its written in C#/.NET. This not only means I had to use Mono on my preferred Linux platform but there is a language barrier that destroys parts of the beauty of the proposed solution. MySQL and its plugins are written in C/C++. One cannot call/link the Isis2 library from a MySQL plugin and use Isis2's neat distributed (programming language) objects feature or any of it other services directly. A proxy/connector, a socket server, is required to communicate between MySQL and Isis2. It is an undesired extra layer. Still... yippie yeahh!

MySQL APIs and language barriers require a compromiseC#/.NET socket server becomes Isis2 client

C/C++ MySQL daemon plugin sends heartbeat to socket server

C/C++ MySQL INFORMATION_SCHEMA plugin

Sad but true: compromises...

MySQLPlugin: I_S tablesPlugin: heartbeat to Isis2 clientIsis2 client w. socket serverMySQLPlugin: ConnectorPlugin: ConnectorIsis2 client: CCM/CRM

The speaker says...

The C#/.NET to C/C++ language barrier does kill some of simplicity and beauty of the proposal. A MySQL plugin cannot be an Isis2 client. Instead, a C#/.NET Isis2 client (node) has to run as a socket server and communicate with MySQL through a socket. The process model gets complicated. On the MySQL side we now need a daemon plugin that sends a heartbeat to the local Isis2 client and another MySQL plugin that implements the I_S tables a SQL client can query to learn about the clusters members and state. Think of the plugins as Connectors. The Isis2 client takes the CCM/CRM role as planned.

For every MySQL server do: Start Isis2 client: mono ./isis_deamon.exe

Configure Connector Plugins, e.g. Isis2 client address

Heartbeat to Isis2: INSTALL PLUGIN isis2d SONAME 'libisis2.so'

I_S Plugin: INSTALL PLUGIN isis2is SONAME 'libisis2.so'

Teach your clients to monitor INFORMATION_SCHEMA

The proposed user manual

mysql> SELECT * FROM INFORMATION_SCHEMA.ISIS2IS\G*************************** 1. row *************************** ERROR: ERRNO: 0 ISIS2_MEMBER: (50249) MYSQL_HOST: 127.0.0.1MYSQL_PORT_OR_SOCKET: 3400 MYSQL_STATUS: on MYSQL_ROLE: master MYSQL_HEARTBEAT_TTL: 30MYSQL_HEARTBEAT_LAST: 07.08.2013 13:50:221 row in set (0,01 sec)

The speaker says...

DBA instructions. Start a daemon (only required because of the C# - C/C++ mismatch otherwise it would be part of the plugin!). Then, install some plugins, let the MySQL server announce its availability and you are done. The servers communicate with each other and jointly discover new servers entering the cluster and servers leaving it. If a master leaves, a failover script can be called to reconfigure the cluster.On any of the MySQL nodes in your cluster, a client can get a list of currently connected MySQL servers and their state (e.g. role). The state is replicated synchronously. HA built-in to MySQL, job done ;-). Let's talk about details...

Isis2 client socket server start

Joining Isis2 client

Isis2 clientIsis2 clientConnected clientsJoining Isis2 clientIsis2 clientIsis2 clientCurrent leader ORACLE Rendevous Service Isis2 Group

ORACLE Rendevous Service Isis2 GroupIsis: Searching for the Isis ORACLE...Isis: Found the Isis.ORACLE service, attempting to connect.[IsisDaemonMain] Connecting to ISIS...

[state transfer] received: (51971)-127.0.0.1-3400-master-on-30[state transfer] received: (51971)-192.168.2.1-3306-slave-on-30

The speaker says...

When the DBA starts the Isis2 client socket server on a MySQL host, the client tries to connect to a virtual, distributed group in the cloud. The Isis2 library calls the code that manages virtual groups the ORACLE no joke!

Once the client is connected, a checkpoint is done to transfer the state of already connected clients to the joining one. The state transferred is the list of the MySQL servers, if any, that contacted their local Isis2 clients to register themselves in the group. All this takes less than 100 lines of C#.

New Isis2 clientIsis2 client socket server start

Isis2 clientIsis2 clientConnected clientsJoined Isis2 client ORACLE Rendevous Service Isis2 GroupLocal socket server[IsisDaemonMain] Starting server for communication with MySQL.[GroupDaemon] Server is waiting on socket 127.0.0.1:2200

Isis2 clientIsis2 clientConnected clients ORACLE Rendevous Service Isis2 Group[GroupConnector][view change][ incoming multicasts delivery thread] Some view change (e.g. join), no action required

The speaker says...

As soon as an Isis2 client joins the group, a view change message is send to all group members. In our case it gets ignored. A joining Isis2 client does not change the list of MySQL servers that have registered themselves in the cluster.

Isis2d heartbeat MySQL plugin

Isis2 clientIsis2 clientRemote clientsLocal Isis2 client ORACLE Rendevous Service Isis2 GroupSocket server[GroupConnector][remote mysql register][RemoteRegisterMySQLServer mysql] Added (54830)-127.0.0.1-3400-on-master-07.08.2013 16:28:26-30 New server count 1

MySQLIsis2d Plugin

Q: 31 'join 127.0.0.1 3400 on master'

The speaker says...

MySQL does not appear on stage before the DBA loads the first of the two Connector plugins, the Isis2 daemon plugin, into MySQL. When the plugin respectively the MySQL server starts, it send a join message to the local Isis2 client. The client parses it and sends a message to all group members. Hereby all connected Isis2 clients on all hosts learn (virtually) synchronously about the MySQL server.

The use of the Isis2 SafeSend()function ensures globally ordered and reliable messaging. This is by far the slowest Send() variant, but it simplifies our job. Either all or no Isis2 group members add the MySQL server to their list of servers. Once more: far less than 100 lines of C# ...


Isis2 clientIsis2 clientRemote clientsLocal Isis2 client ORACLE Rendevous Service Isis2 GroupSocket server[GroupConnector][heartbeat][RemoteHeartbeatmysql] heartbeat server (54830)-127.0.0.1-3400-master-on-30-07.08.2013 16:28:36

MySQLIsis2d Plugin

Q: 29 'heartbeat 127.0.0.1 3400 on'A: 56 '(54830) 127.0.0.1 3400 on master 30 07.08.2013 16:28:36

The speaker says...

Unfortunately, due to the C# to C/C++ language barrier, our MySQL Servers are no members the Isis2 group but the local Isis2 client are. Thus, Isis2s' own group membership service cannot monitor the MySQL processes directly. A hack heartbeating - is needed to tell each local Isis2 clients about the state of its associated MySQL Server. Every now and then the Isis2 heartbeat plugin sends a heartbeat. The local Isis2 client then increases the TTL of the servers entry in the groups server list. It is also the local Isis2 client that may drop a server from the list if it fails to send heartbeat messages.


Isis2 clientIsis2 clientRemote clientsLocal Isis2 client ORACLE Rendevous Service Isis2 GroupSocket server[GroupConnector][unregister][RemoteUnregisterMySQLServermysql] removing server (53743)-127.0.0.1-3400-master-on-07.08.2013 16:20:29-30

MySQLIsis2d Plugin

Q: 22 'leave 127.0.0.1 3400'

The speaker says...

Upon plugin respectively MySQL server shutdown, the Isis2d daemon plugin sends a leave message to its local Isis2 client. The leave message is handled like the join message: parse message, forward to all using SafeSend().Neither a missed heartbeat nor a leave command trigger any failover logic in the PoC, although this is what should be done. I didn't want to spend more than few days fooling & toying around with the coding. Also, the main message should come clear even without triggering a failover script, which would be a rather trivial thing to do, given a proper failover script.

Isis2is I_S MySQL plugin

Local Isis2 clientSocket serverMySQLIsis2is Plugin

Q: 12 'serverlist'A: 58 '0 (54830) 127.0.0.1 3400 on master 30 07.08.2013 17:36:58'

mysql> SELECT * FROM INFORMATION_SCHEMA.ISIS2IS\G

***************** 1. row ***************** ERROR: ERRNO: 0 ISIS2_MEMBER: (54830) MYSQL_HOST: 127.0.0.1MYSQL_PORT_OR_SOCKET: 3400 MYSQL_STATUS: on MYSQL_ROLE: master MYSQL_HEARTBEAT_TTL: 30MYSQL_HEARTBEAT_LAST: 07.08.2013 17:36:58

The speaker says...

The Isis2is INFORMATION_SCHEMA plugin provides a list of all MySQL servers in the cluster through the table INFORMATION_SCHEMA.ISIS2IS . Clients can check the table periodically for news, e.g. new servers or not implemented server load to adapt their load balancing dynamically. Whenever a client fails to connect to a MySQL server, it can ask any of the other MySQL servers it knows about for an update. If required, the client can automatically adapt to membership changes: use new servers, switch to different master. No DBA action required to deploy clients.The PECL/mysqlnd_ms code has been prepared for zero-deployment/runtime configuration changes years ago. Search the source for hotloading...

No more than an illustration of the idea

NOT stable, NOT complete, NO sketch of client code

Because: my first three days ever with C#/Mono hacking

Because: the C#/Mono approach raises questionmarks

Because: PoC... just fun, just spreading ideas!

Neat, simple, no SPOF, zero-administration approach ?!

Only 509 lines of C# for the Isis2 client socket server

Only 462 lines of C/C++ for the two MySQL server plugins

Available on blog.ulf-wendel.de

Proof of Concept Code

The speaker says...

The code of the Isis2 client socket server and the two plugins is available at blog.ulf-wendel.de . It is PoC code to illustrate the basic idea, no less, no more. The code is neither complete (e.g. join not repeated if it fails initially, and no failover logic) nor free of bugs (e.g. UNINSTALL PLUGIN may crash under certain circumstances).This, however, does not matter much when the focus is on discussing different designs for HA solutions and suggesting a mashup of two existing one. A mashup with strong client support for zero-administration.Happy hacking!

THE END

Contact: [email protected]

The speaker says...

Thank you for your attendance!
Upcoming shows:

PHP UnconferenceHamburg, September 2013

PHP Summit
Munich, December 2013

poc: using a group communication system to improve mysql replication ha

Technology