mysql query processing1 - stanford...
TRANSCRIPT
HW8: MySQL Query Processing CS245 Winter 2017
Due: Mar 7, 2017, 23:59pm
1
MySQL Query Processing1 Homework 8, CS245 Winter 2017
In this assignment, we will see how MySQL parses, normalizes, and rewrites queries by looking at
its code with the help of a debugger.
Note that you will submit this homework electronically via Gradiance:
http://www.newgradiance.com/services
Taking notes while following the instructions and questions in this document will help you solve
the problems in Gradiance. However, you do NOT have to submit any cleanly written answers to
the questions in this document.
Please start earlier! At least download the necessary files first, even if you plan to work on it
later. The virtual machine disk image is large and may take some time to download. Also note that
because you need to understand and operate real software, this assignment can take longer
than a typical Gradiance homework.
A. Setup
If you have already followed the setup instructions in the “MySQL/InnoDB B+Tree” assignment,
only the last subsection about GDB and DDD is new, and the first few can be safely skipped.
A.1. Downloading ● VirtualBox disk image http://www.stanford.edu/class/cs245/data/cs245.vdi.zip (1.3GiB)
○ A smaller 7zip file (856MiB) is also available: replace .zip with .7z from the URL above to
download it if you are able to handle 7zip files, or having trouble downloading the larger one.
A.2. VirtualBox After downloading the files, set up a virtual machine using the VirtualBox disk image (cs245.vdi).
1. Install VirtualBox using the provided installer, or you may choose to download the
installer instead from https://www.virtualbox.org/wiki/Downloads.
2. Create a new virtual machine:
a. Start VirtualBox, click the “New” button, and then enter in the following values
1 This assignment was initially created by Dennis Sidharta and revised by Jaeho Shin.
HW8: MySQL Query Processing CS245 Winter 2017
Due: Mar 7, 2017, 23:59pm
2
■ Name: cs245 (or whatever name you like)
■ Type: Linux
■ Version: Red Hat (64 bit)
b. Set other settings like memory size, etc. to your liking.
c. For the hard drive, make sure to point to the provided cs245.vdi.
Click “Start” to launch the virtual machine. To login, use the following username and password:
● username: root
● password: cs_245
VirtualBox’s Guest Addition has been installed in the virtual machine. Guest Addition adds a few
features such as mouse-pointer integration with the host OS, shared folder support, etc.
A.3. MySQL We will be using MySQL Community Server version 5.6.10, which has been installed in the virtual
machine.
Here are various important folders and files that you may need to know:
● Source location: /root/workspace/mysql-5.6.10
● Installation location: /usr/local/mysql
● Configuration file: /etc/my.cnf
● Other files:
○ /var/lib/mysql/mysql.sock
○ /var/log/mysqld.log
○ /var/run/mysqld/mysqld.pid
A.3. GDB and DDD DDD has also been installed for you. DDD (http://www.gnu.org/software/ddd) provides a graphical
front-end to GDB (http://www.gnu.org/software/gdb/gdb.html). To launch it, type ddd in a
terminal.
Inspecting a program via DDD requires two steps (more on this later):
1. Loading the program.
2. Attaching DDD to the program’s running process.
HW8: MySQL Query Processing CS245 Winter 2017
Due: Mar 7, 2017, 23:59pm
3
B. Starting MySQL and Attaching DDD to It
B.1. Launching MySQL Server The commands to start and to stop MySQL server are, respectively:
● start: mysqld --debug
● stop: mysqladmin shutdown
Open a terminal, and then start MySQL server. Once the server is started, find out its process id by
executing the following command (the prompt and the output are shown):
# ps aux | grep mysqld
9323 pts/0 Sl+ 0:00 ./mysqld --debug
In the example above, the process id is 9323. Make a note of the id printed in your terminal; we
will use it soon. Do not shutdown the server at the moment.
B.2. MySQL Client On a separate terminal, start a MySQL client by executing mysql -uroot. Once it is started, set
the default DB to employee_db by executing use employee_db;
Do not terminate the client for the moment, but if you need to, simply execute exit.
B.3. DDD
B.3.1. Attaching to MySQL
On a separate terminal, start DDD by typing ddd. Once started, attach it to the currently running
MySQL server process:
1. Select “File” from the menu, “Open Program...”, and then enter the following in the
“Program” field: /usr/local/mysql/bin/mysqld
Alternatively, you can start DDD with the path to the program as an argument to the
command:
ddd /usr/local/mysql/bin/mysqld
2. In the GDB console (at the bottom window), type the following:
attach process_id
where process_id is the one you noted in section B.1, e.g., 9323. If the GDB console is not
shown at the bottom of DDD’s screen, activate it by selecting “View...” from the menu and
then “GDB Console.”
HW8: MySQL Query Processing CS245 Winter 2017
Due: Mar 7, 2017, 23:59pm
4
Because MySQL server process runs as a different user (mysql) than yourself (root), you
cannot use DDD’s “Attach to Process...” feature in its “File” menu to simply select mysqld
from the list, but must figure out the process id by running a separate command.
B.3.2. Looking up Source Code Elements You can easily bring up the part of MySQL’s source code you want to see using DDD’s Lookup
feature. Enter a query in the text input box at the top of DDD’s window, then press the Return or
Enter key, or click the “Lookup” button right next to it. The query can be many things including the
following ones:
File name with an optional line number, e.g., sql_parse.cc:1134
Function name, e.g., dispatch_command
Global variable name, e.g., thread_scheduler
Class name, e.g., THD
Type name (structs, enums, typedefs, etc.), e.g., scheduler_functions and
enum_server_command
Unfortunately, DDD cannot lookup macro or enum constant names, e.g., MYSQL_CALLBACK or
COM_QUERY, as such information is not available as debug symbols in the compiled binary.
Although looking up such information will not be critical for this assignment, using dedicated
analysis tools will definitely help you quickly navigate through the source code. Please consider
using cscope or ctags with your favorite editor if you want to dive deeper into the source code.
A quick web search will teach you how to use them.
B.3.3. Setting Breakpoints You can now set breakpoints via the GDB console. For example, to put a break point in sql_parse.cc
at line 1134:
break /root/workspace/mysql-5.6.10/sql/sql_parse.cc:1134
Alternatively, from the source code you brought up in DDD’s source window, right-clicking on a
line will let you set or delete breakpoint on it. Displaying the line numbers in the source window
by selecting “Source” from the menu and enabling “Display Line Numbers” will help.
You would have noticed that attaching to a running process pauses its execution. In order to let it
continue running, and eventually reach the breakpoints, we must resume the process afterwards.
Remember to always resume the MySQL server process after setting up your breakpoints by
either typing cont (or simply c) into the GDB console, or using the “Cont” button in the middle of
the floating window.
HW8: MySQL Query Processing CS245 Winter 2017
Due: Mar 7, 2017, 23:59pm
5
1. Debugging Basics
The goal of this problem is to familiarize yourself with the environment to debug MySQL server.
We will inspect how MySQL server creates threads and handles connections. Unless noted
otherwise, the source files that we will look at in this problem are all located in
/root/workspace/mysql-5.6.10/sql. Also, we will use the following convention to refer to a function:
file_name:function_name().
MySQL server creates a thread for each new connection.
1. In mysqld.cc:handle_connections_sockets(), you will find an infinite loop,
listening for new connections. Take a few minutes to look inside this function. You do not
have to understand everything that is going on. What is the name of the static function
in mysqld.cc that creates the new thread?
2. Eventually, the thread’s control is handed over to
sql_connect.cc:do_handle_one_connection(), sql_parse.cc:do_command(), and
then to sql_parse.cc:dispatch_command(), etc. Take a peek at those functions. Notice
that thd is heavily used, and is passed around between many functions. It is of the type
THD, which is defined in sql_class.h. What are the direct parent classes of THD?
3. thd holds, among other things all of the information related to the current thread. We
will inspect the contents of this object, and so set a breakpoint at the first statement in
sql_parse.cc:do_command(). Report the command that you used. Because you may see
other commands sent by MySQL client while it starts up, it is recommended to set the
breakpoint after you have a MySQL client ready to accept your input.
4. To get to the breakpoint, first continue the execution of MySQL server, and run the
following SQL query in the MySQL client:
select * from employee limit 1;
Once the execution stops at the breakpoint, double-click thd to show that variable in
DDD’s graphical data window. Notice that double clicking a variable generates a
command in the GDB console. What was the generated command?
5. Move the execution forward by typing n in the GDB console (or clicking the “Next” button)
until you reach the statement that accesses net->read_pos. Here, thd-
>net.read_pos stores the raw query read from the network. Show its value in the GDB
console by executing the following in the console: p thd->net.read_pos. What is
thd->net.read_pos’s value? Another way to discover a variable’ value is by double-
HW8: MySQL Query Processing CS245 Winter 2017
Due: Mar 7, 2017, 23:59pm
6
clicking the variable in the graphical data window. Explore thd by double-clicking it
further.
6. Notice that the extra byte prefixing the raw query stored in thd->net.read_pos. That
first byte tells us the query’s type. In fact, it is an enum enum_server_command,
defined in /root/workspace/mysql-5.6.10/include/mysql_com.h. Execute the following
command:
p (enum enum_server_command) ((uchar) thd->net.read_pos[0])
What is the type of the query we executed in question 4?
HW8: MySQL Query Processing CS245 Winter 2017
Due: Mar 7, 2017, 23:59pm
7
2. Query Parsing
In this problem, we will look at MySQL’s context-free-grammar rules for its SQL statements, and
then we will generate a parse tree of a valid statement.
We will use the employee table from the employee_db, whose schema looks like:
MySQL uses GNU Bison (http://www.gnu.org/software/bison/) to generate parsers of SQL
statements. The context free grammar rules for the statements are defined in
/root/workspace/mysql-5.6.10/sql/sql_yacc.yy. Briefly skim this file.
You will notice various defines, function declarations, terminal symbols (those prefixed by
%tokens), and grammar rules. Look at the rules for parsing a select statement. The first two of
such rules are reproduced below:
create_select:
SELECT_SYM
{
// ...
}
select_options select_item_list
{
// ...
}
opt_select_from
{
// ...
}
;
select_options:
/* empty*/
| select_option_list
{
// ...
}
;
HW8: MySQL Query Processing CS245 Winter 2017
Due: Mar 7, 2017, 23:59pm
8
Here are a few notes about the above rules:
1. SELECT_SYM is a terminal symbol, and it is defined earlier in the file as %token
SELECT_SYM. It represents a reserved MySQL keyword. The value of the reserved
keyword, which in this case is “SELECT,” can be found in lex.h.
2. In the two rules reproduced above (create_select and select_options), we did
not show the semantics definitions (also called “actions”), which are C statements
declared inside the curly braces. An action defined for a particular rule is executed
whenever that rule matches. For this problem, you do not need to understand actions.
3. Therefore, we can see that create_select is SELECT_SYM select_options
select_item_list and opt_select_from;
4. “|” signifies alternatives. And so, select_options is either an empty string or a
select_option_list.
For example, the figure below shows the parse tree of the following SQL statement:
select NULL;
Explore the related grammar rules, and then draw a parse tree of the following SQL statement:
select * from employee where dept = 'engr' limit 10;
You may immediately replace IDENT, TEXT_STRING, and NUM symbols with any identifiers, text
strings, and numeric values, respectively.
HW8: MySQL Query Processing CS245 Winter 2017
Due: Mar 7, 2017, 23:59pm
9
3. Query Normalization
The goal of this problem is to see MySQL’s internal representation of a query. As in problem 2, we
will use the employee table from the employee_db.
As you explore MySQL’s source code, you will encounter Item class very often. Item is defined in
item.h along with many other classes derived from it. It is used to manipulate various items, such
as fields, functions, etc.
In the previous problem, we generated the parse tree of the following statement:
select * from employee where dept = 'engr' limit 10;
The enum_server_command of the statement above is COM_QUERY. Take a moment to skim
sql_parse.cc:dispatch_command() and then sql_parse.cc:mysql_parse(). Notice that the
former calls the latter.
Start MySQL server, MySQL client, and DDD. And then, set a breakpoint in
sql_parse.cc:mysql_parse() at the line immediately after the following statement:
err = parse_sql(thd, parser_state, NULL)
which should be at around line 6055. Once set, execute the query above.
The parsed result is stored in thd->lex->select_lex. And so, when the program stops at the
breakpoint, inspect the contents of select_lex. You may navigate to select_lex via DDD’s
visualization by double-clicking thd, and then lex (hint: lex is defined in the Statement
class, which is THD’s parent class), or you may execute the following command in DDD’s console:
graph display `p thd->lex->select_lex`
Notice the backquotes in the command.
By looking at the select_lex’s fields, can you tell which of them store the table name, the
column names (in this case “*”), the where clause, and the limit clause for the query you just
executed?
1. Questions on table_list:
a. It is often helpful to know what sort of object you are dealing with by printing its
pointer type in the GDB console. Execute:
HW8: MySQL Query Processing CS245 Winter 2017
Due: Mar 7, 2017, 23:59pm
10
p &(thd->lex->select_lex.table_list)
What is table_list’s type? Note that “&” is the address-of operator, i.e., you
just printed the address of table_list.
b. The definition of table_list’s type, which is a linked list, can be found in
sql_list.h. Briefly skim the class. Notice that it has first property. And so, let us
display the first element of the linked list by executing:
graph display `p thd->lex->select_lex->table_list.first`
Navigate into the displayed variable by double clicking it. Into which fields are
the DB and table names stored?
2. Questions on item_list:
a. Similarly, what is item_list’s type? Report the command you used.
b. Just like table_list, item_list is also a linked list. Its definition can also be
found in sql_list.h. Briefly skim the class. Notice that it has head() method. And
so, let us display the head element by executing:
graph display `call thd->lex->select_lex.item_list.head()`
Into which field is the column name (in this case “*”) stored?
3. Questions on select_limit:
a. When you execute p thd->lex->select_lex.select_limit, you will
find that select_limit’s type is Item*. However, select_limit is an
instance of Item_int, which is derived from Item. Here, Item::type()
provides a hint. Execute:
call thd->lex->select_lex.select_limit->type()
What is the select_limit’s enum Type?
b. Execute:
graph display `p (Item_int*) thd->lex->select_lex.select_limit`
What is the value of value?
HW8: MySQL Query Processing CS245 Winter 2017
Due: Mar 7, 2017, 23:59pm
11
4. Questions on where:
a. Similar to 3.a, what is where’s enum Type? Report the command you used.
b. Similar to 3.b, graph display where. What is the value of arg_count? Report
the command you used.
c. Navigate into the fields of where to see how the where clause of the query you
executed is stored internally. In which fields are the arguments (in this case
“dept” and “engr”) stored?
d. Can you tell how the args that points to a pointer is related to the array
tmp_arg of two pointers? What is the role of the next pointer in the Items
pointed by args and tmp_arg?
e. You could tell where is an instance of Item_func by calling its type().
However, it is actually an instance of a subclass that is derived from Item_func,
and Item_func::functype() can provide a hint. Execute:
call ((Item_func*) thd->lex->select_lex.where)->functype()
What is where’s function type?
f. Take a look inside item_cmpfunc.h, and then based on your answer to the
previous question 4.e, of which class is where an instance? Hint: For instance,
Item_func_xor is the class for Item_Func::XOR_FUNC.
g. And, at precisely which line of the source code do you think where was
instantiated as that class? Hint: where_clause in sql_yacc.yy.
HW8: MySQL Query Processing CS245 Winter 2017
Due: Mar 7, 2017, 23:59pm
12
4. Query Rewriting
In this problem, we will look at how MySQL rewrites queries, in particular, how it removes
conditions that always evaluate to true or false.
For this problem, we will execute the following statements:
set @var := (select id from employee order by id asc limit 1);
select * from employee where @var = 1 limit 10;
We want to see how MySQL replaces the condition, @var = 1, with Item::COND_TRUE or
Item::COND_FALSE. Take a moment to skim sql_optimizer.cc:JOIN::optimize(). Pay
attention to the statements that call optimize_cond()2, particularly to the following
statement:
conds = optimize_cond(thd, conds, &cond_equal,
join_list, true, &select_lex->cond_value);
You can find the definition of JOIN::conds in sql_optimizer.h. It refers to the same Item object
referred to by thd->lex->select_lex->where3.
The value that results from evaluating where is stored in select_lex->cond_value, which
can be one of the following: Item::COND_UNDEF, Item::COND_TRUE, Item::COND_FALSE,
or Item::COND_OK. The semantics of those values can be found in the comments immediately
preceding conds’ declaration, in sql_optimizer.h.
In order to remove the redundant condition in our query, the control will move from
optimize_cond() to remove_eq_conds(), and then to internal_remove_eq_conds().
And so, briefly skim those functions as well.
Start MySQL server, MySQL client, and ddd. And then, execute the following statement in MySQL
client:
2 optimize() will also eventually call simplify_joins(). Although we will not be discussing how joins are rewritten in this assignment, it will be good to know how MySQL does it, and so make sure to explore that function and to read its documentation.
3 If you are curious, you can find the assignment of select_lex->where to conds in
sql_resolver.cc:JOIN::prepare(). It is a good exercise to trace the function calls to get there from
main(), and then to sql_optimizer.cc:JOIN::optimize(). Hint: You may start at
sql_select.cc:handle_select().
HW8: MySQL Query Processing CS245 Winter 2017
Due: Mar 7, 2017, 23:59pm
13
set @var := (select id from employee order by id asc limit 1);
Afterwards, put a breakpoint at the first statement in internal_remove_eq_conds(). Notice
that, from the function signature, conds is now referred to as cond.
Once the breakpoint is set, execute the following query:
select * from employee where @var = 1 limit 10;
1. When the program stops at the breakpoint,
a. Execute call ((Item_func*) cond)->functype() in the GDB console.
What is cond’s function type?
b. Execute call cond->const_item() in the GDB console. Is cond a const
item?
c. And then, execute call cond->is_expensive(). Does cond represent an
expression that is expensive to compute?
2. Evaluating cond involves comparing two values. Execute the following command:
graph display `p (Item_func_eq*) cond`
Then, inspect the cond’s cmp. The aforementioned two values are stored in cmp::a and
cmp::b.
a. Like cond, cmp::a is also of the type Item::FUNC_ITEM. Execute:
call ((Item_func*)(*(((Item_func_eq*)cond)->cmp.a)))-
>functype()
What is cmp::a’s function type?
b. Take a look inside item_func.h, and then based on your answer to question 2.a., of
which class is cmp::a an instance?
c. Execute the following command:
call (*((Item_func_eq*) cond)->cmp.a)->val_int()
What is cmp::a’s integer value?
HW8: MySQL Query Processing CS245 Winter 2017
Due: Mar 7, 2017, 23:59pm
14
d. Unlike cond, cmp::b is not a function. What is its type, i.e., the value returned
by type()? Also, report the command you used.
e. What is cmp::b’s integer value? Report the command you used.
3. Move the program’s execution forward until it reaches the following statement:
*cond_value = eval_const_cond(cond) ? Item::COND_TRUE : Item::COND_FALSE;
Then, execute p eval_const_cond(cond). What boolean value does cond evaluate
to?
Thus, cond_value stores either Item::COND_TRUE or Item::COND_FALSE. And,
JOIN::conds is, therefore, set to (Item*) 0, i.e., null.