Binlog 事件格式解析 一文中,笔者对 Format_description_event 的格式进行了解析,并在文末给出了 sysbench write-only test 的火焰图。本文将对火焰图中 MySQL 的行日志事件进行解析。

FlameGraph

首先看下火焰图中的变更修改:

handler::ha_write_row

图 1: handler::ha_write_row

handler::ha_update_row

图 2: handler::ha_update_row

handler::ha_delete_row

图 3: handler::ha_delete_row

如上三种操作分别对应 Write_rows_log_event, Update_rows_log_event 和 Delete_rows_log_event,除此之外,每种变更事件都伴随着一个 Table_map_log_event。

事件的类结构

binlog 中事件的继承结构非常相似,通常都是 *_event 会有一个对应的 *_log_event, *_event 继承自 Binary_log_event 定义该事件的数据布局,*_log_event 则继承 *_eventLog_eventLog_event 定义了读写接口。如:

        Binary_log_event
               ^
               |
               |
    Table_map_event  Log_event
                \       /
                 \     /
                  \   /
           Table_map_log_event

特殊的是 Write_rows_log_event/Update_rows_log_event/Delete_rows_log_event 三种事件都是 row_based logging,其数据布局非常相似,主要区别在于 row 字段存储内容的不同:

  • Write_rows_log_event 存储插入的数据
  • Update_rows_log_event 存储修改前和修改后的数据
  • Delete_rows_log_event 存储删除前的数据

binlog 为此抽象出一个 Rows_log_event,该事件满足上述的继承结构:

        Binary_log_event
               ^
               |
               |
         Rows_event  Log_event
                \       /
          <<vir>>\     /
                  \   /
              Rows_log_event

对于上面的三种变更事件,由于多了一层抽象,使得其继承结构变得有些复杂,如 Write_rows_log_event:

                         Binary_log_event
                                  ^
                                  |
                                  |
                                  |
                Log_event   B_l:Rows_event
                     ^            /\
                     |           /  \
                     |   <<vir>>/    \ <<vir>>
                     |         /      \
                     |        /        \
                     |       /          \
                  Rows_log_event    B_l:Write_rows_event
                             \          /
                              \        /
                               \      /
                                \    /
                                 \  /
                                  \/
                        Write_rows_log_event

  B_l: Namespace Binary_log

通过对三个变更事件菱形继承结构右侧的代码(如 Write_rows_event),发现将 Rows_log_event 实现为模板类会更简洁。

Table_map_log_event

Table_map_log_event 记录了数据库表的元数据信息,包括数据库名,表名及字段类型等信息,具体的数据布局信息定义在 Table_map_event 类:

字段名 格式 简要描述
table_id 6 字节无符号整型 表示该表的 id 值
flags 2 字节 代码注释中说是预留未来使用,其实现在已经使用了
表 1: Post-Header for Table_map_event
字段名 格式 简要描述
database_name 1 字节长度 + null 结尾字符串 因为行日志事件很多时候并不指定数据库名,所以在该事件中有必要记录数据库名。长度不含 null
table_name 1 字节长度 + null 结尾字符串 行日志事件所发生的表名
column_count Packed Integer (当有大量小数值的时候用于节省存储空间的一种编码格式,详见 net_store_length) 数据表所含列数
column_type column_count 长度的字节数组 每字节代表一列数据的类型,枚举值见 enum_field_types
metadata_length Packed Integer 元数据长度
metadata metadata_length 个字节 有些列的类型需要存储元数据,详见代码注释
null_bits 字节数组,数组长度 = int((column_count + 7) / 8) 每一位表示对应列是否可以为 NULL
optional metadata fields Type(byte), Length(Packed Interge), Value(Length byte) 详见代码注释
表 2: Body for Table_map_event

了解了 Table_map_event 的结构之后,下面看一个 binlog 文件中具体的表映射事件。

首先创建数据库并定义表结构:

create database zhjwpku;
use zhjwpku;
CREATE TABLE t
(
  id   INT NOT NULL,
  name VARCHAR(20) NOT NULL,
  date DATE NULL
) ENGINE = InnoDB;

写入一条数据 INSERT INTO t VALUES(1, 'apple', NULL); 然后观察 binlog 文件:

root@ubuntu-bionic:/data/data# mysqlbinlog --start-position=931646961 --hexdump binlog.000001
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=1*/;
/*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/;
DELIMITER /*!*/;
# at 156
#201101  5:58:30 server id 1  end_log_pos 125 CRC32 0x52873723
# Position  Timestamp   Type   Master ID        Size      Master Pos    Flags
# 0000009c 86 4e 9e 5f   0f   01 00 00 00   79 00 00 00   7d 00 00 00   01 00
# 000000af 04 00 38 2e 30 2e 32 32  00 00 00 00 00 00 00 00 |..8.0.22........|
# 000000bf 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 |................|
# 000000cf 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 |................|
# 000000df 00 00 00 00 86 4e 9e 5f  13 00 0d 00 08 00 00 00 |.....N..........|
# 000000ef 00 04 00 04 00 00 00 61  00 04 1a 08 00 00 00 08 |.......a........|
# 000000ff 08 08 02 00 00 00 0a 0a  0a 2a 2a 00 12 34 00 0a |.............4..|
# 0000010f 28 01 23 37 87 52                                |...7.R|
#       Start: binlog v 4, server v 8.0.22 created 201101  5:58:30 at startup
# Warning: this binlog is either in use or was not closed properly.
ROLLBACK/*!*/;
BINLOG '
hk6eXw8BAAAAeQAAAH0AAAABAAQAOC4wLjIyAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAACGTp5fEwANAAgAAAAABAAEAAAAYQAEGggAAAAICAgCAAAACgoKKioAEjQA
CigBIzeHUg==
'/*!*/;
# at 931646961
#201107 14:12:16 server id 1  end_log_pos 931647020 CRC32 0xda52b5cf
# Position  Timestamp   Type   Master ID        Size      Master Pos    Flags
# 3787cdf1 40 ab a6 5f   13   01 00 00 00   3b 00 00 00   2c ce 87 37   00 00
# 3787ce04 8c 00 00 00 00 00 01 00  07 7a 68 6a 77 70 6b 75 |.........zhjwpku|
# 3787ce14 00 01 74 00 03 03 0f 0a  02 50 00 04 01 01 00 02 |..t......P......|
# 3787ce24 03 fc ff 00 cf b5 52 da                          |......R.|
#       Table_map: `zhjwpku`.`t` mapped to number 140

使用 mysqlbinlog 可以很清楚的看出 Table_map_event 的 Common-Header 代表的各项,将之后的数据逐位解析,得到

  • table_id: 8c 00 00 00 00 => 140
  • flags: 01 00 => TM_BIT_LEN_EXACT_F
  • database_name: 07 7a 68 6a 77 70 6b 75 00 => “zhjwpku”
  • table_name: 01 74 00 => “t”
  • column_count: 03 => 三列
  • column_type: 03 0f 0a => 分别对应 MYSQL_TYPE_LONG、MYSQL_TYPE_VARCHAR、MYSQL_TYPE_DATE
  • metadata_length: 02 => 只有 MYSQL_TYPE_VARCHAR 需要两字节的元数据
  • metadata: 50 00 => 表示 varchar 最大长度为 80。定义表的时候定义的 varchar(20),留个疑问后面确认
  • null_bits: 04 => 第三字段可以为 NULL,即 date
  • optional metadata fields:
    • SIGNEDNESS: 01 01 00 => 所有字段都没有 unsign 标志
    • DEFAULT_CHARSET: 02 03 fc ff 00
  • crc: cf b5 52 da

Rows_log_event

Rows_log_event 可以用来表示 Write_rows_log_event/Update_rows_log_event/Delete_rows_log_event 三种事件,其数据布局定义在 Rows_event 中:

字段名 格式 简要描述
table_id 6 字节无符号整型 表示该表的 id 值
flags 2 字节 代码注释中说是预留未来使用,其实现在已经使用了
表 3: Post-Header for Rows_event
字段名 格式 简要描述
extra_row_info Extra_row_info 本文忽略该字段,有兴趣读者自行查阅
width Packed Integer 代表列的个数
cols 字节数组,数组长度 = int((width + 7) / 8) 每个列是否被使用,每个位代表一个字段
rows 0行或多行数据 对于一次修改多行的语句,该事件会有多行数据
表 4: Body for Rows_event

代码中的注释和实际实现上有所不同,如上表格以代码实现为准。我们以上面的插入语句为例,分析 binlog 文件中的 Write_rows_log_event 内容:

root@ubuntu-bionic:/data/data# mysqlbinlog --start-position=931647020 --hexdump binlog.000001
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=1*/;
/*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/;
DELIMITER /*!*/;
# at 156
#201101  5:58:30 server id 1  end_log_pos 125 CRC32 0x52873723
# Position  Timestamp   Type   Master ID        Size      Master Pos    Flags
# 0000009c 86 4e 9e 5f   0f   01 00 00 00   79 00 00 00   7d 00 00 00   00 00
# 000000af 04 00 38 2e 30 2e 32 32  00 00 00 00 00 00 00 00 |..8.0.22........|
# 000000bf 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 |................|
# 000000cf 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 |................|
# 000000df 00 00 00 00 86 4e 9e 5f  13 00 0d 00 08 00 00 00 |.....N..........|
# 000000ef 00 04 00 04 00 00 00 61  00 04 1a 08 00 00 00 08 |.......a........|
# 000000ff 08 08 02 00 00 00 0a 0a  0a 2a 2a 00 12 34 00 0a |.............4..|
# 0000010f 28 01 23 37 87 52                                |...7.R|
# 	Start: binlog v 4, server v 8.0.22 created 201101  5:58:30 at startup
ROLLBACK/*!*/;
BINLOG '
hk6eXw8BAAAAeQAAAH0AAAAAAAQAOC4wLjIyAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAACGTp5fEwANAAgAAAAABAAEAAAAYQAEGggAAAAICAgCAAAACgoKKioAEjQA
CigBIzeHUg==
'/*!*/;
# at 931647020
#201107 14:12:16 server id 1  end_log_pos 931647066 CRC32 0xe0258212
# Position  Timestamp   Type   Master ID        Size      Master Pos    Flags
# 3787ce2c 40 ab a6 5f   1e   01 00 00 00   2e 00 00 00   5a ce 87 37   00 00
# 3787ce3f 8c 00 00 00 00 00 01 00  02 00 03 ff 04 01 00 00 |................|
# 3787ce4f 00 05 61 70 70 6c 65 12  82 25 e0                |..apple....|
# 	Write_rows: table id 140 flags: STMT_END_F

BINLOG '
QKumXx4BAAAALgAAAFrOhzcAAIwAAAAAAAEAAgAD/wQBAAAABWFwcGxlEoIl4A==
'/*!*/;
  • table_id: 8c 00 00 00 00 00 => 140
  • flags: 01 00 => STMT_END_F
  • extra_row_info: 02 00 => EXTRA_ROW_INFO_HEADER_LENGTH
  • width: 03 => 三列
  • cols: ff
  • rows: 04 01 00 00 00 05 61 70 70 6c 65 => id = 1; name = apple
  • crc: 12 82 25 e0

Write_rows_log_event/Delete_rows_log_event 与上面的分析类似,此不赘述。

Event Lifetime

一个时间在构造的时候通过传递特定的参数来填充其数据布局中的各字段,然后在 flush event 阶段调用 Log_event::write 将数据写到 binlog 文件中。

使用 gdb 来看 Rows_log_event 构造函数的调用栈:

(gdb) bt
#0  Rows_log_event::Rows_log_event (this=0x7f41180f8750,
    __vtt_parm=0x55be8d466dd8 <VTT for Write_rows_log_event+8>, thd_arg=0x7f4118001270,
    tbl_arg=0x7f4118157bf0, tid=..., cols=0x7f4118157d60, using_trans=true,
    event_type=binary_log::WRITE_ROWS_EVENT, extra_row_ndb_info=0x0, __in_chrg=<optimized out>)
    at /opt/vagrant/mysql-server/sql/log_event.cc:7654
#1  0x000055be89dce96a in Write_rows_log_event::Write_rows_log_event (this=0x7f41180f8750,
    thd_arg=0x7f4118001270, tbl_arg=0x7f4118157bf0, tid_arg=..., is_transactional=true,
    extra_row_ndb_info=0x0, __in_chrg=<optimized out>, __vtt_parm=<optimized out>)
    at /opt/vagrant/mysql-server/sql/log_event.cc:11638
#2  0x000055be89d9ef96 in THD::binlog_prepare_pending_rows_event<Write_rows_log_event> (
    this=0x7f4118001270, table=0x7f4118157bf0, serv_id=1, needed=11, is_transactional=true,
    extra_row_info=0x0, source_part_id=2147483647) at /opt/vagrant/mysql-server/sql/binlog.cc:10776
#3  0x000055be89d92783 in THD::binlog_write_row (this=0x7f4118001270, table=0x7f4118157bf0,
    is_trans=true, record=0x7f4118098138 "\377\001", extra_row_info=0x0)
    at /opt/vagrant/mysql-server/sql/binlog.cc:11049
#4  0x000055be89dcea53 in Write_rows_log_event::binlog_row_logging_function (thd_arg=0x7f4118001270,
    table=0x7f4118157bf0, is_transactional=true, before_record=0x0,
    after_record=0x7f4118098138 "\377\001") at /opt/vagrant/mysql-server/sql/log_event.cc:11646
#5  0x000055be88e8c753 in binlog_log_row (table=0x7f4118157bf0, before_record=0x0,
    after_record=0x7f4118098138 "\377\001",
    log_func=0x55be89dcea18 <Write_rows_log_event::binlog_row_logging_function(THD*, TABLE*, bool, unsigned char const*, unsigned char const*)>) at /opt/vagrant/mysql-server/sql/handler.cc:7725
#6  0x000055be88e8d0dc in handler::ha_write_row (this=0x7f4118158fe8, buf=0x7f4118098138 "\377\001")
    at /opt/vagrant/mysql-server/sql/handler.cc:7825
#7  0x000055be891b785c in write_record (thd=0x7f4118001270, table=0x7f4118157bf0, info=0x7f419478a650,
    update=0x7f419478a6d0) at /opt/vagrant/mysql-server/sql/sql_insert.cc:2148
#8  0x000055be891b2ff4 in Sql_cmd_insert_values::execute_inner (this=0x7f41180f0ce0, thd=0x7f4118001270)
    at /opt/vagrant/mysql-server/sql/sql_insert.cc:633
#9  0x000055be88b56296 in Sql_cmd_dml::execute (this=0x7f41180f0ce0, thd=0x7f4118001270)
    at /opt/vagrant/mysql-server/sql/sql_select.cc:725
#10 0x000055be88ad4f7f in mysql_execute_command (thd=0x7f4118001270, first_level=true)
    at /opt/vagrant/mysql-server/sql/sql_parse.cc:3357

再看 Log_event::write 的调用栈:

(gdb) bt
#0  Log_event::write (this=0x7f41180f8750, ostream=0x7f41180f85a8)
    at /opt/vagrant/mysql-server/sql/log_event.h:905
#1  0x000055be89d9c55f in binary_event_serialize<Log_event> (ev=0x7f41180f8750, ostream=0x7f41180f85a8)
    at /opt/vagrant/mysql-server/sql/log_event.h:4375
#2  0x000055be89d742b8 in binlog_cache_data::write_event (this=0x7f41180f8550, ev=0x7f41180f8750)
    at /opt/vagrant/mysql-server/sql/binlog.cc:1389
#3  0x000055be89d862c7 in MYSQL_BIN_LOG::flush_and_set_pending_rows_event (
    this=0x55be8d8c6680 <mysql_bin_log>, thd=0x7f4118001270, event=0x0, is_transactional=true)
    at /opt/vagrant/mysql-server/sql/binlog.cc:6866
#4  0x000055be89d93426 in THD::binlog_flush_pending_rows_event (this=0x7f4118001270, stmt_end=true,
    is_transactional=true) at /opt/vagrant/mysql-server/sql/binlog.cc:11254
#5  0x000055be89d93d8a in THD::binlog_query (this=0x7f4118001270, qtype=THD::ROW_QUERY_TYPE,
    query_arg=0x7f41180eeff8 "INSERT INTO t VALUES(1, 'apple', NULL)", query_len=38, is_trans=true,
    direct=false, suppress_use=false, errcode=0) at /opt/vagrant/mysql-server/sql/binlog.cc:11466
#6  0x000055be891b32e9 in Sql_cmd_insert_values::execute_inner (this=0x7f41180f0ce0, thd=0x7f4118001270)
    at /opt/vagrant/mysql-server/sql/sql_insert.cc:706
#7  0x000055be88b56296 in Sql_cmd_dml::execute (this=0x7f41180f0ce0, thd=0x7f4118001270)
    at /opt/vagrant/mysql-server/sql/sql_select.cc:725
#8  0x000055be88ad4f7f in mysql_execute_command (thd=0x7f4118001270, first_level=true)
    at /opt/vagrant/mysql-server/sql/sql_parse.cc:3357

从调用栈可以看出在 handler::ha_write_row 阶段只是完成 Write_rows_log_event 的生成(对于一条语句修改多行的会将事件保存在 binlog_cache_data 的 m_pending 字段以便将后面的数据添加到这个事件中),具体的刷盘操作由 MYSQL_BIN_LOG::flush_and_set_pending_rows_event 完成。这里的逻辑有些复杂,留给后面的文章来具体分析这块的逻辑。

题外话:

mysqlbinlog 在使用 –start-position 参数时 FDE 的显示地址有误,笔者已提 bug #101450 及相关修改给 MySQL 社区。