mysql实现原理 - 字符集和排序规则

发布于:2025-02-21 ⋅ 阅读:(22) ⋅ 点赞:(0)

介绍

字符集

在计算机中字符是以二进制的格式存储的,每个字符在字符编码系统中都有一个唯一的二进制代码,这个过程确保了字符在存储和传输时不会丢失或混淆。将一个字符映射成一个二进制数据的过程就称为字符编码,而将一个二进制数据映射到一个字符的过程就称为字符解码。而字符集就是一套字符和二进制数据映射关系的集合。

一些重要的字符集

ASCII 字符集

共收录128个字符,包括空格、标点符号、数字、大小写字母和一些不可见字符。由于总共才128个字符,所以可以使用1个字节来进行编码。

在这里插入图片描述

ISO 8859-1 字符集

共收录256个字符,是在 ASCII 字符集的基础上又扩充了128个西欧常用字符(包括德法两国的字母),也可以使用1个字节来进行编码。这个字符集也有一个别名 latin1 。

GB2312 字符集

收录了汉字以及拉丁字母、希腊字母、日文平假名及片假名字母、俄语西里尔字母。其中收录汉字6763个,其他文字符号682个。

GBK 字符集

GBK 字符集只是在收录字符范围上对 GB2312 字符集作了扩充,编码方式上兼容 GB2312 。

utf8 字符集

收录地球上能想到的所有字符,而且还在不断扩充。这种字符集兼容 ASCII 字符集,采用变长编码方式,编码一个字符需要使用1~4个字节。

实准确的说,utf8只是Unicode字符集的一种编码方案,Unicode字符集可以采用utf8、utf16、utf32这几种编码方案,utf8使用1~4个字节编码一个字符,utf16使用2个或4个字节编码一个字符,utf32使用4个字节编码一个字符

比较规则

比较规则就是我们查询数据库中的数据,数据以什么样的顺序返回,不同的比较规则,返回的顺序也是区别很大的,可能直接影响到我们对数据的使用

比如:A的ascii码是65,B的ascii码是66,a的ascii码是97,如果以二进制比较规则,则是直接比较字符的ascii码,结果就是 A < B < a ,但是如果二进制比较基础上再加上一个不区分大小写,则又是另外一个顺序了A = a < B

所以了解一下数据库的字符集和比较规则对我们正确使用数据库是很重要的,接下来我们展开看看。

数据库中的字符集和比较规则

说明

  • 演示使用的数据库版本为[8.0.13]

查看

查看字符集

show character set ;show charset ;
字符集 描述 默认比较规则 最大长度(字节)
armscii8 ARMSCII-8 Armenian armscii8_general_ci 1
ascii US ASCII ascii_general_ci 1
big5 Big5 Traditional Chinese big5_chinese_ci 2
binary Binary pseudo charset binary 1
cp1250 Windows Central European cp1250_general_ci 1
cp1251 Windows Cyrillic cp1251_general_ci 1
cp1256 Windows Arabic cp1256_general_ci 1
cp1257 Windows Baltic cp1257_general_ci 1
cp850 DOS West European cp850_general_ci 1
cp852 DOS Central European cp852_general_ci 1
cp866 DOS Russian cp866_general_ci 1
cp932 SJIS for Windows Japanese cp932_japanese_ci 2
dec8 DEC West European dec8_swedish_ci 1
eucjpms UJIS for Windows Japanese eucjpms_japanese_ci 3
euckr EUC-KR Korean euckr_korean_ci 2
gb18030 China National Standard GB18030 gb18030_chinese_ci 4
gb2312 GB2312 Simplified Chinese gb2312_chinese_ci 2
gbk GBK Simplified Chinese gbk_chinese_ci 2
geostd8 GEOSTD8 Georgian geostd8_general_ci 1
greek ISO 8859-7 Greek greek_general_ci 1
hebrew ISO 8859-8 Hebrew hebrew_general_ci 1
hp8 HP West European hp8_english_ci 1
keybcs2 DOS Kamenicky Czech-Slovak keybcs2_general_ci 1
koi8r KOI8-R Relcom Russian koi8r_general_ci 1
koi8u KOI8-U Ukrainian koi8u_general_ci 1
latin1 cp1252 West European latin1_swedish_ci 1
latin2 ISO 8859-2 Central European latin2_general_ci 1
latin5 ISO 8859-9 Turkish latin5_turkish_ci 1
latin7 ISO 8859-13 Baltic latin7_general_ci 1
macce Mac Central European macce_general_ci 1
macroman Mac West European macroman_general_ci 1
sjis Shift-JIS Japanese sjis_japanese_ci 2
swe7 7bit Swedish swe7_swedish_ci 1
tis620 TIS620 Thai tis620_thai_ci 1
ucs2 UCS-2 Unicode ucs2_general_ci 2
ujis EUC-JP Japanese ujis_japanese_ci 3
utf16 UTF-16 Unicode utf16_general_ci 4
utf16le UTF-16LE Unicode utf16le_general_ci 4
utf32 UTF-32 Unicode utf32_general_ci 4
utf8 UTF-8 Unicode utf8_general_ci 3
utf8mb4 UTF-8 Unicode utf8mb4_general_ci 4

当前版本支持的字符集为41个。

utf8与utf8mb4

正常来说,utf8字符集表示一个字符需要1~4个字节,在 MySQL 5.5.3 及之前的版本中,utf8 字符集实际上是一个1~3的 UTF-8 变种。支持常规字符,并不包括所有 Unicode 字符,特别是那些位于扩展区(如表情符号、一些罕见汉字等)的字符。后来在MySQL 5.5.3 之后,utf8mb4 被引入以支持完整的四字节 UTF-8 编码。MySQL 官方也推荐在需要存储全 Unicode 字符集时使用 utf8mb4。考虑到兼容性,utf8字符也保留了下来。

查看比较规则

-- 查看utf8mb4支持的比较规则
show collation like "utf8mb4%";
比较规则 字符集 ID
utf8mb4_0900_ai_ci utf8mb4 255
utf8mb4_0900_as_ci utf8mb4 305
utf8mb4_0900_as_cs utf8mb4 278
utf8mb4_0900_bin utf8mb4 309
utf8mb4_bin utf8mb4 46
utf8mb4_croatian_ci utf8mb4 245
utf8mb4_cs_0900_ai_ci utf8mb4 266
utf8mb4_cs_0900_as_cs utf8mb4 289
utf8mb4_czech_ci utf8mb4 234
utf8mb4_danish_ci utf8mb4 235
utf8mb4_da_0900_ai_ci utf8mb4 267
utf8mb4_da_0900_as_cs utf8mb4 290
utf8mb4_de_pb_0900_ai_ci utf8mb4 256
utf8mb4_de_pb_0900_as_cs utf8mb4 279
utf8mb4_eo_0900_ai_ci utf8mb4 273
utf8mb4_eo_0900_as_cs utf8mb4 296
utf8mb4_esperanto_ci utf8mb4 241
utf8mb4_estonian_ci utf8mb4 230
utf8mb4_es_0900_ai_ci utf8mb4 263
utf8mb4_es_0900_as_cs utf8mb4 286
utf8mb4_es_trad_0900_ai_ci utf8mb4 270
utf8mb4_es_trad_0900_as_cs utf8mb4 293
utf8mb4_et_0900_ai_ci utf8mb4 262
utf8mb4_et_0900_as_cs utf8mb4 285
utf8mb4_general_ci utf8mb4 45
utf8mb4_german2_ci utf8mb4 244
utf8mb4_hr_0900_ai_ci utf8mb4 275
utf8mb4_hr_0900_as_cs utf8mb4 298
utf8mb4_hungarian_ci utf8mb4 242
utf8mb4_hu_0900_ai_ci utf8mb4 274
utf8mb4_hu_0900_as_cs utf8mb4 297
utf8mb4_icelandic_ci utf8mb4 225
utf8mb4_is_0900_ai_ci utf8mb4 257
utf8mb4_is_0900_as_cs utf8mb4 280
utf8mb4_ja_0900_as_cs utf8mb4 303
utf8mb4_ja_0900_as_cs_ks utf8mb4 304
utf8mb4_latvian_ci utf8mb4 226
utf8mb4_la_0900_ai_ci utf8mb4 271
utf8mb4_la_0900_as_cs utf8mb4 294
utf8mb4_lithuanian_ci utf8mb4 236
utf8mb4_lt_0900_ai_ci utf8mb4 268
utf8mb4_lt_0900_as_cs utf8mb4 291
utf8mb4_lv_0900_ai_ci utf8mb4 258
utf8mb4_lv_0900_as_cs utf8mb4 281
utf8mb4_persian_ci utf8mb4 240
utf8mb4_pl_0900_ai_ci utf8mb4 261
utf8mb4_pl_0900_as_cs utf8mb4 284
utf8mb4_polish_ci utf8mb4 229
utf8mb4_romanian_ci utf8mb4 227
utf8mb4_roman_ci utf8mb4 239
utf8mb4_ro_0900_ai_ci utf8mb4 259
utf8mb4_ro_0900_as_cs utf8mb4 282
utf8mb4_ru_0900_ai_ci utf8mb4 306
utf8mb4_ru_0900_as_cs utf8mb4 307
utf8mb4_sinhala_ci utf8mb4 243
utf8mb4_sk_0900_ai_ci utf8mb4 269
utf8mb4_sk_0900_as_cs utf8mb4 292
utf8mb4_slovak_ci utf8mb4 237
utf8mb4_slovenian_ci utf8mb4 228
utf8mb4_sl_0900_ai_ci utf8mb4 260
utf8mb4_sl_0900_as_cs utf8mb4 283
utf8mb4_spanish2_ci utf8mb4 238
utf8mb4_spanish_ci utf8mb4 231
utf8mb4_sv_0900_ai_ci utf8mb4 264
utf8mb4_sv_0900_as_cs utf8mb4 287
utf8mb4_swedish_ci utf8mb4 232
utf8mb4_tr_0900_ai_ci utf8mb4 265
utf8mb4_tr_0900_as_cs utf8mb4 288
utf8mb4_turkish_ci utf8mb4 233
utf8mb4_unicode_520_ci utf8mb4 246
utf8mb4_unicode_ci utf8mb4 224
utf8mb4_vietnamese_ci utf8mb4 247
utf8mb4_vi_0900_ai_ci utf8mb4 277
utf8mb4_vi_0900_as_cs utf8mb4 300

这些比较规则的命名还挺有规律的:

  • 比较规则名称以与其关联的字符集的名称开头
  • 后边紧跟着该比较规则主要作用于哪种语言,比如 utf8_polish_ci 表示以波兰语的规则比较,
    utf8_spanish_ci 是以西班牙语的规则比较, utf8_general_ci 是一种通用的比较规则
  • 名称后缀意味着该比较规则是否区分语言中的重音、大小写啥的,具体可以用的值如下:
后缀 英文释义 描述
_ai accent insensitive 不区分重音
_as accent sensitive 区分重音
_ci case insensitive 不区分大小写
_cs case sensitive 区分大小写
_bin binary 以二进制方式比较

比如 utf8_general_ci 这个比较规则是以 ci 结尾的,说明不区分大小写

字符集和比较规则级别

MySQL 有4个级别的字符集和比较规则,分别是:

  • 服务器级别
  • 数据库级别
  • 表级别
  • 列级别

服务器级别

character_set_server 服务器级别的字符集
collation_server 服务器级别的比较规则

查看

mysql> SHOW VARIABLES LIKE 'character_set_server';
+----------------------+---------+
| Variable_name        | Value   |
+----------------------+---------+
| character_set_server | utf8mb4 |
+----------------------+---------+
1 row in set (0.00 sec)
mysql> SHOW VARIABLES LIKE 'collation_server';
+------------------+--------------------+
| Variable_name    | Value              |
+------------------+--------------------+
| collation_server | utf8mb4_0900_ai_ci |
+------------------+--------------------+
1 row in set (0.02 sec)

可以看到这台mysql服务器默认的字符集为utf8mb4,默认的比较规则为utf8mb4_0900_ai_ci

修改

  • 通过配置文件修改
[server]
character_set_server=gbk
collation_server=gbk_chinese_ci

再次查看服务器的字符集和比较规则

mysql> SHOW VARIABLES LIKE 'character_set_server';
+----------------------+-------+
| Variable_name        | Value |
+----------------------+-------+
| character_set_server | gbk   |
+----------------------+-------+
1 row in set (0.01 sec)

mysql> SHOW VARIABLES LIKE 'collation_server';
+------------------+----------------+
| Variable_name    | Value          |
+------------------+----------------+
| collation_server | gbk_chinese_ci |
+------------------+----------------+
1 row in set (0.02 sec)

已经修改成我们的预期值了。

数据库级别

创建数据库时,可以指定字符集和比较规则,如果不指定,则继承服务器级别字符集和比较规则

语法:

-- 创建数据库
CREATE DATABASE 数据库名
 [[DEFAULT] CHARACTER SET 字符集名称]
 [[DEFAULT] COLLATE 比较规则名称];

-- 修改数据库
ALTER DATABASE 数据库名
 [[DEFAULT] CHARACTER SET 字符集名称]
 [[DEFAULT] COLLATE 比较规则名称];

示例:

  • 创建数据库时,不指定字符集和比较规则
-- 不指定字符集和比较规则
mysql> create database db1;
Query OK, 1 row affected (0.01 sec)

mysql> use db1;
Database changed
mysql> SHOW VARIABLES LIKE 'character_set_database';
+------------------------+-------+
| Variable_name          | Value |
+------------------------+-------+
| character_set_database | gbk   |
+------------------------+-------+
1 row in set (0.00 sec)

mysql> SHOW VARIABLES LIKE 'collation_database';
+------------------+----------------+
| Variable_name    | Value          |
+------------------+----------------+
| collation_server | gbk_chinese_ci |
+------------------+----------------+
1 row in set (0.01 sec)
mysql> 
  • 创建数据库时,指定字符集和比较规则
-- 指定字符集和比较规则
mysql> create database db2 character set utf8mb4 collate utf8mb4_general_ci;
Query OK, 1 row affected (0.01 sec)

mysql> use db2
Database changed
mysql> SHOW VARIABLES LIKE 'character_set_database';
+------------------------+---------+
| Variable_name          | Value   |
+------------------------+---------+
| character_set_database | utf8mb4 |
+------------------------+---------+
1 row in set (0.00 sec)

mysql> SHOW VARIABLES LIKE 'collation_database';
+--------------------+--------------------+
| Variable_name      | Value              |
+--------------------+--------------------+
| collation_database | utf8mb4_general_ci |
+--------------------+--------------------+
1 row in set (0.00 sec)

mysql> 
  • 修改数据库字符集和比较规则
mysql> alter database db2 character set utf8mb3 collate utf8mb3_general_ci;
Query OK, 1 row affected, 2 warnings (0.01 sec)

mysql> use db2;
Database changed
mysql> SHOW VARIABLES LIKE 'character_set_database';
+------------------------+---------+
| Variable_name          | Value   |
+------------------------+---------+
| character_set_database | utf8mb3 |
+------------------------+---------+
1 row in set (0.00 sec)

mysql> SHOW VARIABLES LIKE 'collation_database';
+--------------------+--------------------+
| Variable_name      | Value              |
+--------------------+--------------------+
| collation_database | utf8mb3_general_ci |
+--------------------+--------------------+
1 row in set (0.00 sec)
  • 如果字符集和比较规则不匹配,会报错
mysql> alter database db2 character set utf8mb3 collate utf8mb4_general_ci;
ERROR 1253 (42000): COLLATION 'utf8mb4_general_ci' is not valid for CHARACTER SET 'utf8mb3'
mysql> 

表级别

创建数据表时,可以指定字符集和比较规则,如果不指定,则继承数据库级别字符集和比较规则

语法:

CREATE TABLE 表名 (列的信息)
 [[DEFAULT] CHARACTER SET 字符集名称]
 [COLLATE 比较规则名称]]
ALTER TABLE 表名
 [[DEFAULT] CHARACTER SET 字符集名称]
 [COLLATE 比较规则名称]

示例

  • 创建数据表时,不指定字符集和比较规则
mysql> create table t1 (col VARCHAR(10));
Query OK, 0 rows affected (0.02 sec)

mysql> SELECT 
    ->     TABLE_SCHEMA, 
    ->     TABLE_NAME, 
    ->     TABLE_COLLATION, 
    ->     CCSA.character_set_name AS CHARACTER_SET_NAME
    -> FROM 
    ->     information_schema.TABLES T,
    ->     information_schema.COLLATION_CHARACTER_SET_APPLICABILITY CCSA
    -> WHERE 
    ->     T.TABLE_SCHEMA = 'db1' 
    ->     AND T.TABLE_NAME = 't1'
    ->     AND CCSA.collation_name = T.TABLE_COLLATION;
+--------------+------------+-----------------+--------------------+
| TABLE_SCHEMA | TABLE_NAME | TABLE_COLLATION | CHARACTER_SET_NAME |
+--------------+------------+-----------------+--------------------+
| db1          | t1         | gbk_chinese_ci  | gbk                |
+--------------+------------+-----------------+--------------------+
1 row in set (0.00 sec)
  • 创建数据表时,指定字符集和比较规则
mysql> create table t2 (col VARCHAR(10)) character set utf8mb4 collate utf8mb4_general_ci;
Query OK, 0 rows affected (0.02 sec)

mysql> SELECT 
    ->     TABLE_SCHEMA, 
    ->     TABLE_NAME, 
    ->     TABLE_COLLATION, 
    ->     CCSA.character_set_name AS CHARACTER_SET_NAME
    -> FROM 
    ->     information_schema.TABLES T,
    ->     information_schema.COLLATION_CHARACTER_SET_APPLICABILITY CCSA
    -> WHERE 
    ->     T.TABLE_SCHEMA = 'db1' 
    ->     AND T.TABLE_NAME = 't2'
    ->     AND CCSA.collation_name = T.TABLE_COLLATION;
+--------------+------------+--------------------+--------------------+
| TABLE_SCHEMA | TABLE_NAME | TABLE_COLLATION    | CHARACTER_SET_NAME |
+--------------+------------+--------------------+--------------------+
| db1          | t2         | utf8mb4_general_ci | utf8mb4            |
+--------------+------------+--------------------+--------------------+
1 row in set (0.01 sec)
  • 修改字符集和比较规则
mysql> alter table t2  character set utf8mb3 collate utf8mb3_general_ci;
Query OK, 0 rows affected, 2 warnings (0.01 sec)
Records: 0  Duplicates: 0  Warnings: 2

mysql> SELECT 
    ->     TABLE_SCHEMA, 
    ->     TABLE_NAME, 
    ->     TABLE_COLLATION, 
    ->     CCSA.character_set_name AS CHARACTER_SET_NAME
    -> FROM 
    ->     information_schema.TABLES T,
    ->     information_schema.COLLATION_CHARACTER_SET_APPLICABILITY CCSA
    -> WHERE 
    ->     T.TABLE_SCHEMA = 'db1' 
    ->     AND T.TABLE_NAME = 't2'
    ->     AND CCSA.collation_name = T.TABLE_COLLATION;
+--------------+------------+--------------------+--------------------+
| TABLE_SCHEMA | TABLE_NAME | TABLE_COLLATION    | CHARACTER_SET_NAME |
+--------------+------------+--------------------+--------------------+
| db1          | t2         | utf8mb3_general_ci | utf8mb3            |
+--------------+------------+--------------------+--------------------+
1 row in set (0.00 sec)
  • 如果字符集和比较规则不匹配,会报错
mysql> alter table t2  character set utf8mb3 collate utf8mb4_general_ci;
ERROR 1253 (42000): COLLATION 'utf8mb4_general_ci' is not valid for CHARACTER SET 'utf8mb3'

列级别

对于存储字符串的列,同一个表中的不同的列也可以有不同的字符集和比较规则。我们在创建和修改列定义的时候可以指定该列的字符集和比较规则。

语法:

CREATE TABLE 表名(
 列名 字符串类型 [CHARACTER SET 字符集名称] [COLLATE 比较规则名称],
 其他列...
);
ALTER TABLE 表名 MODIFY 列名 字符串类型 [CHARACTER SET 字符集名称] [COLLATE 比较规则名称];

示例:

  • 同一个表中的不同的列指定不同的字符集和比较规则
mysql> create table t3 (
    -> c1 varchar(100) character set utf8mb3 collate utf8mb3_general_ci,
    -> c2 varchar(100) character set gbk collate gbk_chinese_ci,
    -> c3 varchar(100)
    -> ) character set utf8mb4 collate utf8mb4_general_ci;
Query OK, 0 rows affected, 2 warnings (0.02 sec)

mysql> show create table t3\G;
*************************** 1. row ***************************
       Table: t3
Create Table: CREATE TABLE `t3` (
  `c1` varchar(100) CHARACTER SET utf8mb3 COLLATE utf8mb3_general_ci DEFAULT NULL,
  `c2` varchar(100) CHARACTER SET gbk COLLATE gbk_chinese_ci DEFAULT NULL,
  `c3` varchar(100) COLLATE utf8mb4_general_ci DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_general_ci
1 row in set (0.00 sec)

可以看到 c1和 c2 都是自定义字符集和比较规则,c3则是继承了表t3的字符集和比较规则

  • 修改列的字符集和比较规则
-- 修改c2列的字符集和比较规则,使之继承表t3的字符集和比较规则
mysql> alter table t3 modify column `c2` varchar(100);
Query OK, 0 rows affected (0.03 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> show create table t3\G;
*************************** 1. row ***************************
       Table: t3
Create Table: CREATE TABLE `t3` (
  `c1` varchar(100) CHARACTER SET utf8mb3 COLLATE utf8mb3_general_ci DEFAULT NULL,
  `c2` varchar(100) COLLATE utf8mb4_general_ci DEFAULT NULL,
  `c3` varchar(100) COLLATE utf8mb4_general_ci DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_general_ci
1 row in set (0.00 sec)

补充说明

  • 在转换列的字符集时需要注意,如果转换前列中存储的数据不能用转换后的字符集进行表示会发生错误。比方说原先列使用的字符集是utf8,列中存储了一些汉字,现在把列的字符集转换为ascii的话就会出错,因为ascii字符集并不能表示汉字字符。

例如:

-- t3.c1的当前字符集为utf8mb3,比较规则为utf8mb3_general_ci
mysql> insert into t3(c1) values("列1");
Query OK, 1 row affected (0.00 sec)

-- 修改t3.c1列的字符集会报错
mysql> alter table db1.t3 modify column c1 varchar(100) CHARACTER SET `ascii` COLLATE ascii_general_ci;
ERROR 1366 (HY000): Incorrect string value: '\xE5\x88\x971' for column 'c1' at row 1
  • 如果只修改字符集,则比较规则会自动修改为字符集支持的默认比较规则
-- 查看表t3的当前字符集和比较规则
mysql> SELECT 
    ->      TABLE_SCHEMA, 
    ->      TABLE_NAME, 
    ->      TABLE_COLLATION, 
    ->      CCSA.character_set_name AS CHARACTER_SET_NAME
    ->  FROM 
    ->      information_schema.TABLES T,
    ->      information_schema.COLLATION_CHARACTER_SET_APPLICABILITY CCSA
    ->  WHERE 
    ->      T.TABLE_SCHEMA = 'db1' 
    ->      AND T.TABLE_NAME = 't3'
    ->      AND CCSA.collation_name = T.TABLE_COLLATION;
+--------------+------------+--------------------+--------------------+
| TABLE_SCHEMA | TABLE_NAME | TABLE_COLLATION    | CHARACTER_SET_NAME |
+--------------+------------+--------------------+--------------------+
| db1          | t3         | utf8mb4_general_ci | utf8mb4            |
+--------------+------------+--------------------+--------------------+
1 row in set (0.00 sec)

-- 修改表t3的字符集
mysql> alter table t3 CHARACTER SET utf8mb3;
Query OK, 0 rows affected, 1 warning (0.01 sec)
Records: 0  Duplicates: 0  Warnings: 1

-- 查看表t3的修改后的字符集和比较规则
mysql> SELECT 
    ->      TABLE_SCHEMA, 
    ->      TABLE_NAME, 
    ->      TABLE_COLLATION, 
    ->      CCSA.character_set_name AS CHARACTER_SET_NAME
    ->  FROM 
    ->      information_schema.TABLES T,
    ->      information_schema.COLLATION_CHARACTER_SET_APPLICABILITY CCSA
    ->  WHERE 
    ->      T.TABLE_SCHEMA = 'db1' 
    ->      AND T.TABLE_NAME = 't3'
    ->      AND CCSA.collation_name = T.TABLE_COLLATION;
+--------------+------------+--------------------+--------------------+
| TABLE_SCHEMA | TABLE_NAME | TABLE_COLLATION    | CHARACTER_SET_NAME |
+--------------+------------+--------------------+--------------------+
| db1          | t3         | utf8mb3_general_ci | utf8mb3            |
+--------------+------------+--------------------+--------------------+
1 row in set (0.00 sec)

比较规则确实自动随着字符集的修改而发生了修改

  • 如果只修改比较规则,则字符集也会自动修改为跟比较规则对应的字符集
-- 查看表t3的当前字符集和比较规则
mysql> SELECT 
    ->      TABLE_SCHEMA, 
    ->      TABLE_NAME, 
    ->      TABLE_COLLATION, 
    ->      CCSA.character_set_name AS CHARACTER_SET_NAME
    ->  FROM 
    ->      information_schema.TABLES T,
    ->      information_schema.COLLATION_CHARACTER_SET_APPLICABILITY CCSA
    ->  WHERE 
    ->      T.TABLE_SCHEMA = 'db1' 
    ->      AND T.TABLE_NAME = 't3'
    ->      AND CCSA.collation_name = T.TABLE_COLLATION;
+--------------+------------+--------------------+--------------------+
| TABLE_SCHEMA | TABLE_NAME | TABLE_COLLATION    | CHARACTER_SET_NAME |
+--------------+------------+--------------------+--------------------+
| db1          | t3         | utf8mb3_general_ci | utf8mb3            |
+--------------+------------+--------------------+--------------------+
1 row in set (0.01 sec)

-- 修改表t3的比较规则
mysql> alter table t3 COLLATE utf8mb4_general_ci;
Query OK, 0 rows affected (0.02 sec)
Records: 0  Duplicates: 0  Warnings: 0

-- 查看表t3的修改后的字符集和比较规则
mysql> SELECT 
    ->      TABLE_SCHEMA, 
    ->      TABLE_NAME, 
    ->      TABLE_COLLATION, 
    ->      CCSA.character_set_name AS CHARACTER_SET_NAME
    ->  FROM 
    ->      information_schema.TABLES T,
    ->      information_schema.COLLATION_CHARACTER_SET_APPLICABILITY CCSA
    ->  WHERE 
    ->      T.TABLE_SCHEMA = 'db1' 
    ->      AND T.TABLE_NAME = 't3'
    ->      AND CCSA.collation_name = T.TABLE_COLLATION;
+--------------+------------+--------------------+--------------------+
| TABLE_SCHEMA | TABLE_NAME | TABLE_COLLATION    | CHARACTER_SET_NAME |
+--------------+------------+--------------------+--------------------+
| db1          | t3         | utf8mb4_general_ci | utf8mb4            |
+--------------+------------+--------------------+--------------------+
1 row in set (0.01 sec)
  • 我们假设一种级别的父子级关系服务器级别>数据库级别>表级别>列级别,父级的字符集和比较规则发生变化,不影响子级及以后级别的字符集和比较规则,或者说子级的字符集和比较规则在创建时就已经指定好了,不会受其他级别字符集和比较规则修改的影响
-- 查看服务器的字符集和比较规则
mysql> SHOW VARIABLES LIKE 'character_set_server';
+----------------------+-------+
| Variable_name        | Value |
+----------------------+-------+
| character_set_server | gbk   |
+----------------------+-------+
1 row in set (0.01 sec)

mysql> SHOW VARIABLES LIKE 'collation_server';
+------------------+----------------+
| Variable_name    | Value          |
+------------------+----------------+
| collation_server | gbk_chinese_ci |
+------------------+----------------+
1 row in set (0.00 sec)

-- 创建数据库,继承服务器级别的字符集和比较规则
mysql> create database db5;
Query OK, 1 row affected (0.01 sec)

-- 查看数据库的字符集和比较规则
mysql> use db5;
Database changed
mysql> SHOW VARIABLES LIKE 'character_set_database';
+------------------------+-------+
| Variable_name          | Value |
+------------------------+-------+
| character_set_database | gbk   |
+------------------------+-------+
1 row in set (0.01 sec)

mysql> SHOW VARIABLES LIKE 'collation_database';
+--------------------+----------------+
| Variable_name      | Value          |
+--------------------+----------------+
| collation_database | gbk_chinese_ci |
+--------------------+----------------+
1 row in set (0.01 sec)

------------------------
-- 修改服务器字符集和比较规则配置 并重启服务器
[server]
character_set_server=utf8mb4
collation_server=utf8mb4_general_ci
------------------------

-- 查看服务器的字符集和比较规则
mysql> SHOW VARIABLES LIKE 'character_set_server';
+----------------------+---------+
| Variable_name        | Value   |
+----------------------+---------+
| character_set_server | utf8mb4 |
+----------------------+---------+
1 row in set (0.01 sec)

mysql> SHOW VARIABLES LIKE 'collation_server';
+------------------+--------------------+
| Variable_name    | Value              |
+------------------+--------------------+
| collation_server | utf8mb4_general_ci |
+------------------+--------------------+
1 row in set (0.00 sec)

-- 查看数据库的字符集和比较规则
mysql> use db5;
Database changed
mysql> SHOW VARIABLES LIKE 'character_set_database';
+------------------------+-------+
| Variable_name          | Value |
+------------------------+-------+
| character_set_database | gbk   |
+------------------------+-------+
1 row in set (0.01 sec)

mysql> SHOW VARIABLES LIKE 'collation_database';
+--------------------+----------------+
| Variable_name      | Value          |
+--------------------+----------------+
| collation_database | gbk_chinese_ci |
+--------------------+----------------+
1 row in set (0.00 sec)

字符集的应用

上边我们介绍的是在数据存储方面的编码解码,接下来我们介绍在数据传输方便的编码解码

准备

  • mysql服务器启动配置设置
[server]
character_set_server=utf8mb4
  • 常用的mysql客户端连接到mysql服务器后,都会设置客户端字符集,以便数据可以正确传输,但是这样也给我们查看底层的系统变量造成了干扰,为了避免这个情况,这里采用php连接mysql服务器

关键系统变量

客户端连接mysql服务器,查询获取返回数据,会有多次的字符集编码转换,这个过程中会用到以下三个系统变量:

  • character_set_client: 服务器解码请求时使用的字符集
  • character_set_connection: 服务器处理请求时会把请求字符串从 character_set_client 转为 character_set_connection
  • character_set_results: 服务器向客户端返回数据时使用的字符集

查看这几个变量的默认值

<?php

// 创建连接
$conn = new mysqli("127.0.0.1", "root", "123456", "db1");
// 检测连接
if ($conn->connect_error) {
    die("连接失败: " . $conn->connect_error);
}
echo "连接成功\n";
$ret = $conn->execute_query("SHOW VARIABLES LIKE 'character_set_client';");
$res = $ret->fetch_all()[0];
echo $res[0]." : ".$res[1].PHP_EOL;

$ret = $conn->execute_query("SHOW VARIABLES LIKE 'character_set_connection';");
$res = $ret->fetch_all()[0];
echo $res[0]." : ".$res[1].PHP_EOL;

$ret = $conn->execute_query("SHOW VARIABLES LIKE 'character_set_results';");
$res = $ret->fetch_all()[0];
echo $res[0]." : ".$res[1].PHP_EOL;
?>

查询结果:

连接成功
character_set_client : utf8mb4
character_set_connection : utf8mb4
character_set_results : utf8mb4

可以看到,这几个变量的默认值继承了mysql服务器级别的字符集

客户端连接成功后,可以指定字符集来修改这三个系统变量的值

$conn->set_charset("gbk");

查询结果:

连接成功
character_set_client : gbk
character_set_connection : gbk
character_set_results : gbk

也可以通过客户端单个设置系统变量的值

这是设置是非持久的

$conn->set_charset("gbk");
$ret = $conn->execute_query("SET character_set_client = utf8;");

查询结果:

连接成功
character_set_client : utf8mb3
character_set_connection : gbk
character_set_results : gbk

编码和解码使用不同字符集的后果

正确的查询:

<?php

// 创建连接
$conn = new mysqli("127.0.0.1", "root", "123456", "db1");
// 检测连接
if ($conn->connect_error) {
    die("连接失败: " . $conn->connect_error);
}
echo "连接成功\n";

$ret = $conn->execute_query("select * from db1.t3 where c1 like'%列%'");
print_r($ret->fetch_all());

查询结果:

连接成功
Array
(
    [0] => Array
        (
            [0] => 列1
            [1] => 
            [2] => 
        )

)

  • 客户端单独设置了character_set_client,之后:
// 创建连接
$conn = new mysqli("127.0.0.1", "root", "123456", "db1");

// 检测连接
if ($conn->connect_error) {
    die("连接失败: " . $conn->connect_error);
}
echo "连接成功\n";

// 设置 character_set_client
$ret = $conn->execute_query("SET character_set_client = gbk");

// 查询
$ret = $conn->execute_query("select * from db1.t3 where c1 like'%列%'");
print_r($ret->fetch_all());
?>

查询结果:

连接成功
Array
(
)

结果查询为空,说明了mysql服务端通过gbk解码请求中的%列%时,中文字符不能正确解析出来,导致查询不到正确的结果

  • 客户端设置了跟服务端不兼容的字符集后
<?php

// 创建连接
$conn = new mysqli("127.0.0.1", "root", "123456", "db1");

// 检测连接
if ($conn->connect_error) {
    die("连接失败: " . $conn->connect_error);
}
echo "连接成功\n";

// 设置字符集为gbk
$conn->set_charset("gbk");
$ret = $conn->execute_query("select * from db1.t3 where c1 like'%列%'");
print_r($ret->fetch_all());

查询报错:

连接成功
PHP Fatal error:  Uncaught mysqli_sql_exception: Illegal mix of collations (utf8mb3_general_ci,IMPLICIT) and (gbk_chinese_ci,COERCIBLE) for operation 'like' in /Users/xxx/code/php/test/test.php:18
Stack trace:
#0 /Users/xxx/code/php/test/test.php(18): mysqli->execute_query('select * from d...')
#1 {main}
  thrown in /Users/xxx/code/php/test/test.php on line 18

还有很多种情况,这里就不一一列举,但是字符集不一致导致的编码和解码出错的情况,导致的后果还是很严重的,我们需要特别关注

参考下面这张图,我们能更清晰的感觉到一次request,底层需要做很多次的字符串编码解码和字符集的转换,有一个环节出问题,都可能会返回一个错误的结果。
在这里插入图片描述
这样绕来绕去看着就很晕,为了简单起见,我们通常都把 character_set_clientcharacter_set_connectioncharacter_set_results 这三个系统变量设置成和客户端使用的字符集一致的情况,这样减少了很多无谓的字符集转换
为了方便设置,mysql提供了一条非常简单的语句:

SET NAMES 字符集名;

这样一次性就可以把这三条设置成统一的字符集

比较规则的应用

比较规则的作用通常体现比较字符串大小的表达式以及对某个字符串列进行排序中,所以有时候也称为排序规则,影响的是结果集的顺序

示例:

首先看一下db1.t3.c1列的排序规则:

CREATE TABLE `t3` (
  `c1` varchar(100) CHARACTER SET utf8mb3 COLLATE utf8mb3_general_ci DEFAULT NULL,
  `c2` varchar(100) COLLATE utf8mb4_general_ci DEFAULT NULL,
  `c3` varchar(100) COLLATE utf8mb4_general_ci DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_general_ci

c1列目前的排序规则是 utf8mb3_general_ci,不区分大小写
插入几条数据:

mysql> insert into db1.t3(c1) values('A'),('B'),('a'),('b');
Query OK, 4 rows affected (0.04 sec)
Records: 4  Duplicates: 0  Warnings: 0

查询一下看看返回结果:

mysql> select * from db1.t3 order by c1 asc;
+------+------+------+
| c1   | c2   | c3   |
+------+------+------+
| 1    | NULL | NULL |
| A    | NULL | NULL |
| a    | NULL | NULL |
| B    | NULL | NULL |
| b    | NULL | NULL |
|1  | NULL | NULL |
+------+------+------+
6 rows in set (0.00 sec)

我们可以看到 第二行到第五行,是大小写交叉返回的,顺序是26个英文字母的先后顺序,跟我们预期的是一样的

接下来我们修改一下排序规则

utf8mb3_bin 排序规则是直接比较字符编码,是区分大小写的

mysql> alter table db1.t3 modify column c1 varchar(200) collate utf8mb3_bin;
Query OK, 0 rows affected, 1 warning (0.03 sec)
Records: 0  Duplicates: 0  Warnings: 1

再次执行同样的查询语句:

mysql> select * from db1.t3 order by c1 asc;
+------+------+------+
| c1   | c2   | c3   |
+------+------+------+
| 1    | NULL | NULL |
| A    | NULL | NULL |
| B    | NULL | NULL |
| a    | NULL | NULL |
| b    | NULL | NULL |
|1  | NULL | NULL |
+------+------+------+
6 rows in set (0.00 sec)

第二行到第五行的返回排序果然发生了调整,跟我们预期的也是一直的。

总结

对字符集和排序规则的深入了解,有利于我们正确使用mysql,以及针对相关问题进行快速定位。

本篇时间和篇幅都有限,先整理到这里,后续进行了更深入的整理,会通过版本迭代式地补充到这里。