MySQL Character encoding – part 1
Breaking and unbreaking your data
Recently at FOSDEM, Maciej presented “Breaking and unbreaking your data”, a presentation about the potential problems you can incur regarding character encoding whilst working with MySQL. In short, there are a myriad of places where character encoding can be controlled, which gives ample opportunity for the system to break and for text to become unrecoverable.
The slides from the presentation are available on slideshare .
Character Encoding – MySQL DevRoom – FOSDEM 2015 from mushupl
Since slides don’t tell the whole story, we decided to create a series of blog posts to demonstrate how easy it is to go wrong, how to fix some of the issues and how to avoid such issues in the future.
What is character encoding?
The encoding is the binary representation of glyphs, where each character can be represented by 1 or more bytes. Popular schemes include ASCII and Unicode, and can include language specific character sets such as Latin US, Latin1, Latin2 which are commonly used in America and Europe and EUC-KR or GB18030 which support language characters with an Asian origin. Each character can be associated by several different codes, and one code may correspond to several different characters, depending on the encoding scheme used.
Where do you set character sets in MySQL?
Here is the core of the problem, the character encoding can be controlled from the application, database or even on a per table or column basis. Together with a set of rules regarding inheritance, it is easy to have one layer of the system configured for one character set whilst the actual data being introduced is using a different character set.
In MySQL the following area, the following settings can all affect the character encoding used.
- Session settings
- Schema level defaults
- Table level defaults
- Column charsets
Character encoding in MySQL.
As Maciej pointed out in the presentation, where MySQL is concerned we are all born Swedish, as MySQL starts configured for the Latin1 character set and collation set to latin1_swedish_ci. This is even the case in MySQL 5.7, meaning by default your system expects only characters in the latin1 set and will when comparing characters it will assume the Swedish language is being used.
Lets look at how this manifests itself in a new application, where server, client and table are set to the default latin1.
￼mysql> SELECT @@global.character_set_server, @@session.character_set_client;+——————————-+——————————–+| @@global.character_set_server | @@session.character_set_client |+——————————-+——————————–+| latin1| latin1 |+——————————-+——————————–+1 row in set (0.00 sec)mysql> CREATE SCHEMA fosdem;Query OK, 1 row affected (0.00 sec)mysql> USE fosdem;mysql> CREATE TABLE locations (city VARCHAR(30) NOT NULL); Query OK, 0 rows affected (0.15 sec);
mysql> SHOW CREATE TABLE locations/G*************************** 1. row ***************************Table: locationsCreate Table: CREATE TABLE `locations` (`city` varchar(30) NOT NULL) ENGINE=InnoDB DEFAULT CHARSET=latin1 1 row in set (0.00 sec)
So what happens when you try to save some data that is not latin1 encoded.
The city of Tokyo is displayed.
The application returned and rendered the new city correctly, however inside the database there is some confusion.
￼mysql> SET NAMES utf8;Query OK, 0 rows affected (0.00 sec)mysql> select * from locations;+——————–+| city |+——————–+| Berlin || KrakÃ3w|| æ±äo¬éƒ1⁄2 |+——————–+3 rows in set (0.00 sec)
The data being saved was UTF8 encoded, however if an application attempts to query the database as UTF8 it receives garbage. Instead the application must ask for Latin1 to receive the original data.
￼mysql> SET NAMES latin1;Query OK, 0 rows affected (0.00 sec)mysql> select+———–+| city|+———–+| Berlin|| Kraków|| 東京都 |+———–+3 rows in set* from locations;(0.00 sec)
The new city was saved and from the application the result looked correct, however what is happening here is that the connection to the database has saved the binary data without any manipulation. Hence it returned the same data, and the browser was able to do the right thing and display it correctly, as did the terminal which was set to UTF8. Inside the database though, it is not able to understand the data in the correct context.
In the next blog post we will look at how to correctly configure character sets, as well as demonstrating some of the problems we have encountered in production systems and how we fixed those.
Axure汉化版已经发布，版本号Axure 188.8.131.529，下面是截图效果 Axure汉化版文件下载地址：Axure汉化补丁 Axure RP pro 184.108.40.2069 下载地址注册用户名：Axure 序列...
BI中文站 6月7日报道 艾默生·斯帕茨(Emerson Spartz)今年28岁，已婚，是Spartz Inc公司的首席执行官。Spartz Inc是一个网站媒体帝国，旗下的30多家网站专门发布有趣、励志和让人感到不可思议的文章和帖子，其分享量非常...
- mysql 将字段time按天/月/年分组
- 新安装mysql 第三方工具连接不上问题
- CentOS 安装MySQL 5.1.69
- mysql出现“Incorrect key file for table”解决办法
- mysql无法启动——cannot allocate the memory for the buffer pool
- Mysql 日志删除
- mysql 修改字段类型 删除字段类型
- 修改mysql字符编码出现Job failed to start解决办法
- Why TokuDB hates Transparent HugePages