SQL Info: What is utf8_general

utf8_general_ci is a very simple — and on Unicode, very broken — collation, one that gives incorrect results on general Unicode text.

What it does is:

converts to Unicode normalization form D for canonical decomposition
removes any combining characters
converts to upper case

This does not work correctly on Unicode, because it does not understand Unicode casing. Unicode casing alone is much more complicated than an ASCII-minded approach can handle. For example:

The lowercase of “ẞ” is “β”, but the uppercase of “β” is “SS”.
There are two lowercase Greek sigmas, but only one uppercase one; consider “Σίσυφος”.
Letters like “ø” do not decompose to an “o” plus a diacritic, meaning that it won’t correctly sort.

There are many other subtleties.

utf8_unicode_ci uses the standard Unicode Collation Algorithm, supports so called expansions and ligatures, for example: German letter ß (U+00DF LETTER SHARP S) is sorted near "ss" Letter Œ (U+0152 LATIN CAPITAL LIGATURE OE) is sorted near "OE".

utf8_general_ci does not support expansions/ligatures, it sorts all these letters as single characters, and sometimes in a wrong order.

utf8_unicode_ci is generally more accurate for all scripts. For example, on Cyrillic block: utf8_unicode_ci is fine for all these languages: Russian, Bulgarian, Belarusian, Macedonian, Serbian, and Ukrainian. While utf8_general_ci is fine only for Russian and Bulgarian subset of Cyrillic. Extra letters used in Belarusian, Macedonian, Serbian, and Ukrainian are sorted not well.

The cost of utf8_unicode_ci is that it is a little bit slower than utf8_general_ci. But that’s the price you pay for correctness. Either you can have a fast answer that’s wrong, or a very slightly slower answer that’s right. Your choice. It is very difficult to ever justify giving wrong answers, so it’s best to assume that utf8_general_ci doesn’t exist and to always use utf8_unicode_ci.

Source: http://forums.mysql.com/read.php?103,187048,188748#msg-188748

SQL Info

Featured Post

What is the purpose of the php.ini file?

What is utf8_general_ci?

No comments:

Post a Comment

Popular Posts