Normalization of Unicode Equivalent Animals

A while back, a customer reported that something was off when searching in our application. When searching for “Akilleshäl” (Swedish word for “Achilles heel”, which is a query I just made up) they only got a result for “Akilleshäl 1” and not “Akilleshäl 2”. Weird!

I started to look into it and noticed that both results were returned when searching for the part of the word leading up to the very non-English letter of “ä”, but once that letter was added, the second result disappeared.

The horror slowly began to dawn on me: I was dealing with a character encoding issue.

Time to dig in. I launched the Rails console, fetched the strings and started to cut into the them.

"Akilleshäl" == "Akilleshäl" returned false. Some more cutting. "ä" == "ä" returned false.

Since I was pretty sure this was encoding related, I used the String#bytes method on both characters.

"ä".bytes =>  [195, 164]
"ä".bytes =>  [97, 204, 136]

As the saying goes:

If it looks like a duck, swims like a duck, and quacks like a duck, then it is either a duck or a unicode equivalent animal.

Yes, Unicode equivalence is a thing. The byte sequences above will render identical characters, even though they mean slightly different things. In UTF-8, the bytes 195 and 164 are interpreted as “Latin Small Letter A with Diaeresis”, or “ä”.

The other byte sequence means “Latin Small Letter A” (97) followed by “Combining Diaeresis” (204+136). More simply put, take the letter “a” and slap two dots on top of it. Or “ä”.

Great, now I understand why it doesn’t work, but I still have no idea what to do about it. I tried to string words together to get Google to lead me to a solution - only marginally more evolved than a monkey in the infinite monkey theorem - until I finally stumbled on the magic word: normalization.

As it turns out, Unicode normalization is standardized by the Unicode Consortium. There are a bunch of different algorithms, which I did not have the mental energy to read, but the important point is that they will turn both byte sequences above into the same byte sequence.

Luckily, Ruby introduced String#unicode_normalize in version 2.2. Let’s take it for a spin!

[97, 204, 136].pack("c*").force_encoding("UTF-8").unicode_normalize.bytes # => [195, 164]

Boom!

Let’s just quickly go through what happened here. The 3-byte sequence is packed into a string, then Ruby is told to interpret this as UTF-8 before applying the default Unicode normalization algorithm. Finally, the byte sequence is returned, which now matches “Latin Small Letter A with Diaeresis”. Or “ä”.

It was now possible fix the specific issue that got this snowball rolling. I was interested in finding all records where the name was not normalized and normalize it, which looked something like this:

Model.pluck(:id, :name).reject { _2.unicode_normalized? }.each do |id, name|
  Model.find(id).update(name: name.unicode_normalize)
end

The reason for plucking id and name first was to not crash the Rails console running on Heroku due to excessive memory use (a full ActiveRecord object uses more memory than just a number and a string). If you have even more data and you’re running PostgreSQL 13, a more efficient approach (I assume) is to use the built-in IS NORMALIZED function and only query the affected records from the database.

To ensure we avoid this problem down the line, we are also applying the normalization on relevant fields for all future records in our ActiveRecord models by extending the String type used by the attributes API - but that deserves a blog post of its own.

Software and Other Mysteries

On code and productivity with a dash of unicorn dust.

Comments