Ruby 1.9 encoding gotcha, Retreat to ASCII-8BIT
By taylor luk on August 14, 2009 @ 02:52 AM
(Note: this is not a rant)
Why should i care?
Never thought i would actually write about character encoding in Ruby1.9, it never interests me as long i can get my code to work doing whatever i need to do. Lately, I have been running into corners for a few time like when you need to parse a large amount of xml feed without knowing it's character encoding. I keep asking myself why should i even bother, but the fact is
You can't trust anyone to encode their files properly, I actually ran into cases like a xml file tells you (ContentType UTF8) and the feed itself is actually encoded in UTF16-LE
One of a very nice Ruby 1.9 feature is encoding aware character string, It includes many good stuff features such as M17N, character encoding conversion and detection (*limited). I have learn more about Ruby 1.9 and how character encoding works and i recommend to go through James Edwards II's series of posts about character encoding.
This is a error (and other similar ones) which ruby just came up on you, either you aren't doing things properly or the library you are doing aren't in a great shape.
incompatible character encodings: ASCII-8 BIT and UTF-8
Most of the time things are good and peaceful, and there are time things are dreadful, but easily fixable.
Set your source file encoding
# encoding: utf-8
Make sure your Regex are Unicode friendly
/#{some_patter}/u
Encoding fail
There are some cases ruby's character detection (limited) failed to get correct file encoding, Well okay, I open file with correct encoding
File.open(file_path, 'r:utf-8')
Again what about the time Ruby's encoding detection fails and where you have no knowledge about the external file (again think external feeds).
Then you are simply out of luck.
How about character encoding libraries ?
I came a cross rCharDet gem, It is a port of python's UniversalDetector and which is also a port of Mozilla Charset Detectors which is a library taken intelligent and uses statistical approach to detect the correct encoding.
Sounds good and scientific but then i find out,
It is not Ruby 1.9 compatible
How about i fix it?
when I tried to fixed rCharDet, It looks and smells like python code and yet again.
something epic fails about "encoding: ASCII-8 BIT" i couldn't remember.
Ruby fails because it doesn't know the encoding of input string, and isn't that exactly why i am use this library to start with (hair pulling)
I start to realize
there are cases where you HAVE to work with unknown encoding (just don't know) and that is precisely the reason this non-standard encoding ASCII-8BIT is invented.
Give me back to good old byte stream.
if "1.9".respond_to?(:encoding)
source.encode!('ASCII-8BIT')
end
Fortunately, I have successfully ported rCharDet to Ruby 1.9 and residing on my github account, it's also how i fixed the issues when i port my h2o-template to ruby, it's also how Rails core team fixed this in ActionView erb template.
So please Rack can you do the same (just accept the ASCII-8BIT patch) and my app can run again, because these Chinese characters(華人醫學) in the URL aren't letting me to go to sleep tonight.