Monday, March 5, 2007

i18n'ing it up!

I plan on working on some multi-lingual websites, so what better encoding to use, than UTF8, right? You get dozens of languages and character sets all supported by a single encoding, like on this site.

So what's it take to get all this set up? Let's start with the most obvious thing you can do, like sticking this into your html header.

<meta equiv="Content-Type" content="text/html; charset=utf-8">
That's not nearly enough however, because most browsers will also check your http header, and if it doesn't agree with your meta tag, then the http header value takes precedence. This is fairly easy to correct. In JSP I can do it with a page directive :

<%@ page contentType="text/html; charset=utf-8" %>
What about sending/recieving form data?

Most browsers will send in the same encoding that the page is in, but for an added guarantee you can specify an encoding type in your form tag :

<form charset="utf-8" method="post" action="postArticle.do">
...
</form>
Things can get a little more tricky on the receiving end however. A servlet will check the character encoding, with request.getCharacterEncoding(), and if it's null, you won't get what your expecting. I'm not enough of a container expert to understand what's going on behind the scenes, but in my case it was necessary to tweak things to tell java to use utf8 encoding. I did this with a simple modification to my ActionForm bean.

public void setContent(String content) throws Exception {
this.content = new String(content.getBytes("8859_1"),"UTF8");
}
Granted, there are probably much better ways to do this (request filters, etc) and I'd be more than willing to listen to anyone else's expertise on the subject.