When you use latin2 (or perhaps any non-latin1) encoding you probably bump into the issue of not having the desired punctuation in the character set. In the Hungarian language, for example, such characters would be:
- the "ndash" HTML entity –
- the quotation characters „ and ”
So how come you didn't notice it yet? It's because your browser recognizes that you aren't using utf-8 so it encodes those values to their HTML entity equivalents before sending the request to your server. Your server may store these values in the database and when rendering an HTML page, they simply get handed over to the browser in the HTML source and they appear properly on the user interface. Or aren't they?!
You sure want to use htmlspecialchars() to properly encode database values before outputting them to the browser (especially if your DB consists of any custom user input). The problem with htmlspecialchars() is that it converts the first character of any HTML entity (&) to its HTML entity equivalent as well, which is &. Thus, for example, – becomes –. What happens is what we might call "double encoding".
In some special cases you may trust the data you are displaying and remove htmlspecialchars() so no escape is happening. This isn't the desired behavior either though, because then you cannot enter less-than (<) or greater than signs (>), for example, since they will end up in the HTML source unencoded and produce undesired results because the browser will recognize them as HTML tag markers. Sure, you can still enter their HTML entity equivalents < and > but that's not too convenient.
The issue has been addressed in PHP 5.2.3 by a quick fix: it introduces a fourth htmlspecialchars() parameter, bool $double_encode, which is true by default to stay backwards-compatible. You can set it to false, so any HTML entity occurence in the string you pass to it won't get encoded again but the rest of the values still will.
This seems to work fine: the browser recognizes the use of a single-byte character set, which, in our example contains a special punctuation character (like „ and ”), so converts it to the HTML entity equivalent, and the server, when displaying the value, converts any "standard" special HTML characters (like < and >) to their HTML entity equivalents. This is fine for most applications.
There are still (at least) two issues with this approach though: special punctuation characters will end up encoded in your database, and, because of this:
a) your searches will not be consistent: if you submit a search for "ash", you get back all records which contain an – "character" also
b) your field length isn't accurate anymore: if your field is declared as varchar(100) in MySQL, for example, you can't store 100 characters in it unless the value contains no special punctuation.
So, if you want to create websites and/or applications which are totally perfect from a linguistic point of view as well (and you want, because you want to
Set Higher Standards for yourself and those around you), you need to use utf-8 even if there's a native character set available for the language you are using.