Some strings you can shove into your application and see if you at least sort of handle Unicode correctly. For example, if you have a form which stores a text value in a database and then shows it later on a webpage, you can try to stick these in the form and see if they come out correctly in the other end.
First one includes many non-ascii characters. Second one starts with a multi-byte character.
Tried to find out how to convert windows-1252 code files to utf-8 without messing up Norwegian characters today. Couldn’t really find anything good other than linux tools and php stuff. Finally, *facepalm*, I remembered it might be possible using Notepad… And sure enough, seems to work great. Just open up the windows-1252 encoded file in Notepad, then choose ‘Save as’ and set encoding to UTF-8.
Hopefully I won’t forget this the next time I need it… *sigh*
After another reply to a question I’ve had on StackOverflow for a while, I decided that I perhaps should add another level of security to my method of providing JSONP from PHP. The way I did it before, I didn’t do any checking on the provided callback. This means that someone could technically put whatever they wanted in there, including malicious code. So, therefore it might be a good idea to check if the callback, which should be a function name, actually is a valid function name. But,
Ok, so I was happily reading CSV files from an SFTP server. The file content is returned as an InputStream and I I used a BufferedReader to read it line by line. Each line contained either a header or an order. The header lines started with the string “HDR”.
However, I suddenly discovered that my code was consistently skipping the first header (and as a result the orders belonging to it). The reason, I found, was simple. The first header, on the first line, didn’t start with “HDR”, it started with “□HDR”! And that undisplayable square turned out to be a Unicode Byte Order Mark (BOM).
Continue reading Java: How to deal with the BOM in a Unicode InputStream