Tuesday, October 20, 2009

Reading UTF-8 data from asynchronous sockets to the file system


Using asynchronous sockets, data is generally read into ByteBuffer objects. The general pattern is to read multiple times until there is no more data, and each time when the ByteBuffer is full, transfer to a larger buffer, like a ByteArrayOutputStream.

Now if you want to manipulate data collected (which is now in the ByteArrayOutputStream) as a String, it has to be decoded. This can be done using the CharsetDecoder object like this:

ByteArrayOutputStream outStrm;

// read data to outStrm using nio

CharBuffer charBuffer = CharBuffer.allocate(outStrm.size());
byte[] ba = outStrm.toByteArray();
ByteBuffer byteBuffer = ByteBuffer.wrap(ba);
Charset charset = Charset.forName( "UTF-8" );
decoder = charset.newDecoder();
CoderResult res = decoder.decode(byteBuffer, charBuffer, true);
res = decoder.flush(charBuffer);
String out = charBuffer.flip().toString();


However, all this decoding does is translating UTF-8 characters to their respective code points. As a result, we can't save this data to a file (OutputStream) correctly.

If you were to print the out string to the display, it is not guaranteed to print valid UTF-8 characters. Of course it will work for the single byte characters, but not necessarily for the multi-byte characters. Ex: 0xca a0 represents a non-breaking space with a code point of 0xA0. The above decoding will decode this to the code point 0xA0, but if you now write this to an output stream, it will not be stored as UTF-8, as the decoding stripped the UTF-8 and replaced it with code points.

So the correct approach is to simply write the byte buffer to an output stream like this:

outStrm.writeTo(System.out);


This will present UTF-8 characters to the output stream and thus the file will be saved as correct UTF-8 data.

No comments: