Lately I've been working with
Netbeans and
OpenESB on a project requiring the use of a number of features of this relatively new
JBI technology. It's been interesting to say the least. Here's an issue I've uncovered this morning.
Several of the projects actions can be triggered by XML files appearing it directories which the server polls. I've used the file binding component for these with no difficulty - that is until I got to one for which the input and XSD schema are provided for me. They're part of another system with which the new server must interact.
The sample XML is UTF-16, little endian, which Windows knows as "unicode". I built the module to poll the directory and read this file just like I had with other operations, but there was an error validating the XML; "Content is not allowed in prolog."
I made sure that the schema declared UTF-16, the input file declared UTF-16, and everything actually was in fact UTF-16LE, no luck. Experiments revealed that the file worked perfectly if I opened it in Windows Notepad and saved it as "ANSI" file type. In fact, if I did that, the file would read in file regardless of the encoding stated in the header of either the XML or the XSD.
Further reading lead me to the Byte Order Marker (BOM). The BOM is FEFF in hex, a unicode, two byte character, denoting a zero-width, non-breaking space. A file designated as UTF-16 will have a BOM at position zero that will be either FEFF or FFFE (the reverse). The order of these bytes indicates the file's endian-ness. Hex-dumping my sample file revealed the BOM.
00000000: FF FE 3C 00 3F 00 78 00 6D 00 6C 00 20 00 76 00 ■<.?.x.m.l. .v.
00000010: 65 00 72 00 73 00 69 00 6F 00 6E 00 3D 00 22 00 e.r.s.i.o.n.=.
I get the error about content in the prolog because of the BOM character appearing before the XML header.
This maybe related to this Netbeans bug declared fixed earlier this year.
http://www.netbeans.org/issues/show_bug.cgi?id=83321
Maybe not quite... I'm not sure what I'll have to do about it.