Jeff Sexton

Thursday, October 16, 2008

BOMed

Unicode plane 2600-26F0Image via WikipediaLately I've been working with Netbeans and OpenESB on a project requiring the use of a number of features of this relatively new JBI technology. It's been interesting to say the least. Here's an issue I've uncovered this morning.

Several of the projects actions can be triggered by XML files appearing it directories which the server polls. I've used the file binding component for these with no difficulty - that is until I got to one for which the input and XSD schema are provided for me. They're part of another system with which the new server must interact.

The sample XML is UTF-16, little endian, which Windows knows as "unicode". I built the module to poll the directory and read this file just like I had with other operations, but there was an error validating the XML; "Content is not allowed in prolog."

I made sure that the schema declared UTF-16, the input file declared UTF-16, and everything actually was in fact UTF-16LE, no luck. Experiments revealed that the file worked perfectly if I opened it in Windows Notepad and saved it as "ANSI" file type. In fact, if I did that, the file would read in file regardless of the encoding stated in the header of either the XML or the XSD.

Further reading lead me to the Byte Order Marker (BOM). The BOM is FEFF in hex, a unicode, two byte character, denoting a zero-width, non-breaking space. A file designated as UTF-16 will have a BOM at position zero that will be either FEFF or FFFE (the reverse). The order of these bytes indicates the file's endian-ness. Hex-dumping my sample file revealed the BOM.

00000000: FF FE 3C 00 3F 00 78 00 6D 00 6C 00 20 00 76 00 ■<.?.x.m.l. .v.

00000010: 65 00 72 00 73 00 69 00 6F 00 6E 00 3D 00 22 00 e.r.s.i.o.n.=.

I get the error about content in the prolog because of the BOM character appearing before the XML header.

This maybe related to this Netbeans bug declared fixed earlier this year.

http://www.netbeans.org/issues/show_bug.cgi?id=83321

Maybe not quite... I'm not sure what I'll have to do about it.




Reblog this post [with Zemanta]

Post a Comment
3D modeling Advertising Air Canada Airline Alfa Romeo Spider Touring Gran Sport Analog signal Android Anomalies and Alternative Science Apache Apollo Astoria Augmented reality Aurora Famous Fighters auto-awesome Automobile Autos Barack Obama Batman Beards Beer Bell System Berkshire Hathaway Bigfoot Bird Toys Birds Blogger Books Build Management Business and Economy Business Process Execution Language Byte-order mark Canadian Carrot Cats Christmas Civil Defense CNN Cockatiels Collections Crows Dear Jane Debian Diabetes Digital Living Network Alliance Digital television Disney Doll House Dow Jones Industrial Average Duesenburg SJ Roadster Durham University E-mail address ebauche Economics EJB Energy development Enterprise JavaBean ESP Facebook Fedora Filesharing Finance Ford Fossil fuel Garfield James Abram Garfield Minus Garfield Glassfish Global warming Golden Arches Goofy Google Google Buzz Google Docs Google Lively Google Photos Google Reader Google Wave Google+ Greenhouse gas Half-Life 2 Helbros High-definition television History Hybrid electric vehicle IBM Inner city Instagram Insulin Investing Irony J.C. Penny Jane Austen Java Java Architecture for XML Binding JDBC Jeff's! Jim Davis joe the plumber John McCain Karma Kay Thompson Kermit the Frog Kids and Teens LA Auto Show Larry King Laser Logging Lowry Sexton Mark Cuban Market trends McDonald Meier and Frank Microsoft Microsoft Windows Models Monkey monsters Moon MOUNT HOOD Music Music industry Muxtape MySQL NetBeans Netflix Nintendo Nissan Cube Norm Coleman Nuclear fallout Nuclear warfare Office Depot Open ESB Oracle Corporation Pacific Ocean Packard Boattail Pearl District Pearl District Portland Oregon Philip K Dick photography PlayStation 3 Pocher Pokémon HeartGold and SoulSilver Politics Portal Portland Portland Development Commission Presidents Pride and Prejudice Programming Projects Radio Recording Industry Association of America Renewable energy RIAA Robot Chicken Rock-paper-scissors Sarcasm Science fiction film Serbia Service-oriented architecture Shopping Slide Rule Social Security Social Studies Society6 Spirit of St. Louis SQL Stanford Hospital Star Wars Starbucks Stock market Strip search Sun Microsystems T-Mobile TechCrunch Technical ThinkGeek Toaster Total Recall Transportation Security Administration Unicode United States United States Department of Homeland Security Universal Plug and Play Unknown Primates Vegetable garden Video game Vintage Images Vintage Vintage! Virtual world Volvo C70 Wall Street Warren Buffett watches We Can Remember It for You Wholesale Web service Web Services Description Language Wii Windows 7 Windows Phone 7 Windows Vista Windows XP X-Files X-ray vision XML XML Schema YouTube Yugo Zima