XML is your friend

I've been working a lot recently with data import and exports between systems.  Obviously different systems have a variety of ways they can export data, and some even have multiple ways of importing data, but none of them seem to have a standard type for the file data.  In recent weeks I have worked with:

  • Files where fields are delimited with a #
  • Files where fields are delimited with a ^
  • Files where fields are delimited with a , (typical CSV files)
  • Files where everything is nicely structured XML

Out of all of those, my favorite is the XML file.  Yes XML gets a lot of bad press (it certainly used to) but I think it's just misunderstood.  Okay, an XML file is going to be several times larger than a single character delimited file, but with today's computing, that shouldn't matter.  Disk space is very cheap, and data transfer speeds online are quick enough for it not to matter that much.  I also find XML a lot easier to work through if there are issues.

For example, I had an export to create recently where the file needed to be separated by a # between fields, and in that there were over 180 possible items of data per row.  There was a problem with the file that it was missing some fields.  The object length was too short so the file failed during the import.  The process for solving the issue went something along the lines of:

  1. Load the export specification
  2. Open up the export generator
  3. Look line for line at what it was producing
  4. Determine where the missing fields should be
  5. Add in the missing fields
  6. Produce a new export
  7. Count the number of data items per line
  8. Run the file through the import

Because each object was on one line with just a single character between fields, the import couldn't tell me whether the fields missing were.  As far as it was concerned the fields were missing from the end of the object because the row was too short.  The fields were actually missing about 2/3 through the object line, and it was a very manual and long process to check each data item being added in against what the specification was.  Had the file been an XML file, the system importing the data would (or at the very least should) have been able to determine which fields were missing.  That, combined with a quick check of the specification again, would have allowed a much faster resolution with that issue.

The down side, however, is that if the file was XML is that the file would be much larger.  I created a simple file for the object with 180 data items, each item having the value "test" and separated the items with a # in a text file.  Total size 935 bytes.  Less that 1k.  The same data in XML (including header, a parent group () and then object being a type item () where each node was 3 characters long in name was 4582 bytes, ~5k.  5 times the size on a very basic test.  In reality the fields wouldn't have had 3 character names, and some of the data would have been longer, some shorter.  That's a large difference in file size, but nothing that modern disks should need to worry about.

I'll admit that when I was younger and first getting into computing in a more serious way than Geo Pages that XML always looked horrible and clunky, but I think I was too young to understand that uses of it, of which there are many.  Back then I didn't have to worry about transferring data in a structured way, or getting it into a system with just the right fields to be able to complete an import.  Back then I just wondered how this thing which looked like HTML was always looking like random garbage and how none of the tags between items were the same from one example to the next.  I think that's where XML gets its bad reputation, and it's completely unjustified.

I never thought I would become an advocate of XML, but when it comes to a mass of data imports and exports, then it should be something that is seriously considered rather than laughed at.  It makes tracking errors a lot easier, and therefore fixing issues easier.  More so when it comes to objects with a lot of data.  XML isn't something which should be laughed upon, but embraced like an old friend, in the right circumstances.