Wednesday, July 9, 2014

PKZip through rose-colored glasses

PKZip is a piece of software that I've looked fondly upon - until recently.  Most computer users are aware of PKZip (a.k.a. just "zip") which is probably the most common and ubiquitous file compression format.  It's been around since the late 80s and is extensively used in computers - so even if you don't directly use this format I can pretty much guaranty many of the software and services you rely on do use this format.  And what's not to love about this format?  It offers very good compression, it's fast, and it's royalty free unlike some other compression software.

Well, it turns out not everything is peachy-keen in Zip-land.  Recently at work I've had the need to directly read and write zip files.  I cannot rely on existing code to read and write the zip file for me, I must do it myself.  Fortunately the actual compression and decompression code I don't have to write, but all the metadata inside the zip file I must write myself.  Basically the internal structure of a zip file is a lot of smaller structures that contain file into such as compressed/uncompressed sizes, filename, attributes, etc.  These structures also point to relative positions of other structures in the file, etc.  This is all standard stuff if you've ever written code to process a binary file.  So what's the problem then?  Simple, the way these structures are laid out is horrible.  You have to start off parsing the file from the end which is counter-intuitive, some structures have signatures ids whereas others do not, the contents of structures varies depending on bit flags, etc.  But by far the biggest headache is the zip64 extensions.  The original zip format cannot handle large files, so they had to extend the format to support 64-bit file addresses.  I understand they wanted to maintain backwards compatibility, but what they should have done was create a new format and zip/unzip tools would adjust accordingly.  It would actually be the same code to maintain backwards compatibility, just where that code goes would have been different.

Oh well, I can't fault them too much.  I mean the zip format became wildly popular, probably more popular than they had ever anticipated.  They might have put more thought into the design had they known.  Also, it would have been hard to foresee the need for 64-bit support back in a time when hard drives were only a few megabytes in size.

PKZip has definitely stood the test of time.  But its age is showing.  Newer formats like Rar and 7Zip offer better compression ratios.  Personally I would recommend 7Zip.  But zip is so ubiquitous it's going to be around for a while.  I just wish the internals of the file weren't so bad.

No comments:

Post a Comment