I had the displeasure of using multiple versions of Apache Pig (0.9, 0.11, 0.12, 0.13 and 0.14) in different capacities. Why was it so unpleasant you ask?
My scripts were running quickly and efficiently on Pig 0.9.2. I was using globs in my LOAD statement (e.g. “a = LOAD ‘/files/*/*type_v4*.lzo”) to find tens to hundreds of thousands of files in an HDFS structure as follows:
files/
— 2014-10-01/
—- 2014-10-01-15-45-00.log_type_v4.log.lzo
The log types could vary by day but would only have a delta on 1 each day with most days maintaining the same version number. So on 2014-10-02 there might be log types v4 and v5, but for the rest of October it would be only log type 5.
The log types were effectively cumulative; a new log version meant the new log version had the same fields as the old version plus some. Unfortunately… new fields were sometimes added to the MIDDLE of new log version, so log version 4 would have 2 columns name and phone in that order, but log version 4 would have name, email and phone in that order. This meant new parsers were needed for every log version.
However, while the versions would change, my aggregation desires would use the same fields that existed in all logs and I would want to look back through many log versions in my analysis.
So we’re back to the real issue at hand. Once I got past the individual parsers for each log version and ran my scripts frequently on Pig 0.9.2, I started to run the scripts on other clusters, but I immediately noticed a huge problem. The LOAD statements would not finish and even if they did, the job would almost always if not always fail. I ran through the verbose output, tried figuring out why it was hanging on all versions newer than 0.9.2, but never came up with a solution. I need to get in touch with Pig’s maintainers about this one… Needless to say, I now use Pig 0.9.2 begrudgingly and exclusively because the LOAD statements could not seem to handle the massive number of small files I was passing to it past version 0.9.2.