As a followup to my initial post regarding Apache Pig’s inability to quickly load many small files in Pig 0.10 and newer, I wanted to share a simple fix that worked for me courtesy of in-depth research by Amazon Support Team (+ Engineers).
Basically around Pig 0.10.0, PigStorage builds a hidden schema file in an attempt to determine your file’s schema. By passing the ‘-noschema’ flag to PigStorage, we see far improved performance.
a = LOAD '/files/*' USING PigStorage('\t','-noschema') AS (field1:int, field2:chararray);
Much better.