As a followup to my initial post regarding Apache Pig’s inability to quickly load many small files in Pig 0.10 and newer, I wanted to share a simple fix that worked for me courtesy of in-depth research by Amazon Support Team (+ Engineers).

Basically around Pig 0.10.0, PigStorage builds a hidden schema file in an attempt to determine your file’s schema. By passing the ‘-noschema’ flag to PigStorage, we see far improved performance.

a = LOAD '/files/*' USING PigStorage('\t','-noschema') AS (field1:int, field2:chararray);

Much better.


ELI5 - The Human analogy to Multi-threading (Concurrency) and how it can go wrong Split STDIN to multiple Compressed Files on highly skewed data

Leave a Reply

Your email address will not be published.