Split metadata size exceeded 10000000. was the error I got when trying to process ~20TB of highly compressed logs (~100TB uncompressed) on my 64 node Amazon EMR cluster. Naturally I found some good resources recommending a quick file by modifying the mapred-site.xml file in /home/hadoop/conf/

Warning: By setting this configuration to -1, you are effectively removing the limit on the splits’ metadata. Be aware this may cause unintended consequences should your cluster not have the resources to handle the actual job.


Then since I was modifying an existing running EMR cluster on Hadoop 1.0.4, I needed to restart my jobtracker:
sudo service --status-all
sudo service hadoop-jobtracker restart

Restart Yarn ResourceManager for running EMR cluster on Hadoop 2.4.0:
sudo service yarn-resourcemanager stop

Source post:

Apache Pig chokes on many small files Now using S4CMD

Leave a Reply

Your email address will not be published. Required fields are marked *