java.io.IOException: Split metadata size exceeded 10000000. was the error I got when trying to process ~20TB of highly compressed logs (~100TB uncompressed) on my 64 node Amazon EMR cluster. Naturally I found some good resources recommending a quick file by modifying the mapred-site.xml file in /home/hadoop/conf/

Warning: By setting this configuration to -1, you are effectively removing the limit on the splits’ metadata. Be aware this may cause unintended consequences should your cluster not have the resources to handle the actual job.



mapreduce.jobtracker.split.metainfo.maxsize
    -1

Then since I was modifying an existing running EMR cluster on Hadoop 1.0.4, I needed to restart my jobtracker:
sudo service --status-all
sudo service hadoop-jobtracker restart

Restart Yarn ResourceManager for running EMR cluster on Hadoop 2.4.0:
sudo service yarn-resourcemanager stop

Source post: http://blog.dongjinleekr.com/my-hadoop-job-crashes-with-split-metadata-size-exceeded/


Apache Pig chokes on many small files Now using S4CMD

Leave a Reply

Your email address will not be published.