java.io.IOException: Split metadata size exceeded 10000000.
was the error I got when trying to process ~20TB of highly compressed logs (~100TB uncompressed) on my 64 node Amazon EMR cluster. Naturally I found some good resources recommending a quick file by modifying the mapred-site.xml file in /home/hadoop/conf/
Warning: By setting this configuration to -1, you are effectively removing the limit on the splits’ metadata. Be aware this may cause unintended consequences should your cluster not have the resources to handle the actual job.
mapreduce.jobtracker.split.metainfo.maxsize -1
Then since I was modifying an existing running EMR cluster on Hadoop 1.0.4, I needed to restart my jobtracker:
sudo service --status-all
sudo service hadoop-jobtracker restart
Restart Yarn ResourceManager for running EMR cluster on Hadoop 2.4.0:
sudo service yarn-resourcemanager stop
Source post: http://blog.dongjinleekr.com/my-hadoop-job-crashes-with-split-metadata-size-exceeded/