Split metadata size exceeded 10000000

java.io.IOException: Split metadata size exceeded 10000000. was the error I got when trying to process ~20TB of highly compressed logs (~100TB uncompressed) on my 64 node Amazon EMR cluster. Naturally I found some good resources recommending a quick file by modifying the mapred-site.xml file in /home/hadoop/conf/

Warning: By setting this configuration to -1, you are effectively removing the limit on the splits’ metadata. Be aware this may cause unintended consequences should your cluster not have the resources to handle the actual job.

<!-- In: conf/mapred-site.xml -->
<property>
<name>mapreduce.jobtracker.split.metainfo.maxsize</name>
    <value>-1</value>
</property>

Then since I was modifying an existing running EMR cluster on Hadoop 1.0.4, I needed to restart my jobtracker:
sudo service --status-all
sudo service hadoop-jobtracker restart

Restart Yarn ResourceManager for running EMR cluster on Hadoop 2.4.0:
sudo service yarn-resourcemanager stop

Source post: http://blog.dongjinleekr.com/my-hadoop-job-crashes-with-split-metadata-size-exceeded/

Now using S4CMD

S3CMD’s distinct lack of multi-threading led me to hunt for alternatives. While I tried many alternatives, such as s3-multipart (great when I did use it), s3funnel and s3cp among others, none quite fit the bill of supporting the key features I found important.

1) Listing/Downloading/Uploading/etc of files and “folders”
2) Multi-threaded
3) Synchronization handled so as to avoid re-downloading an existing file

S4CMD fit all the requirements and did it at an even higher performance than I anticipated. It did not require me to set the number of threads I wanted it to use (for better or for worse), and it seemed to err on the side of more threads (e.g. 32 threads for a single process). This even considering Python Global Interpreter Lock (GIL). The downloading performance was superb. It was saturating a good portion of the available bandwidth (~30-50mbps – ~4-6MB/s). If it saturated the bandwidth anymore, my coworkers may not have appreciated the return to dialup :).

What makes S4CMD so awesome to me is the fact that it takes multi-threading very seriously, using it for more than just GET requests, but also even listing of files/directories. The multi-threading was a real pain point for me because I use logs from a third party that are split into 15 minute chunks and there are anywhere from 1 to numerous files per 15 minute chunk. For better or worse, each file is miniscule in size. Unfortunately this makes downloading the files a nightmare as the overhead to GET the file is excessive for any serial downloaders. S4CMD’s multi-threading makes downloading the files a breeze.

I am in no way affiliated with S4CMD besides having a code crush on it :).

Use regex to select only certain files via s3cmd

Recently I wanted to download only a small subset of thousands of files from an S3 “folder.”

The s3cmd docs were lacking with regard to how to select only a subset.

This is how I downloaded only log files from 11PM:
s3cmd get --recursive --rinclude ".*2014-10-28-23.*" --rexclude ".*" s3://my-bucket-of-logs/folder/2014-10-28/

Note that I used both rinclude and rexclude. If you only use rinclude (or include), it won’t restrict only to what you specified, but it will ensure what you specified is taken. So you have to exclude everything first (--rexclude ".*"), THEN include only what you want (--rinclude ".*2014-10-28-23.*")

My experience setting up a router as a repeater

Wired access is better than wireless generally, so when I wanted to have two machines in opposite areas with wired internet, something had to give. My initial setup before using two routers was to have the modem and main router in one area and just use wireless everywhere else. But that was cramping my speed testing style; I was consistently seeing poor speeds far from the router.

I got a new router, retiring – or so I thought – my trusty WRT54GL to backup duty. But when I installed the new wireless N router (sorry, not AC yet :)), I was happier with the performance far from the router, but not happy enough. So I decided to setup the WRT54GL as a repeater. And this is where it gets interesting.

Ingredients:
Linksys WRT54GL
DD-WRT firmware
Netgear WNR2000-100NAR

The WNR2000-100NAR became my new main router, because it has wireless N (not dual band, but good enough for this setup). Setup using WPA2-PSK (AES) and ensured internet access worked and all devices could connect. All was good so far.

When I switched my WRT54GL to “Repeater” and configured it using the same WPA2-PSK security and SSID as the WNR2000, it completed one task, which was to pickup the signal from the WNR2000 and bridge it to the WRT54GL’s ports. So now I had successfully attained wired internet access in both areas simultaneously. Sure, I lost some performance by using wireless as the repeating medium, but I had to start somewhere…

What I didn’t do was configure the repeater correctly so it wasn’t rebroadcasting the wireless signal and it was still acting as a DHCP server.

I remedied this by switching the WRT54GL from “Repeater” to “Repeater Bridge,” then configuring a new Virtual Interface. Next, I setup the wireless security specifically for that new virtual interface and ensured the DHCP server was disabled. Voila, the WNR2000 was now transmitting the network connection both wireless and through its wired interface. The WRT54GL was picking up the WNR2000’s wireless signal, rebroadcasting the signal wirelessly and on the wired interface. So now I had achieved both machines being wired and far superior wireless coverage.

Boo ya

Integer Ranges in SQL

Sometimes we need to store integers, but we have too many to store for a traditional relational database like MySQL, and Hadoop wouldn’t bat an eye at the amount. However we want to use SQL, because it poses a new challenge (and may be necessary for other reasons we won’t go into).

A typical table to store orders with their corresponding cities may have this structure:

CREATE TABLE order_cities
(
  order_num int unsigned,
  city varchar(255),
  KEY (order_num)
)
INSERT INTO order_cities VALUES (7,'Seattle'),(8,'Seattle'),(9,'San Francisco'),(10,'San Francisco'),(11,'New York')

Say we want to store 2^32 (~4.3B) orders. While some storage engines (i.e. TokuDB) can handle exceptionally large tables, their performance will be mostly undesirable. So let’s use an integer range to represent this data instead!

order_range_cities
(
  order_start int unsigned,
  order_end int unsigned,
  city varchar(255),
  PRIMARY KEY (order_start),
  UNIQUE KEY (order_end)
)
INSERT INTO order_range_cities VALUES (7,8,'Seattle'),(9,10,'San Francisco'),(11,'New York')

Now normally when using integers, we could simply ask for all rows with num between 7 and 11.

SELECT *
FROM order_cities
WHERE order_num BETWEEN 7 AND 11

Now to get all orders between 7 and 11, the solution seems straightforward:

SELECT *
FROM order_cities
WHERE order_start BETWEEN 7 AND 11

But that would be wrong for when the range boundaries do not align.

Say that instead of Seattle starting at 7 and ending at 8, it started at 5 and ended at 8. The above query would miss it entirely. So what can we do to find the range boundaries?

The easy (and slow way):

SELECT *
FROM order_cities
WHERE 7 BETWEEN order_start AND order_end

Find the first range boundary at the low end:

SELECT IF(order_end >= 7, order_start, NULL)
FROM order_cities
WHERE order_start <= 7
ORDER BY order_start DESC
LIMIT 1

Let me explain…

Let’s find any ranges that are less than or equal to 7 (in our case, it will find “5”)

WHERE order_start <= 7

Here’s a little trick with integer ranges. We want to see the highest value order_start that is closest to 7 first. We’ll assume ranges are contiguous and do not overlap at all, therefore the highest value would also be most likely to be the range containing 7.

ORDER BY order_start DESC

Many ranges could be less than 7, but not contain 7, so we don’t really care about those.

LIMIT 1

Only return the order_start if the order_end also includes 7 (this will protect us from grabbing ranges that are lower the 7, but do not contain 7).

SELECT IF(order_end >= 7, order_start, NULL)

Then since we already know the max value (11), we can use 11 as the highest inclusive order_start.

To recap, we wanted all rows whose range contained 7 through 11, and we found the low end range boundary to be 5, left the high end range boundary at 11 and ended up with this query:

SELECT *
FROM order_cities
WHERE order_start BETWEEN 5 AND 11

Going through and manually finding the low end range boundary may not be possible/desired every time, so we can create a stored function that retrieves this value for us:

CREATE FUNCTION get_order_start(v_order_num int unsigned) returns int unsigned
BEGIN
...
RETURN ifnull( -- Return 0 if the result is null
	(
		SELECT if(order_end >= v_order_num,order_start,NULL)
		FROM order_range_cities 
		WHERE order_start <= v_order_num
		ORDER BY order_start DESC 
		LIMIT 1
	)
,0);
...
END

In MySQL, when we want to use this automatically generated value, we cannot use it directly in a SQL clause (ie a WHERE clause). We have to wrap it in a derived table like this:

SELECT *
FROM
(
  SELECT get_order_start(7) AS first_order_start
) lower_bounary
JOIN order_range_cities orc ON (orc.order_start BETWEEN lower_boundary.first_order_start AND 11) -- first_order_start will be = 5