Integer Ranges in SQL

Sometimes we need to store integers, but we have too many to store for a traditional relational database like MySQL, and Hadoop wouldn’t bat an eye at the amount. However we want to use SQL, because it poses a new challenge (and may be necessary for other reasons we won’t go into).

A typical table to store orders with their corresponding cities may have this structure:

CREATE TABLE order_cities
(
  order_num int unsigned,
  city varchar(255),
  KEY (order_num)
)
INSERT INTO order_cities VALUES (7,'Seattle'),(8,'Seattle'),(9,'San Francisco'),(10,'San Francisco'),(11,'New York')

Say we want to store 2^32 (~4.3B) orders. While some storage engines (i.e. TokuDB) can handle exceptionally large tables, their performance will be mostly undesirable. So let’s use an integer range to represent this data instead!

order_range_cities
(
  order_start int unsigned,
  order_end int unsigned,
  city varchar(255),
  PRIMARY KEY (order_start),
  UNIQUE KEY (order_end)
)
INSERT INTO order_range_cities VALUES (7,8,'Seattle'),(9,10,'San Francisco'),(11,'New York')

Now normally when using integers, we could simply ask for all rows with num between 7 and 11.

SELECT *
FROM order_cities
WHERE order_num BETWEEN 7 AND 11

Now to get all orders between 7 and 11, the solution seems straightforward:

SELECT *
FROM order_cities
WHERE order_start BETWEEN 7 AND 11

But that would be wrong for when the range boundaries do not align.

Say that instead of Seattle starting at 7 and ending at 8, it started at 5 and ended at 8. The above query would miss it entirely. So what can we do to find the range boundaries?

The easy (and slow way):

SELECT *
FROM order_cities
WHERE 7 BETWEEN order_start AND order_end

Find the first range boundary at the low end:

SELECT IF(order_end >= 7, order_start, NULL)
FROM order_cities
WHERE order_start <= 7
ORDER BY order_start DESC
LIMIT 1

Let me explain…

Let’s find any ranges that are less than or equal to 7 (in our case, it will find “5”)

WHERE order_start <= 7

Here’s a little trick with integer ranges. We want to see the highest value order_start that is closest to 7 first. We’ll assume ranges are contiguous and do not overlap at all, therefore the highest value would also be most likely to be the range containing 7.

ORDER BY order_start DESC

Many ranges could be less than 7, but not contain 7, so we don’t really care about those.

LIMIT 1

Only return the order_start if the order_end also includes 7 (this will protect us from grabbing ranges that are lower the 7, but do not contain 7).

SELECT IF(order_end >= 7, order_start, NULL)

Then since we already know the max value (11), we can use 11 as the highest inclusive order_start.

To recap, we wanted all rows whose range contained 7 through 11, and we found the low end range boundary to be 5, left the high end range boundary at 11 and ended up with this query:

SELECT *
FROM order_cities
WHERE order_start BETWEEN 5 AND 11

Going through and manually finding the low end range boundary may not be possible/desired every time, so we can create a stored function that retrieves this value for us:

CREATE FUNCTION get_order_start(v_order_num int unsigned) returns int unsigned
BEGIN
...
RETURN ifnull( -- Return 0 if the result is null
	(
		SELECT if(order_end >= v_order_num,order_start,NULL)
		FROM order_range_cities 
		WHERE order_start <= v_order_num
		ORDER BY order_start DESC 
		LIMIT 1
	)
,0);
...
END

In MySQL, when we want to use this automatically generated value, we cannot use it directly in a SQL clause (ie a WHERE clause). We have to wrap it in a derived table like this:

SELECT *
FROM
(
  SELECT get_order_start(7) AS first_order_start
) lower_bounary
JOIN order_range_cities orc ON (orc.order_start BETWEEN lower_boundary.first_order_start AND 11) -- first_order_start will be = 5