Sometimes we need to store integers, but we have too many to store for a traditional relational database like MySQL, and Hadoop wouldn’t bat an eye at the amount. However we want to use SQL, because it poses a new challenge (and may be necessary for other reasons we won’t go into).
A typical table to store orders with their corresponding cities may have this structure:
CREATE TABLE order_cities ( order_num int unsigned, city varchar(255), KEY (order_num) )
INSERT INTO order_cities VALUES (7,'Seattle'),(8,'Seattle'),(9,'San Francisco'),(10,'San Francisco'),(11,'New York')
Say we want to store 2^32 (~4.3B) orders. While some storage engines (i.e. TokuDB) can handle exceptionally large tables, their performance will be mostly undesirable. So let’s use an integer range to represent this data instead!
order_range_cities ( order_start int unsigned, order_end int unsigned, city varchar(255), PRIMARY KEY (order_start), UNIQUE KEY (order_end) )
INSERT INTO order_range_cities VALUES (7,8,'Seattle'),(9,10,'San Francisco'),(11,'New York')
Now normally when using integers, we could simply ask for all rows with num between 7 and 11.
SELECT * FROM order_cities WHERE order_num BETWEEN 7 AND 11
Now to get all orders between 7 and 11, the solution seems straightforward:
SELECT * FROM order_cities WHERE order_start BETWEEN 7 AND 11
But that would be wrong for when the range boundaries do not align.
Say that instead of Seattle starting at 7 and ending at 8, it started at 5 and ended at 8. The above query would miss it entirely. So what can we do to find the range boundaries?
The easy (and slow way):
SELECT * FROM order_cities WHERE 7 BETWEEN order_start AND order_end
Find the first range boundary at the low end:
SELECT IF(order_end >= 7, order_start, NULL) FROM order_cities WHERE order_start <= 7 ORDER BY order_start DESC LIMIT 1
Let me explain...
Let's find any ranges that are less than or equal to 7 (in our case, it will find "5")
WHERE order_start <= 7
Here's a little trick with integer ranges. We want to see the highest value order_start that is closest to 7 first. We'll assume ranges are contiguous and do not overlap at all, therefore the highest value would also be most likely to be the range containing 7.
ORDER BY order_start DESC
Many ranges could be less than 7, but not contain 7, so we don't really care about those.
LIMIT 1
Only return the order_start if the order_end also includes 7 (this will protect us from grabbing ranges that are lower the 7, but do not contain 7).
SELECT IF(order_end >= 7, order_start, NULL)
Then since we already know the max value (11), we can use 11 as the highest inclusive order_start.
To recap, we wanted all rows whose range contained 7 through 11, and we found the low end range boundary to be 5, left the high end range boundary at 11 and ended up with this query:
SELECT * FROM order_cities WHERE order_start BETWEEN 5 AND 11
Going through and manually finding the low end range boundary may not be possible/desired every time, so we can create a stored function that retrieves this value for us:
CREATE FUNCTION get_order_start(v_order_num int unsigned) returns int unsigned BEGIN ... RETURN ifnull( -- Return 0 if the result is null ( SELECT if(order_end >= v_order_num,order_start,NULL) FROM order_range_cities WHERE order_start <= v_order_num ORDER BY order_start DESC LIMIT 1 ) ,0); ... END
In MySQL, when we want to use this automatically generated value, we cannot use it directly in a SQL clause (ie a WHERE clause). We have to wrap it in a derived table like this:
SELECT * FROM ( SELECT get_order_start(7) AS first_order_start ) lower_bounary JOIN order_range_cities orc ON (orc.order_start BETWEEN lower_boundary.first_order_start AND 11) -- first_order_start will be = 5