Currently in a Relational Database such as MySQL, Oracle, SQL Server, etc, the two most common schools of thought are Normalized vs Denormalized database designs.

Essentially, Normalized Database design entails grouping similar dimensions into a single table, such as the ephemeral orders, customers, and products tables. Normalized design might have an orders table with order_id, order_date, customer_id, and product_id then a customers table identified by customer_id and products table identified by product_id. Denormalized designs mean the data is combined into one table. So in this case, it might be an orders table with products and customers listed in the same table.

Normalized designs are better on storage space, but require join statements. Denormlized statements put everything together, but require more complex insertions/updates.

What I propose is an alternate database storage engine concept that would automatically build a memory-mapped normalization under the hood. Picture a denormalized orders table with order_id, customer name, new_customer flag, and product name. Let’s say order_id is high cardinality (very uncommon/unique), customer name is normal cardinality (fairly uncommon/unique), with product name and new_customer flag are low cardinality (repeated frequently).

Example Table:
Order ID: 5
Customer Name: Jerry
Product Name: Box of Cookies
New Customer: True

A pseudo-normalized table might abstract the product name into a memory map, leaving an integer in the product name’s strings place under the hood.

Example Table:
Order ID: 5
Customer Name: Jerry
Product Name: 1
New Customer: True

Example Memory Map:
Product Id: 1
Product Name: Box of Cookies

Now when the data is being retrieved, the engine would look in the source table for the raw data, then map the integer for product name to the actual product name string.

By mapping low cardinality columns to high performance in-memory maps, we are able to drastically reduce the amount of storage space used and potentially increase performance simultaneously.

Note this design would involve only changing the underlying structure of the data, so anytime the data is accessed, it would go through a translation layer to convert the mapped value to the actually defined value.

Github repo with Proof of Concept: https://github.com/gstaubli/pndb


ELI5 - Jelly Bean Analogy to MapReduce (Hadoop) Apache Pig chokes on many small files

Leave a Reply

Your email address will not be published.