Split STDIN to multiple Compressed Files on highly skewed data

Sometimes I have data that is simply too large to store on one volume uncompressed, but I need to run processes on it. In my recent use case, I had one large Tar GZIPPED file that decompressed into many smaller tar gzipped files, which then decompressed into two column text (both strings). I did not… Continue reading


Simple parallelized processing with GIL languages (Python, Ruby, etc)

Sometimes we want parallel processing, but don’t want to pay the cost of ensuring proper multi-threaded handling. Because who wants to spend an extra 30 minutes setting up threads, ensuring race condition safety, et al just to save a few minutes? If you have access to “xargs,” you have access to a built-in utility that… Continue reading



Pseudo-Normalized Database Engine Concept

Currently in a Relational Database such as MySQL, Oracle, SQL Server, etc, the two most common schools of thought are Normalized vs Denormalized database designs. Essentially, Normalized Database design entails grouping similar dimensions into a single table, such as the ephemeral orders, customers, and products tables. Normalized design might have an orders table with order_id,… Continue reading