December 2014 – Garren's [Big] Data Blog

Split STDIN to multiple Compressed Files on highly skewed data

Posted by Garren on 2014/12/31

Sometimes I have data that is simply too large to store on one volume uncompressed, but I need to run processes on it. In my recent use case, I had one large Tar GZIPPED file that decompressed into many smaller tar gzipped files, which then decompressed into two column text (both strings). I did not… Continue reading→

Default Leave a Comment

Simple parallelized processing with GIL languages (Python, Ruby, etc)

Posted by Garren on 2014/12/22

Sometimes we want parallel processing, but don’t want to pay the cost of ensuring proper multi-threaded handling. Because who wants to spend an extra 30 minutes setting up threads, ensuring race condition safety, et al just to save a few minutes? If you have access to “xargs,” you have access to a built-in utility that… Continue reading→

Default Leave a Comment

ELI5 – Jelly Bean Analogy to MapReduce (Hadoop)

Posted by Garren on 2014/12/18

A simple and tasty explanation of the MapReduce process: Start with a bowl of 4 colored Jelly Beans (Red, Green, Blue, and Yellow). You don’t know exactly how many JBs are in the bowl, nor do you know how many of each JBs are in the bowl. But naturally you want to know. Because why… Continue reading→

Default Leave a Comment

Pseudo-Normalized Database Engine Concept

Posted by Garren on 2014/12/14

Currently in a Relational Database such as MySQL, Oracle, SQL Server, etc, the two most common schools of thought are Normalized vs Denormalized database designs. Essentially, Normalized Database design entails grouping similar dimensions into a single table, such as the ephemeral orders, customers, and products tables. Normalized design might have an orders table with order_id,… Continue reading→

Default Leave a Comment