A simple and tasty explanation of the MapReduce process: Start with a bowl of 4 colored Jelly Beans (Red, Green, Blue, and Yellow). You don’t know exactly how many JBs are in the bowl, nor do you know how many of each JBs are in the bowl. But naturally you want to know. Because why… Continue reading→
Currently in a Relational Database such as MySQL, Oracle, SQL Server, etc, the two most common schools of thought are Normalized vs Denormalized database designs. Essentially, Normalized Database design entails grouping similar dimensions into a single table, such as the ephemeral orders, customers, and products tables. Normalized design might have an orders table with order_id,… Continue reading→
I had the displeasure of using multiple versions of Apache Pig (0.9, 0.11, 0.12, 0.13 and 0.14) in different capacities. Why was it so unpleasant you ask? My scripts were running quickly and efficiently on Pig 0.9.2. I was using globs in my LOAD statement (e.g. “a = LOAD ‘/files/*/*type_v4*.lzo”) to find tens to hundreds… Continue reading→
java.io.IOException: Split metadata size exceeded 10000000. was the error I got when trying to process ~20TB of highly compressed logs (~100TB uncompressed) on my 64 node Amazon EMR cluster. Naturally I found some good resources recommending a quick file by modifying the mapred-site.xml file in /home/hadoop/conf/ Warning: By setting this configuration to -1, you are… Continue reading→