Garren's [Big] Data Blog – Page 7 – Discussing data, big, small and in between

Apache Pig chokes on many small files [part 2]

Posted by Garren on 2015/01/09

As a followup to my initial post regarding Apache Pig’s inability to quickly load many small files in Pig 0.10 and newer, I wanted to share a simple fix that worked for me courtesy of in-depth research by Amazon Support Team (+ Engineers). Basically around Pig 0.10.0, PigStorage builds a hidden schema file in an… Continue reading→

Default Leave a Comment

Split STDIN to multiple Compressed Files on highly skewed data

Posted by Garren on 2014/12/31

Sometimes I have data that is simply too large to store on one volume uncompressed, but I need to run processes on it. In my recent use case, I had one large Tar GZIPPED file that decompressed into many smaller tar gzipped files, which then decompressed into two column text (both strings). I did not… Continue reading→

Default Leave a Comment

Simple parallelized processing with GIL languages (Python, Ruby, etc)

Posted by Garren on 2014/12/22

Sometimes we want parallel processing, but don’t want to pay the cost of ensuring proper multi-threaded handling. Because who wants to spend an extra 30 minutes setting up threads, ensuring race condition safety, et al just to save a few minutes? If you have access to “xargs,” you have access to a built-in utility that… Continue reading→

Default Leave a Comment

ELI5 – Jelly Bean Analogy to MapReduce (Hadoop)

Posted by Garren on 2014/12/18

A simple and tasty explanation of the MapReduce process: Start with a bowl of 4 colored Jelly Beans (Red, Green, Blue, and Yellow). You don’t know exactly how many JBs are in the bowl, nor do you know how many of each JBs are in the bowl. But naturally you want to know. Because why… Continue reading→

Default Leave a Comment