Files sometimes come in (whether via hadoop or other processes) as big globs of data with inter-related parts. Many times I want to process these globs concurrently but see my dilemma unfolding quickly. I could a) write the code to process it serially and be done with it in 1 hour or b) write code to process it concurrently and be done in 1.5 hours because the added overhead of verifying the output, thread safety, etc exceeds the processing time serially. This made me sad, because concurrent processes are awesome. But self-managed thread safe concurrent processes are even more awesome!
I thought, what if I could split an input file on keys and group those similarly keyed lines into separate files for processing. Aha!
So I naturally first tried finding existing solutions and to be honest, awk has a pretty killer one liner as noted here on Stack Overflow:
awk '{ print >> $5 }' yourfile
This one liner is likely great for many folks (especially when using only small files). But for me, awk threw an error due to having “too many open files” – again, sad face :(.
So… I wrote my own python command line utility to take an input file and split it into any number of output files by unique keys in the file. So all your input is maintained, just in different files sorted/segregated on keys you provide.
While my naming conventions may be lacking panache, they are at least clearly intentioned utilities. (But seriously, if you have a better name, I’m all ears)
Without further ado, I give you
split_file_by_key