What is Apache Parquet? It is a compressable binary columnar data format used in the hadoop ecosystem. We’ll talk about it primarily with relation to the Hadoop Distributed File System (HDFS) and Spark 2.x contexts. What role does it fill? It is a fast and efficient data format great for scalable big data analytics. Optimization… Continue reading→
In a similar vein to my prior Python decorator metadata for functions (“meta_func” => github | PyPi | blog), this decorator is intended to help illuminate the number of calls and time taken per call aggregates. It will keep track of each function by its uniquely assigned python object identifier, the total number of function… Continue reading→
echo -e “\u0001” | cat -v # ^A cat -v 000001 | tr ‘^A’ ‘\t’ | head Inspiration: http://stackoverflow.com/questions/31460818/creating-a-ctrl-a-delimiter-file Note: Within the same day, this strategy both worked then failed. YMMV More reliable would be to get into a non screen session and do “ctrl-v then a”
Files sometimes come in (whether via hadoop or other processes) as big globs of data with inter-related parts. Many times I want to process these globs concurrently but see my dilemma unfolding quickly. I could a) write the code to process it serially and be done with it in 1 hour or b) write code… Continue reading→