Garren's [Big] Data Blog – Page 5 – Discussing data, big, small and in between

Tips for using Apache Parquet with Spark 2.x

Posted by Garren on 2017/04/08

What is Apache Parquet? It is a compressable binary columnar data format used in the hadoop ecosystem. We’ll talk about it primarily with relation to the Hadoop Distributed File System (HDFS) and Spark 2.x contexts. What role does it fill? It is a fast and efficient data format great for scalable big data analytics. Optimization… Continue reading→

Apache Spark Parquet, spark 2 Comments

Runtime Stats for Functions | Python Decorator

Posted by Garren on 2016/10/21

In a similar vein to my prior Python decorator metadata for functions (“meta_func” => github | PyPi | blog), this decorator is intended to help illuminate the number of calls and time taken per call aggregates. It will keep track of each function by its uniquely assigned python object identifier, the total number of function… Continue reading→

Default Leave a Comment

Replace CTRL-A in a file while in a screen session

Posted by Garren on 2015/11/02

echo -e “\u0001” | cat -v # ^A cat -v 000001 | tr ‘^A’ ‘\t’ | head Inspiration: http://stackoverflow.com/questions/31460818/creating-a-ctrl-a-delimiter-file Note: Within the same day, this strategy both worked then failed. YMMV More reliable would be to get into a non screen session and do “ctrl-v then a”

Default Leave a Comment

Split file by keys

Posted by Garren on 2015/04/02

Files sometimes come in (whether via hadoop or other processes) as big globs of data with inter-related parts. Many times I want to process these globs concurrently but see my dilemma unfolding quickly. I could a) write the code to process it serially and be done with it in 1 hour or b) write code… Continue reading→

Default Leave a Comment