Runtime Stats for Functions | Python Decorator

In a similar vein to my prior Python decorator metadata for functions (“meta_func” => github | PyPi | blog), this decorator is intended to help illuminate the number of calls and time taken per call aggregates.

It will keep track of each function by its uniquely assigned python object identifier, the total number of function calls, total time taken for all calls to that function, and min, max and average time for the function calls.

Sample usage:
def self_mult(n):
return n*n

print(self_mult(10)) # => 100
print(self_mult(7)) # => 49
print(self_mult.get_func_runtime_stats()) # => {'total_time': 401.668, 'avg': 200.834, 'func_uid': 4302206808, 'func_name': 'self_mult', 'min': 200.445, 'max': 201.223, 'total_calls': 2}

Replace CTRL-A in a file while in a screen session

echo -e "\u0001” | cat -v
# ^A

cat -v 000001 | tr '^A' '\t' | head


Note: Within the same day, this strategy both worked then failed. YMMV

More reliable would be to get into a non screen session and do “ctrl-v then a”

Split file by keys

Files sometimes come in (whether via hadoop or other processes) as big globs of data with inter-related parts. Many times I want to process these globs concurrently but see my dilemma unfolding quickly. I could a) write the code to process it serially and be done with it in 1 hour or b) write code to process it concurrently and be done in 1.5 hours because the added overhead of verifying the output, thread safety, etc exceeds the processing time serially. This made me sad, because concurrent processes are awesome. But self-managed thread safe concurrent processes are even more awesome!

I thought, what if I could split an input file on keys and group those similarly keyed lines into separate files for processing. Aha!

So I naturally first tried finding existing solutions and to be honest, awk has a pretty killer one liner as noted here on Stack Overflow:

awk '{ print >> $5 }' yourfile

This one liner is likely great for many folks (especially when using only small files). But for me, awk threw an error due to having “too many open files” – again, sad face :(.

So… I wrote my own python command line utility to take an input file and split it into any number of output files by unique keys in the file. So all your input is maintained, just in different files sorted/segregated on keys you provide.

While my naming conventions may be lacking panache, they are at least clearly intentioned utilities. (But seriously, if you have a better name, I’m all ears)

Without further ado, I give you

Converting CSVs with Headers to AVRO

Recently I wanted to quickly convert some CSV files to AVRO due to recent logging changes that meant we were receiving AVRO logs instead of CSV. I wanted to have some AVRO files to test my logic on and to get more familiar with AVRO. After looking around for a while and trying a few different CSV to AVRO conversion utilities, none of them actually worked as expected. I just wanted a simple conversion, why was this so hard to find? The closest I came to finding a working CSV2AVRO utility was avroutils.

Unfortunately avroutils hasn’t been updated in 5 years and the code was very restrictive since it offered a command line argument for “headers.” However regardless of whether that argument was passed, it used the [in my opinion, flawed] utility “csv.Sniffer” in Python to confirm there was a header. For me, this meant even though I was explicitly telling it my file had a header it still did not convert due to the Sniffer claiming there was no header.

To reiterate, I really just wanted a simple utility to convert CSV to AVRO if for no other reason than to some hands-on experience with the AVRO format. Why was I having such a hard time?

So… I wrote my own [very basic] utility to handle CSV to AVRO conversions. It does no validation (re: it assumes you know what you’re doing) and is hard-coded to treat all fields as strings. Remember, this was primarily a learning exercise!

Without further ado, I present to you csv2avro

Questions/Comments/Bugs/Requests/etc are encouraged!