Metadata for Functions | Python Decorator

Often I find myself wanting more information about the Python functions I’m running, whether it’s because I want to debug, log or even time their completion. All of these are relatively well-defined problems (debugging excepted). Unfortunately no tool makes it easy enough from my research to truly see the input, output, time elapsed, errors, warnings, etc about a function in a simple interface. So I wrote a simple python decorator compatible with Python 2.7+ (and probably earlier versions), including Py3.

What does the meta_func decorator actually do?

It stores all arguments for every function call, both positional and keyword arguments, error information (including the ability to catch and not raise errors), warnings, time elements (time started, ended and elapsed), and the returned value as a standard python dictionary.

What’s the point of tracking all this [meta]data?

Debugging, Logging, Timing… The use cases are nearly endless, because it tells us a lot of what’s going on in one easily interpreted structure.

Important Notes

This decorator should be expected to add a good deal of overhead to many function calls due to the handling of so many dimensions.

Arguments (Positional and Keywords), Return value, Warnings and Exceptions will be stored in their raw form, so any transformations (such as stringifying errors and traceback) would need to be done post-processing.

The error_info field will return a tuple from sys.exc_info() with error details.

Github Repo

The Power of Hadoop in under 10 lines!

Okay Okay, I may have oversold it a bit, but here are less than 10 bash lines that resemble (if you squint really hard) Hadoop/MapReduce.

split -d -a 5 -l 100000 $in_file $in_file"_" && \
ls $in_file"_"* | xargs -P8 -n1 -I file $code_to_run file file.out && \
cat $in_file"_"*.out > $out_file && \
rm $in_file"_"*

What will this do?
Takes 3 args

  • code_to_run is just a path to an executable
  • in_file is a path to a single in_file
  • out_file is a path to a single out_file
split -d -a 5 -l 100000 $in_file $in_file"_"

Split the in_file into 100,000 line chunks with an underscore and numbers following (e.g. in_file = “file.tsv”, temp files file.tsv_00000, file.tsv_00001, etc)

ls $in_file"_"* | xargs -P8 -n1 -I file $code_to_run file file.out

Get a list of all temp numbered in files, pass into xargs to run 8 processes of your code_to_run executable passing in the chunked in_file and outputting a chunked out_file.

cat $in_file"_"*.out > $out_file

Then cat chunked out files into single out file as you expect

rm $in_file"_"*

Cleanup (re: remove) all temporary files; both in and out temporary files will be removed.

For the sake of data safety, we include “&&” following each line to ensure all subsequent commands are not run unless the prior conditions are met.

ELI5 – The Human analogy to Multi-threading (Concurrency) and how it can go wrong

Ever feel like your brain is processing multiple things at once? An example might be when you’re reading a book and listening to music. While you may not realize it, your brain is processing both what you’re reading and what you’re hearing simultaneously.

A computer operates similarly; it is constantly working in the background doing multiple tasks (such as drawing the windows on your screen, checking for new e-mails, etc) all while juggling its specific user-requested tasks (such as opening a document then writing in it). The computer is doing these things concurrently just as your brain was reading a book and listening to music concurrently. It’s not sacrificing one task to complete another one; it is working on the tasks simultaneously without interruption.* This stands in contrast to processing information serially; you read a book, then you listen to music.

When we talk to another person, we expect them to understand what we’re saying because we formulate a thought, then compose our thoughts into words and finally speak those words. Likewise, a computer program generally tries to ensure the user understands what it means by handling background tasks out of sight (no windows, no warnings, etc) and only notifying the user when it is instructed to do so.

We speak with one mouth and are therefore locked to saying one thing at any given moment. That doesn’t mean we don’t make mistakes in our speech sometimes, such as by responding to the question “What would you like to eat?” with “Yes, I’d like a cheese refrigerator pizza” when we meant to say “Yes, I’d like a cheese pizza” while thinking concurrently about how we still need to fix our refrigerator. A computer can make a similar mistake when it’s instructed to tell the user “Your system is experiencing a disk space error” and “FREE VACATION – CALL NOW” simultaneously. Those two messages are unrelated and undesirable to be displayed together, let alone erroneously intertwined.

Unfortunately the bad news is that both humans and computers will likely continue to make the same mistakes for the foreseeable future. The solutions that exist for both are flawed; humans must focus extensively [not to mention consciously] to ensure only the words we wish to speak are spoken. Computers likewise must be instructed by the former (re: humans) to only output the information for which the user is meant to be provided.

The good news is that multi-threading/concurrency allows information to be processed simultaneously and therefore faster than when each action blocks subsequent actions from being processed.

Apache Pig chokes on many small files [part 2]

As a followup to my initial post regarding Apache Pig’s inability to quickly load many small files in Pig 0.10 and newer, I wanted to share a simple fix that worked for me courtesy of in-depth research by Amazon Support Team (+ Engineers).

Basically around Pig 0.10.0, PigStorage builds a hidden schema file in an attempt to determine your file’s schema. By passing the ‘-noschema’ flag to PigStorage, we see far improved performance.

a = LOAD '/files/*' USING PigStorage('\t','-noschema') AS (field1:int, field2:chararray);

Much better.