S3CMD’s distinct lack of multi-threading led me to hunt for alternatives. While I tried many alternatives, such as s3-multipart (great when I did use it), s3funnel and s3cp among others, none quite fit the bill of supporting the key features I found important.

1) Listing/Downloading/Uploading/etc of files and “folders”
2) Multi-threaded
3) Synchronization handled so as to avoid re-downloading an existing file

S4CMD fit all the requirements and did it at an even higher performance than I anticipated. It did not require me to set the number of threads I wanted it to use (for better or for worse), and it seemed to err on the side of more threads (e.g. 32 threads for a single process). This even considering Python Global Interpreter Lock (GIL). The downloading performance was superb. It was saturating a good portion of the available bandwidth (~30-50mbps – ~4-6MB/s). If it saturated the bandwidth anymore, my coworkers may not have appreciated the return to dialup :).

What makes S4CMD so awesome to me is the fact that it takes multi-threading very seriously, using it for more than just GET requests, but also even listing of files/directories. The multi-threading was a real pain point for me because I use logs from a third party that are split into 15 minute chunks and there are anywhere from 1 to numerous files per 15 minute chunk. For better or worse, each file is miniscule in size. Unfortunately this makes downloading the files a nightmare as the overhead to GET the file is excessive for any serial downloaders. S4CMD’s multi-threading makes downloading the files a breeze.

I am in no way affiliated with S4CMD besides having a code crush on it :).

Split metadata size exceeded 10000000 Use regex to select only certain files via s3cmd

Leave a Reply

Your email address will not be published. Required fields are marked *