Oh do you mean to simulate a stream? That’s a pretty cool idea. My approach, which worked, is a little different. I used python’s zipfile module. Since I had the files stored locally I was able to use python to append json files together to make bigger files of at least 128MB and at most 1GB. So far it seems to have worked. I went from having ~2M mini json files to 28 appropriately-sized json files (~300MB each). Your blog was super helpful, thanks!
]]>Mohammad, if you already have an army of tiny files, you can do a batch conversion to parquet by reading in chunks of the data set (e.g. 1 hour partition at a time of a 24 hour day)
]]>Probably related to your python version (e.g. 2.7 versus 3.6). Hopefully you got it resolved.
]]>