Streaming multipart/form-data parser for PythonMay 06, 2017 • #cython , #programming , #python , #tornado
Recently at work we were working on a feature where we wanted our users to be able to upload files. We're a Python shop using tornado, and tornado handles it really well by parsing the form data and making the uploaded file available directly inside the request handler. The catch is that by the time the handler function is called, the complete file is in memory independent of its size. The obvious fix is to set a limit on the request size in your reverse proxy server (nginx/apache/...) so that any request exceeding the threshold size is dropped and never reaches tornado.
That works really well in practice. However, sometimes it's nicer to not be forced to load the complete file in memory (even if there's a limit on the file size) and handle the file as data chunks are read from the HTTP request. Again, tornado handles this really well with the stream_request_body decorator which provides an interface to handle each chunk of the request body while it's being read from the reverse proxy.
The catch here is that
stream_request_body works on the request level, while
file uploads are usually encoded using the multipart/form-data encoding. This
means you can't just throw in
stream_request_body and expect things to work.
You need to parse the form body to be able to see the file.
This seemed like one of the problems which makes you think "I bet someone already ran into this problem and wrote a solution for it". But surprisingly, nothing existed which could handle parsing multipart/form-data encoded data in a streaming manner.
So I spent some of my spare time writing streaming-form-data. This small
module provides a Python parser which expects successive request body chunks,
and lets you define what should be done with each field in the data. The way
it's done is that you initialize the main parser class, and associate one of the
*Target classes to define what should be done with a field when it has been
extracted out of the request body. For instance,
FileTarget would stream the
data to a file on disk, while
SHA256Target calculates the sha256 hash of the
content seen so far.
Writing this parser also made me realize that byte-by-byte processing in pure Python can be slow. So slow that running a 500k file through this parser took ~3 seconds to finish and hogged the CPU during the time. I spent some time trying to move the core parsing code into a C extension, but the amount of work involved in defining a new type and handling bytes inputs (not to mention manual memory management) always made me procrastinate.
Luckily, I found Cython, which lets me write Python code (well, a superset of Python) and compiles it down to a C extension which I could then compile into a shared library. This let me define a Python interface to the module while the performance-intensive code was written in C. This was so cool, it almost felt like cheating.
Anyway, I released the module to PyPI under the name streaming-form-data. Hope it's useful.