Streaming uploads to S3 with Python and Poster 5 comments

Every Amazon S3 library I can lay my hands on (for Python at least), seems to read the entire file to be uploaded into memory before sending it. This might be alright when uploading lots of small files, but I have needed to upload a lot of very large files, and my poor old server would creak under the weight of that kind of memory usage.

I managed to bolt a solution together using urllib2 and poster that has been working reliably for me for the past few months. I'm going to show you:

  1. A little about how S3 works
  2. How to use Poster
  3. A simple script to stream uploads to S3

A little about how S3 works

S3 is essentially a big python dictionary in the cloud, you give it a key and a value(file) to store, and later on you can read it back out again. S3 has a nice HTTP API, so you can read and write to the store using standard HTTP libraries.

The area you put your files into is called a bucket. Bucket names (which have restrictions) are globally unique, that is, if you make a bucket called holiday_photos, then no one else using s3 can have a bucket called holiday_photos, which might sound weird, but it has its advantages, you can now access your files from http://holiday_photos.s3.amazonaws.com/. If you set the permissions up so anyone can read the contents of the bucket, the whole world can see you files via http://holiday_photos.s3.amazonaws.com/.

The flip side of this, is that you can upload your files, lets say "meonthebeach.jpg" by using HTTP PUT, in this case PUT to http://holiday_photos.s3.amazonaws.com/meonthebeach.jpg.

When uploading to S3, we need provide a few HTTP headers along with our file data when we PUT.

How to use Poster

Poster is a small library that works with urllib2 to allow streaming uploads. All you need to do is import it and call a single function which registers posters custom url openers with urllib2 and you are good to go.

import urllib2

from poster.streaminghttp import register_openers
register_openers()

Secondly we need to tell urllib to use HTTP PUT rather than POST. We do this by creating a request object and overriding the get_method

request = urllib2.Request(url, data=data)

request.get_method = lambda: 'PUT'

And then we can make our request and read the response

response = urllib2.urlopen(request).read()

The last step for use in poster is that rather than data containing the file object to be uploaded, it should return an iterator that provides the file data chunk by chunk. For example.

def read_data(file_object):

    while True:
        r = file_object.read(64 * 1024)

        if not r:
            break
        yield r

f = open("text.txt","r")

data = read_data(f)

data is now a generator that will return our file a line at a time.

A simple script to stream uploads to S3

Below is the source for a simple command line tool that will take a filename bucket name, and amazon credentials and upload the file to the bucket making it publicly readable.

import os

import sys
import time
import base64
import hmac
import mimetypes

import urllib2

from hashlib import sha1

from poster.streaminghttp import register_openers

def read_data(file_object):
    while True:
        r = file_object.read(64 * 1024)

        if not r:
            break
        yield r

def upload_file(filename, bucket, AWS_ACCESS_KEY_ID, 
              AWS_SECRET_ACCESS_KEY):
    length = os.stat(filename).st_size
    content_type = mimetypes.guess_type(filename)[0]
    resource = "/%s/%s" % (bucket, filename)

    url = "http://%s.s3.amazonaws.com/%s" % (bucket, filename)

    date = time.strftime("%a, %d %b %Y %X GMT", time.gmtime())

    sig_data = "PUT\n\n%s\n%s\nx-amz-acl:public-read\n%s" % (
                                            content_type, date, resource)
    signature = base64.encodestring(
                hmac.new(
                    AWS_SECRET_ACCESS_KEY, sig_data, sha1).digest()).strip()

    auth_string = "AWS %s:%s" % (AWS_ACCESS_KEY_ID, signature)

    register_openers()
    input_file = open(filename, 'r')

    data = read_data(input_file)
    request = urllib2.Request(url, data=data)

    request.add_header('Date', date)
    request.add_header('Content-Type', content_type)

    request.add_header('Content-Length', length)
    request.add_header('Authorization', auth_string)

    request.add_header('x-amz-acl', 'public-read')
    request.get_method = lambda: 'PUT'

    urllib2.urlopen(request).read()

if __name__ == "__main__":

    filename = sys.argv[1]
    bucket = sys.argv[2]

    AWS_ACCESS_KEY_ID = sys.argv[3]
    AWS_SECRET_ACCESS_KEY = sys.argv[4]

    upload_file(filename, bucket, AWS_ACCESS_KEY_ID, 
             AWS_SECRET_ACCESS_KEY)

Comments

Nice post. Might need a ruby one of those soon.

Steve Quinlan 10:07 Monday the 25th of January 2010 #

Well done, thank you, I had the very same problem and this is the right solution.

Miki 16:50 Thursday the 11th of February 2010 #

There's a simpler way - use simples3 and the very same library, poster. sendapatch.se/projects/simples3/streaming.html

Ludvig Ericson 15:06 Tuesday the 26th of October 2010 #

Thanks Ludvig, I dont think SimpleS3 was around when I originally cooked this up, but it looks great. I still do a lot of work with S3, and I imagine ill be using it a lot from now on.

Sean O'Donnell 15:42 Tuesday the 26th of October 2010 #

Thanks for this post! I've got a question: do you know how to make it works with Python 3 ?

With approximately the same code (and poster modified with 2to3 and a few fixes for Python3) I've got this error:
"TypeError: 'generator' does not support the buffer interface", on the line "urllib.request.urlopen(request).read()", and I didn't found a solution...

Regards,
Sam.

Samuel 08:00 Thursday the 10th of May 2012 #

New Comment

required

required (not published)

optional