Streaming uploads to S3 with Python and Poster 2 Comments

Every Amazon S3 library I can lay my hands on (for Python at least), seems to read the entire file to be uploaded into memory before sending it. This might by ok when uploading lots of small files, but I have needed to upload a lot of very large files, and my poor old server would creak under the weight of that kind of memory usage.

I managed to bolt a solution together using urllib2 and poster that has been working reliably for me for the past few months. Im going to show you:

  1. A little about how S3 works
  2. How to use Poster
  3. A simple script to stream uploads to S3

A little about how S3 works

S3 is essentially a big python dictionary in the cloud, you give it a key and a value(file) to store, and later on you can read it back out again. S3 has a nice HTTP API, so you can read and write to the store using standard HTTP libraries.

The area you put your files into is called a bucket. Bucket names (which have restrictions) are globally unique, that is, if you make a bucket called holiday_photos, then no one else using s3 can have a bucket called holiday_photos, which might sound weird, but it has its advantages, you can now access your files from http://holiday_photos.s3.amazonaws.com/. If you set the permissions up so anyone can read the contents of the bucket, the whole world can see you files via http://holiday_photos.s3.amazonaws.com/.

The flip side of this, is that you can upload your files, lets say "meonthebeach.jpg" by using HTTP PUT, in this case PUT to http://holiday_photos.s3.amazonaws.com/meonthebeach.jpg.

When uploading to S3, we need provide a few HTTP headers along with our file data when we PUT.

How to use Poster

Poster is a small library that works with urllib2 to allow streaming uploads. All you need to do is import it and call a single function which registers posters custom url openers with urllib2 and you are good to go.

import urllib2

from poster.streaminghttp import register_openers
register_openers()

Secondly we need to tell urllib to use HTTP PUT rather than POST. We do this by creating a request object and overriding the get_method

request = urllib2.Request(url, data=data)

request.get_method = lambda: 'PUT'

And then we can make our request and read the response

response = urllib2.urlopen(request).read()

The last step for use in poster is that rather than data containing the file object to be uploaded, it should return an iterator that provides the file data chunk by chunk. For example.

def read_data(file_object):

    while True:
        r = file_object.read(64 * 1024)

        if not r:
            break
        yield r

f = open("text.txt","r")

data = read_data(f)

data is now a generator that will return our file a line at a time.

A simple script to stream uploads to S3

Below is the source for a simple command line tool that will take a filename bucket name, and amazon credentials and upload the file to the bucket making it publicly readable

import os

import sys
import time
import base64
import hmac
import mimetypes

import urllib2

from hashlib import sha1

from poster.streaminghttp import register_openers

def read_data(file_object):
    while True:
        r = file_object.read(64 * 1024)

        if not r:
            break
        yield r

def upload_file(filename, bucket, AWS_ACCESS_KEY_ID, 
              AWS_SECRET_ACCESS_KEY):
    length = os.stat(filename).st_size
    content_type = mimetypes.guess_type(filename)[0]
    resource = "/%s/%s" % (bucket, filename)

    url = "http://%s.s3.amazonaws.com/%s" % (bucket, filename)

    date = time.strftime("%a, %d %b %Y %X GMT", time.gmtime())

    sig_data = "PUT\n\n%s\n%s\nx-amz-acl:public-read\n%s" % (
                                            content_type, date, resource)
    signature = base64.encodestring(
                hmac.new(
                    AWS_SECRET_ACCESS_KEY, sig_data, sha1).digest()).strip()

    auth_string = "AWS %s:%s" % (AWS_ACCESS_KEY_ID, signature)

    register_openers()
    input_file = open(filename, 'r')

    data = read_data(input_file)
    request = urllib2.Request(url, data=data)

    request.add_header('Date', date)
    request.add_header('Content-Type', content_type)

    request.add_header('Content-Length', length)
    request.add_header('Authorization', auth_string)

    request.add_header('x-amz-acl', 'public-read')
    request.get_method = lambda: 'PUT'

    urllib2.urlopen(request).read()

if __name__ == "__main__":

    filename = sys.argv[1]
    bucket = sys.argv[2]

    AWS_ACCESS_KEY_ID = sys.argv[3]
    AWS_SECRET_ACCESS_KEY = sys.argv[4]

    upload_file(filename, bucket, AWS_ACCESS_KEY_ID, 
             AWS_SECRET_ACCESS_KEY)

Add Comment

Comments

Nice post. Might need a ruby one of those soon.

Steve Quinlan 10:07 Monday the 25th of January 2010

Well done, thank you, I had the very same problem and this is the right solution.

Miki 16:50 Thursday the 11th of February 2010