Sean O'Donnells Weblog
Every Amazon S3 library I can lay my hands on (for Python at least), seems to read the entire file to be uploaded into memory before sending it. This might be alright when uploading lots of small files, but I have needed to upload a lot of very large files, and my poor old server would creak under the weight of that kind of memory usage.
I managed to bolt a solution together using urllib2 and poster that has been working reliably for me for the past few months. I'm going to show you:
S3 is essentially a big python dictionary in the cloud, you give it a key and a value(file) to store, and later on you can read it back out again. S3 has a nice HTTP API, so you can read and write to the store using standard HTTP libraries.
The area you put your files into is called a bucket. Bucket names (which have restrictions) are globally unique, that is, if you make a bucket called holiday_photos, then no one else using s3 can have a bucket called holiday_photos, which might sound weird, but it has its advantages, you can now access your files from http://holiday_photos.s3.amazonaws.com/. If you set the permissions up so anyone can read the contents of the bucket, the whole world can see you files via http://holiday_photos.s3.amazonaws.com/.
The flip side of this, is that you can upload your files, lets say "meonthebeach.jpg" by using HTTP PUT, in this case PUT to http://holiday_photos.s3.amazonaws.com/meonthebeach.jpg.
When uploading to S3, we need provide a few HTTP headers along with our file data when we PUT.
Authorization - This is the tricky one, S3 requires that your PUT request be accompanied by an authorization string in the following format: AWS AWS_ACCESS_KEY_ID:SIGNATURE The AWS_ACCESS_KEY_ID is the one provided to you when you signed up to S3
The signature is a string consisting of several of the headers you are sending, along with the resource you are putting concatenated, and hashed with your AWS Secret access key. Constructing the signature is quite complicated in the general case, so I am going to show a method of generating it for the specific type of upload request we will be making, if you need to send headers that we are not using here, see Amazons Documentation for how to create the Authentication Header.
The signature string consists of
PUT\n\n<content-type>\n<date>\nx-amz-acl:public-read\n<resource>
a code example of creating this
sig_data = "PUT\n\n%s\n%s\nx-amz-acl:public-read\n%s" % ( content_type, date, resource)
We then take this string and create an sha1 hash of it and your secret access key, and base 64 encode it.
signature = base64.encodestring( hmac.new( settings.AWS_SECRET_ACCESS_KEY, sig_data, sha1).digest() ).strip()
and thats your signature.
Poster is a small library that works with urllib2 to allow streaming uploads. All you need to do is import it and call a single function which registers posters custom url openers with urllib2 and you are good to go.
import urllib2 from poster.streaminghttp import register_openers register_openers()
Secondly we need to tell urllib to use HTTP PUT rather than POST. We do this by creating a request object and overriding the get_method
request = urllib2.Request(url, data=data) request.get_method = lambda: 'PUT'
And then we can make our request and read the response
response = urllib2.urlopen(request).read()
The last step for use in poster is that rather than data containing the file object to be uploaded, it should return an iterator that provides the file data chunk by chunk. For example.
def read_data(file_object): while True: r = file_object.read(64 * 1024) if not r: break yield r f = open("text.txt","r") data = read_data(f)
data is now a generator that will return our file a line at a time.
Below is the source for a simple command line tool that will take a filename bucket name, and amazon credentials and upload the file to the bucket making it publicly readable.
import os import sys import time import base64 import hmac import mimetypes import urllib2 from hashlib import sha1 from poster.streaminghttp import register_openers def read_data(file_object): while True: r = file_object.read(64 * 1024) if not r: break yield r def upload_file(filename, bucket, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY): length = os.stat(filename).st_size content_type = mimetypes.guess_type(filename)[0] resource = "/%s/%s" % (bucket, filename) url = "http://%s.s3.amazonaws.com/%s" % (bucket, filename) date = time.strftime("%a, %d %b %Y %X GMT", time.gmtime()) sig_data = "PUT\n\n%s\n%s\nx-amz-acl:public-read\n%s" % ( content_type, date, resource) signature = base64.encodestring( hmac.new( AWS_SECRET_ACCESS_KEY, sig_data, sha1).digest()).strip() auth_string = "AWS %s:%s" % (AWS_ACCESS_KEY_ID, signature) register_openers() input_file = open(filename, 'r') data = read_data(input_file) request = urllib2.Request(url, data=data) request.add_header('Date', date) request.add_header('Content-Type', content_type) request.add_header('Content-Length', length) request.add_header('Authorization', auth_string) request.add_header('x-amz-acl', 'public-read') request.get_method = lambda: 'PUT' urllib2.urlopen(request).read() if __name__ == "__main__": filename = sys.argv[1] bucket = sys.argv[2] AWS_ACCESS_KEY_ID = sys.argv[3] AWS_SECRET_ACCESS_KEY = sys.argv[4] upload_file(filename, bucket, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
Comments
Nice post. Might need a ruby one of those soon.
Well done, thank you, I had the very same problem and this is the right solution.
There's a simpler way - use simples3 and the very same library, poster. sendapatch.se/projects/simples3/streaming.html
Thanks Ludvig, I dont think SimpleS3 was around when I originally cooked this up, but it looks great. I still do a lot of work with S3, and I imagine ill be using it a lot from now on.
Thanks for this post! I've got a question: do you know how to make it works with Python 3 ?
With approximately the same code (and poster modified with 2to3 and a few fixes for Python3) I've got this error:
"TypeError: 'generator' does not support the buffer interface", on the line "urllib.request.urlopen(request).read()", and I didn't found a solution...
Regards,
Sam.
New Comment