Sean O'Donnells Weblog
Every Amazon S3 library I can lay my hands on (for Python at least), seems to read the entire file to be uploaded into memory before sending it. This might by ok when uploading lots of small files, but I have needed to upload a lot of very large files, and my poor old server would creak under the weight of that kind of memory usage.
I managed to bolt a solution together using urllib2 and poster that has been working reliably for me for the past few months. Im going to show you:
S3 is essentially a big python dictionary in the cloud, you give it a key and a value(file) to store, and later on you can read it back out again. S3 has a nice HTTP API, so you can read and write to the store using standard HTTP libraries.
The area you put your files into is called a bucket. Bucket names (which have restrictions) are globally unique, that is, if you make a bucket called holiday_photos, then no one else using s3 can have a bucket called holiday_photos, which might sound weird, but it has its advantages, you can now access your files from http://holiday_photos.s3.amazonaws.com/. If you set the permissions up so anyone can read the contents of the bucket, the whole world can see you files via http://holiday_photos.s3.amazonaws.com/.
The flip side of this, is that you can upload your files, lets say "meonthebeach.jpg" by using HTTP PUT, in this case PUT to http://holiday_photos.s3.amazonaws.com/meonthebeach.jpg.
When uploading to S3, we need provide a few HTTP headers along with our file data when we PUT.
Authorization - This is the tricky one, S3 requires that your PUT request be accompanied by an authorization string in the following format: AWS AWS_ACCESS_KEY_ID:SIGNATURE The AWS_ACCESS_KEY_ID is the one provided to you when you signed up to S3
The signature is a string consisting of several of the headers you are sending, along with the resource you are putting concatenated, and hashed with your AWS Secret access key. Constructing the signature is quite complicated in the general case, so I am going to show a method of generating it for the specific type of upload request we will be making, if you need to send headers that we are not using here, see Amazons Documentation for how to create the Authentication Header.
The signature string consists of
PUT\n\n<content-type>\n<date>\nx-amz-acl:public-read\n<resource>
a code example of creating this
sig_data = "PUT\n\n%s\n%s\nx-amz-acl:public-read\n%s" % (
content_type, date, resource)
We then take this string and create an sha1 hash of it and your secret access key, and base 64 encode it.
signature = base64.encodestring( hmac.new( settings.AWS_SECRET_ACCESS_KEY, sig_data, sha1).digest() ).strip()
and thats your signature.
Poster is a small library that works with urllib2 to allow streaming uploads. All you need to do is import it and call a single function which registers posters custom url openers with urllib2 and you are good to go.
import urllib2 from poster.streaminghttp import register_openers register_openers()
Secondly we need to tell urllib to use HTTP PUT rather than POST. We do this by creating a request object and overriding the get_method
request = urllib2.Request(url, data=data) request.get_method = lambda: 'PUT'
And then we can make our request and read the response
response = urllib2.urlopen(request).read()
The last step for use in poster is that rather than data containing the file object to be uploaded, it should return an iterator that provides the file data chunk by chunk. For example.
def read_data(file_object): while True: r = file_object.read(64 * 1024) if not r: break yield r f = open("text.txt","r") data = read_data(f)
data is now a generator that will return our file a line at a time.
Below is the source for a simple command line tool that will take a filename bucket name, and amazon credentials and upload the file to the bucket making it publicly readable
import os import sys import time import base64 import hmac import mimetypes import urllib2 from hashlib import sha1 from poster.streaminghttp import register_openers def read_data(file_object): while True: r = file_object.read(64 * 1024) if not r: break yield r def upload_file(filename, bucket, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY): length = os.stat(filename).st_size content_type = mimetypes.guess_type(filename)[0] resource = "/%s/%s" % (bucket, filename) url = "http://%s.s3.amazonaws.com/%s" % (bucket, filename) date = time.strftime("%a, %d %b %Y %X GMT", time.gmtime()) sig_data = "PUT\n\n%s\n%s\nx-amz-acl:public-read\n%s" % ( content_type, date, resource) signature = base64.encodestring( hmac.new( AWS_SECRET_ACCESS_KEY, sig_data, sha1).digest()).strip() auth_string = "AWS %s:%s" % (AWS_ACCESS_KEY_ID, signature) register_openers() input_file = open(filename, 'r') data = read_data(input_file) request = urllib2.Request(url, data=data) request.add_header('Date', date) request.add_header('Content-Type', content_type) request.add_header('Content-Length', length) request.add_header('Authorization', auth_string) request.add_header('x-amz-acl', 'public-read') request.get_method = lambda: 'PUT' urllib2.urlopen(request).read() if __name__ == "__main__": filename = sys.argv[1] bucket = sys.argv[2] AWS_ACCESS_KEY_ID = sys.argv[3] AWS_SECRET_ACCESS_KEY = sys.argv[4] upload_file(filename, bucket, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
I used to work with a guy (Hi Daniel) who got everyone he knew to send him OPML files from their RSS readers so he could find new gems to subscribe to. Im feeling kind of bored at the moment. So I am going to repeat his experiment. Anyone who reads this, or sees the related tweet, please send me your OPML file. If your RSS reader makes it difficult to export a list of links, then by all means send them in whatever format you like.
In a weeks time, I'll take the results, crunch em a little, and put them up for all to see. So you can get the benefit too. My email address can be grabbed from the contact link to the left. Come on, send me your links!
For the curious, here is my current list of feeds.
Readability is a bookmarklet that removes clutter from webpages to make them more readable. I read from computer screens a lot, but when it comes to longer text I actually prefer to read from the tiny screen on my mobile phone than from a laptop monitor.
I recently began a little reading on Typography, and learned of the concept of the comfortable measure. Essentially, approximately 66 characters per line is regarded as the ideal width for readable text.
While Readability does not hit that mark exactly, its a lot closer than the average over wide web layout. Give it a try, it can return a lot of the pleasure of reading to computers.
Kablingy Software just released a fun widget generator to display what platform your web app is running on. I imagine they will probably come up with some interesting statistics from running it as well.
Go on over to http://whatareyourunning.com and take it for a spin. If you are running something they don't have on the list yet, drop them a quick email and it will be up in not time. See my widget below.
I'm really fed up with hearing questions about "how does it scale". The simple fact, borne out by historical evidence, is that absolutely bloody everything scales. From what history tells us, you could write your web app in Cobol and run it on potato powered web servers, and still find a way to make it scale. What has not scaled? I can't think of a single web technology that was abandoned because it could not scale.
They all scaled, and they all had to face the whiners and naysayers who claimed they would not. Anyone who claims a new Technology will not scale before they have tried it is committing the mortal sin of premature optimization and should repent immediately.