Caffeine Fueled Dreams

Sean O'Donnells Weblog

  • Archive
  • Contact
  • RSS Feed
  • The Future comes to pass 00:30 Tuesday the 27th of April 2010 0 Comments

    All the shares are owned by those companies in equal measure, and I can tell you that their regulations are written in Python.

    Charles Stross - Accelerando 2005

    We are proposing to require that most ABS issuers file a computer program that gives effect to the flow of funds, or “waterfall,” provisions of the transaction. We are proposing that the computer program be filed on EDGAR in the form of downloadable source code in Python. …

    SECURITIES AND EXCHANGE COMMISSION - 17 CFR Parts 200, 229, 230, 232, 239, 240, 243 and 249 Release Nos. 33-9117; 34-61858; File No. S7-08-10 RIN 3235-AK37 ASSET-BACKED SECURITIES - 2010

    via Sean McGrath

  • Streaming uploads to S3 with Python and Poster 16:25 Sunday the 24th of January 2010 2 Comments

    Every Amazon S3 library I can lay my hands on (for Python at least), seems to read the entire file to be uploaded into memory before sending it. This might by ok when uploading lots of small files, but I have needed to upload a lot of very large files, and my poor old server would creak under the weight of that kind of memory usage.

    I managed to bolt a solution together using urllib2 and poster that has been working reliably for me for the past few months. Im going to show you:

    1. A little about how S3 works
    2. How to use Poster
    3. A simple script to stream uploads to S3

    A little about how S3 works

    S3 is essentially a big python dictionary in the cloud, you give it a key and a value(file) to store, and later on you can read it back out again. S3 has a nice HTTP API, so you can read and write to the store using standard HTTP libraries.

    The area you put your files into is called a bucket. Bucket names (which have restrictions) are globally unique, that is, if you make a bucket called holiday_photos, then no one else using s3 can have a bucket called holiday_photos, which might sound weird, but it has its advantages, you can now access your files from http://holiday_photos.s3.amazonaws.com/. If you set the permissions up so anyone can read the contents of the bucket, the whole world can see you files via http://holiday_photos.s3.amazonaws.com/.

    The flip side of this, is that you can upload your files, lets say "meonthebeach.jpg" by using HTTP PUT, in this case PUT to http://holiday_photos.s3.amazonaws.com/meonthebeach.jpg.

    When uploading to S3, we need provide a few HTTP headers along with our file data when we PUT.

    • Date - The current date and time in a specific format, e.g. Wed, 01 Mar 2006 12:00:00 GMT. I generate it with time.strftime("%a, %d %b %Y %X GMT", time.gmtime())
    • Content-Type - The mime type of the file being uploaded, e.g. text/html. Python's mimetypes module does a good job of guessing this for any given file based on its extension. mimetypes.guess_type(filename)[0]
    • Content-Length - the length of the data to be uploaded according to RFC 2616, if you are uploading the file from disk you can get this with the os modules stat function. os.stat(filename).st_size
    • x-amz-acl - Optional, this tells S3 with default access control policy to use, by default this will be available to the logged in owner of the bucket only, to make it publicly readable set it to public-read
    • Authorization - This is the tricky one, S3 requires that your PUT request be accompanied by an authorization string in the following format: AWS AWS_ACCESS_KEY_ID:SIGNATURE The AWS_ACCESS_KEY_ID is the one provided to you when you signed up to S3

      The signature is a string consisting of several of the headers you are sending, along with the resource you are putting concatenated, and hashed with your AWS Secret access key. Constructing the signature is quite complicated in the general case, so I am going to show a method of generating it for the specific type of upload request we will be making, if you need to send headers that we are not using here, see Amazons Documentation for how to create the Authentication Header.

      The signature string consists of

      PUT\n\n<content-type>\n<date>\nx-amz-acl:public-read\n<resource>

      a code example of creating this

      sig_data = "PUT\n\n%s\n%s\nx-amz-acl:public-read\n%s" % (
      content_type, date, resource)

      We then take this string and create an sha1 hash of it and your secret access key, and base 64 encode it.

       signature = base64.encodestring(
                          hmac.new(
                    settings.AWS_SECRET_ACCESS_KEY, sig_data, sha1).digest()
                     ).strip()
      
      

      and thats your signature.

    How to use Poster

    Poster is a small library that works with urllib2 to allow streaming uploads. All you need to do is import it and call a single function which registers posters custom url openers with urllib2 and you are good to go.

    import urllib2
    
    from poster.streaminghttp import register_openers
    register_openers()
    

    Secondly we need to tell urllib to use HTTP PUT rather than POST. We do this by creating a request object and overriding the get_method

    request = urllib2.Request(url, data=data)
    
    request.get_method = lambda: 'PUT'
    

    And then we can make our request and read the response

    response = urllib2.urlopen(request).read()
    
    

    The last step for use in poster is that rather than data containing the file object to be uploaded, it should return an iterator that provides the file data chunk by chunk. For example.

    def read_data(file_object):
    
        while True:
            r = file_object.read(64 * 1024)
    
            if not r:
                break
            yield r
    
    f = open("text.txt","r")
    
    data = read_data(f)
    
    

    data is now a generator that will return our file a line at a time.

    A simple script to stream uploads to S3

    Below is the source for a simple command line tool that will take a filename bucket name, and amazon credentials and upload the file to the bucket making it publicly readable

    import os
    
    import sys
    import time
    import base64
    import hmac
    import mimetypes
    
    import urllib2
    
    from hashlib import sha1
    
    from poster.streaminghttp import register_openers
    
    def read_data(file_object):
        while True:
            r = file_object.read(64 * 1024)
    
            if not r:
                break
            yield r
    
    def upload_file(filename, bucket, AWS_ACCESS_KEY_ID, 
                  AWS_SECRET_ACCESS_KEY):
        length = os.stat(filename).st_size
        content_type = mimetypes.guess_type(filename)[0]
        resource = "/%s/%s" % (bucket, filename)
    
        url = "http://%s.s3.amazonaws.com/%s" % (bucket, filename)
    
        date = time.strftime("%a, %d %b %Y %X GMT", time.gmtime())
    
        sig_data = "PUT\n\n%s\n%s\nx-amz-acl:public-read\n%s" % (
                                                content_type, date, resource)
        signature = base64.encodestring(
                    hmac.new(
                        AWS_SECRET_ACCESS_KEY, sig_data, sha1).digest()).strip()
    
        auth_string = "AWS %s:%s" % (AWS_ACCESS_KEY_ID, signature)
    
        register_openers()
        input_file = open(filename, 'r')
    
        data = read_data(input_file)
        request = urllib2.Request(url, data=data)
    
        request.add_header('Date', date)
        request.add_header('Content-Type', content_type)
    
        request.add_header('Content-Length', length)
        request.add_header('Authorization', auth_string)
    
        request.add_header('x-amz-acl', 'public-read')
        request.get_method = lambda: 'PUT'
    
        urllib2.urlopen(request).read()
    
    if __name__ == "__main__":
    
        filename = sys.argv[1]
        bucket = sys.argv[2]
    
        AWS_ACCESS_KEY_ID = sys.argv[3]
        AWS_SECRET_ACCESS_KEY = sys.argv[4]
    
        upload_file(filename, bucket, AWS_ACCESS_KEY_ID, 
                 AWS_SECRET_ACCESS_KEY)
    
  • Send me your OPML 15:22 Sunday the 7th of June 2009 1 Comments

    I used to work with a guy (Hi Daniel) who got everyone he knew to send him OPML files from their RSS readers so he could find new gems to subscribe to. Im feeling kind of bored at the moment. So I am going to repeat his experiment. Anyone who reads this, or sees the related tweet, please send me your OPML file. If your RSS reader makes it difficult to export a list of links, then by all means send them in whatever format you like.

    In a weeks time, I'll take the results, crunch em a little, and put them up for all to see. So you can get the benefit too. My email address can be grabbed from the contact link to the left. Come on, send me your links!

    For the curious, here is my current list of feeds.

  • Readability 00:30 Thursday the 21st of May 2009 0 Comments

    Readability is a bookmarklet that removes clutter from webpages to make them more readable. I read from computer screens a lot, but when it comes to longer text I actually prefer to read from the tiny screen on my mobile phone than from a laptop monitor.

    I recently began a little reading on Typography, and learned of the concept of the comfortable measure. Essentially, approximately 66 characters per line is regarded as the ideal width for readable text.

    While Readability does not hit that mark exactly, its a lot closer than the average over wide web layout. Give it a try, it can return a lot of the pleasure of reading to computers.

  • What are you running? 17:40 Friday the 30th of January 2009 1 Comments

    Kablingy Software just released a fun widget generator to display what platform your web app is running on. I imagine they will probably come up with some interesting statistics from running it as well.

    Go on over to http://whatareyourunning.com and take it for a spin. If you are running something they don't have on the list yet, drop them a quick email and it will be up in not time. See my widget below.

© Copyright 2004-2010 Sean O'Donnell