Sean O'Donnells Weblog
The sad news that Yahoo plans to shut down del.icio.us reached me this week (although theres still hope). I use del.icio.us pretty much every day and was a little traumatized upon hearing this. Once I had finished wailing and gnashing my teeth I set out looking for somewhere to go.
There are many bookmarking sites/services out there, but I fear change, and pinboard.in seemed like the closest thing to a plain replacement. It even supports the same API as del.icio.us. Theres a small charge for signing up, but no recurring fee, so I broke out the credit card and joined up.
The next step was to figure out how to migrate my bookmarks. del.icio.us provides a export to html feature in its settings area, but a quick look at the export revealed some data was missing (mostly extended descriptions). Rabid googling revealed a lesser known XML export mechanism. To use it visit https://api.del.icio.us/v1/posts/all , enter your username and password and save the resulting XML file.
Now to get my bookmarks into pinboard.in. I broke out my trusty text editor and battered together the script below which works just fine, a few hours later all my bookmarks are in pinboard.in, their bookmarklets are installed in my browser, and I'm loving their read later features. Sean is a happy geek again.
You can download my migration script. To use it :
python delmigrate.py backup.xml username password
Heres the source for the curious.
from xml.dom import minidom import sys import urllib import urllib2 import time user = sys.argv password = sys.argv endpoint = "https://api.pinboard.in" url = "/v1/posts/add?" #open the xml file to import from and parse it f = open(sys.argv, "r") doc = minidom.parse(f).documentElement #keep count of how many urls have been imported urlcount = 0 count = 0 ellength = len(doc.childNodes) failcount = 0 while count < ellength: e = doc.childNodes[count] if e.nodeType == e.ELEMENT_NODE: print "import url %s" % urlcount #get the attributes from the xml href = e.getAttribute("href") description = e.getAttribute("description") extended = e.getAttribute("extended") tags = e.getAttribute("tag") dt = e.getAttribute("time") rargs = dict(url=href, description=description, extended=extended, tags=tags, dt=dt) shared = e.getAttribute("shared") if shared.strip() == 'no': rargs['shared'] = 'no' #convert them to unicode rargs = dict([k, v.encode('utf-8')] for k, v in rargs.items()) print rargs #build the request to send #set up http auth for pinboard.in #doing this for every request may seem wasteful, but urllib2 #seems to forget the auth details after a half dozen requests # if you dont password_manager = urllib2.HTTPPasswordMgrWithDefaultRealm() password_manager.add_password(None, endpoint, user, password) auth_handler = urllib2.HTTPBasicAuthHandler(password_manager) opener = urllib2.build_opener(auth_handler) urllib2.install_opener(opener) request = urllib2.Request(endpoint + url + urllib.urlencode(rargs)) #set the user agent request.add_header('User-Agent','SeansDeliciousMigrater') try: r = opener.open(request) #send the request and read the response response = minidom.parse(r).documentElement.getAttribute("code") except Exception, e: response = str(e) #if we get an invalid response, abort, proabbly throttled if response !="done": failcount += 1 print "Failure: Invalid response: %s" % response if failcount > 4: print "Aborting: Invalid response %s" break else: print "waiting for 30 seconds and retrying" time.sleep(30) else: failcount = 0 count += 1 #put in a delay between requests to reduce odds of throttling time.sleep(1) urlcount += 1 else: count += 1 print "%s urls imported" % urlcount
All the shares are owned by those companies in equal measure, and I can tell you that their regulations are written in Python.
We are proposing to require that most ABS issuers file a computer program that gives effect to the flow of funds, or “waterfall,” provisions of the transaction. We are proposing that the computer program be filed on EDGAR in the form of downloadable source code in Python. …
via Sean McGrath
Every Amazon S3 library I can lay my hands on (for Python at least), seems to read the entire file to be uploaded into memory before sending it. This might by ok when uploading lots of small files, but I have needed to upload a lot of very large files, and my poor old server would creak under the weight of that kind of memory usage.
I managed to bolt a solution together using urllib2 and poster that has been working reliably for me for the past few months. Im going to show you:
S3 is essentially a big python dictionary in the cloud, you give it a key and a value(file) to store, and later on you can read it back out again. S3 has a nice HTTP API, so you can read and write to the store using standard HTTP libraries.
The area you put your files into is called a bucket. Bucket names (which have restrictions) are globally unique, that is, if you make a bucket called holiday_photos, then no one else using s3 can have a bucket called holiday_photos, which might sound weird, but it has its advantages, you can now access your files from http://holiday_photos.s3.amazonaws.com/. If you set the permissions up so anyone can read the contents of the bucket, the whole world can see you files via http://holiday_photos.s3.amazonaws.com/.
The flip side of this, is that you can upload your files, lets say "meonthebeach.jpg" by using HTTP PUT, in this case PUT to http://holiday_photos.s3.amazonaws.com/meonthebeach.jpg.
When uploading to S3, we need provide a few HTTP headers along with our file data when we PUT.
Authorization - This is the tricky one, S3 requires that your PUT request be accompanied by an authorization string in the following format: AWS AWS_ACCESS_KEY_ID:SIGNATURE The AWS_ACCESS_KEY_ID is the one provided to you when you signed up to S3
The signature is a string consisting of several of the headers you are sending, along with the resource you are putting concatenated, and hashed with your AWS Secret access key. Constructing the signature is quite complicated in the general case, so I am going to show a method of generating it for the specific type of upload request we will be making, if you need to send headers that we are not using here, see Amazons Documentation for how to create the Authentication Header.
The signature string consists of
a code example of creating this
sig_data = "PUT\n\n%s\n%s\nx-amz-acl:public-read\n%s" % (
content_type, date, resource)
We then take this string and create an sha1 hash of it and your secret access key, and base 64 encode it.
signature = base64.encodestring( hmac.new( settings.AWS_SECRET_ACCESS_KEY, sig_data, sha1).digest() ).strip()
and thats your signature.
Poster is a small library that works with urllib2 to allow streaming uploads. All you need to do is import it and call a single function which registers posters custom url openers with urllib2 and you are good to go.
import urllib2 from poster.streaminghttp import register_openers register_openers()
Secondly we need to tell urllib to use HTTP PUT rather than POST. We do this by creating a request object and overriding the get_method
request = urllib2.Request(url, data=data) request.get_method = lambda: 'PUT'
And then we can make our request and read the response
response = urllib2.urlopen(request).read()
The last step for use in poster is that rather than data containing the file object to be uploaded, it should return an iterator that provides the file data chunk by chunk. For example.
def read_data(file_object): while True: r = file_object.read(64 * 1024) if not r: break yield r f = open("text.txt","r") data = read_data(f)
data is now a generator that will return our file a line at a time.
Below is the source for a simple command line tool that will take a filename bucket name, and amazon credentials and upload the file to the bucket making it publicly readable
import os import sys import time import base64 import hmac import mimetypes import urllib2 from hashlib import sha1 from poster.streaminghttp import register_openers def read_data(file_object): while True: r = file_object.read(64 * 1024) if not r: break yield r def upload_file(filename, bucket, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY): length = os.stat(filename).st_size content_type = mimetypes.guess_type(filename) resource = "/%s/%s" % (bucket, filename) url = "http://%s.s3.amazonaws.com/%s" % (bucket, filename) date = time.strftime("%a, %d %b %Y %X GMT", time.gmtime()) sig_data = "PUT\n\n%s\n%s\nx-amz-acl:public-read\n%s" % ( content_type, date, resource) signature = base64.encodestring( hmac.new( AWS_SECRET_ACCESS_KEY, sig_data, sha1).digest()).strip() auth_string = "AWS %s:%s" % (AWS_ACCESS_KEY_ID, signature) register_openers() input_file = open(filename, 'r') data = read_data(input_file) request = urllib2.Request(url, data=data) request.add_header('Date', date) request.add_header('Content-Type', content_type) request.add_header('Content-Length', length) request.add_header('Authorization', auth_string) request.add_header('x-amz-acl', 'public-read') request.get_method = lambda: 'PUT' urllib2.urlopen(request).read() if __name__ == "__main__": filename = sys.argv bucket = sys.argv AWS_ACCESS_KEY_ID = sys.argv AWS_SECRET_ACCESS_KEY = sys.argv upload_file(filename, bucket, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
I used to work with a guy (Hi Daniel) who got everyone he knew to send him OPML files from their RSS readers so he could find new gems to subscribe to. Im feeling kind of bored at the moment. So I am going to repeat his experiment. Anyone who reads this, or sees the related tweet, please send me your OPML file. If your RSS reader makes it difficult to export a list of links, then by all means send them in whatever format you like.
In a weeks time, I'll take the results, crunch em a little, and put them up for all to see. So you can get the benefit too. My email address can be grabbed from the contact link to the left. Come on, send me your links!
For the curious, here is my current list of feeds.
Readability is a bookmarklet that removes clutter from webpages to make them more readable. I read from computer screens a lot, but when it comes to longer text I actually prefer to read from the tiny screen on my mobile phone than from a laptop monitor.
I recently began a little reading on Typography, and learned of the concept of the comfortable measure. Essentially, approximately 66 characters per line is regarded as the ideal width for readable text.
While Readability does not hit that mark exactly, its a lot closer than the average over wide web layout. Give it a try, it can return a lot of the pleasure of reading to computers.