From to with Python 3 comments

The sad news that Yahoo plans to shut down reached me this week (although theres still hope). I use pretty much every day and was a little traumatized upon hearing this. Once I had finished wailing and gnashing my teeth I set out looking for somewhere to go.

There are many bookmarking sites/services out there, but I fear change, and seemed like the closest thing to a plain replacement. It even supports the same API as Theres a small charge for signing up, but no recurring fee, so I broke out the credit card and joined up.

The next step was to figure out how to migrate my bookmarks. provides a export to HTML feature in its settings area, but a quick look at the export revealed some data was missing (mostly extended descriptions). Rabid googling revealed a lesser known XML export mechanism. To use it visit, enter your username and password and save the resulting XML file.

Now to get my bookmarks into I broke out my trusty text editor and battered together the script below which works just fine, a few hours later all my bookmarks are in, their bookmarklets are installed in my browser, and I'm loving their read later features. Sean is a happy geek again.

You can download my migration script. To use it :

python backup.xml username password

Heres the source for the curious.

from xml.dom import minidom
import sys

import urllib
import urllib2
import time

user = sys.argv[2]

password = sys.argv[3]

endpoint = ""

url = "/v1/posts/add?"

#open the xml file to import from and parse it
f = open(sys.argv[1], "r")

doc = minidom.parse(f).documentElement

#keep count of how many urls have been imported
urlcount = 0

count = 0
ellength = len(doc.childNodes)

failcount = 0
while count < ellength:
    e = doc.childNodes[count]

    if e.nodeType == e.ELEMENT_NODE:
        print "import url %s" % urlcount

        #get the attributes from the xml
        href = e.getAttribute("href")
        description = e.getAttribute("description")
        extended = e.getAttribute("extended")
        tags = e.getAttribute("tag")

        dt = e.getAttribute("time")
        rargs = dict(url=href, description=description, extended=extended,
                        tags=tags, dt=dt)
        shared = e.getAttribute("shared")

        if shared.strip() == 'no':
            rargs['shared'] = 'no'

        #convert them to unicode
        rargs = dict([k, v.encode('utf-8')] for k, v in rargs.items())

        print rargs
        #build the request to send
        #set up http auth for
        #doing this for every request may seem wasteful, but urllib2
        #seems to forget the auth details after a half dozen requests
        # if you dont
        password_manager = urllib2.HTTPPasswordMgrWithDefaultRealm()
        password_manager.add_password(None, endpoint, user, password)

        auth_handler = urllib2.HTTPBasicAuthHandler(password_manager)
        opener = urllib2.build_opener(auth_handler)


        request = urllib2.Request(endpoint + url + urllib.urlencode(rargs))

        #set the user agent

            r =
            #send the request and read the response
            response = minidom.parse(r).documentElement.getAttribute("code")

        except Exception, e:
            response = str(e)

        #if we get an invalid response, abort, proabbly throttled
        if response !="done":
            failcount += 1

            print "Failure: Invalid response: %s" % response
            if failcount > 4:

                print "Aborting: Invalid response %s"
                print "waiting for 30 seconds and retrying"

            failcount = 0

            count += 1
            #put in a delay between requests to reduce odds of throttling

            urlcount += 1
        count += 1

print "%s urls imported" % urlcount


Aside from the fun in using python to do this, why not simply follow the instructions at ? No scripting needed! :-)

Tim 17:12 Monday the 20th of December 2010 #

Hi Tim, take a look at the output from the html backup vs the XML backup, not as much information is preserved and carried accross.

Sean O'Donnell 02:21 Tuesday the 21st of December 2010 #

Ah, cool - that answers that :)

Tim 12:26 Tuesday the 21st of December 2010 #

New Comment

required (not published)