HTML Comments 4 comments

After I got the new Blog up and running, I quickly noticed that plain text comments kinda suck. I have never been a big fan of Textile, Markdown, or any of the other simplified markup languages, so I decided to stick with plain old HTML.

Plain old HTML is unfortunately not a very safe thing to allow people to stick in your comments. Malicious JavaScript, random CSS, all these things can mess you up in a hurry.

The second problem is there are plenty of people out there who don't know HTML and don't want to know HTML, for them I decided a rich editor was in order.

I needed to figure out how to sanitize the HTML. Bold, italic, underlined text, paragraphs and hyper links seem to be about all you really want in the average Blog comment. I wanted a way to only allow these tags and strip out everything else.

Tom Insam's recipe using Beautiful Soup seemed to fit the bill perfectly, I only needed to modify his tag list a little.

Heres my ever so slightly modified version

from BeautifulSoup import BeautifulSoup
import re

def sanitize(html):
    # allow these tags. Other tags are removed, 
    # but their child elements remain
    whitelist = ['em', 'i', 'strong', 'u', 'a', 'b', 'p',
                    'br', 'code', 'pre' ]

    # allow only these attributes on these tags. 
    # No other tags are allowed any
    # attributes.
    attr_whitelist = { 'a':['href','title','hreflang']}

    # remove these tags, complete with contents.
    blacklist = [ 'script', 'style' ]

    attributes_with_urls = [ 'href', 'src' ]

    # BeautifulSoup is catching out-of-order and unclosed tags, so markup
    # can't leak out of comments and break the rest of the page.
    soup = BeautifulSoup(html)

    # now strip HTML we don't like.
    for tag in soup.findAll():
        if tag.name.lower() in blacklist:
            # blacklisted tags are removed in their entirety
            tag.extract()
        elif tag.name.lower() in whitelist:
            # tag is allowed. Make sure all the attributes are allowed.
            for attr in tag.attrs:
                # allowed attributes are whitelisted per-tag
                if tag.name.lower() in attr_whitelist and \
                    attr[0].lower() in attr_whitelist[ tag.name.lower() ]:
                    # some attributes contain urls..
                    if attr[0].lower() in attributes_with_urls:
                        # ..make sure they're nice urls
                        if not re.match(r'(https?|ftp)://', attr[1].lower()):
                            tag.attrs.remove( attr )
                    # ok, then
                    pass
                else:
                    # not a whitelisted attribute. Remove it.
                    tag.attrs.remove( attr )
        else:
            # not a whitelisted tag. I'd like to remove it from the tree
            # and replace it with its children. But that's hard. It's much
            # easier to just replace it with an empty span tag.
            tag.name = "span"
            tag.attrs = []

    # stringify back again
    safe_html = unicode(soup)

    # HTML comments can contain executable scripts, depending on the browser,
    # so we'll
    # be paranoid and just get rid of all of them
    # e.g. <!--[if lt IE 7]><script type="text/javascript">h4x0r();</script><!
    # [endif]-->
    # TODO - I rather suspect that this is the weakest part of the operation..
    safe_html = re.sub(r'<!--[.\n]*?-->','',safe_html)
    return safe_html

All comments are run through this sanitizer before being saved. If a tag is not allowed, but contains valid child tags, they are preserved (wrapped in a span instead of the original container).

Now I needed a rich editor. I have used TinyMCE. Its very configurable and can be used for simple editors like mine, or all the way up to a very rich word processor.

To use it include the main tiny_mce.js script on your page, and then a second configuration script that starts TinyMCE and configures it.

<script type="text/javascript" src="/static/blog/js/tiny_mce/tiny_mce.js"></script>
<script type="text/javascript" src="/static/blog/js/commenteditor.js"></script>

Heres the code from commenteditor.js

tinyMCE.init(
  {
    //just turn one specific textarea into a tiny mce editor
    mode:"exact",  
    //the specific textarea has id="id_comment"
    elements : "id_comment", 
    //use the advanced theme so we can configure the exact appearance
    theme: "advanced",
    //the first row of buttons in the editor, 
    //these are the only functions I want
    theme_advanced_buttons1 : "bold, italic, underline,link,unlink", 
    theme_advanced_buttons2 : "", //make the other 2 rows empty
    theme_advanced_buttons3 : "",
    //tell tiny_mce I am working with xhtml strict (default is transitional)
    doctype: '&lt;!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http:// www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">', 
    //dont use inline styles , this makes sanitizing the html a lot easier
    inline_styles: false, 
    //css styles for the content in the editor
    content_css : "/static/blog/css/commenteditor.css", 
    width: "528" //the width of the control
  });

The comment editor.css file contains CSS styles used for the content of the editor. This allows you to set its background, font size etc to match how you style the rendered comments, giving a real WYSIWYG experience.

And voila you see the nice rich editor at the bottom of this page. Leave a comment to try it out :)

Comments

Yo. To strip blacklisted tags use:

            tag.replaceWith(tag.contents[0])

tezro 22:47 Sunday the 30th of August 2009 #

Here is a more reliable way to remove comments:

from BeautifulSoup import Comment

comments = soup.findAll(text=lambda text:isinstance(text, Comment))

[comment.extract() for comment in comments]

Chase Seibert 15:39 Friday the 26th of February 2010 #

Thanks Chase, but in my case I wanted to allow some tags, not simply strip the text in every case, but I have not seen the Comment class before, very handy.

Sean O'Donnell 15:51 Friday the 26th of February 2010 #

To strip non-white listed tages use the following:

In the
sanitize(html) function,
# not a whitelisted tag. I'd like to remove it from the tree
# and replace it with its children. But that's hard.

This is easy to be solved by:
tag.hidden = True

Good luck and thanks for sharing this script with us.

lucas 13:07 Tuesday the 14th of September 2010 #

New Comment

required
required (not published)
optional