How to change a whitelist into a blacklist

At work we had an interesting requirement some days ago: We are displaying user-input in descriptions and comments and want to allow both markdown and html. The markdown-rendering is done in javascript on the client-side. But sanitizing the input to fix the html (users do forget to close tags) and disallowing certain tags like script is done on the server-side before it hits the clients browsers.

Bleach

So we came to bleach, an html sanitizer. And bleach is great! It allows to give a list of allowed tags and dictionaries of allowed attributes per tag. And more stuff like filtering the allowed classes and style per element. It also allows to give a function to decide whether the attribute is allowed for that tag.

Sadly it didn't allow to give a function to decide whether the tag is allowed at all. So I got on github, forked the repo and created a patch/pull-request.

My patch allowed to give bleach.clean a function to check for the allowed tags. I also wrote unittests for giving functions as allowed_attributes and allowed_tags. Unfortunately the maintainer denied my pull-request as it allows to change the behaviour of bleach from a whitelist to a blacklist very easily. Well duh, that was the intention. But I do understand and respect that decision!

So is there another way to solve our little problem?

Whitelist in bleach

Lets take a look at how bleach does the whitelisting. Don't worry, its really easy: bleach uses the given list of allowed tags as allowed_elements:

if element in allowed_elements:
    # Do the rest of the parsing

But what if allowed_elements isn't a list? What if its a custom object that just happens to implement __contains__()?

Blacklist script-tag instead of whitelisting everything else

Lets write a little blacklist-object that inverts the behaviour of __contains__.

class BlackList(object):
    def __contains__(self, value):
        return value not in ['script']

html = bleach.clean(html, tags=BlackList())

Done. Changed the whitelist of bleach.clean() into a blacklist to allow all tags excluding only the script-tag.

Our version is a little bit more advanced, it also takes a list-argument to __init__ to set an extendable list of forbidden tags.

Of course this depends on the way bleach works, which might change with future versions. But that is one of the reasons we have unittests. Not only do these protect against developers changing one part of the code and breaking a completely different corner they didn't think of. Unittests also protect against changed behaviour in your dependencies...

Update 2014-10-14: Gave a small talk (slides) about this at the Leipzig Python User Group.