Shouldn't we allow users to write HTML?

Say you have a website that allows users to make posts or update data that other users can see. Obviously, if we let them write HTML, there's going to be all sorts of XSS concerns, right?

Hello fellow users, I'm just writing to say <script> sendMeAllYourCookies(); </script>

So we tend to block all HTML by stripping out all the tags and thus the above becomes something like ...

My name is Hello fellow users, I'm just writing to say sendMeAllYourCookies();

... and boom! You're protected from all sorts of XSS attacks.

But sometimes we want users to use HTML.

Sometimes we need to allow styling -- bold, underline, bullet points and links.

There are a bunch of solutions for this, but the most common solution is to use a different markup language that gets parsed into HTML.

So, which markup language should we use?

My first thought was markdown. It seemed (initially) like a great option because it ticks all the boxes of what you'd need.

Large community, so lots of updates.
Widely used, so less training for users.
Supports all the styles we'd want users to use.

The sad part is that markdown is actually very vulnerable to XSS attacks . There's an obvious reason for this that I'll get to, but the simple reason is that it allows users to write JavaScript.

You can configure most of the issues away, but I don't want to be saddled with the potential zero-day vulnerabilities and configuration know-how associated with that.

Also in my case I'd be importing a large library with lots of functions that I would never use.

Alright then, let's create our own markup language.

I'm a sucker for re-inventing the wheel, so this is what I wanted to do all along. Sadly, it's a bad idea, mostly for maintenance reasons, but also because it won't guarantee the protection from XSS that you really want.

In fact, there's only one way to get the most protection and that's just to allow users to write HTML.

Back to life, Back to HTML

Think about it. If you create your own markup language, or use a markup language that's available out there, that language will be compiled down into HTML and HTML is where the real vulnerabilities are. Using a whole other markup language only abstracts away the danger and makes it harder to see the end product.

In the end, the <strong> tag without any attributes can never ever do you any harm. We know this because that's exactly what everyone is doing all the time when we write HTML for our sites.

Essentially all we want is a strict subset of HTML.

Let's do it

Firstly, you'll probably want to strip all unwanted tags on the server side before the data makes it into your database. Note: regex is inadequate for this, use a DOM parser. I'll not get into this too much as there are too many different flavours of back end code.

The front end code is where the XSS attack takes place and is the most important.

We'll be using the DOMParser web API to process HTML.

DOMParser is extremely fast and safe. It will parse most HTML strings in microseconds and won't execute any code contained within.

So, let's say a user submitted the following good HTML:

My name is <strong>Jake</strong>

Let's parse this:

const parser = new DOMParser();
const dom = parser.parseFromString("My name is <strong>Jake</strong>", 'text/html');

Using the dom variable, we're going to have to iterate through each element's childNodes to get the result we want.

Note: I'm using lit-html here, because it's awesome. If you want a vanilla JS example see further below.

function processElement(element) {
    const result = [];
    for (let i = 0; i < element.childNodes.length; i++) {
        result.push(outputNode(element.childNodes[i]));
    }
    return result;
}

The above is our main iterator function, it just iterates through the nodes and outputs their children. (lit-html supports transforming arrays into html)

import { html } from 'lit-html';

function outputNode (element) {
    const nodeName = element.nodeName.toLowerCase();
    switch (nodeName) {
        case 'strong':
            return html`<strong>
                ${processElement(element)}
            </strong>`;
        default:
            if (element.nodeType === Node.TEXT_NODE) return element.nodeValue;
            return element.innerText;
    }
}

Here is where the magic happens. We check the name of the html tag. If it's in our explicit list of allowed elements, then we output that element as if we created it ourselves. Otherwise we'll output the textValue of the element which strips all the inner tags.

Note: You can use innerHTML instead of innerText because lit-html will sanitize all non explicit tags anyhow.

import { render } from 'lit-html';
const result = processElement(dom.body);
render(result, document.body);

Then we're done.

Note: We start with dom.body because dom is essentially a reference to the html document. The html document holds the <html> tag, which then holds the <head> and <body> tags. Our initial string was automatically wrapped with these.

Vanilla JS Example

With vanilla JS we want to pass in the output element so we can control when to create elements and when to set the innerText of those nodes. We also want to allow nested nodes to pass themselves as the second argument to processElement so that their child nodes can be rendered appropriately.

function processElement(element, outputElement) {
    for (let i = 0; i < element.childNodes.length; i++) {
        outputNode(element.childNodes[i], outputElement);
    }
}

function outputNode (element, outputElement) {
    const nodeName = element.nodeName.toLowerCase();
    switch (nodeName) {
        case 'strong':
            const elem = document.createElement("strong");
            outputElement.append(elem);
            processElement(element, elem);
            break;
        default:
            if (element.nodeType === Node.TEXT_NODE) {
                outputElement.innerText += element.nodeValue;
            } else {
                outputElement.innerText += element.innerText;
            }
    }
}

Note: We don't want to update innerHTML of ANY element on the page with user values. That way, unless you do something insane like include the "script" tag in your switch statement of allowed tags, we can be pretty sure we're secure.

Why, tho?

The problem with XSS vulnerabilities and markup languages is that everything in the end has to be compiled into HTML. You're probably not going to get away from that by using some other markup language that promises to be secure. You're mostly going to import their vulnerabilities as their library expands and requires more functionality for more users.

You're better off just allowing users to use a subset of HTML.

There are a few pitfalls, to be aware of.

HTML Attributes

The moment you start allowing HTML attributes is where things get dicey.

Let's say we add functionality for the <a> tag.

function outputNode (element) {
    const nodeName = element.nodeName.toLowerCase();
    switch (nodeName) {
        case 'a':
            return html`<a href=${element.attributes.href}>
                ${processElement(element)}
            </a>`;
        case 'strong':
            return html`<strong>
                ${processElement(element)}
            </strong>`;
        default:
            if (element.nodeType === Node.TEXT_NODE) return element.nodeValue;
            return element.innerText;
    }
}

Doing this will introduce an XSS vulnerability where an attacker could include the following code:

Hey there, click <a href="javascript: sendMeAllYourCookies();">this link</a> to get a million dollars.

So, when the user clicks the link, an XSS attack will occur.

It's easy to mitigate against this kind of attack by adding sanitation:

function sanitizeHref(hrefString) {
    if (hrefString.startsWith("https://")) { 
        return hrefString;
    }
}

Then updating that bit of code to:

        case 'a':
            return html`<a href=${sanitizeHref(element.attributes.href)}>
                ${processElement(element)}
            </a>`;

And that will mitigate against the attack. In fact, it's probably good to look at the OWASP cheat sheet before allowing other tags and attributes.

That said, I still think, despite vulnerabilities like these, that this is a better method for allowing users to add custom styles to pages. The vulnerability above is not unique to this method. It is something that is present throughout all links that possibly contain stored or reflected data on your site already. Thus, it introduces nothing new and you should already be doing a lot to protect yourself from these kinds of attacks.

Hopefully, this doesn't need to be said.

You should never allow html attributes that give potential attackers direct access to the user's javascript engine. Such as: onclick, onload or onerror.

Adham Jongsma's T-shaped Journey