Markdown-Z

#tech#zig

For the past month I've been working on a Markdown parser variant I call Markdown-Z (MDZ) which contains many quality of life improvements on CommonMark, including commonly requested Markdown extensions. It's written in Zig but compiles to WASM so it can either be used as a standalone binary or a JavaScript package. It's available now on JSR - go check it out!

Playground

<h1>Hello, world!</h1>

Motivation

I used Markdown daily for my personal notes, journals, and websites (this very page you're reading is generated from Markdown). To say that I use Markdown often would be an understatement which is why these pain points are so evident to me. My motivation for creating yet another Markdown-like syntax boils down to a few things:

Ambiguous Syntax

Markdown is a plain text format originally designed make writing HTML documents simpler for technical and non-technical people alike. When the idea was devised in 2004, there was no formal specification and no extensions. In later years the CommonMark specification was created to help standardize the different parser implementations, but it still falls short in a number of cases. The CommonMark-defined syntax can be inconsistent and unintuitive at times, leading to strange and unexpected behavior if you're not familiar with it. Here are some examples that come to mind:

This doesn't even take into account the fact that the landscape of the web has changed dramatically in the past 20 years. There are so many new HTML elements and CSS styles that allow us to style our web pages in unique ways. There have been numerous Markdown extensions created over the years to address these feature gaps, but it would be simpler to incorporate these into the official specification.

Lacking Performance

When analyzing the performance of my static site generator ss, I noticed that one of the larger performance bottlenecks was markdown-it parsing time. The bottleneck remained even after removing my configured extensions. markdown-it appears to be reasonably fast for small quantities of files, but the performance drops off significantly beyond 500 files.

I even considered writing my own CommonMark-compliant parser, but due to the nature of the syntax, a lot of input backtracking and memory retention is involved. The spec even recommends parsing the input in two passes effectively doubling the minimum time required to produce an output. It also eliminates the potential for streamed inputs and outputs, further reducing efficiency.

I decided at this point that it would be simpler to design a new syntax that avoids these problems entirely.

Learning

I also have personal reasons for wanting to write a new parser. Lately I've been loving Zig and Web Assembly and I was itching to build something with both as an escape from the stresses of life (I'm not stressed, who's stressed!?). This is the perfect project for me!

Why Not Djot?

I can't finish this post without mentioning Djot, another plain text format syntax designed by CommonMark's creator for the same reasons I mention above. Djot learns from the mistakes of Markdown and creates a brand new syntax designed for the modern web. Although it looks very promising to me, there are two reasons keeping me from switching to Djot:

  1. The ubiquity of Markdown syntax is hard to leave. Djot syntax is different enough from Markdown that if I ever needed to use a Djot input in a Markdown parser, I would probably need to rewrite the whole input.

2. Over-complication of features. Some extensions of Markdown are much easier to read than their Djot counterparts, defeating the purpose of Markdown's simplicity. For example, I hate that I have to wrap raw HTML in a dedicated code block with =FORMAT off to the side. It's just unnecessary noise when in actuality I just want my Markdown to integrate seamlessly with HTML.

For all of these reasons, I decided to write my own Markdown-like specification and parser. Let's walk through how it works!

Goals

Markdown-Z has a few goals it adheres to:

  • Simplicity. There should be one and exactly one syntax to write a "block".
  • Modernization. There's no need to support older syntax styles because they hold historical significance.
  • Efficiency. Implementations should be able to parse documents with no backtracking and zero heaps allocations in O(n).
  • WYSIWYG (What You See Is What You Get). Whitespace should be passed through as is without omission.
  • Close compatibility. Excluding extensions, the defined markup syntax should produce similar output to other CommonMark implementations to preserve relative backwards compatibility with existing Markdown. In most cases, an MDZ document is just a more strictly-formatted CommonMark document.

Architecture

Markdown-Z is written primarily in Zig with a small amount of JavaScript compatibility code to generate the WASM build output. I originally tried writing a recursive parser that operated with data in ArrayLists but I realized I could clean up the code and reduce memory allocation to zero by processing the input one line at a time. This also means that the inputs and outputs can be progressively streamed, improving throughput.

The main function parseMDZ accepts an input Reader and output Writer to stream the input to the output. I had a bit of trouble trying to figure out how to read from Reader line by line, handle both LF and CRLF line endings, and gracefully handle the end of the stream as the end of the input (rather than outputting an error). The Zig standard library comes with Reader.takeDelimiterExclusive but it doesn't handle these scenarios so I ended up writing my own:

/// Custom implementation of `std.io.Reader.takeDelimiterExclusive` to
/// account for different line endings (LF/CRLF) and optional EOF LF.
fn takeNewlineExclusive(r: *Reader) Reader.DelimiterError![]u8 {
    const result = r.peekDelimiterInclusive('\n') catch |err| switch (err) {
        Reader.DelimiterError.EndOfStream, Reader.DelimiterError.StreamTooLong => {
            const remaining = r.buffer[r.seek..r.end];
            if (remaining.len == 0) return error.EndOfStream;
            r.toss(remaining.len);
            return remaining;
        },
        else => |e| return e,
    };
    r.toss(result.len);

    if (result.len > 1 and result[result.len - 2] == '\r') {
        @branchHint(.cold); // stop using Windows!
        return result[0 .. result.len - 2];
    }

    return result[0 .. result.len - 1];
}

We can then use this to gracefully loop over the lines:

while (takeNewlineExclusive(r)) |line| {
    // do something with line
} else |e| switch (e) {
    Reader.DelimiterError.EndOfStream => {}, // end of input
    Reader.DelimiterError.ReadFailed,
    Reader.DelimiterError.StreamTooLong,
    => |err| return err,
}

I use this function to call a processLine function on each line which reads each line and outputs the HTML value of that line.

Concepts

Before I go further, I need to explain conceptually how a Markdown document is constructed. The abstract syntax tree is composed of blocks and inline blocks. Blocks are top level markers that translate to HTML tags, can span multiple lines, and usually can be nested. Inline blocks are markers placed within text content inside top level blocks. For example, > hello translates to a blockquote block with the inline text content "hello". This means that a Markdown document can be thought of as a tree of blocks.

State

Let's talk about how internal state is managed. Because Markdown comprises of nested blocks, some form of state must persist across lines to know if a block is opening or closing and what children live in that block. I use a static integer array which acts as a stack to capture the current state of open blocks for each line.

pub const Block = enum(u4) {
    /// Used for initialization and falsy checks only
    nil,
    // line blocks
    block_quote,
    unordered_list,
    ordered_list,
    // leaf blocks
    paragraph,
    paragraph_hidden,
    code_block,
    html_block,
    footnote_reference,
    table,
};

var stack: [16]Block = undefined;
@memset(&stack, .nil);

The array has a self-imposed length of 16 to prevent infinite stack recursion and fail early for invalid documents. Because we use a static length, no allocation is necessary. It's also extremely unlikely that a real document would ever go past 16 nested blocks...

> > > > > > > > > > > > > > > > > is this real markdown?

We also need to keep track of inline block state. Inline blocks are more complicated to handle as some of them can handle multiple values. For example, the link [my picture](https://example.com) contains both the link target (https://example.com) and the link text (my picture). Inline blocks are also self-contained within the parent and usually within a single line. For inline blocks I use a packed struct bitfield to keep track of whether each inline type is open or not. I don't check if the inline block has a valid closing tag as that should be the responsibility of the writer.

const InlineFlags = packed struct {
    is_em: bool = false,
    is_strong: bool = false,
    is_code: bool = false,
    is_link: bool = false,
    is_footnote_citation: bool = false,
    is_img: bool = false,
    is_strike: bool = false,
    is_del: bool = false,
    is_ins: bool = false,
    is_mark: bool = false,
};

const flags = InlineFlags{};

I also keep track of footnotes and the number of references of each footnote in a static array [128]u8. Each index is a footnote reference number and each value is the number of times that number has been referenced. When the footnote citations are rendered, I print the same number of back reference links as the value at the footnote's index in the array. This implementation has a few limitations of course:

  • you can only write a footnote with the symbol being a number from 0-127.
  • you can only reference that symbol up to 255 times.
  • if you write the footnote citation before all footnote references, they won't be counted.

In my use case, all of these limitations are unlikely to be broken. I've never written a footnote with an alphabetic reference, and I've never referenced a footnote more than 255 times. In many white papers I read, that's more than enough. I've also never referenced a footnote after writing the footnote citation in Markdown. I think it only makes sense to do it at the end.

These are the core components of Markdown-Z's internal state management. Let's get back to the implementation.

processLine

processLine is the function I use to parse block structure in a line. Once a block structure has been established, processInlines reads the contents within the block and writes it to the output.

processLine starts by validating the existing block stack by checking for any existing block markers. For example, if we had the following state:

const stack: [16]Block = { .block_quote, .paragraph, .nil, .nil, .nil, .nil, .nil, .nil, .nil, .nil, .nil, .nil, .nil, .nil, .nil, .nil };

We would know that for the following line:

> more content

The blockquote has already been opened and a paragraph has already been started, so the text content "more content" should be appended to the contents of the previous paragraph with a newline character.

Once existing blocks have been established, the function first checks if the line is empty. If so, we can close the blocks at that level. If not, we manually check each block and see if it can open based on the line's contents.

For example, consider a new document that looks like this:

# Hello, world!

> Hi.
> Bye.

On line 1, we identify the # heading 1 marker and create a heading the content "Hello, world!". We ignore line 2 because blank lines only indicate to close open blocks but there are none. On line 3, we identify a new <blockquote>, then open a <p> within it by default and append the contents "Hi.". Finally, on the last line, we validate that an existing blockquote and paragraph are open, then append "n" and the contents "Bye.", before closing the opened blocks on the stack from top to bottom. The end result would look like this:

<h1>Hello, world!</h1>
<blockquote>
<p>Hi.
Bye.</p>
</blockquote>

Because the output is progressively written to the output Writer, it's possible to stream this output or batch write calls to improve efficiency and throughput.

Improvements

Heading IDs

One quality of life improvement made in Markdown-Z is builtin functionality to auto-generate heading IDs for headings other than <h1> in the output. I added this because I was manually parsing each document into a DOM in my static site generator just to add heading IDs when we already know the contents of the heading while converting Markdown. This gave me an opportunity to improve of my original implementation and make for a more efficient heading ID generator. Before I was using chained loops with notoriously slow regex:

const id = String(x)
  .replace(/[^\w\s]/g, " ")
  .trim()
  .toLocaleLowerCase()
  .replace(/(\s|-|_)+/g, "-");

But I've modified it to work efficiently in a single loop.

pub fn slugify(x: []u8, output: []u8) usize {
    std.debug.assert(output.len >= x.len);
    var i: usize = 0;
    var is_prev_delimiter = false;

    for (x) |c| switch (c) {
        'A'...'Z' => {
            output[i] = std.ascii.toLower(c);
            i += 1;
            is_prev_delimiter = false;
        },
        '0'...'9', 'a'...'z' => {
            output[i] = c;
            i += 1;
            is_prev_delimiter = false;
        },
        else => {
            if (i > 0 and !is_prev_delimiter) {
                output[i] = '-';
                i += 1;
                is_prev_delimiter = true;
            }
        },
    };

    return if (output[i - 1] == '-') i - 1 else i;
}

This implementation loops over every character individually and decides how to transform the character before writing it to the output buffer (if at all).

Backslash Escaping

One complicated quirk of CommonMark that I've fixed in Markdown-Z is adding universal backslash escaping. In CommonMark, backslashes usually escape the characters they precede:

Hello, \*world\*! => <p>Hello, *world*!</p>

But this doesn't work universally for some reason:

Hello, `code is \`cool`! => <p>Hello, <code>code is \</code>cool`!</p>

I decided to make backslashes work universally in all cases because it feels the most intuitive to me. The only case in which it doesn't work is for code blocks because I don't modify the output of code blocks at all except for HTML escaping.

Now this outputs what you might expect:

Hello, `code is \`cool`! => <p>Hello, <code>code is `cool</code>!</p>

Conclusion

Writing this parser helped me learn a lot about both Zig and WASM and I had a lot of fun programming it. It's the most efficient Markdown parser I've ever used and I'm excited to continue tinkering with it as time goes on. Go check it out and leave me a star on Github if you like it!