Saturday 3 September 2022

Finite State Machine Tokenizers

"If you solve a problem with a regex, now you have two problems"

I have a dream! That one day software developers who have to do string processing will be just as likely to reach for a finite state machine as they are to start writing regular expressions. FSM's are not a magic bullet, but for some problems they're perfect: Easy to write, easy to understand, easy to test.

For the sake of this article, lets say that we want to grab the comments out of a c or java source file. These are the rules:

  • A line comment starts with a double slash and ends at the first newline
  • A block comment starts with slash star and ends at the first star slash
  • A double slash within a string literal does not start a comment
  • Slash star within a string literal does not start a comment
  • A string literal starts with doublequote and ends at the first double quote
  • A string literal does not end at a double quote preceded by a backslash
This is the definition of a FSM tokenizer that grabs the comments. This is not a real format, but I think it might be valid YAML. I also think it could be a fascinating project in java to implement a general purpose FSM tokenizer where the FSM is defined in yaml.

Start in the CODE state and for each character in the string do what your current state tells you, including moving to a new state.

---
CODE: {
'/': ONESLASH
'"': INQUOTE
?: CODE
}

ONESLASH: {
'/': LINECOMMENT
'*': BLOCKCOMMENT
?: CODE
}

LINECOMMENT: {
'\n': [endToken, CODE]
?: [appendTheChar, LINECOMMENT]
}

BLOCKCOMMENT: {
'*': BLOCKSTAR
?: [appendTheChar, BLOCKCOMMENT]
}

BLOCKSTAR: {
'/': [endToken, CODE]
?: [append:'*', appendTheChar; BLOCKCOMMENT]
}

INQUOTE: {
'"': CODE
'\\': ESCAPE
?: INQUOTE
}

ESCAPE: {
'"': INQUOTE
?: INQUOTE
}

Some of you are already familiar with this method of string processing. Some of you think that this type of FSM is better expressed with the characters at the root level. That is, all the state transitions that can be triggered by (say) a slash are grouped together, rather than all the transitions that start in (say) a quoted string being grouped together. "You're wrong and I hate you".

One of my teams has recently started to use the language Kotlin, so over the bank holiday weekend I gave myself a personal coding project: To write a FSM tokenizer in Kotlin and to make it as easy to read as I could. This what I ended up with.

/**
 * base class for Finite State Machine based tokenizers (lexers)
 */
abstract class FSMTokenizer {
    var token = StringBuilder();    // the current token
    var tokens = ArrayList<String>(); // all the tokens

    operator fun plus(c: Char?){    // add a character to the current token
        token.append(c);
    }

    operator fun plus(s: String){   // add a string to the current token
        token.append(s);
    }

    fun split() {                   // end one token and start the next
        tokens.add(token.toString());
        token = StringBuilder();
    }

    fun tokenize(s: String): List<String> { // feed the characters of a string into a Finite State Machine
        for (c in s) {
            step(c)
        }
        step(null) // the FSM is given a null after the last character - it might have some tidying up to do
        return tokens;    
    }

    abstract fun step(c: Char?)     // deal with one character. You need to implement this yourself
}

/*
 * tokeniser which grabs only the comments out of a java source file
 */
class FSMComments(): FSMTokenizer() {
    var state: States = States.CODE; // start the machine in the CODE state

    override fun step(c: Char?) = state.step(this, c) // pass this tokenizer into the enum class

    enum class States {                    // FSM model
        CODE, ONESLASH, LINECOMMENT, BLOCKCOMMENT, BLOCKSTAR, INQUOTE, ESCAPE;
        fun step(tokens: FSMComments, c: Char?) {
            tokens.state = when(tokens.state) { // our implementation of step() updates the state of the machine
            CODE ->
                when (c) {
                    '/' -> ONESLASH
                    '"' -> INQUOTE
                    else -> CODE
                }
            ONESLASH ->
                when (c) {
                    '/' -> LINECOMMENT
                    '*' -> BLOCKCOMMENT
                    else -> CODE
                }
            LINECOMMENT ->
                when (c) {
                    '\n' -> {tokens.split(); CODE}
                    else -> {tokens+c; LINECOMMENT}
                }
            BLOCKCOMMENT ->
                when (c) {
                    '*' -> BLOCKSTAR
                    else -> {tokens+c; BLOCKCOMMENT}
                }
            BLOCKSTAR ->
                when (c) {
                    '/' -> {tokens.split(); CODE}
                    else -> {tokens+'*'; tokens+c; BLOCKCOMMENT}
                }
            INQUOTE ->
                when (c) {
                    '"' -> CODE
                    '\\' -> ESCAPE
                    else -> INQUOTE
                }
            ESCAPE ->
                when (c) {
                    '"' -> INQUOTE
                    else -> INQUOTE
                }
            }
        }
    }
}

/*
 * tokeniser which splits a string on commas. It eats leading whitespace and multiple commas
 */
class FSMCommas(): FSMTokenizer() {
    var state: States = States.COMMA;    // start the machine in the COMMA state

    // extra operators can be implemented here

    override fun step(c: Char?) = state.step(this, c)

    enum class States {                    // FSM model
        TEXT, COMMA;
        fun step(tokens: FSMCommas, c: Char?) {
            tokens.state = when(tokens.state) {
            TEXT ->
                when (c) {
                    ',', null -> {tokens.split(); COMMA}
                    else -> {tokens+c; TEXT}
                }
            COMMA ->
                when (c) {
                    ',', ' ', '\t', '\n' -> COMMA
                    else -> {tokens+c; TEXT}
                }
            }
        }
    }
}

fun main() {
    val helloWorld = """
        /* The code below will print the words Hello World
        to the screen, and it is amazing */
        #include <stdio.h>
        int main() {
            // printf() displays the string inside quotation
            printf("Hello, World!");
            printf("//shhh this isn\"t a comment");
            return 0;
        }
        /* this block comment has a * in it */
    """.trimIndent();
    val list = """
            one, two,
            three, four""";
    for (t in FSMCommas().tokenize(list)) {
        println(t);    
    }
    println("---")
    for (t in FSMComments().tokenize(helloWorld)) {
        println(t);    
    }
}

Richard "Blogger is rubbish for code samples" B

No comments:

Post a Comment