bolingblog: September 2022: September 2022

Thursday 29 September 2022

Are You Sure?

Posted by rjb

A computer app will often throw up a message that says something like "Are you sure?" and you can click yes or no. I wanted to take some money out of a building society and the "yes" option of "are you sure?" opened up a boring side-quest IRL and the more I think about it the more ridiculous it seems.

Are you sure?
Download this pdf file and take it somewhere with a printing machine. Print the file onto a sheet of paper. Write your name briskly with an ink pen on the paper. Obtain another sheet of paper that has been folded and glued into a very specific shape. Put the first sheet of paper inside the second. Mark the second piece of paper with this exact string of symbols. Obtain a small sticky piece of paper that has a portrait of the last Queen on it. Stick the third piece of paper onto the second. Drop the whole bundle into a giant red cast-iron cupboard that you will find bolted down in the street.

Richard "paperless office" B

Tuesday 20 September 2022

Mimosa

Posted by rjb

In the early 90s my father built a very accurate model of the boat that he owned at the time. It was an 18' Plymouth Pilot called Mimosa. The building of the model was significant to me because it divided my life into two parts: The part where I didn’t understand the lines drawing of a hull shape, and the part where I could read them quite easily.

The model has been gathering dust and dirt and pieces have got broken or gone missing since my father's death. I inherited this model as part of my mother's estate and I have been cleaning and restoring it. The boat has a cuddy at the front and there are grab rails on its roof. On the model the clearance between the bottom of the grab rail and the top of the roof is so small that it was very hard to clean. The gap was far too small for a cotton bud, and a bit too small for a pipe cleaner. The answer was a small size interdental brush, but when I was wandering around the house trying to find something to squeeze into this tiny gap I remembered something. The gap on the real boat was also too small! If you walked up the sidedeck and quickly took hold of the grab rail you would skin you knuckles on the top of the cuddy. That's really fantastic model building!

Richard "repair shop" B

Saturday 3 September 2022

Finite State Machine Tokenizers

Posted by rjb

"If you solve a problem with a regex, now you have two problems"

I have a dream! That one day software developers who have to do string processing will be just as likely to reach for a finite state machine as they are to start writing regular expressions. FSM's are not a magic bullet, but for some problems they're perfect: Easy to write, easy to understand, easy to test.

For the sake of this article, lets say that we want to grab the comments out of a c or java source file. These are the rules:

A line comment starts with a double slash and ends at the first newline
A block comment starts with slash star and ends at the first star slash
A double slash within a string literal does not start a comment
Slash star within a string literal does not start a comment
A string literal starts with doublequote and ends at the first double quote
A string literal does not end at a double quote preceded by a backslash

This is the definition of a FSM tokenizer that grabs the comments. This is not a real format, but I think it might be valid YAML. I also think it could be a fascinating project in java to implement a general purpose FSM tokenizer where the FSM is defined in yaml.

Start in the CODE state and for each character in the string do what your current state tells you, including moving to a new state.

---
CODE: {
'/': ONESLASH
'"': INQUOTE
?: CODE
}

ONESLASH: {
'/': LINECOMMENT
'*': BLOCKCOMMENT
?: CODE
}

LINECOMMENT: {
'\n': [endToken, CODE]
?: [appendTheChar, LINECOMMENT]
}

BLOCKCOMMENT: {
'*': BLOCKSTAR
?: [appendTheChar, BLOCKCOMMENT]
}

BLOCKSTAR: {
'/': [endToken, CODE]
?: [append:'*', appendTheChar; BLOCKCOMMENT]
}

INQUOTE: {
'"': CODE
'\\': ESCAPE
?: INQUOTE
}

ESCAPE: {
'"': INQUOTE
?: INQUOTE
}

Some of you are already familiar with this method of string processing. Some of you think that this type of FSM is better expressed with the characters at the root level. That is, all the state transitions that can be triggered by (say) a slash are grouped together, rather than all the transitions that start in (say) a quoted string being grouped together. "You're wrong and I hate you".

One of my teams has recently started to use the language Kotlin, so over the bank holiday weekend I gave myself a personal coding project: To write a FSM tokenizer in Kotlin and to make it as easy to read as I could. This what I ended up with.

/**

* base class for Finite State Machine based tokenizers (lexers)

abstract class FSMTokenizer {

var token = StringBuilder(); // the current token

var tokens = ArrayList<String>(); // all the tokens

operator fun plus(c: Char?){ // add a character to the current token

token.append(c);

}

operator fun plus(s: String){ // add a string to the current token

token.append(s);

}

fun split() { // end one token and start the next

tokens.add(token.toString());

token = StringBuilder();

}

fun tokenize(s: String): List<String> { // feed the characters of a string into a Finite State Machine

for (c in s) {

step(c)

}

step(null) // the FSM is given a null after the last character - it might have some tidying up to do

return tokens;

}

abstract fun step(c: Char?) // deal with one character. You need to implement this yourself

}

* tokeniser which grabs only the comments out of a java source file

class FSMComments(): FSMTokenizer() {

var state: States = States.CODE; // start the machine in the CODE state

override fun step(c: Char?) = state.step(this, c) // pass this tokenizer into the enum class

enum class States { // FSM model

CODE, ONESLASH, LINECOMMENT, BLOCKCOMMENT, BLOCKSTAR, INQUOTE, ESCAPE;

fun step(tokens: FSMComments, c: Char?) {

tokens.state = when(tokens.state) { // our implementation of step() updates the state of the machine

CODE ->

when (c) {

'/' -> ONESLASH

'"' -> INQUOTE

else -> CODE

}

ONESLASH ->

when (c) {

'/' -> LINECOMMENT

'*' -> BLOCKCOMMENT

else -> CODE

}

LINECOMMENT ->

when (c) {

'\n' -> {tokens.split(); CODE}

else -> {tokens+c; LINECOMMENT}

}

BLOCKCOMMENT ->

when (c) {

'*' -> BLOCKSTAR

else -> {tokens+c; BLOCKCOMMENT}

}

BLOCKSTAR ->

when (c) {

'/' -> {tokens.split(); CODE}

else -> {tokens+'*'; tokens+c; BLOCKCOMMENT}

}

INQUOTE ->

when (c) {

'"' -> CODE

'\\' -> ESCAPE

else -> INQUOTE

}

ESCAPE ->

when (c) {

'"' -> INQUOTE

else -> INQUOTE

}

* tokeniser which splits a string on commas. It eats leading whitespace and multiple commas

class FSMCommas(): FSMTokenizer() {

var state: States = States.COMMA; // start the machine in the COMMA state

// extra operators can be implemented here

override fun step(c: Char?) = state.step(this, c)

enum class States { // FSM model

TEXT, COMMA;

fun step(tokens: FSMCommas, c: Char?) {

tokens.state = when(tokens.state) {

TEXT ->

when (c) {

',', null -> {tokens.split(); COMMA}

else -> {tokens+c; TEXT}

}

COMMA ->

when (c) {

',', ' ', '\t', '\n' -> COMMA

else -> {tokens+c; TEXT}

}

fun main() {

val helloWorld = """

/* The code below will print the words Hello World

to the screen, and it is amazing */

#include <stdio.h>

int main() {

// printf() displays the string inside quotation

printf("Hello, World!");

printf("//shhh this isn\"t a comment");

return 0;

}

/* this block comment has a * in it */

""".trimIndent();

val list = """

one, two,

three, four""";

for (t in FSMCommas().tokenize(list)) {

println(t);

}

println("---")

for (t in FSMComments().tokenize(helloWorld)) {

println(t);

}

Richard "Blogger is rubbish for code samples" B

bolingblog

Thursday 29 September 2022

Are You Sure?

Tuesday 20 September 2022

Mimosa

Saturday 3 September 2022

Finite State Machine Tokenizers

Followers

Contributors