Grep from First Principles, in Golang

Disclaimer: Whilst the tile mentions 'grep', I'm using it in the colloquial sense rather than relating to grep(1), the acutal unix program. In this article we'll build a toy version of a string searching program solely for the purposes of education. The intention is not to fully replicate the grep tool (and this post actually explains a different algorithm to the one grep uses). What we'll build will almost certainly be slow and incomplete (but fun!).


Most programmers are likely familiar with the concept of the regular expression and most of us probably use them one way or another on a daily basis, but when I asked my colleagues, most people seemed to not have any idea about how they might work. I really enjoy taking things which we use every day and pulling back the curtain as means to show programmers that there is no magic. You just have to go deeper.

The rest of this article will assume you have a passing familiarity with regular expressions. Namely that you understand what the symbols *, + mean. If you can explain the expressions a, a+, a*, (ab)*, we're pretty much good to go!

The Regular in Regular Expression.

To understand how regular expressions work, we first need to understand what a regular language is. The definition of a regular language is often given as "any language which can be expressed with a regular expression", which sounds like a tautology to me. The actual formal langauge theory here is a bit esoteric, so I'm going to instead try and demonstrate regular and non-regular languages with a few examples.

Given any alphabet, say the latin alphabet, or maybe just one which constists of two symbols {a,b}, some regular languages are:

  • The empty string ""
  • The symbol a or the symbol b
  • Other langauges combined, e.g. aa or ab.
  • Any language 0 or more times, like abab

These simple examples consitute the base cases of regular languages. From those, you can build up some slightly more interesting things like:

  • "Any string with an even number of a's" -> (aa)*
  • "One or more a, followed by one or more bs -> aabb (or a+b+)

To understand where the limits of regular languages lie, let's look at some examples of languages which are not regular. If I asked you to write me a regular expression which matched any valid Go program, you'd rightfully tell me to get stuffed, but what is it about a Go program that is inherently more complex than the languages we defined above?

Basically (and this is a massive over-simplification), it comes down to memory. For a language to be regular, we need to be able to recognise (i.e. check its validity) it without storing anything in memory. To properly understand a Go program, we need to keep track of state as we're parsing it to check if the next token we're looking at is valid. With regular expressions, we have a current state, but we don't have any way to remember things (in fact, this is why modern regular expression engines which support things like backracking can parse languages which are not formally regular).

To demonstrate this, consider the language defined as "3 a's followed by 3 b's" which we could represent as aaabbb or a{3}b{3}. This is a totally valid regular language (and therefore a valid regular expression). But what if I said we wanted a language which was "N a's followed by N b's"? The most obvious step of inference would be to write a{N}b{N}. If you give me a value for N, I can generate an expression which matches a language, but importantly we can't have a single regular language which matches for every value of N.

So why not? Well, to do that, we'd need some memory. We'd have to count all the a's to discover our N, record that number somewhere, and then make sure there are the same number of b's.

Another canonical example of something a regular expression can't do is to test whether a string has 'balanced brackets', e.g. match (()(())) but not ()((). To do that, we'd need to keep track of every time we saw an opening bracket to ensure that we closed it. And with no memory, no keeping track.

String search with regular expressions

We know that if we have a regular language, we can design a regular expression to match it. So how might we build an algorithm to actually put this into practise for string searching? A general, high level approach might look something like this:

  1. Parse the regular expression
  2. Create a 'program' based on that expression which can return true (match) or false (no match) for any given string.
  3. Open our input file and break into lines
  4. Feed each line into the 'program' and print the line if it returns 'match'

Let's not worry about the parsing of the expression for now (we'll see that a bit later on), but instead, let's focus on this mystery 'program' that we're creating. Basically, we need to design a function which can take a parsed regular expression and 'run' it against a given string. There are different ways to solve this problem, each with different capabilities, but the one we're going to focus on now is by using something called a "Deterministic Finite Automaton" or DFA.

Deterministic Finite Automatons

DFA's are really just a slightly fancy kind of state machine with an even fancier name. Many of you are probably familiar with the concept of a state machine in general, but to understand how they apply in this sitation, let's start with something simple and build our way up.

First, let's consider the state machine for a simple concept like a door. A door can either be open, or it can be closed. You can open a closed door, and vice versa, but you can't close a door that's already closed. This simple state machine represents that model. The circles are states which the door can be in, and the lines between them are the events.

Now, one thing that makes a DFA more complex than any old state machine is that it has the concept of an 'accepting' state. We can define which states signify that the series of events is acceptable. This is how we're going to answer if a string 'matches' or not when we do our search.

Let's extend our example slightly by adding a new event 'kick door' and and a new state 'broken'. Here we're saying that if a door is closed, we can also kick it, but once we've broken it, we can no longer close it. Now, in my house - and this wasn't the case when I was a student - kicking the door and breaking it leaves it in an unacceptable state. So let's mark our other states as 'accepting' by using a double circle notaion.

Simple state machine

Let's take a look at how we can cook up a simple state machine to model the above concepts. First, we start by defining a state struct, which corresponds to the circles in the diagram:

type State struct {
	value       string
	isAccepting bool
}

Note: For brevity, I'll omit most of the NewX functions and simple functions attached to structs, but you can see the full code on github if you like

Now we can model our Transitions - the lines between the states:

type Transition struct {
	Event     string
	Source    State
	NextState State
}

A transition is simply something that goes from one state to another based on a specific event. For example: Transition{ Event: "kick door", Source: State{"closed", true}, NextState: State{"broken", false}.

Now with our building blocks, we need a little bit of machinery to transition between states based on events that we're observing. To do this, we can simply require an initial state and the set of all possible transitions as input.

type StateMachine struct {
	initialState State
	currentState State
	transitions  []Transition
}

func New(initial State, transitions []Transition) *StateMachine {
	return &StateMachine{
		initialState: initial,
		currentState: initial,
		transitions:  transitions,
	}
}

Then we can run by playing all events through the state machine, and record the resulting state.

// Run checks if the machine is in an accepting state after processing events.
func (m *StateMachine) Run(events []string) bool {
	for _, event := range events {
		transition := m.findTransition(event)
		if transition == nil {
			// If no transition exists between current state and next character, we're done.
			break
		}
		m.currentState = transition.NextState
	}

	// After all input, check if the state we finished in is in the set of final states.
	return m.currentState.Accepting()
}

func (m *StateMachine) findTransition(event string) *Transition {
	for _, t := range m.transitions {
		if t.Source.Equal(m.currentState) && t.Event == event {
			return &t
		}
	}
	return nil
}

For simplicity's sake, we've made a few assumptions here that it's worth pointing out. Firstly, we're coutning on the fact that the (event, source) pair that we're given for each transition is unique, as this is the property which adds the 'deterministic' part to the DFA. Secondly, and importantly for our string searching use-case, we're short-circuiting the machine when we don't find a matching transition. The more correct alternative is to put the machine into a non-accepting state when it encounters an unknown event, but as we're not going to implement wildcards (.), doing that would limit the usefullness of our toy program.

Now, let's try running something simple through our state machine, in the form of a unit test:

// States
var (
	openState    = NewAcceptingState("open")
	closedState  = NewAcceptingState("closed")
	brokenState = NewState("broken")
)

// Events
var (
	open    = "open"
	close   = "close"
	kick   = "kick"
	invalid = "foo"
)

func TestSimpleMachine(t *testing.T) {
	initial := openState
    machine, err := New(initial, []Transition{
        {Event: open, Source: closedState, NextState: openState},
        {Event: close, Source: openState, NextState: closedState},
        {Event: kick, Source: closedState, NextState: brokenState},
    })
    assert.NoError(t, err)
    result := machine.Run([]string{close, open, close})
    assert.True(t, result)
}

Here we close and open our door a few times and verify that we're in an accepting state. I'll omit the test which shows using a "kick" event, but we simply pass something like machine.Run([]string{close, open, close, kick}) and assert that the result is false (not accepting).

Putting it together: Regex as DFAs

So we've built our simple DFA-analogue, but how do we apply this tool to string searching and regular expressions? The key is to realise that we can express any regular expression as a DFA. Our 'events' are simply characters appearing in a stream (e.g. the line 'foo' would be a stream of events f, o, o).

Let's take a look at some examples for the subset of regular expressions that we'll support, and then how we can combine them into more useful expressions.

Literal

A literal is any simple string of characters. If we wanted to represent the simple expression ab, that is, one a followed by one b, we can draw:

Alternate

An alternate is any expression or another expression. For example, we could have an alternate between two literals; ab|cd.

Here you can see that we can go down either path to achieve an accepting state.

Kleene Star

The Kleene star allows us to express zero or more of any other expression. To keep our example simple, we'll only support the star when applied to a single literal. For example, a* would be:

Because we want zero or more, our initial state is also accepting. We then create a loop from state 1 back to itself. This allows us to transition any number of times when we see an a and stay in the same state.

Plus

A plus signifies one or more, which is really just zero or more with a preceding event. We can acheive this pretty simply, as you can see with a+:

Character class

A character class is a way to express a lot of characters without having to construct a giant alternate tree. For example, we can use [a-z] instead of (a|b|c|...). The simple implementation of this, which isn't the most efficient, is just to create a transition for every literal within the class. So [a-c] would become:

Concatenation

Finally, a concatenation is any two expressions joined together. Whilst it's a simple concept, this is what allows us to create useful expressions. For example, a simple concatenation of ab* would be:

Which is simply connecting the DFA for a to the DFA for b*. We can also concatenate multiple expressions. For example, a more complex combination of a character class, kleene star, literal and alternate combined into a single expression like [a-c]d*e|f would be:

Building the DFA

Now that we've designed the components of our state machine, we need to somehow take any regular expression as input, parse it and create our DFA. Helpfully, the Go standard library already includes a regex parser in the regexp/syntax package which "parses regular expressions into parse trees and compiles parse trees into programs". We're going to use this package to generate the "parse trees", but we're going to handle the "compiling into programs" part ourself (that's the DFA).

We're going to create a regex2fsm package, which contains a Parser. This parser will have a Convert method which takes a regex pattern, and returns an fsm.StateMachine, by creating one with the initial state and all transitions.

type Parser struct {
  // stateGenerator returns states with monotonically increasing numbers
  // when next() is called, e.g. "0", "1", "2".
  stateGenerator fsm.StateGenerator
}

func (g Parser) getNextState(isAccepting bool) fsm.State {
	if isAccepting {
		return g.stateGenerator.NextAccepting()
	}
	return g.stateGenerator.Next()
}

func (g Parser) Convert(pattern string) (*fsm.StateMachine, error) {
	regexTree, err := syntax.Parse(pattern, syntax.POSIX)
	if err != nil {
		return nil, err
	}

    initialState := g.stateGenerator.Next()
	transitions := g.parseTree(initialState, regexTree, true)

	return fsm.New(initialState, transitions), nil
}

The key here is of course the parseTree function. To understand how to implement this, we need to think about regular expressions in the form a syntax tree. Without going into too much detail, the package is generating a tree where each node represents an operator like Concat or Star, zero or more literals like a and zero or more subexpressions, which are also valid trees.

For example, the pattern ab|[xy]z+ can be broken down into this tree:

Walking the tree

To build transitions from the tree, we need to walk the tree, that is, visit every node, and create transistions based on each operation we find. As each node can have any number of subexpressions, each also a valid tree, we're going to walk in a recursive, depth-first manor. Our top level parsing function is actually quite simple:

func (g Parser) parseTree(currentState fsm.State, tree *syntax.Regexp, isAccepting bool) []fsm.Transition {
	switch tree.Op {
	case syntax.OpAlternate:
		return g.parseAlternate(currentState, tree, isAccepting)
	case syntax.OpLiteral:
		return g.parseLiteral(currentState, tree, isAccepting)
	case syntax.OpStar:
		return g.parseStar(currentState, tree, isAccepting)
	case syntax.OpPlus:
		return g.parsePlus(currentState, tree, isAccepting)
	case syntax.OpConcat:
		return g.parseConcat(currentState, tree, isAccepting)
	case syntax.OpCharClass:
		return g.parseCharClass(currentState, tree, isAccepting)
	default:
		panic(fmt.Sprintf("unsupported operation: %s", tree.Op))
	}
}

Here, we simply look at the type of operation for the current node, and delegate to a function to handle that specific node. I'll get on to why we're passing isAccepting down a little later.

So let's take a look at parseAlternate, which will be called first in our example tree above. We know an alternate has no literals, it's simply a subexpression on the left and on the right, so our program is pretty simple:

func (g Parser) parseAlternate(currentState fsm.State, alternate *syntax.Regexp, isAccepting bool) []fsm.Transition {
	left := g.parseTree(currentState, alternate.Sub[0], isAccepting)
	right := g.parseTree(currentState, alternate.Sub[1], isAccepting)
	return append(left, right...)
}

We're recursively calling the parseTree function, because each expression is its own tree. We're then just appending all the transitions we get from each tree to our final list.

So let's continue down our tree. On the left, we have the literal ab. In this node, we need to create a transition from our current state to a and from a to b. If we should accept on this literal, we also need to mark b as an accepting state:

func (g Parser) parseLiteral(currentState fsm.State, literal *syntax.Regexp, isAccepting bool) []fsm.Transition {
	transitions := []fsm.Transition{}
	last := currentState
	for i, c := range literal.Rune {
		isLast := i == len(literal.Rune)-1
		nextState := g.getNextState(isAccepting && isLast)
		transitions = append(transitions, fsm.Transition{
			Event:     string(c),
			Source:    fsm.State(last),
			NextState: nextState,
		})
		last = nextState
	}
	return transitions
}

This completes the left side of our tree, as a literal cannot have any subexpressions. Let's take a look at what happens down the right side, where we reach our Concat operator:

func (g Parser) parseConcat(currentState fsm.State, concat *syntax.Regexp, isAccepting bool) []fsm.Transition {
	// Link current state to first state
	source := currentState
	transitions := []fsm.Transition{}
	for i, expression := range concat.Sub {
		isLast := i == len(concat.Sub)-1
		subTransitions := g.parseTree(source, expression, isAccepting && isLast)
		source = subTransitions[len(subTransitions)-1].NextState
		transitions = append(transitions, subTransitions...)
	}
	return transitions
}

This is very similar to literal parser. However, with literals we knew that we could just create transitions and join them together, but with a Concat, we're not linking literals, we're linking entire subexpressions. So here, for each subexpression we see, we call parseTree to parse it, then link the last state of what we parsed to the first state of the next expression.

Continuing on, we have our CharClass operation on the left, which I'll omit because it's not that interesting. In essence, we just creating a transition from the node we're given to a new node, for every literal in the char class. On the right, we have our Plus operation. If we look back at the DFAs we built before, we know a Plus operation can be modelled with 3 states:

func (g Parser) parsePlus(currentState fsm.State, plus *syntax.Regexp, isAccepting bool) []fsm.Transition {
	midState := g.getNextState(isAccepting)
	repeatingState := g.getNextState(isAccepting)
	return []fsm.Transition{
		{Event: string(plus.Sub[0].Rune[0]), Source: currentState, NextState: midState},
		{Event: string(plus.Sub[0].Rune[0]), Source: midState, NextState: repeatingState},
		{Event: string(plus.Sub[0].Rune[0]), Source: repeatingState, NextState: repeatingState},
	}
}

That's from the state we're given to to our 'middle state', from the middle to our repeating state, and from the repeating state to itself.

We're done parsing our expression! Let's take a look all the transitions we created to ensure that we've built the right DFA. I wrote a little bit of code using the gographviz package, which takes the transitions of the machine and generates a graphviz string, which can be used to make the DFA diagrams. Running our transitions from the machine we build for ab|[xy]z+ through the visualizer, we get:

Everything looks good there!

Running the machine

We've built our DFA from a regular expression, so now we can actually use it to do some grepping! Let's write a crummy program which builds a machine from input and runs it against a list of words:

func main() {
	pattern := os.Args[1]
	filename := os.Args[2]

	converter := regex2fsm.New()
	machine, err := converter.Convert(pattern)
	if err != nil {
		log.Fatal(err)
	}
	searchInFile(filename, machine)
}

func searchInFile(filename string, machine *fsm.StateMachine) {
	file, err := os.Open(filename)
	if err != nil {
		log.Fatal(err)
	}
	defer file.Close()

	scanner := bufio.NewScanner(file)
	for scanner.Scan() {
		line := scanner.Text()
		tokens := strings.Split(line, "")
		result := machine.Run(tokens)
		if result {
			fmt.Println(line)
		}
		machine.Reset()
	}

	err = scanner.Err()
	if err != nil {
		log.Fatal(err)
	}
}

To test it out with a real-ish world case, let's download a list of ~470k English words and search with the expression t[wo]o:

> go run main.go "t[wo]o" words.txt
...
tootsies
tootsy-wootsy
tootsy-wootsies
toozle
toozoo
two
two-a-cat
two-along
two-angle
two-arched
two-armed
...

I'm not sure I'd class most of those as words, but they certainly match our regular expression, so mission acomplished!

Comparing to grep

The most obvious way in which this is different to grep from a functional perspective is that we're only searching for lines which match our expression from the start (i.e. we're really searching for ^t[wo]o in regex parlance). There's a lot of useful things that we didn't implement, like wildcards (.) and other regex features. Finally, as I mentioned at the start, we're also using a different algorithm to the one grep uses.

Still, we've got a basic approximation. To wrap up, let's take a look at how fast our program is compared to grep.

> perf stat -r 10 ./main "t[wo]o" words.txt
0.2718 +- 0.0176 seconds time elapsed  ( +-  6.46% )

An average of 270ms to search 470k words for our pattern. Not bad! What about grep?

> perf stat -r 10 grep "t[wo]o" words.txt
0.025055 +- 0.000663 seconds time elapsed  ( +-  2.65% )

An order of magnitude faster! We're not even close in (features or in time). Well, we totally failed to live up to grep, but we did implement a pretty quick string searching program from first principles.

I hope you learnt something along the way. Next time you come across a program you use every day, but have no idea how it works, take a minute to think about how you might build something like it from first principles. Remember, there's no magic, you just have to go deeper.

The code from the article is available on GitHub