Algebraic Data Types and Pattern Matching

What may not be clear to readers in a lot of the previous discussions is the use of Algebraic Data Types in combination with patterm matching to define functions. It’s really quite simple, conceptually (implementation may be a different matter, we’ll see.) Here’s an example we’ve seen before, I’ll just be more descriptive this time:

typedef list(t) { cons(t, list(t)) | null }

1	typedef list(t) { cons(t, list(t)) \| null }

This declaration achieves two things:

It defines a type list(t) (list of t) where t is a type variable that can stand for any type.
It creates two constructor functions, called cons and null, that accept arguments of the specified types (none in the case of null,) and return data of type list(t).

Reading it aloud, it says define a type list of some unspecified type t which is either a cons of a t and a list of t, or a null.

Once defined, we use these type costructors to create lists of a concrete type:

def a = cons(true,
             cons(false,
                  cons(false, null)));

def a = cons(true,

cons(false,

cons(false, null)));

After the above definition, a has type list(bool). The following, on the other hand, would fail to type check:

def a = cons(true, cons('x', null));

1	def a = cons(true, cons('x', null));

It fails because:

cons('x', null) is of type list(char) .
The outer cons expects arguments <t> and list(<t>) , but it gets bool and list(char) .
The outer cons cannot reconcile <t> = bool with <t> = char so the type check fails.

That’s all very nice, but how can we use Algeraic Data Types? It turns out that they become very useful in combination with pattern matching in case statements. Consider:

fn length(a) {
    switch(a) {
        case (cons(head, tail)) {
            1 + length(tail)
        }
        case (null) {
            0
        }
    }
}

fn length(a) {

switch(a) {

case (cons(head, tail)) {

1 + length(tail)

}

case (null) {

}

In that case statement, a must match either cons(head, tail) or null. Now if it matches cons(head, tail), the (normal) variables head and tail are automatically created and instantiated as the relevant components of the cons in the body of the case statement. This kind of behaviour is so commonplace in languages like ML that special syntax for functions has evolved, which I’m borrowing for F♮:

fn length {
    (cons(head, tail)) {
        1 + length(tail)
    }
    (null) {
        0
    }
}

fn length {

(cons(head, tail)) {

1 + length(tail)

}

(null) {

}

This version of length, instead of having a single formal argument list outside the body, has alternative formal argument lists inside the body, with mini bodies of their own, just like a case statement. It’s functionally identical to the previous version, but a good deal more concise and readable.

One thing to bear in mind, in both versions, is that length has type list(t) → int. That is to say, each of the formal argument lists inside the body of a function, or the alternative cases in a case statement, must agree in the number and types of the arguments, and must return the same type of result.

Now, it becomes obvious that, just as we can rewrite a let to be a lambda, this case statement is in fact just syntactic sugar for an anonymous function call. The earlier definition of length above, using a case statement, can be re-written as:

fn length(a) {
    fn {
        (cons(head, tail)) {
            1 + length(tail)
        }
        (null) {
            0
        }
    }(a)
}

fn length(a) {

fn {

(cons(head, tail)) {

1 + length(tail)

}

(null) {

}

}(a)

}

so we get case statements: powerful, pattern matching ones, allowing more than one argument, for free if we take this approach.

length is polymorphic. It does not do anything to the value of head so does not care about its type. Therefore the type of length, namely list(t) → int actually contains a type variable t.

Here’s a function that does care about the type of the list:

fn sum_strlen {
    (cons(head, tail)) {
        strlen(head) + sum_strlen(tail)
    }
    (null) {
        0
    }
}

fn sum_strlen {

(cons(head, tail)) {

strlen(head) + sum_strlen(tail)

}

(null) {

}

Assuming strlen has type string → int, that would constrain sum_strlen to have type list(string) → int. Of course that’s a rather silly function, we would be better passing in a function like this:

fn sum {
    (fn, cons(h, t)) {
        fn(h) + sum(fn, tail)
    }
    (fn, null) {
        0
    }
}

fn sum {

(fn, cons(h, t)) {

fn(h) + sum(fn, tail)

}

(fn, null) {

}

That would give sum a type:

(t -> int) -> list(t) -> int

1	(t -> int) -> list(t) -> int

and we could call it like:

def a = cons("hello",
             cons(" ",
                  cons("there", null)));
def len = sum(strlen, a);

def a = cons("hello",

cons(" ",

cons("there", null)));

def len = sum(strlen, a);

or even, with a Curried application:

def sum_strlen = sum(strlen);

1	def sum_strlen = sum(strlen);

This is starting to look like map-reduce. More on that later.

Real-World Applications

Algebraic Data Types really come in to their own when it comes to tree walking. Consider the following definitions:

typedef expr(t) {
    value(t)
  | add(expr(t), expr(t))
  | sub(expr(t), expr(t))
  | mul(expr(t), expr(t))
  | div(expr(t), expr(t))
}

typedef expr(t) {

value(t)

| add(expr(t), expr(t))

| sub(expr(t), expr(t))

| mul(expr(t), expr(t))

| div(expr(t), expr(t))

}

Given that, we can write an evaluator for arithmetic expressions very easily:

fn eval {
    (value(i))  { i }
    (add(l, r)) { eval(l) + eval(r) }
    (sub(l, r)) { eval(l) - eval(r) }
    (mul(l, r)) { eval(l) * eval(r) }
    (div(l, r)) { eval(l) / eval(r) }
}

fn eval {

(value(i)) { i }

(add(l, r)) { eval(l) + eval(r) }

(sub(l, r)) { eval(l) - eval(r) }

(mul(l, r)) { eval(l) * eval(r) }

(div(l, r)) { eval(l) / eval(r) }

}

So eval has type expr(int) → int . We can call it like:

eval(add(value(2),
         mul(value(3),
             value(5))));

eval(add(value(2),

mul(value(3),

value(5))));

to get 17.

Pattern matching not only covers variables and type constructors, it can also cope with constants. For example here’s a definition offactorial:

fn factorial {
    (0) { 1 }
    (n) { n * factorial(n - 1) }
}

fn factorial {

(0) { 1 }

(n) { n * factorial(n - 1) }

}

For this and other examples to work, the cases must be checked in order and the first case that matches is selected. so the argument to factorial would only match n if it failed to match .

As another example, here’s member:

fn member {
    (item, item @ tail) { true }
    (item, _ @ tail)    { member(item, tail) }
    (item, [])          { false }
}

fn member {

(item, item @ tail) { true }

(item, _ @ tail) { member(item, tail) }

(item, []) { false }

}

Here I’m using F♮’s built-in list type constructors @, (pronounced cons,) and [] (pronounced null,) and a wildcard _ to indicate a don’t care variable that always unifies, but apart from that it’s just the same as the cons and null constructors. Anyway, the cases say:

member(item, list) is true if item is at the head of the list.
member(item, list) is true if item is a member of the tail of the list.
item is not a member of the empty list.

Problems and Solutions

You’ve probably realised that given a type like list(t) above, it’s not possible to directly create lists of mixed type. That is because it is usually a very bad idea to do so. However if you need to do so, you can get around the restriction without breaking any rules, as follows:

Create a container type for your mixed types:

typedef either(t, u) { first(t) | second(u) }

1

typedef either(t, u) { first(t) | second(u) }
Create lists of that type:

def a = [ first("a string") , second(20) ]

1

def a = [ first("a string") , second(20) ]

After the above definition, a has type list(either(string, int)), and you can’t get at the data without knowing its type:

fn sum_numbers {
   (first(_) @ tail)  { sum_numbers(tail) }
   (second(n) @ tail) { n + sum_numbers(tail) }
   ([])               { 0 }
}

fn sum_numbers {

(first(_) @ tail) { sum_numbers(tail) }

(second(n) @ tail) { n + sum_numbers(tail) }

([]) { 0 }

}

Here, sum_numbers has type [either(<t>, int)] → int. e.g. it doesn’t care what type first holds. We could have written first(s) instead of first(_), but the use of a wildcard _explicitly says we don’t care, stops any potential warnings about unused variables, and is more efficient.

Bill Hails

Algebraic Data Types and Pattern Matching

Real-World Applications

Problems and Solutions

Leave a Reply Cancel reply

Random Blogs