Déjà Vu

If you have studied natural language processing (NLP) then generation of binary trees might have given you a feeling of déjà vu. Here’s why:

NLP often involves using a context-free grammar (CFG) to decompose a sentence. We use the rules (“productions”) of the grammar to generate a parse tree, also called a derivation. In general, CFGs are non-deterministic or ambiguous, meaning multiple derivations can be produced. (It is certainly for a CFG to be deterministic, in which case a single derivation can be produced—computer programming languages, unlike natural languages, conform to a deterministic CFG.)

Shift-reduce parsing is a flexible method for generating derivations of a CFG from an input text. Remember the choices in the generation of binary trees: “lazy” to put something on the stack, and “eager” to create new non-terminals and put them on copies of the stack. The “lazy choice” directly corresponds to the “shift” step of a shift-reduce parser. The “eager” choice corresponds to the “reduce” step.

In the latest version of the Rust code, I’ve replaced use of the term eager with reduce, and use of the term lazy with shift.

From Characters to Tokens

The code currently generates binary trees from sequences of characters. In NLP, derivations of CFGs are usually generated from sequences of words. We’d like to accommodate these possibilities—and more.

For example, we might expect a speech recognition system to produce a sequence of words from an audio signal. However, there can be uncertainty about which word was uttered, because different words can sound similar, especially in the presence of background noise. Therefore, the speech recognition system is likely to produce a sequence of probability distributions over words.

For this reason, I’ve modified the code to treat the input as a sequence of tokens. In Rust, we can allow tokens to be of non-character types using generics. However, we do need to display and clone tokens. Therefore, we introduce the Token trait. This trait itself does not require implementation of any methods, but it refers to types that implement the required Clone and Display methods:

trait Token: Clone + Display {}

The terminals of the binary tree will now consist of tokens, any type that implements the Token trait:

enum BinaryTree<T: Token> {
    Terminal(T),
    Nonterminal(Box<BinaryTree<T>>, Box<BinaryTree<T>>),
}

One way to interpret this is: the Rust compiler will generate, on the fly, versions of the BinaryTree storing different types of terminals, as long as those types implement the Token trait. (We say they “are Token.”) It will also generate versions of any methods on BinaryTree that depend on what is stored.

The built-in char and str types already implement Display and Clone. We still need to “mark” them as implementing Token with a single line each:

impl Token for char {}
impl Token for &str {}

generate_stacks was calling the chars method, but that method is only suitable for character sequences. We replace its argument with any sequence that implements an iterator over Token types. We generalize the argument of generate similarly:

fn generate_stacks<T: Token>(input_sequence: impl Iterator<Item = T>)
-> Vec<Stack<T>> {
    input_sequence.fold(
        vec![Stack::default()],
        |stacks, input| {
            stacks.into_iter().fold(
                Vec::default(),
                shift_reduce_fn(input)
            )
        }
    )
}

fn generate<T: Token>(input_sequence: impl Iterator<Item = T>)
-> Vec<BinaryTree<T>> {
    tops(filter_stacks(generate_stacks(input_sequence)))
}

In main, we test this capability by passing sequences of characters as well as words:

fn main() {
    let x = generate("abcde".chars());
    println!("{} trees", x.len());
    for t in x {
        println!("{}", t);
    }
    let word_sequence = vec!["the", "cat", "sat", "on", "the", "mat"];
    let x = generate(word_sequence.into_iter());
    println!("{} trees", x.len());
    for t in x {
        println!("{}", t);
    }
}

This version successfully generates 42 trees for the sequence of words!