Mind to Matter

Saturday, July 22, 2017

Knuth-Bendix Completion on Non-Abelian Group Theory

Applying Knuth-Bendix Completion to the theory of non-abelian groups to come to a rewrite system that reflects decidability of the theory it implements, where semantic equality is equivalent to syntactic equality of normal forms.

Introduction

It turns out that the way we were thought group theory in the lower grades of high school was not optimal, at least not from a computational point of view. Simple statements in group theory could be rather tricky to prove, while some theories are decidable and consequently problems formulated in these theories can be solved mechanically.

In this post, we will use a procedure referred to as Knuth-Bendix completion to transform the traditional axioms for non-abelian group theory into a deterministic rewrite system with the same expressive power. The different steps of the procedure are enumerated exhaustively. Intermediate theories will introduce new axioms and make older ones obsolete.

Being exhaustive makes the post somewhat verbose. In fact, the text has largely been generated as a side effect of the automation of the Knuth-Bendix completion procedure. Skip the sections that feel like you get them. Use them for reference if you are interested in the detail.

The whole Knuth-Bendix completion procedure on non-abelian group theory is presented graphically using the graphviz dot language. Go for the graphics if you want to scan through quickly.

Non-Abelian Groups

A non-abelian group is defined as a collection of elements, a binary operation on these elements and four axioms: closure, identity, invertibility and associativity. We will consider the collection of all well-formed terms over an alphabet using a typical algebraic term syntax, meaning closure is implicit.

The theory of non-abelian groups can then be formulated using the axioms:

[1] a*1 = a
[2] a*i(a) = 1
[3] a*b*c = a*(b*c)

Here, the binary operation is represented by an asterisk (*), and any letter that is not used as a functor denotes a variable.

Briefly explained:

Axiom [1] states that there is an element 1 which is the right-side identity for the operation.
Axiom [2] states that the unary functor i denotes right-side inverse element of its argument.
Axiom [3] expresses that the binary operation is associative. The operation is assumed to associate to the left, but that is mere notational convention.

A Simple Exercise

Semantic equality in non-abelian groups is decidable, meaning that there is an effective procedure to determine semantic equality within finite space and time. However, using the axioms provided above, decidability may not be apparent.

Let's illustrate the problem with a simple exercise, proving that the right-side inverse is also the left-side inverse, a property known to be valid for non-abelian groups. In other words, the exercise is to prove, for all a, the following equation:

i(a)*a = 1

Before reading on, we urge you to give it a try, to appreciate how tricky the exercise is, given the axioms provided above.

The proof is given using the following derivation:

i(a)*a
  [1]= i(a)*a*1
  [2]= i(a)*a*(i(a)*i(i(a)))
  [3]= i(a)*a*i(a)*i(i(a))
  [3]= i(a)*(a*i(a))*i(i(a))
  [2]= i(a)*1*i(i(a))
  [1]= i(a)*i(i(a))
  [2]= 1

As one can see, the proof requires making the term i(a)*a much more complex before reducing the complexity to obtain the desired result. Proving equality using the axioms above is semi-decidable. If you cannot find the proof, there is no way of knowing whether equality does not hold or whether you did not try hard enough.

Rewrite Systems

So far, we have been using the term theory as referring to a collection of axioms, and the term axiom as the equality of two terms, symmetric. In what follows, we will be using the term rewrite rule to refer to a directed axiom, which can be used only from left to right. We will refer to a collection of rewrite rules as a rewrite system.

A rewrite system may suffer the same problem regarding decidability as a theory does. However, under the right conditions a rewrite system may reflect the decidability of the theory it implements.

Rewriting a term using a rewrite system is referred to as reduction. If a term has a unique simplest, reduced form, this is called the normal form of the term. If all terms have a normal form, the rewrite system is called confluent.

This post is about transforming the theory of non-abelian groups described above to a rewrite system that implements the semantics of that theory and that is decidable.

Knuth-Bendix Completion

The Knuth-Bendix completion procedure starts off by restricting the use of the axioms to a single direction, from complex to simple. Typically, syntactic complexity is used as a metric.

Obviously, the rewrite system consisting of rewrite rules (directed axioms) does not implement the full semantics of the theory we started off with. One can prove equality for many more terms with the theory than with the rewrite system obtained out of it by directing axioms.

The Knuth-Bendix completion procedure will therefore try adding new axioms to the rewrite system in order to restore the expressive power. It will do so by looking for occurrences of the left-hand side of a rewrite rule in the left-hand side of another rewrite rule, using unification. When applying the substitution obtained by the unification to the left-hand side of the latter rewrite rule, we get a term that can be reduced (rewritten) in two distinct ways, once using each of the two rewrite rules involved.

If these reductions result in syntactically different terms, a so-called critical pair is found. A new rewrite rule is added to the rewrite system that consists of these terms, again directed from complex to simple. While adding new rewrite rules to the rewrite system, existing rewrite rules potentially can be reduced, and even become trivial.

If, by repeating this process of adding new rewrite rules and reducing existing ones, we eventually reach a fixed-point, where no more new rewrite rules can be added, the resulting rewrite system will have the same expressive power as the original theory consisting of non-directed axioms. We may also conclude that the theory is decidable, and the rewrite system provides a mechanism that realizes that decidability. Using the rewrite system, semantic equality of terms is equivalent to syntactic equality of the normal forms of these terms.

If on the other hand no such fixed-point is reached, no conclusion can be made on the decidability of the theory. In fact, Knuth-Bendix completion itself is semi-decidable.

From Theory to Rewrite System

The axioms part of the theory of non-abelian groups are directed to get a rewrite system with directed rewrite rules. The criterion used for directing an axiom is based on the syntactical complexity of the terms in the axiom, using the complexer term as the left-hand side and the simpler term as the right-hand side, such that rewrite rules always "simplify" a term.

The result is rewrite system 1:

[1] a*1 -> a
[2] a*i(a) -> 1
[3] a*b*c -> a*(b*c)

Graphical Representation

Using the graphviz dot language, this rewrite system can be represented as:

In what follows, we will illustrate each iteration in the Knuth-Bendix procedure with such a graphical representation, illustrating new rewrite rules, reduced rewrite rules and rewrite rules that have become trivial. We will use graphical conventions as presented in this legend:

Iteration 1

The first iteration of the procedure can be represented graphically as:

Two new rewrite rules can be derived. The critical term for the derivation of each new rewrite rule is included in the box representing the new rewrite rule.

In what follows, we will document each of the critical terms found and how they can be used to derive a new rewrite rule.

Iteration 1 - Critical Pairs

Combining rewrite rules [1] and [3] the term a*1*b can be rewritten as in

a*1*b
  [1]-> a*b

or as in

a*1*b
  [3]-> a*(1*b)

Therefore, we create a new rewrite rule:

[4] a*(1*b) -> a*b

Combining rewrite rules [2] and [3] the term a*i(a)*b can be rewritten as in

a*i(a)*b
  [2]-> 1*b

or as in

a*i(a)*b
  [3]-> a*(i(a)*b)

Therefore, we create a new rewrite rule:

[5] a*(i(a)*b) -> 1*b

The result is rewrite system 1:

[1] a*1 -> a
[2] a*i(a) -> 1
[3] a*b*c -> a*(b*c)
[4] a*(1*b) -> a*b
[5] a*(i(a)*b) -> 1*b

Iteration 1 - Simplification

The rewrite system cannot be simplified.

Iteration 2

The second iteration of the procedure can be represented graphically as:

Iteration 2 - Critical Pairs

Combining rewrite rules [2] and [4] the term a*(1*i(1)) can be rewritten as in

a*(1*i(1))
  [2]-> a*1
  [1]-> a

or as in

a*(1*i(1))
  [4]-> a*i(1)

Therefore, we create a new rewrite rule:

[6] a*i(1) -> a

Combining rewrite rules [2] and [5] the term a*(i(a)*i(i(a))) can be rewritten as in

a*(i(a)*i(i(a)))
  [2]-> a*1
  [1]-> a

or as in

a*(i(a)*i(i(a)))
  [5]-> 1*i(i(a))

Therefore, we create a new rewrite rule:

[7] 1*i(i(a)) -> a

Combining rewrite rules [5] and [4] the term a*(1*(i(1)*b)) can be rewritten as in

a*(1*(i(1)*b))
  [5]-> a*(1*b)
  [4]-> a*b

or as in

a*(1*(i(1)*b))
  [4]-> a*(i(1)*b)

Therefore, we create a new rewrite rule:

[8] a*(i(1)*b) -> a*b

The result is rewrite system 2:

[1] a*1 -> a
[2] a*i(a) -> 1
[3] a*b*c -> a*(b*c)
[4] a*(1*b) -> a*b
[5] a*(i(a)*b) -> 1*b
[6] a*i(1) -> a
[7] 1*i(i(a)) -> a
[8] a*(i(1)*b) -> a*b

Iteration 2 - Simplification

The rewrite system cannot be simplified.

Iteration 3

The third iteration of the procedure can be represented graphically as:

During this iteration, four new rewrite rules [9], [10], [11] and [12] are added. The new rewrite rule [9] becomes trivial immediately using rewrite rule [12]. The existing rewrite rule [7] can be reduced using rewrite rule [12]. The new rewrite rule [11] can also be reduced using rewrite rule [7]. It remains marked green however, as it is a new rewrite rule.

Note that after reducing rewrite rule [7], it expresses the fact that the left identity element is also the right identity element, even in non-abelian groups, which is a nice intermediate result.

Iteration 3 - Critical Pairs

Combining rewrite rules [2] and [8] the term a*i(1)*i(i(1)) can be rewritten as in

a*(i(1)*i(i(1)))
  [2]-> a*1
  [1]-> a

or as in

a*(i(1)*i(i(1)))
  [8]-> a*i(i(1))

Therefore, we create a new rewrite rule:

[9] a*i(i(1)) -> a

Combining rewrite rules [5] and [8] the term a*i(1)*i(i(1))*b can be rewritten as in

a*(i(1)*(i(i(1))*b))
  [5]-> a*(1*b)
  [4]-> a*b

or as in

a*(i(1)*(i(i(1))*b))
  [8]-> a*(i(i(1))*b)

Therefore, we create a new rewrite rule:

[10] a*(i(i(1))*b) -> a*b

Combining rewrite rules [7] and [3] the term (1*i(i(a)))*b can be rewritten as in

1*i(i(a))*b
  [7]-> a*b

or as in

1*i(i(a))*b
  [3]-> 1*(i(i(a))*b)

Therefore, we create a new rewrite rule:

[11] 1*(i(i(a))*b) -> a*b

Combining rewrite rules [7] and [4] the term a*(1*i(i(b))) can be rewritten as in

a*(1*i(i(b)))
  [7]-> a*b

or as in

a*(1*i(i(b)))
  [4]-> a*i(i(b))

Therefore, we create a new rewrite rule:

[12] a*i(i(b)) -> a*b

The result is rewrite system 3:

[1] a*1 -> a
[2] a*i(a) -> 1
[3] a*b*c -> a*(b*c)
[4] a*(1*b) -> a*b
[5] a*(i(a)*b) -> 1*b
[6] a*i(1) -> a
[7] 1*i(i(a)) -> a
[8] a*(i(1)*b) -> a*b
[9] a*i(i(1)) -> a
[10] a*(i(i(1))*b) -> a*b
[11] 1*(i(i(a))*b) -> a*b
[12] a*i(i(b)) -> a*b

Iteration 3 - Simplification

Looking at rewrite rule [7], the left-hand side can be rewritten as in

1*i(i(a))
  [12]-> 1*a

We therefore replace the rewrite rule by

[7] 1*a -> a

Looking at rewrite rule [9], the left-hand side can be rewritten as in

a*i(i(1))
  [12]-> a*1
  [1]-> a

The rewrite rule can therefore be removed.
Looking at rewrite rule [11], the left-hand side can be rewritten as in

1*(i(i(a))*b)
  [7]-> i(i(a))*b

We therefore replace the rewrite rule by

[11] i(i(a))*b -> a*b

The result is rewrite system 3bis:

[1] a*1 -> a
[2] a*i(a) -> 1
[3] a*b*c -> a*(b*c)
[4] a*(1*b) -> a*b
[5] a*(i(a)*b) -> 1*b
[6] a*i(1) -> a
[7] 1*a -> a
[8] a*(i(1)*b) -> a*b
[10] a*(i(i(1))*b) -> a*b
[11] i(i(a))*b -> a*b
[12] a*i(i(b)) -> a*b

Iteration 4

The fourth iteration of the procedure can be represented graphically as:

During this iteration, three new rewrite rules [13], [14] and [15] are added. Rewrite rules [4], [10], [11], [12] and [13] become trivial. The existing rewrite rule [5] can be reduced using rewrite rule [7].

Note that rewrite rule [15] expresses the fact that the inverse of the inverse of an element is the element itself, which is a nice intermediate result.

Iteration 4 - Critical Pairs

Combining rewrite rules [2] and [5] the term a*(i(a)*i(i(a))) can be rewritten as in

a*(i(a)*i(i(a)))
  [2]-> a*1
  [1]-> a

or as in

a*(i(a)*i(i(a)))
  [5]-> 1*i(i(a))
  [7]-> i(i(a))

Therefore, we create a new rewrite rule:

[13] i(i(a)) -> a

Combining rewrite rules [11] and [5] the term i(a)*(i(i(a))*b) can be rewritten as in

i(a)*(i(i(a))*b)
  [11]-> i(a)*(a*b)

or as in

i(a)*(i(i(a))*b)
  [5]-> 1*b
  [7]-> b

Therefore, we create a new rewrite rule:

[14] i(b)*(b*a) -> a

Combining rewrite rules [12] and [5] the term a*(i(a)*i(i(b))) can be rewritten as in

a*(i(a)*i(i(b)))
  [12]-> a*(i(a)*b)
  [5]-> 1*b
  [7]-> b

or as in

a*(i(a)*i(i(b)))
  [5]-> 1*i(i(b))
  [7]-> i(i(b))

Therefore, we create a new rewrite rule:

[15] i(i(a)) -> a

The result is rewrite system 4:

[1] a*1 -> a
[2] a*i(a) -> 1
[3] a*b*c -> a*(b*c)
[4] a*(1*b) -> a*b
[5] a*(i(a)*b) -> 1*b
[6] a*i(1) -> a
[7] 1*a -> a
[8] a*(i(1)*b) -> a*b
[10] a*(i(i(1))*b) -> a*b
[11] i(i(a))*b -> a*b
[12] a*i(i(b)) -> a*b
[13] i(i(a)) -> a
[14] i(b)*(b*a) -> a
[15] i(i(a)) -> a

Iteration 4 - Simplification

Looking at rewrite rule [4], the left-hand side can be rewritten as in

a*(1*b)
  [7]-> a*b

The rewrite rule can therefore be removed. Looking at rewrite rule [5], the right-hand side can be rewritten as in

1*b
  [7]-> b

We therefore replace the rewrite rule by

[5] a*(i(a)*b) -> b

Looking at rewrite rule [10], the left-hand side can be rewritten as in

a*(i(i(1))*b)
  [11]-> a*(1*b)
  [7]-> a*b

The rewrite rule can therefore be removed. Looking at rewrite rule [11], the left-hand side can be rewritten as in

i(i(a))*b
  [13]-> a*b

The rewrite rule can therefore be removed. Looking at rewrite rule [12], the left-hand side can be rewritten as in

a*i(i(b))
  [13]-> a*b

The rewrite rule can therefore be removed. Looking at rewrite rule [13], the left-hand side can be rewritten as in

i(i(a))
  [15]-> a

The rewrite rule can therefore be removed. The result is rewrite system 4bis:

[1] a*1 -> a
[2] a*i(a) -> 1
[3] a*b*c -> a*(b*c)
[5] a*(i(a)*b) -> b
[6] a*i(1) -> a
[7] 1*a -> a
[8] a*(i(1)*b) -> a*b
[14] i(b)*(b*a) -> a
[15] i(i(a)) -> a

Iteration 5

The fifth iteration of the procedure can be represented graphically as:

During this iteration, eight new rewrite rules [16], [17], [18], [19], [20], [21], [22] and [23] are added. Rewrite rules [6], [8], [16], [19], [20], [21] and [22] become trivial.

Note that rewrite rule [18] states that the identity element is its own inverse. Also note that rewrite rule [23] expresses the fact that the right symmetric element is equally the left symmetric element, even in non-abelian groups. These are nice intermediate results.

Iteration 5 - Critical Pairs

Combining rewrite rules [1] and [14] the term i(a)*(a*1) can be rewritten as in

i(a)*(a*1)
  [1]-> i(a)*a

or as in

i(a)*(a*1)
  [14]-> 1

Therefore, we create a new rewrite rule:

[16] i(a)*a -> 1

Combining rewrite rules [3] and [14] the term i(a*b)*(a*b*c) can be rewritten as in

i(a*b)*(a*b*c)
  [3]-> i(a*b)*(a*(b*c))

or as in

i(a*b)*(a*b*c)
  [14]-> c

Therefore, we create a new rewrite rule:

[17] i(b*c)*(b*(c*a)) -> a

Combining rewrite rules [6] and [5] the term a*(i(a)*i(1)) can be rewritten as in

a*(i(a)*i(1))
  [6]-> a*i(a)
  [2]-> 1

or as in

a*(i(a)*i(1))
  [5]-> i(1)

Therefore, we create a new rewrite rule:

[18] i(1) -> 1

Combining rewrite rules [6] and [14] the term i(a)*(a*i(1)) can be rewritten as in

i(a)*(a*i(1))
  [6]-> i(a)*a

or as in

i(a)*(a*i(1))
  [14]-> i(1)

Therefore, we create a new rewrite rule:

[19] i(a)*a -> i(1)

Combining rewrite rules [7] and [14] the term i(1)*(1*a) can be rewritten as in

i(1)*(1*a)
  [7]-> i(1)*a

or as in

i(1)*(1*a)
  [14]-> a

Therefore, we create a new rewrite rule:

[20] i(1)*a -> a

Combining rewrite rules [8] and [5] the term a*(i(a)*(i(1)*b)) can be rewritten as in

a*(i(a)*(i(1)*b))
  [8]-> a*(i(a)*b)
  [5]-> b

or as in

a*(i(a)*(i(1)*b))
  [5]-> i(1)*b

Therefore, we create a new rewrite rule:

[21] i(1)*a -> a

Combining rewrite rules [8] and [14] the term i(a)*(a*(i(1)*b)) can be rewritten as in

i(a)*(a*(i(1)*b))
  [8]-> i(a)*(a*b)
  [14]-> b

or as in

i(a)*(a*(i(1)*b))
  [14]-> i(1)*b

Therefore, we create a new rewrite rule:

[22] i(1)*a -> a

Combining rewrite rules [15] and [2] the term i(a)*i(i(a)) can be rewritten as in

i(a)*i(i(a))
  [15]-> i(a)*a

or as in

i(a)*i(i(a))
  [2]-> 1

Therefore, we create a new rewrite rule:

[23] i(a)*a -> 1

The result is rewrite system 5:

[1] a*1 -> a
[2] a*i(a) -> 1
[3] a*b*c -> a*(b*c)
[5] a*(i(a)*b) -> b
[6] a*i(1) -> a
[7] 1*a -> a
[8] a*(i(1)*b) -> a*b
[14] i(b)*(b*a) -> a
[15] i(i(a)) -> a
[16] i(a)*a -> 1
[17] i(b*c)*(b*(c*a)) -> a
[18] i(1) -> 1
[19] i(a)*a -> i(1)
[20] i(1)*a -> a
[21] i(1)*a -> a
[22] i(1)*a -> a
[23] i(a)*a -> 1

Iteration 5 - Simplification

Looking at rewrite rule [6], the left-hand side can be rewritten as in

a*i(1)
  [18]-> a*1
  [1]-> a

The rewrite rule can therefore be removed. Looking at rewrite rule [8], the left-hand side can be rewritten as in

a*(i(1)*b)
  [20]-> a*b

The rewrite rule can therefore be removed. Looking at rewrite rule [16], the left-hand side can be rewritten as in

i(a)*a
  [19]-> i(1)
  [18]-> 1

The rewrite rule can therefore be removed. Looking at rewrite rule [19], the left-hand side can be rewritten as in

i(a)*a
  [23]-> 1

while the right-hand side can be rewritten as in

i(1)
  [18]-> 1

The rewrite rule can therefore be removed. Looking at rewrite rule [20], the left-hand side can be rewritten as in

i(1)*a
  [21]-> a

The rewrite rule can therefore be removed. Looking at rewrite rule [21], the left-hand side can be rewritten as in

i(1)*a
  [22]-> a

The rewrite rule can therefore be removed. Looking at rewrite rule [22], the left-hand side can be rewritten as in

i(1)*a
  [18]-> 1*a
  [7]-> a

The rewrite rule can therefore be removed. The result is rewrite system 5bis:

[1] a*1 -> a
[2] a*i(a) -> 1
[3] a*b*c -> a*(b*c)
[5] a*(i(a)*b) -> b
[7] 1*a -> a
[14] i(b)*(b*a) -> a
[15] i(i(a)) -> a
[17] i(b*c)*(b*(c*a)) -> a
[18] i(1) -> 1
[23] i(a)*a -> 1

Iteration 6

The sixth iteration of the procedure can be represented graphically as:

During this iteration, seven new rewrite rules [24], [25], [26], [27], [28], [29] and [30] are added. Rewrite rules [17], [25], [26], [28], [29] and [30] become trivial.

Iteration 6 - Critical Pairs

Combining rewrite rules [2] and [17] the term i(a*b)*(a*(b*i(b))) can be rewritten as in

i(a*b)*(a*(b*i(b)))
  [2]-> i(a*b)*(a*1)
  [1]-> i(a*b)*a

or as in

i(a*b)*(a*(b*i(b)))
  [17]-> i(b)

Therefore, we create a new rewrite rule:

[24] i(b*a)*b -> i(a)

Combining rewrite rules [3] and [17] the term i(a*b*c)*(a*b*(c*d)) can be rewritten as in

i(a*b*c)*(a*b*(c*d))
  [3]-> i(a*b*c)*(a*(b*(c*d)))
  [3]-> i(a*(b*c))*(a*(b*(c*d)))

or as in

i(a*b*c)*(a*b*(c*d))
  [17]-> d

Therefore, we create a new rewrite rule:

[25] i(b*(c*d))*(b*(c*(d*a))) -> a

Combining rewrite rules [3] and [17] the term i(a*(b*c))*(a*(b*c*d)) can be rewritten as in

i(a*(b*c))*(a*(b*c*d))
  [3]-> i(a*(b*c))*(a*(b*(c*d)))

or as in

i(a*(b*c))*(a*(b*c*d))
  [17]-> d

Therefore, we create a new rewrite rule:

[26] i(b*(c*d))*(b*(c*(d*a))) -> a

Combining rewrite rules [5] and [17] the term i(a*b)*(a*(b*(i(b)*c))) can be rewritten as in

i(a*b)*(a*(b*(i(b)*c)))
  [5]-> i(a*b)*(a*c)

or as in

i(a*b)*(a*(b*(i(b)*c)))
  [17]-> i(b)*c

Therefore, we create a new rewrite rule:

[27] i(c*a)*(c*b) -> i(a)*b

Combining rewrite rules [14] and [17] the term i(a*i(b))*(a*(i(b)*(b*c))) can be rewritten as in

i(a*i(b))*(a*(i(b)*(b*c)))
  [14]-> i(a*i(b))*(a*c)

or as in

i(a*i(b))*(a*(i(b)*(b*c)))
  [17]-> b*c

Therefore, we create a new rewrite rule:

[28] i(c*i(a))*(c*b) -> a*b

Combining rewrite rules [23] and [17] the term i(i(a*b)*a)*(i(a*b)*(a*b)) can be rewritten as in

i(i(a*b)*a)*(i(a*b)*(a*b))
  [23]-> i(i(a*b)*a)*1
  [1]-> i(i(a*b)*a)

or as in

i(i(a*b)*a)*(i(a*b)*(a*b))
  [17]-> b

Therefore, we create a new rewrite rule:

[29] i(i(b*a)*b) -> a

Combining rewrite rules [23] and [17] the term i(a*i(b))*(a*(i(b)*b)) can be rewritten as in

i(a*i(b))*(a*(i(b)*b))
  [23]-> i(a*i(b))*(a*1)
  [1]-> i(a*i(b))*a

or as in

i(a*i(b))*(a*(i(b)*b))
  [17]-> b

Therefore, we create a new rewrite rule:

[30] i(b*i(a))*b -> a

The result is rewrite system 6:

[1] a*1 -> a
[2] a*i(a) -> 1
[3] a*b*c -> a*(b*c)
[5] a*(i(a)*b) -> b
[7] 1*a -> a
[14] i(b)*(b*a) -> a
[15] i(i(a)) -> a
[17] i(b*c)*(b*(c*a)) -> a
[18] i(1) -> 1
[23] i(a)*a -> 1
[24] i(b*a)*b -> i(a)
[25] i(b*(c*d))*(b*(c*(d*a))) -> a
[26] i(b*(c*d))*(b*(c*(d*a))) -> a
[27] i(c*a)*(c*b) -> i(a)*b
[28] i(c*i(a))*(c*b) -> a*b
[29] i(i(b*a)*b) -> a
[30] i(b*i(a))*b -> a

Iteration 6 - Simplification

Looking at rewrite rule [17], the left-hand side can be rewritten as in

i(b*c)*(b*(c*a))
  [27]-> i(c)*(c*a)
  [14]-> a

The rewrite rule can therefore be removed. Looking at rewrite rule [25], the left-hand side can be rewritten as in

i(b*(c*d))*(b*(c*(d*a)))
  [26]-> a

The rewrite rule can therefore be removed. Looking at rewrite rule [26], the left-hand side can be rewritten as in

i(b*(c*d))*(b*(c*(d*a)))
  [27]-> i(c*d)*(c*(d*a))
  [27]-> i(d)*(d*a)
  [14]-> a

The rewrite rule can therefore be removed. Looking at rewrite rule [28], the left-hand side can be rewritten as in

i(c*i(a))*(c*b)
  [27]-> i(i(a))*b
  [15]-> a*b

The rewrite rule can therefore be removed. Looking at rewrite rule [29], the left-hand side can be rewritten as in

i(i(b*a)*b)
  [24]-> i(i(a))
  [15]-> a

The rewrite rule can therefore be removed. Looking at rewrite rule [30], the left-hand side can be rewritten as in

i(b*i(a))*b
  [24]-> i(i(a))
  [15]-> a

The rewrite rule can therefore be removed. The result is rewrite system 6bis:

[1] a*1 -> a
[2] a*i(a) -> 1
[3] a*b*c -> a*(b*c)
[5] a*(i(a)*b) -> b
[7] 1*a -> a
[14] i(b)*(b*a) -> a
[15] i(i(a)) -> a
[18] i(1) -> 1
[23] i(a)*a -> 1
[24] i(b*a)*b -> i(a)
[27] i(c*a)*(c*b) -> i(a)*b

Iteration 7

The seventh iteration of the procedure can be represented graphically as:

During this iteration, eight new rewrite rules [31], [32], [33], [34] and [35] are added. Rewrite rules [27], [32], [33], [34] and [35] become trivial.

Note that rewrite rule [31] states that the inverse of a product of two elements is the product of the inverted elements in inverted order, which is a nice intermediate result.

Iteration 7 - Critical Pairs

Combining rewrite rules [2] and [27] the term i(a*b)*(a*i(a)) can be rewritten as in

i(a*b)*(a*i(a))
  [2]-> i(a*b)*1
  [1]-> i(a*b)

or as in

i(a*b)*(a*i(a))
  [27]-> i(b)*i(a)

Therefore, we create a new rewrite rule:

[31] i(b*a) -> i(a)*i(b)

Combining rewrite rules [5] and [27] the term i(a*b)*(a*(i(a)*c)) can be rewritten as in

i(a*b)*(a*(i(a)*c))
  [5]-> i(a*b)*c

or as in

i(a*b)*(a*(i(a)*c))
  [27]-> i(b)*(i(a)*c)

Therefore, we create a new rewrite rule:

[32] i(b*a)*c -> i(a)*(i(b)*c)

Combining rewrite rules [14] and [27] the term i(i(a)*b)*(i(a)*(a*c)) can be rewritten as in

i(i(a)*b)*(i(a)*(a*c))
  [14]-> i(i(a)*b)*c

or as in

i(i(a)*b)*(i(a)(*a*c))
  [27]-> i(b)*(a*c)

Therefore, we create a new rewrite rule:

[33] i(i(b)*a)*c -> i(a)*(b*c)

Combining rewrite rules [23] and [27] the term i(i(a)*b)*(i(a)*a) can be rewritten as in

i(i(a)*b)*(i(a)*a)
  [23]-> i(i(a)*b)*1
  [1]-> i(i(a)*b)

or as in

i(i(a)*b)*(i(a)*a)
  [27]-> i(b)*a

Therefore, we create a new rewrite rule:

[34] i(i(b)*a) -> i(a)*b

Combining rewrite rules [24] and [27] the term i(i(a*b)*c)*(i(a*b)*a) can be rewritten as in

i(i(a*b)*c)*(i(a*b)*a)
  [24]-> i(i(a*b)*c)*i(b)

or as in

i(i(a*b)*c)*(i(a*b)*a)
  [27]-> i(c)*a

Therefore, we create a new rewrite rule:

[35] i(i(b*c)*a)*i(c) -> i(a)*b

The result is rewrite system 7:

[1] a*1 -> a
[2] a*i(a) -> 1
[3] a*b*c -> a*(b*c)
[5] a*(i(a)*b) -> b
[7] 1*a -> a
[14] i(b)*(b*a) -> a
[15] i(i(a)) -> a
[18] i(1) -> 1
[23] i(a)*a -> 1
[24] i(b*a)*b -> i(a)
[27] i(c*a)*(c*b) -> i(a)*b
[31] i(b*a) -> i(a)*i(b)
[32] i(b*a)*c -> i(a)*(i(b)*c)
[33] i(i(b)*a)*c -> i(a)*(b*c)
[34] i(i(b)*a) -> i(a)*b
[35] i(i(b*c)*a)*i(c) -> i(a)*b

Iteration 7 - Simplification

Looking at rewrite rule [24], the left-hand side can be rewritten as in

i(b*a)*b
  [32]-> i(a)*(i(b)*b)
  [23]-> i(a)*1
  [1]-> i(a)

The rewrite rule can therefore be removed. Looking at rewrite rule [27], the left-hand side can be rewritten as in

i(c*a)*(c*b)
  [32]-> i(a)*(i(c)*(c*b))
  [14]-> i(a)*b

The rewrite rule can therefore be removed. Looking at rewrite rule [32], the left-hand side can be rewritten as in

i(b*a)*c
  [31]-> (i(a)*i(b))*c
  [3]-> i(a)*(i(b)*c)

The rewrite rule can therefore be removed. Looking at rewrite rule [33], the left-hand side can be rewritten as in

i(i(b)*a)*c
  [31]-> i(a)*i(i(b))*c
  [3]-> i(a)*(i(i(b))*c)
  [15]-> i(a)*(b*c)

The rewrite rule can therefore be removed. Looking at rewrite rule [34], the left-hand side can be rewritten as in

i(i(b)*a)
  [31]-> i(a)*i(i(b))
  [15]-> i(a)*b

The rewrite rule can therefore be removed. Looking at rewrite rule [35], the left-hand side can be rewritten as in

i(i(b*c)*a)*i(c)
  [31]-> i(a)*i(i(b*c))*i(c)
  [3]-> i(a)*(i(i(b*c))*i(c))
  [15]-> i(a)*(b*c*i(c))
  [3]-> i(a)*(b*(c*i(c)))
  [2]-> i(a)*(b*1)
  [1]-> i(a)*b

The rewrite rule can therefore be removed. The result is rewrite system 7bis:

[1] a*1 -> a
[2] a*i(a) -> 1
[3] a*b*c -> a*(b*c)
[5] a*(i(a)*b) -> b
[7] 1*a -> a
[14] i(b)*(b*a) -> a
[15] i(i(a)) -> a
[18] i(1) -> 1
[23] i(a)*a -> 1
[31] i(b*a) -> i(a)*i(b)

Completion

The resulting rewrite system 7bis cannot be extended anymore. We have therefore reached the fixed-point of the Knuth-Bendix completion algorithm.

The rewrite system can be represented graphically as:

This result constitutes the deterministic rewrite system that is semantically equivalent to the theory consisting of the three non-directed axioms we started off with. It provides a decidable computational framework for non-abelian groups, in the sense that, repeat, semantic equality of terms is equivalent to syntactic equality of normal forms.

Monday, October 20, 2014

Self-Replication in a Combinatorial Soup

About the spontaneous emergence of self-replication in combinatory logic and its implementation in Java. We explore some variations of combinatory logic, such as SKI, BCKW and the one-point basis. The use of anonymous Java classes provides a simple and elegant implementation.

Introduction

Evidence for evolution of organisms in an ecosystem is ubiquitous. The power of Darwinism in explaining the variation in life forms on earth is overwhelming. Experiments in artificial life were able to duplicate many aspects of evolution.

The remaining mystery of life on earth is in the origins of the first living organisms. Tierra and Avida both start of with a population of "functioning" organisms and focus on the evolution of their code base, a metaphor for the genome. This post is about looking at that first step: the transition of dead matter into the first biological agents with some properties of life.

Simulation of chemistry is computationally complex. It would also be hard to create an environment with more potential than the one we have around us. Therefore, we will be using combinatory logic as a metaphor for chemistry. More expressive, computationally simpler, and Turing-equivalent.

Combinatory Logic

Combinatory logic is referring to variable-free variants of Lambda calculus. Combinatory logic offers function application as its sole operation. Function application is denoted by juxtaposition and associates to the left. Brackets are used to specify a different association. As for lambda calculus, all computation is realized through term rewriting, also called reduction.

One variant of combinatory logic, SKI Combinator Calculus, uses an alphabet of three letters, I, K and S, also called combinators. The rewrite rules for this calculus are few and simple:

Ix -> x
Kxy -> x
Sfgx -> fx(gx)

Here, the combinator I represents the identity function, as, applied to x, it yields x. The combinator K is the constant function generator, as K applied to x is the constant x-function. Finally, the combinator S is called the steering function. It distributes its third argument over the first and the second argument, and yields the application of both results.

In order to get some familiarity with term rewriting, we will write out the reduction of SKK applied to x.

SKKx
-> Kx(Kx)
-> x

The reduction shows that SKK applied to x yields x, for any x. Apparently, SKK is semantically equivalent to the I combinator, and the latter is redundant in SKI combinator calculus. This means we could do with the combinators K and S only, without losing computational power.

Combinatorial Soup

For our experiment we will simulate a combinatorial soup, a metaphor for the primordial soup, as a pool of expressions in SKI combinator calculus. The pool will be initialized with random combinators out of the alphabet, I, K and S. Chemical interaction will be replaced by function application, meaning that for randomly chosen expressions a and b, a will be replaced by ab (a applied to b) and b will be replaced by ba (b applied to a).

In our experiment, we will use no locality, in the sense that interactions occur between two randomly chosen expressions, regardless of their location. One might say the soup is served very hot.

Self-Replication

Given this setup, on most occasions the simulation results in a homogeneous soup, populated with specific combinatorial expressions. The shortest of these expressions, consisting of only 10 primitive combinators, is:

S(K(SSK))(K(K(SSK)))

This expression is self-replicating: when applied to any other expression, it yields a copy of itself, as illustrated by the following reduction sequence:

S(K(SSK))(K(K(SSK)))x
-> K(SSK)x(K(K(SSK))x)
-> SSK(K(K(SSK))x)
-> S(K(K(SSK))x)(K(K(K(SSK))x))
-> S(K(SSK))(K(K(K(SSK))x))
-> S(K(SSK))(K(K(SSK)))

The property of self-replication is not a trivial thing. The reduction is not merely making a copy of the original, as this would be kind of cheating. In fact, copying would be impossible, as there is no introspection in combinatory logic.

During the reduction the expression is thouroughly consumed. The fact that a copy of the expression is eventually reproduced implies that the structure of the expression is somehow encoded implicitly in the expression itself.

Another expression that frequently occurs is somewhat more complex with 11 primitive combinators:

S(K(SIK))(K(S(K(SIK))))

As before, this expression is self-replicating, as illustrated by the following reduction sequence:

S(K(SIK))(K(S(K(SIK))))x
-> K(SIK)x(K(S(K(SIK)))x)
-> SIK(K(S(K(SIK)))x)
-> I(K(S(K(SIK)))x)(K(K(S(K(SIK)))x))
-> K(S(K(SIK)))x(K(K(S(K(SIK)))x))
-> S(K(SIK))(K(K(S(K(SIK)))x))
-> S(K(SIK))(K(S(K(SIK))))

Many larger variants are arising on other occasions, all sharing the same property of self-replication.

An Alternative Alphabet

Different variations of combinatory logic have been defined, each based on a specific alphabet, their own set of primitive combinators, and each with their own set of rewrite rules. One such variant is the BCKW-system, which uses the combinators ... indeed.

The semantics of the BCKW-system are defined by the following rewrite rules:

Bxyz -> x(yz)
Cxyz -> xyz
Kxy -> y
Wxy -> xyy

Here, B and C can be seen as restricted versions of the S combinator, simplifying reduction at the functor or argument side respectively.

Repeating the experiment with a combinatorial soup based on this alphabet yields similar results. As this variant is somewhat more expressive, the shortest expression that appears counts only 7 primitive combinators:

C(K(WC))(K(WC))

Self-replication of this expression is demonstrated by the following reduction sequence:

C(K(WC))(K(WC))x
-> K(WC)x(K(WC))
-> WC(K(WC))
-> C(K(WC))(K(WC))

A variant consisting of 9 primitive combinators is:

C(C(KW)C)(C(KW)C)

The demonstration of self-replication is left as an exercise.

One-Point Basis

We will close the experiment using a variant of combinatory logic with a one-point basis. Let's define the combinator X as:

X := x.xSK

We are using lambda notation for the definition of X, as explained on Wikipedia under combinatory logic.

Both K and S have semantic equivalents that can be expressed in function of X, as for example:

K := X(X(XX))
S := X(X(X(XX)))

This means that the variant of combinatory logic with only X as an alphabet is computationally equivalent to the other variants. However, we do not have a reduction rule for X that is expressed in terms of X only. More problematically, if we substitute K and S in the definition of X, we get an expression that has no normal form: the reduction never stops.

As an alternative, we will reduce expressions in X using the definition in terms of K and S, and if that reduction results in a normal form, substitute K and S by their equivalents expressed in terms of X, without any further reduction. This way, we avoid the problem of infinite reduction inherent to the definition of X and we get a kind of normal form for many expressions.

Repeating the experiment with a combinatorial soup based on this one-point basis and this single rewrite rule again yields similar results. As this variant of combinatory logic is significantly less expressive, the shortest self-replicating expression that appears requires 45 primitive combinators:

X(X(X(XX)))(X(X(XX))(X(X(X(XX)))(X(X(X(XX))))(X(X(XX)))))(X(X(XX))(X(X(XX))(X(X(X(XX)))(X(X(X(XX))))(X(X(XX))))))

As before, the demonstration of self-replication is left as an exercise, but make sure to have enough paper at hand.

Conclusions

In terms of artificial life and the mystery of the origin of life, the results of the experiment are not convincing. The fitness function is endogenous, but with the best of intentions, the resulting "organisms" cannot be called protocells. Also, the system is not "open-ended": after a while, no new forms are arising and the soup becomes stable in its composition.

In terms of playing with combinatory logic and self-organization however, the results obtained are interesting and may be an inspiration for future work. We could add locality, using a 3D structure for the ecosystem and limit interactions in space. Also, we might introduce occasional errors in term rewriting, as a kind of mutation. This would change proximity of "organisms" and allow the exploration of new tracks in "animal space", as introduced by Richard Dawkins in The Blind Watchmaker. It might be a first step in coming to an "open-ended" system ?

Another orientation might be to make the fitness function exogenous, measuring resource consumption, such as the size of expressions or the number of reduction cycles required to reduce. Alternatively, we could introduce fitness scoring in function of the ability to perform specific tasks, in number theory for example, and evolve useful functions.

Or we could use the ability for self-interpretation as a fitness function ...

Combinator Representation and Reduction

To complete this post we will present the essentials of the implementation of combinators in Java as it was used to run these simulations. The implementation makes a clear separation between representation and reduction, such that these aspects can be used independently of one another.

Combinators are immutable. All combinators implement the Combinator interface.

public interface Combinator {
   public Combinator apply(Combinator a);
   public Combinator reduce();
}

The apply function is used to create the representation of function application. The reduce function implements a single step in term reduction: it applies a single rewrite rule and returns the resulting combinator. If no reduction is possible, reduce returns the combinator itself, unmodified. The expression is said to be in normal form.

The Apply class is a generic representation of an application. It is also the sole class used to build the structure of expressions.

public class Apply implements Combinator {

   protected Combinator f;
   protected Combinator a;
 
   public Apply(Combinator f, Combinator a) {
      this.f = f;
      this.a = a;
   }

   ...

   public Combinator reduce() {
      Combinator g = f.reduce();
      if (f!=g) return g.apply(a);
      Combinator b = a.reduce();
      if (a!=b) return f.apply(b);
      return this;
   }

}

The class provides a default implementation of the reduce function that offers the semantics of lazy evaluation, also known as outer-most reduction. It tries to reduce the functor of the application. Only if unsuccessful, it continues with trying to reduce the argument of the application.

Given these building blocks, the I combinator can be realized in a simple way.

public class I implements Combinator {

   ...

   public Combinator apply(Combinator a) {
      return new Apply(this, a) {
         public Combinator reduce() {
            return a;
         }
      };
   }

}

The apply function returns an anonymous class extending the Apply class. The anonymous class overrides the reduce function to implement the semantics of the I combinator. Read this as: apply returns a combinator that, when reduced, returns its argument.

In a similar way, the K combinator can be realized.

public class K implements Combinator {

   ...

   public Combinator apply(Combinator a) {
      return new Apply(this, a) {
         public Combinator apply(Combinator b) {
            return new Apply(this, b) {
               public Combinator reduce() {
                  return a;
               }
            };
         }
      };
   }

}

As the K combinator requires two arguments in order to be reduced, nested anonymous classes are used. In this case, the first level overrides the apply function of Apply, while the second level overrides the reduce function.

And we need an extra level of anonymous classes for the implementation of the S combinator.

public class S implements Combinator {

   ...

   public Combinator apply(Combinator a) {
      return new Apply(this, a) {
         public Combinator apply(Combinator b) {
            return new Apply(this, b) {
               public Combinator apply(Combinator c) {
                  return new Apply(this, c) {
                     public Combinator reduce() {
                        return a.apply(c).apply(b.apply(c));
                     }
                  };
               }
            };
         }
      };
   }

}

Conclusion

The combination of a single, generic representation class for application and creating nested anonymous classes in the apply function provides an elegant way to implement primitive combinators. It also significantly simplifies the implementation of currying.

Tuesday, September 16, 2014

Tuple Calculus using Java 8 Streams

About a light-weight approach to processing arbitrary tabular data sets using Java 8 streams. The encapsulation of stream functions and utilities as operations on typed tuple streams offers a coherent basis for processing tabular data sets. The approach extends naturally into a straightforward domain specific language that is easy to explain and understand.

Introduction

Recently I came across the problem of processing arbitrary tabular data sets from various sources. Data sources included query results from relational databases, search engines and a bunch of flat files. Processing had to support filtering, calculated fields, aggregation, joining, sorting ..., basically the full computational power offered by SQL, Relational Algebra or Tuple Calculus.

This post summarizes a solution to this problem using a light-weight domain specific language on top of Java 8 streams. The approach offers an encapsulation of streams where:

Streamed objects are tuples, supporting field access by name and position.
Tuple streams are typed using schemas. Schemas are used to define stream operations too.
Tuple streams are defined as a composition of operations. Operations encapsulate the specifics of and variations in lambda types and stream composition.
These tuple stream defintions are expressed using a light-weight domain specific language.

Benford's Law

In this post I will be using Benford's law to illustrate the concepts and techniques introduced.

Benford's law refers to the frequency distribution of the first digit in real-life, numerical data. One would expect a uniform distribution of digits. Counterintuitively, the distribution is not uniform: digit 1 occurs as the first digit about 30% of the time, while digit 9 occurs as the first digit less than 5% of the time, as predicted by Benford's law.

I will use the World Bank country surface area's 2013 data set and compare the frequencies predicted by Benford's law with the actual observations. The data set provides the surface area of over 200 countries. The data has been prepared in a CSV-file, say country.csv, which starts with the following content:

country	surface:Decimal
Afghanistan	652,230
Albania	28,750
Algeria	2,381,740

The country data can be processed using the following tuple stream definition:

read(‘country.csv’)
.aggregate(
   { digit <- integer(substr(text(surface),0,1)) },
   { observation <- count()/count(*), prediction <- ln(1+1/digit) }
)
.sort({ digit })

The operations that constitute the tuple stream definition are:

The Read generator produces the content of a CSV-file as a stream of tuples. The name of the file is provided as an argument. The header row of the CSV-file is used as the schema definition of the content of the file.
The Aggregate operation groups the data, based on a group schema defining the tuple used for grouping and a value schema defining the aggregated values for each group. The result schema is the concatenation of the key and aggregate schemas.
- The digit field in the group schema is defined as the first character of the conversion of the surface area to text, converted to an integer again.
- The observation field in the value schema defines the relative frequency for each digit, dividing the absolute frequency by the total number of entries.
- The prediction field in the value schema defines the theoretical relative frequency for each digit using Benford's law.
The Sort operation sorts these data on the digit field.

The resulting tuple stream generates the following frequency values:

digit:Integer	observation:Decimal	prediction:Decimal
1	0.276995	0.301030
2	0.178404	0.176091
3	0.122066	0.124939
4	0.112676	0.096910
5	0.079812	0.079181
6	0.061033	0.066947
7	0.070423	0.057992
8	0.042254	0.051153
9	0.056338	0.045757

To compare the observations with the predictions we combine them into a chart:

Clearly, the observed values (blue column chart) nicely match the theoretical frequencies (red line chart). The chi-square test is left as an exercise.

The remainder of this post illustrates how the tuple calculus used in the example can be realized using Java 8 streams.

Building Blocks

Before coming to the actual Java 8 streams, I will introduce some concepts and the basics of their definition in Java.

I will be working with streams of tuples, Stream<Tuple>, where Tuple is an interface. A tuple represents an ordered collection of objects and provides access to these objects by name.

public interface Tuple extends Comparable {
   public Schema getSchema();
   public Object get(String code);
}

Tuples are comparable, using the ordering of the values they contain. Tuples are mutable, in the sense that the values they contain can be updated. For brevity, the setter functions have been left out.

The content of a tuple is described by a Schema. Basically, a schema is an ordered collection of fields. The Schema interface provides positional access to these fields.

public interface Schema {
   public int getSize();
   public Field getField(int i);
}

Besides the access functions mentioned here, a schema provides access to fields by name and some convenience functions to add fields and get a field iterator.

A field is defined by a name and a primitive data type. As schemas are used for multiple purposes, fields are somewhat overloaded and may specify an expression for calculated fields and a sort order for schemas used for sorting and merging streams.

public interface Field {
   public String getCode();
   public AttributeType getType();
   public Expression getExpression();
   public boolean getAscending();
}

An expression is the representation of an algebraic formula that supports the calculation of field values. I will come back on this subject when talking about a domain specific language for tuple streams.

Stream Operations

Now that we have tuples, schemas and fields, we can start talking about operations. Operations are the building blocks of tuple stream definitions. A tuple stream definition is a composition of operations.

An operation typically operates on one or more source operations. An operation that has no source operations is a generator. It can be used as the starting point for a tuple stream definition. I will be talking about generators further on.

Before addressing some examples of operations, we will have a look at the Operation interface.

public interface Operation {
   public Schema getSchema();
   public Stream<Tuple> getStream();
}

An operation provides access to instances of the stream that it defines using the getStream function. It also provides access to the schema of this stream.

As a first example, consider the Filter operation, used for selecting a subset of the tuples in a stream. The selection citerion is provided as an expression that is evaluated for each tuple.

The implementation is straightforward, using the Java 8 Stream.filter function.

public class Filter extends BaseOperation {

   protected Expression condition;

   public Filter(Operation source, String condition) {
      this.condition = StreamUtils.parseExpression(condition);
      addSourceOperation(source);
   }

   public Schema getSchema() {
      return getSourceOperation(0).getSchema();
   }

   public Stream<Tuple> getStream() {
      return getSourceOperation(0).getStream()
         .filter(tuple -> condition.eval(new TupleContext(tuple)));
   }

}

The constructor of the Filter operation takes a source stream and the filter condition as an argument. As mentioned before, I will come back on expressions when talking about a domain specific language for tuple streams. For now, note that the expression is parsed once and evaluated many times, once per tuple that is processed.

The getSchema function is very simple, as the operation works on a single source stream, and the schema of the result stream is the same as the schema of the source stream. The management of source streams is implemented at the level of the BaseOperation class.

The getStream function takes the stream of the source operation and applies a filter using a Predicate that evaluates the condition expression. The TupleContext serves as the evaluation context based on the fields of the source tuple. The evaluation context merely provides a binding of field names to field values for that particular tuple.

As a second example, consider the Select operation, used for selecting fields based on the fields of the source stream. The operation also supports the use of calculated fields, once again using expressions. The fields and calculated fields selected are provided as a schema.

Again, the implementation is straightforward, using the Java 8 Stream.map function.

public class Select extends BaseOperation {

   protected Schema schema;

   public Select(Operation source, Schema schema) {
      this.schema = schema;
      addSourceOperation(source);
   }

   public Schema getSchema() {
      return schema;
   }

   public Stream<Tuple> getStream() {
      return getSourceOperation(0).getStream()
         .map(tuple -> TupleUtils.select(schema, tuple));
   }

}

In this case, the schema of the result stream is the selection schema provided as a constructor argument. The stream of the operation is created based on the stream of the source operation, where the selected tuple fields are calculated out of the tuples of the source stream. The TupleUtils.select function evaluates the expressions in the schema argument based on the field values of the tuple argument.

As a somewhat more complex example, consider the Merge operation, used for merging a number of sorted source streams with identical schemas while preserving the sort order.

The implementation is greatly simplified using the Iterators.mergeSorted function part of Google's guava-libraries. As a result, the implementation primarily consists of wrapping streams as iterators and vice versa.

public class Merge extends BaseOperation {

   protected Schema sortSchema;

   public Merge(Iterable sources, Schema sortSchema) {
      this.sortSchema = sortSchema;
      for (Operation source : sources) {
         addSourceOperation(source);
      }
   }

   public Schema getSchema() {
      return getSourceOperation(0).getSchema();
   }

   public Stream<Tuple> getStream() {
      List<Iterator<Tuple>> iterators = getSourceOperationList()
         .stream()
         .map(operation -> iterator(operation))
         .collect(Collectors.toList());
      Iterator<Tuple> iterator = Iterators.mergeSorted(
         iterators,
         (a, b) -> TupleUtils.compare(sortSchema, a, b)
      );
      return StreamSupport.stream(
         Spliterators.spliteratorUnknownSize(
            operation -> operation.getStream().iterator(),
            Spliterator.ORDERED
         ),
         false
      );
   }

}

The getStream function gets the collection of streams for the source operations using stream primitives and uses Iterators.mergeSorted to merge those iterators. The TupleUtils.compare function compares tuples based on a sort schema. The function then wraps the resulting iterator as a Spliterator and converts this to a stream.

As a last example, consider the Aggregate operation. It is used to aggregate tuples in a stream based on aggregation key and value schemas.

public class Aggregate extends BaseOperation {

   protected Schema aggregateSchema;
   protected Schema valueSchema;
   protected Schema schema;

   public Aggregate(Operation source, Schema aggregateSchema,
      Schema valueSchema) {
      this.aggregateSchema = aggregateSchema;
      this.valueSchema = valueSchema;
      addSourceOperation(source);
   }

   public Schema getSchema() {
      if (schema == null) {
         schema = SchemaUtils.concat(aggregateSchema, valueSchema);
      }
      return schema;
   }

   public Stream<Tuple> getStream() {
      Map<Tuple, Tuple> map = getSourceOperation(0).getStream()
         .collect(supplier(), accumulator(), combiner());
      return map.keySet().stream().map(
         key -> TupleUtils.concat(getSchema(), key, map.get(key))
      );
   }

}

The getSchema function simply returns the concatenation of the key and value schemas. The aggregation is implemented using a Collector based on a Map<Tuple, Tuple>, with the appropriate supplier, accumulator and combiner functions. The result stream is then created by streaming the key tuples in the map with the corresponding value tuples concatenated to them.

Stream Generators

As mentioned before, a generator is an operation that takes no source operations. It can be used as the starting point of a sequence of operations.

As a first example of a generator, consider the Read operation, that defines a tuple stream based on the content of a CSV-file.

public class Read extends BaseOperation {

   protected String file;
   protected Schema schema;

   public Read(String file) {
      this.file = file;
   }

   public Schema getSchema() {
      if (schema == null) {
         schema = SchemaUtils.getSchema(
            Files.lines(Paths.get(file)).findFirst().get()
         );
      }
      return schema;
   }

   public Stream getStream() {
      return Files.lines(Paths.get(file)).skip(1).map(
         line -> TupleUtils.getTuple(schema, line)
      );
   }

}

The schema of the Read operation is defined by the header row in the CSV-file. This line should contain the separated labels of the columns in the CSV-file. Optionally, a label may be followed by a type indication, as in

count:Integer

The stream for the Read operation is created using the file stream features present in Java 8. The first line is skipped, as it is the header line. Subsequent lines are converted to tuples using a utility function.

As a second example, consider the Generate operation, which uses a start schema to generate an initial tuple and a step schema to generate subsequent tuples out of the previous tuple.

public class Generate extends BaseOperation {

   protected Schema startSchema;
   protected Schema stepSchema;

   public Generate(Schema startSchema, Schema stepSchema) {
      this.startSchema = startSchema;
      this.stepSchema = stepSchema;
   }

   public Schema getSchema() {
      return startSchema;
   }

   public Stream getStream() {
      return StreamSupport.stream(
         Spliterators.spliteratorUnknownSize(
            new TupleIterator(startSchema, stepSchema),
            Spliterator.ORDERED
         ),
         false
      );
   }

}

The stream for the Generate operation is created using a TupleIterator, which will evaluate expressions in the start and step schemas to generate tuples. Once again, a Java 8 Spliterator is used to wrap this iterator as a stream.

Coding Convenience

You will have noticed the presence of some convenience classes such as StreamUtils, SchemaUtils and TupleUtils. The use of such convenience classes can be extended to simplify using operations too.

In particular, the creation of schemas can be made less cumbersome by providing a varargs SchemaUtils.create function. Using this function, schemas can be created fairly easily, as in

SchemaUtils.create(
   "radius",
   "circumference <- 2*radius*pi()"
   "surface <- sqr(radius)*pi()"
)

This schema could be used to select a radius field from a source stream and complete the result stream with the circumference and the surface area of a circle with that radius. Again we are using expressions, this time not only to specify calculated values, but also to specify their binding to schema fields.

Another convenience technique is to provide constructor shortcut functions on the BaseOperation class, as in

public class BaseOperation implements Operation {

   public Filter filter(String condition) {
      return new Filter(this, condition);
   }

}

This supports chaining of operations when defining complex tuple streams, as in

BaseOperation
   .read("country.csv")
   .filter("surface > 1000")
   .sort(SchemaUtils.create("name"))

This stream definition processes countries for which at least the name and the surface area is available in a CSV-file by filtering out the ones with a surface area above 1,000 and sorting the result by name.

The Sort operation uses a sort schema to sort a tuple stream using the natural sort order of tuples defined by the sort schema. I will leave the implementation as an exercise.

Towards a Domain Specific Language

The last example illustrates that tuple stream definitions can be formulated in an expressive way given the proper convenience classes and functions. We can simplify these definitions even more by the introduction of a domain specific language for tuple stream definitions. In fact, the expression syntax used so far already constitutes part of such a domain specific language.

Parsing these expressions is done using an operator precedence parser. Such a parser supports term notation and can be configured dynamically to add support for unary and binary operators. The parser creates an abstract syntax tree for parsed expressions. I hope to address the topic of the operator precedence parser in another post.

By adding the appropriate operator definitions, the expression syntax can be extended to:

Support chaining for the construction of operations using the . (dot) operator.
Provide a syntax for schema specifications using curly braces ({}) and the <- (arrow) operator.

More Operations

On top of the operations already mentioned in this post, the entire collection of stream functions provided by the Java 8 Streams library can be lifted to make them available as operations, including concat, distinct, limit, skip ..., each time leveraging the schema and expression features addressed in this post.

In the context of accessing data of many different source systems, one may imagine generators similar to the Read operation, such as a Sql operation to fetch data from a relational database, a Solr operation to fetch data from a solr server, and even a Sax operation to stream the content of an XML-file as a sequence of SAX events.

Presenting aggregated results in a tabular layout can be supported by a Matrix operation that transforms a tuple stream with row labels, column labels and values into a matrix. The inverse operation Unwind (to give it a name) may convert data in a matrix layout to a tuple stream with row labels, column labels and values.

Conclusions

The encapsulation of Java 8 stream functions and utilities as operations on typed tuple streams offers a coherent basis for processing tabular data sets. Java 8 stream features can be lifted to the level of operations, and the approach adds a descriptive level for stream definitions. The focus of operations is mainly on the composition of streams, while Java 8 takes care of the actual processing and resource management.

The approach extends naturally into a straightforward domain specific language that is easy to explain and understand. The resulting formalism is light-weight, but it equals the computational power of SQL, Relational Algebra and Tuple Calculus. It provides a rich set of operations that offer excellent expressivity for data processing in a variety of contexts.