Mind to Matter

Saturday, December 24, 2022

Tiny Prime Combinations

Writing a prime number generator in lambda calculus, starting from first principles. Getting our hands dirty on syntax, term representation, term reduction, Church numbers and lists, context-free recursion and the Y-combinator, lazy evaluation and infinite data structures, and many more shenanigans. Going from zero to hero, and fitting it on a single A4.

Introduction

We first came up with this idea about three decades ago, but never really got to it. The idea to write Erathostenes' prime sieve in lambda calculus, building it from first principles.

Inspired by a prime number generating expression in Miranda (Haskell's predecessor, say), using list comprehension, something like:

[ p <- [1..] | #[ d <- [1..p] | p%d = 0 ] = 2 ]

And thereby convinced that it (the idea) could be done on a single A4.

Despite lambda stuff being ubiquitus on the Internet today, we persisted in taking the test, and came up with a late arrival. And yes, adorned with generous comments, it fits a single A4, as demonstrated at the end of this post under Prime Number Generation from First Principles.

By the way, we stopped pasting code snippets in the text (apart from lambda fragments, that is), but made the entire implementation available on GitLab Suskewiet.

And oh yeah, you're right. It's not Erathostenes', but another variation ...

Prime Number Generation using List Comprehension

List comprehension is a techique where lists are formally described, typically using a fairly mathematical notation, very expressive (as in short), and with the flavour of executable specifications.

We'll explain it by illustration, using the Miranda prime number generator:

[ p <- [1..] | #[ d <- [1..p] | p%d = 0 ] = 2 ]

First of all, don't pick on the syntax, our Miranda knowledge is a little rusty.

The list is opened using a square bracket, where one can start enumerating elements, or alternatively, describe the content of the list. Over here, we start reading, and it says:

The list of all p, elements of the set of natural numbers, ...

The notation [1..] is a list enumerating its elements, starting from 1, but without an upper bound. This is indeed an infinite list, so in order to work with it, we need a computational context that supports lazy evaluation, such as lambda calculus.

With the vertical bar |, we continue reading ... for which ...

Following is a condition that constrains the elements that are part of the list. In this case, the condition is on the number of elements in a list described by a nested expression.

We continue reading from the hash symbol #, an operator giving the length of a list, ... the cardinality of the list of all d, elements of the range from 1 to p, ...

This time a range with an upper bound. Again with a limiting condition ... for which the remainder after dividing p by d is 0, ...

We refer, incorrectly, but equally clear, to the modulo operator using the percent sign %, and to the equality operator using the equals sign =.

Consequently, the nested expression turns out to describe the list of divisors of p. We continue reading, one level up, back to the cardinality ... is 2.

List Comprehension Primitives

List comprehension can be implemented (tranformed into executable code) using a limited number of primitives. Besides the generation of finite and infinite enumerated lists, for our prime number generator, we only need the filter function to select data, denoted by the vertical bar |.

Generically, one would also need the map function, to manipulate data, and the fold function, for aggregation, to provide maximal power and expressivity in list comprehension.

Nice to see how these concepts, originating long time ago from pure functional programming, found their way into nearly every contemporary programming language. Second best, after world dominance by Haskell.

The Grand Design

In order to implement such a prime number generator from scratch, we will need to climb up the ladder of abstraction, addressing the following topics:

Select a representation of natural numbers, such as the Church encoding.
Define a number of arithmetic and relational operations on these numbers.
Select a representation for boolean values and define a number of logical operations.
Introduce the Y combinator for the implementation of recursive functions.
Select a representation of lists, with basic list operations.
Implement the list primitives required for list comprehension.
Represent the infinite list of natural numbers.
Use all of these to implement the prime number generator.

No worries. All of these building blocks can easily be gathered from various sources on the Internet.

Getting our Hands Dirty

However, to get to the bottom of things, we will also implement lambda calculus from scratch in Java, including a tokenizer and a parser for lambda expressions and term rewriting for actual computation. This will alow us to play with the expressions, and effectively test the prime number generation.

The full source code, including unit testing and all the lambda snippets presented in this post, are publicly accessible on GitLab Suskewiet.

In what follows, we will gradually climb up the aforementioned abstraction ladder, step by step, starting from first principles, to the effective calculation of prime numbers.

First Principles

The only data type in lambda calculus is a function. The only operation is function application. Function application is denoted by juxtaposition. It associates to the left. Brackets can be used to override default association.

a b c = (a b) c ≠ a (b c)

Besides function application, abstraction can be used to lift an expression to a function. Many sources use the lambda and dot characters to denote abstraction, as in λx.e, often shortened to x.e. In this post, we will use the somewhat more programming-oriented backslash operator to denote abstraction, as in (\x e).

All of these refer to a function with formal parameter x and function body e. Application of such a function on an argument a is done by substitution of (free) occurrences of x in e by a:

(\x e) a -> e[ x/a ]

Lots of notational conventions here. Abstraction runs to the end of an expression or to the first unmatched closing bracket, hence the brackets. The arrow denotes rewriting the term, also referred to as reduction. We use square brackets to define a substitution, in this case consisting of a single replacement, of variable x by expression a, and apply that subtitution to expression e by putting it after the expression.

As an example, consider the application of the identity function to an expression e:

(\x x) a -> x[ x/a ] -> a

Computation in lambda calculus is done by rewriting a term until no more rewrite rules can be applied. The term is then said to be in normal form. Some terms will never reach normal form, if reduction does not terminate, as in

(\x x x) (\x x x) -> (\x x x) (\x x x) -> ...

Many sources provide more detailed and more correct explanations of lambda terms and reduction rules. You may want to start with the Wikipedia page on Lambda Calculus.

Church Numbers

There's a lot more to it, but essentially you can capture the natural numbers by defining the number 0 and a successor function. Church encoding, named after Alonzo Church, does it like this:

zero = \f \x x
succ = \n \f \x f (n f x)

Let's see what this means for the representation of the number 1:

succ zero
-> (\n \f \x f (n f x)) zero
-> \f \x f (zero f x)
-> \f \x f ((\f \x x) f x)
-> \f \x f ((\x x) x)
-> \f \x f x

And the number 2:

succ (succ zero)
-> (\n \f \x f (n f x)) (succ zero)
-> \f \x f (succ zero f x)
-> \f \x f ((\n \f \x f (n f x)) zero f x)
-> \f \x f ((\f \x f (zero f x)) f x)
-> \f \x f ((\x f (zero f x)) x)
-> \f \x f (f (zero f x))
-> \f \x f (f ((\f \x x) f x))
-> \f \x f (f ((\x x) x))
-> \f \x f (f x)

Apparently, the number represented by the term is reflected by the number of occurences of f in the term.

The derivation illustrates that our reduction supports making named references to expressions that have been defined previously. We will come back to this when providing some Notes on the Implementation. In what follows we will extend identifier resolution to natural numbers, merely to simplify expressions and increase readability, as in

2 -> \f \x f (f x)

We will also need the predecessor function, which is somewhat obscurely defined as:

pred = \n \f \x n (\g \h h (g f)) (\u x) (\u u)

Again, lots of material avaiable, such as on the Wikipedia page on Church Encoding. There, you will also find the derivation of the pred function.

Arithmetic

Undoubtably no coincidence, but the Church encoding lends itself very well to define arithmetic in a simple and expressive way. Consider the following definition for add, a function taking two number parameters that returns the sum of those numbers:

add = \m \n n succ m

In order to understand how it works, note that the implementation starts by applying the number n to the successor function. Let's apply the number 2 to the successor function and see what happens:

2 succ
-> (\f \x f (f x)) succ
-> \x succ (succ x)
-> ...

We interrupt the reduction at the point where it becomes apparent that the number 2, applied to the successur function, reduces to a function that applies the successor function twice. This is a general property, where the representation of a number n, when used as a function, represents the n-fold application, through composition, of its argument.

Consequently, adding 1 and 2 ...

add 1 2
-> (\m \n n succ m) 1 2
-> (\n n succ 1) 2
-> 2 succ 1
-> ...

... reduces to applying the successor function to 1 twice, reducing further to the representation of 3.

In fact, add is a function of one parameter, that returns a function that in turn will consume another parameter. This is a principle referred to as currying, named after Haskell Curry, which applies across the board in the context of lambda calculus, and is very common in functional programming in general.

Completely analogous, we can define arithmetic subtraction as

sub = \m \n n pred m

Other operations can be defined with similar elegance, but with the exception of the modulo operator, we don't need them for our purpose. The modulo operator itself requires some more preparation and will be addressed after introducing the Y-combinator.

Booleans and Logical Operations

As for numbers, working with boolean values requires a representation of the values and some basic operations on those values. We will start with the values:

true = \a \b a
false = \a \b b

Here we see that true consumes two arguments and returns the first, while false consumes two arguments and returns the second. We assume this is clear by argumentation, but try it out if not.

As for numbers, this does not mean anything much by itself, but gets meaning in combination with the operations. The boolean operations we need are if for branching and not for logical negation:

if = \p \a \b p a b
not = \p \a \b p b a
and = \p \q p q p

The if function consumes three arguments, the condition, the then-part and the else-part. It applies the condition, assumed to reduce to a boolean value, on these parts. Knowing the effect of using boolean values as a function, you can guess what comes out ...

if true a b
-> (\p \a \b p a b) true a b
-> (\a \b true a b) a b
-> (\b true a b) b
-> true a b
-> (\a \b a) a b
-> (\b a) b
-> a

The not function is the same, but swaps the last two arguments, implicitly negating the condition.

The and function consumes two arguments, and uses the first one as a boolean aplied to both arguments. If the first argument reduces to true, it returns the second argument. Otherwise, it returns the first argument. Exactly as one would expect from a logical and function.

Relational Operations

Next we need relational operations, on numbers, that is.

iszero = \n n (\x false) true
leq = \m \n iszero (sub m n)
eq = \m \n and (leq m n) (leq n m)
lt = \m \n not (leq n m)

To understand the iszero function, first note that (\x false) is the constant false function. Applied to any argument, this function will return false.

The iszero function applies its argument, assumed to represent a number, to the constant false function. We learnt earlier that this will repeatedly apply the constant false function, exactly n times, in this case to the value true. After a single application, the result will be false, and remain false. Only for the number 0, without any application of the constant false function, the result will be true.

The remaining functions are less-than-or-equal, equal and less-than, respectively. Their definition boils down to plain functional programming. We will assume them to be self-explaining.

With the exception of the leq function maybe, where we note that the pred function is idempotent from 0 onward:

pred 0
-> (\n \f \x ((n \g \h h (g f)) \u x) \u u) 0
-> \f \x ((0 \g \h h (g f)) \u x) \u u
-> \f \x (((\f \x x) \g \h h (g f)) \u x) \u u
-> \f \x ((\x x) \u x) \u u
-> \f \x (\u x) \u u
-> \f \x x

The Issue with Recursion

The next thing we need is recursion. For all kinds of list utilities, and for some additional arithmetic.

Well, what is the issue here ? We can perfectly define the factorial function for example, using the building blocks we have so far:

fac = \n if (leq n 1) 1 (mul n (fac (pred n)))

No, we can't. Although the definition seems to be just like anything we've done before, in fact, it isn't. The difference is in the self-reference.

Up to now, we've named expressions merely for convenience, readability. Lambda calculus in itself does not provide a naming mechanism. We've let it sneak in silently, and it went unnoticed. That is because, up to now, we could easily eliminate the names by substituting them for their definition wherever they occur. With the definition of the factorial function, due to the self-reference, this is no longer the case.

Without self-reference, all naming could be resolved statically, using a kind of let operator:

let x = a in e

Our definition of the factorial function would require dynamic substitution, during evaluation. Binding and resolving names during evaluation would jeopardize referential transparency, one of the key concepts in functional programming. We refer to the Notes on the Implementation for some practical considerations on naming and name resolution.

We can define an alternative factorial function facr, with an extra parameter to provide the function for the recursive invocation, abstracting it for the name of the recursive function:

facr = \r \n if (leq n 1) 1 (mul n (r r (pred n)))

This equally calcuates the factorial, without self-reference, invoked as in

facr facr 3

However, having to provide the function as an argument twice is inconvenient.

The Y-combinator will come to the rescue.

The Y-Combinator

The Y-combinator is a lambda expression that helps in defining context-free recursive functions. It, or at least one variant, is defined as

Y = \f (\x f (x x)) (\x f (x x))

Applying it to a function, and rewriting it, we get

Y f
-> (\f (\x f (x x)) (\x f (x x))) f
-> (\x f (x x)) (\x f (x x))
-> f ((\x f (x x)) (\x f (x x)))
-> ...

The reduction sequence will never terminate. Indeed, this expression does not have a normal form.

However, somewhat out of nowhere, let's look at the following reduction:

f (Y f)
-> f ((\f (\x f (x x)) (\x f (x x))) f)
-> f ((\x f (x x)) (\x f (x x)))
-> ...

Again, no termination. But we see the same term arising as in the previous reduction sequence.

In fact, we just demonstrated that (Y f) is a fixed-point of f, for any f:

f (Y f) ~> Y f

Note the difference in notation for this reduction with the ~> arrow. Although these terms are semantically equal, we did use some rewrite rules right-to-left to prove it.

We can use this property to eliminate self-reference for recursion. Consider the original, single argument definition of the factorial function, abstracting it to the function name

facr = \r \n if (leq n 1) 1 (mul n (r (pred n)))

and then defining fac as

fac = Y facr

Now consider the reduction sequence for the factorial of 3:

fac 3
-> Y facr 3
~> facr (Y facr) 3
-> (\r \n if (leq n 1) 1 (mul n (r (pred n)))) (Y facr) 3
-> (\n if (leq n 1) 1 (mul n (Y facr (pred n)))) 3
-> if (leq 3 1) 1 (mul 3 (Y facr (pred 3)))
-> mul 3 (Y facr 2)
~> mul 3 (fac 2)

Note again the use of the alternative arrow ~> for upstream reductions. Once to reduce the Y-combinator in an appropriate way, and once to show that the recursion effectively works as expected: the factorial of 3 is 3 times the factorial of 2. Once more, we refer to the Notes on the Implementation, this time for some practical considerations on the reduction of the Y-combinator.

In summary, we define a self-referencing version of the factorial function, abstract it for recursion and apply the Y-combinator:

fac = Y \r \n if (leq n 1) 1 (mul n (r (pred n)))

Repeat, this time for our ultimate objective of generating prime numbers, where we will need the modulo operator.

We define a self-referencing version of the mod function:

mod = \n \d if (lt n d) n (mod (sub n d) d)

We abstract that function for recursion:

modr = \r \n \d if (lt n d) n (r (sub n d) d)

And we apply the Y-combinator to eliminate the extra function argument:

mod = Y modr = Y \r \n \d if (lt n d) n (r (sub n d) d)

Lists and Basic List Operations

For the last time, we will be introducing some constructs for a data structure and related operations, this time for list representation and basic operations:

nil = false
cons = \x \y \z z x y
head = \p p true
tail = \p p false
isnil = \l l (\h \t \d false) true

Note that cons, head and tail are usually introduced as pair, first and second, respectively. Indeed, a list is represented as a pair, the first element being the head of the list and the second element being the tail of the list, with nil as a representation of the empty list.

We will quickly verify the cooperation of these elements, starting with a list with head h and tail t:

cons h t
-> (\x \y \z z x y) h t
-> (\y \z z h y) t
-> \z z h t

Checking the head function:

head \z z a b
-> (\p p true) \z z a b
-> (\z z a b) true
-> true a b
-> a

The trick is to bring true into position to be applied to the two arguments previously captured by cons, as it will return the first argument. The tail function works in exactly the same way, but will apply false to the arguments. We leave its verification as an exercise.

As for isnil, looking at the definition, and knowing that the empty list nil is represented as false, it will obviously reduce to true. Let's consider the case where the argument list is not empty:

isnil \z z a b
-> (\l l (\h \t \d false) true) \z z a b
-> (\z z a b) (\h \t \d false) true
-> (\h \t \d false) a b true
-> (\t \d false) b true
-> (\d false) true
-> false

Some More List Operations

Now that the last data type has been introduced, the cryptic part is over. From here on, it becomes plain functional programming ...

We introduce the len function to obtain the number of elements in a list, the getfunction to get an element by position, and the take function to obtain a list with the first so many elements of a list.

len = Y \r \l if (isnil l) 0 (succ (r (tail l)))
get = Y \r \l \n if (leq n 0) (head l) (r (tail l) (pred n))
take = Y \r \l \n if (leq n 0) nil (cons (head l) (r (tail l) (pred n)))

Note the use of the Y-combinator, together with the extra abstraction for the function argument.

It should be fairly easy to understand the functions by reading the definition. Verification is more of the same. Check it out on GitLab Suskewiet if you feel like it.

List Comprehension

Finally, we've come to the subject of list comprehension. Not too much remaining though. We only need the filter function:

filter = Y \r \p \l if (isnil l) nil (if (p (head l)) (cons (head l) (r p (tail l))) (r p (tail l)))

The Natural Numbers

The final building block we need to define the prime number generator is the set of natural numbers. Let's start by defining the numbers function, generating a list of consecutive natural numbers from a starting value upward:

numbers = \n cons n (numbers (succ n))

Using the Y-combinator and an extra abstraction to eliminate the self-reference has become routine by now:

numbers = Y \r \n cons n (r (succ n))

Then we define the set of natural numbers, excluding 0, as

numbers1 = numbers 1

This is an infinite set, but thanks to lazy evaluation lambda calculus can work with infinite structures. We cannot request for the entire structure, but we can, for example, ask for a finite part of the structure:

get numbers1 6 ~> 7

Finally, we define the range function to take the sublist of the so many first natural numbers, again excluding 0:

range = \n take numbers1 n

Prime Number Generation

We are now ready to define the set of prime numbers, according to the same principle as the Miranda snippet in the introduction, but in strict lambda notation:

divisor = \n \d iszero (mod n d)
prime = \n eq (len (filter (divisor n) (range n))) 2
primes = filter prime numbers1

The divisor function is a predicate that checks whether its second argument is a divisor of its first argument.

The prime function is a predicate that checks whether its argument is a prime number. It does so by counting the number of elements in the set of divisors of the argument, which should be 2.

The set of primes can then simply be defined by applying prime as a filter to the set of the natural numbers. We can, for example, take the fourth (0-based) prime number:

get primes 4 ~> 11

It will return 11, though it will take over ten million reductions to do so. Real life functional programming languages come with a primitive type for numbers.

Notes on the Implementation

As stated under Getting our Hands Dirty, the full source code related to this post, in Java, is available on GitLab Suskewiet.

In what follows, we will elaborate on a number of aspects of the implementation that might justify somewhat more explanation.

Syntax

Besides the representation of a lambda term, as an abstract syntax tree (AST), and an implementation of the rewrite rules, the repository provides a tokenizer and an operator precedence parser (OPP) for lambda calculus and combinatory logic. It also provides the conversion of lambda expressions to combinatory logic and vice versa, but those are never used here.

For this post, abstraction is denoted using the backslash operator as in (\x y). The implementation equally supports using the dot operator, as in x.y.

Application associates to the left, as explained. Abstraction precedes over application. That is, for a correct reading:

a \b c d = a (\b c d) c ≠ (a \b c) d

Precedence may be different for other dialects.

Context

According to principles of functional programming, and in support of referential transparency, naming and name resolution should be static. The implementation does however support dynamic naming and resolution. Consequently, the self-referencing definition of the factorial function, for example, does work after all:

fac = \n if (leq n 1) 1 (mul n (fac (pred n)))
fac 3 ~> 6

However, the realization of the prime number generator does not need nor exploit this feature. That's where the Y-combinator kicks in, precisely to make the point.

Reduction

The implementation of term rewriting features outermost reduction, also known as lazy evaluation, as opposed to innermost reduction, also known as eager evaluation. This is required for working with infinite data structures, for example. Outermost reduction is implemented as follows:

For an identifier:
- If the identifier is a number, resolve it to the Church representation of that number.
- Try resolving the identifier in the context.
For an application:
- If the function of the application is an abstraction, substitute the variable of that abstraction for the argument of the application in the expression of that abstraction.
- Try reducing the function of the application.
- Try reducing the argument of the application.
For an abstraction:
- Try reducing the expression of the abstraction.

The first rule that succeeds wins. Trying succeeds if it makes a difference.

Now remember that the Y-combinator does not have a normal form. Reducing the Y-combinator according to the recipe for outermost reduction above will never terminate, but rather exhaust your computer's resources.

In the examples we provided, the Y-combinator was reduced in a very particular way:

Y f ~> f (Y f)

Upstream, so to speak. This then creates the opportunity for the first occurence of f to be addressed with priority in subsequent reductions.

Implementing this behaviour generically in the reduction rules is ugly. They don't tell you in elementary, but the nicest way to achieve this is to create the Y-combinator as a built-in combinator with its own reduction logic.

This then boils down to adding the following reduction rule:

For an application:
- If the function of the application is the Y-combinator, apply the argument of the application to the term that is being reduced.

Add it to the end, such that all other reduction rules get priority over this one. That way, the Y-combinator will reduce only after all other reduction opportunities have been exhausted, and it will reduce only once before handing control back to other reduction rules.

Obviously, you will have to remove the regular definition of the Y-combinator from the context, in order for this to work.

Testing

By now you will have understood that the naming of bound variables does not make a difference. This means for example that the identity function can be written in different ways:

\x x = \y y

Generally speaking, testing for equality of terms should make abstraction of the naming of bound variables, so long as these names are used consistenly throughout the terms. This is particularly relevant for writing unit tests on the reduction of terms, as term rewriting tends to introduce new names for bound variables.

In the implementation this is done by renaming bound variables in subterms before recursively checking for equality of subterms.

As a consequence, the Java Object contract requires the hashcode of a term to make the same abstraction, as it should be consistent with the implementation of equals. This has not been done yet. We will try addressing it in due course.

Prime Number Generation from First Principles

Here, we bring it all together. Generating prime numbers with no prior instrumentation but the basic principles of lambda calculus: function application and abstraction.

# church numbers
zero = \f \x x
succ = \n \f \x f (n f x)
pred = \n \f \x n (\g \h h (g f)) (\u x) (\u u)

# arithmetic
add = \m \n n succ m
sub = \m \n n pred m

# booleans
true = \a \b a
false = \a \b b

# logical operations
if = \p \a \b p a b
not = \p \a \b p b a
and = \p \q p q p

# relational
iszero = \n n (\x false) true
leq = \m \n iszero (sub m n)
eq = \m \n and (leq m n) (leq n m)
lt = \m \n not (leq n m)

# the Y combinator
Y = \f (\x f (x x)) (\x f (x x))

# arithmetic continued
mod = Y \r \n \d if (lt n d) n (r (sub n d) d)

# lists
nil = false
cons = \x \y \z z x y
head = \p p true
tail = \p p false
isnil = \l l (\h \t \d false) true

# list utilities
len = Y \r \l if (isnil l) 0 (succ (r (tail l)))
get = Y \r \l \n if (leq n 0) (head l) (r (tail l) (pred n))
take = Y \r \l \n if (leq n 0) nil (cons (head l) (r (tail l) (pred n)))

# list comprehension
filter = Y \r \p \l if (isnil l) nil (if (p (head l)) (cons (head l) (r p (tail l))) (r p (tail l)))

# natural numbers
numbers = Y \r \n cons n (r (succ n))
numbers1 = numbers 1
range = \n take numbers1 n

# prime number generation
divisor = \n \d iszero (mod n d)
prime = \n eq (len (filter (divisor n) (range n))) 2
primes = filter prime numbers1

Sunday, August 8, 2021

The Magic of Growing Trees

Determining the estimated time of arrival (ETA) in parcel delivery using random forest decision trees. Using a do-it-yourself approach and digging into the unexpected complexity of the seemingly straightforward task of building decision trees.

Introduction

This time, the self-imposed challenge is directly related to our professional activity: the application of machine learning to predict the duration of a process. We are keeping up appearances by addressing a similar problem in another domain, parcel delivery, something we all can relate to in these days of lockdown and quarantine.

In what follows we will report on an experiment in machine learning, estimating the time of arrival of a parcel, including generating the data, training the model and evaluating the results.

The choice for decision trees was inspired by the promise of the model in classical machine learning, and reluctance to go neural once more. The results are encouraging, but also bring to light the hidden complexity in what seems to be a simple, straighforward task.

Estimated Time of Arrival

Nowadays, parcel delivery is a domain that almost everone can relate to. The data involved is easy to understand. For our exercise at hand, we will use the following parameters:

Parameter	Type	Description
Origin	City	The city where the parcel delivery is initiated.
Destination	City	The city where the parcel is to be delivered.
Departure	Timestamp	The moment on which the delivery of the parcel starts.
Carrier	Carrier	The carrier that will take care of the delivery process.
Express	Boolean	A flag indicating whether this is an express delivery.

Most of the parameters are of a categorical nature, with a value set that is a fairly small collection of labels: cities, carriers ... Express, as a boolean, will be considered of a categorical nature as well. Departure is a numerical parameter, that will be handled in a significantly different way.

As we are not close with any of the major parcel delivery companies around, we will have to fabricate data ourselves. The data generation process is explained at end of the post, under Data Generation. For now, we will merely list the possible values for each parameter type:

City: Cities used for the exercise are Brussels, Paris, Berlin, Amsterdam and London.
Carrier: Parcels will be handled and delivered by BPost, PostNL and DHL.
Timestamp: Data covers parcel deliveries initiated over the course of a single week.
Boolean: As said, boolean will be considered a categorical type, with values Yes and No.

For our exercise, these parameters constitute the collection of independent variables. Note that all of these parameters are known before the delivery process starts, which is important if we want our predictions to be useful at all.

What we will be predicting is the time of arrival:

Parameter	Type	Description
Arrival	Timestamp	The moment on which the parcel will be delivered at its destination.

The arrival timestamp is the dependent or target variable, the one that will be predicted in function of the independent variables. It is commonly referred to as the estimated time of arrival, ETA in short.

Data Preparation

As for all machine learning exercises, we need to consider preparing our data in function of the machine learning algorithm that will be used, with the objective to make learning easier. In the context of the parcel delivery data, and with decision trees in mind, we will transform the data in a couple of ways.

First of all, we will encode variables of type timestamp into two different variables, one specifying the day of the week (DOW), and one specifying the time of day (TOD). For the Departure variable, we will introduce the variables Departure[DOW] (of the categorical type DayOfWeek) and Departure[TOD] (of the numerical type Time).

This is because we expect the parcel delivery process to behave differently on weekdays versus weekend days, for example. Learning the concept of day of the week based on timestamps alone will add significantly to the complexity of the problem. The general idea here is to put any knowledge we have about the problem domain into the way the data is presented, in order to maximize the performance of the machine learning algorithm.

Secondly, we will decrease the granularity of variables of type Time by truncating to the hour. At first, this seemed to simplify the problem. We realized however that this obscures information and may make the learning problem more difficult. At best, it will decrease the degree of determinism in the data, which is something we are actually looking for.

Anyway, we will be working with the variable Departure[TOD][H], holding the hour of departure, as an integer value ranging from 0 to 23.

Finally, to avoid the complexity of the representation of date and time information from the dependent variable, and to simplify cluster representation, we replace the variable Arrival by Duration[H], being the difference between Arrival and Departure, holding the number of hours it takes to deliver the parcel, equally truncating to the hour.

Visual Inspection

Before starting machine learning exercises on the data, we will get to know the data better. This is another best practice in machine learning.

Let's generate a data set of 10K data elements to start with, and have a look at the distribution of the hour of departure, variable Departure[TOD][H], from Berlin and London:

The distribution histogram shows the number of parcels shipped from Berlin and London, per hour, over a 24-hour timeframe, with time expressed in UTC. Interesting observations are:

There is no shipping activity during the night: departures start from 06h00 and stop around 20h00 local time.
Shipping in London lags behind one hour compared to Berlin, due to the time zone difference.
The shipment volume is high in the morning and lower in the afternoon.
Shipment distribution has a similar pattern in both cities, with somewhat more activity in Berlin.

Next, let's look at the distribution of the hour of arrival, variable Arrival[TOD][H], in Berlin and London:

Here we can make a number of interesting observations as well:

Again, there is no activity during the night: deliveries start at 06h00 and stop around 18h00 local time.
Again, there is a one hour lag in London compared to Berlin, due to the time zone difference.
There is a remarkable peak volume during the first hour of activity, local time.
The distribution patterns for arrival time for both cities are significantly different.

Finally, let's look at the distribution of the duration of the delivery process, in hours, variable Duration[H], for shipments from Berlin to London and Paris:

The interesting observations are:

The duration of the delivery process spans a wide range from 1 hour to 94 hours, where the pattern of 4 24-hour cycles is clearly visible.
There is again a shift, this time between London and Paris, but far more than the time zone difference of 1 hour.
There is a significant concentration on the 1 hour duration, if not, the delivery process will take at least 10 hours to complete.

There's a lot more to it, in terms of patterns in the data. We refer to the end of this post, where more detail is provided in the section on Data Generation.

By the way, the charts have been created using Gnuplot, a bit old school, but still an excellent charting tool. It allowed us to fully automate the process of chart generation, as you will be able to observe in what follows.

Decision Trees

Decision trees can be used for classification or regression. We will use decision trees as a model to classify data, a clustering method, that is. There are plenty of sources on the Internet that explain decision trees in detail, including the algorithms to construct them. You may start with the Decision Tree Learning page on Wikipedia. We will not go there but merely provide a sneak preview on what is to come, using a minimalistic example of a decision tree that we will be generating.

For the first example we start from the 10K data set introduced earlier, and show a decision tree restricted to depth 2:

Non-leaf nodes represent a classification decision and are in orange. Leaf nodes represent a cluster and are in green. Note that non-leaf nodes represent a cluster as well, ignoring the more detailed classification realized by their child nodes.

Let's walk through the characteristics of the nodes featuring in this example, starting with the root node:

The root node represents the entire data set, 10K data elements, with durations ranging from 1 hour to 99 hours, and a duration of 28 hours as a cluster representative.
The decision criterion is using the day of the week of the departure of the parcel, variable Departure[DOW], with values Thursday, Friday and Saturday for selecting the left side, and all other values for selecting the right side.
Threshold is the divisor of the value set of the target variable that realizes the optimal split of the data elements.
For the cluster representative, the accuracy realized with this node is 1.65%, with an average error of 26.25 hours.

We will elaborate on the definition and the meaning of these parameters in more detail in the sections on Growing Trees and Measuring Performance.

The objective of splitting the data set based on a criterion on one of the independent variables is to reduce the overall error of the decision tree. Looking at the left node:

With 4488 data elements, the left node represents somewhat below half of the data set, with durations still ranging from 1 hour to 99 hours, and a duration of 61 hours as a cluster representative.
The accuracy of classification is 0.96% and the average error is 29.99 hours.

Looking at the right node:

The right node represents the other half of the data elements, 5512 to be precise, with durations ranging from 1 hour to 57 hours, and a duration of 25 hours as a cluster representative.
The accuracy of classification is 5.42% and the average error is 12.96 hours.

Note that, although the performance of the left node is inferior to that of the root note, the right node largely compensates for this, and overall the classification performance is improved by the split. Let's verify that by calculating the weighted average error of the leaf nodes: $$\frac{4488*29.99+5512*12.96}{4488+5512} = 20.60 < 26.25$$ By the way, the tree representation has been created using the dot language, provided by the Graphviz project. It is an excellent tool for graph visualization that we have been using in earlier posts as well, again supporting full automation of the production of visuals.

The Benchmark

An absolute benchmark for the performance of any algorithm can be found in the classification matrix of the data.

The classification matrix is a multi-dimensional matrix with one dimension for each independent variable, considering the possible values in the most fine-grained way. The cells in the matrix hold the values for the target variable encountered for the cluster in the training set, where the median value is taken as the representation of the cluster. The prediction of the value for the target variable for a new data element is then defined as the representation of the matching cell for that data element.

In our case, the matrix dimensions, with their respective number of values, are:

Parameter	Cardinality
Origin	5
Destination	5
Departure[DOW]	7
Departure[TOD][H]	24
Carrier	3
Express	2
Product	25,200

With the cardinalities given in this table, the matrix counts 25,200 cells defining just as many clusters.

When creating such a matrix for a range of training sets, covering sizes from 2K to 100K data elements, and visualizing the average error on an equally sized test set, we get an understanding of the performance of the matrix method:

In green, you see the average error on the training set. In violet, you see the average error on the test set, averaged out over 5 different test runs, each run using a different test set.

The performance of the matrix method increases with the size of the training set, as can be expected. When the size of the training set is around four times the size of the matrix, we get an average error as low as 1.25 hours.

As this method uses all the information that is provided in the training set, it will provide the optimal performance, the best possible prediction that can be made using the training set. The remaining error is then merely a reflection of the non-determinism in the training data: identical values for all independent variables that yield different values for the target variable. No algorithm can systematically outperform the matrix method.

An average error of 1.25 hours basically characterizes the non-determinism of the data, and not so much the accuracy of the algorithm.

The attentive reader may have noticed that the average error on the training set is increasing rather than decreasing. This is because when the number of data elements in the training set is much lower than the number of elements in the matrix, duplicate values for all independent variables will be rare, and the the non-determinism in the data will be minimal. However, as the size of the training set grows, the non-determinism will become apparent accordingly, and the average error on the training data will converge to reflect exactly that.

For the same reason, the average error on the training set will initially increase, come to a maximum and only then decrease and converge as the size of the data set grows.

Now, why bother at all to find alternative algorithms when the classification matrix cannot be outperformed ?

The answer is straightforward: Even for this simple problem domain, with a limited number of values for categorical variables and strong discretisation of numerical variables, the size of the matrix is substantial. One can imagine the growth of the matrix when more dimensions or more values on a number of dimensions are required, exploding quickly to a level that cannot be practically supported anymore.

Decision trees offer an alternative that is similar in the sense that it will find clusters with maximum homogenity, but starting selectively with the independent variable with the highest return, narrowing down the cluster sizes by repeating this process, and consequently work with more sparse, coarse-grained clusters. This way, decision trees have the potential of realizing a performance that, if not equally good, comes close to the optimum.

Growing Trees

The construction of a decision tree is, in principle, fairly simple. We will keep talking about performance, and sometimes about error, in an abstract way until clarifying performance metrics in the section on Measuring Performance.

One starts with a root node that represents the entire training set. Then, a leaf-node is transformed into a non-leaf node, giving it a left- and right-hand child node, recursively, until a certain stop condition is reached. This condition typically relates to the performance on the data represented by the node: when the performance is satisfactory, no split operation is performed on the node and it remains a leaf node.

The difficult part is finding the optimal criterion to split a node: which independent variable is going to be used, and what is an appropriate value for the split condition ? A number of aspects are of particular importance here:

Is the split sufficiently balanced in terms of the data volume represented by each side ?
Does the split guarantee sufficient homogenity on both sides in terms of the target variable ?
In other words, is the split optimal in terms of performance improvement ?

Bluntly stated, the solution is to consider all possible ways to split the data set represented by a node based on the target variable, with maximum homogenity, and for each of those splits, to consider all possible independent variables, and for each variable to consider all possible split conditions, and then selecting the split condition with the best performance improvement.

As illustrated in the section on Decision Trees, the performance of a split is defined as the weighted average of the performance of both partitions of the split. Consider the example presented in that section, trying all possible splits based on the target variable:

Clearly, the minimum error is realized with a split of the target variable on 32 hours, which we will refer to as the threshold. As said, finding this point requires, for each potential threshold, a scan of all independent variables and all possible split conditions on each individual variable, in search for the lowest error.

In this case, the optimal split is on the day of the week of departure, variable Departure[DOW], which is categorical. Trying all possible split conditions effectively means trying all possible combinations of days, also known as the power set of the set of days in a week.

With 7 days in a week, the power set contains 520 possible combinations. Given the left-right symmetry of the decision tree, 260 combinations have to be effectively considered. Minus the trivial combination of no and all days of the week, leaves 258 combinations to go. Where the lowest error is to be found for the combination { Thu, Fri, Sat }.

The same logic can be applied for a numerical variable, where different split values for that variable have to be considered. Here, the number of possibilities in not combinatorial, but the scan requires discretisation of the numeric variable.

First Results

Time to apply all this logic to our data set and have a look at the resulting decision tree. Generating a decision tree on the same 10K training set, limiting the tree depth to 4 levels, we get:

After a first split on the day of the week of departure, the left node is split again on the same variable, this time with threshold 52 hours selecting on Thursday only. Again, this can be seen in the error chart for all possible thresholds:

Other nodes are split based on other variables, using the same procedure to decide on the split condition. Besides green for leaf nodes and orange for non-leaf nodes on a categorical variable, non-leaf nodes on numerical variables are shown in violet. With this decision tree, the average error is around 12.50 hours.

Removing the restriction on the depth of the tree will generate a much larger tree. We define the stop conditions for further refinement to be an error below 1 hour, or a cluster size of 10, whichever is reached first. In addition to these stop conditions, refinement of the decision tree will also stop if the remaining data does not offer room for improvement anymore. That is, if the error of the split configuration is higher than the error of the original node, or if the discriminating capacity of the independent variables would result in a trivial split.

The resulting decision tree is a beautiful monster of around 850 nodes, giving an average error rate of around 2.32 hours, one hour off of our benchmark. Look at this unreadable rendering to get an idea of the tree structure:

We provide an arbitrary fragment of the decision tree as a readable image to give some insight on the details of the decision process:

You can see all of the different stop conditions in action here.

One may wonder if such a huge decision tree isn't suffering from overfitting, capturing the specific patterns in the training data rather than the generic patterns of the problem domain. We will elaborate on overfitting in the section Return to Hyperspace.

In order to render the monstrous decision tree, we had to switch the dot conversion in Graphviz to scalable vector graphics (SVG), because of limitations on the supported size for bitmaps. We then used an online converter to render the SVG as a bitmap.

Measuring Performance

There are different ways to quantify the performance of a trained model. In the case of decision trees, already when the tree is being created, we have to compare the performance of different ways to split the data set at hand.

The accuracy reported in the previous sections was defined as the fraction of correct estimates, expressed as a percentage: $$\frac{\#\{i\in\{1,...N\}|d_e(i)=d_a(i)\}}{N}.$$ Here, $N$ is the number of data elements in the data set, $d_e(i)$ is the expected duration for data element $i$, the representative of the cluster obtained using the decision tree, and $d_a(i)$ is the actual duration for data element $i$, the one that is part of the data set, the label, so to speak.

As we we have been working with the discretisation of a numerical, continuous variable, this definition of accuracy is a very pessimistic measure of performance. This is the reason why the accuracy of even the leaf nodes in the decision trees we have encountered so far are poor.

Rather than using accuracy as a metric, we have been using the root-mean-square error (RMSE) to evaluate the performance of a decision tree. It is defined as $$\sqrt{\frac{1}{N}\sum\limits_{i=1}^{N}(d_e(i)-d_a(i))^2},$$ with the same meaning for the symbols used.

This error measurement will punish estimations that are far off, squaring the difference with the actual.

However, when reporting on the performance of a decision tree, we have been using the regular mean error, defined as $$\frac{1}{N}\sum\limits_{i=1}^{N}d_e(i)-d_a(i),$$ still with the same meaning for the symbols used.

This error measurement is easier to understand, as it provides the average difference between the estimate and the actual. It aligns better with the intuition that we have for an error.

Note that the use of the root-mean-square error as a measurement for performance is not common for decision trees. More common methods include information gain, an entropy-based metric, or Gini impurity.

We started our exercise with information gain as a metric, but due to a stuborn bug in the implementation, we switched to RMSE. As this seems to work equally well for the exercise, we never switched back to information gain.

Return to Hyperspace

As with all machine learning exercises, we may wonder what the optimal values are for those few parameters of a model, the ones that we do not train, the ones we call hyperparameters. We did that rather exhaustively in the context of artificial neural networks, in our previous posts on A Not-so-Random Walk Through Hyperspace and Secrets of the Heart.

For decision trees, the obvious candidate hyperparameters are the performance metrics and the stop conditions. As explained before, we did not take the time to try other performance metrics than the RMSE. However, we did play with the stop conditions, to some degree.

It is common sense to stop splitting nodes whenever the performance of the split configuration does not improve that of the original one, which is a stop condition that we have been applying systematically. Other stop conditions we applied are on the maximum error and the minimum cluster size.

With the maximum error, we refer to the error that is allowed at the level of a node in the decision tree. Whenever the average error on the data related to that node lies beneath that maximum, we will not split that node any further. Increasing the maximum error step by step from 1 hour to 20 hours, we get an unexpected result:

The increasing average error of the decision tree on both the training and the test data is expected. The threshold for the stop condition on the error should be very low. We do however have no explanation for the unexpected change in the speed of deterioration of the error somehwere around 13 hours.

A more interesting parameter however, is with the cluster size. How small do we need to make the clusters, while keeping overfitting in mind ? We therefore test the performance of the decision trees in function of the minimum cluster size. Whenever the number of data elements associated with a cluster drops beneath that minimum, we will not split that node any further. Increasing the minimum cluster size step by step from 1 data element to 30 data elements, we get:

The interpretation of this chart is not straightforward. The average error on the test set seems to remain constant up to a minimum cluster size of 20 data elements, with somewhat decreasing volatility, in particular since the average error is taken as the average over 5 different test sets. After a minimum cluster size of 20 data elements, the performance of the decision trees seems to start deteriorating.

Note that the optimal cluster size is not absolute, but is expected to depend strongly on the problem at hand, in particular on the number of values of the independent variable.

Finally, we wish to investigate what the impact of the size of the training set is on the performance of the decision tree. We test the performance of the decision trees on a training set with size ranging from 1K to 50K data elements, using a maximum error of 1 hour and a minimum cluster size of 10 data elements.

The average error measured on the test set systematically lies somewhat above the average error measured on the training set, as can be expected. In addition to that, the average error keeps getting smaller, up to 1.77 hours for a training set of 50K data elements, getting closer and closer to the error defined as The Benchmark. Testing on an extended range seems to be the next step here.

The systematically good performance on the test sets, again averaged out over 5 test sets, is a clear indication that there is no overfitting. At least not on the training data. With fabricated data, we cannot exclude that the model is overfitting on the data generation method.

Growing a Forest

By now, it is a well known fact that the performance of a machine learning algorithm can be improved and stabilized by training multiple models and combining the estimates these models provide. We already tried that once with artificial neural networks in our previous post Secrets of the Heart.

In the context of machine learning, the term random forest is often used to describe this technique. I guess the trees were the inspiration for the forest.

Trying it out on our decision trees, training 10 different decision trees on disjoint subsets of the training data set, we get:

Tree	Average Error
0	2.67
1	2.63
2	2.89
3	2.65
4	2.64
5	2.72
6	2.76
7	2.79
8	2.53
9	2.74
Ensemble	2.49

The estimate of the random forest is determined as the arithmetic average of the estimates of the individual decision trees in the forest. The average error for each individual decision tree is determined as an average on 5 different test sets, as is the average error of the forest.

The result is in line with expectations. The average error for all the trees together is 2.70 hours, while the ensemble achieves an average error of 2.49 hours. With that score, the forest outperforms each one of the individual decision trees.

Conclusion

All of machine learning is either magic or trivial. It is magic until you understand it, and then it becomes trivial. - after Ernest Rutherford

The decision tree seems to be a good instrument for the estimation of a categorical (or the descretisation of a continuous) dependent variable. It combines a computationally lightweight implementation with good performance in terms of the average error of the estimation. It is also fairly easy to implement.

Using a threshold on the target variable to determine the next optimal split condition for the independent variables is key to building an efficient, balanced decision tree. This technique is relevant for any performance metric used.

The creation of monstrously large decision trees may be expected. This is not necessarily a sign of overfitting. The proclaimed explainability of the estimates made by decision trees therefore seems to be overly optimistic.

In our case, the downside of using generated data is that it will not be as rich in variation as real-world data. On the other hand, an advantage is that we have all the data available that we want, without limitations on the data volume. All in all, fabricated data has allowed us to effectively develop and test the algorithm to build the decision trees, and we are now ready to take it to the real world.

Data Generation

As stated earlier, real-world data on parcel delivery is not available to us, so we have to work with fabricated data, generated, that is.

First, we have to generate randomly distributed data for the independent variables:

Departure timestamps are generated over the course of a week, with a higher probability on weekdays as opposed to weekend days, and a still higher probability on Fridays. The time of departure, behold exceptions, is between 6h00 and 20h00, with a higher probability before noon. All this is calculated in the time zone of the country at hand, and truncated to the hour.
For generating the Origin and Destination variables, we use a fixed list of five cities, capitals, and the probability to select a city is proportional to the population of the country where the city is located.
For generating the Carrier variable, we use a fixed list of three carriers, and the probability to select a carrier is specified in the configuration data. The exact probabilities are not relevant.
The ratio of express deliveries is configured as 6%.

There is no correlation between any of the independent variables, by definition, but also in practice.

Then we have to determine the values for the dependent variable, in function of these independent variables, with variation. The arrival timestamp is calculated from the departure timestamp in a number of steps:

A delay is added to leave the origin country, proportional to the area of the country, according to a normal distribution, where the standard deviation is 10% of the mean.
If there is a time zone difference between the origin and the destination, a delay of 4 hours per hour difference in time zone is added, variating according to a normal distribution, where the standard deviation is 10% of the mean.
A delay is added to arrive in destination country, proportional to the area of the country, according to a normal distribution, where the standard deviation is 10% of the mean.
For express deliveries, these delays are divided by 24.
For leaving the origin country and arriving in the destination country, these delays are added taking into account whether the carrier is active during night-time (18h00 to 06h00) and during the weekend (Saturday and Sunday). These properties vary over the different carriers.
As parcels are not delivered during night-time nor during the weekend, an additional delay of 10 minutes is added, according to a normal distribution, where the standard deviation is 10% of the mean, now skipping night-time and weekends.
If the total delay thus obtained is less than 1 hour, the delay is increased by 1 hour.
The arrival timestamp thus obtained is truncated to the hour.

The effects of all this logic on the distribution of both the independent and the dependent variables have already been illustrated to some degree in the section Visual Inspection.

Note that, as we are using java.util.Random.nextDouble for generating random numbers, we need a trick to generate random numbers with a given distribution. This can be achieved by generating uniformly distributed random number, and subsequently applying the inverse cumulative density function (CDF) of the given distribution to the unformly distributed random numbers.

Java Implementation

No Java code snippets this time. There is nothing much to it. Just a few tips and tricks:

Make sure you have a decent data structure for the representation of the data sets, frequency distributions, performance metrics ... We used lists of tuples, with schemas, and streams to work on them, much like those introduced in our previous post on Tuple Calculus using Java 8 Streams, but without the DSL.
Visualize your data, visualize your results, and automate the generation of these visuals.
Store results as you go, both data and visualizations. And store meta data as well. Remember what you did.

Tuesday, December 29, 2020

Anatomy of the Pauli Group

Elaboration on Hadamard and Friends - Constructing the Pauli group as the central product of the cyclic group of order four and the dihedral group of a square - Finite group representation and operations in Java.

Introduction

In our previous post about quantum computing, Hadamard and Friends, we touched upon the Pauli group, a group of matrices over the complex domain that represents a collection of elementary quantum gates.

The wikipedia page on the Pauli group states that the group is isomorphic to the central product of the cyclic group of order four and the dihedral group of a square. In this post, we will create representations of various of these groups in Java and demonstrate the isomorphism for $G_1$, the Pauli group on a single qubit.

Generating the Pauli Group

The Pauli group can be defined using three matrices over the complex domain: $$X=\begin{bmatrix}0&1\\1&0\end{bmatrix},\,Y=\begin{bmatrix}0&-i\\i&0\end{bmatrix},\,Z=\begin{bmatrix}1&0\\0&-1\end{bmatrix}.$$ Before you continue, note that the wikipedia page on Pauli matrices describes a lot of properties and applications of these matrices.

Generating the group using these generators, as illustrated in Generation of a Group, gives us a total of 16 elements: $$\begin{align}G_1=\{\,&X,\,Y,\,Z,\,XX,\,XY,\,XZ,\,YX,\,YZ,\,ZX,\,ZY,\\&XYX,\,XYZ,\,XZX,\,XZY,\,YXY,\,XYXY\,\}.\end{align}$$ The generation of the group can be illustrated using a Cayley graph:

The Cayley Graph for the Generated Pauli Group
Here, multiplication with $X$, $Y$ and $Z$ are represented by a blue, red and green line respectively.

Note that all three generators $X$, $Y$ and $Z$ are their own inverse. For example $$YY=\begin{bmatrix}0&-i\\i&0\end{bmatrix}\begin{bmatrix}0&-i\\i&0\end{bmatrix}=\begin{bmatrix}1&0\\0&1\end{bmatrix}=I.$$ This implies that all operations in the Cayley graph are symmetrical. We therefore use no arrows in the Cayley graph, and any edge can be used in both directions. As another consequence, the neutral element $I$ is represented as $XX$.

Each element of the group can be represented in many different ways. For example, it is easy to observe the following equations in the Cayley graph: $$XYZ=YZX=ZXY.$$ The generation process implicitly checks for equal matrices. The notation of the elements above is merely depending on the order in which the generators are combined. The first time a new matrix is obtained, it is added to the set, while subsequent occurrences of the same matrix will not be added to the set.

Enumerating the Pauli Group

The Pauli group can also be defined by enumeration of all elements, using the identity matrix and the generator matrices introduced earlier, multiplied by one of a number of complex factors: $$G_1=\{fA,f\in\{\pm 1,\pm i\},A\in\{I,X,Y,Z\}\}.$$ Consider the matrix multiplication $$XY=\begin{bmatrix}0&1\\1&0\end{bmatrix}\begin{bmatrix}0&-i\\i&0\end{bmatrix}=\begin{bmatrix}i&0\\0&-i\end{bmatrix}=iZ.$$ It is easy to verify that each element of the generated definiton of the group can be written as an element of the enumerated definition of the group. As such, the enumerated group is identical to the generated group. We intentionally do not use the term isomorphic, as the matrices in both groups are identical and the difference is merely in naming.

We can therefore rename the nodes in the Cayley graph based on this enumeration:

The Cayley Graph for the Enumerated Pauli Group

Alternative Generators

Using the Cayley graph, one can observe the symmetry of the generators, in the sense that an even permutation of the generators will not change the structure of the Cayley graph. An even permutation of $X$, $Y$ and $Z$ still yields a correct Cayley graph, where the edges duly represent matrix multiplications. Odd permutations however would require some nodes to trade places in order for the group to be correctly represented.

Looking at the way the group is generated by the three generators, each combination of two generators will generate exactly half of the group. In fact, that is the case for any pair of independent elements of the group.

Applying the generators one after the other, alternatingly, enumerates half of the group in a loop. The third generator is required to get to the other half of the group, where the same loop structure occurs. In fact, that is the case for any combination of generators of the group.

We will now move on to an alternative construction of the Pauli group, as the central product of the cyclic group of order four and the dihedral group of a square.

The Cyclic Group of Order Four

The cyclic group of order 4 is a group with one generator that yields the identity after 4 applications. It can be defined as the additive group of integers modulo 4: $$C_4=\mathbb{Z}/4\mathbb{Z}=\{0,1,2,3\}.$$ Note that a number refers to the equivalence class of that number in the quotient group here.

As we can only represent finite groups, we cannot use the definition based on a quotient group of $\mathbb{Z}$. We will be using the finite set of integers modulo 4 and define the addition operation on that set, as illustrated in Cyclic Groups.

The Cayley graph for the group is

The Cayley Graph for the Cyclic Group of Order 4
In what follows, we will need the center of $C_4$. The center of a group is defined as the set of elements that commute with every element in the group. The center of a group $G$ is denoted and defined by $$Z(G)=\{z\in G|\forall g\in G, zg=gz\}.$$ As $C_4$ is an Abelian group, its center is the group itself: $$Z(C_4)=C_4=\{0,1,2,3\}.$$ Note that we could just as well create a cyclic group of order 4 using for example the matrix representation of a 90 degrees rotation as a generator. Borrowing on the notation of the Pauli matrices, that would be using the matrix $$R=XZ=\begin{bmatrix}0&1\\1&0\end{bmatrix}\begin{bmatrix}1&0\\0&-1\end{bmatrix}=\begin{bmatrix}0&-1\\1&0\end{bmatrix}.$$ The group would then be defined as the set $$C_4=\{I,R,R^2,R^3\},$$ with composition (matrix multiplication) as an operation.

Note that we could have used any combination of two generators of the Pauli group, in any order, as they will all cycle after four applications. Defining $R$ as $XZ$ is just an example with an intuitive meaning.

We could use the matrix representation of $C_4$ to continue. However, we opted for the integers modulo 4, just to illustrate using groups with different types of elements in our constructions.

The Dihedral Group of a Square

A dihedral group is the symmetry group of a regular polygon. In this case, we need the dihedral group of a square, denoted $D_8$ in abstract algebra, denoted $D_4$ in geometry. The group can be generated using two generators: a rotation over 90 degrees, denoted R, and mirroring against the verical axis, denoted M.

We can represent those generators by matrices once more, as transformations in a plane: $$M=\begin{bmatrix}1&0\\0&-1\end{bmatrix},\,R=\begin{bmatrix}0&-1\\1&0\end{bmatrix}.$$ Generating the group using these generators, as illustrated in Generation of a Group, gives us a set of 8 elements: $$D_8=\{I,R,RR,RRR,M,RM,RRM,MR\}.$$ with composition (matrix multiplication) as an operation.

The Cayley graph for this group is

The Cayley Graph for the Dihedral Group of a Square
A red arrow represents multiplication by $R$, a blue line represents multiplication by $M$. As before, as $M$ represents a symmetric operation, the arrows for that generator have been left out.

In what follows, we will need $Z(D_8)$, the center of $D_8$. As illustrated in The Center of a Group, we get $$Z(D_8)=\{I,RR\}.$$ That is, besides the identity, $RR$ is the only element that commutes with every other element in the group.

Note that our center is isomorphic to the cyclic group of order 2: $$Z(D_8)\cong C_2.$$

The Central Product

On the Wikipedia page on the Pauli group, we read that the Pauli group is isomorphic to the central product of the cyclic group of order four and the dihedral group of a square: $$G_1\cong C_4\circ D_8.$$ As a reminder, the Wikipedia page on the central product states that the central product is similar to the direct product, but in the central product two isomorphic central subgroups of the smaller groups are merged into a single central subgroup of the product.

The Direct Product

The set of elements of the direct product of $C_4$ and $D_8$ is the cartesian product of the factor groups: $$C_4\times D_8 = \{(n, A)\,|\,n\in C_4, A\in D_8\}.$$ The operation for the direct product is the pairwise application of the operations of the factor groups. Assuming that $+$ is the group operation of $C_4$, being the additive group of integers modulo 4, matrix multiplication in $D_8$ is denoted by juxtaposition, and $\times$ is the group operation of the direct product, we get: $$(n, A)\times (m, B) = (n+m, AB).$$

The Common Center

Looking at the centers of $C_4$ and $D_8$, it is easy to see that the maximum subgroup they have in common is isomorphic to $C_2$, the cyclic group of order 2. Formulated over the respective alphabets, we get $$\{0, 2\}\cong\{I, RR\}\cong C_2.$$ We will therefore express the commonality using the subgroup $Z$ of the direct product isomorphic to the common center of the factor groups as: $$Z=\{(0, I), (2, RR)\}\cong C_2.$$ Clearly, $(0, I)$ is the neutral element of the direct product, and consequently also of $Z$. We quickly verify the cyclic nature of $Z$, for illustration: $$(2, RR)\times (2, RR) = (2+2, RRRR) = (0, I).$$

An Alternative Definition of the Pauli Group

Now, finally, we are ready to try constructing our alternative Pauli group, as a quotient group we will refer to as $F_1$, dividing the direct product by the subgroup $Z$: $$G_1\cong F_1=(C_4\times D_8)/Z.$$ Although we did use the notion of a quotient group already talking about the set of integers modulo 4, we quickly repeat the definiton of a quotient group as given on Wikipedia: $$G/N=\{gN\,|\,g\in G\},$$ where $G$ is a group and $N$ is a normal subgroup. Here, $gN$ is referred to as a coset of $N$ in $G$. Typically, different elements $g$ will result in the same coset when applied to $N$.

However, for simplicity in notation, for the representation of the quotient group, we will use a representative of each coset rather than the coset iself. For example, we will use $(1, M)$ as the representative of its coset $$(1, M)\times Z=\{(1, M), (3, RRM)\}$$ As such, and as illustrated in The Central Product, we get the following enumeration for the quotient group: $$\begin{align}F_1=\{&(0,I), (0,R), (0,M), (0,RR), (0,RM), (0,MR), (0,RRR), (0,RRM),\\&(1,I), (1,R), (1,M), (1,RR), (1,RM), (1,MR), (1,RRR), (1,RRM)\}\end{align}$$ So far so good, but is this group $F_1$ really isomorphic to the Pauli group ?

Finding the Isomorphism

To verify whether $F_1$ is isomorphic to $G_1$, we have to find a bijection $\phi$ between the elements of both groups that preserves the group operation. That is, for $A$ and $B$ elements of $G_1$: $$\phi(AB)=\phi(A)\times\phi(B).$$ First of all, the number of elements is fine: $F_1$ has 16 elements, just as the Pauli group. However, the remainder of the exercise is a bit harder.

We can restrict ourselves to defining the bijection for the generators only, and then verifying preservation of the operation on all combinations (including repetitions) of the generators.

Candidate mappings for the generators should respect a number of constraints. First of all, the identity element can be excluded. Secondly, all generators should be their own inverse, as is the case for $X$, $Y$ and $Z$. This reduces the candidate set to seven elements: $$\{(0,M), (0,RR), (0,RM), (0,MR), (0,RRM), (1,R), (1,RRR)\}$$ Furthermore, logic dictates that in our combination of three generators, all basic building blocks should be represented, in particular the elements $0$, $1$, $R$ and $M$ of the respective factor groups. This leaves $2\times 4\times 5=40$ possible combinations of three, ignoring order.

In the java section on Finding an Isomorphism, we do an exhaustive scan, mapping $X$, $Y$ and $Z$ onto different combinations of elements of $F_1$. Here, we limit ourselves to presenting one of the resulting isomorphisms: $$\phi(X)=(0, M),\,\phi(Y)=(0, RM),\,\phi(Z)=(1, R).$$ Once we have the bijection defined for the generators, the entire group structure follows nicely. This is illustrated once again by a Cayley diagram, the equivalent of the ones presented earlier, but now over our new alphabet:

The Cayley Graph for the Central Product
Note that many more mappings are possible to define the isomorphism, given the symmetries of the Pauli group, and the one presented here represents just one possiblity.

Conclusions

We verified exhaustively that the Pauli group on a single qubit, typically defined as the group generated by three 2x2 matrices over the complex domain, is isomorphic to the central product of the cyclic group of order four and the dihedral group of a square. There's no magic involved here.

Java Implementation

However, the fun in this post is found in the way finite groups can be represented in Java, using whatever type of element and operation that suits the purpose. In addition, constructs such as the direct product, the center and the quotient of groups can be implemented generically on these groups. In what follows, we will illustrate this implementation in Java.

Group Representation

We start our implementation in Java with the definition of the interface that defines a group:

public interface Group<T> {

	public Set<T> getElements();
	public T getIdentity();
	public T multiply(T a, T b);

}

It is defined as a generic interface, where the element type is a parameter. We do not impose any explicit type for the elements of a group, merely that the element type implements equals and hashcode.

A group should be able to produce its elements, as a Java Set, it provides access to the neutral element, and it implements the group operation, referred to as mul.

Note that the group operation could be implemented on the element as well, but that would require an explicit element type, which strongly reduces the genericity of the group implementation.

Also note that we did not include the concepts of symmetric elements, as we do not need them in the scope of this post. We are confident that this could be added to the implementation without breaking the approach taken.

Enumeration of a Group

As a first implementation of the group interface we will focus on an enumerated group:

public class EnumeratedGroup<T> implements Group<T> {

    private Set<T> elements;
    private T identity;
    private BiFunction<T, T, T> operation;

    public EnumeratedGroup(Set<T> elements, T identity, BiFunction<T, T, T> operation) {
        this.elements = new HashSet<>(elements);
        this.identity = identity;
        this.operation = operation;
    }

    ...

    @Override
    public T multiply(T a, T b) {
        return operation.apply(a, b);
    }

}

The constructor of the group requires the set of elements of the group, an explicit reference to the identity element, and a binary function that implements the group operation.

The implementation of this class is straightforward now. You will want to add getters and at least provide an implementation for toString. We only included the implementation of the multiply function, just in case.

Generation of a Group

Next comes the implementation of a group specified using a finite set of generators:

public class GeneratedGroup<T> implements Group<T> {

    private Set<T> generators;
    private Set<T> elements;
    private T identity;
    private BiFunction<T, T, T> operation;

    public GeneratedGroup(Set<T> generators, T identity, BiFunction<T, T, T> operation) {
        this.generators = generators;
        this.identity = identity;
        this.operation = operation;
    }

    @Override
    public Set<T> getElements() {
        if (elements==null) {
            elements = new HashSet<>(generators);
            elements.add(getIdentity());
            int size = 0;
            while (size<elements.size()) {
                size = elements.size();
                Set<T> generated = new HashSet<>();
                for (T t : elements) {
                    for (T g : generators) {
                        generated.add(multiply(t, g));
                    }
                }
                elements.addAll(generated);
            }
        }
        return elements;
    }

    ...

}

Again, simple implementation details are left out.

The thing worth mentioning here is the getElements function, implemented lazily. On first invocation, it will generate the set of elements based on the identity element and the set of generators using a fixed point approach:

We start with the set of elements consisting of the identity element only.
We then keep applying all generators to all elements of this set, and adding these as new elements, until the set no longer grows in size.

The Pauli Group

We now have sufficient building blocks to implement a first version of the Pauli group, with matrices as elements, using generators. We will be using the matrices introduced in Generating the Pauli Group.

We will be borrowing the matrix implementation of our previous post about quantum computing, Hadamard and Friends. And we have an alphabet of matrices ready:

public class Alphabet {

    public static final Matrix I = MatrixUtils.identity(2);

    public static final Matrix X = new Matrix("X", new Double[][] { { 0d, 1d }, { 1d, 0d } });
    public static final Matrix Y = new Matrix("Y", new Double[][] { { 0d, -1d }, { 1d, 0d } }).cmul(Complex.I).setName("Y");
    public static final Matrix Z = new Matrix("Z", new Double[][] { { 1d, 0d }, { 0d, -1d } });

    ...

}

Multi-dimensional complex array constants are cumbersome to write, and the Matrix implemention is immutable. Therefore, we have to rename the matrix $Y$ after multiplication with the complex number $i$.

The Pauli group can now easily be defined as:

public class GroupUtils {

    public static Group<Matrix> getPauliGroup() {
        Set<Matrix> generators = new HashSet<>(Arrays.asList(Alphabet.X, Alphabet.Y, Alphabet.Z));
        return new GeneratedGroup<Matrix>(generators, Alphabet.I, (a, b) -> a.vmul(b));
    }

    ...

}

Cyclic Groups

For the alternative implementation of the pauli group, we first need a cyclic group of order four. We cannot define this group as the quotient $\mathbb{Z}/4\mathbb{Z}$, as we can only handle finite groups. We will therefore explicitly define modulo numbers and once more use a generated group.

public class Modulo {

    private int value;
    private int modulo;

    public Modulo(int value, int mod) {
        this.value = value%mod;
        this.modulo = mod;
    }

    public Modulo add(Modulo m) {
        return new Modulo(value+m.getValue(), modulo);
    }

    ...

}

Completing the implementation is straightforward. We added the implementation of the group operation add, just in case.

The cyclic group of order four can now be easily defined as:

public class GroupUtils {

    public static Group<Modulo> getCyclicGroup(int size) {
        HashSet<Modulo> generators = new HashSet<>(Arrays.asList(new Modulo(1, size)));
        return new GeneratedGroup<Modulo>(generators, new Modulo(0, size), (a, b) -> a.add(b));
    }

    ...

}

The Dihedral Group of Order Eight

The dihedral group of order eight can equally be defined as a generated group, using the matrices introduced in The Dihedral Group of a Square. Therefore, this section is very similar to the one where we introduced the matrix representation of The Pauli Group.

First we need the matrix definitions:

public class Alphabet {

    public static final Matrix R = new Matrix("R", new Double[][] { { 0d, -1d }, { 1d, 0d } });
    public static final Matrix M = new Matrix("M", new Double[][] { { 1d, 0d }, { 0d, -1d } });

    ...

}

Then we are ready to create the group:

public class GroupUtils {

    public Group<Modulo> getDihedral8Group() {
        Set<Matrix> generators = new HashSet<>(Arrays.asList(Alphabet.R, Alphabet.M));
        return new GeneratedGroup<Matrix>(generators, Alphabet.I, (a, b) -> a.vmul(b));
    }

    ...

}

The Center of a Group

Comes the time we need to compute the center of a group:

public class GroupUtils {

    public static <T> Group<T> getCenter(Group<T> group) {
        Set<T> center = group.getElements().stream()
            .filter(z -> group.getElements().stream()
                .filter(g -> !group.multiply(z, g).equals(group.multiply(g, z))).count()==0l)
            .collect(Collectors.toSet());
        return new EnumeratedGroup<T>(center, group.getIdentity(), group::multiply);
    }

    ...

}

We provide a stream style implementation, that filters all elements of the group (the outer filter) that commute with all elements of the group (the inner filter) to determine the elements of the center. We then use that set to create an enumerated group with the same identity element and the same group operation.

The Direct Product

By now, the exercise risks becoming just more of the same thing. We can implement the direct product of two groups as the Cartesian product of the elements in the group with the pairwise group operations.

public class ProductGroup<T, S> implements Group<Pair<T, S>> {

    private Group<T> tgroup;
    private Group<S> sgroup;
    private Set<Pair<T, S>> elements;

    public ProductGroup(Group<T> tgroup, Group<S> sgroup) {
        this.tgroup = tgroup;
        this.sgroup = sgroup;
    }

    @Override
    public Set<Pair<T, S>> getElements() {
        if (elements==null) {
            elements = new HashSet<>();
            for (T t : tgroup.getElements()) {
                for (S s : sgroup.getElements()) {
                    elements.add(new Pair<>(t, s));
                }
            }
        }
        return elements;
    }

    @Override
    public Pair<T, S> getIdentity() {
        return new Pair<>(tgroup.getIdentity(), sgroup.getIdentity());
    }

    @Override
    public Pair<T, S> multiply(Pair<T, S> a, Pair<T, S> b) {
        return new Pair<>(tgroup.multiply(a.getT(), b.getT()), sgroup.multiply(a.getS(), b.getS()));
    }

    ...

}

We us a Pair to combine two elements, one of each group. You can use any implementation you can find, or write your own. Just make sure to implement equals and hashcode in a proper way.

Again we have a lazy implementation of the getElements method that calculates the Cartesian product of the elements of the two groups as a set of pairs. The identity element is the pair of the identity elements of both groups. The operation is the pairwise multiplication of the elements of both groups.

Quotient Groups

The implementation of a quotient group however is a bit trickier:

public class QuotientGroup<T> implements Group<T> {

    private Group<T> group;
    private Set<T> subgroup;
    private Set<T> elements;
    private Map<T, Set<T>> representation;

    public QuotientGroup(Group<T> group, Set<T> subgroup) {
        this.group = group;
        this.subgroup = subgroup;
    }

    @Override
    public Set<T> getElements() {
        if (elements==null) {
            elements = new HashSet<>();
            representation = new HashMap<>();
            Set<Set<T>> cosets = new HashSet<>();
            for (T e : group.getElements()) {
                Set<T> coset = GroupUtils.multiply(group, subgroup, e);
                if (!cosets.contains(coset)) {
                    elements.add(e);
                    representation.put(e, coset);
                    cosets.add(coset);
                }
            }
        }
        return elements;
    }

    @Override
    public T getIdentity() {
        return group.getIdentity();
    }

    private T normalize(T t) {
        T n = GroupUtils.findElement(representation, t);
        if (n!=null) {
            return n;
        }
        return t;
    }

    @Override
    public T multiply(T a, T b) {
        return normalize(group.multiply(a, b));
    }

    ...

}

We start with a group and a subgroup, the latter represented as a set of elements only. In addition to that, the impementation will keep track of the set of elements, as most of the Group implementations. However, as we will represent each coset by a unique representative of the coset, we need to keep track of which element represents which coset. This is done using the representation, a mapping of represtatives to cosets.

As before, the elements are determined lazily. To determine the set of elements, we scan the elements of the larger group and determine the coset of the smaller group for each element. Is this is a new coset, we use the element as a representative for that coset, and update the representation map. If we already have the coset, we replace the element by its representation.

The getElements function uses a utility function GroupUtils.multiply for this purpose, that determines the coset of an element of the group:

public class GroupUtils {

    public static <T> Set<T> multiply(Group<T> group, Set<T> elements, T element) {
        return elements.stream().map(e -> group.multiply(e, element)).collect(Collectors.toSet());
    }

    ...

}

The QuotientGroup implementation has a private member function normalize that does the same thing, replacing an element by its unique representative, that we use after every calculation. For now, that is only in the multiply function. It uses another utility function GroupUtils.findElement, that finds the coset in the representation map that contains a newly created element, and returns the representation of that coset:

public class GroupUtils {

     public static <T> T findElement(Map<T, Set<T>> map, T t) {
        for (T r : map.keySet()) {
            if (map.get(r).contains(t)) {
                return r;
            }
        }
        return null;
    }

    ...

}

The Central Product

We now have sufficient building blocks to implement the second version of the Pauli group, as the central product of the cyclic group of order four and the dihedral group of a square.

public class GroupUtils {

    public static <T> Group<Pair<Modulo, Matrix>> getCentralProductPauliGroup() {
        Group<Modulo> cyclic4 = getCyclicGroup(4);
        Group<Matrix> dihedral8 = getDihedral8Group();
        Group<Pair<Modulo, Matrix>> product = new ProductGroup<>(cyclic4, dihedral8);
        Set<Pair<Modulo, Matrix>> subgroup = getCentralProductPauliSubGroup();
        return new QuotientGroup<>(product, subgroup);
    }

    public static <T> Set<Pair<Modulo, Matrix>> getCentralProductPauliSubGroup() {
        Set<Pair<Modulo, Matrix>> subgroup = new HashSet<>();
        subgroup.add(new Pair<>(new Modulo(0, 4), Alphabet.I));
        subgroup.add(new Pair<>(new Modulo(2, 4), r.vmul(r)));
        return subgroup;
    }

    ...

}

Note that we explicitly construct the subgroup of the direct product isomorphic to the maximum common subgroups of the centers of both factor groups of the direct product. Let's say we leave the generic implementation of that step as an exercise.

Finding an Isomorphism

Verifying whether a mapping between groups is an isomorphism is easy. Assume we represent the isomorphism as a mapping of elements from one group to the other:

public class GroupUtils {

    public static <T, S> boolean isIsomorphism(Group<T> tgroup, Group<S> sgroup, Map<T, S> map) {
        if (tgroup.getElements().size()!=sgroup.getElements().size()) {
            return false;
        }
        for (T a : tgroup.getElements()) {
            for (T b : tgroup.getElements()) {
                if (!map.get(tgroup.multiply(a, b)).equals(sgroup.multiply(map.get(a), map.get(b)))) {
                    return false;
                }
            }
        }
        return true;
    }

    ...

}

If the number of elements in both groups is different, the groups cannot possibly be isomorphic. Otherwise, we just check any combination of two elements of the first group, and verify whether the product of the images of the elements equals the image of the product. Any mismatch will refute the isomorphic character of the mapping.

Finding an isomorphism between two groups is trickier. Suppose we have generators for both groups, then we can gradually construct the isomorphism, and check for inconsistencies along the way:

public class GroupUtils {

    public static <T, S> Map<T, S> getIsomorphism(Group<T> tgroup, Group<S> sgroup, List<T> tgenerators, List<S> sgenerators) {
        if (tgenerators.size()!=sgenerators.size()) {
            return null;
        }
        Map<T, S> isomorphism = new HashMap<>();
        isomorphism.put(tgroup.getIdentity(), sgroup.getIdentity());
        for (int i=0; i<tgenerators.size(); ++i) {
            isomorphism.put(tgenerators.get(i), sgenerators.get(i));
        }
        int size = 0;
        while (isomorphism.size()>size) {
            size = isomorphism.size();
            Map<T, S> mapping = new HashMap<>();
            for (T t1 : isomorphism.keySet()) {
                for (T t2 : isomorphism.keySet()) {
                    T tproduct = tgroup.multiply(t1, t2);
                    S sproduct = sgroup.multiply(isomorphism.get(t1), isomorphism.get(t2));
                    S expected = isomorphism.get(tproduct);
                    if (expected==null) {
                        expected = mapping.get(tproduct);
                    }
                    if (expected==null) {
                        mapping.put(tproduct, sproduct);
                    } else if (!sproduct.equals(expected)) {
                        return null;
                    }
                }
            }
            isomorphism.putAll(mapping);
        }
        return isomorphism;
    }

    ...

}

In what follows we will refer to the first group as the domain of the isomorphism and to the second group as the image of the isomorphism.

We assume that the list of generators that are provided for both groups already reflect the isomorphism. Therefore, they should be of equals size.

Then we initialize the isomorphism with the generators. We then extend the isomorphism by trying all combinations of elements in the domain, and calculating the products of their images and the image of their product. If the product is not yet present in the domain of the isomorphism, we add a mapping. If the product is present already, its image in the isomorphism should be equals to what we just calculated.

Because we cannot iterate over a collection while it is being modified, we accumulate new mappings in a separate data structure. We then merge all new mappings into the isomorphism after all combination of elements in the domain have been processed.

We repeat this process until the isomorphism no longer grows in size, effectively finding a fixed point for the process. Under the assumption that the initial elements provided are generators for the doamin and the image of the isomorphism, respectively, the fixed point will represent an isomorphism between both groups.

We realize this is all a bit messy, but we basically come to the same conclusion as Naftali Harris on this subject: finding an isomorphism between two groups is not a trivial task. However, using generators will reduce the complexity in a significant way.

Selecting Generators

The only step remaining to close the loop here, is finding the generators for the image. We will leave out the coding details, but give you the results.

We already have the generators for the domain, being $X$, $Y$ and $Z$. Starting with all 16 elements of the image, we can limit ourselves to those that are a root of the identity element, as we know this condition holds for the generators of the domain. Excluding the identity itself, we find 7 elements that square to the identity:

[(0,M), (0,RR), (0,RM), (0,MR), (0,RRM), (1,R), (1,RRR)]

If we try each combination of three elements out of this set, no repetition, order matters, we get $7!/3!=210$ possible combinations. Out of these, it turns out that $48$ can be used to construct an isomorphism.

Knowing that the group structure is invariant under permutations of those three generators, and there are $3!=6$ permutations of three elements, we get $48/6=8$ combinations, ignoring order, that do the job. Here they are:

[
  [(0,M), (0,RM), (1,R)], [(0,M), (0,RM), (1,RRR)],
  [(0,M), (0,MR), (1,R)], [(0,M), (0,MR), (1,RRR)],
  [(0,RM), (0,RRM), (1,R)], [(0,RM), (0,RRM), (1,RRR)],
  [(0,MR), (0,RRM), (1,R)], [(0,MR), (0,RRM), (1,RRR)]
]

Conclusions

Wrapping the standard Java Set interface and its implementations with some additional characteristics provides an easy way to implement finite groups and play around with them. As can be expected however, one quickly gets into the realms of NP-complete problems. For the Pauli group on a single qubit, the limited quantitative complexity helped us to gain some insights.