Wednesday, November 1, 2017

How to Become an Astronaut

On Tim Peake's selection exam to become an astronaut, the group of rotations of a cube, solving the question the long way, using Knuth Bendix Completion.
(updated on 15-06-2020)

Introduction


Recently, Tim Peake tweeted a Friday Night Brain Teaser about a question he was asked when taking the selection examination for admission to the astronaut program. The question was about the capacity for one to imagine 3D manipulations of a cube.

In this post, we will use the results of a previous post on Knuth Bendix Completion to find an answer to the question. No doubt this is not the shortest solution, but it surely is an interesting one.

The Question


Imagine a cube in front of you. The cube can be rolled left and right, and backward and forward. The bottom of the cube is marked with a black dot.


Imagine rolling the cube forward, left, left, forward, right, backward, right. Where is the black dot now ?

The Solution


Although the solution is quite obvious for those of us with good spatial sense, it is interesting to approach the problem from an algebraic view.

In a previous post, we talked about Knuth-Bendix Completion on Non-Abelian Group Theory. As the set of transformations of a cube with composition as an operation is a non-Abelian group, we will unleash the Knuth-Bendix Completion procedure on this group.

If the procedure terminates, we will have a decidable rewrite system that implements this group. We will then apply this rewrite system to find the normal form of the sequence of rotations specified in the question and see what comes out.

Transformations of a Cube


The number of positions, including orientation, of a cube is easy to determine. There are six sides to a cube, and for each side there are four possible orientations. This brings the total number of positions to 24 (6x4).

Imagine a given starting position for a cube. Each of the 24 positions corresponds to a transformation of the cube from the starting position to that position. This gives a collection of 24 transformations that, with composition as an operator, forms a non-Abelian group. This group is referred to as the transformation group of a cube.

Permutations of Four


The Internet is full of sources stating that the group of transformations of a cube is isomorphic to the group of permutations of four elements, referred to as S4. An isomorphism is a one-to-one relationship between the elements of two groups groups that preserves the group operation. We will come back to this property lateron.

Consider an ordered collection of 4 objects, say 1234. Then consider all the possible ways in which these 4 elements can be ordered, called permutations of these 4 objects.

To create such a configuration of 4, we have 4 possiblities for the first position, 3 for the second position (as one object is already used), 2 for the third position and then only one object is left for the fourth position. This gives a total of 24 (4x3x2x1 or 4!) permutations.

Some sources are providing a proof of the isomorphism. We will take a more visual approach.

The Cayley Graph


A Cayley graph visualizes the structure of a group based on a limited number of elements, called generators. The idea is that any element of the group, without exception, can be written as a composition of generators. For the permutation group of four, this gives an interesting diagram:


In the bottom left corner is our starting configuration 1234. A red arrow, representing the rotation to the right, brings us to 4123. A blue arrow, representing the rotation to the left of the 3 leftmost objects while keeping the righmost object in place, brings us to 2314. We will refer to the transformation represented by the red arrow as C, and the transformation represented by the blue arrow as S.

As you can see in the diagram, compositions of only these two generating elements of the group can be used to construct any of the 24 permutations, in various ways. Moreover, C is cyclic after four compositions, resulting in a number of red squares, while S is cyclic after three compositions, resulting in a number of blue triangles. Although the squares are not connected and the triangles are not connected, together, they create a fully connected graph.

Note that the collection of generators is not unique, in the sense that different collections of elements of the group may be used as generators. What's special about this collection is that it results in a nice graphical representation without crossing arrows, which is certainly not always the case.

You will find the Cayley graph and a lot more visuals on the wikiversity page on the Symmetric Group S4.

The Isomorphism


Coming back to the group of transformations of a cube, we could try to find a collection of generators and construct a Cayley diagram. However, as the two groups are isomorphic, with the right representation of the positions of a cube and the right choice of generators, we should get the same Cayley diagram.

First, let's look at the representation of the position of a cube. We will use a dice for this purpose, to easily distinguish between different positions. We will label the corners of the dice with labels 1, 2, 3 and 4, in such a way that each side has four different labels. It turns out that, after choosing the labels for one side, only one possibility is left to complete the labeling for the whole dice:


For the top side of the dice, we choose the labels to start with 1 for the corner closest to us, and continue clockwise while counting to 4, as illustrated by the blue labels. The labels of the corners of the bottom side can then only be chosen equal to the diagonally opposite corner, as illustrated by the red labels (where label 1 is at the back and not shown). The position of the dice when labeling is important, but arbitrary.

We will then represent the position of a cube by enumerating the labels of the corners of the top side, starting with the corner closest to us, and continuing clockwise. This results in the following observations:
  • The representations of all possible positions of a cube correspond exactly with the permutations of four.
  • The transformation represented by C is clockwise rotation around the vertical axis.
  • The transformation represented by S is anti-clockwise rotation around the horizontal axis (front left to back right), followed by clockwise rotation around the vertical axis.

With this representation, the Cayley graph for both groups is identical. Grab a dice and try it out ! And since any element of each group can be written as the same composition of generators, the relation preserves the group operation.

Knuth-Bendix Completion


We are now ready to start solving the question. We will start with the axioms of non-Abelian groups, define constants for the generators and add some axioms related to the generators. Then, Knuth-Bendix Completion will solve the question for us.

The Alphabet


An intuitive alphabet to represent the set of rotations on a cube, besides the identity, consists of a 90° rotation around each axis in 3D-space:
  • R - Right: Rolling the cube around the Y-axis, to the right.
  • C - Clockwise: Rolling the cube around the Z-axis, clockwise.
  • B - Backward: Rolling the cube around the X-axis, towards yourself.

Yes, we added an extra degree of freedom here, to prepare for interplanetary travel, say.

For readability, and because they are used in the question, we will add the inverse operations to the alphabet as well:
  • L - Left: Rolling the cube around the Y-axis, to the left.
  • A - Anti-Clockwise: Rolling the cube around the Z-axis, anti-clockwise.
  • F - Forward: Rolling the cube around the X-axis, away from yourself.

However, we will construct our theory using only the two generators introduced before, defined as:
  • C - Clockwise rotation around the vertical axis (hey, what a coincidence).
  • S - Rotate anti-clockwise around the horizontal axis, followed by rotate clockwise around the vertical axis.

One can easily express the transformations used in the question in function of these generators as follows:
  • R = S*S*C*S
  • L = C*S*S
  • B = S*S*C
  • F = C*C*C*S

Remember, the composition operator reverses the order of operation: C*S reads as "C after S".

The Theory


We will start off with the three axioms of non-Abelian group theory. Closure is implicit.

[1] a*1 = a
[2] a*i(a) = 1
[3] a*b*c = a*(b*c)

Then, we will add a minimum set of axioms defining the interaction of the transformations C and S.
[4] i(C) = C*C*C
[5] i(S) = S*S

Here we express the cyclic character of these transformations, as stated before: C is cyclic after four compositions and S is cyclic after three compositions. Note that these axioms correspond to loops in the Cayley diagram.

Observing the diagram somewhat closer, we notice one other type of loop, involving a combination of red and blue arrows. We add this relationship as an axiom too.
[6] i(C) = S*C*S

This completes our theory. Note that we did not orient the axioms lexicographically, from "complex" to "simple", as mentioned in the post on the Knuth-Bendix Completion procedure. This is not necessary, as the procedure does the orientation by itself.

The Rewrite System


As explained in our previous post on Knuth-Bendix Completion, we will try generating a decidable rewrite system out of this theory. And indeed, after the ninth iteration, a rewrite system is produced consisting of 39 rewrite rules.
[1]  a*1 -> a
[2]  a*i(a) -> 1
[3]  a*b*c -> a*(b*c)
[4]  S*(C*S) -> i(C)
[5]  i(S) -> S*S
[6]  C*(C*C) -> i(C)
[7]  a*(i(a)*b) -> b
[8]  S*(S*S) -> 1
[9]  1*a -> a
[10] S*(C*(S*a)) -> i(C)*a
[11] C*(C*(C*a)) -> i(C)*a
[12] S*(S*(S*a)) -> a
[13] i(C)*i(C) -> C*C
[14] S*(S*i(C)) -> C*S
[15] i(C)*(i(C)*a) -> C*(C*a)
[16] i(C)*(S*S) -> S*C
[17] S*(S*(i(C)*a)) -> C*(S*a)
[18] i(C)*(S*(S*a)) -> S*(C*a)
[19] i(1) -> 1
[20] S*(S*(C*C)) -> C*(S*i(C))
[21] S*(C*(C*S)) -> i(C)*(S*i(C))
[22] S*(S*(C*(C*a))) -> C*(S*(i(C)*a))
[23] C*(S*C) -> S*S
[24] C*(C*(S*S)) -> i(C)*(S*C)
[25] i(b)*(b*a) -> a
[26] i(i(a)) -> a
[27] S*(C*(C*(S*a))) -> i(C)*(S*(i(C)*a))
[28] C*(S*(C*a)) -> S*(S*a)
[29] C*(C*(S*(S*a))) -> i(C)*(S*(C*a))
[30] S*(i(C)*(S*i(C))) -> C*(S*(i(C)*S))
[31] i(C)*(S*(i(C)*S)) -> S*(i(C)*(S*C))
[32] C*(S*(i(C)*(S*C))) -> S*(i(C)*S)
[33] i(a)*a -> 1
[34] S*(i(C)*(S*(i(C)*a))) -> C*(S*(i(C)*(S*a)))
[35] i(C)*(S*(i(C)*(S*a))) -> S*(i(C)*(S*(C*a)))
[36] C*(S*(i(C)*(S*(C*a)))) -> S*(i(C)*(S*a))
[37] i(b*a) -> i(a)*i(b)
[38] S*(i(C)*(S*(C*(C*a)))) -> C*(C*(S*(i(C)*(S*a))))
[39] C*(C*(S*(i(C)*S))) -> S*(i(C)*(S*(C*C)))

The rewrite system has the same expressive power as the theory it originates from, but it has the advantage of decidability: semantic equality of terms can be verified using syntactic equality of normal forms, which are unique and guaranteed to exist.

Rewriting


Comes the time to take the sequence of rotations mentioned in the question and try rewriting it. Taking into account that our composition operator reverses the order of operation, the sequence forward, left, left, forward, right, backward, right should be represented as R*B*R*F*L*L*F. We then substitute the tranformations in this expression by their representation in terms of generators, and rewrite this term:
S*S*C*S*(S*S*C)*(S*S*C*S)*(C*C*C*S)*(C*S*S)*(C*S*S)*(C*C*C*S)
  [3]  -> S*S*C*S*(S*S*C)*(S*S*C*S)*(C*C*C*S)*(C*S*S)*(C*S*S*(C*C*C*S))
  [3]  -> S*S*C*S*(S*S*C)*(S*S*C*S)*(C*C*C*S)*(C*S*S*(C*S*S*(C*C*C*S)))
  [3]  -> S*S*C*S*(S*S*C)*(S*S*C*S)*(C*C*C*S*(C*S*S*(C*S*S*(C*C*C*S))))
  [3]  -> S*S*C*S*(S*S*C)*(S*S*C*S*(C*C*C*S*(C*S*S*(C*S*S*(C*C*C*S)))))
  [3]  -> S*S*C*S*(S*S*C*(S*S*C*S*(C*C*C*S*(C*S*S*(C*S*S*(C*C*C*S))))))
  [3]  -> S*S*C*(S*(S*S*C*(S*S*C*S*(C*C*C*S*(C*S*S*(C*S*S*(C*C*C*S)))))))
  [3]  -> S*S*(C*(S*(S*S*C*(S*S*C*S*(C*C*C*S*(C*S*S*(C*S*S*(C*C*C*S))))))))
  [3]  -> S*(S*(C*(S*(S*S*C*(S*S*C*S*(C*C*C*S*(C*S*S*(C*S*S*(C*C*C*S)))))))))
  [10] -> S*(i(C)*(S*S*C*(S*S*C*S*(C*C*C*S*(C*S*S*(C*S*S*(C*C*C*S)))))))
  [3]  -> S*(i(C)*(S*S*(C*(S*S*C*S*(C*C*C*S*(C*S*S*(C*S*S*(C*C*C*S))))))))
  [3]  -> S*(i(C)*(S*(S*(C*(S*S*C*S*(C*C*C*S*(C*S*S*(C*S*S*(C*C*C*S)))))))))
  [18] -> S*(S*(C*(C*(S*S*C*S*(C*C*C*S*(C*S*S*(C*S*S*(C*C*C*S))))))))
  [23] -> C*(S*(i(C)*(S*S*C*S*(C*C*C*S*(C*S*S*(C*S*S*(C*C*C*S)))))))
  [3]  -> C*(S*(i(C)*(S*S*C*(S*(C*C*C*S*(C*S*S*(C*S*S*(C*C*C*S))))))))
  [3]  -> C*(S*(i(C)*(S*S*(C*(S*(C*C*C*S*(C*S*S*(C*S*S*(C*C*C*S)))))))))
  [3]  -> C*(S*(i(C)*(S*(S*(C*(S*(C*C*C*S*(C*S*S*(C*S*S*(C*C*C*S))))))))))
  [18] -> C*(S*(S*(C*(C*(S*(C*C*C*S*(C*S*S*(C*S*S*(C*C*C*S)))))))))
  [22] -> C*(C*(S*(i(C)*(S*(C*C*C*S*(C*S*S*(C*S*S*(C*C*C*S))))))))
  [3]  -> C*(C*(S*(i(C)*(S*(C*C*C*(S*(C*S*S*(C*S*S*(C*C*C*S)))))))))
  [3]  -> C*(C*(S*(i(C)*(S*(C*C*(C*(S*(C*S*S*(C*S*S*(C*C*C*S))))))))))
  [3]  -> C*(C*(S*(i(C)*(S*(C*(C*(C*(S*(C*S*S*(C*S*S*(C*C*C*S)))))))))))
  [36] -> C*(S*(i(C)*(S*(C*(C*(S*(C*S*S*(C*S*S*(C*C*C*S)))))))))
  [36] -> S*(i(C)*(S*(C*(S*(C*S*S*(C*S*S*(C*C*C*S)))))))
  [10] -> S*(i(C)*(i(C)*(C*S*S*(C*S*S*(C*C*C*S)))))
  [15] -> S*(C*(C*(C*S*S*(C*S*S*(C*C*C*S)))))
  [3]  -> S*(C*(C*(C*S*(S*(C*S*S*(C*C*C*S))))))
  [3]  -> S*(C*(C*(C*(S*(S*(C*S*S*(C*C*C*S)))))))
  [11] -> S*(i(C)*(S*(S*(C*S*S*(C*C*C*S)))))
  [18] -> S*(S*(C*(C*S*S*(C*C*C*S))))
  [3]  -> S*(S*(C*(C*S*(S*(C*C*C*S)))))
  [3]  -> S*(S*(C*(C*(S*(S*(C*C*C*S))))))
  [22] -> C*(S*(i(C)*(S*(S*(C*C*C*S)))))
  [18] -> C*(S*(S*(C*(C*C*C*S))))
  [3]  -> C*(S*(S*(C*(C*C*(C*S)))))
  [3]  -> C*(S*(S*(C*(C*(C*(C*S))))))
  [22] -> C*(C*(S*(i(C)*(C*(C*S)))))
  [25] -> C*(C*(S*(C*S)))
  [28] -> C*(S*(S*S))
  [8]  -> C*1
  [1]  -> C

So, the transformation mentioned in the question is equal to a single rotation around the Z-axis, clockwise. Consequently, the cube will still be positioned on the side with the black dot, albeit rotated by 90°, which is irrelevant for answering the question.

A Note on Expressivity


Admitted, the reduction is quite verbose. This is obviously caused by the fact that we limit our alphabet to the two generators only. A similar exercise where all the 90° rotations and their inverse are added to the alphabet, and the theory is extended with the axioms reflecting their relationships, works equally well, with the desired brevity. However, the rewrite system generated by the Knuth-Bendix completion procedure consists of 92 rewrite rules, as there are a lot more relationships to capture.

We will not present the completed rewrite system over the extended alphabet here, but we do provide the reduction of the term using this extended alphabet:
R*B*R*F*L*L*F
  [3]  -> R*B*R*F*L*(L*F)
  [3]  -> R*B*R*F*(L*(L*F))
  [3]  -> R*B*R*(F*(L*(L*F)))
  [3]  -> R*B*(R*(F*(L*(L*F))))
  [3]  -> R*(B*(R*(F*(L*(L*F)))))
  [33] -> B*(C*(R*(F*(L*(L*F)))))
  [32] -> B*(B*(C*(F*(L*(L*F)))))
  [58] -> A*(B*(B*(F*(L*(L*F)))))
  [55] -> A*(B*(L*(L*F)))
  [49] -> B*(L*F)
  [21] -> B*(C*L)
  [56] -> C

Conclusion


We are going to Mars !

The group of rotations of a cube nicely passes Knuth-Bendix Completion and the resulting rewrite system solves the Friday night brain teaser, mechanically, deterministically, in the blink of an eye.

We cannot be sure of our theory to be true, but it is correct, and, more importantly, it solves the issue.

Friday, August 25, 2017

A Not-so-Random Walk Through Hyperspace

Reporting on a first experiment with neural network technology (DL4J), applying feed forward neural networks (FFNN) for supervised learning of file types based on content fragments, exploring the hyperparameter space of domain and tooling.

Introduction


Artificial intelligence based on neural networks has a long history. However, up to a couple of years ago, success stories have been few and with limited impact.

Over the last few years, neural networks have demonstrated some impressive results, catching up with the long standing promise of performance at the level of human experts in different domains, and even taking us beyond. Interesting presentations in this context include Jeremy Howard's talk on the potential of deep learning and Demis Hassabis' talk on DeepMind and AlphaGo.

Due to these recent successes, neural network technology has gained a renewed interest. Today, different generic neural network platforms are available in the public domain, and a large community of enthusiasts is looking into the technology and its applications.

Time to join the community ...

Finding a Problem


With the solution at hand, we still have to find a problem. Data should be real life, easily available in large numbers, and categorised.

The problem we will be addressing will be using the file system, where the goal is to recognise the file type based on (a fragment of) the file content. We have a large number of files on the file system at hand, created by many different authors, and the bulk of them labelled correctly by the file extension.

The data set will consist of all files under a given root directory, using the following constraints:
  • Text files only, determined heuristically based on the content of the file. We will be using the function proposed by Ondra Žižka on stackoverflow for this purpose.
  • Excluding files with extension "txt", as these have strongly varied content.
  • Having at least 100 occurrences for a file type, to allow patterns to be detected.

This results in a collection of 50,665 files with 12 different extensions, distributed as


Note that the number of xml-files is 41,103, but the bar is clipped in favour of visibility of statistics for the other file types.

Performance Evaluation


The first parameter that comes to mind for evaluating the performance of a learning system is the accuracy of its output, defined as the ratio of the number of correct classifications over the total number of classifications. As explained in the Deep Learning and Neural Network Glossary and James Brownlee's post on Performance Measures, we can improve on the evaluation of performance of a neural network.

The results for a specific class during the test phase can be divided into:
  • True positives (TP): sample correctly accepted.
  • False positives (FP): sample incorrectly accepted.
  • True negatives (TN): sample correctly rejected.
  • False negatives (FN): sample incorrectly rejected.

Given these definitions, we can express the accuracy and introduce additional metrics of performance:
  • Accuracy: A = (TP+TN)/(TP+FP+TN+FN), how often is a classification correct ?
  • Precision: P = TP/(TP+FP), how often is the acceptance of a sample correct ?
  • Recall: R = TP/(TP+FN), how often is a sample in the class correctly classified ?
  • F1 Score: F1 = 2*P*R/(P+R), the harmonic mean of precision and recall.

We will track all four of these metrics, but we will use F1 as a metric for overall performance.

Selecting a Neural Network Topology


We will be using the DL4J system, because it is low entry and feature rich.

An important observation to determine the network topology is related to the type of problem we wish to solve. The Introduction to Deep Learning on DL4J briefly explains:
  • Classification as a learning process where the category is provided as part of the input, also referred to as supervised learning.
  • Clustering as a learning process where the category is not provided as part of the input, but should be discovered by the learning process, also referred to as unsupervised learning.
  • Predictive Analytics as a variant of either classification or clustering, where one of the elements in the sample data reflects the time dimension.

The problem we have chosen to solve clearly is a classification problem.

For this first attempt to get neural networks operational, we will focus on feed forward neural networks (FFNN) with multiple layers. See Feed Forwards Neural Network on Wikipedia for a general introduction. We borrow the illustration from the DL4J Introduction to Neural Networks:


For the FFNN, we will consider as hyperparameters the number of layers, the organisation of nodes into these layers, the learning speed and some more parameters explained below.

For the chosen problem domain, the parameters include the sample size or the number of characters taken from the file, and the minimum frequency for a file type to be taken into scope.

Jumping into the Unknown


We will take the class CSVExample in the DL4J sample code base as a starting point for training and testing an FFNN. Without having anything else to go on, the example will also help us with the choices of hyperparameter values.
new NeuralNetConfiguration.Builder()
    .seed(1L)
    .iterations(1000)
    .activation(Activation.TANH)
    .weightInit(WeightInit.XAVIER)
    .learningRate(0.1)
    .regularization(true).l2(1e-4)
    .list()
    .layer(0, new DenseLayer.Builder().nIn(64).nOut(12).build())
    .layer(1, new OutputLayer
        .Builder(LossFunctions.LossFunction.NEGATIVELOGLIKELIHOOD)
        .activation(Activation.SOFTMAX).nIn(12).nOut(12).build())
    .backprop(true)
    .pretrain(false)
    .build();
The above code fragment builds a MultiLayerConfiguration providing all the necessary parameters for the network to function. We refer to the full example for more information on how to provide the sample data to the network and run the training and testing processes.

Let's have a look at the parameters that can be provided, and why we've chosen particular values:
  • Seed: The seed is used to initialise weights. We are using a hard-coded seed value to be able to reproduce results.
  • Number of Iterations: An iteration is a cycle in which the weights in the network are adjusted based on the errors obtained using the training data set. We will use 1,000 iterations for our first attempt. No clue whatsoever about convergence behaviour.
  • Activation Function: The activation function determines what output a node will generate based upon its input. We use the hyperbolic tangent function because it ranges from -1 to 1, using the full spectrum (as opposed to the once popular sigmoid function). Also, the function is continuously differentiable (as opposed to the more recent ReLu function), which helps in supporting backpropagation by gradient descent, as mentioned below. See Activation Functions on Wikipedia for more information on activation functions.
  • Weight Initialisation: We are using Xavier for the initial weights in the neural network. This keeps the weights, and consequently the signals that flow through the network, at a reasonable range throughout the layers of the network. See Andy's post on Xavier initialisation for more information.
  • Learning Rate: The learning rate is a measure for how quickly a network abandons old beliefs for new ones, as explained by Conner Davis on Quora. We will start of with 0.1, as most of the examples do.
  • Regularisation: Regularization artificially discourages complex or extreme explanations of the world even if they fit the what has been observed better, as explained by Sean Owen on Quora. It is used as a measure against overfitting, where a statistical model describes random error or noise instead of the underlying relationships. See Stanford Universities' Course on Convolutional Neural Networks for Visual Recognition for a more in-depth coverage in the context of neural networks. We will activate regularisation for obvious reasons. L2 regularisation is most commonly used.
  • Layer 0: The input layer of the network. The number of inputs, and consequently the number of nodes, equals the sample size, 64 in our case. The number of outputs has to equal the number of inputs of the output layer. The input layer is never counted as a layer. Consequently, a 1-layer network has an input layer and an output layer, no hidden layers. See also Stanford Universities' Course on Convolutional Neural Networks for Visual Recognition.
  • Layer 1: The output layer of the network. As we are doing classification, the number of nodes equals the number of categories, 12. The loss function is used for backpropagation, to quantify the error. The negative log-likelihood function is also referred to as cross entropy cost. See the section on Loss Functions in the Stanford Universities' Course on Convolutional Neural Networks for Visual Recognition. You may also want to have a look at the post on Cost Functions used in Neural Networks on Cross Validated. We are using the softmax activation function to make sure the output values can be interpreted as probabilities, one for each category. See also the Softmax function on Wikipedia.
  • Backpropagation: Backpropagation is the process where the weights are updated in function of correct and incorrect classification during the learning process. Some parameters provided before are actually related to backpropagation. See Chapter 2 of Michael Nielsen's book of Neural Networks and Deep Learning for a theoretical description and Matt Mazur's post on Step by Step Backpropagation for an exhaustive example. DL4J refers to Stanford's wiki on the Backpropagation Algorithm.
  • Pretrain: Pre-training is using the weights of a neural network that was already trained on a different (but similar) data set to initialize the neural network, as explained in post on pretraining a neural network on Cross Validated. We will disable pretraining.

Falling on One's Feet


As stated before, our data set consists of 50,665 files. We will be using 90% (45,599 files) for training, and 10% (5,066 files) for testing. To guarantee the independence of our test set, we will never use any of the test files for training.

The first attempt, after 1,000 iterations, gives an accuracy of 99.07%, which is quite impressive. We were lucky with our initial choices for the hyperparameter values. However, other metrics are far less impressive, with a precision of 86.80%, a recall of 70.13% and an F1 score of 77.22%.

The evolution of these scores in function of the number of iterations is quite instructive:


While the accuracy improves dramatically, reaching over 90% after 15 iterations, and over 99% after 559 iterations, the other metrics evolve much more slowly. The chart however suggests increasing the number of iterations may be useful, as the F1 score still seems to be increasing at 1,000 iterations.

We may add that, on an Intel® Core™ i7-6700HQ CPU @ 2.60GHz × 8, the training process took just below 2 minutes, including data load, model initialisation, and an evaluation after each iteration. Removing the evaluation finishes the learning process in 77 seconds.

Another interesting instrument, providing an alternative view on the performance of the neural network, is the confusion matrix. The confusion matrix shows the actual categories in the rows, and the number of classifications by the network per category in the columns.

After our first run, the confusion matrix is

csvprojectjavaxmlhtmlgroovyxsdshproperties
csv300000001
project0001000000
java10321000103
xml2004,60300001
html0000430000
groovy000001008
xsd00016003200
sh000000010
properties1003000016


We observe an important mix-up with all project files and many xsd files being classified as xml files, and essentially, they are. We also see that almost all groovy files are classified as properties files, most probably because they start with some assignment statements.

A final interesting observation is that the evolution of precision and recall, and consequently the F1 score, is not smooth but rather bumpy. We can see for example a sudden decrease in precision after iteration 161, from 56.68% to 47.38%, followed by a sudden increase after iteration 194, both in precision, from 48.92% to 65.59%, and in recall, from 31.25% to 42.36%.

Analysis of the confusion matrix before and after shows that the decrease in precision is due to a change in classification of all 43 html files in the test set, from category xml to project. Both classifications are incorrect. However, as the total number of xml files is very high compared to other categories, the relative expression of the error significantly decreases the precision. A later change in classification which causes the same 43 html files to be correctly classified gives a significant increase in both precision and recall.

This observation also relates to the impressively fast increase in accuracy demonstrated in this example. As the distribution of file types is extremely uneven, with 4,606 out of 5,066 files being xml files, learning to classify xml files correctly will get the accuracy above 90%. This also explains the relative smoothness of the accuracy curve over iterations in comparison with other examples of learning processed based on neural networks, as the accuracy of xml classification is very dominant.

Getting Oriented


With the term hyperspace, we refer to the space generated by all hyperparameters. In the previous topic, we chose a first point in hyperspace to start off with, based on a little intuition and, no doubt, a lot of ignorance.

Now that we have an anchor point, we will explore hyperspace in some more detail. Obviously, we risk getting caught in a combinatorial explosion of possibilities. In order to get oriented, we will scan hyperspace selectively, by choosing a discrete number of values around our starting point for each dimension in hyperspace, effectively scanning a multi-dimensional cross in hyperspace.

However, the first question that comes to mind is where we would end up by running more iterations in the learning process.

Longer Learning


Well, here it is, with the same seed value of 1L, after 5,000 iterations, we get accuracy 99.76%, precision 78.33%, recall 96.94% and F1 score 89.84%.


Although the accuracy still improves with the number of iterations, there is no consistent convergence for other metrics. Running the network with different initialisations (read different random seeds) confirms this: there is no consistent convergence of all performance metrics towards an optimum with this configuration.

Bigger Samples


Common sense suggests one of the more important parameters is the sample size, the number of bytes taken from the file content to be used for classification. To investigate we run the train and test cycle using sample sizes ranging from 32 to 128 bytes, in steps of 16. The difference between these variants is illustrated by their performance graph:


Clearly, the size of the sample is an important factor: the larger the sample, the better all performance metrics become, as expected. For the best run, obtained with a sample size of 128, we get accuracy 99.87%, precision 93.47%, recall 98.08% and F1 score 94.70%.

The tests have been executed with different seed values for the initialisation of weights. Because the volatility of the results of individual tests is high, depending on the seed value, 5 tests runs have been scheduled for each parameter value, where the results with the lowest and highest F1 score have been discarded. The chart displays the average performance metrics of the 3 remaining results. We will use this technique for the remainder of this post.

As a sidekick to this experiment, and inspired by Bruno Stecanella's post on Naive Bayes Classifiers on MonkeyLearn, we did the same test using Naive Bayes classification. The test showed an accuracy of approximately 92%, with an optimum sample size around 64.

For the remainder of the post, we will continue to work with a sample size of 64, mainly because in multi-layer configurations, the number of weights quickly grows with the sample size, as illustrated below.

More Samples


In order to observe patterns, sufficient data should be available. It is therefore important to make sure that, for each file type, a significant number of files is available in the training set. Up to now, we have been working with a minimum frequency of 100, which gave fairly good results.

Running the FFNN with different data sets, where the minimum frequency is 25, 50 and 100, respectively, confirms our expectation:


For the remainder of the post, we will continue to work with a minimum frequency of 100, as before.

Multi-Layer Topology


So far, all our tests have been executed using a 1-layer FFNN. As we are working with deep learning technology, we should look at the impact of working with multiple layers of nodes as well.

In designing a multi-layer configuration, we have to decide on the number of nodes for each layer, which is again jumping in the unknown. The constraints available are:
  • The number of nodes in the input layer is equal to the sample size.
  • The number of nodes in the output layer is equal to the number of categories.
  • The total number of weights should be far below the number of samples. We cannot tune a number of weights that is larger than the number of samples. Our intuition says that we should have sufficient margin here.
  • Some experts state that the number of nodes should not exceed the number of inputs. We will keep the number of nodes between the number of inputs and the number of outputs for each individual layer.

Given these constraints, we will run a test with 3 different configurations:
  • Min: The number of nodes on hidden layers is the minimum of the number of nodes on the input and output layers, 64 in our case.
  • Linear: The number of nodes on hidden layers evolves linearly from the number of nodes on the input layer to the number of nodes on the output layer, from 64 to 12 in our case.
  • Max: The number of nodes on hidden layers is the maximum of the number of nodes on the input and output layers, 12 in our case.

Testing these distributions on a 3-layer configuration results in the following performance metrics:


Although the sample is small and has poor coverage, we will continue with the linear node distribution model. According to this model, the number of nodes N(i) for layer i is given by:
N(i) = #B - i * (#B - #C) / #L
Here, #B is the number of bytes in the sample, or the sample size, #C is the number of categories, and #L is the number of layers. The input layer gets label 0, hidden layers are labelled 1 to #L-1, and the output layer gets label #L.

With a sample size of 64 and 12 categories, the following table provides an overview of the number of nodes for each layer, and consequently, the number of weights to be tuned in the network, for up to 10 layers:

# LayersL0L1L2L3L4L5L6L7L8L9L10# Weights
16412768
26438122,888
3644729124,719
464513825126,452
56454433322128,187
6645547382921129,854
7645749423427191211,586
864585145383225191213,394
96458524741352924181214,893
10645954484338332822171216,624

Clearly, the number of weights quickly increases with the number of layers and we need to be cautious not configure too many nodes, in particular as we only have 45,599 samples in the training set.

Configuring the FFNN with more layers, from 1 to 12, and using linear node distribution, we get the following results:


Apparently, the configuration reaches an optimum with 6 layers, where all performance metrics are at 100%. Adding more layers to the FFNN improves the performance, at least up to 6 layers, and this configuration demonstrates consistent convergence on all performance metrics.

Conclusion


Neural network technology recently has matured to the extent that it has become widely applicable. This post reports on a first experiment with the technology (for the author) and demonstrates that the DL4J system offers a framework that is easily accessible to a mainstream Java audience.

Applying neural network technology to a non-trivial classification problem generates results almost instantly. The exercise is interesting as it demystifies the subject of neural networks and encourages further experimentation.

Working on a more complex data set, and using more powerful hardware, would allow us to examine the impact of changes to other hyperparameters, address other types of machine learning problems and experiment with other types of neural network topologies.

Saturday, July 22, 2017

Knuth-Bendix Completion on Non-Abelian Group Theory

Applying Knuth-Bendix Completion to the theory of non-abelian groups to come to a rewrite system that reflects decidability of the theory it implements, where semantic equality is equivalent to syntactic equality of normal forms.

Introduction


It turns out that the way we were thought group theory in the lower grades of high school was not optimal, at least not from a computational point of view. Simple statements in group theory could be rather tricky to prove, while some theories are decidable and consequently problems formulated in these theories can be solved mechanically.

In this post, we will use a procedure referred to as Knuth-Bendix completion to transform the traditional axioms for non-abelian group theory into a deterministic rewrite system with the same expressive power. The different steps of the procedure are enumerated exhaustively. Intermediate theories will introduce new axioms and make older ones obsolete.

Being exhaustive makes the post somewhat verbose. In fact, the text has largely been generated as a side effect of the automation of the Knuth-Bendix completion procedure. Skip the sections that feel like you get them. Use them for reference if you are interested in the detail.

The whole Knuth-Bendix completion procedure on non-abelian group theory is presented graphically using the graphviz dot language. Go for the graphics if you want to scan through quickly.

Non-Abelian Groups


A non-abelian group is defined as a collection of elements, a binary operation on these elements and four axioms: closure, identity, invertibility and associativity. We will consider the collection of all well-formed terms over an alphabet using a typical algebraic term syntax, meaning closure is implicit.

The theory of non-abelian groups can then be formulated using the axioms:
[1] a*1 = a
[2] a*i(a) = 1
[3] a*b*c = a*(b*c)
Here, the binary operation is represented by an asterisk (*), and any letter that is not used as a functor denotes a variable.

Briefly explained:
  • Axiom [1] states that there is an element 1 which is the right-side identity for the operation.
  • Axiom [2] states that the unary functor i denotes right-side inverse element of its argument.
  • Axiom [3] expresses that the binary operation is associative. The operation is assumed to associate to the left, but that is mere notational convention.

A Simple Exercise


Semantic equality in non-abelian groups is decidable, meaning that there is an effective procedure to determine semantic equality within finite space and time. However, using the axioms provided above, decidability may not be apparent.

Let's illustrate the problem with a simple exercise, proving that the right-side inverse is also the left-side inverse, a property known to be valid for non-abelian groups. In other words, the exercise is to prove, for all a, the following equation:
i(a)*a = 1
Before reading on, we urge you to give it a try, to appreciate how tricky the exercise is, given the axioms provided above.

The proof is given using the following derivation:
i(a)*a
  [1]= i(a)*a*1
  [2]= i(a)*a*(i(a)*i(i(a)))
  [3]= i(a)*a*i(a)*i(i(a))
  [3]= i(a)*(a*i(a))*i(i(a))
  [2]= i(a)*1*i(i(a))
  [1]= i(a)*i(i(a))
  [2]= 1
As one can see, the proof requires making the term i(a)*a much more complex before reducing the complexity to obtain the desired result. Proving equality using the axioms above is semi-decidable. If you cannot find the proof, there is no way of knowing whether equality does not hold or whether you did not try hard enough.

Rewrite Systems


So far, we have been using the term theory as referring to a collection of axioms, and the term axiom as the equality of two terms, symmetric. In what follows, we will be using the term rewrite rule to refer to a directed axiom, which can be used only from left to right. We will refer to a collection of rewrite rules as a rewrite system.

A rewrite system may suffer the same problem regarding decidability as a theory does. However, under the right conditions a rewrite system may reflect the decidability of the theory it implements.

Rewriting a term using a rewrite system is referred to as reduction. If a term has a unique simplest, reduced form, this is called the normal form of the term. If all terms have a normal form, the rewrite system is called confluent.

This post is about transforming the theory of non-abelian groups described above to a rewrite system that implements the semantics of that theory and that is decidable.

Knuth-Bendix Completion


The Knuth-Bendix completion procedure starts off by restricting the use of the axioms to a single direction, from complex to simple. Typically, syntactic complexity is used as a metric.

Obviously, the rewrite system consisting of rewrite rules (directed axioms) does not implement the full semantics of the theory we started off with. One can prove equality for many more terms with the theory than with the rewrite system obtained out of it by directing axioms.

The Knuth-Bendix completion procedure will therefore try adding new axioms to the rewrite system in order to restore the expressive power. It will do so by looking for occurrences of the left-hand side of a rewrite rule in the left-hand side of another rewrite rule, using unification. When applying the substitution obtained by the unification to the left-hand side of the latter rewrite rule, we get a term that can be reduced (rewritten) in two distinct ways, once using each of the two rewrite rules involved.

If these reductions result in syntactically different terms, a so-called critical pair is found. A new rewrite rule is added to the rewrite system that consists of these terms, again directed from complex to simple. While adding new rewrite rules to the rewrite system, existing rewrite rules potentially can be reduced, and even become trivial.

If, by repeating this process of adding new rewrite rules and reducing existing ones, we eventually reach a fixed-point, where no more new rewrite rules can be added, the resulting rewrite system will have the same expressive power as the original theory consisting of non-directed axioms. We may also conclude that the theory is decidable, and the rewrite system provides a mechanism that realizes that decidability. Using the rewrite system, semantic equality of terms is equivalent to syntactic equality of the normal forms of these terms.

If on the other hand no such fixed-point is reached, no conclusion can be made on the decidability of the theory. In fact, Knuth-Bendix completion itself is semi-decidable.

From Theory to Rewrite System


The axioms part of the theory of non-abelian groups are directed to get a rewrite system with directed rewrite rules. The criterion used for directing an axiom is based on the syntactical complexity of the terms in the axiom, using the complexer term as the left-hand side and the simpler term as the right-hand side, such that rewrite rules always "simplify" a term.

The result is rewrite system 1:
[1] a*1 -> a
[2] a*i(a) -> 1
[3] a*b*c -> a*(b*c)

Graphical Representation


Using the graphviz dot language, this rewrite system can be represented as:


In what follows, we will illustrate each iteration in the Knuth-Bendix procedure with such a graphical representation, illustrating new rewrite rules, reduced rewrite rules and rewrite rules that have become trivial. We will use graphical conventions as presented in this legend:


Iteration 1


The first iteration of the procedure can be represented graphically as:


Two new rewrite rules can be derived. The critical term for the derivation of each new rewrite rule is included in the box representing the new rewrite rule.

In what follows, we will document each of the critical terms found and how they can be used to derive a new rewrite rule.

Iteration 1 - Critical Pairs


Combining rewrite rules [1] and [3] the term a*1*b can be rewritten as in
a*1*b
  [1]-> a*b
or as in
a*1*b
  [3]-> a*(1*b)
Therefore, we create a new rewrite rule:
[4] a*(1*b) -> a*b
Combining rewrite rules [2] and [3] the term a*i(a)*b can be rewritten as in
a*i(a)*b
  [2]-> 1*b
or as in
a*i(a)*b
  [3]-> a*(i(a)*b)
Therefore, we create a new rewrite rule:
[5] a*(i(a)*b) -> 1*b
The result is rewrite system 1:
[1] a*1 -> a
[2] a*i(a) -> 1
[3] a*b*c -> a*(b*c)
[4] a*(1*b) -> a*b
[5] a*(i(a)*b) -> 1*b

Iteration 1 - Simplification


The rewrite system cannot be simplified.

Iteration 2


The second iteration of the procedure can be represented graphically as:


Iteration 2 - Critical Pairs


Combining rewrite rules [2] and [4] the term a*(1*i(1)) can be rewritten as in
a*(1*i(1))
  [2]-> a*1
  [1]-> a
or as in
a*(1*i(1))
  [4]-> a*i(1)
Therefore, we create a new rewrite rule:
[6] a*i(1) -> a
Combining rewrite rules [2] and [5] the term a*(i(a)*i(i(a))) can be rewritten as in
a*(i(a)*i(i(a)))
  [2]-> a*1
  [1]-> a
or as in
a*(i(a)*i(i(a)))
  [5]-> 1*i(i(a))
Therefore, we create a new rewrite rule:
[7] 1*i(i(a)) -> a
Combining rewrite rules [5] and [4] the term a*(1*(i(1)*b)) can be rewritten as in
a*(1*(i(1)*b))
  [5]-> a*(1*b)
  [4]-> a*b
or as in
a*(1*(i(1)*b))
  [4]-> a*(i(1)*b)
Therefore, we create a new rewrite rule:
[8] a*(i(1)*b) -> a*b
The result is rewrite system 2:
[1] a*1 -> a
[2] a*i(a) -> 1
[3] a*b*c -> a*(b*c)
[4] a*(1*b) -> a*b
[5] a*(i(a)*b) -> 1*b
[6] a*i(1) -> a
[7] 1*i(i(a)) -> a
[8] a*(i(1)*b) -> a*b

Iteration 2 - Simplification


The rewrite system cannot be simplified.

Iteration 3


The third iteration of the procedure can be represented graphically as:



During this iteration, four new rewrite rules [9], [10], [11] and [12] are added. The new rewrite rule [9] becomes trivial immediately using rewrite rule [12]. The existing rewrite rule [7] can be reduced using rewrite rule [12]. The new rewrite rule [11] can also be reduced using rewrite rule [7]. It remains marked green however, as it is a new rewrite rule.

Note that after reducing rewrite rule [7], it expresses the fact that the left identity element is also the right identity element, even in non-abelian groups, which is a nice intermediate result.

Iteration 3 - Critical Pairs


Combining rewrite rules [2] and [8] the term a*i(1)*i(i(1)) can be rewritten as in
a*(i(1)*i(i(1)))
  [2]-> a*1
  [1]-> a
or as in
a*(i(1)*i(i(1)))
  [8]-> a*i(i(1))
Therefore, we create a new rewrite rule:
[9] a*i(i(1)) -> a
Combining rewrite rules [5] and [8] the term a*i(1)*i(i(1))*b can be rewritten as in
a*(i(1)*(i(i(1))*b))
  [5]-> a*(1*b)
  [4]-> a*b
or as in
a*(i(1)*(i(i(1))*b))
  [8]-> a*(i(i(1))*b)
Therefore, we create a new rewrite rule:
[10] a*(i(i(1))*b) -> a*b
Combining rewrite rules [7] and [3] the term (1*i(i(a)))*b can be rewritten as in
1*i(i(a))*b
  [7]-> a*b
or as in
1*i(i(a))*b
  [3]-> 1*(i(i(a))*b)
Therefore, we create a new rewrite rule:
[11] 1*(i(i(a))*b) -> a*b
Combining rewrite rules [7] and [4] the term a*(1*i(i(b))) can be rewritten as in
a*(1*i(i(b)))
  [7]-> a*b
or as in
a*(1*i(i(b)))
  [4]-> a*i(i(b))
Therefore, we create a new rewrite rule:
[12] a*i(i(b)) -> a*b
The result is rewrite system 3:
[1] a*1 -> a
[2] a*i(a) -> 1
[3] a*b*c -> a*(b*c)
[4] a*(1*b) -> a*b
[5] a*(i(a)*b) -> 1*b
[6] a*i(1) -> a
[7] 1*i(i(a)) -> a
[8] a*(i(1)*b) -> a*b
[9] a*i(i(1)) -> a
[10] a*(i(i(1))*b) -> a*b
[11] 1*(i(i(a))*b) -> a*b
[12] a*i(i(b)) -> a*b

Iteration 3 - Simplification


Looking at rewrite rule [7], the left-hand side can be rewritten as in
1*i(i(a))
  [12]-> 1*a
We therefore replace the rewrite rule by
[7] 1*a -> a
Looking at rewrite rule [9], the left-hand side can be rewritten as in
a*i(i(1))
  [12]-> a*1
  [1]-> a
The rewrite rule can therefore be removed.
Looking at rewrite rule [11], the left-hand side can be rewritten as in
1*(i(i(a))*b)
  [7]-> i(i(a))*b
We therefore replace the rewrite rule by
[11] i(i(a))*b -> a*b
The result is rewrite system 3bis:
[1] a*1 -> a
[2] a*i(a) -> 1
[3] a*b*c -> a*(b*c)
[4] a*(1*b) -> a*b
[5] a*(i(a)*b) -> 1*b
[6] a*i(1) -> a
[7] 1*a -> a
[8] a*(i(1)*b) -> a*b
[10] a*(i(i(1))*b) -> a*b
[11] i(i(a))*b -> a*b
[12] a*i(i(b)) -> a*b

Iteration 4


The fourth iteration of the procedure can be represented graphically as:



During this iteration, three new rewrite rules [13], [14] and [15] are added. Rewrite rules [4], [10], [11], [12] and [13] become trivial. The existing rewrite rule [5] can be reduced using rewrite rule [7].

Note that rewrite rule [15] expresses the fact that the inverse of the inverse of an element is the element itself, which is a nice intermediate result.

Iteration 4 - Critical Pairs


Combining rewrite rules [2] and [5] the term a*(i(a)*i(i(a))) can be rewritten as in
a*(i(a)*i(i(a)))
  [2]-> a*1
  [1]-> a
or as in
a*(i(a)*i(i(a)))
  [5]-> 1*i(i(a))
  [7]-> i(i(a))
Therefore, we create a new rewrite rule:
[13] i(i(a)) -> a
Combining rewrite rules [11] and [5] the term i(a)*(i(i(a))*b) can be rewritten as in
i(a)*(i(i(a))*b)
  [11]-> i(a)*(a*b)
or as in
i(a)*(i(i(a))*b)
  [5]-> 1*b
  [7]-> b
Therefore, we create a new rewrite rule:
[14] i(b)*(b*a) -> a
Combining rewrite rules [12] and [5] the term a*(i(a)*i(i(b))) can be rewritten as in
a*(i(a)*i(i(b)))
  [12]-> a*(i(a)*b)
  [5]-> 1*b
  [7]-> b
or as in
a*(i(a)*i(i(b)))
  [5]-> 1*i(i(b))
  [7]-> i(i(b))
Therefore, we create a new rewrite rule:
[15] i(i(a)) -> a
The result is rewrite system 4:
[1] a*1 -> a
[2] a*i(a) -> 1
[3] a*b*c -> a*(b*c)
[4] a*(1*b) -> a*b
[5] a*(i(a)*b) -> 1*b
[6] a*i(1) -> a
[7] 1*a -> a
[8] a*(i(1)*b) -> a*b
[10] a*(i(i(1))*b) -> a*b
[11] i(i(a))*b -> a*b
[12] a*i(i(b)) -> a*b
[13] i(i(a)) -> a
[14] i(b)*(b*a) -> a
[15] i(i(a)) -> a

Iteration 4 - Simplification


Looking at rewrite rule [4], the left-hand side can be rewritten as in
a*(1*b)
  [7]-> a*b
The rewrite rule can therefore be removed. Looking at rewrite rule [5], the right-hand side can be rewritten as in
1*b
  [7]-> b
We therefore replace the rewrite rule by
[5] a*(i(a)*b) -> b
Looking at rewrite rule [10], the left-hand side can be rewritten as in
a*(i(i(1))*b)
  [11]-> a*(1*b)
  [7]-> a*b
The rewrite rule can therefore be removed. Looking at rewrite rule [11], the left-hand side can be rewritten as in
i(i(a))*b
  [13]-> a*b
The rewrite rule can therefore be removed. Looking at rewrite rule [12], the left-hand side can be rewritten as in
a*i(i(b))
  [13]-> a*b
The rewrite rule can therefore be removed. Looking at rewrite rule [13], the left-hand side can be rewritten as in
i(i(a))
  [15]-> a
The rewrite rule can therefore be removed. The result is rewrite system 4bis:
[1] a*1 -> a
[2] a*i(a) -> 1
[3] a*b*c -> a*(b*c)
[5] a*(i(a)*b) -> b
[6] a*i(1) -> a
[7] 1*a -> a
[8] a*(i(1)*b) -> a*b
[14] i(b)*(b*a) -> a
[15] i(i(a)) -> a


Iteration 5


The fifth iteration of the procedure can be represented graphically as:


During this iteration, eight new rewrite rules [16], [17], [18], [19], [20], [21], [22] and [23] are added. Rewrite rules [6], [8], [16], [19], [20], [21] and [22] become trivial.

Note that rewrite rule [18] states that the identity element is its own inverse. Also note that rewrite rule [23] expresses the fact that the right symmetric element is equally the left symmetric element, even in non-abelian groups. These are nice intermediate results.

Iteration 5 - Critical Pairs


Combining rewrite rules [1] and [14] the term i(a)*(a*1) can be rewritten as in
i(a)*(a*1)
  [1]-> i(a)*a
or as in
i(a)*(a*1)
  [14]-> 1
Therefore, we create a new rewrite rule:
[16] i(a)*a -> 1
Combining rewrite rules [3] and [14] the term i(a*b)*(a*b*c) can be rewritten as in
i(a*b)*(a*b*c)
  [3]-> i(a*b)*(a*(b*c))
or as in
i(a*b)*(a*b*c)
  [14]-> c
Therefore, we create a new rewrite rule:
[17] i(b*c)*(b*(c*a)) -> a
Combining rewrite rules [6] and [5] the term a*(i(a)*i(1)) can be rewritten as in
a*(i(a)*i(1))
  [6]-> a*i(a)
  [2]-> 1
or as in
a*(i(a)*i(1))
  [5]-> i(1)
Therefore, we create a new rewrite rule:
[18] i(1) -> 1
Combining rewrite rules [6] and [14] the term i(a)*(a*i(1)) can be rewritten as in
i(a)*(a*i(1))
  [6]-> i(a)*a
or as in
i(a)*(a*i(1))
  [14]-> i(1)
Therefore, we create a new rewrite rule:
[19] i(a)*a -> i(1)
Combining rewrite rules [7] and [14] the term i(1)*(1*a) can be rewritten as in
i(1)*(1*a)
  [7]-> i(1)*a
or as in
i(1)*(1*a)
  [14]-> a
Therefore, we create a new rewrite rule:
[20] i(1)*a -> a
Combining rewrite rules [8] and [5] the term a*(i(a)*(i(1)*b)) can be rewritten as in
a*(i(a)*(i(1)*b))
  [8]-> a*(i(a)*b)
  [5]-> b
or as in
a*(i(a)*(i(1)*b))
  [5]-> i(1)*b
Therefore, we create a new rewrite rule:
[21] i(1)*a -> a
Combining rewrite rules [8] and [14] the term i(a)*(a*(i(1)*b)) can be rewritten as in
i(a)*(a*(i(1)*b))
  [8]-> i(a)*(a*b)
  [14]-> b
or as in
i(a)*(a*(i(1)*b))
  [14]-> i(1)*b
Therefore, we create a new rewrite rule:
[22] i(1)*a -> a
Combining rewrite rules [15] and [2] the term i(a)*i(i(a)) can be rewritten as in
i(a)*i(i(a))
  [15]-> i(a)*a
or as in
i(a)*i(i(a))
  [2]-> 1
Therefore, we create a new rewrite rule:
[23] i(a)*a -> 1
The result is rewrite system 5:
[1] a*1 -> a
[2] a*i(a) -> 1
[3] a*b*c -> a*(b*c)
[5] a*(i(a)*b) -> b
[6] a*i(1) -> a
[7] 1*a -> a
[8] a*(i(1)*b) -> a*b
[14] i(b)*(b*a) -> a
[15] i(i(a)) -> a
[16] i(a)*a -> 1
[17] i(b*c)*(b*(c*a)) -> a
[18] i(1) -> 1
[19] i(a)*a -> i(1)
[20] i(1)*a -> a
[21] i(1)*a -> a
[22] i(1)*a -> a
[23] i(a)*a -> 1

Iteration 5 - Simplification


Looking at rewrite rule [6], the left-hand side can be rewritten as in
a*i(1)
  [18]-> a*1
  [1]-> a
The rewrite rule can therefore be removed. Looking at rewrite rule [8], the left-hand side can be rewritten as in
a*(i(1)*b)
  [20]-> a*b
The rewrite rule can therefore be removed. Looking at rewrite rule [16], the left-hand side can be rewritten as in
i(a)*a
  [19]-> i(1)
  [18]-> 1
The rewrite rule can therefore be removed. Looking at rewrite rule [19], the left-hand side can be rewritten as in
i(a)*a
  [23]-> 1
while the right-hand side can be rewritten as in
i(1)
  [18]-> 1
The rewrite rule can therefore be removed. Looking at rewrite rule [20], the left-hand side can be rewritten as in
i(1)*a
  [21]-> a
The rewrite rule can therefore be removed. Looking at rewrite rule [21], the left-hand side can be rewritten as in
i(1)*a
  [22]-> a
The rewrite rule can therefore be removed. Looking at rewrite rule [22], the left-hand side can be rewritten as in
i(1)*a
  [18]-> 1*a
  [7]-> a
The rewrite rule can therefore be removed. The result is rewrite system 5bis:
[1] a*1 -> a
[2] a*i(a) -> 1
[3] a*b*c -> a*(b*c)
[5] a*(i(a)*b) -> b
[7] 1*a -> a
[14] i(b)*(b*a) -> a
[15] i(i(a)) -> a
[17] i(b*c)*(b*(c*a)) -> a
[18] i(1) -> 1
[23] i(a)*a -> 1


Iteration 6


The sixth iteration of the procedure can be represented graphically as:


During this iteration, seven new rewrite rules [24], [25], [26], [27], [28], [29] and [30] are added. Rewrite rules [17], [25], [26], [28], [29] and [30] become trivial.

Iteration 6 - Critical Pairs


Combining rewrite rules [2] and [17] the term i(a*b)*(a*(b*i(b))) can be rewritten as in
i(a*b)*(a*(b*i(b)))
  [2]-> i(a*b)*(a*1)
  [1]-> i(a*b)*a
or as in
i(a*b)*(a*(b*i(b)))
  [17]-> i(b)
Therefore, we create a new rewrite rule:
[24] i(b*a)*b -> i(a)
Combining rewrite rules [3] and [17] the term i(a*b*c)*(a*b*(c*d)) can be rewritten as in
i(a*b*c)*(a*b*(c*d))
  [3]-> i(a*b*c)*(a*(b*(c*d)))
  [3]-> i(a*(b*c))*(a*(b*(c*d)))
or as in
i(a*b*c)*(a*b*(c*d))
  [17]-> d
Therefore, we create a new rewrite rule:
[25] i(b*(c*d))*(b*(c*(d*a))) -> a
Combining rewrite rules [3] and [17] the term i(a*(b*c))*(a*(b*c*d)) can be rewritten as in
i(a*(b*c))*(a*(b*c*d))
  [3]-> i(a*(b*c))*(a*(b*(c*d)))
or as in
i(a*(b*c))*(a*(b*c*d))
  [17]-> d
Therefore, we create a new rewrite rule:
[26] i(b*(c*d))*(b*(c*(d*a))) -> a
Combining rewrite rules [5] and [17] the term i(a*b)*(a*(b*(i(b)*c))) can be rewritten as in
i(a*b)*(a*(b*(i(b)*c)))
  [5]-> i(a*b)*(a*c)
or as in
i(a*b)*(a*(b*(i(b)*c)))
  [17]-> i(b)*c
Therefore, we create a new rewrite rule:
[27] i(c*a)*(c*b) -> i(a)*b
Combining rewrite rules [14] and [17] the term i(a*i(b))*(a*(i(b)*(b*c))) can be rewritten as in
i(a*i(b))*(a*(i(b)*(b*c)))
  [14]-> i(a*i(b))*(a*c)
or as in
i(a*i(b))*(a*(i(b)*(b*c)))
  [17]-> b*c
Therefore, we create a new rewrite rule:
[28] i(c*i(a))*(c*b) -> a*b
Combining rewrite rules [23] and [17] the term i(i(a*b)*a)*(i(a*b)*(a*b)) can be rewritten as in
i(i(a*b)*a)*(i(a*b)*(a*b))
  [23]-> i(i(a*b)*a)*1
  [1]-> i(i(a*b)*a)
or as in
i(i(a*b)*a)*(i(a*b)*(a*b))
  [17]-> b
Therefore, we create a new rewrite rule:
[29] i(i(b*a)*b) -> a
Combining rewrite rules [23] and [17] the term i(a*i(b))*(a*(i(b)*b)) can be rewritten as in
i(a*i(b))*(a*(i(b)*b))
  [23]-> i(a*i(b))*(a*1)
  [1]-> i(a*i(b))*a
or as in
i(a*i(b))*(a*(i(b)*b))
  [17]-> b
Therefore, we create a new rewrite rule:
[30] i(b*i(a))*b -> a
The result is rewrite system 6:
[1] a*1 -> a
[2] a*i(a) -> 1
[3] a*b*c -> a*(b*c)
[5] a*(i(a)*b) -> b
[7] 1*a -> a
[14] i(b)*(b*a) -> a
[15] i(i(a)) -> a
[17] i(b*c)*(b*(c*a)) -> a
[18] i(1) -> 1
[23] i(a)*a -> 1
[24] i(b*a)*b -> i(a)
[25] i(b*(c*d))*(b*(c*(d*a))) -> a
[26] i(b*(c*d))*(b*(c*(d*a))) -> a
[27] i(c*a)*(c*b) -> i(a)*b
[28] i(c*i(a))*(c*b) -> a*b
[29] i(i(b*a)*b) -> a
[30] i(b*i(a))*b -> a

Iteration 6 - Simplification


Looking at rewrite rule [17], the left-hand side can be rewritten as in
i(b*c)*(b*(c*a))
  [27]-> i(c)*(c*a)
  [14]-> a
The rewrite rule can therefore be removed. Looking at rewrite rule [25], the left-hand side can be rewritten as in
i(b*(c*d))*(b*(c*(d*a)))
  [26]-> a
The rewrite rule can therefore be removed. Looking at rewrite rule [26], the left-hand side can be rewritten as in
i(b*(c*d))*(b*(c*(d*a)))
  [27]-> i(c*d)*(c*(d*a))
  [27]-> i(d)*(d*a)
  [14]-> a
The rewrite rule can therefore be removed. Looking at rewrite rule [28], the left-hand side can be rewritten as in
i(c*i(a))*(c*b)
  [27]-> i(i(a))*b
  [15]-> a*b
The rewrite rule can therefore be removed. Looking at rewrite rule [29], the left-hand side can be rewritten as in
i(i(b*a)*b)
  [24]-> i(i(a))
  [15]-> a
The rewrite rule can therefore be removed. Looking at rewrite rule [30], the left-hand side can be rewritten as in
i(b*i(a))*b
  [24]-> i(i(a))
  [15]-> a
The rewrite rule can therefore be removed. The result is rewrite system 6bis:
[1] a*1 -> a
[2] a*i(a) -> 1
[3] a*b*c -> a*(b*c)
[5] a*(i(a)*b) -> b
[7] 1*a -> a
[14] i(b)*(b*a) -> a
[15] i(i(a)) -> a
[18] i(1) -> 1
[23] i(a)*a -> 1
[24] i(b*a)*b -> i(a)
[27] i(c*a)*(c*b) -> i(a)*b


Iteration 7


The seventh iteration of the procedure can be represented graphically as:


During this iteration, eight new rewrite rules [31], [32], [33], [34] and [35] are added. Rewrite rules [27], [32], [33], [34] and [35] become trivial.

Note that rewrite rule [31] states that the inverse of a product of two elements is the product of the inverted elements in inverted order, which is a nice intermediate result.

Iteration 7 - Critical Pairs


Combining rewrite rules [2] and [27] the term i(a*b)*(a*i(a)) can be rewritten as in
i(a*b)*(a*i(a))
  [2]-> i(a*b)*1
  [1]-> i(a*b)
or as in
i(a*b)*(a*i(a))
  [27]-> i(b)*i(a)
Therefore, we create a new rewrite rule:
[31] i(b*a) -> i(a)*i(b)
Combining rewrite rules [5] and [27] the term i(a*b)*(a*(i(a)*c)) can be rewritten as in
i(a*b)*(a*(i(a)*c))
  [5]-> i(a*b)*c
or as in
i(a*b)*(a*(i(a)*c))
  [27]-> i(b)*(i(a)*c)
Therefore, we create a new rewrite rule:
[32] i(b*a)*c -> i(a)*(i(b)*c)
Combining rewrite rules [14] and [27] the term i(i(a)*b)*(i(a)*(a*c)) can be rewritten as in
i(i(a)*b)*(i(a)*(a*c))
  [14]-> i(i(a)*b)*c
or as in
i(i(a)*b)*(i(a)(*a*c))
  [27]-> i(b)*(a*c)
Therefore, we create a new rewrite rule:
[33] i(i(b)*a)*c -> i(a)*(b*c)
Combining rewrite rules [23] and [27] the term i(i(a)*b)*(i(a)*a) can be rewritten as in
i(i(a)*b)*(i(a)*a)
  [23]-> i(i(a)*b)*1
  [1]-> i(i(a)*b)
or as in
i(i(a)*b)*(i(a)*a)
  [27]-> i(b)*a
Therefore, we create a new rewrite rule:
[34] i(i(b)*a) -> i(a)*b
Combining rewrite rules [24] and [27] the term i(i(a*b)*c)*(i(a*b)*a) can be rewritten as in
i(i(a*b)*c)*(i(a*b)*a)
  [24]-> i(i(a*b)*c)*i(b)
or as in
i(i(a*b)*c)*(i(a*b)*a)
  [27]-> i(c)*a
Therefore, we create a new rewrite rule:
[35] i(i(b*c)*a)*i(c) -> i(a)*b
The result is rewrite system 7:
[1] a*1 -> a
[2] a*i(a) -> 1
[3] a*b*c -> a*(b*c)
[5] a*(i(a)*b) -> b
[7] 1*a -> a
[14] i(b)*(b*a) -> a
[15] i(i(a)) -> a
[18] i(1) -> 1
[23] i(a)*a -> 1
[24] i(b*a)*b -> i(a)
[27] i(c*a)*(c*b) -> i(a)*b
[31] i(b*a) -> i(a)*i(b)
[32] i(b*a)*c -> i(a)*(i(b)*c)
[33] i(i(b)*a)*c -> i(a)*(b*c)
[34] i(i(b)*a) -> i(a)*b
[35] i(i(b*c)*a)*i(c) -> i(a)*b

Iteration 7 - Simplification


Looking at rewrite rule [24], the left-hand side can be rewritten as in
i(b*a)*b
  [32]-> i(a)*(i(b)*b)
  [23]-> i(a)*1
  [1]-> i(a)
The rewrite rule can therefore be removed. Looking at rewrite rule [27], the left-hand side can be rewritten as in
i(c*a)*(c*b)
  [32]-> i(a)*(i(c)*(c*b))
  [14]-> i(a)*b
The rewrite rule can therefore be removed. Looking at rewrite rule [32], the left-hand side can be rewritten as in
i(b*a)*c
  [31]-> (i(a)*i(b))*c
  [3]-> i(a)*(i(b)*c)
The rewrite rule can therefore be removed. Looking at rewrite rule [33], the left-hand side can be rewritten as in
i(i(b)*a)*c
  [31]-> i(a)*i(i(b))*c
  [3]-> i(a)*(i(i(b))*c)
  [15]-> i(a)*(b*c)
The rewrite rule can therefore be removed. Looking at rewrite rule [34], the left-hand side can be rewritten as in
i(i(b)*a)
  [31]-> i(a)*i(i(b))
  [15]-> i(a)*b
The rewrite rule can therefore be removed. Looking at rewrite rule [35], the left-hand side can be rewritten as in
i(i(b*c)*a)*i(c)
  [31]-> i(a)*i(i(b*c))*i(c)
  [3]-> i(a)*(i(i(b*c))*i(c))
  [15]-> i(a)*(b*c*i(c))
  [3]-> i(a)*(b*(c*i(c)))
  [2]-> i(a)*(b*1)
  [1]-> i(a)*b
The rewrite rule can therefore be removed. The result is rewrite system 7bis:
[1] a*1 -> a
[2] a*i(a) -> 1
[3] a*b*c -> a*(b*c)
[5] a*(i(a)*b) -> b
[7] 1*a -> a
[14] i(b)*(b*a) -> a
[15] i(i(a)) -> a
[18] i(1) -> 1
[23] i(a)*a -> 1
[31] i(b*a) -> i(a)*i(b)

Completion


The resulting rewrite system 7bis cannot be extended anymore. We have therefore reached the fixed-point of the Knuth-Bendix completion algorithm.

The rewrite system can be represented graphically as:


This result constitutes the deterministic rewrite system that is semantically equivalent to the theory consisting of the three non-directed axioms we started off with. It provides a decidable computational framework for non-abelian groups, in the sense that, repeat, semantic equality of terms is equivalent to syntactic equality of normal forms.