Friday, August 25, 2017

A Not-so-Random Walk Through Hyperspace

Reporting on a first experiment with neural network technology (DL4J), applying feed forward neural networks (FFNN) for supervised learning of file types based on content fragments, exploring the hyperparameter space of domain and tooling.

Introduction


Artificial intelligence based on neural networks has a long history. However, up to a couple of years ago, success stories have been few and with limited impact.

Over the last few years, neural networks have demonstrated some impressive results, catching up with the long standing promise of performance at the level of human experts in different domains, and even taking us beyond. Interesting presentations in this context include Jeremy Howard's talk on the potential of deep learning and Demis Hassabis' talk on DeepMind and AlphaGo.

Due to these recent successes, neural network technology has gained a renewed interest. Today, different generic neural network platforms are available in the public domain, and a large community of enthusiasts is looking into the technology and its applications.

Time to join the community ...

Finding a Problem


With the solution at hand, we still have to find a problem. Data should be real life, easily available in large numbers, and categorised.

The problem we will be addressing will be using the file system, where the goal is to recognise the file type based on (a fragment of) the file content. We have a large number of files on the file system at hand, created by many different authors, and the bulk of them labelled correctly by the file extension.

The data set will consist of all files under a given root directory, using the following constraints:
  • Text files only, determined heuristically based on the content of the file. We will be using the function proposed by Ondra Žižka on stackoverflow for this purpose.
  • Excluding files with extension "txt", as these have strongly varied content.
  • Having at least 100 occurrences for a file type, to allow patterns to be detected.

This results in a collection of 50,665 files with 12 different extensions, distributed as


Note that the number of xml-files is 41,103, but the bar is clipped in favour of visibility of statistics for the other file types.

Performance Evaluation


The first parameter that comes to mind for evaluating the performance of a learning system is the accuracy of its output, defined as the ratio of the number of correct classifications over the total number of classifications. As explained in the Deep Learning and Neural Network Glossary and James Brownlee's post on Performance Measures, we can improve on the evaluation of performance of a neural network.

The results for a specific class during the test phase can be divided into:
  • True positives (TP): sample correctly accepted.
  • False positives (FP): sample incorrectly accepted.
  • True negatives (TN): sample correctly rejected.
  • False negatives (FN): sample incorrectly rejected.

Given these definitions, we can express the accuracy and introduce additional metrics of performance:
  • Accuracy: A = (TP+TN)/(TP+FP+TN+FN), how often is a classification correct ?
  • Precision: P = TP/(TP+FP), how often is the acceptance of a sample correct ?
  • Recall: R = TP/(TP+FN), how often is a sample in the class correctly classified ?
  • F1 Score: F1 = 2*P*R/(P+R), the harmonic mean of precision and recall.

We will track all four of these metrics, but we will use F1 as a metric for overall performance.

Selecting a Neural Network Topology


We will be using the DL4J system, because it is low entry and feature rich.

An important observation to determine the network topology is related to the type of problem we wish to solve. The Introduction to Deep Learning on DL4J briefly explains:
  • Classification as a learning process where the category is provided as part of the input, also referred to as supervised learning.
  • Clustering as a learning process where the category is not provided as part of the input, but should be discovered by the learning process, also referred to as unsupervised learning.
  • Predictive Analytics as a variant of either classification or clustering, where one of the elements in the sample data reflects the time dimension.

The problem we have chosen to solve clearly is a classification problem.

For this first attempt to get neural networks operational, we will focus on feed forward neural networks (FFNN) with multiple layers. See Feed Forwards Neural Network on Wikipedia for a general introduction. We borrow the illustration from the DL4J Introduction to Neural Networks:


For the FFNN, we will consider as hyperparameters the number of layers, the organisation of nodes into these layers, the learning speed and some more parameters explained below.

For the chosen problem domain, the parameters include the sample size or the number of characters taken from the file, and the minimum frequency for a file type to be taken into scope.

Jumping into the Unknown


We will take the class CSVExample in the DL4J sample code base as a starting point for training and testing an FFNN. Without having anything else to go on, the example will also help us with the choices of hyperparameter values.
new NeuralNetConfiguration.Builder()
    .seed(1L)
    .iterations(1000)
    .activation(Activation.TANH)
    .weightInit(WeightInit.XAVIER)
    .learningRate(0.1)
    .regularization(true).l2(1e-4)
    .list()
    .layer(0, new DenseLayer.Builder().nIn(64).nOut(12).build())
    .layer(1, new OutputLayer
        .Builder(LossFunctions.LossFunction.NEGATIVELOGLIKELIHOOD)
        .activation(Activation.SOFTMAX).nIn(12).nOut(12).build())
    .backprop(true)
    .pretrain(false)
    .build();
The above code fragment builds a MultiLayerConfiguration providing all the necessary parameters for the network to function. We refer to the full example for more information on how to provide the sample data to the network and run the training and testing processes.

Let's have a look at the parameters that can be provided, and why we've chosen particular values:
  • Seed: The seed is used to initialise weights. We are using a hard-coded seed value to be able to reproduce results.
  • Number of Iterations: An iteration is a cycle in which the weights in the network are adjusted based on the errors obtained using the training data set. We will use 1,000 iterations for our first attempt. No clue whatsoever about convergence behaviour.
  • Activation Function: The activation function determines what output a node will generate based upon its input. We use the hyperbolic tangent function because it ranges from -1 to 1, using the full spectrum (as opposed to the once popular sigmoid function). Also, the function is continuously differentiable (as opposed to the more recent ReLu function), which helps in supporting backpropagation by gradient descent, as mentioned below. See Activation Functions on Wikipedia for more information on activation functions.
  • Weight Initialisation: We are using Xavier for the initial weights in the neural network. This keeps the weights, and consequently the signals that flow through the network, at a reasonable range throughout the layers of the network. See Andy's post on Xavier initialisation for more information.
  • Learning Rate: The learning rate is a measure for how quickly a network abandons old beliefs for new ones, as explained by Conner Davis on Quora. We will start of with 0.1, as most of the examples do.
  • Regularisation: Regularization artificially discourages complex or extreme explanations of the world even if they fit the what has been observed better, as explained by Sean Owen on Quora. It is used as a measure against overfitting, where a statistical model describes random error or noise instead of the underlying relationships. See Stanford Universities' Course on Convolutional Neural Networks for Visual Recognition for a more in-depth coverage in the context of neural networks. We will activate regularisation for obvious reasons. L2 regularisation is most commonly used.
  • Layer 0: The input layer of the network. The number of inputs, and consequently the number of nodes, equals the sample size, 64 in our case. The number of outputs has to equal the number of inputs of the output layer. The input layer is never counted as a layer. Consequently, a 1-layer network has an input layer and an output layer, no hidden layers. See also Stanford Universities' Course on Convolutional Neural Networks for Visual Recognition.
  • Layer 1: The output layer of the network. As we are doing classification, the number of nodes equals the number of categories, 12. The loss function is used for backpropagation, to quantify the error. The negative log-likelihood function is also referred to as cross entropy cost. See the section on Loss Functions in the Stanford Universities' Course on Convolutional Neural Networks for Visual Recognition. You may also want to have a look at the post on Cost Functions used in Neural Networks on Cross Validated. We are using the softmax activation function to make sure the output values can be interpreted as probabilities, one for each category. See also the Softmax function on Wikipedia.
  • Backpropagation: Backpropagation is the process where the weights are updated in function of correct and incorrect classification during the learning process. Some parameters provided before are actually related to backpropagation. See Chapter 2 of Michael Nielsen's book of Neural Networks and Deep Learning for a theoretical description and Matt Mazur's post on Step by Step Backpropagation for an exhaustive example. DL4J refers to Stanford's wiki on the Backpropagation Algorithm.
  • Pretrain: Pre-training is using the weights of a neural network that was already trained on a different (but similar) data set to initialize the neural network, as explained in post on pretraining a neural network on Cross Validated. We will disable pretraining.

Falling on One's Feet


As stated before, our data set consists of 50,665 files. We will be using 90% (45,599 files) for training, and 10% (5,066 files) for testing. To guarantee the independence of our test set, we will never use any of the test files for training.

The first attempt, after 1,000 iterations, gives an accuracy of 99.07%, which is quite impressive. We were lucky with our initial choices for the hyperparameter values. However, other metrics are far less impressive, with a precision of 86.80%, a recall of 70.13% and an F1 score of 77.22%.

The evolution of these scores in function of the number of iterations is quite instructive:


While the accuracy improves dramatically, reaching over 90% after 15 iterations, and over 99% after 559 iterations, the other metrics evolve much more slowly. The chart however suggests increasing the number of iterations may be useful, as the F1 score still seems to be increasing at 1,000 iterations.

We may add that, on an Intel® Core™ i7-6700HQ CPU @ 2.60GHz × 8, the training process took just below 2 minutes, including data load, model initialisation, and an evaluation after each iteration. Removing the evaluation finishes the learning process in 77 seconds.

Another interesting instrument, providing an alternative view on the performance of the neural network, is the confusion matrix. The confusion matrix shows the actual categories in the rows, and the number of classifications by the network per category in the columns.

After our first run, the confusion matrix is

csvprojectjavaxmlhtmlgroovyxsdshproperties
csv300000001
project0001000000
java10321000103
xml2004,60300001
html0000430000
groovy000001008
xsd00016003200
sh000000010
properties1003000016


We observe an important mix-up with all project files and many xsd files being classified as xml files, and essentially, they are. We also see that almost all groovy files are classified as properties files, most probably because they start with some assignment statements.

A final interesting observation is that the evolution of precision and recall, and consequently the F1 score, is not smooth but rather bumpy. We can see for example a sudden decrease in precision after iteration 161, from 56.68% to 47.38%, followed by a sudden increase after iteration 194, both in precision, from 48.92% to 65.59%, and in recall, from 31.25% to 42.36%.

Analysis of the confusion matrix before and after shows that the decrease in precision is due to a change in classification of all 43 html files in the test set, from category xml to project. Both classifications are incorrect. However, as the total number of xml files is very high compared to other categories, the relative expression of the error significantly decreases the precision. A later change in classification which causes the same 43 html files to be correctly classified gives a significant increase in both precision and recall.

This observation also relates to the impressively fast increase in accuracy demonstrated in this example. As the distribution of file types is extremely uneven, with 4,606 out of 5,066 files being xml files, learning to classify xml files correctly will get the accuracy above 90%. This also explains the relative smoothness of the accuracy curve over iterations in comparison with other examples of learning processed based on neural networks, as the accuracy of xml classification is very dominant.

Getting Oriented


With the term hyperspace, we refer to the space generated by all hyperparameters. In the previous topic, we chose a first point in hyperspace to start off with, based on a little intuition and, no doubt, a lot of ignorance.

Now that we have an anchor point, we will explore hyperspace in some more detail. Obviously, we risk getting caught in a combinatorial explosion of possibilities. In order to get oriented, we will scan hyperspace selectively, by choosing a discrete number of values around our starting point for each dimension in hyperspace, effectively scanning a multi-dimensional cross in hyperspace.

However, the first question that comes to mind is where we would end up by running more iterations in the learning process.

Longer Learning


Well, here it is, with the same seed value of 1L, after 5,000 iterations, we get accuracy 99.76%, precision 78.33%, recall 96.94% and F1 score 89.84%.


Although the accuracy still improves with the number of iterations, there is no consistent convergence for other metrics. Running the network with different initialisations (read different random seeds) confirms this: there is no consistent convergence of all performance metrics towards an optimum with this configuration.

Bigger Samples


Common sense suggests one of the more important parameters is the sample size, the number of bytes taken from the file content to be used for classification. To investigate we run the train and test cycle using sample sizes ranging from 32 to 128 bytes, in steps of 16. The difference between these variants is illustrated by their performance graph:


Clearly, the size of the sample is an important factor: the larger the sample, the better all performance metrics become, as expected. For the best run, obtained with a sample size of 128, we get accuracy 99.87%, precision 93.47%, recall 98.08% and F1 score 94.70%.

The tests have been executed with different seed values for the initialisation of weights. Because the volatility of the results of individual tests is high, depending on the seed value, 5 tests runs have been scheduled for each parameter value, where the results with the lowest and highest F1 score have been discarded. The chart displays the average performance metrics of the 3 remaining results. We will use this technique for the remainder of this post.

As a sidekick to this experiment, and inspired by Bruno Stecanella's post on Naive Bayes Classifiers on MonkeyLearn, we did the same test using Naive Bayes classification. The test showed an accuracy of approximately 92%, with an optimum sample size around 64.

For the remainder of the post, we will continue to work with a sample size of 64, mainly because in multi-layer configurations, the number of weights quickly grows with the sample size, as illustrated below.

More Samples


In order to observe patterns, sufficient data should be available. It is therefore important to make sure that, for each file type, a significant number of files is available in the training set. Up to now, we have been working with a minimum frequency of 100, which gave fairly good results.

Running the FFNN with different data sets, where the minimum frequency is 25, 50 and 100, respectively, confirms our expectation:


For the remainder of the post, we will continue to work with a minimum frequency of 100, as before.

Multi-Layer Topology


So far, all our tests have been executed using a 1-layer FFNN. As we are working with deep learning technology, we should look at the impact of working with multiple layers of nodes as well.

In designing a multi-layer configuration, we have to decide on the number of nodes for each layer, which is again jumping in the unknown. The constraints available are:
  • The number of nodes in the input layer is equal to the sample size.
  • The number of nodes in the output layer is equal to the number of categories.
  • The total number of weights should be far below the number of samples. We cannot tune a number of weights that is larger than the number of samples. Our intuition says that we should have sufficient margin here.
  • Some experts state that the number of nodes should not exceed the number of inputs. We will keep the number of nodes between the number of inputs and the number of outputs for each individual layer.

Given these constraints, we will run a test with 3 different configurations:
  • Min: The number of nodes on hidden layers is the minimum of the number of nodes on the input and output layers, 64 in our case.
  • Linear: The number of nodes on hidden layers evolves linearly from the number of nodes on the input layer to the number of nodes on the output layer, from 64 to 12 in our case.
  • Max: The number of nodes on hidden layers is the maximum of the number of nodes on the input and output layers, 12 in our case.

Testing these distributions on a 3-layer configuration results in the following performance metrics:


Although the sample is small and has poor coverage, we will continue with the linear node distribution model. According to this model, the number of nodes N(i) for layer i is given by:
N(i) = #B - i * (#B - #C) / #L
Here, #B is the number of bytes in the sample, or the sample size, #C is the number of categories, and #L is the number of layers. The input layer gets label 0, hidden layers are labelled 1 to #L-1, and the output layer gets label #L.

With a sample size of 64 and 12 categories, the following table provides an overview of the number of nodes for each layer, and consequently, the number of weights to be tuned in the network, for up to 10 layers:

# LayersL0L1L2L3L4L5L6L7L8L9L10# Weights
16412768
26438122,888
3644729124,719
464513825126,452
56454433322128,187
6645547382921129,854
7645749423427191211,586
864585145383225191213,394
96458524741352924181214,893
10645954484338332822171216,624

Clearly, the number of weights quickly increases with the number of layers and we need to be cautious not configure too many nodes, in particular as we only have 45,599 samples in the training set.

Configuring the FFNN with more layers, from 1 to 12, and using linear node distribution, we get the following results:


Apparently, the configuration reaches an optimum with 6 layers, where all performance metrics are at 100%. Adding more layers to the FFNN improves the performance, at least up to 6 layers, and this configuration demonstrates consistent convergence on all performance metrics.

Conclusion


Neural network technology recently has matured to the extent that it has become widely applicable. This post reports on a first experiment with the technology (for the author) and demonstrates that the DL4J system offers a framework that is easily accessible to a mainstream Java audience.

Applying neural network technology to a non-trivial classification problem generates results almost instantly. The exercise is interesting as it demystifies the subject of neural networks and encourages further experimentation.

Working on a more complex data set, and using more powerful hardware, would allow us to examine the impact of changes to other hyperparameters, address other types of machine learning problems and experiment with other types of neural network topologies.

No comments:

Post a Comment