Mind to Matter: DL4J

Showing posts with label DL4J. Show all posts

Saturday, February 17, 2018

Secrets of the Heart

Learning the beat - Short term prediction of a cardiogram using recurrent neural networks, once more exploring the hyperparameter space of domain and tooling (still DL4J).

Introduction

In a previous post on neural networks, A Not So Random Walk Through Hyperspace, we reported on an experiment with feed forward neural networks (FFNN). This time, we will explore time series prediction using recurrent neural networks (RNN).

By mere coincidence we obtained a 24-hour cardiogram from a family member. The challenge we will take on is to predict the next measurement of the cardiogram based on a series of measurements preceding it.

Accessing the Cardio Data

The magic number of the cardio file is "ISHNE1.0", which quickly leads us to The Format Description of Holter ECGs. A couple of hours of programming and we are able to correctly extract both meta data and sampling data out of the cardio file.

From the meta data we learn that the cardio file contains 2 leads, both generic bipolar. Each lead registers over 21 million data points. With a sampling rate of 250 Hz, this represents almost 24 hours of data. The vertical resolution is 2500. With data ranging from -2.5 mV to 2.5 mV, this would represent a precision of 2 µV, but we are guessing here.

Using gnuplot, and another couple of hours, we are able to visualize any fragment of the 24-hour cardio.

A Sample Holter Cardio Fragment
For the remainder of this post, we will use the first lead of the cardiogram.

Establishing a Baseline

Before we start our experiment, we may reflect on the performance to be expected:

What is the best performance we may expect ?
What would be a good result ?

First of all, there is the vertical resolution of the cardio. We already argued it would be in the order of 2 µV on a full range of 5,000 µV. Taking into account a maximum absolute error of measurement of one unit, we will assume the error of the cardio device to be 0.04%.

In addition, we may expect the electrical signal that drives the heart muscles to contain some degree of noise. By lack of any knowledge in the field, there is no way we can estimate the noise. Any input on this subject would be warmly welcomed, but for now, we will have to stick with the estimated best case based on the cardio meta data.

Now that we have a lower bound for the error, we might raise the question on what would make a good predictor. Performance of mere earthlings on this subject does not help us, as most cardiologists, although familiar with various patterns in a cardio, are not in to predicting individual measurements. We will therefore revert to a number of heuristic predictors, and estimate the average error for each one.

To use the cardio signal as an input to a neural network, the values should be projected on an appropriate numeric interval. From now on, we will apply a linear transformation on the cardio sampling data, projecting the range in µV [-2000,2000] (values outside of this range are rare) to the range [-1,1]. Our heuristic predictors will be introduced and evaluated on this projection too, using the average absolute error expressed as a percentage of the full range.

Given the fact that the value of a cardio is mostly close to 0, with sparse peaks up and down, one may reason that the uniform random predictor on the interval [-1,1] will show an average absolute percentage error of 25%. Trying that out on a series of 1,000 sampling points we get an average error of 24.7695% with a standard deviation of 15.3092%.

Random Predictor Error Distribution Histogram
Here, the label "A" marks the bucket with the average and the label "+S" marks the bucket that contains the average increased with the standard deviation.

The constant 0 predictor is expected to perform much better, making only small errors most of the time. Trying that out on the same series of 1,000 sampling points we get an average error of 4.6094% with a standard deviation of 6.6011%.

Contant 0 Predictor Error Distribution Histogram
Note the difference in horizontal scale of this chart compared to the previous one.

However, there is an even better heuristic approach: the weather tomorrow most likely will be like the weather today. A predictor that repeats the last observed value is likely to do fairly well. Trying that out on the same series of 1,000 sampling points we get an average error of 0.5972% with a standard deviation of 1.0971%. No doubt, this is the one to beat !

Previous Predictor Error Distribution Histogram
Again, note the difference in horizontal scale.

This results in the following baseline scenarios that will be the playing field for our experiment:

#	Scenario	Description	Average Error	Standard Deviation
1	Observations	Cardio Device	0.0400%
2	Previous	Previous Value Predictor	0.5972%	1.0971%
4	Zero	Constant 0 Predictor	4.6094%	6.6011%

Network and Scenario Configuration

When we started this experiment, we selected a point in hyperspace, representing a recurrent neural network configuration and a scenario, and evaluated the performance of a series of configurations changing only one parameter. For a categorical parameter, this would give us the optimal parameter value. For a numerical parameter, this would give us the direction of improvement, and eventually also an optimal parameter value. We shifted focus to the optimal value and repeated this for other parameters.

We understand that the location of the point in hyperspace will have an impact on the conclusion for the varying parameter. Multiple sources advise a random walk, as suggested in part 3 of the excellent Introduction to Convolutional Neural Networks from Stanford. However, we took the approach of walking straight from one optimum to the next.

We tested a wide range of network and scenario parameters. Rather than starting with a lengthy report and postponing the Eureka moment to the very end, we mention the optimal values here and provide more detail at the end:

Corrected Sample Selection does not accelerate learning.
One LSTM Layer is plenty.
Fixed Epoch Data improves convergence and reduces variance.
The Best Activation Function is Leaky RELU.
The Best Updater is RMSProp.
The Best Loss Function is Mean Square Error.
No Overfitting with longer training.

Beating the Previous Predictor

Given the baseline introduced before, the key issue is whether we can beat the previous predictor. We reformulate this objective using the following questions:

Could an RNN with a single LSTM node that was trained using samples of 1 sampling point replicate the performance of the previous predictor ?
Would increasing the sample size and consequently the size of the LSTM layer improve the performance of the network ?

To investigate this, we will use the optimal network configuration described in the previous section and perform tests by varying the sample size from 1 to 50, representing 1/250 to 1/5 seconds of cardiogram data.

An RNN always has a single input node. The first layer is an LSTM layer, with the same size as the cardio sample. The second layer will be a dense layer, with half the number of nodes. The third layer will be an RNN layer, with a single output node.

Each configuration is trained in 100K iterations, consisting of 400 epochs of 250 batches, each batch containing 32 samples. The performance of the network is evaluated using 1,000 samples not seen during training.

The results are visualized in the following scatter diagram:

Error Distribution by Series Size 1-50
First of all, the performance of the RNN with only a single LSTM node indeed matches the performance of the previous predictor closely. With an average error of 0.4785% and a standard deviation of 0.9408%, it performs slightly better. Clearly the network's 19 parameters are learning some specifics on the distribution of the next value in function of the previous value.

In addition, it is clear that the sample size contributes to the performance significantly, with a decreasing marginal improvement. The best performer is M35-D17-R1-L05d-35-S1, using a sample size of 35, consequently having 5,915 parameters, demonstrating an average error of 0.1899% with a standard deviation of 0.2151%.

Training and testing that configuration, the error drops very quickly to a low value and then continues to converge slowly, as illustrated by the MSE (mean square error) chart.

MSE Error Evolution for M35-D17-R1-L05d-35-S1
The chart shows the evolution of the MSE and its average over 10 iterations for the first 10,000 iterations. Note that the training process was done over 100,000 iterations. Although slower, convergence continued over the entire learning process.

The error distribution now becomes very narrow.

Error Distribution for M35-D17-R1-L05d-35-S1
Again, note the difference in horizontal scale.

In order to get an in-depth view on the learning process and see the improvement of the prediction, we evaluate the network on a batch of 32 samples at several stages in the learning process. For each sample in the batch, we offset the prediction of the next observation against the actual observation:

M35-D17-R1-L05d-35-S1 Predictions after 0, 100, 200, 500, 1K and 10K Iterations
Note that the charts do not represent a cardio fragment of 32 sampling points, but rather the predictions for 32 different, randomly chosen cardio fragments of 35 sampling points.

The evaluations are made at specific points during the training phase. The respective performance observations are:

# Iterations	Average Error
0	10.3315%
100	2.0748%
200	1.4205%
500	1.4227%
1,000	1.0113%
10,000	0.3514%

And the average error goes down to 0.1899% after 100K iterations.

These results illustrate once again that learning is initially very fast, and slows down quickly, demonstrating slow convergence. This seems to be consistent with Naftali Tishby's observation on Information Theory of Deep Learning, although we did not verify this explicitly. A very interesting talk by the way.

Why is the performance after 0 iterations (before any training) so much worse than that of the constant 0 predictor ? Because, trying to accelerate learning, we are not using uniformly distributed samples. Instead, we have more samples ending in "interesting" regions than with a uniform distribution. See the section on Corrected Sample Selection for more information.

Why is there still a factor 2 in performance between 10K and 100K iterations ? Actually, there is no factor 2. The charts that visualize the intermediate performance are not using uniformly distributed samples and consequently show higher errors. The performance after 100K iterations is measured using uniformly distributed samples, such that it is comparable to the performance reported for the heuristic predictors. Again, we refer the section on Corrected Sample Selection for more information.

Model Ensemble

Before we get to more technical details, we wish to try out one interesting point made in the Introduction to Convolutional Neural Networks from Stanford about model ensembles.

The term model ensemble refers to an approach to improve the performance of neural networks by a few percent by training multiple independent models and average their predictions.

Trying this technique on the top 5 performers we tested, we can confirm that model ensembles tend to perform better than their constituent neural networks.

Configuration	Average Error	Standard Deviation	Ensemble Average Error	Ensemble Standard Deviation
M35-D17-R1-L05d-35-S1	0.1899%	0.2151%	0.1899%	0.2151%
M48-D24-R1-L05d-48-S1	0.1899%	0.2374%	0.1848%	0.2192%
M33-D16-R1-L05d-33-S1	0.1912%	0.2388%	0.1839%	0.2214%
M46-D23-R1-L05d-46-S1	0.1917%	0.2592%	0.1839%	0.2268%
M42-D21-R1-L05d-42-S1	0.1918%	0.2182%	0.1832%	0.2197%

In the ensemble columns, for each row, we listed the performance indicators for the model ensemble consisting of all the neural networks up to that row in the table. Indeed, the model ensemble consisting of all 5 neural networks outperforms any of the individual neural networks in the top 5, with an average error of 0.1832%, but with a slightly higher standard deviation of 0.2197%.

Intermezzo

In the next few sections, we will address some of the implementation techniques we used to get an RNN configured and running in DL4J. Feel free to skip these sections if you are more interested in the results than in the technical implementation details. You will find more information on hyperparameters in subsequent sections.

Creating ND4J DataSets

The way to provide training data to a neural network in DL4J, is to use an implementation of the DataSetIterator interface. A number of implementations are provided by DL4J. For the file classification exercise with FFNNs, we used the implementation provided for CSV-files.

To avoid transforming cardio data into the required format and creating myriads of CSV-files for this purpose, we implemented the DataSetIterator interface by creating a wrapper for the java representation of a cardio lead. Configuration parameters for the iterator class are:

lead: the cardio lead from which to generate samples,
random: a random generator to select samples, for reproducability,
epochSize: the number of iterations in an epoch,
batchSize: the number of samples in a batch, and
sampleSize: the number of data points in a sample.

The key element of the iterator is the next function, where an ND4J DataSet is populated with data for the next training batch:

public DataSet next(int batchSize) {
  INDArray input = Nd4j.create(new int[]{ batchSize, 1, sampleSize }, 'f');
  INDArray labels = Nd4j.create(new int[]{ batchSize, 1, sampleSize }, 'f');
  INDArray labelsMask = Nd4j.create(new int[]{ batchSize, 1, sampleSize }, 'f');
  for (int series=0; series<batchSize; ++series) {
    int offset = getRandomOffset(sampleSize);
    for (int sample=0; sample<sampleSize; ++sample) {
      double point = MathUtils.transform(lead.getValue(offset+sample),-2000d,2000d,-1d,1d);
      double label = MathUtils.transform(lead.getValue(offset+sampleSize),-2000d,2000d,-1d,1d);
      input.putScalar(new int[]{ series, 0, sample }, point);
      labels.putScalar(new int[]{ series, 0, sample }, label);
      labelsMask.putScalar(new int[]{ series, 0, sample }, 0);
    }
    labelsMask.putScalar(new int[]{ series, 0, sampleSize-1 }, 1);
  }
  return new DataSet(input, labels, null, labelsMask);
}

Note that this implementation is specific to an RNN, where we create 3D arrays:

The first dimension is for the samples in the batch.
The second dimension is for the depth of the sample, in our case 1, as we only use 1 lead of the cardio for this exercise.
The third dimension is the time series dimension.

We first allocate the arrays, one for the input, one for the labels and one for the label mask. The label mask is another element that is typical for RNNs. It instructs DL4J which elements in the label array should be taken into account for the error calculation, where the mask is 1, and which elements should be ignored, where the mask is 0.

We then define our sample by selecting a random offset in the cardio lead where the sample will start, using the getRandomOffset function. The implementation of this function is not provided, as it is not essential to the focus of this post.

The next step is to populate the arrays using the sample data:

The input array is populated with the data points of the cardio lead, observations 0 to sampleSize, exclusive.
All elements of the label array are populated with the observation in position sampleSize, which is the observation to be predicted.
The label mask array is populated with zeroes.

We then set the last element in the label mask array (at position sampleSize-1) to 1, to specify that only this position should be used to calculate the error.

Finally, we create a new DataSet with the input, label and label mask arrays. The third parameter of the DataSet constructor is the input (feature) mask. As we want all inputs to be used, we provide a null value for this parameter.

Why do we create a label array and a label mask array, when we are using only the last element in the array ? Because for RNNs, DL4J requires the label array to be equal in size to the input array.

Why do we bother to populate other elements besides the last one in the label array ? Because we might be using the number of label elements to be taken into account for error calculation as a hyperparameter.

Configuring the Recurrent Neural Network

In order to industrialize our test, we defined a configuration object. This is a POJO that specifies both network configuration parameters and scenario configuration parameters. We assume the meaning of the parameters is clear by naming and context.

We provide the code snippet responsible for the creation of the network layers:

int nIn = configuration.network.inputSize;
int nOut = 0;
int count = 0;
for (NetworkLayer networkLayer : configuration.network.layers) {
  nOut = networkLayer.size;
  Layer layer = null;
  if ("LSTM".equals(networkLayer.type.name())) {
    layer = new GravesLSTM.Builder().nIn(nIn).nOut(nOut).activation(networkLayer.activation).build();
  } else if ("RNN".equals(networkLayer.type.name())) {
    layer = new RnnOutputLayer.Builder(networkLayer.loss).nIn(nIn).nOut(nOut).activation(networkLayer.activation).build();
  } else {
    layer = new DenseLayer.Builder().nIn(nIn).nOut(nOut).activation(networkLayer.activation).build();
  }
  builder.layer(count, layer);
  nIn = nOut;
  count++;
}

This code is able to create LSTM (long short-term memory) layers, dense layers and RNN layers. For an RNN, the first (few) layers typically are LSTM layers, then followed by a number of dense layers, finally followed by a single RNN layer.

A typical configuration part of our experiment, expressed in JSON, would be:

{
  "seed" : 1,
  "network" : {
    "optimization" : "STOCHASTIC_GRADIENT_DESCENT",
    "initialization" : "XAVIER",
    "learningRates" : [
      { "iteration" : 0, "rate" : 0.05 },
      { "iteration" : 2000, "rate" : 0.025 },
      { "iteration" : 4000, "rate" : 0.005 },
      { "iteration" : 6000, "rate" : 0.0025 }
    ],
    "updater" : "NESTEROVS",
    "normalization" : 0.5,
    "inputSize" : 1,
    "layers" : [
      { "type" : "LSTM", "size" : 250, "activation" : "TANH" },
      { "type" : "LSTM", "size" : 125, "activation" : "TANH" },
      { "type" : "Dense", "size" : 62, "activation" : "SIGMOID" },
      { "type" : "RNN", "size" : 1, "activation" : "IDENTITY", "loss" : "MSE" }
    ]
  },
  "scenario" : {
    "sampleSize" : 250,
    "batchSize" : 32,
    "epochSize" : 100,
    "scenarioSize" : 100
  }
}

Most configuration parameters will speak for themselves, assuming some familiarity with neural networks. In this case, we have chosen a sample size of 250 sampling points, representing 1 second of cardiogram data. The learning rate of the network decreases from 5% to 0.25% in steps at specific iterations.

The network consists of 4 layers (not counting the input layer): two LSTM layers with TANH activation, one dense layer with SIGMOID activation and an RNN output layer with loss function MSE (mean square error).

Note that, for an RNN, the number of nodes on the input layer is always 1, as time series data is offered sequentially to the network.

Intermezzo

In the remaining sections, we will discuss a number of hyperparameters to find the optimal value. Note that these observations are depending on the problem at hand. As stated before, they will also differ depending on the "location" in hyperspace where the parameter is being tested.

Corrected Sample Selection

Looking at the visual representation of a cardiogram, one can see that the value of the cardiogram is mostly close to zero, with periodic peaks up and down. Although these peaks represent only a fraction of the cardiogram, about 20%, they seem to be the "interesting" regions in the cardiogram. The essence of what has to be learned is in the peak regions.

We therefore tested learning with a correction on the uniform random sample generation: forcing a fraction of the samples, typically 50%, to end in a peak region, hoping to accelerate the learning process.

Although we did observe accelerated learning in some cases, the results were not consistent. In fact, for the best performing configurations, corrected sample generation systematically had a negative contribution to the performance of the neural network.

MSE Error Evolution for M40-D20-R1-L05d-40-S1 with and without Correction
The error seems to increase after say 3,000 iterations ? No, the error for the better iterations increases, but the error for the worse iterations decreases. Overall, the average error continues to decrease, and the standard deviation decreases too. See the section on No Overfitting for more information.

One LSTM Layer

In order to evaluate the impact of LSTM nodes, we tried several configurations with the same network topology, varying the second layer to be an LSTM layer or a regular dense layer. Introducing the type of the second layer, LSTM or dense, as a series in the scatter chart, is rather enlightening:

Error Distribution by Sample Size with Layer 2 Type
Networks with a second LSTM layer are clearly underperforming compared to networks with only one LSTM layer and a second dense layer, or learning progresses significantly slower.

The performance demonstrated in this chart seems to be a lot worse than that of the optimal configuration ? Indeed. Remember we are scanning hyperspace for optimal parameter values here, while the optimal configuration represents all the knowledge gathered during the experiment.

Fixed Epoch Data

According to best practices, the data offered during the training phase should be the same for each epoch. However, we initially started working with variable (new) data for every batch of every iteration. Consequently, we were training on different data for each epoch.

We examined the effect of fixed versus variable epoch data on 28 scenarios with sample size 20. We visualize the results in function of the average error and the standard deviation:

Standard Deviation Distribution by Average Error with Variable/Fixed Epoch Data
Clearly, scenarios using fixed data over different epochs suffer less variance, both in terms of the average error and in terms of the standard deviation, although the difference is not spectacular.

Another observation based on this projection is that there is a strong correlation between the average error and the standard deviation. In fact, this is a general conclusion that applies to the entire set of scenarios, without exception.

To get a better view on the impact on the learning process, we compare the MSE over the last 1,000 iteration of the learning process for two identical scenarios, one using variable and one using fixed epoch data.

MSE Evolution for Variable/Fixed Epoch Data, Final 1K Iterations
The conclusion is the same for all such tests: fixed epoch data results in less variance on the result and continued convergence over iterations. More importantly, when validating the network on unseen data, performance is not negatively affected by the fact that we are using fixed data during training.

The Best Activation Function

The next hyperparameter we investigated was the activation function. There are numerous activation functions to choose from. Initially, we had been using TANH for LSTM layers and SIGMOID for dense layers. Why ? Because that's what we see most in the examples discussed in other posts on the subject.

We tested 120 scenarios, with sample sizes 15, 20 and 25, and with different random seeds. The activation functions tested are TANH, SIGMOID, RELU, LEAKY RELU and CUBE. They are applied to all layers of the network, LSTM and dense alike.

Error Distribution by Activation Function
The results are significant. The TANH, RELU and LRELU activation functions give the better results, with the lowest error and the least variance. The CUBE activation function gives the worst result, but also with little variance. The SIGMOID activation function has significant variance, performing like the better for some samples and like the worst for others.

The reason for the bad performance of the SIGMOID and CUBE activation functions is that they have a small gradient, preventing updates of the network parameters, a problem known as the vanishing gradient.

Many other activation functions are available, but most of them are numerical approximations of the set we tested that calculate more quickly. As for this experiment, featuring relatively small neural networks, we do not have an issue with processing time, there is no need to revert to more efficient approximative implementations of activation functions.

The Best Updater

After a successful investigation into activation functions, we directed our attention to the updaters for backpropagation. There are numerous updaters to choose from. Up to now, we have been using NESTEROVS. Why ? Because that's what we found in the DL4J examples.

We tested 40 scenarios, all with sample size 20, and with different random seeds. The updaters tested are NESTEROVS. ADADELTA, ADAM, ADAMAX, SGD, NADAM, RMSPROP and ADAGRAD.

Error Distribution by Updater
Clearly, there are significant differences between different updaters, both in terms of the average error and in terms of the variance of the results. ADAM and NADAM seem to provide small average errors with the most stability. RMSPROP provides us with the smallest average error and still has a small variance.

The Best Loss Function

We continued the series of successes and investigated the impact of using different loss (error) functions. So far, we had been using MSE (mean square error).

With this investigation, we will be using MAE (mean absolute error), MAPE (mean absolute percentage error) and MSE (mean square error). We tested 30 scenarios with the same network topology for various random seeds, each time with one of these loss functions.

Note that we never changed the loss function for the final evaluation, which is always the absolute percentage error. We varied the loss function exclusively during the learning process.

Error Distribution by Loss Function
No improvements were found here. Apparently, the MSE loss function was the optimal one to begin with.

No Overfitting

As a final illustration, we present the same 50 scenarios of the section on Beating the Previous Predictor, with identical configurations and random seeds, and consequently the same training and evaluation data, each with their respective performance on the same unseen test set after 10K iterations and 100K iterations.

Error Distribution by Series Size 1-50, 10K and 100K Iterations
The systematic overperformance of the neural networks that have had longer training is undeniable. Looking at the evolution of the scores (not shown) and given that we use unseen data for testing, there seems to be no overfitting, at least not up to 100K iterations.

We have been testing different optimization strategies, but the results were inconclusive. Therefore, we have been using ClipL2PerLayer for optimization with a rate of 0.7, consistently.

We have also been testing L2 regularization rates, but again, the results were inconclusive, with a slight preference for lower rates. Therefore, we have been using L2 regularization with a rate of 1e-5, consistently.

Conclusions

During our experiment with RNNs to predict a cardiogram we tested about 860 neural network and scenario configurations. The baseline for the problem can easily be set at a challenging level of an average absolute percentage error of about 0.6% using a heuristic predictor that predicts each value to be the previous value. Replication of this result using an RNN is immediate, and doing significantly better is not so hard.

The best configurations realized an error of 0.1832%, a reduction by a factor 3 compared to the baseline. Interestingly, this result is obtained using fairly small fragments of the cardiogram, in the order of 1/10 to 1/5 of a second, far below the period of the cardiogram. Selective training data with a focus on the "harder" regions to learn does not consistently result in faster learning.

A single LSTM layer, followed by a single dense layer and an output RNN layer offers the best results. Additional LSTM layers do not offer any obvious advantage. At best, they result in slower learning, and they typically result in a larger average error.

Fixed epoch data provides less variance in both the average error and its standard deviation. However, convergence is not broken using variable epoch data, but the average error typically converges to a higher value.

In this context, clear conclusions could be drawn for some of the network parameters:

The best activation function is Leaky RELU, followed closely by RELU and TANH. SIGMOID and CUBE are bad choices.
The best updaters for stochastic gradient descent are ADAM, NADAM and RMSPROP. ADADELTA, SGD and ADAGRAD are bad choices.
The best loss function is MSE. MAE and MAPE are suboptimal choices.

Performance of the neural networks continued to improve up to 100K iterations. The effect of even longer training has not been investigated. We did not find consistent optimum values for optimization and regularization.

Friday, August 25, 2017

A Not-so-Random Walk Through Hyperspace

Reporting on a first experiment with neural network technology (DL4J), applying feed forward neural networks (FFNN) for supervised learning of file types based on content fragments, exploring the hyperparameter space of domain and tooling.

Introduction

Artificial intelligence based on neural networks has a long history. However, up to a couple of years ago, success stories have been few and with limited impact.

Over the last few years, neural networks have demonstrated some impressive results, catching up with the long standing promise of performance at the level of human experts in different domains, and even taking us beyond. Interesting presentations in this context include Jeremy Howard's talk on the potential of deep learning and Demis Hassabis' talk on DeepMind and AlphaGo.

Due to these recent successes, neural network technology has gained a renewed interest. Today, different generic neural network platforms are available in the public domain, and a large community of enthusiasts is looking into the technology and its applications.

Time to join the community ...

Finding a Problem

With the solution at hand, we still have to find a problem. Data should be real life, easily available in large numbers, and categorised.

The problem we will be addressing will be using the file system, where the goal is to recognise the file type based on (a fragment of) the file content. We have a large number of files on the file system at hand, created by many different authors, and the bulk of them labelled correctly by the file extension.

The data set will consist of all files under a given root directory, using the following constraints:

Text files only, determined heuristically based on the content of the file. We will be using the function proposed by Ondra Žižka on stackoverflow for this purpose.
Excluding files with extension "txt", as these have strongly varied content.
Having at least 100 occurrences for a file type, to allow patterns to be detected.

This results in a collection of 50,665 files with 12 different extensions, distributed as

Note that the number of xml-files is 41,103, but the bar is clipped in favour of visibility of statistics for the other file types.

Performance Evaluation

The first parameter that comes to mind for evaluating the performance of a learning system is the accuracy of its output, defined as the ratio of the number of correct classifications over the total number of classifications. As explained in the Deep Learning and Neural Network Glossary and James Brownlee's post on Performance Measures, we can improve on the evaluation of performance of a neural network.

The results for a specific class during the test phase can be divided into:

True positives (TP): sample correctly accepted.
False positives (FP): sample incorrectly accepted.
True negatives (TN): sample correctly rejected.
False negatives (FN): sample incorrectly rejected.

Given these definitions, we can express the accuracy and introduce additional metrics of performance:

Accuracy: A = (TP+TN)/(TP+FP+TN+FN), how often is a classification correct ?
Precision: P = TP/(TP+FP), how often is the acceptance of a sample correct ?
Recall: R = TP/(TP+FN), how often is a sample in the class correctly classified ?
F1 Score: F1 = 2*P*R/(P+R), the harmonic mean of precision and recall.

We will track all four of these metrics, but we will use F1 as a metric for overall performance.

Selecting a Neural Network Topology

We will be using the DL4J system, because it is low entry and feature rich.

An important observation to determine the network topology is related to the type of problem we wish to solve. The Introduction to Deep Learning on DL4J briefly explains:

Classification as a learning process where the category is provided as part of the input, also referred to as supervised learning.
Clustering as a learning process where the category is not provided as part of the input, but should be discovered by the learning process, also referred to as unsupervised learning.
Predictive Analytics as a variant of either classification or clustering, where one of the elements in the sample data reflects the time dimension.

The problem we have chosen to solve clearly is a classification problem.

For this first attempt to get neural networks operational, we will focus on feed forward neural networks (FFNN) with multiple layers. See Feed Forwards Neural Network on Wikipedia for a general introduction. We borrow the illustration from the DL4J Introduction to Neural Networks:

For the FFNN, we will consider as hyperparameters the number of layers, the organisation of nodes into these layers, the learning speed and some more parameters explained below.

For the chosen problem domain, the parameters include the sample size or the number of characters taken from the file, and the minimum frequency for a file type to be taken into scope.

Jumping into the Unknown

We will take the class CSVExample in the DL4J sample code base as a starting point for training and testing an FFNN. Without having anything else to go on, the example will also help us with the choices of hyperparameter values.

new NeuralNetConfiguration.Builder()
    .seed(1L)
    .iterations(1000)
    .activation(Activation.TANH)
    .weightInit(WeightInit.XAVIER)
    .learningRate(0.1)
    .regularization(true).l2(1e-4)
    .list()
    .layer(0, new DenseLayer.Builder().nIn(64).nOut(12).build())
    .layer(1, new OutputLayer
        .Builder(LossFunctions.LossFunction.NEGATIVELOGLIKELIHOOD)
        .activation(Activation.SOFTMAX).nIn(12).nOut(12).build())
    .backprop(true)
    .pretrain(false)
    .build();

The above code fragment builds a MultiLayerConfiguration providing all the necessary parameters for the network to function. We refer to the full example for more information on how to provide the sample data to the network and run the training and testing processes.

Let's have a look at the parameters that can be provided, and why we've chosen particular values:

Seed: The seed is used to initialise weights. We are using a hard-coded seed value to be able to reproduce results.
Number of Iterations: An iteration is a cycle in which the weights in the network are adjusted based on the errors obtained using the training data set. We will use 1,000 iterations for our first attempt. No clue whatsoever about convergence behaviour.
Activation Function: The activation function determines what output a node will generate based upon its input. We use the hyperbolic tangent function because it ranges from -1 to 1, using the full spectrum (as opposed to the once popular sigmoid function). Also, the function is continuously differentiable (as opposed to the more recent ReLu function), which helps in supporting backpropagation by gradient descent, as mentioned below. See Activation Functions on Wikipedia for more information on activation functions.
Weight Initialisation: We are using Xavier for the initial weights in the neural network. This keeps the weights, and consequently the signals that flow through the network, at a reasonable range throughout the layers of the network. See Andy's post on Xavier initialisation for more information.
Learning Rate: The learning rate is a measure for how quickly a network abandons old beliefs for new ones, as explained by Conner Davis on Quora. We will start of with 0.1, as most of the examples do.
Regularisation: Regularization artificially discourages complex or extreme explanations of the world even if they fit the what has been observed better, as explained by Sean Owen on Quora. It is used as a measure against overfitting, where a statistical model describes random error or noise instead of the underlying relationships. See Stanford Universities' Course on Convolutional Neural Networks for Visual Recognition for a more in-depth coverage in the context of neural networks. We will activate regularisation for obvious reasons. L2 regularisation is most commonly used.
Layer 0: The input layer of the network. The number of inputs, and consequently the number of nodes, equals the sample size, 64 in our case. The number of outputs has to equal the number of inputs of the output layer. The input layer is never counted as a layer. Consequently, a 1-layer network has an input layer and an output layer, no hidden layers. See also Stanford Universities' Course on Convolutional Neural Networks for Visual Recognition.
Layer 1: The output layer of the network. As we are doing classification, the number of nodes equals the number of categories, 12. The loss function is used for backpropagation, to quantify the error. The negative log-likelihood function is also referred to as cross entropy cost. See the section on Loss Functions in the Stanford Universities' Course on Convolutional Neural Networks for Visual Recognition. You may also want to have a look at the post on Cost Functions used in Neural Networks on Cross Validated. We are using the softmax activation function to make sure the output values can be interpreted as probabilities, one for each category. See also the Softmax function on Wikipedia.
Backpropagation: Backpropagation is the process where the weights are updated in function of correct and incorrect classification during the learning process. Some parameters provided before are actually related to backpropagation. See Chapter 2 of Michael Nielsen's book of Neural Networks and Deep Learning for a theoretical description and Matt Mazur's post on Step by Step Backpropagation for an exhaustive example. DL4J refers to Stanford's wiki on the Backpropagation Algorithm.
Pretrain: Pre-training is using the weights of a neural network that was already trained on a different (but similar) data set to initialize the neural network, as explained in post on pretraining a neural network on Cross Validated. We will disable pretraining.

Falling on One's Feet

As stated before, our data set consists of 50,665 files. We will be using 90% (45,599 files) for training, and 10% (5,066 files) for testing. To guarantee the independence of our test set, we will never use any of the test files for training.

The first attempt, after 1,000 iterations, gives an accuracy of 99.07%, which is quite impressive. We were lucky with our initial choices for the hyperparameter values. However, other metrics are far less impressive, with a precision of 86.80%, a recall of 70.13% and an F1 score of 77.22%.

The evolution of these scores in function of the number of iterations is quite instructive:

While the accuracy improves dramatically, reaching over 90% after 15 iterations, and over 99% after 559 iterations, the other metrics evolve much more slowly. The chart however suggests increasing the number of iterations may be useful, as the F1 score still seems to be increasing at 1,000 iterations.

We may add that, on an Intel® Core™ i7-6700HQ CPU @ 2.60GHz × 8, the training process took just below 2 minutes, including data load, model initialisation, and an evaluation after each iteration. Removing the evaluation finishes the learning process in 77 seconds.

Another interesting instrument, providing an alternative view on the performance of the neural network, is the confusion matrix. The confusion matrix shows the actual categories in the rows, and the number of classifications by the network per category in the columns.

After our first run, the confusion matrix is

	csv	java	xml	html	groovy	xsd	sh	properties
csv	3	0	0	0	0	0	0	1
project	0	0	10	0	0	0	0	0
java	1	321	0	0	0	1	0	3
xml	2	0	4,603	0	0	0	0	1
html	0	0	0	43	0	0	0	0
groovy	0	0	0	0	1	0	0	8
xsd	0	0	16	0	0	32	0	0
sh	0	0	0	0	0	0	1	0
properties	1	0	3	0	0	0	0	16

We observe an important mix-up with all project files and many xsd files being classified as xml files, and essentially, they are. We also see that almost all groovy files are classified as properties files, most probably because they start with some assignment statements.

A final interesting observation is that the evolution of precision and recall, and consequently the F1 score, is not smooth but rather bumpy. We can see for example a sudden decrease in precision after iteration 161, from 56.68% to 47.38%, followed by a sudden increase after iteration 194, both in precision, from 48.92% to 65.59%, and in recall, from 31.25% to 42.36%.

Analysis of the confusion matrix before and after shows that the decrease in precision is due to a change in classification of all 43 html files in the test set, from category xml to project. Both classifications are incorrect. However, as the total number of xml files is very high compared to other categories, the relative expression of the error significantly decreases the precision. A later change in classification which causes the same 43 html files to be correctly classified gives a significant increase in both precision and recall.

This observation also relates to the impressively fast increase in accuracy demonstrated in this example. As the distribution of file types is extremely uneven, with 4,606 out of 5,066 files being xml files, learning to classify xml files correctly will get the accuracy above 90%. This also explains the relative smoothness of the accuracy curve over iterations in comparison with other examples of learning processed based on neural networks, as the accuracy of xml classification is very dominant.

Getting Oriented

With the term hyperspace, we refer to the space generated by all hyperparameters. In the previous topic, we chose a first point in hyperspace to start off with, based on a little intuition and, no doubt, a lot of ignorance.

Now that we have an anchor point, we will explore hyperspace in some more detail. Obviously, we risk getting caught in a combinatorial explosion of possibilities. In order to get oriented, we will scan hyperspace selectively, by choosing a discrete number of values around our starting point for each dimension in hyperspace, effectively scanning a multi-dimensional cross in hyperspace.

However, the first question that comes to mind is where we would end up by running more iterations in the learning process.

Longer Learning

Well, here it is, with the same seed value of 1L, after 5,000 iterations, we get accuracy 99.76%, precision 78.33%, recall 96.94% and F1 score 89.84%.

Although the accuracy still improves with the number of iterations, there is no consistent convergence for other metrics. Running the network with different initialisations (read different random seeds) confirms this: there is no consistent convergence of all performance metrics towards an optimum with this configuration.

Bigger Samples

Common sense suggests one of the more important parameters is the sample size, the number of bytes taken from the file content to be used for classification. To investigate we run the train and test cycle using sample sizes ranging from 32 to 128 bytes, in steps of 16. The difference between these variants is illustrated by their performance graph:

Clearly, the size of the sample is an important factor: the larger the sample, the better all performance metrics become, as expected. For the best run, obtained with a sample size of 128, we get accuracy 99.87%, precision 93.47%, recall 98.08% and F1 score 94.70%.

The tests have been executed with different seed values for the initialisation of weights. Because the volatility of the results of individual tests is high, depending on the seed value, 5 tests runs have been scheduled for each parameter value, where the results with the lowest and highest F1 score have been discarded. The chart displays the average performance metrics of the 3 remaining results. We will use this technique for the remainder of this post.

As a sidekick to this experiment, and inspired by Bruno Stecanella's post on Naive Bayes Classifiers on MonkeyLearn, we did the same test using Naive Bayes classification. The test showed an accuracy of approximately 92%, with an optimum sample size around 64.

For the remainder of the post, we will continue to work with a sample size of 64, mainly because in multi-layer configurations, the number of weights quickly grows with the sample size, as illustrated below.

More Samples

In order to observe patterns, sufficient data should be available. It is therefore important to make sure that, for each file type, a significant number of files is available in the training set. Up to now, we have been working with a minimum frequency of 100, which gave fairly good results.

Running the FFNN with different data sets, where the minimum frequency is 25, 50 and 100, respectively, confirms our expectation:

For the remainder of the post, we will continue to work with a minimum frequency of 100, as before.

Multi-Layer Topology

So far, all our tests have been executed using a 1-layer FFNN. As we are working with deep learning technology, we should look at the impact of working with multiple layers of nodes as well.

In designing a multi-layer configuration, we have to decide on the number of nodes for each layer, which is again jumping in the unknown. The constraints available are:

The number of nodes in the input layer is equal to the sample size.
The number of nodes in the output layer is equal to the number of categories.
The total number of weights should be far below the number of samples. We cannot tune a number of weights that is larger than the number of samples. Our intuition says that we should have sufficient margin here.
Some experts state that the number of nodes should not exceed the number of inputs. We will keep the number of nodes between the number of inputs and the number of outputs for each individual layer.

Given these constraints, we will run a test with 3 different configurations:

Min: The number of nodes on hidden layers is the minimum of the number of nodes on the input and output layers, 64 in our case.
Linear: The number of nodes on hidden layers evolves linearly from the number of nodes on the input layer to the number of nodes on the output layer, from 64 to 12 in our case.
Max: The number of nodes on hidden layers is the maximum of the number of nodes on the input and output layers, 12 in our case.

Testing these distributions on a 3-layer configuration results in the following performance metrics:

Although the sample is small and has poor coverage, we will continue with the linear node distribution model. According to this model, the number of nodes N(i) for layer i is given by:

N(i) = #B - i * (#B - #C) / #L

Here, #B is the number of bytes in the sample, or the sample size, #C is the number of categories, and #L is the number of layers. The input layer gets label 0, hidden layers are labelled 1 to #L-1, and the output layer gets label #L.

With a sample size of 64 and 12 categories, the following table provides an overview of the number of nodes for each layer, and consequently, the number of weights to be tuned in the network, for up to 10 layers:

# Layers	L0	L1	L2	L3	L4	L5	L6	L7	L8	L9	L10	# Weights
1	64	12										768
2	64	38	12									2,888
3	64	47	29	12								4,719
4	64	51	38	25	12							6,452
5	64	54	43	33	22	12						8,187
6	64	55	47	38	29	21	12					9,854
7	64	57	49	42	34	27	19	12				11,586
8	64	58	51	45	38	32	25	19	12			13,394
9	64	58	52	47	41	35	29	24	18	12		14,893
10	64	59	54	48	43	38	33	28	22	17	12	16,624

Clearly, the number of weights quickly increases with the number of layers and we need to be cautious not configure too many nodes, in particular as we only have 45,599 samples in the training set.

Configuring the FFNN with more layers, from 1 to 12, and using linear node distribution, we get the following results:

Apparently, the configuration reaches an optimum with 6 layers, where all performance metrics are at 100%. Adding more layers to the FFNN improves the performance, at least up to 6 layers, and this configuration demonstrates consistent convergence on all performance metrics.

Conclusion

Neural network technology recently has matured to the extent that it has become widely applicable. This post reports on a first experiment with the technology (for the author) and demonstrates that the DL4J system offers a framework that is easily accessible to a mainstream Java audience.

Applying neural network technology to a non-trivial classification problem generates results almost instantly. The exercise is interesting as it demystifies the subject of neural networks and encourages further experimentation.

Working on a more complex data set, and using more powerful hardware, would allow us to examine the impact of changes to other hyperparameters, address other types of machine learning problems and experiment with other types of neural network topologies.