Saturday, February 17, 2018

Secrets of the Heart

Learning the beat - Short term prediction of a cardiogram using recurrent neural networks, once more exploring the hyperparameter space of domain and tooling (still DL4J).

Introduction


In a previous post on neural networks, A Not So Random Walk Through Hyperspace, we reported on an experiment with feed forward neural networks (FFNN). This time, we will explore time series prediction using recurrent neural networks (RNN).

By mere coincidence we obtained a 24-hour cardiogram from a family member. The challenge we will take on is to predict the next measurement of the cardiogram based on a series of measurements preceding it.

Accessing the Cardio Data


The magic number of the cardio file is "ISHNE1.0", which quickly leads us to The Format Description of Holter ECGs. A couple of hours of programming and we are able to correctly extract both meta data and sampling data out of the cardio file.

From the meta data we learn that the cardio file contains 2 leads, both generic bipolar. Each lead registers over 21 million data points. With a sampling rate of 250 Hz, this represents almost 24 hours of data. The vertical resolution is 2500. With data ranging from -2.5 mV to 2.5 mV, this would represent a precision of 2 µV, but we are guessing here.

Using gnuplot, and another couple of hours, we are able to visualize any fragment of the 24-hour cardio.

A Sample Holter Cardio Fragment

For the remainder of this post, we will use the first lead of the cardiogram.

Establishing a Baseline


Before we start our experiment, we may reflect on the performance to be expected:
  • What is the best performance we may expect ?
  • What would be a good result ?

First of all, there is the vertical resolution of the cardio. We already argued it would be in the order of 2 µV on a full range of 5,000 µV. Taking into account a maximum absolute error of measurement of one unit, we will assume the error of the cardio device to be 0.04%.

In addition, we may expect the electrical signal that drives the heart muscles to contain some degree of noise. By lack of any knowledge in the field, there is no way we can estimate the noise. Any input on this subject would be warmly welcomed, but for now, we will have to stick with the estimated best case based on the cardio meta data.

Now that we have a lower bound for the error, we might raise the question on what would make a good predictor. Performance of mere earthlings on this subject does not help us, as most cardiologists, although familiar with various patterns in a cardio, are not in to predicting individual measurements. We will therefore revert to a number of heuristic predictors, and estimate the average error for each one.

To use the cardio signal as an input to a neural network, the values should be projected on an appropriate numeric interval. From now on, we will apply a linear transformation on the cardio sampling data, projecting the range in µV [-2000,2000] (values outside of this range are rare) to the range [-1,1]. Our heuristic predictors will be introduced and evaluated on this projection too, using the average absolute error expressed as a percentage of the full range.

Given the fact that the value of a cardio is mostly close to 0, with sparse peaks up and down, one may reason that the uniform random predictor on the interval [-1,1] will show an average absolute percentage error of 25%. Trying that out on a series of 1,000 sampling points we get an average error of 24.7695% with a standard deviation of 15.3092%.

Random Predictor Error Distribution Histogram

Here, the label "A" marks the bucket with the average and the label "+S" marks the bucket that contains the average increased with the standard deviation.

The constant 0 predictor is expected to perform much better, making only small errors most of the time. Trying that out on the same series of 1,000 sampling points we get an average error of 4.6094% with a standard deviation of 6.6011%.

Contant 0 Predictor Error Distribution Histogram

Note the difference in horizontal scale of this chart compared to the previous one.

However, there is an even better heuristic approach: the weather tomorrow most likely will be like the weather today. A predictor that repeats the last observed value is likely to do fairly well. Trying that out on the same series of 1,000 sampling points we get an average error of 0.5972% with a standard deviation of 1.0971%. No doubt, this is the one to beat !

Previous Predictor Error Distribution Histogram

Again, note the difference in horizontal scale.

This results in the following baseline scenarios that will be the playing field for our experiment:

#ScenarioDescriptionAverage ErrorStandard Deviation
1ObservationsCardio Device0.0400%
2PreviousPrevious Value Predictor0.5972%1.0971%
4ZeroConstant 0 Predictor4.6094%6.6011%

Network and Scenario Configuration


When we started this experiment, we selected a point in hyperspace, representing a recurrent neural network configuration and a scenario, and evaluated the performance of a series of configurations changing only one parameter. For a categorical parameter, this would give us the optimal parameter value. For a numerical parameter, this would give us the direction of improvement, and eventually also an optimal parameter value. We shifted focus to the optimal value and repeated this for other parameters.

We understand that the location of the point in hyperspace will have an impact on the conclusion for the varying parameter. Multiple sources advise a random walk, as suggested in part 3 of the excellent Introduction to Convolutional Neural Networks from Stanford. However, we took the approach of walking straight from one optimum to the next.

We tested a wide range of network and scenario parameters. Rather than starting with a lengthy report and postponing the Eureka moment to the very end, we mention the optimal values here and provide more detail at the end:

Beating the Previous Predictor


Given the baseline introduced before, the key issue is whether we can beat the previous predictor. We reformulate this objective using the following questions:
  • Could an RNN with a single LSTM node that was trained using samples of 1 sampling point replicate the performance of the previous predictor ?
  • Would increasing the sample size and consequently the size of the LSTM layer improve the performance of the network ?

To investigate this, we will use the optimal network configuration described in the previous section and perform tests by varying the sample size from 1 to 50, representing 1/250 to 1/5 seconds of cardiogram data.

An RNN always has a single input node. The first layer is an LSTM layer, with the same size as the cardio sample. The second layer will be a dense layer, with half the number of nodes. The third layer will be an RNN layer, with a single output node.

Each configuration is trained in 100K iterations, consisting of 400 epochs of 250 batches, each batch containing 32 samples. The performance of the network is evaluated using 1,000 samples not seen during training.

The results are visualized in the following scatter diagram:

Error Distribution by Series Size 1-50

First of all, the performance of the RNN with only a single LSTM node indeed matches the performance of the previous predictor closely. With an average error of 0.4785% and a standard deviation of 0.9408%, it performs slightly better. Clearly the network's 19 parameters are learning some specifics on the distribution of the next value in function of the previous value.

In addition, it is clear that the sample size contributes to the performance significantly, with a decreasing marginal improvement. The best performer is M35-D17-R1-L05d-35-S1, using a sample size of 35, consequently having 5,915 parameters, demonstrating an average error of 0.1899% with a standard deviation of 0.2151%.

Training and testing that configuration, the error drops very quickly to a low value and then continues to converge slowly, as illustrated by the MSE (mean square error) chart.

MSE Error Evolution for M35-D17-R1-L05d-35-S1

The chart shows the evolution of the MSE and its average over 10 iterations for the first 10,000 iterations. Note that the training process was done over 100,000 iterations. Although slower, convergence continued over the entire learning process.

The error distribution now becomes very narrow.

Error Distribution for M35-D17-R1-L05d-35-S1

Again, note the difference in horizontal scale.

In order to get an in-depth view on the learning process and see the improvement of the prediction, we evaluate the network on a batch of 32 samples at several stages in the learning process. For each sample in the batch, we offset the prediction of the next observation against the actual observation:

M35-D17-R1-L05d-35-S1 Predictions after 0, 100, 200, 500, 1K and 10K Iterations

Note that the charts do not represent a cardio fragment of 32 sampling points, but rather the predictions for 32 different, randomly chosen cardio fragments of 35 sampling points.

The evaluations are made at specific points during the training phase. The respective performance observations are:

# IterationsAverage Error
010.3315%
1002.0748%
2001.4205%
5001.4227%
1,0001.0113%
10,0000.3514%

And the average error goes down to 0.1899% after 100K iterations.

These results illustrate once again that learning is initially very fast, and slows down quickly, demonstrating slow convergence. This seems to be consistent with Naftali Tishby's observation on Information Theory of Deep Learning, although we did not verify this explicitly. A very interesting talk by the way.

Why is the performance after 0 iterations (before any training) so much worse than that of the constant 0 predictor ? Because, trying to accelerate learning, we are not using uniformly distributed samples. Instead, we have more samples ending in "interesting" regions than with a uniform distribution. See the section on Corrected Sample Selection for more information.

Why is there still a factor 2 in performance between 10K and 100K iterations ? Actually, there is no factor 2. The charts that visualize the intermediate performance are not using uniformly distributed samples and consequently show higher errors. The performance after 100K iterations is measured using uniformly distributed samples, such that it is comparable to the performance reported for the heuristic predictors. Again, we refer the section on Corrected Sample Selection for more information.

Model Ensemble


Before we get to more technical details, we wish to try out one interesting point made in the Introduction to Convolutional Neural Networks from Stanford about model ensembles.

The term model ensemble refers to an approach to improve the performance of neural networks by a few percent by training multiple independent models and average their predictions.

Trying this technique on the top 5 performers we tested, we can confirm that model ensembles tend to perform better than their constituent neural networks.

ConfigurationAverage ErrorStandard DeviationEnsemble Average ErrorEnsemble Standard Deviation
M35-D17-R1-L05d-35-S10.1899%0.2151%0.1899%0.2151%
M48-D24-R1-L05d-48-S10.1899%0.2374%0.1848%0.2192%
M33-D16-R1-L05d-33-S10.1912%0.2388%0.1839%0.2214%
M46-D23-R1-L05d-46-S10.1917%0.2592%0.1839%0.2268%
M42-D21-R1-L05d-42-S10.1918%0.2182%0.1832%0.2197%

In the ensemble columns, for each row, we listed the performance indicators for the model ensemble consisting of all the neural networks up to that row in the table. Indeed, the model ensemble consisting of all 5 neural networks outperforms any of the individual neural networks in the top 5, with an average error of 0.1832%, but with a slightly higher standard deviation of 0.2197%.

Intermezzo


In the next few sections, we will address some of the implementation techniques we used to get an RNN configured and running in DL4J. Feel free to skip these sections if you are more interested in the results than in the technical implementation details. You will find more information on hyperparameters in subsequent sections.

Creating ND4J DataSets


The way to provide training data to a neural network in DL4J, is to use an implementation of the DataSetIterator interface. A number of implementations are provided by DL4J. For the file classification exercise with FFNNs, we used the implementation provided for CSV-files.

To avoid transforming cardio data into the required format and creating myriads of CSV-files for this purpose, we implemented the DataSetIterator interface by creating a wrapper for the java representation of a cardio lead. Configuration parameters for the iterator class are:
  • lead: the cardio lead from which to generate samples,
  • random: a random generator to select samples, for reproducability,
  • epochSize: the number of iterations in an epoch,
  • batchSize: the number of samples in a batch, and
  • sampleSize: the number of data points in a sample.

The key element of the iterator is the next function, where an ND4J DataSet is populated with data for the next training batch:
public DataSet next(int batchSize) {
  INDArray input = Nd4j.create(new int[]{ batchSize, 1, sampleSize }, 'f');
  INDArray labels = Nd4j.create(new int[]{ batchSize, 1, sampleSize }, 'f');
  INDArray labelsMask = Nd4j.create(new int[]{ batchSize, 1, sampleSize }, 'f');
  for (int series=0; series<batchSize; ++series) {
    int offset = getRandomOffset(sampleSize);
    for (int sample=0; sample<sampleSize; ++sample) {
      double point = MathUtils.transform(lead.getValue(offset+sample),-2000d,2000d,-1d,1d);
      double label = MathUtils.transform(lead.getValue(offset+sampleSize),-2000d,2000d,-1d,1d);
      input.putScalar(new int[]{ series, 0, sample }, point);
      labels.putScalar(new int[]{ series, 0, sample }, label);
      labelsMask.putScalar(new int[]{ series, 0, sample }, 0);
    }
    labelsMask.putScalar(new int[]{ series, 0, sampleSize-1 }, 1);
  }
  return new DataSet(input, labels, null, labelsMask);
}

Note that this implementation is specific to an RNN, where we create 3D arrays:
  • The first dimension is for the samples in the batch.
  • The second dimension is for the depth of the sample, in our case 1, as we only use 1 lead of the cardio for this exercise.
  • The third dimension is the time series dimension.

We first allocate the arrays, one for the input, one for the labels and one for the label mask. The label mask is another element that is typical for RNNs. It instructs DL4J which elements in the label array should be taken into account for the error calculation, where the mask is 1, and which elements should be ignored, where the mask is 0.

We then define our sample by selecting a random offset in the cardio lead where the sample will start, using the getRandomOffset function. The implementation of this function is not provided, as it is not essential to the focus of this post.

The next step is to populate the arrays using the sample data:
  • The input array is populated with the data points of the cardio lead, observations 0 to sampleSize, exclusive.
  • All elements of the label array are populated with the observation in position sampleSize, which is the observation to be predicted.
  • The label mask array is populated with zeroes.

We then set the last element in the label mask array (at position sampleSize-1) to 1, to specify that only this position should be used to calculate the error.

Finally, we create a new DataSet with the input, label and label mask arrays. The third parameter of the DataSet constructor is the input (feature) mask. As we want all inputs to be used, we provide a null value for this parameter.

Why do we create a label array and a label mask array, when we are using only the last element in the array ? Because for RNNs, DL4J requires the label array to be equal in size to the input array.

Why do we bother to populate other elements besides the last one in the label array ? Because we might be using the number of label elements to be taken into account for error calculation as a hyperparameter.

Configuring the Recurrent Neural Network


In order to industrialize our test, we defined a configuration object. This is a POJO that specifies both network configuration parameters and scenario configuration parameters. We assume the meaning of the parameters is clear by naming and context.

We provide the code snippet responsible for the creation of the network layers:
int nIn = configuration.network.inputSize;
int nOut = 0;
int count = 0;
for (NetworkLayer networkLayer : configuration.network.layers) {
  nOut = networkLayer.size;
  Layer layer = null;
  if ("LSTM".equals(networkLayer.type.name())) {
    layer = new GravesLSTM.Builder().nIn(nIn).nOut(nOut).activation(networkLayer.activation).build();
  } else if ("RNN".equals(networkLayer.type.name())) {
    layer = new RnnOutputLayer.Builder(networkLayer.loss).nIn(nIn).nOut(nOut).activation(networkLayer.activation).build();
  } else {
    layer = new DenseLayer.Builder().nIn(nIn).nOut(nOut).activation(networkLayer.activation).build();
  }
  builder.layer(count, layer);
  nIn = nOut;
  count++;
}

This code is able to create LSTM (long short-term memory) layers, dense layers and RNN layers. For an RNN, the first (few) layers typically are LSTM layers, then followed by a number of dense layers, finally followed by a single RNN layer.

A typical configuration part of our experiment, expressed in JSON, would be:
{
  "seed" : 1,
  "network" : {
    "optimization" : "STOCHASTIC_GRADIENT_DESCENT",
    "initialization" : "XAVIER",
    "learningRates" : [
      { "iteration" : 0, "rate" : 0.05 },
      { "iteration" : 2000, "rate" : 0.025 },
      { "iteration" : 4000, "rate" : 0.005 },
      { "iteration" : 6000, "rate" : 0.0025 }
    ],
    "updater" : "NESTEROVS",
    "normalization" : 0.5,
    "inputSize" : 1,
    "layers" : [
      { "type" : "LSTM", "size" : 250, "activation" : "TANH" },
      { "type" : "LSTM", "size" : 125, "activation" : "TANH" },
      { "type" : "Dense", "size" : 62, "activation" : "SIGMOID" },
      { "type" : "RNN", "size" : 1, "activation" : "IDENTITY", "loss" : "MSE" }
    ]
  },
  "scenario" : {
    "sampleSize" : 250,
    "batchSize" : 32,
    "epochSize" : 100,
    "scenarioSize" : 100
  }
}

Most configuration parameters will speak for themselves, assuming some familiarity with neural networks. In this case, we have chosen a sample size of 250 sampling points, representing 1 second of cardiogram data. The learning rate of the network decreases from 5% to 0.25% in steps at specific iterations.

The network consists of 4 layers (not counting the input layer): two LSTM layers with TANH activation, one dense layer with SIGMOID activation and an RNN output layer with loss function MSE (mean square error).

Note that, for an RNN, the number of nodes on the input layer is always 1, as time series data is offered sequentially to the network.

Intermezzo


In the remaining sections, we will discuss a number of hyperparameters to find the optimal value. Note that these observations are depending on the problem at hand. As stated before, they will also differ depending on the "location" in hyperspace where the parameter is being tested.

Corrected Sample Selection


Looking at the visual representation of a cardiogram, one can see that the value of the cardiogram is mostly close to zero, with periodic peaks up and down. Although these peaks represent only a fraction of the cardiogram, about 20%, they seem to be the "interesting" regions in the cardiogram. The essence of what has to be learned is in the peak regions.

We therefore tested learning with a correction on the uniform random sample generation: forcing a fraction of the samples, typically 50%, to end in a peak region, hoping to accelerate the learning process.

Although we did observe accelerated learning in some cases, the results were not consistent. In fact, for the best performing configurations, corrected sample generation systematically had a negative contribution to the performance of the neural network.

MSE Error Evolution for M40-D20-R1-L05d-40-S1 with and without Correction

The error seems to increase after say 3,000 iterations ? No, the error for the better iterations increases, but the error for the worse iterations decreases. Overall, the average error continues to decrease, and the standard deviation decreases too. See the section on No Overfitting for more information.

One LSTM Layer


In order to evaluate the impact of LSTM nodes, we tried several configurations with the same network topology, varying the second layer to be an LSTM layer or a regular dense layer. Introducing the type of the second layer, LSTM or dense, as a series in the scatter chart, is rather enlightening:

Error Distribution by Sample Size with Layer 2 Type

Networks with a second LSTM layer are clearly underperforming compared to networks with only one LSTM layer and a second dense layer, or learning progresses significantly slower.

The performance demonstrated in this chart seems to be a lot worse than that of the optimal configuration ? Indeed. Remember we are scanning hyperspace for optimal parameter values here, while the optimal configuration represents all the knowledge gathered during the experiment.

Fixed Epoch Data


According to best practices, the data offered during the training phase should be the same for each epoch. However, we initially started working with variable (new) data for every batch of every iteration. Consequently, we were training on different data for each epoch.

We examined the effect of fixed versus variable epoch data on 28 scenarios with sample size 20. We visualize the results in function of the average error and the standard deviation:

Standard Deviation Distribution by Average Error with Variable/Fixed Epoch Data

Clearly, scenarios using fixed data over different epochs suffer less variance, both in terms of the average error and in terms of the standard deviation, although the difference is not spectacular.

Another observation based on this projection is that there is a strong correlation between the average error and the standard deviation. In fact, this is a general conclusion that applies to the entire set of scenarios, without exception.

To get a better view on the impact on the learning process, we compare the MSE over the last 1,000 iteration of the learning process for two identical scenarios, one using variable and one using fixed epoch data.

MSE Evolution for Variable/Fixed Epoch Data, Final 1K Iterations

The conclusion is the same for all such tests: fixed epoch data results in less variance on the result and continued convergence over iterations. More importantly, when validating the network on unseen data, performance is not negatively affected by the fact that we are using fixed data during training.

The Best Activation Function


The next hyperparameter we investigated was the activation function. There are numerous activation functions to choose from. Initially, we had been using TANH for LSTM layers and SIGMOID for dense layers. Why ? Because that's what we see most in the examples discussed in other posts on the subject.

We tested 120 scenarios, with sample sizes 15, 20 and 25, and with different random seeds. The activation functions tested are TANH, SIGMOID, RELU, LEAKY RELU and CUBE. They are applied to all layers of the network, LSTM and dense alike.

Error Distribution by Activation Function

The results are significant. The TANH, RELU and LRELU activation functions give the better results, with the lowest error and the least variance. The CUBE activation function gives the worst result, but also with little variance. The SIGMOID activation function has significant variance, performing like the better for some samples and like the worst for others.

The reason for the bad performance of the SIGMOID and CUBE activation functions is that they have a small gradient, preventing updates of the network parameters, a problem known as the vanishing gradient.

Many other activation functions are available, but most of them are numerical approximations of the set we tested that calculate more quickly. As for this experiment, featuring relatively small neural networks, we do not have an issue with processing time, there is no need to revert to more efficient approximative implementations of activation functions.

The Best Updater


After a successful investigation into activation functions, we directed our attention to the updaters for backpropagation. There are numerous updaters to choose from. Up to now, we have been using NESTEROVS. Why ? Because that's what we found in the DL4J examples.

We tested 40 scenarios, all with sample size 20, and with different random seeds. The updaters tested are NESTEROVS. ADADELTA, ADAM, ADAMAX, SGD, NADAM, RMSPROP and ADAGRAD.

Error Distribution by Updater

Clearly, there are significant differences between different updaters, both in terms of the average error and in terms of the variance of the results. ADAM and NADAM seem to provide small average errors with the most stability. RMSPROP provides us with the smallest average error and still has a small variance.

The Best Loss Function


We continued the series of successes and investigated the impact of using different loss (error) functions. So far, we had been using MSE (mean square error).

With this investigation, we will be using MAE (mean absolute error), MAPE (mean absolute percentage error) and MSE (mean square error). We tested 30 scenarios with the same network topology for various random seeds, each time with one of these loss functions.

Note that we never changed the loss function for the final evaluation, which is always the absolute percentage error. We varied the loss function exclusively during the learning process.

Error Distribution by Loss Function

No improvements were found here. Apparently, the MSE loss function was the optimal one to begin with.

No Overfitting


As a final illustration, we present the same 50 scenarios of the section on Beating the Previous Predictor, with identical configurations and random seeds, and consequently the same training and evaluation data, each with their respective performance on the same unseen test set after 10K iterations and 100K iterations.

Error Distribution by Series Size 1-50, 10K and 100K Iterations

The systematic overperformance of the neural networks that have had longer training is undeniable. Looking at the evolution of the scores (not shown) and given that we use unseen data for testing, there seems to be no overfitting, at least not up to 100K iterations.

We have been testing different optimization strategies, but the results were inconclusive. Therefore, we have been using ClipL2PerLayer for optimization with a rate of 0.7, consistently.

We have also been testing L2 regularization rates, but again, the results were inconclusive, with a slight preference for lower rates. Therefore, we have been using L2 regularization with a rate of 1e-5, consistently.

Conclusions


During our experiment with RNNs to predict a cardiogram we tested about 860 neural network and scenario configurations. The baseline for the problem can easily be set at a challenging level of an average absolute percentage error of about 0.6% using a heuristic predictor that predicts each value to be the previous value. Replication of this result using an RNN is immediate, and doing significantly better is not so hard.

The best configurations realized an error of 0.1832%, a reduction by a factor 3 compared to the baseline. Interestingly, this result is obtained using fairly small fragments of the cardiogram, in the order of 1/10 to 1/5 of a second, far below the period of the cardiogram. Selective training data with a focus on the "harder" regions to learn does not consistently result in faster learning.

A single LSTM layer, followed by a single dense layer and an output RNN layer offers the best results. Additional LSTM layers do not offer any obvious advantage. At best, they result in slower learning, and they typically result in a larger average error.

Fixed epoch data provides less variance in both the average error and its standard deviation. However, convergence is not broken using variable epoch data, but the average error typically converges to a higher value.

In this context, clear conclusions could be drawn for some of the network parameters:
  • The best activation function is Leaky RELU, followed closely by RELU and TANH. SIGMOID and CUBE are bad choices.
  • The best updaters for stochastic gradient descent are ADAM, NADAM and RMSPROP. ADADELTA, SGD and ADAGRAD are bad choices.
  • The best loss function is MSE. MAE and MAPE are suboptimal choices.

Performance of the neural networks continued to improve up to 100K iterations. The effect of even longer training has not been investigated. We did not find consistent optimum values for optimization and regularization.

No comments:

Post a Comment