Sunday, August 8, 2021

The Magic of Growing Trees

Determining the estimated time of arrival (ETA) in parcel delivery using random forest decision trees. Using a do-it-yourself approach and digging into the unexpected complexity of the seemingly straightforward task of building decision trees.

Introduction


This time, the self-imposed challenge is directly related to our professional activity: the application of machine learning to predict the duration of a process. We are keeping up appearances by addressing a similar problem in another domain, parcel delivery, something we all can relate to in these days of lockdown and quarantine.

In what follows we will report on an experiment in machine learning, estimating the time of arrival of a parcel, including generating the data, training the model and evaluating the results.

The choice for decision trees was inspired by the promise of the model in classical machine learning, and reluctance to go neural once more. The results are encouraging, but also bring to light the hidden complexity in what seems to be a simple, straighforward task.

Estimated Time of Arrival


Nowadays, parcel delivery is a domain that almost everone can relate to. The data involved is easy to understand. For our exercise at hand, we will use the following parameters:

ParameterTypeDescription
OriginCityThe city where the parcel delivery is initiated.
DestinationCityThe city where the parcel is to be delivered.
DepartureTimestampThe moment on which the delivery of the parcel starts.
CarrierCarrierThe carrier that will take care of the delivery process.
ExpressBooleanA flag indicating whether this is an express delivery.

Most of the parameters are of a categorical nature, with a value set that is a fairly small collection of labels: cities, carriers ... Express, as a boolean, will be considered of a categorical nature as well. Departure is a numerical parameter, that will be handled in a significantly different way.

As we are not close with any of the major parcel delivery companies around, we will have to fabricate data ourselves. The data generation process is explained at end of the post, under Data Generation. For now, we will merely list the possible values for each parameter type:
  • City: Cities used for the exercise are Brussels, Paris, Berlin, Amsterdam and London.
  • Carrier: Parcels will be handled and delivered by BPost, PostNL and DHL.
  • Timestamp: Data covers parcel deliveries initiated over the course of a single week.
  • Boolean: As said, boolean will be considered a categorical type, with values Yes and No.

For our exercise, these parameters constitute the collection of independent variables. Note that all of these parameters are known before the delivery process starts, which is important if we want our predictions to be useful at all.

What we will be predicting is the time of arrival:

ParameterTypeDescription
ArrivalTimestampThe moment on which the parcel will be delivered at its destination.

The arrival timestamp is the dependent or target variable, the one that will be predicted in function of the independent variables. It is commonly referred to as the estimated time of arrival, ETA in short.

Data Preparation


As for all machine learning exercises, we need to consider preparing our data in function of the machine learning algorithm that will be used, with the objective to make learning easier. In the context of the parcel delivery data, and with decision trees in mind, we will transform the data in a couple of ways.

First of all, we will encode variables of type timestamp into two different variables, one specifying the day of the week (DOW), and one specifying the time of day (TOD). For the Departure variable, we will introduce the variables Departure[DOW] (of the categorical type DayOfWeek) and Departure[TOD] (of the numerical type Time).

This is because we expect the parcel delivery process to behave differently on weekdays versus weekend days, for example. Learning the concept of day of the week based on timestamps alone will add significantly to the complexity of the problem. The general idea here is to put any knowledge we have about the problem domain into the way the data is presented, in order to maximize the performance of the machine learning algorithm.

Secondly, we will decrease the granularity of variables of type Time by truncating to the hour. At first, this seemed to simplify the problem. We realized however that this obscures information and may make the learning problem more difficult. At best, it will decrease the degree of determinism in the data, which is something we are actually looking for.

Anyway, we will be working with the variable Departure[TOD][H], holding the hour of departure, as an integer value ranging from 0 to 23.

Finally, to avoid the complexity of the representation of date and time information from the dependent variable, and to simplify cluster representation, we replace the variable Arrival by Duration[H], being the difference between Arrival and Departure, holding the number of hours it takes to deliver the parcel, equally truncating to the hour.

Visual Inspection


Before starting machine learning exercises on the data, we will get to know the data better. This is another best practice in machine learning.

Let's generate a data set of 10K data elements to start with, and have a look at the distribution of the hour of departure, variable Departure[TOD][H], from Berlin and London:
The distribution histogram shows the number of parcels shipped from Berlin and London, per hour, over a 24-hour timeframe, with time expressed in UTC. Interesting observations are:
  • There is no shipping activity during the night: departures start from 06h00 and stop around 20h00 local time.
  • Shipping in London lags behind one hour compared to Berlin, due to the time zone difference.
  • The shipment volume is high in the morning and lower in the afternoon.
  • Shipment distribution has a similar pattern in both cities, with somewhat more activity in Berlin.

Next, let's look at the distribution of the hour of arrival, variable Arrival[TOD][H], in Berlin and London:
Here we can make a number of interesting observations as well:
  • Again, there is no activity during the night: deliveries start at 06h00 and stop around 18h00 local time.
  • Again, there is a one hour lag in London compared to Berlin, due to the time zone difference.
  • There is a remarkable peak volume during the first hour of activity, local time.
  • The distribution patterns for arrival time for both cities are significantly different.

Finally, let's look at the distribution of the duration of the delivery process, in hours, variable Duration[H], for shipments from Berlin to London and Paris:
The interesting observations are:
  • The duration of the delivery process spans a wide range from 1 hour to 94 hours, where the pattern of 4 24-hour cycles is clearly visible.
  • There is again a shift, this time between London and Paris, but far more than the time zone difference of 1 hour.
  • There is a significant concentration on the 1 hour duration, if not, the delivery process will take at least 10 hours to complete.

There's a lot more to it, in terms of patterns in the data. We refer to the end of this post, where more detail is provided in the section on Data Generation.

By the way, the charts have been created using Gnuplot, a bit old school, but still an excellent charting tool. It allowed us to fully automate the process of chart generation, as you will be able to observe in what follows.

Decision Trees


Decision trees can be used for classification or regression. We will use decision trees as a model to classify data, a clustering method, that is. There are plenty of sources on the Internet that explain decision trees in detail, including the algorithms to construct them. You may start with the Decision Tree Learning page on Wikipedia. We will not go there but merely provide a sneak preview on what is to come, using a minimalistic example of a decision tree that we will be generating.

For the first example we start from the 10K data set introduced earlier, and show a decision tree restricted to depth 2:
Non-leaf nodes represent a classification decision and are in orange. Leaf nodes represent a cluster and are in green. Note that non-leaf nodes represent a cluster as well, ignoring the more detailed classification realized by their child nodes.

Let's walk through the characteristics of the nodes featuring in this example, starting with the root node:
  • The root node represents the entire data set, 10K data elements, with durations ranging from 1 hour to 99 hours, and a duration of 28 hours as a cluster representative.
  • The decision criterion is using the day of the week of the departure of the parcel, variable Departure[DOW], with values Thursday, Friday and Saturday for selecting the left side, and all other values for selecting the right side.
  • Threshold is the divisor of the value set of the target variable that realizes the optimal split of the data elements.
  • For the cluster representative, the accuracy realized with this node is 1.65%, with an average error of 26.25 hours.

We will elaborate on the definition and the meaning of these parameters in more detail in the sections on Growing Trees and Measuring Performance.

The objective of splitting the data set based on a criterion on one of the independent variables is to reduce the overall error of the decision tree. Looking at the left node:
  • With 4488 data elements, the left node represents somewhat below half of the data set, with durations still ranging from 1 hour to 99 hours, and a duration of 61 hours as a cluster representative.
  • The accuracy of classification is 0.96% and the average error is 29.99 hours.

Looking at the right node:
  • The right node represents the other half of the data elements, 5512 to be precise, with durations ranging from 1 hour to 57 hours, and a duration of 25 hours as a cluster representative.
  • The accuracy of classification is 5.42% and the average error is 12.96 hours.

Note that, although the performance of the left node is inferior to that of the root note, the right node largely compensates for this, and overall the classification performance is improved by the split. Let's verify that by calculating the weighted average error of the leaf nodes: $$\frac{4488*29.99+5512*12.96}{4488+5512} = 20.60 < 26.25$$ By the way, the tree representation has been created using the dot language, provided by the Graphviz project. It is an excellent tool for graph visualization that we have been using in earlier posts as well, again supporting full automation of the production of visuals.

The Benchmark


An absolute benchmark for the performance of any algorithm can be found in the classification matrix of the data.

The classification matrix is a multi-dimensional matrix with one dimension for each independent variable, considering the possible values in the most fine-grained way. The cells in the matrix hold the values for the target variable encountered for the cluster in the training set, where the median value is taken as the representation of the cluster. The prediction of the value for the target variable for a new data element is then defined as the representation of the matching cell for that data element.

In our case, the matrix dimensions, with their respective number of values, are:

ParameterCardinality
Origin5
Destination5
Departure[DOW]7
Departure[TOD][H]24
Carrier3
Express2
Product25,200

With the cardinalities given in this table, the matrix counts 25,200 cells defining just as many clusters.

When creating such a matrix for a range of training sets, covering sizes from 2K to 100K data elements, and visualizing the average error on an equally sized test set, we get an understanding of the performance of the matrix method:
In green, you see the average error on the training set. In violet, you see the average error on the test set, averaged out over 5 different test runs, each run using a different test set.

The performance of the matrix method increases with the size of the training set, as can be expected. When the size of the training set is around four times the size of the matrix, we get an average error as low as 1.25 hours.

As this method uses all the information that is provided in the training set, it will provide the optimal performance, the best possible prediction that can be made using the training set. The remaining error is then merely a reflection of the non-determinism in the training data: identical values for all independent variables that yield different values for the target variable. No algorithm can systematically outperform the matrix method.

An average error of 1.25 hours basically characterizes the non-determinism of the data, and not so much the accuracy of the algorithm.

The attentive reader may have noticed that the average error on the training set is increasing rather than decreasing. This is because when the number of data elements in the training set is much lower than the number of elements in the matrix, duplicate values for all independent variables will be rare, and the the non-determinism in the data will be minimal. However, as the size of the training set grows, the non-determinism will become apparent accordingly, and the average error on the training data will converge to reflect exactly that.

For the same reason, the average error on the training set will initially increase, come to a maximum and only then decrease and converge as the size of the data set grows.

Now, why bother at all to find alternative algorithms when the classification matrix cannot be outperformed ?

The answer is straightforward: Even for this simple problem domain, with a limited number of values for categorical variables and strong discretisation of numerical variables, the size of the matrix is substantial. One can imagine the growth of the matrix when more dimensions or more values on a number of dimensions are required, exploding quickly to a level that cannot be practically supported anymore.

Decision trees offer an alternative that is similar in the sense that it will find clusters with maximum homogenity, but starting selectively with the independent variable with the highest return, narrowing down the cluster sizes by repeating this process, and consequently work with more sparse, coarse-grained clusters. This way, decision trees have the potential of realizing a performance that, if not equally good, comes close to the optimum.

Growing Trees


The construction of a decision tree is, in principle, fairly simple. We will keep talking about performance, and sometimes about error, in an abstract way until clarifying performance metrics in the section on Measuring Performance.

One starts with a root node that represents the entire training set. Then, a leaf-node is transformed into a non-leaf node, giving it a left- and right-hand child node, recursively, until a certain stop condition is reached. This condition typically relates to the performance on the data represented by the node: when the performance is satisfactory, no split operation is performed on the node and it remains a leaf node.

The difficult part is finding the optimal criterion to split a node: which independent variable is going to be used, and what is an appropriate value for the split condition ? A number of aspects are of particular importance here:
  • Is the split sufficiently balanced in terms of the data volume represented by each side ?
  • Does the split guarantee sufficient homogenity on both sides in terms of the target variable ?
  • In other words, is the split optimal in terms of performance improvement ?

Bluntly stated, the solution is to consider all possible ways to split the data set represented by a node based on the target variable, with maximum homogenity, and for each of those splits, to consider all possible independent variables, and for each variable to consider all possible split conditions, and then selecting the split condition with the best performance improvement.

As illustrated in the section on Decision Trees, the performance of a split is defined as the weighted average of the performance of both partitions of the split. Consider the example presented in that section, trying all possible splits based on the target variable:
Clearly, the minimum error is realized with a split of the target variable on 32 hours, which we will refer to as the threshold. As said, finding this point requires, for each potential threshold, a scan of all independent variables and all possible split conditions on each individual variable, in search for the lowest error.

In this case, the optimal split is on the day of the week of departure, variable Departure[DOW], which is categorical. Trying all possible split conditions effectively means trying all possible combinations of days, also known as the power set of the set of days in a week.

With 7 days in a week, the power set contains 520 possible combinations. Given the left-right symmetry of the decision tree, 260 combinations have to be effectively considered. Minus the trivial combination of no and all days of the week, leaves 258 combinations to go. Where the lowest error is to be found for the combination { Thu, Fri, Sat }.

The same logic can be applied for a numerical variable, where different split values for that variable have to be considered. Here, the number of possibilities in not combinatorial, but the scan requires discretisation of the numeric variable.

First Results


Time to apply all this logic to our data set and have a look at the resulting decision tree. Generating a decision tree on the same 10K training set, limiting the tree depth to 4 levels, we get:
After a first split on the day of the week of departure, the left node is split again on the same variable, this time with threshold 52 hours selecting on Thursday only. Again, this can be seen in the error chart for all possible thresholds:
Other nodes are split based on other variables, using the same procedure to decide on the split condition. Besides green for leaf nodes and orange for non-leaf nodes on a categorical variable, non-leaf nodes on numerical variables are shown in violet. With this decision tree, the average error is around 12.50 hours.

Removing the restriction on the depth of the tree will generate a much larger tree. We define the stop conditions for further refinement to be an error below 1 hour, or a cluster size of 10, whichever is reached first. In addition to these stop conditions, refinement of the decision tree will also stop if the remaining data does not offer room for improvement anymore. That is, if the error of the split configuration is higher than the error of the original node, or if the discriminating capacity of the independent variables would result in a trivial split.

The resulting decision tree is a beautiful monster of around 850 nodes, giving an average error rate of around 2.32 hours, one hour off of our benchmark. Look at this unreadable rendering to get an idea of the tree structure:
We provide an arbitrary fragment of the decision tree as a readable image to give some insight on the details of the decision process:
You can see all of the different stop conditions in action here.

One may wonder if such a huge decision tree isn't suffering from overfitting, capturing the specific patterns in the training data rather than the generic patterns of the problem domain. We will elaborate on overfitting in the section Return to Hyperspace.

In order to render the monstrous decision tree, we had to switch the dot conversion in Graphviz to scalable vector graphics (SVG), because of limitations on the supported size for bitmaps. We then used an online converter to render the SVG as a bitmap.

Measuring Performance


There are different ways to quantify the performance of a trained model. In the case of decision trees, already when the tree is being created, we have to compare the performance of different ways to split the data set at hand.

The accuracy reported in the previous sections was defined as the fraction of correct estimates, expressed as a percentage: $$\frac{\#\{i\in\{1,...N\}|d_e(i)=d_a(i)\}}{N}.$$ Here, \(N\) is the number of data elements in the data set, \(d_e(i)\) is the expected duration for data element \(i\), the representative of the cluster obtained using the decision tree, and \(d_a(i)\) is the actual duration for data element \(i\), the one that is part of the data set, the label, so to speak.

As we we have been working with the discretisation of a numerical, continuous variable, this definition of accuracy is a very pessimistic measure of performance. This is the reason why the accuracy of even the leaf nodes in the decision trees we have encountered so far are poor.

Rather than using accuracy as a metric, we have been using the root-mean-square error (RMSE) to evaluate the performance of a decision tree. It is defined as $$\sqrt{\frac{1}{N}\sum\limits_{i=1}^{N}(d_e(i)-d_a(i))^2},$$ with the same meaning for the symbols used.

This error measurement will punish estimations that are far off, squaring the difference with the actual.

However, when reporting on the performance of a decision tree, we have been using the regular mean error, defined as $$\frac{1}{N}\sum\limits_{i=1}^{N}d_e(i)-d_a(i),$$ still with the same meaning for the symbols used.

This error measurement is easier to understand, as it provides the average difference between the estimate and the actual. It aligns better with the intuition that we have for an error.

Note that the use of the root-mean-square error as a measurement for performance is not common for decision trees. More common methods include information gain, an entropy-based metric, or Gini impurity.

We started our exercise with information gain as a metric, but due to a stuborn bug in the implementation, we switched to RMSE. As this seems to work equally well for the exercise, we never switched back to information gain.

Return to Hyperspace


As with all machine learning exercises, we may wonder what the optimal values are for those few parameters of a model, the ones that we do not train, the ones we call hyperparameters. We did that rather exhaustively in the context of artificial neural networks, in our previous posts on A Not-so-Random Walk Through Hyperspace and Secrets of the Heart.

For decision trees, the obvious candidate hyperparameters are the performance metrics and the stop conditions. As explained before, we did not take the time to try other performance metrics than the RMSE. However, we did play with the stop conditions, to some degree.

It is common sense to stop splitting nodes whenever the performance of the split configuration does not improve that of the original one, which is a stop condition that we have been applying systematically. Other stop conditions we applied are on the maximum error and the minimum cluster size.

With the maximum error, we refer to the error that is allowed at the level of a node in the decision tree. Whenever the average error on the data related to that node lies beneath that maximum, we will not split that node any further. Increasing the maximum error step by step from 1 hour to 20 hours, we get an unexpected result:
The increasing average error of the decision tree on both the training and the test data is expected. The threshold for the stop condition on the error should be very low. We do however have no explanation for the unexpected change in the speed of deterioration of the error somehwere around 13 hours.

A more interesting parameter however, is with the cluster size. How small do we need to make the clusters, while keeping overfitting in mind ? We therefore test the performance of the decision trees in function of the minimum cluster size. Whenever the number of data elements associated with a cluster drops beneath that minimum, we will not split that node any further. Increasing the minimum cluster size step by step from 1 data element to 30 data elements, we get:
The interpretation of this chart is not straightforward. The average error on the test set seems to remain constant up to a minimum cluster size of 20 data elements, with somewhat decreasing volatility, in particular since the average error is taken as the average over 5 different test sets. After a minimum cluster size of 20 data elements, the performance of the decision trees seems to start deteriorating.

Note that the optimal cluster size is not absolute, but is expected to depend strongly on the problem at hand, in particular on the number of values of the independent variable.

Finally, we wish to investigate what the impact of the size of the training set is on the performance of the decision tree. We test the performance of the decision trees on a training set with size ranging from 1K to 50K data elements, using a maximum error of 1 hour and a minimum cluster size of 10 data elements.
The average error measured on the test set systematically lies somewhat above the average error measured on the training set, as can be expected. In addition to that, the average error keeps getting smaller, up to 1.77 hours for a training set of 50K data elements, getting closer and closer to the error defined as The Benchmark. Testing on an extended range seems to be the next step here.

The systematically good performance on the test sets, again averaged out over 5 test sets, is a clear indication that there is no overfitting. At least not on the training data. With fabricated data, we cannot exclude that the model is overfitting on the data generation method.

Growing a Forest


By now, it is a well known fact that the performance of a machine learning algorithm can be improved and stabilized by training multiple models and combining the estimates these models provide. We already tried that once with artificial neural networks in our previous post Secrets of the Heart.

In the context of machine learning, the term random forest is often used to describe this technique. I guess the trees were the inspiration for the forest.

Trying it out on our decision trees, training 10 different decision trees on disjoint subsets of the training data set, we get:

TreeAverage Error
02.67
12.63
22.89
32.65
42.64
52.72
62.76
72.79
82.53
92.74
Ensemble2.49

The estimate of the random forest is determined as the arithmetic average of the estimates of the individual decision trees in the forest. The average error for each individual decision tree is determined as an average on 5 different test sets, as is the average error of the forest.

The result is in line with expectations. The average error for all the trees together is 2.70 hours, while the ensemble achieves an average error of 2.49 hours. With that score, the forest outperforms each one of the individual decision trees.

Conclusion


All of machine learning is either magic or trivial. It is magic until you understand it, and then it becomes trivial. - after Ernest Rutherford

The decision tree seems to be a good instrument for the estimation of a categorical (or the descretisation of a continuous) dependent variable. It combines a computationally lightweight implementation with good performance in terms of the average error of the estimation. It is also fairly easy to implement.

Using a threshold on the target variable to determine the next optimal split condition for the independent variables is key to building an efficient, balanced decision tree. This technique is relevant for any performance metric used.

The creation of monstrously large decision trees may be expected. This is not necessarily a sign of overfitting. The proclaimed explainability of the estimates made by decision trees therefore seems to be overly optimistic.

In our case, the downside of using generated data is that it will not be as rich in variation as real-world data. On the other hand, an advantage is that we have all the data available that we want, without limitations on the data volume. All in all, fabricated data has allowed us to effectively develop and test the algorithm to build the decision trees, and we are now ready to take it to the real world.

Data Generation


As stated earlier, real-world data on parcel delivery is not available to us, so we have to work with fabricated data, generated, that is.

First, we have to generate randomly distributed data for the independent variables:
  • Departure timestamps are generated over the course of a week, with a higher probability on weekdays as opposed to weekend days, and a still higher probability on Fridays. The time of departure, behold exceptions, is between 6h00 and 20h00, with a higher probability before noon. All this is calculated in the time zone of the country at hand, and truncated to the hour.
  • For generating the Origin and Destination variables, we use a fixed list of five cities, capitals, and the probability to select a city is proportional to the population of the country where the city is located.
  • For generating the Carrier variable, we use a fixed list of three carriers, and the probability to select a carrier is specified in the configuration data. The exact probabilities are not relevant.
  • The ratio of express deliveries is configured as 6%.

There is no correlation between any of the independent variables, by definition, but also in practice.

Then we have to determine the values for the dependent variable, in function of these independent variables, with variation. The arrival timestamp is calculated from the departure timestamp in a number of steps:
  • A delay is added to leave the origin country, proportional to the area of the country, according to a normal distribution, where the standard deviation is 10% of the mean.
  • If there is a time zone difference between the origin and the destination, a delay of 4 hours per hour difference in time zone is added, variating according to a normal distribution, where the standard deviation is 10% of the mean.
  • A delay is added to arrive in destination country, proportional to the area of the country, according to a normal distribution, where the standard deviation is 10% of the mean.
  • For express deliveries, these delays are divided by 24.
  • For leaving the origin country and arriving in the destination country, these delays are added taking into account whether the carrier is active during night-time (18h00 to 06h00) and during the weekend (Saturday and Sunday). These properties vary over the different carriers.
  • As parcels are not delivered during night-time nor during the weekend, an additional delay of 10 minutes is added, according to a normal distribution, where the standard deviation is 10% of the mean, now skipping night-time and weekends.
  • If the total delay thus obtained is less than 1 hour, the delay is increased by 1 hour.
  • The arrival timestamp thus obtained is truncated to the hour.

The effects of all this logic on the distribution of both the independent and the dependent variables have already been illustrated to some degree in the section Visual Inspection.

Note that, as we are using java.util.Random.nextDouble for generating random numbers, we need a trick to generate random numbers with a given distribution. This can be achieved by generating uniformly distributed random number, and subsequently applying the inverse cumulative density function (CDF) of the given distribution to the unformly distributed random numbers.

Java Implementation


No Java code snippets this time. There is nothing much to it. Just a few tips and tricks:
  • Make sure you have a decent data structure for the representation of the data sets, frequency distributions, performance metrics ... We used lists of tuples, with schemas, and streams to work on them, much like those introduced in our previous post on Tuple Calculus using Java 8 Streams, but without the DSL.
  • Visualize your data, visualize your results, and automate the generation of these visuals.
  • Store results as you go, both data and visualizations. And store meta data as well. Remember what you did.