Machine Learning Implementation using Java and Tribuo

Going Deeper with Tribuo—Regression, Provenance and Model Serialization

This is the third post in a series of four posts.


In the previous post, First Tribuo Example, we saw how to implement a wine quality classifier with Tribuo, a Java-based Machine Learning library. This time, we will use the same dataset to demonstrate how to implement a regressor, and along the way discover a few more capabilities of Tribuo.

Regression vs. Classification

As we saw in Introduction to Supervised Learning, models used for regression tasks typically have a single output, providing a continuous value; given the input values (or ‘features’), the model is expected to predict the value of the output. This is in contrast to classification tasks, where the model maps the features into two or more categories.

Interestingly, the Wine Quality Dataset we used for classification in the previous post can also be used for regression: although the output variable divides the wine variants to different classes, 0, 1, 2, …, 10, that value goes up with the wine quality, and therefore we can also think of the quality as continuous value between 0 and 10.

The Code

The main file used for this example is the class WineQualityRegression; it has many similarities to the class WineQualityClassification which we described in the previous post; in this post, we will concentrate on the differences between these classes.

<Regressor> Instead of <Label>

The type <Regressor> is replacing <Label> throughout the code, to denote our use case of regression, as demonstrated in our class variables:

There are several other common types used in Tribuo, such as <Event>, used in anomaly detection, and <ClusterID>—used in clustering.

Random Forest Trainer

For this task, we are going to demonstrate the use of a Random Forest model. Just like the XGBoost model we used in the previous post, this type of model can be used for both classification and regression. During training, the Random Forest algorithm creates a collection (or ensemble) of decision trees, each created for a randomly selected subspace of the data set. The outputs of all these ‘subsampling’ trees are then averaged to create the prediction of the trained regressor.

To make that happen, the method createTrainer() starts by creating a CART (Classification And Regression Tree) trainer that will be used to create the ‘subsampling’ trees—the building blocks of the Random Forest model. The constructor of this trainer is given a list of parameter values that will be used when creating these trees:

Next we create the actual Random Forest trainer, that receives the CART trainer we just defined as its first parameter:

When running the program, the output of the training process looks as follows:

Tribuo’s BaggingTrainer class is responsible for creating the ensemble of decision trees for the Random Forest regressor. We can see 10 rounds of training, matching the value of numMembers we initialized the trainer with.

Evaluating the Regression Results

To evaluate the results, the evaluate() method utilizes Tribuo’s RegressionEvaluator and RegressionEvaluation, in a similar manner to what we saw in the case of classification. However, in order to extract the results we need, an ‘auxiliary’ regressor needs to be created first as follows:

The reason for that is that Tribuo supports multidimensional regression: imagine, for example, that our dataset had two output columns, one for the wine’s quality and the other—for consistency. A single, two-dimensional regressor could be used for this dataset, with the first dimension predicting the quality, and the second dimension—the consistency. Each of these dimensions would be evaluated separately. In our case, there is a single dimension (‘dimension 0’) and we extract its evaluation as follows:

The measures extracted from the evaluation represent three different ways to evaluate the results of a regressor:

  • MAE (Mean Absolute Error): represents the average absolute distance between the predicted results and the actual data
  • RMSE (Root Mean Squared Error): represents the squared root of the average squared distance between the predicted results and the actual data, emphasizing large errors.
  • R^2 (R-Squared): provides a measure, between 0 and 1, of how well the predicted results fit compared to the actual data.

When running the program, the output of the evaluation looks as follows:

As evident from these results, the performance for the test set is considerably worse comparing to what we get for the train set—the two distance related measures are larger, and the R-Squared is closer to 0 than to 1. This is consistent with the results we got for the same dataset in the case of classification.

Provenance

As we mentioned in our Introduction to Oracle Tribuo post, Tribuo’s Provenance is a unique feature integrated into all classes that represent Models, Datasets and Evaluations. Provenance provides information about the parameters, transformations, and files were used to create them; it allows each model and experiment to be recreated from scratch.

To illustrate the wealth of information that is stored in the model’s provenance, we added at the end of the trainAndEvaluate() method a few lines that print out of the provenance for the dataset and the trainer used for our experiment :

The resulting output is a very detailed structure, containing all the pertaining information. Below is a small portion of it:

You can learn more about Provenance and how it can be used in conjunction with the Configuration Manager in the official Tribuo Configuration Tutorial.

Serializing the Model

We can now save the trained model into a file, so we can recreate it later and use it to make predictions. This is done by the code in the method saveModel():

As seen above, all we need to do is write the Model object into a file.

What’s Next?

In our next post, we will read the model back from the file, and use it to create an application that provides wine quality predictions by responding to REST queries.

Tagged , , , , , , , ,

About Eyal Wirsansky

Eyal Wirsansky is a senior software developer, an artificial intelligence consultant and a genetic algorithms specialist, helping developers and startup companies learn and utilize artificial intelligence techniques. Eyal is the author of the book 'Hands-On Genetic Algorithms with Python' (Packt).
View all posts by Eyal Wirsansky →

Leave a Reply

Your email address will not be published. Required fields are marked *