Genetic Algorithms, Machine Learning Basics

Enhancing Machine Learning Models using Genetic Algorithms with Python

In this post, you will learn about supervised machine learning, and find out how genetic algorithms can be used to improve the performance of machine learning models by selecting the best subset of features from the provided input data. Along the way, you will get acquainted with the real-life ‘Zoo’ dataset and discover how to utilize Python code to create a machine learning model, as well as a genetic algorithm-based solution for enhancing the accuracy of this model.

This post is based on Chapter 7 of my book, Hands-On Genetic Algorithms in Python. The code used in this post can be found in this Github repository.

Supervised Machine Learning

The term ‘machine learning’ typically refers to a computer program that receives inputs and produces outputs. Our goal is to ‘train’ this program, also known as the ‘model’, to produce the correct outputs for given inputs, without explicitly programming it.

During this training process, the model ‘learns’ the mapping between the inputs and the outputs by adjusting its internal parameters. One common way to ‘train’ the model is by providing it with a set of inputs, for which the correct output is known. For each of these inputs, we ‘tell’ the model what the correct output is, so it can adjust, or ‘tune’ itself, aiming to eventually produce the desired output for each of the given inputs. This ‘tuning’ is at the heart of the learning process.

Over the years, many types of machine learning models were developed. Each model has its own particular internal parameters that can affect the mapping between the input and the output, and the values of these parameters can be tuned, as illustrated in the image below.

For example, if the model was implementing a decision tree, it could contain several IF-THEN statements formulated as:

“IF <input value> IS LESS THEN <some threshold value> THEN <go to some target branch>”.

In this case, both the threshold value and the identity of the target branch are parameters that can be adjusted, or ‘tuned’, during the learning process.

To carry out the tuning of the internal parameters, each type of model has an accompanying learning algorithm that iterates over the given input and output values, and seeks to match the given output for each of the given inputs. To accomplish this goal, a typical learning algorithm will measure the difference (or ‘error’) between the actual output and the desired output; the algorithm then will attempt to minimize this error by adjusting the model’s internal parameters.

The two main types of Supervised machine learning are classification and regression, described below.

Classification

When carrying out a classification task, the model needs to decide to which category a certain input belongs. Each category is represented by a single output (called ‘label’), while the inputs are called ‘features’:

For example, in the well-known Iris Flower dataset, there are four features: Petal Length, Petal Width, Sepal Length and Sepal width, representing measurements manually taken off of actual Iris flowers.

At the output end, there are three labels: ‘Iris Setosa’, ‘Iris Virginica’ and ‘Iris Versicolor’, representing three different types of Iris.

When input values are present, representing measurements that were taken from a given Iris flower, we expect the output of the correct label to go ‘high’ and the other two to go ‘low’:

Classification tasks have a multitude of real-life applications, such as approval of bank loans and credit cards, email spam detection, handwritten digit recognition, and face recognition. The problem you will soon be solving in this post is classification of animal types using the ‘Zoo’ dataset.

Regression

In contrast to classification tasks, the model for regression tasks maps the input values into a single output, providing a continuous value, as illustrated in the image below:

Given the input values, the model is expected to predict the correct value of the output.

Real life examples of regression include predicting the value of stocks, the quality of wine, or the market price of a house, as depicted in the following image:

In this image, the inputs are features providing information that describes a given house, while the output is the predicted value of the house.

Many types of models exist for carrying out classification and regression tasks, including decision trees, support vector machines and artificial neural networks.

There are certain techniques that can be used to improve and enhance the performance of such models. One interesting technique—Feature Selection—is described below.

Feature Selection in Supervised Learning

As we just mentioned, a supervised learning model receives a set of inputs called ‘features’ and maps them to a set of outputs. The assumption is that the information described by the features is useful for determining the value of the corresponding outputs. And it may seem, at first glance, that the more information we can use as input, the better our chances to predict the output(s) correctly. However, in many cases the opposite holds true; if some of the features we use are irrelevant or redundant, the consequence could be a (sometime significant) decrease in the accuracy of the models.

Feature selection is the process of selecting the most beneficial and essential set of features out of the entire given set of features. Besides increasing the accuracy of the model, a successful feature selection can provide the following advantages:

  • Training times of the models are shorter
  • The resulting trained models are simpler and easier to interpret
  • These models are likely to provide better generalization, that is, perform better with new input data that is dissimilar to the data used for training.

The “Zoo” Dataset Classification Task

The UCI Machine Learning Repository maintains over 350 data sets as a service to the machine learning community. These datasets can be used for experimentation with various models and algorithms. A typical dataset contains a number of features (inputs) and the desired output, in a form of columns, with a description of the data they contain.

For our sample problem we chose the UCI Zoo dataset. This dataset describes 101 different animals using the following 18 features:

No. Feature Name Data Type
1  animal name  Unique for each instance
2  hair  Boolean
3  feathers  Boolean
4  eggs  Boolean
5  milk  Boolean
6  airborne  Boolean
7  aquatic  Boolean
8  predator  Boolean
9  toothed  Boolean
10  backbone  Boolean
11  breathes  Boolean
12  venomous  Boolean
13  fins  Boolean
14  legs  Numeric (set of values {0,2,4,5,6,8})
15  tail  Boolean
16  domestic  Boolean
17  catsize  Boolean
18  type  Numeric (integer values in range [1,7])

Most features are Boolean (value of 1 or 0), indicating the presence or absence of a certain attribute such hair, fins etc. The first feature, animal name, is for our information only, and does not participate in the learning process.

This dataset is used for testing of classification tasks, where the input features need to be mapped into two or more categories, or ‘labels’. In this dataset, the last column—called type—represents the category, and is therefore used as the output value. In this dataset, there are seven categories altogether. A type value of 5, for instance, represents an animal category that includes frog, newt and toad. To sum up, a classification model trained with this dataset will use the columns 2-17 (hair, feathers, fins, etc) to predict the value of column 18 (animal type).

Python Representation

To encapsulate the feature selection process for the Zoo dataset classification task, we created a Python class called Zoo. This class is contained in the file zoo.py that can be found here.

The main parts of this class are highlighted below:

  1. The __init__()method of the class loads the Zoo dataset from the web, while skipping the first feature—animal name—as follows:

  1. It then separates the data to input features (first remaining 16 columns) and resulting category (last column):

  1. Instead of just separating the data to a training set and a test set, we use here k-fold cross-validation. This means that the data is split to kequal parts, and the model is evaluated k times, each time using (k-1) parts for training and the remaining part for testing (or validation). This is easy to do in Python using the scikit-learn library method KFold():

  1. Next, we create the classification model based on a Decision Tree. This type of classifier creates a tree structure during the training phase, that splits the dataset into increasingly smaller subsets, eventually resulting in a prediction:

Note that we are passing the random seed value to the classifier. This enables us to run tests with repeatable results.

  1. The getMeanAccuracy()method of the class is used to evaluate the performance of the classifier for a set of selected features. This method accepts a list of binary values corresponding the features in the dataset—a value of ‘1’ represents using the corresponding feature, while a value of ‘0’ means that the feature is dropped. The method then drops the dataset columns that correspond to the unselected features:

  1. This modified dataset—containing only the selected features—is then used to perform the k-fold cross-validation process, and determine the classifier’s performance over the data partitions. The value of kin our class is set to 5, so five evaluations take place each time:

The metric used here to evaluate the classifier is ‘accuracy‘—the portion of the cases that were classified correctly. An accuracy of 0.85, for example, means that 85% of the cases were classified correctly. Since in our case we train and evaluate the classifier k times, we use the average (mean) accuracy value obtained over these evaluations.

  1. The main()method of the class creates an instance of the Zoo class, then evaluates the classifier with all 16 features present, using al ‘all-ones’ solution representation:

When running the main method of the class, the printout shows that when testing our classifier with 5-fold cross-validation using all sixteen features, the classification accuracy achieved is about 91%:

Next, we will attempt to improve the accuracy of the classifier by selecting a subset of features from the dataset, instead of using all the features.

Selecting the Features for the “Zoo” Dataset

Since we have sixteen different features, a subset of these features can be represented as a sequence of sixteen Boolean values, each denoting the including or exclusion of the corresponding feature. This gives us 216, or 65,536 different possible subsets of features. Unfortunately, trying out all these combinations to find the best one may prove overly time consuming. Luckily, there is an approach that is likely to produce a good solution within a reasonable amount of time. This technique is called—you guessed it—genetic algorithms!

What are Genetic Algorithms?

Genetic algorithms are a family of search algorithms inspired by the principles of evolution in nature. When solving a problem, the genetic algorithm maintains a population of candidate solutions, that are iteratively evaluated and used to create a new generation of solutions. Those who are better at solving the problem have a greater chance to pass their qualities to the next generation of candidate solutions. This way, as generations go by, candidate solutions get better at solving the problem at hand, until a certain stopping condition is met.

When using genetic algorithms, each candidate solution is represented using a ‘chromosome’, enabling us to breed and mutate solutions. The representation we will use for the feature selection solution is a list of integers with the values of ‘0’ or ‘1’, denoting whether a feature should be used or dropped.

Several Python frameworks are available for working with genetic algorithms; we chose to use the DEAP framework, thanks to its ease of use, extensibility and abundance of documentation.

Genetic Algorithms Solution

To identify the best set of features to be used for our Zoo classification task using a Genetic Algorithm, we created the Python program 02-solve-zoo.py located here. The main parts of this program are highlighted below:

  1. First, we need to create an instance of the Zooclass, passing our random seed along for the sake of producing repeatable results:

  1. DEAP’s creator module enables us to extend existing classes by augmenting them with new attributes. Since our goal here is to maximize the accuracy of the classifier model, we extend DEAP’s base Fitness class using a positive weight:

  1. DEAP’s Toolbox class serves as a container for functions (or ‘operators’), and enables us to create new operators by aliasing and customizing existing functions We now use the following toolbox definitions to create the initial population of individuals, each constructed as a list of ‘0’ or ‘1’ integer values :

  1. We now need to define and register the fitness function, the one we want our solution to maximize. Here we call it zooClassificationAccuracy(). It is a wrapper around the getMeanAccuracy()method of the Zoo instance, as we indeed aim to maximize the mean accuracy. In addition, the wrapper is enriched with two enhancements:
    • We eliminate the possibility of no features being selected (all-zeros individual), as our classifier will throw an exception in such a case.
    • We add a small penalty for each feature being selected, to encourage selection of fewer features. The penalty value is very small (0.001), so it only comes into play as a ‘tie-breaker’ between two equally performing classifiers, leading the algorithm to prefer the one that uses less features:

  1. We next need to define the genetic operators of selection (selecting solutions to create the next generation), crossover (breeding two solutions to create offspring solutions) and mutation (making random changes in some of the offspring). Here we use a simple selection operator called Tournament Selection, and crossover and mutation operators that are specialized for binary-list chromosomes:

  1. We also use an enhancement to the genetic algorithm called Elitism, where the ‘hall of fame’ members—the current best individuals—are always passed untouched to the next generation:

  1. At the end of the run, we print all members of the ‘hall of fame’, so we can see the top results found by the algorithm. We print both the fitness value—that includes the penalty for number of features—and the actual classification accuracy value:

Running the algorithm for 50 generations, with population size of 50 and ‘hall of fame’ size of 5, we get the following outcome:

The results indicate that all five top solutions achieved accuracy value of 97%, using either six or seven features out of the available sixteen. Thanks to the penalty factor on number of features, the top solution is the set of six features, which correspond to the following ones:

  • ‘feathers’
  • ‘milk’
  • ‘airborne’
  • ‘backbone’
  • ‘fins’
  • ‘tail’

In conclusion, by selecting these six particular features out of the sixteen given in the dataset, not only did we reduce the dimensionality of the problem, but we were also able to improve our model accuracy from 91% to 97%. If this does not seem like a large enhancement at first glance, think of it as reducing the error rate from 9% to 3%—a very significant improvement in terms of classification performance.

Summary

In this post, you were introduced to machine learning and to the two main types of Supervised machine learning tasks—regression and classification. You were then presented with the potential benefits of Feature Selection to the performance of models carrying out these tasks. At the heart of this post was a demonstration of how genetic algorithms can be utilized to enhance the performance such models via feature selection, using Python code and the UCI ‘Zoo’ classification dataset.

Since you made it thus far, you deserve a special reward: the entire chapter of Hands-On Genetic Algorithms in Python on which this post is based on, is yours for free download.

Tagged , , , , , , ,

About Eyal Wirsansky

Eyal Wirsansky is a senior software developer, an artificial intelligence consultant and a genetic algorithms specialist, helping developers and startup companies learn and utilize artificial intelligence techniques. Eyal is the author of the book 'Hands-On Genetic Algorithms with Python' (Packt).
View all posts by Eyal Wirsansky →

Leave a Reply

Your email address will not be published. Required fields are marked *