In this blog entry we introduce evolutionary algorithms and an integration between an evolutionary computation tool, ECJ, and Apache Hadoop. This research aims at speeding up the evaluation of solutions by distributing the workload among a cluster of machines. Finally, we make sense out of this integration showing how it has been used for improving a face recognition algorithm.
Let's get inspired by Darwin: evolutionary algorithms
When traditional methods to solve computational problems do not provide a solution, other methods such as an exhaustive search come to the rescue. However, not for all problems an exhaustive search is possible due to either the search space is too big or the evaluation of each solution is very expensive. Under these circumstances, a smarter search needs to be performed, and here is when Evolutionary Computation come to help.
The English naturalist Charles Darwin and others, in the 19th century, stated that all species of organisms arise and develop through the natural selection of small, inherited variations that increase the individual's ability to compete, survive, and reproduce. Being inspired by such theory, numerous computer scientists have tried to solve computational problems treating solutions for computational problems as if they were individuals of a population. By evolving that individuals, they get reasonable good solutions for such problems. From that idea is where a field of knowledge called Evolutionary Computation emerges.
Within evolutionary computation, there are different techniques such as genetic programming, harmony search, ant colony optimization etc, today we focus on evolutionary algorithms. In evolutionary algorithms solutions for the problem are represented as individuals, a population of these individuals is evolved through generations until an ideal (or good enough) solution is found for the problem.
The evolutionary process follows several phases in order to carry out the evolution of individuals. In the following image the different phases are shown.
- Initialise population: the population is filled up with randomly generated individuals.
- Following phases are repeated till solution is found.
- Evaluation: each individual of the population is evaluated obtaining its fitness, normally a number which indicates how well the individual solve the problem.
- Selection: individuals that will be used for breeding are selected. Randomness and previous calculated fitness take normally place for this process.
- Breeding and mutation: selected individuals are crossed and offspring is produced, generating a new population. Individuals may suffer random mutations with very small probability.
Java-based Evolutionary Computation System (ECJ)
Born at the computer science department of the George Manson University, ECJ has become a popular tool when it comes to Evolutionary Computation. The web portal of the project can be found at: https://cs.gmu.edu/~eclab/projects/ecj/
Thanks to such tools, the whole evolutionary logic does not need to be implemented every time you want to apply it. In most of the cases, configuring the tool and implementing the fitness function is enough to run the algorithm.
Different reasons make ECJ a good tool for any research related with evolutionary computation.
- Multi-platform because it is implemented in Java
- Provide a wide flexibility, it is easy to implement many kind of problems
- It is highly configurable, with hierarchical configuration files you can modify many of the aspects of the evolutionary process.
- A run can take very long, but it has a check pointing feature that allows to stop and resume the execution.
- Some of the phases can be very computationally expensive, however with multi-threading it can be mitigated.
- Random number are generated with a pseudo-random number generator which allows to reproduce the results just by providing the same seed.
Ok, so how do we speed it up?: evaluation phase on Hadoop
Even though by using these methodologies you reduce considerably the search space, the evaluation of solutions may be a very expensive task. If so, you would wish to parallelise/distribute this task, this is what we propose here: distribute the evaluation of solutions on a cluster of computers by means of Hadoop.
This problem is far from being a Big Data problem, we have thought on Hadoop because the tool already provide all the mechanisms required for the distribution of the workload. So, we "simply" need to create the input for the job, implement the map function of the MapReduce job (reduce function is not utilized) and read the output.
Every time the evaluation phase is reached, a MapReduce job is launched. Before launching the job two things need to be done. First, the evaluation state is loaded into the distributed cache of Hadoop (1). The evaluation state from the evolutionary process is required by every task in order to perform the evaluation of individuals, so we place it into the distributed cache which is accessible by all tasks. Second, we need to create the input for the MapReduce into HDFS (2), here the input is the genotype of all individuals.
Once the context has been set, we can lunch the MapReduce job (3). In that case we only need to parallelise the work, so only map phase is required. Once Map tasks finish the evaluation, they write the result of the evaluation (fitness) into HDFS. Finally, the output is read (4) and all the fitness are assigned to the corresponding individuals to continue the evolutionary process.
Theoretically the solution seems appropriate, but what if we put it in practice?: improving a face recognition problem
In order to show the value of the integration between ECJ and Hadoop, we have tried to improve a face recognition algorithm which is under the described circumstances, the evaluation of each solution takes long enough that the use of evolutionary algorithms is not feasible. However, we have made use of the integration to reduce considerably the evaluation time.
The goal of applying an evolutionary algorithm to the face recognition algorithm is to reduce the number of points of interest (POI) the algorithm needs to recognise faces. By doing that, we can get a more optimized algorithm.
We will not go in details of how the algorithm works since it is not of the interest of this post. Basically, first there is a training phase where the algorithm learns the faces from a database. Second, the algorithm tests how well it is able to classify the faces. This is done using the library OpenCV which is used to analyse a windows around each of the POIs in every image.
The database contains a long set of images, each image is described by a list of POIs. These POIs correspond to different points of the face, eyes, nose, lips, ... but not all of them are useful for recognising faces. Which ones are useful and which ones are useless? This is what the evolutionary algorithm will show us! The goal is to keep only the useful ones.
How can we adapt that problem to an evolutionary algorithm? Individuals are represented by a string of 0's and 1's that indicates which POIs are used and the fitness is the percentage of faces correctly classified.
We can see the whole evolutionary process in the following image.
This research aims at reducing the time when using an evolutionary algorithm. The next plot compares the time when the evaluation is done sequentially, without distributing the evaluation, and when the evaluation is done distributing the work with Hadoop along different number of nodes.
We can observe how the time is reduced almost proportionally to the number of nodes, which is exactly what we wished to get!
Have the results of the face recognition algorithm been improved? This research did not aim at investigating how the Evolutionary Algorithm could improve the results of the algorithm, but at speeding up its evaluation, so we do not have results related with that.
The integration can be found at GitHub: https://github.com/dlanza1/ecj
This research has been possible thanks to Francisco Chávez and Francisco Fernandez from the University of Extremadura (Spain), César Benavides-Alvarez and Juán Villegas from Universidad Autónoma Metropolitana (México), Leonardo Trujillo from Instituto Tecnolóico de Tijuana (México) and Gustavo Olague from CICESE (México).
This research was done alongside a research stay at the Michigan State University, at BEACON Centre with the professors Erik Goodman and Wolfgang Banzhaf.