Finding neural network topologies is a problem with a rich history in evolutionary computing, or neuroevolution. This post will revisit some of the key ideas and outgoing research paths. Code related to this post is found here: [code link].

## NEAT

In their 2002 paper, Kenneth Stanley & Risto Miikkulainen proposed the foundational algorithm NeuroEvolution of Augmenting Topologies (NEAT). I’ll focus on this algorithm as a starting point; for earlier developments please see Section 5.2 of this great review, Schaffer’s 1992 review, and Yao’s 1999 review. The NEAT paper introduces the ideas clearly and there are other great NEAT overviews, so to change it up I will try to present the algorithm with generic notation, which is perhaps useful for thinking about how to modify the algorithm or apply it to a new problem setting.

I’ve also made an implementation [code link] contained in a single python file; you might find it useful to see the entire algorithm in one place, or as a comparison if you also implement NEAT as an exercise. For a more robust implementation, see NEAT-Python (which the code is based on) and its extension PyTorch-NEAT.

**Problem**

NEAT addresses the problem of finding a computation graph . Each node has a bias, activation, and aggregation function, written , and each edge has a source and destination, a weight, and may be active or inactive, written .

Searching through the space of these graphs amounts to searching through a space of neural networks. NEAT conducts this search using a few generic neuroevolution concepts, which I’ll focus on below, and often implements them with design decisions that can be relaxed or modified for different problems.

**High-Level Method**

NEAT iteratively produces a set of candidates , using a candidate partitioning where and a given function which measures a candidate’s quality. The candidates, partitions, and quality function are known as ‘population’, ‘species’, and ‘fitness’, respectively.

Each NEAT iteration returns a new candidate set and new partitioning, denoted as . Intuitively E is an ‘evolution step’ that produces a new ‘generation’. NEAT’s goal is to eventually output a ‘good’ candidate set . Typically *good* means that the best candidate has quality exceeding a goal threshold, . We then use this high-performing neural network on a task.

### Evolution Steps

Each evolution step produces a new population using four ideas: mutation, crossover, fitness ranking, and partitioning.

**Mutation** randomly perturbs a candidate graph. In NEAT, mutations consist of adding or deleting a node, adding or deleting an edge, or perturbing a node or edge property (such as an edge’s weight or a node’s activation). Each mutation type occurs with a pre-specified probability, and involves a random perturbation; for instance an add-edge randomly chooses an edge location and weights are adjusted with Gaussian noise. One can design other mutations, such as resetting a weight to a new value.

**Crossover** produces a new candidate by swapping properties of two existing candidates. In NEAT, roughly speaking if and have a matching node , then receives one of them randomly (similarly for edges). simply inherits non-matching nodes or edges. The notion of ‘matching’ is tricky due to isomorphic graph structures, so NEAT assigns an ID to each new node and edge, then uses these IDs for comparison (see 2.2 and 3.2 of the NEAT paper for details). In part due to the added complexity, some papers leave out crossover completely.

**Fitness Ranking** follows its name, first ranking candidates according to fitness, where means . Only the top (e.g. 20%) candidates are used for crossover and mutation. This locally biases the search towards candidates with high relative fitness.

**Partitioning**, or speciation, groups candidates according to a distance function . One use of the partitions is to promote diversity in the solution space by modifying each candidate’s fitness. To do so, NEAT defines a distance function and adjusts each candidate’s fitness based on its partition size. Each partition is guaranteed a certain number of candidates in the next generation based on the adjusted fitnesses.

Intuitively, a small partition contains graphs with relatively unique characteristics which might ultimately be useful in a final solution, even if they do not yield immediate fitness. To avoid erasing these characteristics from the search during fitness ranking, the small partition candidates receive guaranteed spots in the next phase.

We can write this step as . We might alternatively view this step as just fitness re-ranking, , without requiring actual partitions, though without partitions it may be tricky to achieve the exact ‘guaranteed spots’ behavior.

The partitions could also be useful in problems requiring a *collection* of solutions rather than a single optimal solution. For instance, rather than just selecting the highest performing candidate, we might consider the *best candidate in each partition* as the final output of NEAT, thus producing a *collection* of networks, each maximizing fitness in a different way than the others (assuming a partitioning scheme that promotes diverse solutions).

### Example Results

Let’s use the implementation [**code link**] to solve an xor problem and the Cartpole and Lunar-Lander gym environments.

To solve **xor**, NEAT finds a network with a single hidden node:

**CartPole-v0** is easy to solve (even random search is sufficient), and NEAT finds a simple network without hidden units (for fun we’ll also construct an artificially complicated solution in the **Variations** section below):

**LunarLander-v2** is more difficult, and NEAT finds a network with non-trivial structure:

On the xor environment, NEAT creates around 10 partitions, on Cartpole just 1, and on LunarLander it tends to create 2-3 partitions. On these simple environments NEAT also performs similarly *without crossover*.

**Variations** As mentioned before, we may want NEAT to produce a *diverse set* of solutions rather than a single solution. To manually demonstrate this intuition, suppose I want NEAT to find a network that uses sigmoid activations, and one that uses tanh. To do so, I increased the activation parameter in the node distance function (the used in partitioning), then chose the highest scoring network from each partition. On Cartpole, the partitions now naturally separate into sigmoid and tanh networks:

While Cartpole is evidently simple enough for a network with no hidden layers, perhaps we want to follow a trend of using large networks even for easy problems. We can modify the fitness function to ‘reject’ networks without a certain number of connections, and NEAT will yield more complicated solutions:

In particular, I added -1000 to the fitness when the network had less than *k* connections, starting with *k=5* and incrementing *k* each time a candidate achieved max fitness at the current *k *(stopping at k=20).

## Discussion & Extensions

Vanilla NEAT attempts to find both a network structure and the corresponding weights from scratch. This approach is very flexible and involves minimal assumptions, but could limit NEAT to problems requiring small networks. However, the key idea can still be applied or modified in creative ways.

### Minimal Assumptions

NEAT represents an extreme on the spectrum of learned versus hand-crafted architectural biases, by placing few assumptions on graph structure or learning algorithm. At a very speculative level, such flexibility may be useful for networks with backward or long-range connections that may be difficult to hand design, or as part of a learning process which involves removing or adding connections rather than optimizing weights of a fixed architecture.

A more concrete example is the recent Weight Agnostic Neural Networks paper (Gaier & Ha 2019), where the authors aimed to find a model for a task by *finding good network structures*, rather than finding good weights for a fixed network structure; they use a *single shared weight value* in each network and evaluate fitness on multiple rollouts, with a randomly selected weight value for each rollout. In this case, a NEAT variant allowed finding exotic network structures from scratch, without requiring prior knowledge such as hand-designed layer types.

As a rough approximation, I modified the NEAT implementation so that each network only has a single shared weight value, and included more activation functions (sin, cos, arctan, abs, floor). Each run of evaluation sets the network’s shared weight to a randomly sampled value ( excluding ), and the network’s overall fitness is the average fitness over 10 runs. On XOR, NEAT finds a network with similar structure as before:

This was just an initial experiment to give intuition, so check out the WANN paper for a good way of doing this for non-trivial tasks.

### Scalability

One could also consider improving NEAT’s scalability. A high level strategy is to reduce the search space by restricting the search to topologies, searching at a higher abstraction level, or introducing hierarchy.

An example is DeepNEAT (Miikkulainen et al 2017), which evolves graph structures using NEAT, but with nodes representing *layers* rather than single neurons, and edges specifying layer connectivity. Weight and bias values are learned with back-propagation. The authors further extend DeepNEAT to CoDeepNEAT, which represents graphs with a two level hierarchy defined by a *blueprint* specifying connectivity of *modules*. Separate blueprint and module populations are evolved, with the full graph (module + blueprint) assembled for fitness evaluation.

This view is quite general, allowing learning the internal structure of reusable modules as well as how they are composed. In the experiments the authors begin with modules involving known components such as convolutional layers or LSTM cells and evolve only specific parts (e.g. connections between LSTM layers), but one might imagine searching for completely novel, reusable modules.

### Indirect Encodings

NEAT essentially writes down a description, or *direct encoding*, of every node and edge and their properties, then evolves these descriptions. The description size grows as the network grows, making the search space prohibitively large.

An alternative is to use a function to describe a network. For instance, we can evaluate a function at pairs of points from a grid to obtain a weighted adjacency matrix. This function is an example of an indirect encoding of the graph. Assuming the description of is small, we can describe very large networks by evaluating a suitable using a large grid or coordinate pattern. A neural network with a variety of activations that is evaluated in this manner is called a compositional pattern producing network (CPPN) [see, also].

HyperNEAT (Stanley et al. 2009) uses this idea to find network weights by evolving an indirect encoding function. HyperNEAT uses NEAT to evolve a (small) CPPN to act as , then evaluates at coordinates from a hyper-cube, resulting in weights of a (larger) network used for fitness evaluation.

Several works have adopted or extended ideas from HyperNEAT for a deep learning setting. Fernando et al. 2016 proposed the Differentiable Pattern Producing Network (DPPN) which evolves the structure of a weight-generating CPPN while using back-propagation for its weights. The authors evolve a 200 parameter that generates weights for a fully connected auto-encoder with ~150,000 weights, though it is for a small-scale MNIST image de-noising task. Interestingly the weight generating function learns to produce convolution-esque filters embedded in the fully connected network.

HyperNetworks (Ha et al 2016) further scales HyperNEAT’s notion of indirect encodings to more complex tasks by learning a weight generation function with end-to-end training, including an extension that can generate time-varying weights for recurrent networks:### Wrapping Up

In this post we revisited a core technique for generating neural network topologies, and briefly traced some of its outgoing research paths. We took a brief step back from the constraints of pre-defined-layer architectures and searched through a space of very general (albeit small-scale) topologies. It was interesting to see how this generality has been refined towards some larger scale tasks, but also revisited. We briefly saw how fitness re-ranking and partitioning can be used to yield *a set of distinct solutions, *which connects to other concepts that I may discuss further in future posts.

[Code]