Self-Organizing Maps for Machine Learning Algorithms

In this article, you’ll be introduced to the concept of self-organizing maps (SOMs) and presented with a model called a Kohonen network, which will be able to map the input patterns onto a surface, where some attractors (one per class) are placed through a competitive learning process.

Such a model will be able to recognise new patterns (belonging to the same distribution) by eliciting a strong response in the attractor that is most similar to the pattern. In this way, after a labeling process, the model can be employed as a soft classifier that can easily manage noisy or altered patterns.

Self-organizing maps (SOMs) have been proposed by Willshaw and Von Der Malsburg to model different neurobiological phenomena observed in animals. In particular, they discovered that some areas of the brain develop structures with different areas, each of them with a high sensitivity for a specific input pattern.

The process behind such a behavior is quite different because it’s based on competition among neural units based on a principle called winner-takes-all. During the training period, all the units are excited with the same signal, but only one will produce the highest response. This unit will automatically become a candidate to the receptive basin for that specific pattern. You’ll be introduced to the Kohonen model.


The main idea is to implement a gradual winner-takes-all paradigm, to avoid the premature convergence of a neuron (as a definitive winner) and increase the level of plasticity of the network. This concept is expressed graphically in the following graph (where a linear sequence of neurons is considered):

A Mexican-hat dynamic implemented by a Kohonen network

In this case, the same pattern is presented to all the neurons. At the beginning of the training process (t=0), a positive response is observed in xi-2 toxi+2 with a peak in xi. The potential winner is obviously xi, but all these units are potentiated according to their distance from xi. In other words, the network (which is trained sequentially) is still receptive to change if other patterns produce a stronger activation.

If instead xi keeps on being the winner, the radius is slightly reduced, until the only potentiated unit will be xi.

Considering the shape of this function, this dynamic is often called a Mexican Hat.

With this approach, the network remains plastic until all the patterns have been repeatedly presented.

If, for example, another pattern elicits a stronger response in xi, it’s important that its activation is still not too high, to allow a fast reconfiguration of the network. At the same time, the new winner will probably be a neighbor of xi, which has received a partial potentiation and can easily take the place of xi.

Kohonen SOM (also known as the Kohonen network or simply Kohonen map) is normally represented as a bi-dimensional map (for example, a square matrix m × m or any other rectangular shape), but 3D surfaces, such as sphere or torus are also possible (the only necessary condition is the existence of a suitable metric). In your case, always refer to a square matrix, where each cell is a receptive neuron characterised by a synaptic weight w with the dimensionality of the input patterns:

During both training and working phases, the winning unit is determined according to a similarity measure between a sample and each weight vector. The most common metric is the Euclidean; hence, if you consider a bi-dimensional map W with a shape (k × p) so that W ∈ ℜk × p × n, the winning unit (in terms of its coordinates) is computed as follows:

As explained before, it’s important to avoid the premature convergence because the complete final configuration could be quite different from the initial one. Therefore, the training process is normally subdivided into two different stages.

1. During the first stage, the duration is normally about 10–20% of the total number of iterations (let’s call this value tmax), and the correction is applied to the winning unit and its neighbors (computed by adopting a decaying radius).

2. However, during the second stage, the radius is set to 1.0 and the correction is applied only to the winning unit.

In this way, it’s possible to analyse a larger number of possible configurations, automatically selecting the one associated with the least error. The neighbourhood can have different shapes; it can be square (in closed 3D maps, the boundaries don’t exist anymore), or, more easily, it’s possible to employ a radial basis function based on an exponentially decaying distance-weight:

The relative weight of each neuron is determined by the σ(t). The σ0 function is the initial radius and τ is a time-constant that must be considered as a hyperparameter that determines the slope of the decaying weight. Suitable values are 5–10% of the total number of iterations.

Adopting a radial basis function, it’s not necessary to compute an actual neighbourhood because the multiplication factor n(i, j) becomes close to zero outside of the boundaries. A drawback is related to the computational cost, which is higher than a square neighbourhood (as the function must be computed for the whole map).

However, it’s possible to speed up the process by precomputing all the squared distances (the numerator) and exploiting the vectorisation features offered by packages such as NumPy (a single exponential is computed every time).

The update rule is very simple and it’s based on the idea to move the winning unit synaptic weights closer to the pattern, xi, (repeated for the whole dataset, X):

The η(t) function is the learning rate, which can be fixed, but it’s preferable to start with a higher value, η0 and let it decay to a target final value, η∞:

In this way, the initial changes force the weights to align with the input patterns, while all the subsequent updates allow slight modifications to improve the overall accuracy. Therefore, each update is proportional to the learning rate, the neighbourhood weighted distance, and the difference between each pattern and the synaptic vector.

Theoretically, if Δwij is equal to 0.0 for the winning unit, it means that a neuron has become the attractor of a specific input pattern, and its neighbours will be receptive to noisy/altered versions. The most interesting aspect is that the complete final map will contain the attractors for all patterns which are organised to maximize the similarity between adjacent units.

In this way, when a new pattern is presented, the area of neurons that maps the most similar shapes will show a higher response. For example, if the patterns are made up of handwritten digits, attractors for the digit 1 and for digit 7 will be closer than the attractor, for example, for digit 8. A malformed 1 (which could be interpreted as 7) will elicit a response that is between the first two attractors, allowing you to assign a relative probability based on the distance.


Implementation

As you’ll see in the example, this feature yields to a smooth transition between different variants of the same pattern class avoiding rigid boundaries that oblige a binary decision (like in a K-means clustering or in a hard classifier).

The complete Kohonen SOM algorithm is as follows:

  1. Randomly initialise W(0). The shape is (k × p × n).
  2. Initialize nb_iterations, the total number of iterations, and tmax (for example, nb_iterations = 1000 and tmax = 150).
  3. Initialize τ (for example, τ = 100).
  4. Initialize η0 and η∞ (for example, η0 = 1.0 and η∞ = 0.05).
  5. For t = 0 to nb_iterations:
  6. If t < tmax:
  7. Compute η(t)
  8. Compute σ(t)
  9. Otherwise:
  10. Set η(t) = η∞
  11. Set σ(t) = σ∞
  12. For each xi in X:
  13. Compute the winning unit u* (you can assume the coordinates as ij)
  14. Compute n(i, j)
  15. Apply the weight correction Δwij(t) to all synaptic weights W(t)
  16. Renormalize W(t) = W(t) / ||W(t)||(columns) (the norm must be computed column-wise)

Implementing SOM

You can implement a SOM using the Olivetti faces dataset. As the process can be very long, in this example, you can limit the number of input patterns to 100 (with a 5 × 5 matrix). You can also try with the whole dataset and a larger map.

The first step is loading the data, normalizing it so that all values are bounded between 0.0 and 1.0, and setting the constants:

import numpy as np
from sklearn.datasets import fetch_olivetti_faces
faces = fetch_olivetti_faces(shuffle=True)
Xcomplete = faces['data'].astype(np.float64) / np.max(faces['data'])
np.random.shuffle(Xcomplete)
nb_iterations = 5000
nb_startup_iterations = 500
pattern_length = 64 * 64
pattern_width = pattern_height = 64
eta0 = 1.0
sigma0 = 3.0
tau = 100.0
X = Xcomplete[0:100]
matrix_side = 5

At this point, you can initialize the weight matrix using a normal distribution with a small standard deviation:

W = np.random.normal(0, 0.1, size=(matrix_side, matrix_side, pattern_length))

Now, you need to define the functions to determine the winning unit based on the least distance:

def winning_unit(xt):
    distances = np.linalg.norm(W - xt, ord=2, axis=2)
    max_activation_unit = np.argmax(distances)
    return int(np.floor(max_activation_unit / matrix_side)), max_activation_unit % matrix_side

It’s also useful to define the functions η(t) and σ(t):

def eta(t):
    return eta0 * np.exp(-float(t) / tau)
def sigma(t):
    return float(sigma0) * np.exp(-float(t) / tau)

As explained before, instead of computing the radial basis function for each unit, it’s preferable to use a precomputed distance matrix (in this case, 5 × 5 × 5 × 5) containing all the possible distances between couples of units. In this way, NumPy allows a faster calculation thanks to its vectorization features:

precomputed_distances = np.zeros((matrix_side, matrix_side, matrix_side, matrix_side))
for i in range(matrix_side):
    for j in range(matrix_side):
        for k in range(matrix_side):
            for t in range(matrix_side):
                precomputed_distances[i, j, k, t] = \
                    np.power(float(i) - float(k), 2) + np.power(float(j) - float(t), 2)
def distance_matrix(xt, yt, sigmat):
    dm = precomputed_distances[xt, yt, :, :]
    de = 2.0 * np.power(sigmat, 2)
return np.exp(-dm / de)

The distance_matrix function returns the value of the radial basis function for the whole map given the center point (the winning unit) xt, yt and the current value of σsigmat. Now, it’s possible to start the training process (in order to avoid correlations, it’s preferable to shuffle the input sequence at the beginning of each iteration):

sequence = np.arange(0, X.shape[0])
t = 0
for e in range(nb_iterations):
    np.random.shuffle(sequence)
    t += 1
    if e < nb_startup_iterations:
        etat = eta(t)
        sigmat = sigma(t)
    else:
        etat = 0.2
        sigmat = 1.0
    for n in sequence:
        x_sample = X[n]

        xw, yw = winning_unit(x_sample)
        dm = distance_matrix(xw, yw, sigmat)

        dW = etat * np.expand_dims(dm, axis=2) * (x_sample - W)
        W += dW
        W /= np.linalg.norm(W, axis=2).reshape((matrix_side, matrix_side, 1))

In this case, η∞ = 0.2, but you can try different values and evaluate the final result. After training for 5000 epochs, you will get the following weight matrix (each weight is plotted as a bi-dimensional array):

As it’s possible to see, the weights have converged to faces with slightly different features. In particular, looking at the shapes of the faces and the expressions, it’s easy to notice the transition between different attractors (some faces are smiling, while others are more serious; some have glasses, moustaches, and beards, and so on).

It’s also important to consider that the matrix is larger than the minimum capacity (there are ten different individuals in the dataset). This allows mapping more patterns that cannot be easily attracted by the right neuron. For example, an individual can have pictures with and without a beard and this can lead to confusion. If the matrix is too small, it’s possible to observe instability in the convergence process, while if it’s too large, it’s easy to see redundancies.

The right choice depends on each different dataset and on the internal variance and there’s no way to define a standard criterion. A good starting point is picking a matrix whose capacity is between 2.0 and 3.0 times larger than the number of desired attractors and then increasing or reducing its size until the accuracy reaches a maximum.

The last element to consider is the labeling phase. At the end of the training process, you have no knowledge about the weight distribution in terms of winning neurons, so it’s necessary to process the dataset and annotate the winning unit for each pattern. In this way, it’s possible to submit new patterns to get the most likely label.


If you found this article helpful, you can look at Mastering Machine Learning Algorithms, a complete guide to quickly getting to grips with popular machine learning algorithms. You will be introduced to the most widely used algorithms in supervised, unsupervised, and semi-supervised machine learning, and will learn how to use them in the best possible manner.

For more updates you can follow me on Twitter on my twitter handle @NavRudraSambyal

Thanks for reading, please share it if you found it useful 🙂

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s