*This guest post by Rajalingappaa Shanmugamani, the author of *Deep Learning for Computer Vision*, takes you through three important algorithms for semantic segmentation.*

There are several deep learning-based algorithms to solve image segmentation tasks. The sliding window approach can be applied at a pixel level for segmentation. This approach takes an image and breaks it into smaller crops. Every crop of the image is classified for a label. This approach is expensive and inefficient because it doesn’t reuse the shared features between the overlapping patches. In this article, we will discuss a few algorithms that can overcome this problem.

## The Fully Convolutional Network

The **Fully Convolutional Network** (**FCN**) introduced the idea of an end-to-end convolutional network. Any standard CNN architecture can be used for FCN by removing the fully connected layers. The fully connected layers are replaced by a convolution layer. The depth is higher in the final layers and the size is smaller. Hence, 1D convolution can be performed to reach the desired number of labels. But for segmentation, the spatial dimension has to be preserved. Hence, the full convolution network is constructed without a max pooling, as shown here:

The loss for this network is computed by averaging the cross-entropy loss of every pixel and mini-batch. The final layer has a depth equal to the number of classes. FCN is similar to object detection except that the spatial dimension is preserved. The output produced by the architecture will be coarse as some pixels may be mis-predicted, while the computation is high. The **SegNet** **architecture** helps address this issue.

## The SegNet architecture

The **SegNet** has an encoder and decoder approach. The encoder has various convolution layers and the decoder has various deconvolution layers. SegNet improves the coarse outputs produced by FCN. Because of this, it is less memory-intensive. When the features are reduced in dimensions, it is upsampled again to the image size by deconvolution, reversing the convolution effects. Deconvolution learns the parameters for upsampling. The output of such architecture will be coarse due to the loss of information in pooling layers.

Upsampling, atrous convolution, and transpose convolution will help you in understanding this network better.

### Upsampling the layers by pooling

Max pooling is a sampling strategy that picks the maximum value from a window. This could be reversed for upsampling. Each value can be surrounded with zeroes to upsample the layer, as shown here:

The zeroes are added at the same locations, which are the numbers that are upsampled. Un-pooling can be improved by remembering the locations of downsampling and using it for upsampling, as shown here:

Index-wise, upsampling yields better results than appending zeroes. This upsampling of the layers by pooling is not learned and works as it is. How doyou upsample and downsample with learnable parameters? Read on.

### Sampling the layers by convolution

The layers can be upsampled or downsampled directly using convolution. The stride used for convolution can be increased to cause downsampling, as shown here:

Downsampling by convolution is called **atrous convolution** or **dilated convolution** or **strided convolution**. Similarly, it can be reversed to upsample by learning a kernel as shown here:

Upsampling directly using a convolution can be termed as **transposed convolution**. Some other synonyms are **deconvolution**,** fractionally strided convolution**, or **up-convolution**. Here is a code snippet that describes the algorithm:

`input_height = 360`

input_width = 480

kernel = 3

filter_size = 64

pad = 1

pool_size = 2

After the input is taken, it follows the usual convolutional neural net with decreasing size, which can be termed as an encoder. The following code can be used for defining the encoder:

`model = tf.keras.models.Sequential()`

model.add(tf.keras.layers.Layer(input_shape=(3, input_height, input_width)))

# encoder

model.add(tf.keras.layers.ZeroPadding2D(padding=(pad, pad)))

model.add(tf.keras.layers.Conv2D(filter_size, kernel, kernel,

border_mode='valid'))

model.add(tf.keras.layers.BatchNormalization())

model.add(tf.keras.layers.Activation('relu'))

model.add(tf.keras.layers.MaxPooling2D(pool_size=(pool_size, pool_size)))

model.add(tf.keras.layers.ZeroPadding2D(padding=(pad, pad)))

model.add(tf.keras.layers.Conv2D(128, kernel, kernel, border_mode='valid'))

model.add(tf.keras.layers.BatchNormalization())

model.add(tf.keras.layers.Activation('relu'))

model.add(tf.keras.layers.MaxPooling2D(pool_size=(pool_size, pool_size)))

model.add(tf.keras.layers.ZeroPadding2D(padding=(pad, pad)))

model.add(tf.keras.layers.Conv2D(256, kernel, kernel, border_mode='valid'))

model.add(tf.keras.layers.BatchNormalization())

model.add(tf.keras.layers.Activation('relu'))

model.add(tf.keras.layers.MaxPooling2D(pool_size=(pool_size, pool_size)))

model.add(tf.keras.layers.ZeroPadding2D(padding=(pad, pad)))

model.add(tf.keras.layers.Conv2D(512, kernel, kernel, border_mode='valid'))

model.add(tf.keras.layers.BatchNormalization())

model.add(tf.keras.layers.Activation('relu'))

The output of the encoder can be fed to the decoder with increasing size, using the following code:

`# decoder`

model.add(tf.keras.layers.ZeroPadding2D(padding=(pad, pad)))

model.add(tf.keras.layers.Conv2D(512, kernel, kernel, border_mode='valid'))

model.add(tf.keras.layers.BatchNormalization())

model.add(tf.keras.layers.UpSampling2D(size=(pool_size, pool_size)))

model.add(tf.keras.layers.ZeroPadding2D(padding=(pad, pad)))

model.add(tf.keras.layers.Conv2D(256, kernel, kernel, border_mode='valid'))

model.add(tf.keras.layers.BatchNormalization())

model.add(tf.keras.layers.UpSampling2D(size=(pool_size, pool_size)))

model.add(tf.keras.layers.ZeroPadding2D(padding=(pad, pad)))

model.add(tf.keras.layers.Conv2D(128, kernel, kernel, border_mode='valid'))

model.add(tf.keras.layers.BatchNormalization())

model.add(tf.keras.layers.UpSampling2D(size=(pool_size, pool_size)))

model.add(tf.keras.layers.ZeroPadding2D(padding=(pad, pad)))

model.add(tf.keras.layers.Conv2D(filter_size, kernel, kernel, border_mode='valid'))

model.add(tf.keras.layers.BatchNormalization())

model.add(tf.keras.layers.Conv2D(nClasses, 1, 1, border_mode='valid', ))

The decoded image is of the same size as the input, and the whole model can be trained, using the following code:

`model.outputHeight = model.output_shape[-2]`

model.outputWidth = model.output_shape[-1]

model.add(tf.keras.layers.Reshape((nClasses, model.output_shape[-2] * model.output_shape[-1]),

input_shape=(nClasses, model.output_shape[-2], model.output_shape[-1])))

model.add(tf.keras.layers.Permute((2, 1)))

model.add(tf.keras.layers.Activation('softmax'))

model.compile(loss="categorical_crossentropy", optimizer=tf.keras.optimizers.Adam, metrics=['accuracy'])

This way of encoding and decoding an image overcomes the shortcomings of **FCN**-based models.

Dilated convolution can be used to further reduce the losses in sampling while having a wider view.

To know more about dilated convolution and several other important deep learning concepts, you can always fall back to *Shanmugamani’s book*, Deep Learning for Computer Vision.

For other updates you can follow me on Twitter on my twitter handle @NavRudraSambyal

Happy (machine) learning!

Thanks for reading, please share it if you found it useful 🙂

Hi, Thanks for sharing nice articles, you are doing a good job… You are Providing In this articles more information about this and most of the inspirational articles share…

More: http://www.analyticspath.com/machine-learning-training-in-bangalore

LikeLike