Machine Learning Algorithms for semantic segmentation

This guest post by Rajalingappaa Shanmugamani, the author of Deep Learning for Computer Vision, takes you through three important algorithms for semantic segmentation.

There are several deep learning-based algorithms to solve image segmentation tasks. The sliding window approach can be applied at a pixel level for segmentation. This approach takes an image and breaks it into smaller crops. Every crop of the image is classified for a label. This approach is expensive and inefficient because it doesn’t reuse the shared features between the overlapping patches. In this article, we will discuss a few algorithms that can overcome this problem.

The Fully Convolutional Network

The Fully Convolutional Network (FCN) introduced the idea of an end-to-end convolutional network. Any standard CNN architecture can be used for FCN by removing the fully connected layers. The fully connected layers are replaced by a convolution layer. The depth is higher in the final layers and the size is smaller. Hence, 1D convolution can be performed to reach the desired number of labels. But for segmentation, the spatial dimension has to be preserved. Hence, the full convolution network is constructed without a max pooling, as shown here:

ml1

The loss for this network is computed by averaging the cross-entropy loss of every pixel and mini-batch. The final layer has a depth equal to the number of classes. FCN is similar to object detection except that the spatial dimension is preserved. The output produced by the architecture will be coarse as some pixels may be mis-predicted, while the computation is high. The SegNet architecture helps address this issue.

The SegNet architecture

The SegNet has an encoder and decoder approach. The encoder has various convolution layers and the decoder has various deconvolution layers. SegNet improves the coarse outputs produced by FCN. Because of this, it is less memory-intensive. When the features are reduced in dimensions, it is upsampled again to the image size by deconvolution, reversing the convolution effects. Deconvolution learns the parameters for upsampling. The output of such architecture will be coarse due to the loss of information in pooling layers.

ml2

Upsampling, atrous convolution, and transpose convolution will help you in understanding this network better.

Upsampling the layers by pooling

Max pooling is a sampling strategy that picks the maximum value from a window. This could be reversed for upsampling. Each value can be surrounded with zeroes to upsample the layer, as shown here:

ml3

The zeroes are added at the same locations, which are the numbers that are upsampled. Un-pooling can be improved by remembering the locations of downsampling and using it for upsampling, as shown here:

ml4

Index-wise, upsampling yields better results than appending zeroes. This upsampling of the layers by pooling is not learned and works as it is. How doyou upsample and downsample with learnable parameters? Read on.

Sampling the layers by convolution

The layers can be upsampled or downsampled directly using convolution. The stride used for convolution can be increased to cause downsampling, as shown here:

ml5

Downsampling by convolution is called atrous convolution or dilated convolution or strided convolution. Similarly, it can be reversed to upsample by learning a kernel as shown here:

ml6

Upsampling directly using a convolution can be termed as transposed convolution. Some other synonyms are deconvolution, fractionally strided convolution, or up-convolution. Here is a code snippet that describes the algorithm:

input_height = 360
input_width = 480
kernel = 3
filter_size = 64
pad = 1
pool_size = 2

After the input is taken, it follows the usual convolutional neural net with decreasing size, which can be termed as an encoder. The following code can be used for defining the encoder:

model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Layer(input_shape=(3, input_height, input_width)))
# encoder
model.add(tf.keras.layers.ZeroPadding2D(padding=(pad, pad)))
model.add(tf.keras.layers.Conv2D(filter_size, kernel, kernel,
border_mode='valid'))
model.add(tf.keras.layers.BatchNormalization())
model.add(tf.keras.layers.Activation('relu'))
model.add(tf.keras.layers.MaxPooling2D(pool_size=(pool_size, pool_size)))
model.add(tf.keras.layers.ZeroPadding2D(padding=(pad, pad)))
model.add(tf.keras.layers.Conv2D(128, kernel, kernel, border_mode='valid'))
model.add(tf.keras.layers.BatchNormalization())
model.add(tf.keras.layers.Activation('relu'))
model.add(tf.keras.layers.MaxPooling2D(pool_size=(pool_size, pool_size)))
model.add(tf.keras.layers.ZeroPadding2D(padding=(pad, pad)))
model.add(tf.keras.layers.Conv2D(256, kernel, kernel, border_mode='valid'))
model.add(tf.keras.layers.BatchNormalization())
model.add(tf.keras.layers.Activation('relu'))
model.add(tf.keras.layers.MaxPooling2D(pool_size=(pool_size, pool_size)))
model.add(tf.keras.layers.ZeroPadding2D(padding=(pad, pad)))
model.add(tf.keras.layers.Conv2D(512, kernel, kernel, border_mode='valid'))
model.add(tf.keras.layers.BatchNormalization())
model.add(tf.keras.layers.Activation('relu'))

The output of the encoder can be fed to the decoder with increasing size, using the following code:
# decoder
model.add(tf.keras.layers.ZeroPadding2D(padding=(pad, pad)))
model.add(tf.keras.layers.Conv2D(512, kernel, kernel, border_mode='valid'))
model.add(tf.keras.layers.BatchNormalization())
model.add(tf.keras.layers.UpSampling2D(size=(pool_size, pool_size)))
model.add(tf.keras.layers.ZeroPadding2D(padding=(pad, pad)))
model.add(tf.keras.layers.Conv2D(256, kernel, kernel, border_mode='valid'))
model.add(tf.keras.layers.BatchNormalization())
model.add(tf.keras.layers.UpSampling2D(size=(pool_size, pool_size)))
model.add(tf.keras.layers.ZeroPadding2D(padding=(pad, pad)))
model.add(tf.keras.layers.Conv2D(128, kernel, kernel, border_mode='valid'))
model.add(tf.keras.layers.BatchNormalization())
model.add(tf.keras.layers.UpSampling2D(size=(pool_size, pool_size)))
model.add(tf.keras.layers.ZeroPadding2D(padding=(pad, pad)))
model.add(tf.keras.layers.Conv2D(filter_size, kernel, kernel, border_mode='valid'))
model.add(tf.keras.layers.BatchNormalization())
model.add(tf.keras.layers.Conv2D(nClasses, 1, 1, border_mode='valid', ))

The decoded image is of the same size as the input, and the whole model can be trained, using the following code:

model.outputHeight = model.output_shape[-2]
model.outputWidth = model.output_shape[-1]
model.add(tf.keras.layers.Reshape((nClasses, model.output_shape[-2] * model.output_shape[-1]),
input_shape=(nClasses, model.output_shape[-2], model.output_shape[-1])))
model.add(tf.keras.layers.Permute((2, 1)))
model.add(tf.keras.layers.Activation('softmax'))
model.compile(loss="categorical_crossentropy", optimizer=tf.keras.optimizers.Adam, metrics=['accuracy'])

This way of encoding and decoding an image overcomes the shortcomings of FCN-based models.


Dilated convolution can be used to further reduce the losses in sampling while having a wider view.

To know more about dilated convolution and several other important deep learning concepts, you can always fall back to Shanmugamani’s book, Deep Learning for Computer Vision.

For other updates you can follow me on Twitter on my twitter handle @NavRudraSambyal

Happy (machine) learning!

Thanks for reading, please share it if you found it useful 🙂

Advertisements

One thought on “Machine Learning Algorithms for semantic segmentation

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s