Artificial Intelligence 12 min read

Problems and Solutions in Semantic Segmentation: An Overview of DeepLabV1

This article explains the two main challenges of applying deep convolutional neural networks to semantic segmentation—signal down‑sampling and loss of spatial precision—and describes how the DeepLabV1 architecture, using dilated convolutions, large‑field‑of‑view modules, fully‑connected CRF and multi‑scale fusion, addresses these issues while achieving faster, more accurate segmentation results.

Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
Problems and Solutions in Semantic Segmentation: An Overview of DeepLabV1

Problems and Solutions in Semantic Segmentation

In the original paper, the authors identify two technical obstacles when applying deep convolutional neural networks (DCNNs) to semantic segmentation: signal down‑sampling and spatial invariance.

Signal down‑sampling refers to the repeated use of max‑pooling in DCNNs, which continuously reduces image resolution. Spatial invariance, required by high‑level tasks such as image classification and object detection, limits the network’s spatial precision, making it unsuitable for low‑level tasks like semantic segmentation where the output must change with spatial transformations.

Why does this happen? Signal down‑sampling: repeated max‑pooling lowers resolution. Spatial invariance: classification‑oriented networks need transformation‑invariant decisions, which reduces localization accuracy for segmentation.
How to address it? Signal down‑sampling: Instead of further down‑sampling, use dilated (atrous) convolutions to enlarge the receptive field without reducing resolution. Spatial invariance: Apply a fully‑connected Conditional Random Field (CRF) to refine the DCNN output. This was used in DeepLabV1 and V2, but not in V3.

DeepLabV1 Network Structure

The paper highlights three advantages of DeepLabV1:

Faster speed : Atrous (dilated) convolution enables dense DCNN inference at about 8 frames per second.

Higher accuracy : Achieves state‑of‑the‑art results on the PASCAL semantic segmentation challenge, surpassing the previous best by 7.2%.

Simple architecture : Consists of two well‑defined modules—DCNN and CRF.

LargeFOV Module

DeepLabV1 is built on VGG16. For readers unfamiliar with VGG16, a reference blog is provided. The LargeFOV module replaces the first fully‑connected layer of VGG16 with a dilated convolution, preserving the receptive field while reducing parameters and increasing speed.

To convert a fully‑connected layer to a convolutional one, one can follow the approach used in FCN (e.g., replace the FC layer with a 7×7 convolution of 4096 filters). Then, replace that ordinary convolution with a dilated convolution (e.g., kernel size 3×3, dilation rate 12) to maintain a large receptive field without additional down‑sampling.

The authors conducted experiments with various kernel sizes and dilation rates; the final choice was a 3×3 kernel with a dilation rate of 12, which offered a large receptive field, low parameter count, high mean IoU, and fast inference.

Fully‑Connected CRF Module

The fully‑connected CRF (Conditional Random Field) refines the coarse segmentation map produced by the DCNN, yielding finer object boundaries.

Network Architecture Details

The overall architecture, based on VGG16, introduces several modifications:

All VGG16 max‑pool layers originally use kernel = 2, stride = 2, padding = 0. In DeepLabV1, the first three max‑pools use kernel = 3, stride = 2, padding = 1, still down‑sampling by a factor of 2 each time (total 8× down‑sampling). The last two max‑pools use kernel = 3, stride = 1, padding = 1 to avoid further resolution loss, followed by an average‑pool layer with the same parameters.

The last three convolutional layers of VGG16 are replaced by dilated convolutions with kernel = 3, dilation = 2, stride = 1, padding = 2.

The first fully‑connected layer is replaced by a dilated convolution (kernel = 3, dilation = 12, stride = 1, padding = 12) – this is the LargeFOV module.

The second and third fully‑connected layers are also converted to convolutional layers, producing an output feature map of size 28×28×num_classes.

Multi‑Scale (MSC) Structure

DeepLabV1 also incorporates a multi‑scale (MSC) branch that fuses the original image with the outputs of the first four max‑pool layers. All fused feature maps have the same spatial size (28×28×num_classes) before being combined.

Experimental results show that adding the MSC branch improves segmentation quality noticeably.

DeepLabV1 Experimental Comparison

Visual comparisons with other methods (e.g., FCN‑8s, TTI‑Zoomout‑16) demonstrate that DeepLabV1 produces finer boundaries and more accurate edge preservation.

Conclusion

DeepLabV1 combines dilated convolutions, a large‑field‑of‑view module, fully‑connected CRF refinement, and multi‑scale fusion to overcome the down‑sampling and spatial invariance problems of standard DCNNs, achieving faster inference, higher accuracy, and a simpler model structure for semantic segmentation.

Regarding loss computation, unlike FCN which directly computes cross‑entropy between the ground‑truth and the full‑resolution output, DeepLabV1 first downsamples the ground‑truth by a factor of 8 and then computes cross‑entropy with the 28×28×num_classes output feature map.

computer visionDeep Learningsemantic segmentationCRFDeepLabV1dilated convolution
Rare Earth Juejin Tech Community
Written by

Rare Earth Juejin Tech Community

Juejin, a tech community that helps developers grow.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.