No description
  • Python 98.6%
  • Shell 1.4%
Find a file
2026-01-28 22:10:26 +01:00
.idea unet basic impl + sanity check 2026-01-20 22:40:36 +01:00
models comments 2026-01-27 11:54:45 +01:00
train optimized dataloader, refactoring 2026-01-28 22:10:26 +01:00
utils optimized dataloader, refactoring 2026-01-28 22:10:26 +01:00
env.yaml environment + python-lsp-server 2026-01-20 17:12:24 +01:00
main.py optimized dataloader, refactoring 2026-01-28 22:10:26 +01:00
Readme.md dock project structure 2026-01-27 11:51:43 +01:00
vizualizer.py visualizer can choose between all checkpoints insted of the hardcoded ones 2026-01-25 05:20:31 +01:00
vizualizer.sh visualizer can choose between all checkpoints insted of the hardcoded ones 2026-01-25 05:20:31 +01:00

Unet-Boundery detection

Project structure

unet_cityscapes
├── checkpoints                             # saved model states
├── experiments                             # TensorBoard files
├── models
│   └── unet.py                             # the only unet implementation used in the project
├── train                                   # helper functions for the training: epoch, multi-epoch, validation + validation loops
│   ├── training_blocks_multi_class.py   
│   └── training_blocks_single_class.py
├── utils
│   ├── checkpointing.py                    # helpers to save and load model state
│   ├── class_remap_constants.py            # variables used to configure class remapping
│   ├── dataset_tools.py                    # load and remap images from the cityscapes dataset
│   ├── logging.py                          # only one function that loads the TensorBoard writer for logging
│   └── visualization.py                    # helpers that provide visualisation to confirm that the data loads properly - just for sanity checks
├── env.yaml
├── main.py                                 # START HERE. spagetti that holds the sanity checks and starts the training
│                                           #             at the bottom of the file there are funtction calls (partly outcommented)
│                                           #             uncomment what you want to run and execute the file (activate virutal env!)
├── Readme.md
├── vizualizer.py                           # tool to do inference with trained models and provide visual feedback. annotates images provided.
└── vizualizer.sh                           # convinience wrapper for vizualizer.py, to faster and more convinient infere on random images.

Virtualenv:

conda create -f env.yaml

conda activate unet_segmentation

visualizer

** You need bash or zsh and linux find command **, else it won't work

./visualizer.sh

Logging

The project utilizes TensorBoard because it is FOSS, standard and very likely sufficient.

TensorBoard produces binary log files, that can then be processed by the visualization Software. To look at the output, start TensorBord via commandline: tensorboard --logdir experiments/runs and then browse http://localhost:6006/.

Architecture

The U-Net architecture consists of a symmetric encoderdecoder structure with skip connections, forming a U-shaped network.

Encoder (Downsampling path)

The encoder progressively reduces the spatial resolution of the input using repeated downsampling stages. Each stage consists of:

  • a spatial downsampling operation (max pooling)
  • followed by convolutional feature extraction (DoubleConv)

With each downsampling step:

  • the spatial resolution is halved
  • the number of feature channels is increased

This design allows the network to:

  • increase its effective receptive field
  • capture increasingly high-level and semantic features
  • move from local patterns (edges, textures) to global structures (object parts and shapes)

Downsampling is therefore not only a computational optimization, but a key mechanism for learning contextual and global information.

Bottleneck

At the lowest resolution, the network operates on highly compressed feature maps with a large receptive field. This stage encodes global semantic information about the input image while retaining spatial structure.

Decoder (Upsampling path)

The decoder gradually restores the spatial resolution using transposed convolutions. Each upsampling stage:

  • increases spatial resolution
  • reduces the number of feature channels
  • combines semantic information from the decoder with spatial detail from the encoder via skip connections

Skip Connections

At each resolution level, the output of a downsampling stage is directly passed to the corresponding upsampling stage and concatenated with the upsampled features.

These skip connections:

  • reintroduce high-resolution spatial information
  • compensate for information lost during downsampling
  • enable pixel-accurate segmentation

They allow the network to combine:

  • what is present (high-level features from deep layers)
  • where it is located (fine spatial detail from shallow layers)

Output Layer

A final 1×1 convolution maps the decoders feature maps to the desired number of output channels. The network outputs raw logits, which are later converted to probabilities by a suitable activation function (e.g. sigmoid) outside the model.

Training

In the first run, 50 epochs to learn automobiles (cars + trucks + caravans + bus + trailer) as single class. Second run, try to learn pedestrians:

1.1. on the model that already understands automobiles 1.2. on a new model

2.1 on the trained model 2.2 on a new one

Loss Functions

The loss function defines how far the models prediction deviates from the ground truth and produces a single scalar value that is minimized during training.

BCEWithLogitsLoss (Binary Cross Entropy with Logits)

Used for binary segmentation (foreground vs. background).

  • The model outputs one value per pixel (logit)
  • The target is 0 or 1 per pixel
  • The loss compares each predicted pixel independently with its target
  • Wrong confident predictions are penalized more than uncertain ones
  • The final loss is the average over all pixels in the batch

This loss directly trains the model to assign high confidence to foreground pixels and low confidence to background pixels.

CrossEntropyLoss

Used for multi-class segmentation with mutually exclusive classes.

  • The model outputs one value per class and pixel
  • The target is a single class index per pixel
  • The loss evaluates how strongly the model favors the correct class compared to all other classes
  • The final loss is averaged over all pixels in the batch

This loss enforces competition between classes at each pixel location.

Dice Loss

Used to measure mask overlap.

  • The loss compares predicted and target masks as whole regions
  • Penalizes mismatches in shape and area rather than individual pixels
  • Especially useful when foreground pixels are rare

Dice loss is often combined with BCE or Cross Entropy to stabilize training.

Optimizer

Background

Convolutions / Convolution Layers

Filters and convolutions

  • A filter (also called a kernel) is a small set of learned weights that operates on a local region of the input (image or feature map).

  • A convolution applies the same filter at (typically) every spatial location of the input, producing a new feature map.

  • A filter has one learned weight for each spatial position and each input channel. For example, for a 3×3 filter on an RGB image, the filter has 3×3×3 weights.

  • At a given spatial position, the filter computes a weighted sum of the corresponding local input patch across all channels, then adds a bias. This produces a single scalar output value at that position.

  • Sliding the filter over all spatial positions results in a 2D feature map. Each filter in a convolution layer produces one feature map.

Transposed Convolutions (ConvTranspose2d)

  • A ConvTranspose2d filter (also called a transposed kernel) is also a set of learned weights for each spatial position and input channel. For example, a 2×2 filter on a feature map with 8 channels has 2×2×8 weights.
  • At a given input pixel, the filter spreads its value across a local patch in the output (size = kernel_size), multiplying by the corresponding weights. Multiple input pixels may contribute to the same output positions if their patches overlap.
  • The overlapping contributions from all relevant input pixels are summed, producing the final values in the upsampled output feature map.
  • Applying this filter across all input pixels results in an upsampled feature map. Each filter produces one output channel, just like in a normal convolution layer.
  • ConvTranspose2d is essentially the same mathematical operation as Conv2d, but “inverted”: instead of combining a local patch into one scalar, it spreads one input value into a local patch, increasing the spatial resolution.

What informally happens, when a transposed convolution upsamples is: The downsampled input contains semantic information, about what is present at each location. The transposed convolution kind of "generate" or "draws" more details to upsample. The kernels learned information, about how to do that - like how to refine a flower. That is the reason, why uNets are a great fit to generate images.

(input / output) channels

A channel in a convolution layer

Dataset

Structure

dataset
├── gtFine               # labels
│   ├── test             # labels will not contiain information, since this is the test dataset
│   │   ├── berlin       # below the labels for two pictures
│   │   │   ├── berlin_000000_000019_gtFine_color.png           # human readable image, labels are represented as colors         
│   │   │   ├── berlin_000000_000019_gtFine_instanceIds.png     # objects are labeled with class and instance id
│   │   │   ├── berlin_000000_000019_gtFine_labelIds.png        # segmentation with class id's only
│   │   │   ├── berlin_000000_000019_gtFine_polygons.json       # polygons like in an svg - properly for rescaling lossless.
│   │   │   ├── berlin_000001_000019_gtFine_color.png
│   │   │   ├── berlin_000001_000019_gtFine_instanceIds.png
│   │   │   ├── berlin_000001_000019_gtFine_labelIds.png
│   │   │   ├── berlin_000001_000019_gtFine_polygons.json
│   │   ...
│   ├── train            # labeled data
│   │   ├── aachen
│   │   ...
│   └── val              # labeled data
│       ├── frankfurt
│       ... 
└── leftImg8bit          # images
    ├── test
    │   ├── berlin
    ...
    ├── train
    │   ├── aachen
    ...
    └── val
        ├── frankfurt
        ...

Labels

Here is a part of the code found in the scripts for the cityscapes dataset. The file you can find in /mnt/ssd_data/meins/Coding/machine_learning/cityscapesScripts/cityscapesscripts/helpers/labels.py

Note: I deleted a lot of stuff, because at the point when I extracted this i needed only name, id and category

    name                     id    category      
    'unlabeled'            ,  0 ,  'void'        
    'ego vehicle'          ,  1 ,  'void'        
    'rectification border' ,  2 ,  'void'        
    'out of roi'           ,  3 ,  'void'        
    'static'               ,  4 ,  'void'        
    'dynamic'              ,  5 ,  'void'        
    'ground'               ,  6 ,  'void'        
    'road'                 ,  7 ,  'flat'        
    'sidewalk'             ,  8 ,  'flat'        
    'parking'              ,  9 ,  'flat'        
    'rail track'           , 10 ,  'flat'        
    'building'             , 11 ,  'construction'
    'wall'                 , 12 ,  'construction'
    'fence'                , 13 ,  'construction'
    'guard rail'           , 14 ,  'construction'
    'bridge'               , 15 ,  'construction'
    'tunnel'               , 16 ,  'construction'
    'pole'                 , 17 ,  'object'      
    'polegroup'            , 18 ,  'object'      
    'traffic light'        , 19 ,  'object'      
    'traffic sign'         , 20 ,  'object'      
    'vegetation'           , 21 ,  'nature'      
    'terrain'              , 22 ,  'nature'      
    'sky'                  , 23 ,  'sky'         
    'person'               , 24 ,  'human'       
    'rider'                , 25 ,  'human'       
    'car'                  , 26 ,  'vehicle'     
    'truck'                , 27 ,  'vehicle'     
    'bus'                  , 28 ,  'vehicle'     
    'caravan'              , 29 ,  'vehicle'     
    'trailer'              , 30 ,  'vehicle'     
    'train'                , 31 ,  'vehicle'     
    'motorcycle'           , 32 ,  'vehicle'     
    'bicycle'              , 33 ,  'vehicle'     
    'license plate'        , -1 ,  'vehicle'