As a fundamental task in computer vision, semantic segmentation has achieved tremendous progress, driven by rapid evolution of segmentation network architectures (e.g., FCN, Transformer).
Modern segmentation approaches focus only on mining “local” context, i.e., dependencies between pixels within individual images, by specifically-designed, context aggregation modules (e.g., dilated convolution) or structure-aware optimization objectives (e.g., IoU-like loss). However, they ignore “global” context of the training data, i.e., rich semantic relations between pixels across different images. In this talk, we will introduce a pixel-wise metric learning paradigm for semantic segmentation, by explicitly exploring the structures of the whole training dataset.
Moreover, prevalent segmentation solutions, despite their different network designs (FCN based or Transformer based) and mask decoding strategies (parametric softmax based or pixel-query based), can be placed in one category, by considering the softmax weights or query vectors as learnable class prototypes. In light of this prototype view, I will discuss several fundamental limitations of such parametric segmentation regime, and introduce a nonparametric alternative based on non-learnable prototypes.