Reconciling feature sharing and multiple predictions with MIMO Vision Transformers

Remy Sun, Clément Masson, Nicolas Thome, Matthieu Cord

October 2022

Abstract

Multi-input multi-output training improves network performance by optimizing multiple subnetworks simultaneously. In this paper, we propose MixViT, the first MIMO framework for vision transformers that takes advantage of ViTs’ innate mechanisms to share features between subnetworks. This is in stark contrast to traditional MIMO CNNs that are limited by their inability to mutualize features. Unlike them, MixViT only separates subnetworks in the last layers thanks to a novel source attribution that ties tokens to specific subnetworks. As such, we retain the benefits of multi-output supervision while training strong features useful to both subnetworks. We verify MixViT leads to significant gains across multiple architectures (ConViT, CaiT) and datasets (CIFAR, TinyImageNet, ImageNet-100, and ImageNet-1k) by fitting multiple subnetworks at the end of a base model.

Type

Preprint

Publication

Openreview preprint

Reconciling feature sharing and multiple predictions with MIMO Vision Transformers

Abstract

Remy Sun

Research scientist