MVViT

Object detection has been thoroughly investigated during the last decade using deep neural networks. However, the inclusion of additional information given by multiple concurrent views of the same scene has not received much attention. In scenarios where objects may appear in obscure poses from certain view points, the use of differing simultaneous views can improve object detection. Therefore, we propose a multi-view fusion network to enrich the backbone features of standard object detection architectures across multiple source and target view points. Our method consists of a transformer decoder for the target view that combines the remaining source views feature maps. In this way, the feature representation of the target view can aggregate feature information from the source view through attention. Our architecture is detector-agnostic, meaning it can be applied across any existing detection backbone. We evaluate performance using YOLOX, Deformable DETR and Swin Transformer baseline detectors, comparing standard single view performance against the addition of our multi-view transformer architecture. Our method achieves a 3% increase of the COCO AP over a four view X-ray security dataset and a slight 0.7% increase on a seven view pedestrian dataset. We demonstrate that the integration of different views using attention-based networks improves the detection performance of multi-view datasets.

Resource type: Collection
Total Items: 2
Size: 759 MB
Contributors: Creator: Brian Kostadinov Shalon Isaac Medina ¹

¹ Durham University
Other description: Collection for model weights and config files
Identifier: ark:/32150/r2qn59q404z
Publisher: Durham University

Items in this Collection

List of items in this collection
	Title	Date Uploaded	Visibility	Action

Descriptions

Actions

Items in this Collection