SSViT Family
| model | params (m) | pretrain | head | train | GFLOPs | mAP |
|---|---|---|---|---|---|---|
| SSViT-T | 15.0 | IN-1k : Sup. : 300 | Mask R-CNN | COCO (train) : 12 | 223.0 | 42.6 |
| SSViT-S | 27.0 | IN-1k : Sup. : 300 | Mask R-CNN | COCO (train) : 36 | 266.0 | 45.4 |
| SSViT-S | 27.0 | IN-1k : Sup. : 300 | Mask R-CNN | COCO (train) : 12 | 266.0 | 44.0 |
| SSViT-S | 27.0 | IN-1k : Sup. : 300 | Cascade Mask R-CNN | COCO (train) : 36 | 745.0 | 46.6 |
| SSViT-B | 57.0 | IN-1k : Sup. : 300 | Mask R-CNN | COCO (train) : 36 | 382.0 | 46.4 |
| SSViT-B | 57.0 | IN-1k : Sup. : 300 | Mask R-CNN | COCO (train) : 12 | 382.0 | 45.4 |
| SSViT-B | 57.0 | IN-1k : Sup. : 300 | Cascade Mask R-CNN | COCO (train) : 36 | 861.0 | 47.6 |
| SSViT-L | 100.0 | IN-1k : Sup. : 300 | Mask R-CNN | COCO (train) : 12 | 572.0 | 46.0 |
| model | params (m) | pretrain | finetune | gflops | IN-1k | IN-V2 | IN-A | IN-R |
|---|---|---|---|---|---|---|---|---|
| SSViT-T | 15.0 | IN-1k : Sup. : 300 | — : — : — | 2.4 | 83.0/— | 72.3/— | 32.6/— | 45.6/— |
| SSViT-S | 27.0 | IN-1k : Sup. : 300 | — : — : — | 4.4 | 84.4/— | 74.1/— | 41.6/— | 51.0/— |
| SSViT-B | 57.0 | IN-1k : Sup. : 300 | — : — : — | 9.6 | 85.3/— | 75.7/— | 49.4/— | 55.6/— |
| SSViT-L | 100.0 | IN-1k : Sup. : 300 | — : — : — | 18.2 | 85.7/— | 76.1/— | 55.0/— | 59.2/— |
COCO (val)
| model | pretrain | head | train | gflops | mAPb | APb50 | APb75 | mAPbs | mAPbm | mAPbl |
|---|---|---|---|---|---|---|---|---|---|---|
| SSViT-T | IN-1k : Sup. : 300 | Mask R-CNN | COCO (train) : 12 | 223.0 | 47.3 | 69.1 | 51.7 | — | — | — |
| SSViT-T | IN-1k : Sup. : 300 | RetinaNet | COCO (train) : 12 | 205.0 | 45.6 | 66.5 | 49.3 | 28.6 | 50.1 | 60.5 |
| SSViT-S | IN-1k : Sup. : 300 | Mask R-CNN | COCO (train) : 36 | 266.0 | 51.2 | 72.0 | 56.0 | — | — | — |
| SSViT-S | IN-1k : Sup. : 300 | Mask R-CNN | COCO (train) : 12 | 266.0 | 49.4 | 70.8 | 54.1 | — | — | — |
| SSViT-S | IN-1k : Sup. : 300 | Cascade Mask R-CNN | COCO (train) : 36 | 745.0 | 53.8 | 72.4 | 58.1 | — | — | — |
| SSViT-S | IN-1k : Sup. : 300 | RetinaNet | COCO (train) : 12 | 248.0 | 47.5 | 68.6 | 50.8 | 30.1 | 52.2 | 63.3 |
| SSViT-B | IN-1k : Sup. : 300 | Mask R-CNN | COCO (train) : 36 | 382.0 | 52.6 | 73.2 | 57.7 | — | — | — |
| SSViT-B | IN-1k : Sup. : 300 | Mask R-CNN | COCO (train) : 12 | 382.0 | 51.0 | 72.5 | 55.8 | — | — | — |
| SSViT-B | IN-1k : Sup. : 300 | Cascade Mask R-CNN | COCO (train) : 36 | 861.0 | 54.9 | 73.7 | 59.7 | — | — | — |
| SSViT-B | IN-1k : Sup. : 300 | RetinaNet | COCO (train) : 12 | 363.0 | 49.0 | 70.2 | 52.9 | 32.4 | 53.4 | 64.8 |
| SSViT-L | IN-1k : Sup. : 300 | Mask R-CNN | COCO (train) : 12 | 572.0 | 51.6 | 72.9 | 56.6 | — | — | — |
| SSViT-L | IN-1k : Sup. : 300 | RetinaNet | COCO (train) : 12 | 553.0 | 50.0 | 71.4 | 53.8 | 33.2 | 54.6 | 65.0 |
COCO (val)
| model | pretrain | head | train | gflops | mAPm | APm50 | APm75 | mAPms | mAPmm | mAPml |
|---|---|---|---|---|---|---|---|---|---|---|
| SSViT-T | IN-1k : Sup. : 300 | Mask R-CNN | COCO (train) : 12 | 223.0 | 42.6 | 66.2 | 45.8 | — | — | — |
| SSViT-S | IN-1k : Sup. : 300 | Mask R-CNN | COCO (train) : 36 | 266.0 | 45.4 | 69.7 | 49.0 | — | — | — |
| SSViT-S | IN-1k : Sup. : 300 | Mask R-CNN | COCO (train) : 12 | 266.0 | 44.0 | 67.7 | 47.3 | — | — | — |
| SSViT-S | IN-1k : Sup. : 300 | Cascade Mask R-CNN | COCO (train) : 36 | 745.0 | 46.6 | 70.1 | 50.4 | — | — | — |
| SSViT-B | IN-1k : Sup. : 300 | Mask R-CNN | COCO (train) : 36 | 382.0 | 46.4 | 70.9 | 50.3 | — | — | — |
| SSViT-B | IN-1k : Sup. : 300 | Mask R-CNN | COCO (train) : 12 | 382.0 | 45.4 | 69.7 | 48.9 | — | — | — |
| SSViT-B | IN-1k : Sup. : 300 | Cascade Mask R-CNN | COCO (train) : 36 | 861.0 | 47.6 | 71.6 | 51.5 | — | — | — |
| SSViT-L | IN-1k : Sup. : 300 | Mask R-CNN | COCO (train) : 12 | 572.0 | 46.0 | 70.1 | 49.8 | — | — | — |
ADE20K (val)
| model | pretrain | head | train | gflops | mIoUms | pAccms | mAccms | mIoUss | pAccss | mAccss |
|---|---|---|---|---|---|---|---|---|---|---|
| SSViT-T | IN-1k : Sup. : 300 | Panoptic FPN | ADE20K (train) : 640 : 512 | 35.0 | — | — | — | 46.8 | — | — |
| SSViT-S | IN-1k : Sup. : 300 | UPerNet | ADE20K (train) : 160 : 512 | 941.0 | 50.1 | — | — | — | — | — |
| SSViT-S | IN-1k : Sup. : 300 | Panoptic FPN | ADE20K (train) : 640 : 512 | 184.0 | — | — | — | 49.6 | — | — |
| SSViT-B | IN-1k : Sup. : 300 | UPerNet | ADE20K (train) : 160 : 512 | 1060.0 | 52.2 | — | — | — | — | — |
| SSViT-B | IN-1k : Sup. : 300 | Panoptic FPN | ADE20K (train) : 640 : 512 | 303.0 | — | — | — | 51.0 | — | — |
| SSViT-L | IN-1k : Sup. : 300 | UPerNet | ADE20K (train) : 160 : 512 | 1256.0 | 53.3 | — | — | — | — | — |
| SSViT-L | IN-1k : Sup. : 300 | Panoptic FPN | ADE20K (train) : 640 : 512 | 497.0 | — | — | — | 51.5 | — | — |