Skip to main content
Have a personal or library account? Click to login
SAMannot: A Memory-Efficient, Local, Open-Source Framework for Interactive Video Instance Segmentation Based on SAM2 Cover

SAMannot: A Memory-Efficient, Local, Open-Source Framework for Interactive Video Instance Segmentation Based on SAM2

Open Access
|Apr 2026

Figures & Tables

Figure 1

This figure provides a high-level overview of the software architecture, organized by module functionality. Arrows denote invocation direction. Green boxes indicate core pipeline components responsible for a broad range of tasks. Blue boxes represent subsidiary subsystems with well-defined roles within the pipeline. Yellow boxes correspond to utility classes, including data-structure components and auxiliary widgets extending Tkinter functionality. The red box denotes the underlying tracking dependency (SAM2).

Figure 2

The graphical user interface is built up from the control panel (left), the canvas (right), and the slider (below the canvas). The control panel has the following modules: (A) loading and saving sessions, (B) settings and information buttons, (C) loading input media, (D) label management, (E) feature management, (F) annotation propagation, (G) toggle for visualization, (H) frame slider.

Figure 3

Control flow for annotating a single block: the input video is processed in blocks to enable efficient memory usage. Within each block, the user provides prompts on selected frames. We apply SAM2 to extend prompts within frames to masks, and propagate the masks to the remaining frames in the block. Finally, the system allows for saving the annotation configuration, log and the annotations.

Table 1

Quantitative evaluation of SAMannot on a subset of the DAVIS 2017 train-val dataset (480p resolution). The metrics represent the mean Intersection over Union (IoU), Dice coefficient, and Pixel Accuracy for each sequence.

SEQUENCE NAMEFRAMES (#)INSTANCES (#)ALL MASKS (#)MEAN IOUMEAN DICEPIXEL ACC.
Rhino901900.98070.99020.9968
Cows10411040.97100.98530.9966
Bear821820.97030.98490.9966
Camel901900.96910.98430.9962
Dog601600.96400.98170.9960
Breakdance841840.95930.97920.9963
Breakdance-flare711710.95860.97890.9974
Tuk-tuk5931770.95150.97470.9783
Blackswan501500.95060.97460.9950
Cat-girl8921780.94600.97220.9838
Night-race462830.94450.96560.9973
Train8043200.92900.96310.9877
Bus801800.92950.96260.9883
Classic-car6331890.92650.95790.9879
Color-run8432170.92520.95790.9695
Boxing-fisheye8732610.90990.95220.9948
Bike-packing6921380.90260.94820.9825
Pigs7932370.89140.92780.9885
Boat751750.82430.90360.9882
Sheep6853400.83610.89450.9951
Drone9142980.81880.86490.9627
Schoolgirls8075600.74730.82140.9896
Average761720.91850.95120.9893
Std dev.14.37125.270.06060.04360.0093
Table 2

Quantitative evaluation of SAMannot on a subset of the LVOS dataset [17]. The metrics represent the mean Intersection over Union (IoU), Dice coefficient, and Pixel Accuracy for each sequence.

SEQUENCE NAMEFRAMES (#)INSTANCES (#)ALL MASKS (#)MEAN IOUMEAN DICEPIXEL ACC.
3bvEjhOT46128960.95850.97780.9967
7K7WVzGG617212620.80300.83910.9989
cUD1dwuP793427230.92820.95880.9969
EWCZAcdt1412220560.83330.90100.9993
HYSm91eM5001049920.79970.84520.9934
Average75723860.86450.90440.9970
Std dev.388.451619.970.07390.06350.0023
Figure 4

Qualitative examples of segmentation results on images from the DAVIS 2017 dataset. The columns display the original video frame (left), the ground truth (middle), and the masks predicted by SAMannot (right).

Figure 5

Comparison of semantic segmentation boundaries. From left to right: original frame of the Blackswan sequence, official DAVIS 2017 ground truth, and SAMannot prediction. Note the discrepancy regarding the swan’s feet: while the ground truth excludes them, SAMannot correctly identifies these regions as part of the semantic instance. Such differences contribute to a lower measured Mean IoU and Mean Dice, despite the model providing a more anatomically complete segmentation.

Figure 6

Illustrative frames from the analyzed DAVIS sequences for the performance metrics.

Figure 7

Illustrative frames from the analyzed LVOS sequences, demonstrating the visual diversity of the dataset, for the performance metrics.

Table 3

Performance metrics and resource utilization during video annotation on videos from the DAVIS 2017 dataset [16]. Duration encompasses label definition, annotation, and the final data export.

VIDEO NAMEINST. (#)FRAMES (#)Duration (mm:ss)VRAMmin (MiB)VRAMmax (MiB)\symbfΔVRAM (MiB)
night-race2460:4715032354851
schoolgirls7805:09153628981362
train4804:09151227111199
tuk-tuk3591:33150926691160
sheep5681:35151927111192
Table 4

Performance metrics and resource utilization during video annotation on videos from the LVOS [17] dataset. Duration encompasses label definition, annotation, and the final data export.

VIDEO NAMEINST. (#)FRAMES (#)DURATION (mm:ss)VRAMmin (MiB)VRAMmax (MiB)\symbfΔVRAM (MiB)
3bvEjhOT24617:42152127131192
7K7WVzGG26178:30152326871164
cUD1dwuP479315:4921122763651
EWCZAcdt2141219:2521502843693
HYSm91eM1050016:56153028681338
Figure 8

Examples of ground-truth inconsistencies in LVOS (top) and the consistent annotation achieved with SAMannot (bottom): (a) merging the masks of two distinct players; (b) inclusion of unlabeled objects, moreover, incorrectly given the same label as another instance; (c) temporal and structural inconsistency: the ball within the player’s mask is labeled inconsistently across consecutive frames (alternating between allowing and preventing overlaps).

Figure 9

A system resources monitor, accessible via a pop-up from the main control window, provides real-time tracking of RAM usage, GPU utilization, and GPU VRAM occupancy.

Figure 10

Illustration of User guide windows(A).

Figure 11

Illustration of User guide windows(B).

Figure 12

Qualitative examples of segmentation results on images from the DAVIS 2017 dataset [16]. The columns display the original video frame (left), the ground truth (middle), and the masks predicted by SAMannot (right).

DOI: https://doi.org/10.5334/jors.680 | Journal eISSN: 2049-9647
Language: English
Submitted on: Jan 16, 2026
Accepted on: Mar 26, 2026
Published on: Apr 20, 2026
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2026 Gergely Dinya, András Gelencsér, Krisztina Kupán, Clemens Küpper, Kristóf Karacs, Anna Gelencsér-Horváth, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.