RESEARCH
Comparative Analysis of Segmentation Models on the Human Visual Diet Dataset
FRANCESCO PLASTINA, Harvard College ’28
DOI:
THURJ Volume 16 | Issue 1
Abstract
Teaching an algorithm to identify and separate objects within an image (segmentation) is central to AI applications, from delineating tumors in medical scans to safe robotics, augmented reality, and assistive navigation. Yet most benchmarks use web style photos, whereas people learn from everyday, human-centric views. In this study, I ask whether training modern segmentation models on human-like images (Human Visual Diet dataset), scenes that resemble what people routinely encounter, changes models’ performance and what these models attend to. I compare two conventional convolutional networks (ResNet-18 and DenseNet-121 with custom upsampling decoders) and a transformer-based SegFormer (pre-trained ViT) on their segmentation performance. To interpret model focus, I compute Gradient-weighted Class Activation Mapping (Grad-CAM) for the convolutional neural networks (CNNs) and inspect self-attention maps for the transformer (ViT). My results show that the ViT achieved the highest evaluation accuracy of ~80%, outperforming DenseNet (~74% accuracy) and ResNet (~57% accuracy). All models struggled with !ne object boundaries and smaller objects, indicating room for improvement in edge localization. Grad-CAM for ResNet and DenseNet o"en appeared in di#use or highlighted background regions, while the ViT attention maps were sometimes inconsistent, suggesting that these models do not consistently provide precise object degrees. While prior studies (e.g., Madan et al., 2024) suggest superior generalization for transformers, my HVD results address segmentation performance only: the transformer leads in accuracy, but all models remain below human-level segmentation, especially at fine boundaries and small objects. These findings underscore the difficulty of achieving human-level generalization in segmentation, aligning with recent work emphasizing the importance of more human like data ’diets’ and architectures to close the gap between neural networks and human vision.
