* Equal contribution.
We propose O-MaMa, a new approach that re-defines cross-image segmentation by treating it as a mask matching task.
Overview of our proposed Object Mask Matching (O-MaMa). Instead of attempting the complex cross-view segmentation task, we obtain a set of mask candidates in the destination view using FastSAM. Through contrastive learning, we select the mask candidate that best matches the source mask.
Understanding the world from multiple perspectives is essential for intelligent systems operating together, where segmenting common objects across different views remains an open problem. We introduce a new approach that re-defines cross-image segmentation by treating it as a mask matching task. Our method consists of: (1) A Mask-Context Encoder that pools dense DINOv2 semantic features to obtain discriminative object-level representations from FastSAM mask candidates, (2) a Ego↔Exo Cross-Attention that fuses multi-perspective observations, (3) a Mask Matching contrastive loss that aligns cross-view features in a shared latent space and, (4) a Hard Negative Adjacent Mining strategy to encourage the model to better differentiate between nearby objects. O-MaMa achieves the state of the art in the Ego-Exo4D Correspondences benchmark.
O-MaMa architecture.
In the destination view, we generate a set of mask candidates using FastSAM. We extract descriptors on both source and destination masks by pooling dense DINOv2 features, and we aggregate global cross-view features with respective cross-attention mechanisms. We learn view-invariant features in a latent space via contrastive learning, and we select the most similar mask embedding to obtain the corresponding mask.
Hard Negatives mining examples. We visualize 2nd order adjacent neighbors both in ego (left) and exo (right) scenarios.
While the object embedding contains very discriminative object features, the context embedding incorporates surrounding information to help localizing the object in the other view, but this surrounding context also introduces ambiguity in cluttered environments, where nearby objects share a similar context. To address this, we introduce a hard-negative mining strategy based on adjacent neighbors, encouraging the model to disambiguate between nearby but distinct objects with similar context. In the destination view, we construct a graph of mask segments based on the pixel centers of each mask using the Delaunay Triangulation to select hard negative neightbors.
Ego↔Exo Cross-Attention Maps
Although the mask context embedding incorporates surrounding contextual information, it lacks a global representation across views. Therefore, we introduce a Ego↔Exo Cross-Attention mechanism, which enhances the object embedding by extracting its corresponding semantic features in the other view.
Our contrastive loss is based on InfoNCE. We select a batch of \( |\mathcal{B}| \) elements, one positive and \(|\mathcal{B}|-1\) negatives from the list of closest neighbors around the target object in the other view. Finally, we apply the pairwise cosine similarity:
We evaluate our model O-MaMa the Ego-Exo4D Correspondences v2 test set, demonstrating the our approach's effectiveness.
We compare O-MaMa against other segmentation models, official baselines, and k-NN, a naïve version of our approach. We extract descriptors of the generated mask candidates in the destination view, and we select the most similar to the query mask in the source view.
Results on the Ego-Exo4D Correspondences v2 test split.
Even our simplest version, the k-NN baseline, already surpasses the official XMem+XSegTx, achieving 31.9 IoU in Ego2Exo and 30.9 IoU in Exo2Ego tasks. Our full method, O-MaMa, further improves performance, reaching 42.6 Ego2Exo and 44.1 Exo2Ego IoU, representing considerable relative gains of up to +22.1% and +76.4% over XMem+XSegTx.
We perform some ablation studies of the contribution of each component and of diverse mask descriptors.
Ablation study on the O-MaMa proposed modules on the 10% of the validation set.
The joined effect of all our proposed modules specially improvesall the metrics, which yields a final gain of +37.2% Ego2Exo and +42.1% Exo2Ego IoU. This demonstrates that, while the k-NN baseline is agnostic to the candidate mask location (it just selects the most similar match), our proposed integration of local and global information results in an object mask selection more sensitive to the cross-view relationship.
Qualitative Results. We show the source mask in \(\textcolor{blue}{\text{blue}}\) and the top 3 target masks in \(\textcolor{green}{\text{green}}\), \(\textcolor{yellow}{\text{yellow}}\) and \(\textcolor{orange}{\text{orange}}\).
@inproceedings{mursantos2025mama,
title={O-MaMa: Learning Object Mask Matching between Egocentric and Exocentric Views},
author={Mur-Labadia, Lorenzo and Santos-Villafranca, Maria and Bermudez-Cameo, Jesus and Perez-Yus, Alejandro and Martinez-Cantin, Ruben and Guerrero, Jose J},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
year={2025}
}