María Santos Villafranca

PhD student in Computer Vision

University of Zaragoza - I3A

Biography

Hi! I'm a PhD student at Ropert: Robotics, Computer Vision and Artificial Intelligence group at the University of Zaragoza (Unizar), Spain, supervised by Dr. Jose J. Guerrero since 2022.

My work revolves around Egocentric Vision, focusing on how it can enhance the way humans understand and interact with their surroundings. Specially, a key aspect of my work is addressing the computational demands of Multimodal models. I also have experience working with Multi-Camera Systems, including managing their inherent difficulties such as drastic viewpoint changes and camera calibration.

Download my CV

Interests

Computer Vision
Artifical Intelligence
Deep Learning
Egocentric Vision
Assistive Devices
Multimodal

Education

PhD in Computer Vision, 2022-Present

University of Zaragoza
MSc in Industrial Engineering, specialty in Industrial Automation and Robotics, 2019-2021

University of Zaragoza
BSc in Industrial Technologies Engineering, 2015-2019

University of Zaragoza

Work Experience

Visiting Researcher

Image Processing Laboratory, University of Catania

Sep 2025 – Dec 2025 Catania, Italy

Supervised by Dr. Antonino Furnari

Predoctoral Researcher and Teaching Assistant

University of Zaragoza

Nov 2022 – Present Zaragoza, Spain

Predoctoral Researcher in Computer Vision:

Visual scene interpretation for intelligent assistive devices

Teaching Assistant:

Automatic Systems
Vision and Robotics
Computer Vision
Flexible Automation and Robotics
Computer Graphics-Immersive Environments-Multimedia

Visiting Researcher

Visual Inference Lab, Technische Universität Darmstadt

Dec 2023 – March 2024 Darmstadt, Germany

Supervised by Dr. Simone Schaub-Meyer

Salesforce Developer

Omega CRM Consulting

Oct 2021 – Nov 2022 Zaragoza, Spain

Visualforce Pages and Lightning Components

Integration with other systems

Data Management

Testing and Debugging

Release Management

Certifications: App Builder

University internships

University of Zaragoza

Jan 2021 – Jul 2021 Zaragoza, Spain

Real-time simulator of prosthetic vision (SPV) that uses communication between a Windows computer and an Ubuntu computer through a TCP/IP socket. Supervised by Dr. Jesús Bermúdez Cameo and Dr. Alejandro Pérez Yus

News

(12/Jun/2025) 🏆 1st Place Winner in the Ego-Exo4D Correspondence Challenge at the Second Joint Egocentric Vision (EgoVis) Workshop @ CVPR 2025!! 🏆 Technical Report News
(20/Dec/2024) Organizing the Iberus Connect III Doctoral Student Congress (CEDIC III)
(04/Jul/2024) Attending the ICVSS 24' in Sicily
(10/Jul/2023) Attending the BMVA Computer Vision Summer School in Norwich
(10/Jul/2023) Attending the EVIA Summer School in Sevilla
(16/Jun/2022) XI Jornada de Jóvenes Investigadores Award for Best Scientific Contribution in the Information and Communication Technologies Division

Latest Publications

Lorenzo Mur-Labadia*, Maria Santos-Villafranca*, Jesus Bermudez-Cameo, Alejandro Perez-Yus, Ruben Martinez-Cantin, Jose J. Guerrero

* Equal contribution

2025 In International Conference on Computer Vision, ICCV 2025

O-MaMa: Learning Object Mask Matching between Egocentric and Exocentric Views

Understanding the world from multiple perspectives is essential for intelligent systems operating together, where segmenting common objects across different views remains an open problem. We introduce a new approach that re-defines cross-image segmentation by treating it as a mask matching task. Our method consists of: (1) A Mask-Context Encoder that pools dense DINOv2 semantic features to obtain discriminative object-level representations from FastSAM mask candidates, (2) a Ego↔Exo Cross-Attention that fuses multi-perspective observations, (3) a Mask Matching contrastive loss that aligns cross-view features in a shared latent space and, (4) a Hard Negative Adjacent Mining strategy to encourage the model to better differentiate between nearby objects. O-MaMa achieves the state of the art in the Ego-Exo4D Correspondences benchmark.

Maria Santos-Villafranca*, Dustin Carrión-Ojeda*, Alejandro Perez-Yus, Jesus Bermudez-Cameo, Jose J. Guerrero, Simone Schaub-Meyer

* Equal contribution

April, 2025 In ArXiV

Knowledge Distillation for Multimodal Egocentric Action Recognition Robust to Missing Modalities

Action recognition is an essential task in egocentric vision due to its wide range of applications across many fields. While deep learning methods have been proposed to address this task, most rely on a single modality, typically video. However, including additional modalities may improve the robustness of the approaches to common issues in egocentric videos, such as blurriness and occlusions. Recent efforts in multimodal egocentric action recognition often assume the availability of all modalities, leading to failures or performance drops when any modality is missing. To address this, we introduce an efficient multimodal knowledge distillation approach for egocentric action recognition that is robust to missing modalities (KARMMA) while still benefiting when multiple modalities are available. Our method focuses on resource-efficient development by leveraging pre-trained models as unimodal feature extractors in our teacher model, which distills knowledge into a much smaller and faster student model. Experiments on the Epic-Kitchens and Something-Something datasets demonstrate that our student model effectively handles missing modalities while reducing its accuracy drop in this scenario.

Alejandro Perez-Yus, Maria Santos-Villafranca, Julia Tomas-Barba, Jesus Bermudez-Cameo, Lorenzo Montano-Olivan, Gonzalo Lopez-Nicolas, Jose J. Guerrero

Jan, 2024 In IEEE Access

RASPV: A Robotics Framework for Augmented Simulated Prosthetic Vision

One of the main challenges of visual prostheses is to augment the perceived information to improve the experience of its wearers. Given the limited access to implanted patients, in order to facilitate the experimentation of new techniques, this is often evaluated via Simulated Prosthetic Vision (SPV) with sighted people. In this work, we introduce a novel SPV framework and implementation that presents major advantages with respect to previous approaches. First, it is integrated into a robotics framework, which allows us to benefit from a wide range of methods and algorithms from the field (e.g. object recognition, obstacle avoidance, autonomous navigation, deep learning). Second, we go beyond traditional image processing with 3D point clouds processing using an RGB-D camera, allowing us to robustly detect the floor, obstacles and the structure of the scene. Third, it works either with a real camera or in a virtual environment, which gives us endless possibilities for immersive experimentation through a head-mounted display. Fourth, we incorporate a validated temporal phosphene model that replicates time effects into the generation of visual stimuli. Finally, we have proposed, developed and tested several applications within this framework, such as avoiding moving obstacles, providing a general understanding of the scene, staircase detection, helping the subject to navigate an unfamiliar space, and object and person detection. We provide experimental results in real and virtual environments.

Maria Santos-Villafranca*, Bruno Berenguel-Baeta*, Jesus Bermudez-Cameo, Alejandro Perez-Yus, Jose J. Guerrero

* Equal contribution

November, 2023 Oral Presentation at The 34th British Machine Vision Conference (BMVC)

Convolution kernel adaptation to calibrated fisheye

Convolution kernels are the basic structural component of convolutional neural networks (CNNs). In the last years there has been a growing interest in fisheye cameras for many applications. However, the radially symmetric projection model of these cameras produces high distortions that affect the performance of CNNs, especially when the field of view is very large. In this work, we tackle this problem by proposing a method that leverages the calibration of cameras to deform the convolution kernel accordingly and adapt to the distortion. That way, the receptive field of the convolution is similar to standard convolutions in perspective images, allowing us to take advantage of pre-trained networks in large perspective datasets. We show how, with just a brief fine-tuning stage in a small dataset, we improve the performance of the network for the calibrated fisheye with respect to standard convolutions in depth estimation and semantic segmentation.

Older Publications

Maria Santos-Villafranca, Julia Tomas-Barba, Alejandro Perez-Yus, Jesus Bermudez-Cameo, Jose J. Guerrero (2023). Simulador inmersivo de vision protésica modelando estímulos espacio-temporales. In XLIV Jornadas de Automática.

PDF Cite

Maria Santos-Villafranca, Alejandro Perez-Yus, Jesus Bermudez-Cameo, Jose J. Guerrero (2022). Sistema de realidad virtual para exploración 3D con visión protésica simulada. In XI Jornada de Jóvenes Investigadores e Investigadoras del I3A. Award for Best Scientific Contribution in the Information and Communication Technologies Division

PDF Cite

Projects

Aprendizaje profundo bayesiano para interacciones dinámicas aplicadas a dispositivos asistenciales visuales.

Bayesian deep learning for dynamic interactions applied to visual assistive devices.

Omnidirectional Vision for Man-Made Scene Understanding, OVIMSU.

Contact

m.santos@unizar.es
1 Maria de Luna, Zaragoza, Aragón 50018
Ada Byron Building, first floor, lab 1.01