Collaborative Artificial Intelligence

2026

Conference Papers

To Skip, to Swap or to not Swap? Identifying Step Transition Types in Instructional Manuals

Hsiu-Yu Yang, Michael Roth, Andreas Bulling, Carina Silberer

Proc. International Conference on Language Resources and Evaluation (LREC), 2026.

Abstract Links BibTeX Project

Large language models (LLMs) have been widely used as procedural planners, providing guidance across applications. However, in a human-assistive scenario where the environment and users’ knowledge constantly change, their ability to detect various step types for alternative plan generation is underexplored. To fill this gap, we introduce a novel evaluation task and dataset to assess if models can identify steps that are sequential, interchangeable, and optional in textual instructions across five domains in a step-by-step manner. We compare seven LLM families from both open-source and proprietary spaces across varying sizes to a visually-informed baseline based on procedural knowledge graphs (PKG). Our results suggest that LLMs encode procedural knowledge, enabling them to identify step types with increasing effectiveness as training parameters and data size grow. However, all LLMs exhibit inconsistencies in reasoning on the mutual exclusivity of interchangeable and sequential step pairs. In contrast, the symbolic PKG baseline offers an advantage here. Comprehensive analyses furthermore uncover limitations in LLMs’ procedural reasoning abilities.

doi:

@inproceedings{yang26_lrec, title = {{To Skip, to Swap or to not Swap? Identifying Step Transition Types in Instructional Manuals}}, author = {Yang, Hsiu-Yu and Roth, Michael and Bulling, Andreas and Silberer, Carina}, year = {2026}, doi = {}, booktitle = {Proc. International Conference on Language Resources and Evaluation (LREC)} }
Gaze3P: Gaze-Based Prediction of User-Perceived Privacy

Mayar Elfares, Pascal Reisert, Ralf Küsters, Andreas Bulling

Proc. Privacy Enhancing Technologies Symposium (PETS), 2026.

Abstract Links BibTeX Project ACM Artifact Availabe

Privacy is a highly subjective concept and perceived variably by different individuals. Previous research on quantifying user-perceived privacy has primarily relied on questionnaires. Furthermore, applying user-perceived privacy to optimise the parameters of privacy-preserving techniques (PPT) remains insufficiently explored. To address these limitations, we introduce Gaze3P - the first dataset specifically designed to facilitate systematic investigations into user-perceived privacy. Our dataset comprises gaze data from 100 participants and 1,000 stimuli, encompassing a range of private and safe attributes. With Gaze3P we train a machine learning model to implicitly and dynamically predict perceived privacy from human eye gaze. Through comprehensive experiments, we show that the resulting models achieve high accuracy. Finally, we illustrate how predicted privacy can be used to optimise the parameters of differentially private mechanisms, thereby enhancing their alignment with user expectations.

doi:

Dataset: https://doi.org/10.5281/zenodo.17104154

@inproceedings{elfares26_pets, title = {{{Gaze3P}}: {{Gaze-Based Prediction}} of {{User-Perceived Privacy}}}, shorttitle = {{{Gaze3P}}}, author = {Elfares, Mayar and Reisert, Pascal and K{\"u}sters, Ralf and Bulling, Andreas}, booktitle = {Proc. Privacy Enhancing Technologies Symposium (PETS)}, year = {2026}, doi = {} }
RobustSpring: Benchmarking Robustness to Image Corruptions for Optical Flow, Scene Flow and Stereo

Victor Oei, Jenny Schmalfuss, Lukas Mehl, Madlen Bartsch, Shashank Agnihotri, Margret Keuper, Andreas Bulling, Andrés Bruhn

Proc. International Conference on Learning Representations (ICLR), 2026.

Abstract Links BibTeX Project

Standard benchmarks for optical flow, scene flow, and stereo vision algorithms generally focus on model accuracy rather than robustness to image corruptions like noise or rain. Hence, the resilience of models to such real-world perturbations is largely unquantified. To address this, we present RobustSpring, a comprehensive dataset and benchmark for evaluating robustness to image corruptions for optical flow, scene flow, and stereo models. RobustSpring applies 20 different image corruptions, including noise, blur, color changes, quality degradations, and weather distortions, in a time-, stereo-, and depth-consistent manner to the high-resolution Spring dataset, creating a suite of 20,000 corrupted images that reflect challenging conditions. RobustSpring enables comparisons of model robustness via a new corruption robustness metric. Integration with the Spring benchmark enables public two-axis evaluations of both accuracy and robustness. We benchmark a curated selection of initial models, observing that robustness varies widely by corruption type and experimentally show that evaluations on RobustSpring indicate real-world robustness. RobustSpring is a new computer vision benchmark that treats robustness as a first-class citizen to foster models that combine accuracy with resilience.

doi:

@inproceedings{oei26_iclr, title = {{RobustSpring}: {Benchmarking} {Robustness} to {Image} {Corruptions} for {Optical} {Flow}, {Scene} {Flow} and {Stereo}}, shorttitle = {{RobustSpring}}, author = {Oei, Victor and Schmalfuss, Jenny and Mehl, Lukas and Bartsch, Madlen and Agnihotri, Shashank and Keuper, Margret and Bulling, Andreas and Bruhn, Andrés}, year = {2026}, doi = {}, booktitle = {Proc. International Conference on Learning Representations (ICLR)} }
ObjectVisA-120: Object-based Visual Attention Prediction in Interactive Street-crossing Environments

Igor Vozniak, Philipp Müller, Nils Lipp, Janis Sprenger, Konstantin Poddubnyy, Davit Hovhannisyan, Christian Müller, Andreas Bulling, Philipp Slusallek

Proc. IEEE Intelligent Vehicles Symposium (IV), 2026.

Abstract Links BibTeX Project

The object-based nature of human visual attention is well-known in cognitive science, but has only played a minor role in computational visual attention models so far. This is mainly due to a lack of suitable datasets and evaluation metrics for object-based attention. To address these limitations, we present ObjectVisA-120 - a novel 120-participant dataset of spatial street-crossing navigation in virtual reality specifically geared to object-based attention evaluations. The uniqueness of the presented dataset lies in the ethical and safety affiliated challenges that make collecting comparable data in real-world environments highly difficult. ObjectVisA-120 not only features accurate gaze data and a complete state-space representation of objects in the virtual environment, but it also offers variable scenario complexities and rich annotations, including panoptic segmentation, depth information, and vehicle keypoints. We further propose object-based similarity (oSIM) as a novel metric to evaluate the performance of object-based visual attention models, a previously unexplored performance characteristic. Our evaluations show that explicitly optimising for object-based attention not only improves oSIM performance but also leads to an improved model performance on common metrics. In addition, we present SUMGraph, a Mamba U-Net-based model, which explicitly encodes critical scene objects (vehicles) in a graph representation, leading to further performance improvements over several state-of-the-art visual attention prediction methods. The dataset, code and models will be publicly released.

doi:

@inproceedings{vozniak26_iv, title = {{{ObjectVisA-120}}: {{Object-based Visual Attention Prediction in Interactive Street-crossing Environments}}}, author = {Vozniak, Igor and Müller, Philipp and Lipp, Nils and Sprenger, Janis and Poddubnyy, Konstantin and Hovhannisyan, Davit and Müller, Christian and Bulling, Andreas and Slusallek, Philipp}, year = {2026}, booktitle = {Proc. IEEE Intelligent Vehicles Symposium (IV)}, doi = {} }

Technical Reports

High Entropy Leads to Symmetry Equivariant Policies in Dec-POMDPs

Johannes Forkel, Constantin Ruhdorfer, Andreas Bulling, Jakob Foerster

arXiv:2511.22581, 2026.

Abstract Links BibTeX Project

We prove that in any Dec-POMDP, sufficiently high entropy regularization ensures that policy gradient ascent with tabular softmax parametrization always converges, for any initialization, to the same joint policy, and that this joint policy is equivariant w.r.t. all symmetries of the Dec-POMDP. In particular, policies coming from different random seeds will be fully compatible, in that their cross-play returns are equal to their self-play returns. Through extensive empirical evaluation of independent PPO in the Hanabi, Overcooked, and Yokai environments, we find that the entropy coefficient has a massive influence on the cross-play returns between independently trained policies, and that the drop in self-play returns coming from increased entropy regularization can often be counteracted by greedifying the learned policies after training. In Hanabi we achieve a new SOTA in inter-seed cross-play this way. Despite clear limitations of this recipe, which we point out, both our theoretical and empirical results indicate that during hyperparameter sweeps in Dec-POMDPs, one should consider far higher entropy coefficients than is typically done.

doi: 10.48550/arXiv.2511.22581

Paper: forkel26_arxiv.pdf

@techreport{forkel26_arxiv, title = {High Entropy Leads to Symmetry Equivariant Policies in {{Dec-POMDPs}}}, author = {Forkel, Johannes and Ruhdorfer, Constantin and Bulling, Andreas and Foerster, Jakob}, year = {2026}, doi = {10.48550/arXiv.2511.22581} }

2025

Journal Articles

The Overcooked Generalisation Challenge: Evaluating Cooperation with Novel Partners in Unknown Environments Using Unsupervised Environment Design

Constantin Ruhdorfer, Matteo Bortoletto, Anna Penzkofer, Andreas Bulling

Transactions on Machine Learning Research (TMLR), , pp. 1-25, 2025.

Abstract Links BibTeX Project

We introduce the Overcooked Generalisation Challenge (OGC) - a new benchmark for evaluating reinforcement learning (RL) agents on their ability to cooperate with unknown partners in unfamiliar environments. Existing work typically evaluated cooperative RL only in their training environment or with their training partners, thus seriously limiting our ability to understand agents’ generalisation capacity - an essential requirement for future collaboration with humans. The OGC extends Overcooked-AI to support dual curriculum design (DCD). It is fully GPU-accelerated, open-source, and integrated into the minimax DCD benchmark suite. Compared to prior DCD benchmarks, where designers manipulate only minimal elements of the environment, OGC introduces a significantly richer design space: full kitchen layouts with multiple objects that require the designer to account for interaction dynamics between agents. We evaluate state-of-the-art DCD algorithms alongside scalable neural architectures and find that current methods fail to produce agents that generalise effectively to novel layouts and unfamiliar partners. Our results indicate that both agents and curriculum designers struggle with the joint challenge of partner and environment generalisation. These findings establish OGC as a demanding testbed for cooperative generalisation and highlight key directions for future research.

Paper: ruhdorfer25_tmlr.pdf

Code: https://git.hcics.simtech.uni-stuttgart.de/public-projects/OGC

Paper Access: https://openreview.net/forum?id=K2KtcMlW6j

@article{ruhdorfer25_tmlr, title = {The Overcooked Generalisation Challenge: Evaluating Cooperation with Novel Partners in Unknown Environments Using Unsupervised Environment Design}, author = {Ruhdorfer, Constantin and Bortoletto, Matteo and Penzkofer, Anna and Bulling, Andreas}, year = {2025}, pages = {1-25}, journal = {Transactions on Machine Learning Research (TMLR)}, url = {https://openreview.net/forum?id=K2KtcMlW6j} }
Through the Eyes of Emotion: A Multi-faceted Eye Tracking Dataset for Emotion Recognition in Virtual Reality

Tongyun Yang, Bishwas Regmi, Lingyu Du, Andreas Bulling, Xucong Zhang, Guohao Lan

Proc. of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (IMWUT), , pp. 1–41, 2025.

Abstract Links BibTeX Project

Virtual Reality (VR) is transforming cognitive and psychological research by enabling immersive simulations that elicit authentic emotional responses. The high demand for VR-based emotion recognition is also evident in fields such as mental healthcare, education, or entertainment, where understanding users’ emotional states can enhance user experience and system effectiveness. However, the lack of comprehensive datasets hinders progress in VR-based emotion recognition. In this paper, we present a comprehensive, multi-faceted eye-tracking dataset collected from 26 participants using 28 emotional video stimuli rendered in a customized virtual environment. Our dataset is the first to incorporate high-frame-rate periocular videos, capturing subtle motions, such as micro-expressions and eyebrow shifts, which are critical for emotion analysis. Additionally, it includes high-frequency eye-tracking data, offering gaze direction and pupil dynamics at four times the frequency of existing datasets. Our dataset is also unique in providing emotion annotations according to Ekman’s emotion model and, as such, offering experiments impossible using existing datasets. Our benchmark evaluations show that fusing the multi-faceted eye-tracking signals in our dataset significantly improves emotion recognition accuracy. As such, our work has the potential to significantly accelerate and enable entirely new research on emotion-aware VR applications.

doi: 10.1145/3749545

Paper: yang25_imwut.pdf

@article{yang25_imwut, title = {Through the Eyes of Emotion: A Multi-faceted Eye Tracking Dataset for Emotion Recognition in Virtual Reality}, author = {Yang, Tongyun and Regmi, Bishwas and Du, Lingyu and Bulling, Andreas and Zhang, Xucong and Lan, Guohao}, year = {2025}, pages = {1--41}, doi = {10.1145/3749545}, journal = {Proc. of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (IMWUT)} }
HaHeAE: Learning Generalisable Joint Representations of Human Hand and Head Movements in Extended Reality

Zhiming Hu, Guanhua Zhang, Zheming Yin, Daniel Häufle, Syn Schmitt, Andreas Bulling

IEEE Transactions on Visualization and Computer Graphics (TVCG), , 2025.

Abstract Links BibTeX Project

Human hand and head movements are the most pervasive input modalities in extended reality (XR) and are significant for a wide range of applications. However, prior works on hand and head modelling in XR only explored a single modality or focused on specific applications. We present HaHeAE - a novel self-supervised method for learning generalisable joint representations of hand and head movements in XR. At the core of our method is an autoencoder (AE) that uses a graph convolutional network-based semantic encoder and a diffusion-based stochastic encoder to learn the joint semantic and stochastic representations of hand-head movements. It also features a diffusion-based decoder to reconstruct the original signals. Through extensive evaluations on three public XR datasets, we show that our method 1) significantly outperforms commonly used self-supervised methods by up to 74.1% in terms of reconstruction quality and is generalisable across users, activities, and XR environments, 2) enables new applications, including interpretable hand-head cluster identification and variable hand-head movement generation, and 3) can serve as an effective feature extractor for downstream tasks. Together, these results demonstrate the effectiveness of our method and underline the potential of self-supervised methods for jointly modelling hand-head behaviours in extended reality.

doi: 10.1109/TVCG.2025.3576999

Paper: hu25_tvcg.pdf

Supplementary Material: hu25_tvcg_sup.pdf

@article{hu25_tvcg, title = {HaHeAE: Learning Generalisable Joint Representations of Human Hand and Head Movements in Extended Reality}, author = {Hu, Zhiming and Zhang, Guanhua and Yin, Zheming and Häufle, Daniel and Schmitt, Syn and Bulling, Andreas}, year = {2025}, doi = {10.1109/TVCG.2025.3576999}, journal = {IEEE Transactions on Visualization and Computer Graphics (TVCG)} }
HAIFAI: Human-AI Interaction for Mental Face Reconstruction

Florian Strohm, Mihai Bâce, Andreas Bulling

ACM Transactions on Interactive Intelligent Systems (TiiS), 15 (2), pp. 1–26, 2025.

Abstract Links BibTeX Project

We present HAIFAI - a novel two-stage system where humans and AI interact to tackle the challenging task of reconstructing a visual representation of a face that exists only in a person’s mind. In the first stage, users iteratively rank images our reconstruction system presents based on their resemblance to a mental image. These rankings, in turn, allow the system to extract relevant image features, fuse them into a unified feature vector, and use a generative model to produce an initial reconstruction of the mental image. The second stage leverages an existing face editing method, allowing users to manually refine and further improve this reconstruction using an easy-to-use slider interface for face shape manipulation. To avoid the need for tedious human data collection for training the reconstruction system, we introduce a computational user model of human ranking behaviour. For this, we collected a small face ranking dataset through an online crowd-sourcing study containing data from 275 participants. We evaluate HAIFAI and an ablated version in a 12-participant user study and demonstrate that our approach outperforms the previous state of the art regarding reconstruction quality, usability, perceived workload, and reconstruction speed. We further validate the reconstructions in a subsequent face ranking study with 18 participants and show that HAIFAI achieves a new state-of-the-art identification rate of 60.6%. These findings represent a significant advancement towards developing new interactive intelligent systems capable of reliably and effortlessly reconstructing a user’s mental image.

doi: 10.1145/3725891

Paper: strohm25_tiis.pdf

@article{strohm25_tiis, title = {{HAIFAI}: {Human}-{AI} {Interaction} for {Mental} {Face} {Reconstruction}}, doi = {10.1145/3725891}, journal = {ACM Transactions on Interactive Intelligent Systems (TiiS)}, author = {Strohm, Florian and Bâce, Mihai and Bulling, Andreas}, year = {2025}, pages = {1--26}, volume = {15}, number = {2} }
DiffGaze: A Diffusion Model for Modelling Fine-grained Human Gaze Behaviour on 360° Images

Chuhan Jiao, Yao Wang, Guanhua Zhang, Mihai Bâce, Zhiming Hu, Andreas Bulling

ACM Transactions on Interactive Intelligent Systems (TiiS), , pp. 1–13, 2025.

Abstract Links BibTeX Project

Modelling human gaze behaviour on 360° images is important for various human-computer interaction applications. However, existing methods are limited to predicting discrete fixation sequences or aggregated saliency maps, thereby neglecting fine-grained gaze behaviour such as saccadic eye movements that can be captured by commercial eye-trackers. We introduce a more challenging task—fine-grained gaze sequence generation. This task aims to generate eye-tracker-like gaze data for given stimuli. We propose DiffGaze, a diffusion-based method for generating realistic and diverse fine-grained human gaze sequences conditioned on 360◦ images. We evaluate DiffGaze on two 360◦ image benchmarks for fine-grained gaze sequence generation as well as two downstream tasks, scanpath prediction and saliency prediction. Our evaluations show that DiffGaze outperforms the fine-grained gaze generation baselines in all tasks on both benchmarks. We also report a 21-participant survey study showing that our method generates gaze sequences that are indistinguishable from real human sequences. Taken together, our evaluations not only demonstrate the effectiveness of DiffGaze but also point towards a new generation of methods that faithfully model the rich spatial and temporal nature of natural human gaze behaviour.

doi: 10.1145/3772075

Paper: jiao25_tiis.pdf

@article{jiao25_tiis, title = {DiffGaze: A Diffusion Model for Modelling Fine-grained Human Gaze Behaviour on 360° Images}, author = {Jiao, Chuhan and Wang, Yao and Zhang, Guanhua and B{\^a}ce, Mihai and Hu, Zhiming and Bulling, Andreas}, journal = {ACM Transactions on Interactive Intelligent Systems (TiiS)}, year = {2025}, pages = {1--13}, doi = {10.1145/3772075} }

Conference Papers

MultiMediate ’25: Cross-cultural Multi-domain Engagement Estimation

Daksitha Senel Withanage Don, Marius Funk, Michal Balazia, Huajian Qiu, Shogo Okada, François Brémond, Jan Alexandersson, Andreas Bulling, Elisabeth André, Philipp Müller

Proc. of the 33rd ACM International Conference on Multimedia (MM), pp. 14150–14155, 2025.

Abstract Links BibTeX Project

Estimating momentary conversational engagement is central to assistive, socially aware AI systems, yet models are typically trained and evaluated within a single domain, limiting real-world robustness. The MultiMediate ’25 challenge advances engagement estimation to more challenging, cross-cultural, and multi-domain settings. Building on prior challenge editions, we expand beyond NOXI as the sole training source by introducing NOXI-J, a new multilingual corpus covering Japanese and Chinese interactions, enabling both training and evaluation in diverse linguistic contexts. Although NOXI-J conceptually extends NOXI, we treat it as a distinct domain because linguistic, cultural, capture, and annotation differences induce measurable distribution shifts. In this paper, we present new annotations, precomputed multi-modal features (visual, vocal, and verbal), baseline evaluations, and an analysis of the best performing challenge solutions. Beyond accuracy, we quantify fairness using Conditional Demographic Disparity for gender and language. Our baselines confirm strong in-domain performance (e.g., paralinguistic eGeMAPS and video-transformer features) and reveal notable cross-domain drops, underscoring the challenge of cultural, linguistic, and interactional shifts. Fairness analyses indicate generally small discrepancies for our baselines. We observe the largest disparities for the proposed challenge solutions on the Chinese language test set. All annotations, features, code, and leaderboards are made publicly available to foster sustained progress on robust and fair engagement estimation.

doi: 10.1145/3746027.3762076

Paper: withanage25_mm.pdf

@inproceedings{withanage25_mm, title = {{{MultiMediate}} '25: {{Cross-cultural Multi-domain Engagement Estimation}}}, shorttitle = {{{MultiMediate}} '25}, booktitle = {Proc. of the 33rd {{ACM International Conference}} on {{Multimedia}} (MM)}, author = {Withanage Don, Daksitha Senel and Funk, Marius and Balazia, Michal and Qiu, Huajian and Okada, Shogo and Brémond, François and Alexandersson, Jan and Bulling, Andreas and André, Elisabeth and Müller, Philipp}, year = {2025}, series = {{{MM}} '25}, pages = {14150--14155}, publisher = {Association for Computing Machinery (ACM)}, doi = {10.1145/3746027.3762076} }
HAGI: Head-Assisted Gaze Imputation for Mobile Eye Trackers

Chuhan Jiao, Zhiming Hu, Andreas Bulling

Proc. ACM Symposium on User Interface Software and Technology (UIST), pp. 1–14, 2025.

Abstract Links BibTeX Project

Mobile eye tracking plays a vital role in capturing human visual attention across both real-world and extended reality (XR) environments, making it an essential tool for applications ranging from behavioural research to human-computer interaction. However, missing values due to blinks, pupil detection errors, or illumination changes pose significant challenges for further gaze data analysis. To address this challenge, we introduce HAGI - a multi-modal diffusion-based approach for gaze data imputation that, for the first time, uses the integrated head orientation sensors to exploit the inherent correlation between head and eye movements. Our method includes a head-movement feature extraction module alongside a novel hybrid feature fusion mechanism that effectively integrates gaze and head motion features at multiple levels. Additionally, we introduce a tailored loss function to enhance gaze imputation accuracy further. Extensive evaluations on the large-scale Nymeria, Ego-Exo4D, and HOT3D datasets demonstrate that HAGI consistently outperforms conventional interpolation methods and deep learning-based time-series imputation baselines, reducing mean angular error by up to 22%. Furthermore, statistical analyses confirm that HAGI produces gaze velocity distributions that more closely match actual human gaze behaviour than baselines, ensuring more realistic gaze imputations. Our method paves the way for more complete and accurate eye gaze recordings in real-world settings and has significant potential for enhancing gaze-based analysis and interaction across various application domains.

doi: 10.1145/3746059.3747749

Paper: jiao25_uist.pdf

Code: https://git.hcics.simtech.uni-stuttgart.de/public-projects/HAGI

@inproceedings{jiao25_uist, title = {HAGI: Head-Assisted Gaze Imputation for Mobile Eye Trackers}, author = {Jiao, Chuhan and Hu, Zhiming and Bulling, Andreas}, year = {2025}, pages = {1--14}, doi = {10.1145/3746059.3747749}, booktitle = {Proc. ACM Symposium on User Interface Software and Technology (UIST)} }
Towards a Better Understanding of Graph Perception in Immersive Environments

Lin Zhang, Yao Wang, Ying Zhang, Wilhelm Kerle-Malcharek, Karsten Klein, Falk Schreiber, Andreas Bulling

Proc. 33rd International Symposium on Graph Drawing and Network Visualization (GD), pp. 1–19, 2025.

Abstract Links BibTeX Project

As Immersive Analytics (IA) increasingly uses Virtual Reality (VR) for stereoscopic 3D (S3D) graph visualisation, it is crucial to understand how users perceive network structures in these immersive environments. However, little is known about how humans read S3D graphs during task solving, and how gaze behaviour indicates task performance. To address this gap, we report a user study with 18 participants asked to perform three analytical tasks on S3D graph visualisations in a VR environment. Our findings reveal systematic relationships between network structural properties and gaze behaviour. Based on these insights, we contribute a comprehensive eye tracking methodology for analysing human perception in immersive environments and establish eye tracking as a valuable tool for objectively evaluating cognitive load in S3D graph visualisation.

doi: https://doi.org/10.4230/LIPIcs.GD.2025.11

Paper: zhang25_gd.pdf

Dataset: https://doi.org/10.18419/DARUS-5259

@inproceedings{zhang25_gd, title = {Towards a Better Understanding of Graph Perception in Immersive Environments}, author = {Zhang, Lin and Wang, Yao and Zhang, Ying and Kerle-Malcharek, Wilhelm and Klein, Karsten and Schreiber, Falk and Bulling, Andreas}, year = {2025}, pages = {1--19}, doi = {https://doi.org/10.4230/LIPIcs.GD.2025.11}, booktitle = {Proc. 33rd International Symposium on Graph Drawing and Network Visualization (GD)} }
CLAD: Constrained Latent Action Diffusion for Vision-Language Procedure Planning

Lei Shi, Andreas Bulling

Proc. IEEE International Conference on Computer Vision Workshops (ICCVW), pp. , 2025.

Abstract Links BibTeX Project

We propose CLAD - a Constrained Latent Action Diffusion model for vision-language procedure planning in instructional videos. Procedure planning is the challenging task of predicting intermediate actions given a visual observation of a start and a goal state. However, future interactive AI systems must also be able to plan procedures using multi-modal input, e.g., where visual observations are augmented with language descriptions. To tackle this vision-language procedure planning task, our method uses a Variational Autoencoder (VAE) to learn the latent representation of actions and observations as constraints and integrate them into the diffusion process. This approach exploits that the latent space of diffusion models already has semantics that can be used. We use the latent constraints to steer the diffusion model to better generate actions. We report extensive experiments on the popular CrossTask, Coin, and NIV datasets and show that our method outperforms state-of-the-art methods by a large margin. By evaluating ablated versions of our method, we further show that the proposed integration of the action and observation representations learnt in the VAE latent space is key to these performance improvements.

doi: https://doi.org/10.48550/arXiv.2503.06637

Paper: shi25_iccvw.pdf

@inproceedings{shi25_iccvw, title = {CLAD: Constrained Latent Action Diffusion for Vision-Language Procedure Planning}, author = {Shi, Lei and Bulling, Andreas}, year = {2025}, pages = {}, doi = {https://doi.org/10.48550/arXiv.2503.06637}, booktitle = {Proc. IEEE International Conference on Computer Vision Workshops (ICCVW)} }
Tell Me Without Telling Me: Two-Way Prediction of Visualization Literacy and Visual Attention

Minsuk Chang, Yao Wang, Huichen Will Wang, Yuanhong Zhou, Andreas Bulling, Cindy Xiong Bearfield

Proc. International Conference on Visualisation (VIS), pp. 1–11, 2025.

Abstract Links BibTeX Project

Accounting for individual differences can improve the effectiveness of visualization design. While the role of visual attention in visualization interpretation is well recognized, existing work often overlooks how this behavior varies based on visual literacy levels. Based on data from a 235-participant user study covering three visualization tests (mini-VLAT, CALVI, and SGL), we show that distinct attention patterns in visual data exploration can correlate with participants’ literacy levels: While experts generally show a strong attentional focus, novices focus less and explore more. We then propose two computational models leveraging these insights: Lit2Sal - a novel visual saliency model that predicts observer attention given their visualization literacy level, and Sal2Lit - a model to predict visual literacy from human visual attention data. Our quantitative and qualitative evaluation demonstrates that Lit2Sal outperforms state-of-the-art saliency models with literacy-aware considerations. Sal2Lit predicts literacy with 86% accuracy using a single attention map, providing a time-efficient supplement to literacy assessment that only takes less than a minute. Taken together, our unique approach to consider individual differences in salience models and visual attention in literacy assessments paves the way for new directions in personalized visual data communication to enhance understanding.

doi: https://doi.ieeecomputersociety.org/10.1109/TVCG.2025.3634815

Paper: chang25_vis.pdf

@inproceedings{chang25_vis, title = {Tell Me Without Telling Me: Two-Way Prediction of Visualization Literacy and Visual Attention}, author = {Chang, Minsuk and Wang, Yao and Wang, Huichen Will and Zhou, Yuanhong and Bulling, Andreas and Bearfield, Cindy Xiong}, year = {2025}, pages = {1--11}, doi = {https://doi.ieeecomputersociety.org/10.1109/TVCG.2025.3634815}, booktitle = {Proc. International Conference on Visualisation (VIS)} }
Interactive Expressive Motion Generation Using Dynamic Movement Primitives

Till Hielscher, Andreas Bulling, Kai Arras

Proc. IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. , 2025.

Abstract Links BibTeX Project

Our goal is to enable social robots to interact autonomously with humans in a realistic, engaging, and expressive manner. The 12 Principles of Animation are a well-established framework animators use to create movements that make characters appear convincing, dynamic, and emotionally expressive. This paper proposes a novel approach that leverages Dynamic Movement Primitives (DMPs) to implement key animation principles, providing a learnable, explainable, modulable, online adaptable and composable model for automatic expressive motion generation. DMPs, originally developed for general imitation learning in robotics and grounded in a spring-damper system design, offer mathematical properties that make them particularly suitable for this task. Specifically, they enable modulation of the intensities of individual principles and facilitate the decomposition of complex, expressive motion sequences into learnable and parametrizable primitives. We present the mathematical formulation of the parameterized animation principles and demonstrate the effectiveness of our framework through experiments and application on three robotic platforms with different kinematic configurations, in simulation, on actual robots and in a user study. Our results show that the approach allows for creating diverse and nuanced expressions using a single base model.

doi: 10.48550/ARXIV.2504.06735

Paper: hielscher25_iros.pdf

@inproceedings{hielscher25_iros, title = {Interactive Expressive Motion Generation Using Dynamic Movement Primitives}, author = {Hielscher, Till and Bulling, Andreas and Arras, Kai}, year = {2025}, pages = {}, doi = {10.48550/ARXIV.2504.06735}, booktitle = {Proc. IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)} }
ChartQC: Question Classification from Human Attention Data on Charts

Takumi Nishiyasu, Tobias Kostorz, Yao Wang, Yoichi Sato, Andreas Bulling

Proc. Eye Tracking and Visualization Workshop (ETVIS), pp. 1–6, 2025.

Abstract Links BibTeX Project

Understanding how humans interact with information visualizations is crucial for improving user experience and designing effective visualization systems. While previous studies have focused on task-agnostic visual attention, the relationship between attention patterns and visual analytical tasks remains underexplored. This paper investigates how attention data on charts can be used to classify question types, providing insights into question-driven gaze behaviors. We propose ChartQC, a question classification model leveraging spatial feature alignment in chart images and visual attention data. By aligning spatial features, our approach strengthens the integration of visual and attentional cues, improving classification accuracy. These findings help deepen the understanding of user perception in charts and provide a basis for future research on interactive visual analysis.

doi: 10.1145/3715669.3725883

Paper: nishiyasu25_etvis.pdf

@inproceedings{nishiyasu25_etvis, title = {ChartQC: Question Classification from Human Attention Data on Charts}, author = {Nishiyasu, Takumi and Kostorz, Tobias and Wang, Yao and Sato, Yoichi and Bulling, Andreas}, year = {2025}, pages = {1--6}, booktitle = {Proc. Eye Tracking and Visualization Workshop (ETVIS)}, doi = {10.1145/3715669.3725883} }
Brittle Minds, Fixable Activations: Understanding Belief Representations in Language Models

Matteo Bortoletto, Constantin Ruhdorfer, Lei Shi, Andreas Bulling

Findings of Empirical Methods in Natural Language Processing (EMNLP), 2025.

Abstract Links BibTeX Project

Despite growing interest in Theory of Mind (ToM) tasks for evaluating language models (LMs), little is known about how LMs internally represent mental states of self and others. Understanding these internal mechanisms is critical - not only to move beyond surface-level performance, but also for model alignment and safety, where subtle misattributions of mental states may go undetected in generated outputs. In this work, we present the first systematic investigation of belief representations in LMs by probing models across different scales, training regimens, and prompts - using control tasks to rule out confounds. Our experiments provide evidence that both model size and fine-tuning substantially improve LMs’ internal representations of others’ beliefs, which are structured - not mere by-products of spurious correlations - yet brittle to prompt variations. Crucially, we show that these representations can be strengthened: targeted edits to model activations can correct wrong ToM inferences.

doi: 10.18653/v1/2025.findings-emnlp.1226

Paper: bortoletto25_femnlp.pdf

Code: https://git.hcics.simtech.uni-stuttgart.de/public-projects/mental-states-in-LMs

@inproceedings{bortoletto25_femnlp, title = {Brittle {{Minds}}, {{Fixable Activations}}: {{Understanding Belief Representations}} in {{Language Models}}}, shorttitle = {Brittle {{Minds}}, {{Fixable Activations}}}, author = {Bortoletto, Matteo and Ruhdorfer, Constantin and Shi, Lei and Bulling, Andreas}, booktitle = {Findings of Empirical Methods in Natural Language Processing (EMNLP)}, year = {2025}, doi = {10.18653/v1/2025.findings-emnlp.1226} }
ToM-SSI: Evaluating Theory of Mind in Situated Social Interactions

Matteo Bortoletto, Constantin Ruhdorfer, Andreas Bulling

Proc. Empirical Methods in Natural Language Processing (EMNLP), 2025.

Abstract Links BibTeX Project

Most existing Theory of Mind (ToM) benchmarks for foundation models rely on variations of the Sally-Anne test, offering only a very limited perspective on ToM and neglecting the complexity of human social interactions. To address this gap, we propose ToM-SSI: a new benchmark specifically designed to test ToM capabilities in environments rich with social interactions and spatial dynamics. While current ToM benchmarks are limited to text-only or dyadic interactions, ToM-SSI is multimodal and includes group interactions of up to four agents that communicate and move in situated environments. This unique design allows us to study, for the first time, mixed cooperative-obstructive settings and reasoning about multiple agents’ mental state in parallel, thus capturing a wider range of social cognition than existing benchmarks. Our evaluations reveal that current models’ performance is still severely limited, especially in these new tasks, highlighting critical gaps for future research.

doi: 10.18653/v1/2025.emnlp-main.1642

Paper: bortoletto25_emnlp.pdf

Code: https://git.hcics.simtech.uni-stuttgart.de/public-projects/tom-ssi

@inproceedings{bortoletto25_emnlp, title = {ToM-SSI: Evaluating Theory of Mind in Situated Social Interactions}, author = {Bortoletto, Matteo and Ruhdorfer, Constantin and Bulling, Andreas}, booktitle = {Proc. Empirical Methods in Natural Language Processing (EMNLP)}, year = {2025}, doi = {10.18653/v1/2025.emnlp-main.1642} }
SSPictR: A Biologically-plausible Image Representation

Anna Penzkofer, Karim Habashy, Chris Eliasmith, Andreas Bulling

ECAI Workshop on Artificial Intelligence and Cognition (AIC), pp. 1–13, 2025.

Abstract Links BibTeX Project Oral Presentation

Finding interpretable and generalisable representations of natural images has the potential to increase performance on various computer vision tasks, particularly those that require semantic information and spatial understanding, such as image segmentation or scene recognition. Drawing inspiration from a cognitive modelling framework, we propose SSPictR - a biologically plausible image representation based on spatial semantic pointers (SSPs). SSPictR encodes semantic labels of objects and their spatial locations extracted from segmentation maps. It only requires a single vector to capture a compressed but fully decodable neuro-symbolic representation of an image. We demonstrate the biological plausibility of SSPictR by performing representation similarity analysis, finding a significant correlation with fMRI data recorded from the early visual cortex. We further highlight the effectiveness and out-of-domain generalisability of SSPictR representations by training a compact model for scene recognition on standard benchmark datasets. Our simple neural network achieves performance on par with previous work, while having more than three times fewer trainable parameters. Taken together, SSPictR bridges the gap between biological plausibility and effective representations for tasks in computer vision and beyond.

doi:

Paper: penzkofer25_aic.pdf

@inproceedings{penzkofer25_aic, author = {Penzkofer, Anna and Habashy, Karim and Eliasmith, Chris and Bulling, Andreas}, title = {SSPictR: A Biologically-plausible Image Representation}, booktitle = {ECAI Workshop on Artificial Intelligence and Cognition (AIC)}, year = {2025}, pages = {1--13}, doi = {} }
AttentionLeak: What Does Human Attention Reveal About Information Visualisation?

Malte Sönnichsen, Mayar Elfares, Yao Wang, Ralf Küsters, Alina Roitberg, Andreas Bulling

Proc. International Conference on Document Analysis and Recognition (ICDAR), pp. 77–95, 2025.

Abstract Links BibTeX Project

In scenarios where direct access to displayed content, such as secured web pages or confidential documents, is restricted, eye-tracking data can serve as a side channel for information inference. Represented as human attention maps, eye tracking data is widely used in research, for example, to quantify how users explore visual information. In this work, we specifically focus on visual question-answering (VQA) scenarios to demonstrate, for the first time, that a rich amount of information can be leaked solely from human attention maps. Hence, we assume that an adversary only has access to the gaze attention maps and aims to derive a range of attributes about the image (e.g. the chart type), the question (e.g. question type), and the answer (e.g. the accuracy-based complexity). This information leakage could be the first step towards potentially more complex insights about human perception and cognition. Our experiments demonstrate that deriving attributes is feasible, and simultaneously predicting multiple attributes improves the success rate for attributes that are difficult to infer. This paper highlights potential threats, encouraging the community to address these concerns and develop appropriate privacy-preserving solutions.

doi: 10.1007/978-3-032-04627-7_5

Paper: sonnichsen25_icdar.pdf

@inproceedings{sonnichsen25_icdar, title = {AttentionLeak: What Does Human Attention Reveal About Information Visualisation?}, author = {Sönnichsen, Malte and Elfares, Mayar and Wang, Yao and Küsters, Ralf and Roitberg, Alina and Bulling, Andreas}, year = {2025}, pages = {77--95}, doi = {10.1007/978-3-032-04627-7_5}, booktitle = {Proc. International Conference on Document Analysis and Recognition (ICDAR)} }
Integration of Machine Learning in High-Enthalpy Plasma Spectroscopy

Paul Erik Hofmeyer, Hendrik Burghaus, Constantin Ruhdorfer, Johannes Oswald, Andreas Bulling, Georg Herdrich

International Conference on Flight vehicles, Aerothermodynamics and Re-entry (FAR), pp. 1–8, 2025.

Abstract Links BibTeX Project

Optical emission spectroscopy is widely used to characterize high-enthalpy plasmas because it enables measuring a range of plasma parameters. Although the spectra are relatively easy to acquire, extracting meaningful information requires extensive analysis. In this work, a novel approach is developed to automate the analysis of broadband emission spectra by training two machine learning models on synthetic data. The first model is applied to predict plasma temperatures and species number densities in a CO2 plasma jet. The second model is designed to identify the radiation from potentially occurring species in time-resolved spectra of a titanium material sample demising in an air plasma. Developing a synthetic dataset that allows a trained machine learning model to analyze experimental spectra accurately is identified as a major challenge. Overall, these models offer a significant opportunity to automate the analysis of optical emission spectra.

Paper: hofmeyer25_far.pdf

Paper Access: https://www.researchgate.net/publication/392123052_Integration_of_Machine_Learning_in_High-Enthalpy_Plasma_Spectroscopy

@inproceedings{hofmeyer25_far, author = {Hofmeyer, Paul Erik and Burghaus, Hendrik and Ruhdorfer, Constantin and Oswald, Johannes and Bulling, Andreas and Herdrich, Georg}, year = {2025}, pages = {1--8}, url = {https://www.researchgate.net/publication/392123052_Integration_of_Machine_Learning_in_High-Enthalpy_Plasma_Spectroscopy}, title = {Integration of Machine Learning in High-Enthalpy Plasma Spectroscopy}, booktitle = {International Conference on Flight vehicles, Aerothermodynamics and Re-entry (FAR)} }
ActionDiffusion: An Action-aware Diffusion Model for Procedure Planning in Instructional Videos

Lei Shi, Paul-Christian Bürkner, Andreas Bulling

Proc. IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. , 2025.

Abstract Links BibTeX Project

We present ActionDiffusion - a novel diffusion model for procedure planning in instructional videos that is the first to take temporal inter-dependencies between actions into account. Our approach is in stark contrast to existing methods that fail to exploit the rich information content available in the particular order in which actions are performed. Our method unifies the learning of temporal dependencies between actions and denoising of the action plan in the diffusion process by projecting the action information into the noise space. This is achieved 1) by adding action embeddings in the noise masks in the noise-adding phase and 2) by introducing an attention mechanism in the noise prediction network to learn the correlations between different action steps. We report extensive experiments on three instructional video benchmark datasets (CrossTask, Coin, and NIV) and show that our method outperforms previous state-of-the-art methods on all metrics on CrossTask and NIV and all metrics except accuracy on Coin dataset. We show that by adding action embeddings into the noise mask the diffusion model can better learn action temporal dependencies and increase the performances on procedure planning.

Paper: shi25_wacv.pdf

Code: https://git.hcics.simtech.uni-stuttgart.de/public-projects/ActionDiffusion_WACV2025

@inproceedings{shi25_wacv, author = {Shi, Lei and Bürkner, Paul-Christian and Bulling, Andreas}, title = {ActionDiffusion: An Action-aware Diffusion Model for Procedure Planning in Instructional Videos}, booktitle = {Proc. IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)}, year = {2025}, pages = {} }
A Cognitively Plausible Visual Working Memory Model

Anna Penzkofer, Michael Furlong, Chris Eliasmith, Andreas Bulling

Proc. Annual Meeting of the Cognitive Science Society (CogSci), pp. 1-6, 2025.

Abstract Links BibTeX Project

Visual working memory (VWM) plays a fundamental role in cognitive processes, such as perception, attention, and reasoning. However, existing approaches to modelling VWM are not integrated into cognitive architectures and lack interpretability with respect to their parameters. To address this limitation, we propose a novel VWM model based on the well-established Semantic Pointer Architecture (SPA). In contrast to previous works, our model is the first to integrate a VWM model with a cognitive attention model. It only requires three interpretable hyper-parameters: spatial capacity, feature certainty, and memory decay. We experimentally show that our base model without memory decay replicates the set-size effect and swap errors of human data on a continuous reproduction task. More importantly, we show that by introducing a memory decay, we can achieve a statistically significant (p << 0.001) improvement in model fit, suggesting a potentially important role of memory decay in VWM. Further, our VWM model can be easily extended to model pre- and post-cue conditions, consistently achieving KL divergence between modelled and human performance of less than 0.05.

doi: https://escholarship.org/uc/item/3928d5s4

Paper: penzkofer25_cogsci.pdf

Code: https://git.hcics.simtech.uni-stuttgart.de/public-projects/SSP-VWM

@inproceedings{penzkofer25_cogsci, title = {A {Cognitively} {Plausible} {Visual} {Working} {Memory} {Model}}, booktitle = {Proc. Annual Meeting of the Cognitive Science Society (CogSci)}, author = {Penzkofer, Anna and Furlong, Michael and Eliasmith, Chris and Bulling, Andreas}, year = {2025}, pages = {1-6}, doi = {https://escholarship.org/uc/item/3928d5s4} }
HOIGaze: Gaze Estimation During Hand-Object Interactions in Extended Reality Exploiting Eye-Hand-Head Coordination

Zhiming Hu, Daniel Häufle, Syn Schmitt, Andreas Bulling

Proc. ACM SIGGRAPH (SIGGRAPH), 2025.

Abstract Links BibTeX Project

We present HOIGaze - a novel learning-based approach for gaze estimation during hand-object interactions (HOI) in extended reality (XR). HOIGaze addresses the challenging HOI setting by building on one key insight: The eye, hand, and head movements are closely coordinated during HOIs and this coordination can be exploited to identify samples that are most useful for gaze estimator training - as such, effectively denoising the training data. This denoising approach is in stark contrast to previous gaze estimation methods that treated all training samples as equal. Specifically, we propose: 1) a novel hierarchical framework that first recognises the hand currently visually attended to and then estimates gaze direction based on the attended hand; 2) a new gaze estimator that uses cross-modal Transformers to fuse head and hand-object features extracted using a convolutional neural network and a spatio-temporal graph convolutional network; and 3) a novel eye-head coordination loss that upgrades training samples belonging to the coordinated eye-head movements. We evaluate HOIGaze on the HOT3D and Aria digital twin (ADT) datasets and show that it significantly outperforms state-of-theart methods, achieving an average improvement of 15.6% on HOT3D and 6.0% on ADT in mean angular error. To demonstrate the potential of our method, we further report significant performance improvements for the sample downstream task of eye-based activity recognition on ADT. Taken together, our results underline the significant information content available in eyehand-head coordination and, as such, open up an exciting new direction for learning-based gaze estimation.

doi: https://doi.org/10.1145/3721238.3730692

Paper: hu25_siggraph.pdf

@inproceedings{hu25_siggraph, title = {{HOIGaze}: {Gaze} {Estimation} {During} {Hand}-{Object} {Interactions} in {Extended} {Reality} {Exploiting} {Eye}-{Hand}-{Head} {Coordination}}, author = {Hu, Zhiming and Häufle, Daniel and Schmitt, Syn and Bulling, Andreas}, year = {2025}, booktitle = {Proc. ACM SIGGRAPH (SIGGRAPH)}, doi = {https://doi.org/10.1145/3721238.3730692} }
Supporting Computer-Supported Learning with Gaze Visualization among Self-Focused Individuals

Katarzyna Wisiecka, Sven Mayer, Robin Schweigert, Izabela Krejtz, Andreas Bulling, Krzysztof Krejtz

Proc. World Conference on Educational Technology (EdMedia), 2025.

Abstract BibTeX Project

Successful and satisfactory collaboration requires joint attention of collaborating partners and their mutual focus on an object. However, computer-mediated collaboration settings may restrict access to gaze communication. Restricted nonverbal communication channels may, in turn, boost self-focused individuals to neglect their partner’s perspective and make joint attention inaccessible for collaborating partners. We investigate to what extent visualization of collaborators’ gaze may foster joint attention during Computer-Supported Learning among individuals with high and low self-focus. We conducted an eye-tracking experiment in which we presented the users’ eye movements to the partner while solving logical problems in both remote and co-located settings. The results show that gaze visualization fosters joint attention and enhances collaboration effectiveness measured by task accuracy among self-focused individuals. We postulate introducing visualization of gaze communication to remote computer-mediated systems for yielding a partner-oriented perspective during long-distance collaboration.

@inproceedings{wisiecka25_edmedia, title = {Supporting {Computer}-{Supported} {Learning} with {Gaze} {Visualization} among {Self}-{Focused} {Individuals}}, booktitle = {Proc. {World} {Conference} on {Educational} {Technology} ({EdMedia})}, author = {Wisiecka, Katarzyna and Mayer, Sven and Schweigert, Robin and Krejtz, Izabela and Bulling, Andreas and Krejtz, Krzysztof}, year = {2025} }
Chartist: Task-driven Eye Movement Control for Chart Reading

Danqing Shi, Yao Wang, Yunpeng Bai, Andreas Bulling, Antti Oulasvirta

Proc. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI), pp. 1–14, 2025.

Abstract Links BibTeX Project

To design data visualizations that are easy to comprehend, we need to understand how people with different interests read them. Computational models of modeling scanpaths on charts could complement empirical studies by offering estimates of user performance inexpensively; however, previous models have been limited to gaze patterns and overlooked the effects of tasks. Here, we contribute a model that simulates how users move their eyes to extract information from the chart in order to solve analytical tasks, including retrieve value, filter, and find extreme. Our insight is a two-level hierarchical control structure. At the high level, the model uses a LLM to comprehend information gained so far and uses this representation to select a goal for the lower-level controllers, which in turn move the eyes according to a sampling policy learned via reinforcement learning. The model can accurately predict task-driven scanpaths and reproduce the human-like statistical summary across tasks.

doi: 10.1145/3706598.3713128

Paper: shi25_chi.pdf

@inproceedings{shi25_chi, title = {Chartist: Task-driven Eye Movement Control for Chart Reading}, author = {Shi, Danqing and Wang, Yao and Bai, Yunpeng and Bulling, Andreas and Oulasvirta, Antti}, year = {2025}, pages = {1--14}, booktitle = {Proc. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI)}, doi = {10.1145/3706598.3713128} }
SummAct: Uncovering User Intentions Through Interactive Behaviour Summarisation

Guanhua Zhang, Mohamed Ahmed, Zhiming Hu, Andreas Bulling

Proc. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI), pp. 1–23, 2025.

Abstract Links BibTeX Project

Recent work has highlighted the potential of modelling interactive behaviour analogously to natural language. We propose interactive behaviour summarisation as a novel computational task and demonstrate its usefulness for automatically uncovering latent user intentions while interacting with graphical user interfaces. To tackle this task, we introduce SummAct – a novel hierarchical method to summarise low-level input actions into high-level intentions. SummAct first identifies sub-goals from user actions using a large language model and in-context learning. High-level intentions are then obtained by fine-tuning the model using a novel UI element attention to preserve detailed context information embedded within UI elements during summarisation. Through a series of evaluations, we demonstrate that SummAct significantly outperforms baselines across desktop and mobile interfaces as well as interactive tasks by up to 21.9%. We further show three exciting interactive applications benefited from SummAct: interactive behaviour forecasting, automatic behaviour synonym identification, and language-based behaviour retrieval.

doi: 10.1145/3706598.3713190

Paper: zhang25_chi.pdf

Code: https://git.hcics.simtech.uni-stuttgart.de/public-projects/SummAct.git

@inproceedings{zhang25_chi, title = {SummAct: Uncovering User Intentions Through Interactive Behaviour Summarisation}, author = {Zhang, Guanhua and Ahmed, Mohamed and Hu, Zhiming and Bulling, Andreas}, year = {2025}, pages = {1--23}, booktitle = {Proc. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI)}, doi = {10.1145/3706598.3713190} }
iAssistADL: Intelligent assistive device for patients with neuro- degenerative movement disorder: Concepts and first implementations

Winfried Ilg, Isabell Wochner, Jhon Paul Feliciano Charaja, Veronika Hofmann, Ole Strenge, Melanie Adam, Regine Lendway, Jan Kerner, Bhavya Deep Vashisht, Marko Ackermann, Friedemann Bunjes, Urs Schneider, Martin Giese, Andreas Bulling, Syn Schmitt, Christophe Maufroy, Daniel Florian Benedict Haeufle

Proc. International Consortium for Rehabilitation Robotics (ICORR), pp. 1–6, 2025.

Links BibTeX Project

doi: 10.1109/icorr66766.2025.11063050

Paper: ilg25_icorr.pdf

@inproceedings{ilg25_icorr, title = {iAssistADL: Intelligent assistive device for patients with neuro- degenerative movement disorder: Concepts and first implementations}, author = {Ilg, Winfried and Wochner, Isabell and Charaja, Jhon Paul Feliciano and Hofmann, Veronika and Strenge, Ole and Adam, Melanie and Lendway, Regine and Kerner, Jan and Vashisht, Bhavya Deep and Ackermann, Marko and Bunjes, Friedemann and Schneider, Urs and Giese, Martin and Bulling, Andreas and Schmitt, Syn and Maufroy, Christophe and Haeufle, Daniel Florian Benedict}, year = {2025}, pages = {1--6}, booktitle = {Proc. International Consortium for Rehabilitation Robotics (ICORR)}, doi = {10.1109/icorr66766.2025.11063050} }
V²Dial: Unification of Video and Visual Dialog via Multimodal Experts

Adnen Abdessaied, Anna Rohrbach, Marcus Rohrbach, Andreas Bulling

Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2025.

Abstract Links BibTeX Project

We present V²Dial - a novel expert-based model specifically geared towards simultaneously handling image and video input data for multimodal conversational tasks. Current multimodal models primarily focus on simpler tasks (e.g., VQA, VideoQA, video-text retrieval) and often neglect the more challenging conversational counterparts, such as video and visual/image dialog. Moreover, works on both conversational tasks evolved separately from each other despite their apparent similarities limiting their applicability potential. To this end, we propose to unify both tasks using a single model that for the first time jointly learns the spatial and temporal features of images and videos by routing them through dedicated experts and aligns them using matching and contrastive learning techniques. Furthermore, we systemically study the domain shift between the two tasks by investigating whether and to what extent these seemingly related tasks can mutually benefit from their respective training data. Extensive evaluations on the widely used video and visual dialog datasets of AVSD and VisDial show that our model achieves new state-of-the-art results across four benchmarks both in zero-shot and fine-tuning settings.

doi: 10.1109/CVPR52734.2025.00807

Paper: abdessaied25_cvpr.pdf

Code: https://git.hcics.simtech.uni-stuttgart.de/public-projects/V2Dial

@inproceedings{abdessaied25_cvpr, title = {V²Dial: Unification of Video and Visual Dialog via Multimodal Experts}, author = {Abdessaied, Adnen and Rohrbach, Anna and Rohrbach, Marcus and Bulling, Andreas}, year = {2025}, booktitle = {Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR)}, doi = {10.1109/CVPR52734.2025.00807} }
Grid Labeling: Crowdsourcing Task-Specific Importance from Visualizations

Minsuk Chang, Yao Wang, Huichen Will Wang, Andreas Bulling, Cindy Xiong Bearfield

Proc. 27th Annual Conference on Data Visualization (EuroVis), pp. 1–6, 2025.

Abstract Links BibTeX Project

Knowing where people look in visualizations is key to effective design. Yet, existing research primarily focuses on task-agnostic saliency models, although visual attention is inherently task-dependent. Collecting task-relevant importance data remains a resource-intensive challenge. To address this, we introduce Grid Labeling, a novel annotation method for collecting task-specific importance data to enhance saliency prediction models. Grid Labeling dynamically segments visualizations into Adaptive Grids, enabling efficient, low-effort annotation while adapting to visualization structure. We conducted a human-subject study comparing Grid Labeling with existing annotation methods, ImportAnnots, and BubbleView across multiple metrics. Results show that Grid Labeling produces the least noisy data and the highest inter-participant agreement with fewer participants while requiring less physical (e.g., clicks or mouse movements) and cognitive effort.

doi: 10.2312/evs.20251092

Paper: chang25_eurovis.pdf

@inproceedings{chang25_eurovis, title = {Grid Labeling: Crowdsourcing Task-Specific Importance from Visualizations}, author = {Chang, Minsuk and Wang, Yao and Wang, Huichen Will and Bulling, Andreas and Bearfield, Cindy Xiong}, year = {2025}, booktitle = {Proc. 27th Annual Conference on Data Visualization (EuroVis)}, pages = {1--6}, doi = {10.2312/evs.20251092} }
The Yōkai Learning Environment: Tracking Beliefs Over Space and Time

Constantin Ruhdorfer, Matteo Bortoletto, Andreas Bulling

IJCAI Workshop on Generative AI & Theory of Mind In Communicating Agents, pp. 1–24, 2025.

Abstract Links BibTeX Project Oral Presentation

Developing collaborative AI hinges on Theory of Mind (ToM) - the ability to reason about the beliefs of others to build and maintain common ground. Existing ToM benchmarks, however, are restricted to passive observer settings or lack an assessment of how agents establish and maintain common ground over time. To address these gaps, we introduce the Yokai Learning Environment (YLE) - a multi-agent reinforcement learning (RL) environment based on the cooperative card game Yokai. In the YLE, agents take turns peeking at hidden cards and moving them to form clusters based on colour. Success requires tracking evolving beliefs, remembering past observations, and maintaining common ground with teammates. Our evaluation yields two key findings: First, current RL agents struggle to solve the YLE, even when given access to perfect memory. Second, while belief modelling improves performance, agents are still unable to effectively generalise to unseen partners or form accurate beliefs over longer games, exposing a reliance on brittle conventions rather than robust belief tracking. We use the YLE to investigate research questions in belief modelling, memory, partner generalisation, and scaling to higher-order ToM.

doi: 10.48550/arXiv.2508.12480

Paper: ruhdorfer25_ijcaiw.pdf

@inproceedings{ruhdorfer25_ijcaiw, author = {Ruhdorfer, Constantin and Bortoletto, Matteo and Bulling, Andreas}, title = {The Yōkai Learning Environment: Tracking Beliefs Over Space and Time}, booktitle = {IJCAI Workshop on Generative AI & Theory of Mind In Communicating Agents}, year = {2025}, pages = {1--24}, doi = {10.48550/arXiv.2508.12480} }

Technical Reports

CryptoQA: A Large-scale Question-answering Dataset for AI-assisted Cryptography

Mayar Elfares, Pascal Reisert, Tilman Dietz, Manpa Barman, Ahmed Zaki, Ralf Küsters, Andreas Bulling

arXiv:2512.02625, 2025.

Abstract Links BibTeX Project

Large language models (LLMs) excel at many general-purpose natural language processing tasks. However, their ability to perform deep reasoning and mathematical analysis, particularly for complex tasks as required in cryptography, remains poorly understood, largely due to the lack of suitable data for evaluation and training. To address this gap, we present CryptoQA, the first large-scale question-answering (QA) dataset specifically designed for cryptography. CryptoQA contains over two million QA pairs drawn from curated academic sources, along with contextual metadata that can be used to test the cryptographic capabilities of LLMs and to train new LLMs on cryptographic tasks. We benchmark 15 state-of-the-art LLMs on CryptoQA, evaluating their factual accuracy, mathematical reasoning, consistency, referencing, backward reasoning, and robustness to adversarial samples. In addition to quantitative metrics, we provide expert reviews that qualitatively assess model outputs and establish a gold-standard baseline. Our results reveal significant performance deficits of LLMs, particularly on tasks that require formal reasoning and precise mathematical knowledge. This shows the urgent need for LLM assistants tailored to cryptography research and development. We demonstrate that, by using CryptoQA, LLMs can be fine-tuned to exhibit better performance on cryptographic tasks.

doi: 10.48550/arXiv.2512.02625

Paper: elfares25_arxiv3.pdf

@techreport{elfares25_arxiv3, title = {{{CryptoQA}}: {{A Large-scale Question-answering Dataset}} for {{AI-assisted Cryptography}}}, shorttitle = {{{CryptoQA}}}, author = {Elfares, Mayar and Reisert, Pascal and Dietz, Tilman and Barman, Manpa and Zaki, Ahmed and K{\"u}sters, Ralf and Bulling, Andreas}, year = {2025}, publisher = {arXiv}, doi = {10.48550/arXiv.2512.02625} }
REFLECTing SPERET: Measuring and Promoting Ethics and Privacy Reflexivity in Eye-Tracking Research

Susanne Hindennach, Mayar Elfares, Céline Gressel, Andreas Bulling

arXiv:2511.18965, 2025.

Abstract Links BibTeX Project

The proliferation of eye tracking in high-stakes domains - such as healthcare, marketing and surveillance - underscores the need for researchers to be ethically aware when employing this technology. Although privacy and ethical guidelines have emerged in recent years, empirical research on how scholars reflect on their own work remains scarce. To address this gap, we present two complementary instruments developed with input from more than 70 researchers: REFLECT, a qualitative questionnaire, and SPERET (Latin for "hope"), a quantitative psychometric scale that measures privacy and ethics reflexivity in eye tracking. Our findings reveal a research community that is concerned about user privacy, cognisant of methodological constraints, such as sample bias, and that possesses a nuanced sense of ethical responsibility evolving with project maturity. Together, these tools and our analyses offer a systematic examination and a hopeful outlook on reflexivity in eye-tracking research, promoting more privacy and ethics-conscious practice.

doi: 10.48550/arXiv.2511.18965

Paper: hindennach25_arxiv.pdf

@techreport{hindennach25_arxiv, title = {{{REFLECTing SPERET}}: {{Measuring}} and {{Promoting Ethics}} and {{Privacy Reflexivity}} in {{Eye-Tracking Research}}}, shorttitle = {{REFLECTing SPERET}}, author = {Hindennach, Susanne and Elfares, Mayar and Gressel, C{\'e}line and Bulling, Andreas}, year = {2025}, publisher = {arXiv}, doi = {10.48550/arXiv.2511.18965} }
HAGI++: Head-Assisted Gaze Imputation and Generation

Chuhan Jiao, Zhiming Hu, Andreas Bulling

arXiv:2511.02468, 2025.

Abstract Links BibTeX Project

Mobile eye tracking plays a vital role in capturing human visual attention across both real-world and extended reality (XR) environments, making it an essential tool for applications ranging from behavioural research to human-computer interaction. However, missing values due to blinks, pupil detection errors, or illumination changes pose significant challenges for further gaze data analysis. To address this challenge, we introduce HAGI++ - a multi-modal diffusion-based approach for gaze data imputation that, for the first time, uses the integrated head orientation sensors to exploit the inherent correlation between head and eye movements. HAGI++ employs a transformer-based diffusion model to learn cross-modal dependencies between eye and head representations and can be readily extended to incorporate additional body movements. Extensive evaluations on the large-scale Nymeria, Ego-Exo4D, and HOT3D datasets demonstrate that HAGI++ consistently outperforms conventional interpolation methods and deep learning-based time-series imputation baselines in gaze imputation. Furthermore, statistical analyses confirm that HAGI++ produces gaze velocity distributions that closely match actual human gaze behaviour, ensuring more realistic gaze imputations. Moreover, by incorporating wrist motion captured from commercial wearable devices, HAGI++ surpasses prior methods that rely on full-body motion capture in the extreme case of 100% missing gaze data (pure gaze generation). Our method paves the way for more complete and accurate eye gaze recordings in real-world settings and has significant potential for enhancing gaze-based analysis and interaction across various application domains.

doi: 10.48550/arXiv.2511.02468

Paper: jiao25_arxiv.pdf

@techreport{jiao25_arxiv, title = {{{HAGI}}++: {{Head-Assisted Gaze Imputation}} and {{Generation}}}, shorttitle = {{{HAGI}}++}, author = {Jiao, Chuhan and Hu, Zhiming and Bulling, Andreas}, year = {2025}, publisher = {arXiv}, doi = {10.48550/arXiv.2511.02468} }
ProToM: Promoting Prosocial Behaviour via Theory of Mind-Informed Feedback

Matteo Bortoletto, Yichao Zhou, Lance Ying, Tianmin Shu, Andreas Bulling

arXiv.2509.05091, 2025.

Abstract Links BibTeX Project

While humans are inherently social creatures, the challenge of identifying when and how to assist and collaborate with others - particularly when pursuing independent goals - can hinder cooperation. To address this challenge, we aim to develop an AI system that provides useful feedback to promote prosocial behaviour - actions that benefit others, even when not directly aligned with one’s own goals. We introduce ProToM, a Theory of Mind-informed facilitator that promotes prosocial actions in multi-agent systems by providing targeted, context-sensitive feedback to individual agents. ProToM first infers agents’ goals using Bayesian inverse planning, then selects feedback to communicate by maximising expected utility, conditioned on the inferred goal distribution. We evaluate our approach against baselines in two multi-agent environments: Doors, Keys, and Gems, as well as Overcooked. Our results suggest that state-of-the-art large language and reasoning models fall short of communicating feedback that is both contextually grounded and well-timed - leading to higher communication overhead and task speedup. In contrast, ProToM provides targeted and helpful feedback, achieving a higher success rate, shorter task completion times, and is consistently preferred by human users.

doi: 10.48550/arXiv.2509.05091

Paper: bortoletto25_arxiv.pdf

@techreport{bortoletto25_arxiv, title = {{ProToM}: {Promoting} {Prosocial} {Behaviour} via {Theory} of {Mind}-{Informed} {Feedback}}, shorttitle = {{ProToM}}, author = {Bortoletto, Matteo and Zhou, Yichao and Ying, Lance and Shu, Tianmin and Bulling, Andreas}, year = {2025}, publisher = {arXiv}, website = {https://www.matteobortoletto.org/protom/}, doi = {10.48550/arXiv.2509.05091} }
Unsupervised Partner Design Enables Robust Ad-hoc Teamwork

Constantin Ruhdorfer, Matteo Bortoletto, Victor Oei, Anna Penzkofer, Andreas Bulling

arXiv:2508.06336, pp. 1-7, 2025.

Abstract Links BibTeX Project

We introduce Unsupervised Partner Design (UPD) – a population-free, multi-agent reinforcement learning framework for robust ad-hoc teamwork that adaptively generates training partners without requiring pretrained partners or manual parameter tuning. UPD constructs diverse partners by stochastically mixing an ego agent’s policy with biased random behaviours and scores them using a variancebased learnability metric that prioritises partners near the ego agent’s current learning frontier. We show that UPD can be integrated with unsupervised environment design, resulting in the first method enabling fully unsupervised curricula over both level and partner distributions in a cooperative setting. Through extensive evaluations on Overcooked-AI and the Overcooked Generalisation Challenge, we demonstrate that this dynamic partner curriculum is highly effective: UPD consistently outperforms both population-based and populationfree baselines as well as ablations. In a user study, we further show that UPD achieves higher returns than all baselines and was perceived as significantly more adaptive, more humanlike, a better collaborator, and less frustrating.

doi: 10.48550/arXiv.2508.06336

Paper: ruhdorfer25_arxiv.pdf

@techreport{ruhdorfer25_arxiv, title = {Unsupervised {{Partner Design Enables Robust Ad-hoc Teamwork}}}, author = {Ruhdorfer, Constantin and Bortoletto, Matteo and Oei, Victor and Penzkofer, Anna and Bulling, Andreas}, year = {2025}, pages = {1-7}, doi = {10.48550/arXiv.2508.06336} }
Gaze3P: Gaze-Based Prediction of User-Perceived Privacy

Mayar Elfares, Pascal Reisert, Ralf Küsters, Andreas Bulling

arXiv:2507.00596, 2025.

Abstract Links BibTeX Project

Privacy is a highly subjective concept and perceived variably by different individuals. Previous research on quantifying user-perceived privacy has primarily relied on questionnaires. Furthermore, applying user-perceived privacy to optimise the parameters of privacy-preserving techniques (PPT) remains insufficiently explored. To address these limitations, we introduce Gaze3P – the first dataset specifically designed to facilitate systematic investigations into user-perceived privacy. Our dataset comprises gaze data from 100 participants and 1,000 stimuli, encompassing a range of private and safe attributes. With Gaze3P, we train a machine learning model to implicitly and dynamically predict perceived privacy from human eye gaze. Through comprehensive experiments, we show that the resulting models achieve high accuracy. Finally, we illustrate how predicted privacy can be used to optimise the parameters of differentially private mechanisms, thereby enhancing their alignment with user expectations.

doi: 10.48550/arXiv.2507.00596

Paper: elfares25_arxiv.pdf

@techreport{elfares25_arxiv, title = {{{Gaze3P}}: {{Gaze-Based Prediction}} of {{User-Perceived Privacy}}}, shorttitle = {{{Gaze3P}}}, author = {Elfares, Mayar and Reisert, Pascal and K{\"u}sters, Ralf and Bulling, Andreas}, year = {2025}, doi = {10.48550/arXiv.2507.00596} }
Tell Me Without Telling Me: Two-Way Prediction of Visualization Literacy and Visual Attention

Minsuk Chang, Yao Wang, Huichen Will Wang, Yuanhong Zhou, Andreas Bulling, Cindy Xiong Bearfield

arXiv.2508.03713, 2025.

Abstract Links BibTeX Project

Accounting for individual differences can improve the effectiveness of visualization design. While the role of visual attention in visualization interpretation is well recognized, existing work often overlooks how this behavior varies based on visual literacy levels. Based on data from a 235-participant user study covering three visualization tests (mini-VLAT, CALVI, and SGL), we show that distinct attention patterns in visual data exploration can correlate with participants’ literacy levels: While experts (high-scorers) generally show a strong attentional focus, novices (low-scorers) focus less and explore more. We then propose two computational models leveraging these insights: Lit2Sal – a novel visual saliency model that predicts observer attention given their visualization literacy level, and Sal2Lit – a model to predict visual literacy from human visual attention data. Our quantitative and qualitative evaluation demonstrates that Lit2Sal outperforms state-of-the-art saliency models with literacy-aware considerations. Sal2Lit predicts literacy with 86% accuracy using a single attention map, providing a time-efficient supplement to literacy assessment that only takes less than a minute. Taken together, our unique approach to consider individual differences in salience models and visual attention in literacy assessments paves the way for new directions in personalized visual data communication to enhance understanding.

doi: 10.48550/arXiv.2508.03713

Paper: chang25_arxiv2.pdf

@techreport{chang25_arxiv2, title = {Tell {Me} {Without} {Telling} {Me}: {Two}-{Way} {Prediction} of {Visualization} {Literacy} and {Visual} {Attention}}, shorttitle = {Tell {Me} {Without} {Telling} {Me}}, author = {Chang, Minsuk and Wang, Yao and Wang, Huichen Will and Zhou, Yuanhong and Bulling, Andreas and Bearfield, Cindy Xiong}, year = {2025}, publisher = {arXiv}, doi = {10.48550/arXiv.2508.03713} }
QualitEye: Public and Privacy-preserving Gaze Data Quality Verification

Mayar Elfares, Pascal Reisert, Ralf Küsters, Andreas Bulling

arXiv.2506.05908, 2025.

Abstract Links BibTeX Project

Gaze-based applications are increasingly advancing with the availability of large datasets but ensuring data quality presents a substantial challenge when collecting data at scale. It further requires different parties to collaborate, therefore, privacy concerns arise. We propose QualitEye–the first method for verifying image-based gaze data quality. QualitEye employs a new semantic representation of eye images that contains the information required for verification while excluding irrelevant information for better domain adaptation. QualitEye covers a public setting where parties can freely exchange data and a privacy-preserving setting where parties cannot reveal their raw data nor derive gaze features/labels of others with adapted private set intersection protocols. We evaluate QualitEye on the MPIIFaceGaze and GazeCapture datasets and achieve a high verification performance (with a small overhead in runtime for privacy-preserving versions). Hence, QualitEye paves the way for new gaze analysis methods at the intersection of machine learning, human-computer interaction, and cryptography.

doi: 10.48550/arXiv.2506.05908

Paper: elfares25_arxiv2.pdf

@techreport{elfares25_arxiv2, title = {{QualitEye}: {Public} and {Privacy}-preserving {Gaze} {Data} {Quality} {Verification}}, shorttitle = {{QualitEye}}, author = {Elfares, Mayar and Reisert, Pascal and Küsters, Ralf and Bulling, Andreas}, year = {2025}, publisher = {arXiv}, doi = {10.48550/arXiv.2506.05908} }
Guidelines for Gaze-based Neural Preliminary Diagnosis

Mayar Elfares, Salma Younis, Pascal Reisert, Ralf Küsters, Tobias Renner, Andreas Bulling

arXiv.2506.08517, 2025.

Abstract Links BibTeX Project

Neural disorders refer to any condition affecting the nervous system and that influence how individuals perceive and interact with the world. Traditional neural diagnoses rely on cumbersome, time-consuming, or subjective methods, such as clinical interviews, behavioural observations, or medical imaging. Eye tracking is an attractive alternative because analysing eye movements, such as fixations and saccades, can provide more objective insights into brain function and cognitive processing by capturing non-verbal and unconscious responses. Despite its potential, existing gaze-based studies presented seemingly contradictory findings. They are dispersed across diverse fields, requiring further research to standardise protocols and expand their application, particularly as a preliminary indicator of neural processes for differential diagnosis. Therefore, this paper outlines the main agreed-upon findings and provides a systematisation of knowledge and key guidelines towards advancing gaze-based neural preliminary diagnosis.

doi: 10.48550/arXiv.2506.08517

Paper: elfares25_arxiv1.pdf

@techreport{elfares25_arxiv1, title = {Guidelines for {Gaze}-based {Neural} {Preliminary} {Diagnosis}}, author = {Elfares, Mayar and Younis, Salma and Reisert, Pascal and Küsters, Ralf and Renner, Tobias and Bulling, Andreas}, year = {2025}, publisher = {arXiv}, doi = {10.48550/arXiv.2506.08517} }
Amortized Bayesian Multilevel Models

Daniel Habermann, Marvin Schmitt, Lars Kühmichel, Andreas Bulling, Stefan T. Radev, Paul-Christian Bürkner

arXiv.2408.13230, 2025.

Abstract Links BibTeX Project

Multilevel models (MLMs) are a central building block of the Bayesian workflow. They enable joint, interpretable modeling of data across hierarchical levels and provide a fully probabilistic quantification of uncertainty. Despite their well-recognized advantages, MLMs pose significant computational challenges, often rendering their estimation and evaluation intractable within reasonable time constraints. Recent advances in simulation-based inference offer promising solutions for addressing complex probabilistic models using deep generative networks. However, the utility and reliability of deep learning methods for estimating Bayesian MLMs remains largely unexplored, especially when compared with gold-standard samplers. To this end, we explore a family of neural network architectures that leverage the probabilistic factorization of multilevel models to facilitate efficient neural network training and subsequent near-instant posterior inference on unseen datasets. We test our method on several real-world case studies and provide comprehensive comparisons to Stan’s gold standard sampler, where possible. Finally, we provide an open-source implementation of our methods to stimulate further research in the nascent field of amortized Bayesian inference.

doi: 10.48550/arXiv.2408.13230

Paper: habermann25_arxiv.pdf

@techreport{habermann25_arxiv, title = {Amortized {Bayesian} {Multilevel} {Models}}, author = {Habermann, Daniel and Schmitt, Marvin and Kühmichel, Lars and Bulling, Andreas and Radev, Stefan T. and Bürkner, Paul-Christian}, year = {2025}, publisher = {arXiv}, doi = {10.48550/arXiv.2408.13230} }
ChartOptimiser: Task-driven Optimisation of Chart Designs

Yao Wang, Danqing Shi, Jiarong Pan, Zhiming Hu, Antti Oulasvirta, Andreas Bulling

arXiv:2504.10180, pp. 1–20, 2025.

Abstract Links BibTeX Project

Effective chart design is essential for satisfying viewers’ information needs, such as retrieving values from a chart or comparing two values. However, creating effective charts is challenging and time-consuming due to the large design space and the inter-dependencies between individual design parameters. To address this challenge, we propose ChartOptimiser – a Bayesian approach for task-driven optimisation of charts, such as bar charts. At the core of ChartOptimiser is a novel objective function to automatically optimise an eight-dimensional design space combining four perceptual metrics: visual saliency, text legibility, colour preference, and white space ratio. Through empirical evaluation on 12 bar charts and four common analytical tasks – finding the extreme value, retrieving a value, comparing two values, and computing a derived value – we show that ChartOptimiser outperforms existing design baselines concerning task-solving ease, visual aesthetics, and chart clarity. We also discuss two practical applications of ChartOptimiser: generating charts for accessibility and content localisation. Taken together, ChartOptimiser opens up an exciting new research direction in automated chart design where charts are optimised for users’ information needs, preferences, and contexts.

Paper: wang25_arxiv.pdf

Paper Access: https://arxiv.org/abs/2504.10180

@techreport{wang25_arxiv, title = {ChartOptimiser: Task-driven Optimisation of Chart Designs}, author = {Wang, Yao and Shi, Danqing and Pan, Jiarong and Hu, Zhiming and Oulasvirta, Antti and Bulling, Andreas}, year = {2025}, pages = {1--20}, url = {https://arxiv.org/abs/2504.10180} }
CLAD: Constrained Latent Action Diffusion for Vision-Language Procedure Planning

Lei Shi, Andreas Bulling

arXiv.2503.06637, 2025.

Abstract Links BibTeX Project

We propose CLAD – a Constrained Latent Action Diffusion model for vision-language procedure planning in instructional videos. Procedure planning is the challenging task of predicting intermediate actions given a visual observation of a start and a goal state. However, future interactive AI systems must also be able to plan procedures using multi-modal input, e.g., where visual observations are augmented with language descriptions. To tackle this vision-language procedure planning task, our method uses a Variational Autoencoder (VAE) to learn the latent representation of actions and observations as constraints and integrate them into the diffusion process. This approach exploits that the latent space of diffusion models already has semantics that can be used. We use the latent constraints to steer the diffusion model to better generate actions. We report extensive experiments on the popular CrossTask, Coin, and NIV datasets and show that our method outperforms state-of-the-art methods by a large margin. By evaluating ablated versions of our method, we further show that the proposed integration of the action and observation representations learnt in the VAE latent space is key to these performance improvements.

doi: 10.48550/arXiv.2503.06637

Paper: shi25_arxiv.pdf

@techreport{shi25_arxiv, title = {{{CLAD}}: {{Constrained Latent Action Diffusion}} for {{Vision-Language Procedure Planning}}}, shorttitle = {{CLAD}}, author = {Shi, Lei and Bulling, Andreas}, year = {2025}, publisher = {arXiv}, doi = {10.48550/arXiv.2503.06637} }
Grid Labeling: Crowdsourcing Task-Specific Importance from Visualizations

Minsuk Chang, Yao Wang, Huichen Will Wang, Andreas Bulling, Cindy Xiong Bearfield

arXiv:2502.13902, pp. 1–6, 2025.

Links BibTeX Project

Paper: chang25_arxiv.pdf

Paper Access: https://arxiv.org/abs/2502.13902

@techreport{chang25_arxiv, title = {Grid Labeling: Crowdsourcing Task-Specific Importance from Visualizations}, author = {Chang, Minsuk and Wang, Yao and Wang, Huichen Will and Bulling, Andreas and Bearfield, Cindy Xiong}, year = {2025}, pages = {1--6}, url = {https://arxiv.org/abs/2502.13902} }
RobustSpring: Benchmarking Robustness to Image Corruptions for Optical Flow, Scene Flow and Stereo

Jenny Schmalfuss, Victor Oei, Lukas Mehl, Madlen Bartsch, Shashank Agnihotri, Margret Keuper, Andrés Bruhn

arXiv:2505.09368, 2025.

Links BibTeX Project

doi: 10.48550/arXiv.2505.09368

Paper Access: https://arxiv.org/abs/2505.09368

@techreport{schmalfuss25_arxiv, title = {{RobustSpring}: {Benchmarking} {Robustness} to {Image} {Corruptions} for {Optical} {Flow}, {Scene} {Flow} and {Stereo}}, shorttitle = {{RobustSpring}}, author = {Schmalfuss, Jenny and Oei, Victor and Mehl, Lukas and Bartsch, Madlen and Agnihotri, Shashank and Keuper, Margret and Bruhn, Andrés}, year = {2025}, url = {https://arxiv.org/abs/2505.09368}, doi = {10.48550/arXiv.2505.09368} }

2024

Journal Articles

Learning User Embeddings from Human Gaze for Personalised Saliency Prediction

Florian Strohm, Mihai Bâce, Andreas Bulling

Proc. ACM on Human-Computer Interaction (PACM HCI), 8 (ETRA), pp. 1–18, 2024.

Abstract Links BibTeX Project

Reusable embeddings of user behaviour have shown significant performance improvements for the personalised saliency prediction task. However, prior works require explicit user characteristics and preferences as input, which are often difficult to obtain. We present a novel method to extract user embeddings from pairs of natural images and corresponding saliency maps generated from a small amount of user-specific eye tracking data. At the core of our method is a Siamese convolutional neural encoder that learns the user embeddings by contrasting the image and personal saliency map pairs of different users. Evaluations on two saliency datasets show that the generated embeddings have high discriminative power, are effective at refining universal saliency maps to the individual users, and generalise well across users and images. Finally, based on our model’s ability to encode individual user characteristics, our work points towards other applications that can benefit from reusable embeddings of gaze behaviour.

doi: 10.1145/3655603

Paper: strohm24_etra.pdf

@article{strohm24_etra, title = {Learning User Embeddings from Human Gaze for Personalised Saliency Prediction}, author = {Strohm, Florian and Bâce, Mihai and Bulling, Andreas}, year = {2024}, journal = {Proc. ACM on Human-Computer Interaction (PACM HCI)}, pages = {1--18}, volume = {8}, number = {ETRA}, doi = {10.1145/3655603} }
Enhancing Manatee Aggregation Counting Through Augmentation and Cross-Domain Learning

Matteo Zaramella, Xingquan Zhu, Irene Amerini

IEEE Access, 12, 2024.

Abstract Links BibTeX Project

In this paper, we propose a novel and enhanced approach for crowd counting within the domain of manatee monitoring, aiming to significantly improve efficiency and accuracy. The proposed model achieves state-of-the-art results in the challenging task of manatee counting, simplifying the work of scientists and experts in the field. Our model not only facilitates the identification and enumeration of manatees in images and videos but also excels in scenarios that pose considerable challenges for human observers. To enhance accurate counting of the manatee aggregation, we introduce a framework with three key innovations to tackle the challenge: a new approach to generate density maps during the training process, an augmented technique to balance the dataset, and a cross-domain solution to enhance overall performance. The proposed two-dimensional Gaussian kernel offers a refined method for creating density maps, providing a more robust foundation for the training phase. Additionally, we built a balanced and augmented dataset, ensuring that the model is exposed to diverse and representative instances, thus improving its generalization capabilities. Furthermore, we incorporate a cross-domain phase pretraining the model utilizing an image dataset of wild animals to initialize the weights and further improve performance. Experiments and comparisons, with respect to previously established CSRNET model presented in Wang et al. (2023), demonstrate noteworthy improvements. Remarkably, our model achieves a Mean Absolute Error (MAE) of nearly half compared to the rival approach, showcasing the substantial advancements achieved through our refined methodology. This progress boosts the reliability of manatee counting in conservation efforts and ecological research.

doi: 10.1109/ACCESS.2024.3457800

Paper: zaramella24_ieee.pdf

Paper Access: https://ieeexplore.ieee.org/abstract/document/10677459

@article{zaramella24_ieee, title = {Enhancing {Manatee} {Aggregation} {Counting} {Through} {Augmentation} and {Cross}-{Domain} {Learning}}, volume = {12}, issn = {2169-3536}, url = {https://ieeexplore.ieee.org/abstract/document/10677459}, doi = {10.1109/ACCESS.2024.3457800}, journal = {IEEE Access}, author = {Zaramella, Matteo and Zhu, Xingquan and Amerini, Irene}, year = {2024} }
PrivatEyes: Appearance-based Gaze Estimation Using Federated Secure Multi-Party Computation

Mayar Elfares, Pascal Reisert, Zhiming Hu, Wenwu Tang, Ralf Küsters, Andreas Bulling

Proc. ACM on Human-Computer Interaction (PACM HCI), 8 (ETRA), pp. 1–23, 2024.

Abstract Links BibTeX Project

Latest gaze estimation methods require large-scale training data but their collection and exchange pose significant privacy risks. We propose PrivatEyes - the first privacy-enhancing training approach for appearance-based gaze estimation based on federated learning (FL) and secure multi-party computation (MPC). PrivatEyes enables training gaze estimators on multiple local datasets across different users and server-based secure aggregation of the individual estimators’ updates. PrivatEyes guarantees that individual gaze data remains private even if a majority of the aggregating servers is malicious. We also introduce a new data leakage attack DualView that shows that PrivatEyes limits the leakage of private training data more effectively than previous approaches. Evaluations on the MPIIGaze, MPIIFaceGaze, GazeCapture, and NVGaze datasets further show that the improved privacy does not lead to a lower gaze estimation accuracy or substantially higher computational costs - both of which are on par with its non-secure counterparts.

doi: 10.1145/3655606

Paper: elfares24_etra.pdf

@article{elfares24_etra, title = {PrivatEyes: Appearance-based Gaze Estimation Using Federated Secure Multi-Party Computation}, author = {Elfares, Mayar and Reisert, Pascal and Hu, Zhiming and Tang, Wenwu and Küsters, Ralf and Bulling, Andreas}, year = {2024}, journal = {Proc. ACM on Human-Computer Interaction (PACM HCI)}, pages = {1--23}, volume = {8}, number = {ETRA}, doi = {10.1145/3655606} }
Mindful Explanations: Prevalence and Impact of Mind Attribution in XAI Research

Susanne Hindennach, Lei Shi, Filip Miletic, Andreas Bulling

Proc. ACM on Human-Computer Interaction (PACM HCI), 8 (CSCW), pp. 1–42, 2024.

Abstract Links BibTeX Project Best Paper Honourable Mention Award

When users perceive AI systems as mindful, independent agents, they hold them responsible instead of the AI experts who created and designed these systems. So far, it has not been studied whether explanations support this shift in responsibility through the use of mind-attributing verbs like "to think". To better understand the prevalence of mind-attributing explanations we analyse AI explanations in 3,533 explainable AI (XAI) research articles from the Semantic Scholar Open Research Corpus (S2ORC). Using methods from semantic shift detection, we identify three dominant types of mind attribution: (1) metaphorical (e.g. "to learn" or "to predict"), (2) awareness (e.g. "to consider"), and (3) agency (e.g. "to make decisions"). We then analyse the impact of mind-attributing explanations on awareness and responsibility in a vignette-based experiment with 199 participants. We find that participants who were given a mind-attributing explanation were more likely to rate the AI system as aware of the harm it caused. Moreover, the mind-attributing explanation had a responsibility-concealing effect: Considering the AI experts’ involvement lead to reduced ratings of AI responsibility for participants who were given a non-mind-attributing or no explanation. In contrast, participants who read the mind-attributing explanation still held the AI system responsible despite considering the AI experts’ involvement. Taken together, our work underlines the need to carefully phrase explanations about AI systems in scientific writing to reduce mind attribution and clearly communicate human responsibility.

doi: 10.1145/3641009

Paper: hindennach24_pacm.pdf

Paper Access: https://medium.com/acm-cscw/be-mindful-when-using-mindful-descriptions-in-explanations-about-ai-bfc7666885c6

@article{hindennach24_pacm, title = {Mindful Explanations: Prevalence and Impact of Mind Attribution in XAI Research}, author = {Hindennach, Susanne and Shi, Lei and Miletic, Filip and Bulling, Andreas}, year = {2024}, pages = {1--42}, volume = {8}, number = {CSCW}, doi = {10.1145/3641009}, journal = {Proc. ACM on Human-Computer Interaction (PACM HCI)}, url = {https://medium.com/acm-cscw/be-mindful-when-using-mindful-descriptions-in-explanations-about-ai-bfc7666885c6} }
Pose2Gaze: Eye-body Coordination during Daily Activities for Gaze Prediction from Full-body Poses

Zhiming Hu, Jiahui Xu, Syn Schmitt, Andreas Bulling

IEEE Transactions on Visualization and Computer Graphics (TVCG), , pp. 1–12, 2024.

Abstract Links BibTeX Project

Human eye gaze plays a significant role in many virtual and augmented reality (VR/AR) applications, such as gaze-contingent rendering, gaze-based interaction, or eye-based activity recognition. However, prior works on gaze analysis and prediction have only explored eye-head coordination and were limited to human-object interactions. We first report a comprehensive analysis of eye-body coordination in various human-object and human-human interaction activities based on four public datasets collected in real-world (MoGaze), VR (ADT), as well as AR (GIMO and EgoBody) environments. We show that in human-object interactions, e.g. pick and place, eye gaze exhibits strong correlations with full-body motion while in human-human interactions, e.g. chat and teach, a person’s gaze direction is correlated with the body orientation towards the interaction partner. Informed by these analyses we then present Pose2Gaze – a novel eye-body coordination model that uses a convolutional neural network and a spatio-temporal graph convolutional neural network to extract features from head direction and full-body poses, respectively, and then uses a convolutional neural network to predict eye gaze. We compare our method with state-of-the-art methods that predict eye gaze only from head movements and show that Pose2Gaze outperforms these baselines with an average improvement of 24.0% on MoGaze, 10.1% on ADT, 21.3% on GIMO, and 28.6% on EgoBody in mean angular error, respectively. We also show that our method significantly outperforms prior methods in the sample downstream task of eye-based activity recognition. These results underline the significant information content available in eye-body coordination during daily activities and open up a new direction for gaze prediction.

doi: 10.1109/TVCG.2024.3412190

Paper: hu24_tvcg.pdf

@article{hu24_tvcg, author = {Hu, Zhiming and Xu, Jiahui and Schmitt, Syn and Bulling, Andreas}, title = {Pose2Gaze: Eye-body Coordination during Daily Activities for Gaze Prediction from Full-body Poses}, journal = {IEEE Transactions on Visualization and Computer Graphics (TVCG)}, year = {2024}, pages = {1--12}, doi = {10.1109/TVCG.2024.3412190} }
HOIMotion: Forecasting Human Motion During Human-Object Interactions Using Egocentric 3D Object Bounding Boxes

Zhiming Hu, Zheming Yin, Daniel Häufle, Syn Schmitt, Andreas Bulling

IEEE Transactions on Visualization and Computer Graphics (TVCG), , pp. 1–11, 2024.

Abstract Links BibTeX Project Best Journal Paper Award

We present HOIMotion – a novel approach for human motion forecasting during human-object interactions that integrates information about past body poses and egocentric 3D object bounding boxes. Human motion forecasting is important in many augmented reality applications but most existing methods have only used past body poses to predict future motion. HOIMotion first uses an encoder-residual graph convolutional network (GCN) and multi-layer perceptrons to extract features from body poses and egocentric 3D object bounding boxes, respectively. Our method then fuses pose and object features into a novel pose-object graph and uses a residual-decoder GCN to forecast future body motion. We extensively evaluate our method on the Aria digital twin (ADT) and MoGaze datasets and show that HOIMotion consistently outperforms state-of-the-art methods by a large margin of up to 8.7% on ADT and 7.2% on MoGaze in terms of mean per joint position error. Complementing these evaluations, we report a human study (N=20) that shows that the improvements achieved by our method result in forecasted poses being perceived as both more precise and more realistic than those of existing methods. Taken together, these results reveal the significant information content available in egocentric 3D object bounding boxes for human motion forecasting and the effectiveness of our method in exploiting this information.

doi: 10.1109/TVCG.2024.3456161

Paper: hu24_ismar.pdf

@article{hu24_ismar, author = {Hu, Zhiming and Yin, Zheming and Häufle, Daniel and Schmitt, Syn and Bulling, Andreas}, title = {HOIMotion: Forecasting Human Motion During Human-Object Interactions Using Egocentric 3D Object Bounding Boxes}, journal = {IEEE Transactions on Visualization and Computer Graphics (TVCG)}, year = {2024}, pages = {1--11}, doi = {10.1109/TVCG.2024.3456161} }
Int-HRL: Towards Intention-based Hierarchical Reinforcement Learning

Anna Penzkofer, Simon Schaefer, Florian Strohm, Mihai Bâce, Stefan Leutenegger, Andreas Bulling

Neural Computing and Applications (NCAA), 36, pp. 1–7, 2024.

Abstract Links BibTeX Project

While deep reinforcement learning (RL) agents outperform humans on an increasing number of tasks, training them requires data equivalent to decades of human gameplay. Recent hierarchical RL methods have increased sample efficiency by incorporating information inherent to the structure of the decision problem but at the cost of having to discover or use human-annotated sub-goals that guide the learning process. We show that intentions of human players, i.e., the precursor of goal-oriented decisions, can be robustly predicted from eye gaze even for the long-horizon sparse rewards task of Montezuma’s Revenge – one of the most challenging RL tasks in the Atari2600 game suite. We propose Int-HRL: Hierarchical RL with intention-based sub-goals that are inferred from human eye gaze. Our novel sub-goal extraction pipeline is fully automatic and replaces the need for manual sub-goal annotation by human experts. Our evaluations show that replacing hand-crafted sub-goals with automatically extracted intentions leads to an HRL agent that is significantly more sample efficient than previous methods.

doi: 10.1007/s00521-024-10596-2

Paper: penzkofer24_ncaa.pdf

Code: https://git.hcics.simtech.uni-stuttgart.de/public-projects/Int-HRL

@article{penzkofer24_ncaa, author = {Penzkofer, Anna and Schaefer, Simon and Strohm, Florian and Bâce, Mihai and Leutenegger, Stefan and Bulling, Andreas}, title = {Int-HRL: Towards Intention-based Hierarchical Reinforcement Learning}, journal = {Neural Computing and Applications (NCAA)}, year = {2024}, pages = {1--7}, doi = {10.1007/s00521-024-10596-2}, volume = {36}, issue = {} }
VisRecall++: Analysing and Predicting Visualisation Recallability from Gaze Behaviour

Yao Wang, Yue Jiang, Zhiming Hu, Constantin Ruhdorfer, Mihai Bâce, Andreas Bulling

Proc. ACM on Human-Computer Interaction (PACM HCI), 8 (ETRA), pp. 1–18, 2024.

Abstract Links BibTeX Project

Question answering has recently been proposed as a promising means to assess the recallability of information visualisations. However, prior works are yet to study the link between visually encoding a visualisation in memory and recall performance. To fill this gap, we propose VisRecall++ – a novel 40-participant recallability dataset that contains gaze data on 200 visualisations and five question types, such as identifying the title, and finding extreme values.We measured recallability by asking participants questions after they observed the visualisation for 10 seconds.Our analyses reveal several insights, such as saccade amplitude, number of fixations, and fixation duration significantly differ between high and low recallability groups.Finally, we propose GazeRecallNet – a novel computational method to predict recallability from gaze behaviour that outperforms several baselines on this task.Taken together, our results shed light on assessing recallability from gaze behaviour and inform future work on recallability-based visualisation optimisation.

doi: 10.1145/3655613

Paper: wang24_etra.pdf

Supplementary Material: wang24_etra_sup.pdf

@article{wang24_etra, title = {VisRecall++: Analysing and Predicting Visualisation Recallability from Gaze Behaviour}, author = {Wang, Yao and Jiang, Yue and Hu, Zhiming and Ruhdorfer, Constantin and Bâce, Mihai and Bulling, Andreas}, year = {2024}, journal = {Proc. ACM on Human-Computer Interaction (PACM HCI)}, pages = {1--18}, volume = {8}, number = {ETRA}, doi = {10.1145/3655613} }
Individual differences in visuo-spatial working memory capacity and prior knowledge during interrupted reading

Francesca Zermiani, Prajit Dhar, Florian Strohm, Sibylle Baumbach, Andreas Bulling, Maria Wirzberger

Frontiers in Cognition, 3, pp. 1–9, 2024.

Abstract Links BibTeX Project

Interruptions are often pervasive and require attentional shifts from the primary task. Limited data are available on the factors influencing individuals’ efficiency in resuming from interruptions during digital reading. The reported investigation -conducted using the InteRead dataset -examined whether individual differences in visuo-spatial working memory capacity (vsWMC) and prior knowledge could influence resumption lag times during interrupted reading. Participants’ vsWMC capacity was assessed using the symmetry span (SSPAN) task, while a pre-test questionnaire targeted their background knowledge about the text. While reading an extract from a Sherlock Holmes story, they were interrupted six times and asked to answer an opinion question. Our analyses revealed that the interaction between vsWMC and prior knowledge significantly predicted the time needed to resume reading following an interruption. The results from our analyses are discussed in relation to theoretical frameworks of task resumption and current research in the field.

doi: 10.3389/fcogn.2024.1434642

Paper: zermiani24_fic.pdf

@article{zermiani24_fic, title = {Individual differences in visuo-spatial working memory capacity and prior knowledge during interrupted reading}, author = {Zermiani, Francesca and Dhar, Prajit and Strohm, Florian and Baumbach, Sibylle and Bulling, Andreas and Wirzberger, Maria}, year = {2024}, doi = {10.3389/fcogn.2024.1434642}, pages = {1--9}, volume = {3}, journal = {Frontiers in Cognition} }

Conference Papers

DisMouse: Disentangling Information from Mouse Movement Data

Guanhua Zhang, Zhiming Hu, Andreas Bulling

Proc. ACM Symposium on User Interface Software and Technology (UIST), pp. 1–13, 2024.

Abstract Links BibTeX Project

Mouse movement data contain rich information about users, performed tasks, and user interfaces, but separating the respective components remains challenging and unexplored. As a first step to address this challenge, we propose DisMouse – the first method to disentangle user-specific and user-independent information and stochastic variations from mouse movement data. At the core of our method is an autoencoder trained in a semi-supervised fashion, consisting of a self-supervised denoising diffusion process and a supervised contrastive user identification module. Through evaluations on three datasets, we show that DisMouse 1) captures complementary information of mouse input, hence providing an interpretable framework for modelling mouse movements, 2) can be used to produce refined features, thus enabling various applications such as personalised and variable mouse data generation, and 3) generalises across different datasets. Taken together, our results underline the significant potential of disentangled representation learning for explainable, controllable, and generalised mouse behaviour modelling.

doi: https://doi.org/10.1145/3654777.3676411

Paper: zhang24_uist.pdf

Code: https://git.hcics.simtech.uni-stuttgart.de/public-projects/DisMouse.git

@inproceedings{zhang24_uist, title = {DisMouse: Disentangling Information from Mouse Movement Data}, author = {Zhang, Guanhua and Hu, Zhiming and Bulling, Andreas}, year = {2024}, pages = {1--13}, booktitle = {Proc. ACM Symposium on User Interface Software and Technology (UIST)}, doi = {https://doi.org/10.1145/3654777.3676411} }
Mouse2Vec: Learning Reusable Semantic Representations of Mouse Behaviour

Guanhua Zhang, Zhiming Hu, Mihai Bâce, Andreas Bulling

Proc. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI), pp. 1–17, 2024.

Abstract Links BibTeX Project

The mouse is a pervasive input device used for a wide range of interactive applications. However, computational modelling of mouse behaviour typically requires time-consuming design and extraction of handcrafted features, or approaches that are application-specific. We instead propose Mouse2Vec – a novel self-supervised method designed to learn semantic representations of mouse behaviour that are reusable across users and applications. Mouse2Vec uses a Transformer-based encoder-decoder architecture, which is specifically geared for mouse data: During pretraining, the encoder learns an embedding of input mouse trajectories while the decoder reconstructs the input and simultaneously detects mouse click events. We show that the representations learned by our method can identify interpretable mouse behaviour clusters and retrieve similar mouse trajectories. We also demonstrate on three sample downstream tasks that the representations can be practically used to augment mouse data for training supervised methods and serve as an effective feature extractor.

doi: 10.1145/3613904.3642141

Paper: zhang24_chi.pdf

Code: https://git.hcics.simtech.uni-stuttgart.de/public-projects/Mouse2Vec

@inproceedings{zhang24_chi, title = {Mouse2Vec: Learning Reusable Semantic Representations of Mouse Behaviour}, author = {Zhang, Guanhua and Hu, Zhiming and B{\^a}ce, Mihai and Bulling, Andreas}, year = {2024}, pages = {1--17}, booktitle = {Proc. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI)}, doi = {10.1145/3613904.3642141} }
SalChartQA: Question-driven Saliency on Information Visualisations

Yao Wang, Weitian Wang, Abdullah Abdelhafez, Mayar Elfares, Zhiming Hu, Mihai Bâce, Andreas Bulling

Proc. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI), pp. 1–14, 2024.

Abstract Links BibTeX Project

Understanding the link between visual attention and user’s needs when visually exploring information visualisations is under-explored due to a lack of large and diverse datasets to facilitate these analyses. To fill this gap, we introduce SalChartQA – a novel crowd-sourced dataset that uses the BubbleView interface as a proxy for human gaze and a question-answering (QA) paradigm to induce different information needs in users. SalChartQA contains 74,340 answers to 6,000 questions on 3,000 visualisations. Informed by our analyses demonstrating the tight correlation between the question and visual saliency, we propose the first computational method to predict question-driven saliency on information visualisations. Our method outperforms state-of-the-art saliency models, improving several metrics, such as the correlation coefficient and the Kullback-Leibler divergence. These results show the importance of information needs for shaping attention behaviour and paving the way for new applications, such as task-driven optimisation of visualisations or explainable AI in chart question-answering.

doi: 10.1145/3613904.3642942

Paper: wang24_chi.pdf

Supplementary Material: wang24_chi_sup.pdf

Dataset: https://darus.uni-stuttgart.de/dataset.xhtml?persistentId=doi:10.18419/darus-3884

@inproceedings{wang24_chi, title = {SalChartQA: Question-driven Saliency on Information Visualisations}, author = {Wang, Yao and Wang, Weitian and Abdelhafez, Abdullah and Elfares, Mayar and Hu, Zhiming and B{\^a}ce, Mihai and Bulling, Andreas}, year = {2024}, pages = {1--14}, booktitle = {Proc. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI)}, doi = {10.1145/3613904.3642942} }
GazeMoDiff: Gaze-guided Diffusion Model for Stochastic Human Motion Prediction

Haodong Yan, Zhiming Hu, Syn Schmitt, Andreas Bulling

Proc. 32nd Pacific Conference on Computer Graphics and Application (PG), pp. 1–10, 2024.

Abstract Links BibTeX Project

Human motion prediction is important for many virtual and augmented reality (VR/AR) applications such as collision avoidance and realistic avatar generation. Existing methods have synthesised body motion only from observed past motion, despite the fact that human eye gaze is known to correlate strongly with body movements and is readily available in recent VR/AR headsets. We present GazeMoDiff – a novel gaze-guided denoising diffusion model to generate stochastic human motions. Our method first uses a gaze encoder and a motion encoder to extract the gaze and motion features respectively, then employs a graph attention network to fuse these features, and finally injects the gaze-motion features into a noise prediction network via a cross-attention mechanism to progressively generate multiple reasonable human motions in the future. Extensive experiments on the MoGaze and GIMO datasets demonstrate that our method outperforms the state-of-the-art methods by a large margin in terms of multi-modal final displacement error (17.3% on MoGaze and 13.3% on GIMO). We further conducted a human study (N=21) and validated that the motions generated by our method were perceived as both more precise and more realistic than those of prior methods. Taken together, these results reveal the significant information content available in eye gaze for stochastic human motion prediction as well as the effectiveness of our method in exploiting this information.

doi:

Paper: yan24_pg.pdf

@inproceedings{yan24_pg, title = {GazeMoDiff: Gaze-guided Diffusion Model for Stochastic Human Motion Prediction}, author = {Yan, Haodong and Hu, Zhiming and Schmitt, Syn and Bulling, Andreas}, year = {2024}, doi = {}, pages = {1--10}, booktitle = {Proc. 32nd Pacific Conference on Computer Graphics and Application (PG)} }
Inferring Human Intentions from Predicted Action Probabilities

Lei Shi, Paul-Christian Bürkner, Andreas Bulling

Proc. Workshop on Theory of Mind in Human-AI Interaction at CHI 2024, pp. 1–7, 2024.

Abstract Links BibTeX Project

Inferring human intentions is a core challenge in human-AI collab-oration but while Bayesian methods struggle with complex visual input, deep neural network (DNN) based methods do not provide uncertainty quantifications. In this work we combine both approaches for the first time and show that the predicted next action probabilities contain information that can be used to infer the underlying user intention. We propose a two-step approach to human intention prediction: While a DNN predicts the probabilities of the next action, MCMC-based Bayesian inference is used to infer the underlying intention from these predictions. This approach not only allows for the independent design of the DNN architecture but also the subsequently fast, design-independent inference of human intentions. We evaluate our method using a series of experiments on the Watch-And-Help (WAH) and a keyboard and mouse interaction dataset. Our results show that our approach can accurately predict human intentions from observed actions and the implicit information contained in next action probabilities. Furthermore, we show that our approach can predict the correct intention even if only a few actions have been observed.

Paper: shi24_chiw.pdf

@inproceedings{shi24_chiw, author = {Shi, Lei and Bürkner, Paul-Christian and Bulling, Andreas}, title = {Inferring Human Intentions from Predicted Action Probabilities}, booktitle = {Proc. Workshop on Theory of Mind in Human-AI Interaction at CHI 2024}, year = {2024}, pages = {1--7} }
Neural Reasoning About Agents’ Goals, Preferences, and Actions

Matteo Bortoletto, Lei Shi, Andreas Bulling

Proc. 38th AAAI Conference on Artificial Intelligence (AAAI), pp. 456–464, 2024.

Abstract Links BibTeX Project

We propose the Intuitive Reasoning Network (IRENE) – a novel neural model for intuitive psychological reasoning about agents’ goals, preferences, and actions that can generalise previous experiences to new situations. IRENE combines a graph neural network for learning agent and world state representations with a transformer to encode the task context. When evaluated on the challenging Baby Intuitions Benchmark, IRENE achieves new state-of-the-art performance on three out of its five tasks – with up to 48.9 % improvement. In contrast to existing methods, IRENE is able to bind preferences to specific agents, to better distinguish between rational and irrational agents, and to better understand the role of blocking obstacles. We also investigate, for the first time, the influence of the training tasks on test performance. Our analyses demonstrate the effectiveness of IRENE in combining prior knowledge gained during training for unseen evaluation tasks.

doi: 10.1609/aaai.v38i1.27800

Paper: bortoletto24_aaai.pdf

Code: https://git.hcics.simtech.uni-stuttgart.de/public-projects/IRENE

@inproceedings{bortoletto24_aaai, author = {Bortoletto, Matteo and Shi, Lei and Bulling, Andreas}, title = {Neural Reasoning About Agents’ Goals, Preferences, and Actions}, booktitle = {Proc. 38th AAAI Conference on Artificial Intelligence (AAAI)}, year = {2024}, volume = {38}, number = {1}, pages = {456--464}, doi = {10.1609/aaai.v38i1.27800} }
Limits of Theory of Mind Modelling in Dialogue-Based Collaborative Plan Acquisition

Matteo Bortoletto, Constantin Ruhdorfer, Adnen Abdessaied, Lei Shi, Andreas Bulling

Proc. 62nd Annual Meeting of the Association for Computational Linguistics (ACL), pp. 1–16, 2024.

Abstract Links BibTeX Project

Recent work on dialogue-based collaborative plan acquisition (CPA) has suggested that Theory of Mind (ToM) modelling can improve missing knowledge prediction in settings with asymmetric skill-sets and knowledge. Although ToM was claimed to be important for effective collaboration, its real impact on this novel task remains under-explored. By representing plans as graphs and by exploiting task-specific constraints we show that, as performance on CPA nearly doubles when predicting one’s own missing knowledge, the improvements due to ToM modelling diminish. This phenomenon persists even when evaluating existing baseline methods. To better understand the relevance of ToM for CPA, we report a principled performance comparison of models with and without ToM features. Results across different models and ablations consistently suggest that learned ToM features are indeed more likely to reflect latent patterns in the data with no perceivable link to ToM. This finding calls for a deeper understanding of the role of ToM in CPA and beyond, as well as new methods for modelling and evaluating mental states in computational collaborative agents.

doi: 10.18653/v1/2024.acl-long.266

Paper: bortoletto24_acl.pdf

Code: https://git.hcics.simtech.uni-stuttgart.de/public-projects/limits-of-tom

@inproceedings{bortoletto24_acl, author = {Bortoletto, Matteo and Ruhdorfer, Constantin and Abdessaied, Adnen and Shi, Lei and Bulling, Andreas}, title = {Limits of Theory of Mind Modelling in Dialogue-Based Collaborative Plan Acquisition}, booktitle = {Proc. 62nd Annual Meeting of the Association for Computational Linguistics (ACL)}, year = {2024}, pages = {1--16}, doi = {10.18653/v1/2024.acl-long.266} }
Benchmarking Mental State Representations in Language Models

Matteo Bortoletto, Constantin Ruhdorfer, Lei Shi, Andreas Bulling

Proc. ICML 2024 Workshop on Mechanistic Interpretability, pp. 1–21, 2024.

Abstract Links BibTeX Project

While numerous works have assessed the generative performance of language models (LMs) on tasks requiring Theory of Mind reasoning, research into the models’ internal representation of mental states remains limited. Recent work has used probing to demonstrate that LMs can represent beliefs of themselves and others. However, these claims are accompanied by limited evaluation, making it difficult to assess how mental state representations are affected by model design and training choices. We report an extensive benchmark with various LM types with different model sizes, fine-tuning approaches, and prompt designs to study the robustness of mental state representations and memorisation issues within the probes. Our results show that the quality of models’ internal representations of the beliefs of others increases with model size and, more crucially, with fine-tuning. We are the first to study how prompt variations impact probing performance on theory of mind tasks. We demonstrate that models’ representations are sensitive to prompt variations, even when such variations should be beneficial. Finally, we complement previous activation editing experiments on Theory of Mind tasks and show that it is possible to improve models’ reasoning performance by steering their activations without the need to train any probe.

doi:

Paper: bortoletto24_icmlw.pdf

Code: https://git.hcics.simtech.uni-stuttgart.de/public-projects/mental-states-in-LMs

Paper Access: https://openreview.net/forum?id=yEwEVoH9Be

@inproceedings{bortoletto24_icmlw, author = {Bortoletto, Matteo and Ruhdorfer, Constantin and Shi, Lei and Bulling, Andreas}, title = {Benchmarking Mental State Representations in Language Models}, booktitle = {Proc. ICML 2024 Workshop on Mechanistic Interpretability}, year = {2024}, pages = {1--21}, doi = {}, url = {https://openreview.net/forum?id=yEwEVoH9Be} }
Explicit Modelling of Theory of Mind for Belief Prediction in Nonverbal Social Interactions

Matteo Bortoletto, Constantin Ruhdorfer, Lei Shi, Andreas Bulling

Proc. 27th European Conference on Artificial Intelligence (ECAI), pp. 866–873, 2024.

Abstract Links BibTeX Project

We propose MToMnet - a Theory of Mind (ToM) neural network for predicting beliefs and their dynamics during human social interactions from multimodal input. ToM is key for effective nonverbal human communication and collaboration, yet, existing methods for belief modelling have not included explicit ToM modelling or have typically been limited to one or two modalities. MToMnet encodes contextual cues (scene videos and object locations) and integrates them with person-specific cues (human gaze and body language) in a separate MindNet for each person. Inspired by prior research on social cognition and computational ToM, we propose three different MToMnet variants: two involving fusion of latent representations and one involving re-ranking of classification scores. We evaluate our approach on two challenging real-world datasets, one focusing on belief prediction, while the other examining belief dynamics prediction. Our results demonstrate that MToMnet surpasses existing methods by a large margin while at the same time requiring a significantly smaller number of parameters. Taken together, our method opens up a highly promising direction for future work on artificial intelligent systems that can robustly predict human beliefs from their non-verbal behaviour and, as such, more effectively collaborate with humans.

doi: 10.3233/FAIA240573

Paper: bortoletto24_ecai.pdf

@inproceedings{bortoletto24_ecai, author = {Bortoletto, Matteo and Ruhdorfer, Constantin and Shi, Lei and Bulling, Andreas}, title = {Explicit Modelling of Theory of Mind for Belief Prediction in Nonverbal Social Interactions}, booktitle = {Proc. 27th European Conference on Artificial Intelligence (ECAI)}, year = {2024}, pages = {866--873}, doi = {10.3233/FAIA240573} }
Generating Realistic Arm Movements in Reinforcement Learning: A Quantitative Comparison of Reward Terms and Task Requirements

Jhon Paul Feliciano Charaja Casas, Isabell Wochner, Pierre Schumacher, Winfried Ilg, Martin Giese, Christophe Maufroy, Andreas Bulling, Syn Schmitt, Daniel F.B. Haeufle

Proc. 10th IEEE RAS/EMBS International Conference on Biomedical Robotics and Biomechatronics, pp. 1–6, 2024.

Abstract Links BibTeX Project

The mimicking of human-like arm movement characteristics involves the consideration of three factors during control policy synthesis: (a) chosen task requirements, (b) inclusion of noise during movement execution and (c) chosen optimality principles. Previous studies showed that when considering these factors (a-c) individually, it is possible to synthesize arm movements that either kinematically match the experimental data or reproduce the stereotypical triphasic muscle activation pattern. However, to date no quantitative comparison has been made on how realistic the arm movement generated by each factor is; as well as whether a partial or total combination of all factors results in arm movements with human-like kinematic characteristics and a triphasic muscle pattern. To investigate this, we used reinforcement learning to learn a control policy for a musculoskeletal arm model, aiming to discern which combination of factors (a-c) results in realistic arm movements according to four frequently reported stereotypical characteristics. Our findings indicate that incorporating velocity and acceleration requirements into the reaching task, employing reward terms that encourage minimization of mechanical work, hand jerk, and control effort, along with the inclusion of noise during movement, leads to the emergence of realistic human arm movements in reinforcement learning. We expect that the gained insights will help in the future to better predict desired arm movements and corrective forces in wearable assistive devices.

Paper: casas24_biorob.pdf

@inproceedings{casas24_biorob, title = {Generating Realistic Arm Movements in Reinforcement Learning: A Quantitative Comparison of Reward Terms and Task Requirements}, author = {Casas, Jhon Paul Feliciano Charaja and Wochner, Isabell and Schumacher, Pierre and Ilg, Winfried and Giese, Martin and Maufroy, Christophe and Bulling, Andreas and Schmitt, Syn and Haeufle, Daniel F.B.}, year = {2024}, booktitle = {Proc. 10th IEEE RAS/EMBS International Conference on Biomedical Robotics and Biomechatronics}, pages = {1--6} }
EyeSeeIdentity: Exploring Natural Gaze Behaviour for Implicit User Identification during Photo Viewing

Yasmeen Abdrabou, Mariam Hassib, Shuqin Hu, Ken Pfeuffer, Mohamed Khamis, Andreas Bulling, Florian Alt

Proc. Symposium on Usable Security and Privacy (USEC), pp. 1–12, 2024.

Abstract Links BibTeX Project

Existing gaze-based methods for user identification either require special-purpose visual stimuli or artificial gaze behaviour. Here, we explore how users can be differentiated by analysing natural gaze behaviour while freely looking at images. Our approach is based on the observation that looking at different images, for example, a picture from your last holiday, induces stronger emotional responses that are reflected in gaze behavioor and, hence, is unique to the person having experienced that situation. We collected gaze data in a remote study (N = 39) where participants looked at three image categories: personal images, other people’s images, and random images from the Internet. We demonstrate the potential of identifying different people using machine learning with an accuracy of 85%. The results pave the way towards a new class of authentication methods solely based on natural human gaze behaviour.

Paper: abdrabou24_usec.pdf

@inproceedings{abdrabou24_usec, author = {Abdrabou, Yasmeen and Hassib, Mariam and Hu, Shuqin and Pfeuffer, Ken and Khamis, Mohamed and Bulling, Andreas and Alt, Florian}, title = {EyeSeeIdentity: Exploring Natural Gaze Behaviour for Implicit User Identification during Photo Viewing}, booktitle = {Proc. Symposium on Usable Security and Privacy (USEC)}, year = {2024}, pages = {1--12} }
Explaining Disagreement in Visual Question Answering Using Eye Tracking

Susanne Hindennach, Lei Shi, Andreas Bulling

Proc. International Workshop on Pervasive Eye Tracking and Mobile Gaze-Based Interaction (PETMEI), pp. 1–7, 2024.

Abstract Links BibTeX Project

When presented with the same question about an image, human annotators often give valid but disagreeing answers indicating that their reasoning was different. Such differences are lost in a single ground truth label used to train and evaluate visual question answering (VQA) methods. In this work, we explore whether visual attention maps, created using stationary eye tracking, provide insight into the reasoning underlying disagreement in VQA. We first manually inspect attention maps in the recent VQA-MHUG dataset and find cases in which attention differs consistently for disagreeing answers. We further evaluate the suitability of four different similarity metrics to detect attention differences matching the disagreement. We show that attention maps plausibly surface differences in reasoning underlying one type of disagreement, and that the metrics complementarily detect them. Taken together, our results represent an important first step to leverage eye-tracking to explain disagreement in VQA.

doi: 10.1145/3649902.3656356

Paper: hindennach24_petmei.pdf

@inproceedings{hindennach24_petmei, title = {Explaining Disagreement in Visual Question Answering Using Eye Tracking}, author = {Hindennach, Susanne and Shi, Lei and Bulling, Andreas}, year = {2024}, pages = {1--7}, doi = {10.1145/3649902.3656356}, booktitle = {Proc. International Workshop on Pervasive Eye Tracking and Mobile Gaze-Based Interaction (PETMEI)} }
Quantifying Human Upper Limb Stiffness Responses Based on a Computationally Efficient Neuromusculoskeletal Arm Model

Maria Sapounaki, Pierre Schumacher, Winfried Ilg, Martin Giese, Christophe Maufroy, Andreas Bulling, Syn Schmitt, Daniel F.B. Haeufle, Isabell Wochner

Proc. 10th IEEE RAS/EMBS International Conference on Biomedical Robotics and Biomechatronics, pp. 1–6, 2024.

Abstract Links BibTeX Project Oral Presentation

The mimicking of human-like arm movement characteristics involves the consideration of three factors during control policy synthesis: (a) chosen task requirements, (b) inclusion of noise during movement execution and (c) chosen optimality principles. Previous studies showed that when considering these factors (a-c) individually, it is possible to synthesize arm movements that either kinematically match the experimental data or reproduce the stereotypical triphasic muscle activation pattern. However, to date no quantitative comparison has been made on how realistic the arm movement generated by each factor is; as well as whether a partial or total combination of all factors results in arm movements with human-like kinematic characteristics and a triphasic muscle pattern. To investigate this, we used reinforcement learning to learn a control policy for a musculoskeletal arm model, aiming to discern which combination of factors (a-c) results in realistic arm movements according to four frequently reported stereotypical characteristics. Our findings indicate that incorporating velocity and acceleration requirements into the reaching task, employing reward terms that encourage minimization of mechanical work, hand jerk, and control effort, along with the inclusion of noise during movement, leads to the emergence of realistic human arm movements in reinforcement learning. We expect that the gained insights will help in the future to better predict desired arm movements and corrective forces in wearable assistive devices.

Paper: sapounaki24_biorob.pdf

@inproceedings{sapounaki24_biorob, title = {Quantifying Human Upper Limb Stiffness Responses Based on a Computationally Efficient Neuromusculoskeletal Arm Model}, author = {Sapounaki, Maria and Schumacher, Pierre and Ilg, Winfried and Giese, Martin and Maufroy, Christophe and Bulling, Andreas and Schmitt, Syn and Haeufle, Daniel F.B. and Wochner, Isabell}, year = {2024}, booktitle = {Proc. 10th IEEE RAS/EMBS International Conference on Biomedical Robotics and Biomechatronics}, pages = {1--6} }
GazeMotion: Gaze-guided Human Motion Forecasting

Zhiming Hu, Syn Schmitt, Daniel Häufle, Andreas Bulling

Proc. IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1–6, 2024.

Abstract Links BibTeX Project Oral Presentation

We present GazeMotion, a novel method for human motion forecasting that combines information on past human poses with human eye gaze. Inspired by evidence from behavioural sciences showing that human eye and body movements are closely coordinated, GazeMotion first predicts future eye gaze from past gaze, then fuses predicted future gaze and past poses into a gaze-pose graph, and finally uses a residual graph convolutional network to forecast body motion. We extensively evaluate our method on the MoGaze, ADT, and GIMO benchmark datasets and show that it outperforms state-of-the-art methods by up to 7.4% improvement in mean per joint position error. Using head direction as a proxy to gaze, our method still achieves an average improvement of 5.5%. We finally report an online user study showing that our method also outperforms prior methods in terms of perceived realism. These results show the significant information content available in eye gaze for human motion forecasting as well as the effectiveness of our method in exploiting this information.

Paper: hu24_iros.pdf

Video: https://youtu.be/I-ecIvRqOCY?si=kK8SE0r-JadwOKLt

@inproceedings{hu24_iros, author = {Hu, Zhiming and Schmitt, Syn and Häufle, Daniel and Bulling, Andreas}, title = {GazeMotion: Gaze-guided Human Motion Forecasting}, year = {2024}, pages = {1--6}, booktitle = {Proc. IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)}, video = {https://youtu.be/I-ecIvRqOCY?si=kK8SE0r-JadwOKLt} }
Saliency3D: a 3D Saliency Dataset Collected on Screen

Yao Wang, Qi Dai, Mihai Bâce, Karsten Klein, Andreas Bulling

Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA), pp. 1–9, 2024.

Abstract Links BibTeX Project

While visual saliency has recently been studied in 3D, the experimental setup for collecting 3D saliency data can be expensive and cumbersome. To address this challenge, we propose a novel experimental design that utilizes an eye tracker on a screen to collect 3D saliency data. Our experimental design reduces the cost and complexity of 3D saliency dataset collection. We first collect gaze data on a screen, then we map them to 3D saliency data through perspective transformation. Using this method, we collect a 3D saliency dataset (49,276 fixations) comprising 10 participants looking at sixteen objects. Moreover, we examine the viewing preferences for objects and discuss our findings in this study. Our results indicate potential preferred viewing directions and a correlation between salient features and the variation in viewing directions.

doi: 10.1145/3649902.3653350

Paper: wang24_etras.pdf

@inproceedings{wang24_etras, title = {Saliency3D: a 3D Saliency Dataset Collected on Screen}, author = {Wang, Yao and Dai, Qi and B{\^a}ce, Mihai and Klein, Karsten and Bulling, Andreas}, year = {2024}, pages = {1--9}, booktitle = {Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA)}, doi = {10.1145/3649902.3653350} }
Multi-Modal Video Dialog State Tracking in the Wild

Adnen Abdessaied, Lei Shi, Andreas Bulling

Proc. 18th European Conference on Computer Vision (ECCV), pp. 1–25, 2024.

Abstract Links BibTeX Project

We present MST-MIXER – a novel video dialog model operating over a generic multi-modal state tracking scheme. Current models that claim to perform multi-modal state tracking fall short of two major aspects: (1) They either track only one modality (mostly the visual input) or (2) they target synthetic datasets that do not reflect the complexity of real-world in the wild scenarios. Our model addresses these two limitations in an attempt to close this crucial research gap. Specifically, MST-MIXER first tracks the most important constituents of each input modality. Then, it predicts the missing underlying structure of the selected constituents of each modality by learning local latent graphs using a novel multi-modal graph structure learning method. Subsequently, the learned local graphs and features are parsed together to form a global graph operating on the mix of all modalities which further refines its structure and node embeddings. Finally, the fine-grained graph node features are used to enhance the hidden states of the backbone Vision-Language Model (VLM). MST-MIXER achieves new state-of-the-art results on five challenging benchmarks.

Paper: abdessaied24_eccv.pdf

Code: https://git.hcics.simtech.uni-stuttgart.de/public-projects/MST-MIXER

@inproceedings{abdessaied24_eccv, author = {Abdessaied, Adnen and Shi, Lei and Bulling, Andreas}, title = {Multi-Modal Video Dialog State Tracking in the Wild}, booktitle = {Proc. 18th European Conference on Computer Vision (ECCV)}, year = {2024}, pages = {1--25} }
VD-GR: Boosting Visual Dialog with Cascaded Spatial-Temporal Multi-Modal GRaphs

Adnen Abdessaied, Lei Shi, Andreas Bulling

Proc. IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 5805–5814, 2024.

Abstract Links BibTeX Project

We propose VD-GR – a novel visual dialog model that combines pre-trained language models (LMs) with graph neural networks (GNNs). Prior works mainly focused on one class of models at the expense of the other, thus missing out on the opportunity of combining their respective benefits. At the core of VD-GR is a novel integration mechanism that alternates between spatial-temporal multi-modal GNNs and BERT layers, and that covers three distinct contributions: First, we use multi-modal GNNs to process the features of each modality (image, question, and dialog history) and exploit their local structures before performing BERT global attention. Second, we propose hub-nodes that link to all other nodes within one modality graph, allowing the model to propagate information from one GNN (modality) to the other in a cascaded manner. Third, we augment the BERT hidden states with fine-grained multi-modal GNN features before passing them to the next VD-GR layer. Evaluations on VisDial v1.0, VisDial v0.9, VisDialConv, and VisPro show that VD-GR achieves new state-of-the-art results across all four datasets

Paper: abdessaied24_wacv.pdf

Code: https://git.hcics.simtech.uni-stuttgart.de/public-projects/VDGR

@inproceedings{abdessaied24_wacv, author = {Abdessaied, Adnen and Shi, Lei and Bulling, Andreas}, title = {VD-GR: Boosting Visual Dialog with Cascaded Spatial-Temporal Multi-Modal GRaphs}, booktitle = {Proc. IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)}, year = {2024}, pages = {5805--5814} }
VSA4VQA: Scaling A Vector Symbolic Architecture To Visual Question Answering on Natural Images

Anna Penzkofer, Lei Shi, Andreas Bulling

Proc. Annual Meeting of the Cognitive Science Society (CogSci), 2024.

Abstract Links BibTeX Project Oral Presentation

While Vector Symbolic Architectures (VSAs) are promising for modelling spatial cognition, their application is currently limited to artificially generated images and simple spatial queries. We propose VSA4VQA – a novel 4D implementation of VSAs that implements a mental representation of natural images for the challenging task of Visual Question Answering (VQA). VSA4VQA is the first model to scale a VSA to complex spatial queries. Our method is based on the Semantic Pointer Architecture (SPA) to encode objects in a hyper-dimensional vector space. To encode natural images, we extend the SPA to include dimensions for object’s width and height in addition to their spatial location. To perform spatial queries we further introduce learned spatial query masks and integrate a pre-trained vision-language model for answering attribute-related questions. We evaluate our method on the GQA benchmark dataset and show that it can effectively encode natural images, achieving competitive performance to state-of-the-art deep learning methods for zero-shot VQA.

Paper: penzkofer24_cogsci.pdf

Code: https://git.hcics.simtech.uni-stuttgart.de/public-projects/VSA4VQA

Paper Access: https://escholarship.org/uc/item/26j7v1nf.

@inproceedings{penzkofer24_cogsci, author = {Penzkofer, Anna and Shi, Lei and Bulling, Andreas}, title = {{VSA4VQA}: {Scaling} {A} {Vector} {Symbolic} {Architecture} {To} {Visual} {Question} {Answering} on {Natural} {Images}}, booktitle = {Proc. Annual Meeting of the Cognitive Science Society (CogSci)}, year = {2024}, volume = {46}, url = {https://escholarship.org/uc/item/26j7v1nf.} }
OLViT: Multi-Modal State Tracking via Attention-Based Embeddings for Video-Grounded Dialog

Adnen Abdessaied, Manuel Hochmeister, Andreas Bulling

Proc. 31st Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING), pp. 1–11, 2024.

Abstract Links BibTeX Project

We present the Object Language Video Transformer (OLViT) – a novel model for video dialog operating over a multi-modal attention-based dialog state tracker. Existing video dialog models struggle with questions requiring both spatial and temporal localization within videos, long-term temporal reasoning, and accurate object tracking across multiple dialog turns. OLViT addresses these challenges by maintaining a global dialog state based on the output of an Object State Tracker (OST) and a Language State Tracker (LST): while the OST attends to the most important objects within the video, the LST keeps track of the most important linguistic co-references to previous dialog turns. In stark contrast to previous works, our approach is generic by nature and is therefore capable of learning continuous multi-modal dialog state representations of the most relevant objects and rounds. As a result, they can be seamlessly integrated into Large Language Models (LLMs) and offer high flexibility in dealing with different datasets and tasks. Evaluations on the challenging DVD (response classification) and SIMMC 2.1 (response generation) datasets show that OLViT achieves new state-of-the-art performance across both datasets.

Paper: abdessaied24_coling.pdf

Code: https://git.hcics.simtech.uni-stuttgart.de/public-projects/OLViT

@inproceedings{abdessaied24_coling, author = {Abdessaied, Adnen and von Hochmeister, Manuel and Bulling, Andreas}, title = {OLViT: Multi-Modal State Tracking via Attention-Based Embeddings for Video-Grounded Dialog}, booktitle = {Proc. 31st Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING)}, year = {2024}, pages = {1--11} }
MultiMediate’24: Multi-Domain Engagement Estimation

Philipp Müller, Michal Balazia, Tobias Baur, Michael Dietz, Alexander Heimerl, Anna Penzkofer, Dominik Schiller, François Brémond, Jan Alexandersson, Elisabeth André, Andreas Bulling

Proceedings of the 32nd ACM International Conference on Multimedia, pp. 11377 - 11382, 2024.

Abstract Links BibTeX Project

Estimating the momentary level of participant’s engagement is an important prerequisite for assistive systems that support human interactions. Previous work has addressed this task in within-domain evaluation scenarios, i.e. training and testing on the same dataset. This is in contrast to real-life scenarios where domain shifts between training and testing data frequently occur. With MultiMediate’24, we present the first challenge addressing multi-domain engagement estimation. As training data, we utilise the NOXI database of dyadic novice-expert interactions. In addition to within-domain test data, we add two new test domains. First, we introduce recordings following the NOXI protocol but covering languages that are not present in the NOXI training data. Second, we collected novel engagement annotations on the MPIIGroupInteraction dataset which consists of group discussions between three to four people. In this way, MultiMediate’24 evaluates the ability of approaches to generalise across factors such as language and cultural background, group size, task, and screen-mediated vs. face-to-face interaction. This paper describes the MultiMediate’24 challenge and presents baseline results. In addition, we discuss selected challenge solutions.

doi: 10.1145/3664647.3689004

Paper: mueller24_mm.pdf

@inproceedings{mueller24_mm, author = {M{\"{u}}ller, Philipp and Balazia, Michal and Baur, Tobias and Dietz, Michael and Heimerl, Alexander and Penzkofer, Anna and Schiller, Dominik and Brémond, François and Alexandersson, Jan and André, Elisabeth and Bulling, Andreas}, title = {MultiMediate'24: Multi-Domain Engagement Estimation}, year = {2024}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, doi = {10.1145/3664647.3689004}, booktitle = {Proceedings of the 32nd ACM International Conference on Multimedia}, pages = {11377 - 11382} }
InteRead: An Eye Tracking Dataset of Interrupted Reading

Francesca Zermiani, Prajit Dhar, Ekta Sood, Fabian Kögel, Andreas Bulling, Maria Wirzberger

Proc. 31st Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING), pp. 9154–9169, 2024.

Abstract Links BibTeX Project

Eye movements during reading offer a window into cognitive processes and language comprehension, but the scarcity of reading data with interruptions – which learners frequently encounter in their everyday learning environments – hampers advances in the development of intelligent learning technologies. We introduce InteRead – a novel 50-participant dataset of gaze data recorded during self-paced reading of real-world text. InteRead further offers fine-grained annotations of interruptions interspersed throughout the text as well as resumption lags incurred by these interruptions. Interruptions were triggered automatically once readers reached predefined target words. We validate our dataset by reporting interdisciplinary analyses on different measures of gaze behavior. In line with prior research, our analyses show that the interruptions as well as word length and word frequency effects significantly impact eye movements during reading. We also explore individual differences within our dataset, shedding light on the potential for tailored educational solutions. InteRead is accessible from our datasets web-page: https://www.ife.uni-stuttgart.de/en/llis/research/datasets/.

doi:

Paper: zermiani24_coling.pdf

Paper Access: https://aclanthology.org/2024.lrec-main.802/

@inproceedings{zermiani24_coling, title = {InteRead: An Eye Tracking Dataset of Interrupted Reading}, author = {Zermiani, Francesca and Dhar, Prajit and Sood, Ekta and Kögel, Fabian and Bulling, Andreas and Wirzberger, Maria}, year = {2024}, booktitle = {Proc. 31st Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING)}, pages = {9154--9169}, doi = {}, url = {https://aclanthology.org/2024.lrec-main.802/} }

Technical Reports

Attention Is All You Need For Mixture-of-Depths Routing

Advait Gadhikar, Souptik Kumar Majumdar, Niclas Popp, Piyapat Saranrittichai, Martin Rapp, Lukas Schott

arXiv:2412.20875, pp. 1–22, 2024.

Abstract Links BibTeX Project

Advancements in deep learning are driven by training models with increasingly larger numbers of parameters, which in turn heightens the computational demands. To address this issue, Mixture-of-Depths (MoD) models have been proposed to dynamically assign computations only to the most relevant parts of the inputs, thereby enabling the deployment of large-parameter models with high efficiency during inference and training. These MoD models utilize a routing mechanism to determine which tokens should be processed by a layer, or skipped. However, conventional MoD models employ additional network layers specifically for the routing which are difficult to train, and add complexity and deployment overhead to the model. In this paper, we introduce a novel attention-based routing mechanism A-MoD that leverages the existing attention map of the preceding layer for routing decisions within the current layer. Compared to standard routing, A-MoD allows for more efficient training as it introduces no additional trainable parameters and can be easily adapted from pretrained transformer models. Furthermore, it can increase the performance of the MoD model. For instance, we observe up to 2% higher accuracy on ImageNet compared to standard routing and isoFLOP ViT baselines. Furthermore, A-MoD improves the MoD training convergence, leading to up to 2x faster transfer learning.

doi: 10.48550/arXiv.2412.20875

Paper Access: http://arxiv.org/abs/2412.20875

@techreport{gadhikar24_arxiv, title = {Attention {Is} {All} {You} {Need} {For} {Mixture}-of-{Depths} {Routing}}, author = {Gadhikar, Advait and Majumdar, Souptik Kumar and Popp, Niclas and Saranrittichai, Piyapat and Rapp, Martin and Schott, Lukas}, year = {2024}, pages = {1--22}, doi = {10.48550/arXiv.2412.20875}, url = {http://arxiv.org/abs/2412.20875} }
UP-FacE: User-predictable Fine-grained Face Shape Editing

Florian Strohm, Mihai Bâce, Andreas Bulling

arXiv.2403.13972, 2024.

Abstract Links BibTeX Project

We present User-predictable Face Editing (UP-FacE) – a novel method for predictable face shape editing. In stark contrast to existing methods for face editing using trial and error, edits with UP-FacE are predictable by the human user. That is, users can control the desired degree of change precisely and deterministically and know upfront the amount of change required to achieve a certain editing result. Our method leverages facial landmarks to precisely measure facial feature values, facilitating the training of UP-FacE without manually annotated attribute labels. At the core of UP-FacE is a transformer-based network that takes as input a latent vector from a pre-trained generative model and a facial feature embedding, and predicts a suitable manipulation vector. To enable user-predictable editing, a scaling layer adjusts the manipulation vector to achieve the precise desired degree of change. To ensure that the desired feature is manipulated towards the target value without altering uncorrelated features, we further introduce a novel semantic face feature loss. Qualitative and quantitative results demonstrate that UP-FacE enables precise and fine-grained control over 23 face shape features.

doi: 10.48550/arXiv.2403.13972

@techreport{strohm24_arxiw, title = {{UP}-{FacE}: {User}-predictable {Fine}-grained {Face} {Shape} {Editing}}, shorttitle = {{UP}-{FacE}}, author = {Strohm, Florian and Bâce, Mihai and Bulling, Andreas}, year = {2024}, publisher = {arXiv}, doi = {10.48550/arXiv.2403.13972} }
PrivatEyes: Appearance-based Gaze Estimation Using Federated Secure Multi-Party Computation

Mayar Elfares, Pascal Reisert, Zhiming Hu, Wenwu Tang, Ralf Küsters, Andreas Bulling

arXiv:2402.18970, pp. 1–22, 2024.

Abstract Links BibTeX Project

Latest gaze estimation methods require large-scale training data but their collection and exchange pose significant privacy risks. We propose PrivatEyes - the first privacy-enhancing training approach for appearance-based gaze estimation based on federated learning (FL) and secure multi-party computation (MPC). PrivatEyes enables training gaze estimators on multiple local datasets across different users and server-based secure aggregation of the individual estimators’ updates. PrivatEyes guarantees that individual gaze data remains private even if a majority of the aggregating servers is malicious. We also introduce a new data leakage attack DualView that shows that PrivatEyes limits the leakage of private training data more effectively than previous approaches. Evaluations on the MPIIGaze, MPIIFaceGaze, GazeCapture, and NVGaze datasets further show that the improved privacy does not lead to a lower gaze estimation accuracy or substantially higher computational costs - both of which are on par with its non-secure counterparts.

doi: 10.48550/arXiv.2402.18970

@techreport{elfares24_arxiv, title = {PrivatEyes: Appearance-based Gaze Estimation Using Federated Secure Multi-Party Computation}, author = {Elfares, Mayar and Reisert, Pascal and Hu, Zhiming and Tang, Wenwu and Küsters, Ralf and Bulling, Andreas}, year = {2024}, doi = {10.48550/arXiv.2402.18970}, pages = {1--22} }
ActionDiffusion: An Action-aware Diffusion Model for Procedure Planning in Instructional Videos

Lei Shi, Paul Burkner, Andreas Bulling

arXiv:2403.08591, pp. 1–6, 2024.

Abstract Links BibTeX Project

We present ActionDiffusion – a novel diffusion model for procedure planning in instructional videos that is the first to take temporal inter-dependencies between actions into account in a diffusion model for procedure planning. This approach is in stark contrast to existing methods that fail to exploit the rich information content available in the particular order in which actions are performed. Our method unifies the learning of temporal dependencies between actions and denoising of the action plan in the diffusion process by projecting the action information into the noise space. This is achieved 1) by adding action embeddings in the noise masks in the noiseadding phase and 2) by introducing an attention mechanism in the noise prediction network to learn the correlations between different action steps. We report extensive experiments on three instructional video benchmark datasets (CrossTask, Coin, and NIV) and show that our method outperforms previous state-of-the-art methods on all metrics on CrossTask and NIV and all metrics except accuracy on Coin dataset. We show that by adding action embeddings into the noise mask the diffusion model can better learn action temporal dependencies and increase the performances on procedure planning

Paper Access: https://arxiv.org/abs/2403.08591

@techreport{shi24_arxiv, title = {ActionDiffusion: An Action-aware Diffusion Model for Procedure Planning in Instructional Videos}, author = {Shi, Lei and Burkner, Paul and Bulling, Andreas}, year = {2024}, pages = {1--6}, url = {https://arxiv.org/abs/2403.08591} }
Limits of Theory of Mind Modelling in Dialogue-Based Collaborative Plan Acquisition

Matteo Bortoletto, Constantin Ruhdorfer, Adnen Abdessaied, Lei Shi, Andreas Bulling

arXiv:2405.12621, pp. 1–16, 2024.

Abstract Links BibTeX Project

Recent work on dialogue-based collaborative plan acquisition (CPA) has suggested that Theory of Mind (ToM) modelling can improve missing knowledge prediction in settings with asymmetric skill-sets and knowledge. Although ToM was claimed to be important for effective collaboration, its real impact on this novel task remains under-explored. By representing plans as graphs and by exploiting task-specific constraints we show that, as performance on CPA nearly doubles when predicting one’s own missing knowledge, the improvements due to ToM modelling diminish. This phenomenon persists even when evaluating existing baseline methods. To better understand the relevance of ToM for CPA, we report a principled performance comparison of models with and without ToM features. Results across different models and ablations consistently suggest that learned ToM features are indeed more likely to reflect latent patterns in the data with no perceivable link to ToM. This finding calls for a deeper understanding of the role of ToM in CPA and beyond, as well as new methods for modelling and evaluating mental states in computational collaborative agents.

Code: https://git.hcics.simtech.uni-stuttgart.de/public-projects/limits-of-tom

Paper Access: https://arxiv.org/abs/2405.12621

@techreport{bortoletto24_arxiv, author = {Bortoletto, Matteo and Ruhdorfer, Constantin and Abdessaied, Adnen and Shi, Lei and Bulling, Andreas}, title = {Limits of Theory of Mind Modelling in Dialogue-Based Collaborative Plan Acquisition}, year = {2024}, pages = {1--16}, url = {https://arxiv.org/abs/2405.12621} }
Benchmarking Mental State Representations in Language Models

Matteo Bortoletto, Constantin Ruhdorfer, Lei Shi, Andreas Bulling

arXiv:2406.17513, pp. 1–21, 2024.

Abstract Links BibTeX Project

While numerous works have assessed the generative performance of language models (LMs) on tasks requiring Theory of Mind reasoning, research into the models’ internal representation of mental states remains limited. Recent work has used probing to demonstrate that LMs can represent beliefs of themselves and others. However, these claims are accompanied by limited evaluation, making it difficult to assess how mental state representations are affected by model design and training choices. We report an extensive benchmark with various LM types with different model sizes, fine-tuning approaches, and prompt designs to study the robustness of mental state representations and memorisation issues within the probes. Our results show that the quality of models’ internal representations of the beliefs of others increases with model size and, more crucially, with fine-tuning. We are the first to study how prompt variations impact probing performance on theory of mind tasks. We demonstrate that models’ representations are sensitive to prompt variations, even when such variations should be beneficial. Finally, we complement previous activation editing experiments on Theory of Mind tasks and show that it is possible to improve models’ reasoning performance by steering their activations without the need to train any probe.

Code: https://git.hcics.simtech.uni-stuttgart.de/public-projects/mental-states-in-LMs

Paper Access: https://arxiv.org/abs/2406.17513

@techreport{bortoletto24_arxiv_2, author = {Bortoletto, Matteo and Ruhdorfer, Constantin and Shi, Lei and Bulling, Andreas}, title = {Benchmarking Mental State Representations in Language Models}, year = {2024}, pages = {1--21}, url = {https://arxiv.org/abs/2406.17513} }
Explicit Modelling of Theory of Mind for Belief Prediction in Nonverbal Social Interactions

Matteo Bortoletto, Constantin Ruhdorfer, Lei Shi, Andreas Bulling

arXiv:2407.06762, pp. 1–11, 2024.

Abstract Links BibTeX Project

We propose MToMnet - a Theory of Mind (ToM) neural network for predicting beliefs and their dynamics during human social interactions from multimodal input. ToM is key for effective nonverbal human communication and collaboration, yet, existing methods for belief modelling have not included explicit ToM modelling or have typically been limited to one or two modalities. MToMnet encodes contextual cues (scene videos and object locations) and integrates them with person-specific cues (human gaze and body language) in a separate MindNet for each person. Inspired by prior research on social cognition and computational ToM, we propose three different MToMnet variants: two involving fusion of latent representations and one involving re-ranking of classification scores. We evaluate our approach on two challenging real-world datasets, one focusing on belief prediction, while the other examining belief dynamics prediction. Our results demonstrate that MToMnet surpasses existing methods by a large margin while at the same time requiring a significantly smaller number of parameters. Taken together, our method opens up a highly promising direction for future work on artificial intelligent systems that can robustly predict human beliefs from their non-verbal behaviour and, as such, more effectively collaborate with humans.

Paper Access: https://arxiv.org/abs/2407.06762

@techreport{bortoletto24_arxiv_3, author = {Bortoletto, Matteo and Ruhdorfer, Constantin and Shi, Lei and Bulling, Andreas}, title = {Explicit Modelling of Theory of Mind for Belief Prediction in Nonverbal Social Interactions}, year = {2024}, pages = {1--11}, url = {https://arxiv.org/abs/2407.06762} }
Learning User Embeddings from Human Gaze for Personalised Saliency Prediction

Florian Strohm, Mihai Bâce, Andreas Bulling

arXiv:2403.13653, pp. 1–15, 2024.

Abstract Links BibTeX Project

Reusable embeddings of user behaviour have shown significant performance improvements for the personalised saliency prediction task. However, prior works require explicit user characteristics and preferences as input, which are often difficult to obtain. We present a novel method to extract user embeddings from pairs of natural images and corresponding saliency maps generated from a small amount of user-specific eye tracking data. At the core of our method is a Siamese convolutional neural encoder that learns the user embeddings by contrasting the image and personal saliency map pairs of different users. Evaluations on two saliency datasets show that the generated embeddings have high discriminative power, are effective at refining universal saliency maps to the individual users, and generalise well across users and images. Finally, based on our model’s ability to encode individual user characteristics, our work points towards other applications that can benefit from reusable embeddings of gaze behaviour.

Paper: strohm24_arxiv.pdf

Paper Access: https://arxiv.org/abs/2403.13653

@techreport{strohm24_arxiv, title = {Learning User Embeddings from Human Gaze for Personalised Saliency Prediction}, author = {Strohm, Florian and Bâce, Mihai and Bulling, Andreas}, year = {2024}, pages = {1--15}, url = {https://arxiv.org/abs/2403.13653} }
Generating Realistic Arm Movements in Reinforcement Learning: A Quantitative Comparison of Reward Terms and Task Requirements

Jhon Paul Feliciano Charaja Casas, Isabell Wochner, Pierre Schumacher, Winfried Ilg, Martin Giese, Christophe Maufroy, Andreas Bulling, Syn Schmitt, Daniel F.B. Haeufle

arXiv:2402.13949, pp. 1–6, 2024.

Abstract Links BibTeX Project

The mimicking of human-like arm movement characteristics involves the consideration of three factors during control policy synthesis: (a) chosen task requirements, (b) inclusion of noise during movement execution and (c) chosen optimality principles. Previous studies showed that when considering these factors (a-c) individually, it is possible to synthesize arm movements that either kinematically match the experimental data or reproduce the stereotypical triphasic muscle activation pattern. However, to date no quantitative comparison has been made on how realistic the arm movement generated by each factor is; as well as whether a partial or total combination of all factors results in arm movements with human-like kinematic characteristics and a triphasic muscle pattern. To investigate this, we used reinforcement learning to learn a control policy for a musculoskeletal arm model, aiming to discern which combination of factors (a-c) results in realistic arm movements according to four frequently reported stereotypical characteristics. Our findings indicate that incorporating velocity and acceleration requirements into the reaching task, employing reward terms that encourage minimization of mechanical work, hand jerk, and control effort, along with the inclusion of noise during movement, leads to the emergence of realistic human arm movements in reinforcement learning. We expect that the gained insights will help in the future to better predict desired arm movements and corrective forces in wearable assistive devices.

Paper Access: https://arxiv.org/abs/2402.13949

@techreport{casas24_arxiv, title = {Generating Realistic Arm Movements in Reinforcement Learning: A Quantitative Comparison of Reward Terms and Task Requirements}, author = {Casas, Jhon Paul Feliciano Charaja and Wochner, Isabell and Schumacher, Pierre and Ilg, Winfried and Giese, Martin and Maufroy, Christophe and Bulling, Andreas and Schmitt, Syn and Haeufle, Daniel F.B.}, year = {2024}, pages = {1--6}, url = {https://arxiv.org/abs/2402.13949} }
SeFFeC: Semantic Facial Feature Control for Fine-grained Face Editing

Florian Strohm, Mihai Bâce, Markus Kaltenecker, Andreas Bulling

arXiv:2403.13972, pp. 1–18, 2024.

Abstract Links BibTeX Project

We propose Semantic Facial Feature Control (SeFFeC) - a novel method for fine-grained face shape editing. Our method enables the manipulation of human-understandable, semantic face features, such as nose length or mouth width, which are defined by different groups of facial landmarks. In contrast to existing methods, the use of facial landmarks enables precise measurement of the facial features, which then enables training SeFFeC without any manually annotated labels. SeFFeC consists of a transformer-based encoder network that takes a latent vector of a pre-trained generative model and a facial feature embedding as input, and learns to modify the latent vector to perform the desired face edit operation. To ensure that the desired feature measurement is changed towards the target value without altering uncorrelated features, we introduced a novel semantic face feature loss. Qualitative and quantitative results show that SeFFeC enables precise and fine-grained control of 23 facial features, some of which could not previously be controlled by other methods, without requiring manual annotations. Unlike existing methods, SeFFeC also provides deterministic control over the exact values of the facial features and more localised and disentangled face edits.

Paper Access: https://arxiv.org/abs/2403.13972

@techreport{strohm24_arxiv_2, title = {SeFFeC: Semantic Facial Feature Control for Fine-grained Face Editing}, author = {Strohm, Florian and Bâce, Mihai and Kaltenecker, Markus and Bulling, Andreas}, year = {2024}, pages = {1--18}, url = {https://arxiv.org/abs/2403.13972} }
GazeMotion: Gaze-guided Human Motion Forecasting

Zhiming Hu, Syn Schmitt, Daniel Häufle, Andreas Bulling

arXiv:2403.09885, pp. 1–6, 2024.

Abstract Links BibTeX Project

We present GazeMotion, a novel method for human motion forecasting that combines information on past human poses with human eye gaze. Inspired by evidence from behavioural sciences showing that human eye and body movements are closely coordinated, GazeMotion first predicts future eye gaze from past gaze, then fuses predicted future gaze and past poses into a gaze-pose graph, and finally uses a residual graph convolutional network to forecast body motion. We extensively evaluate our method on the MoGaze, ADT, and GIMO benchmark datasets and show that it outperforms state-of-the-art methods by up to 7.4% improvement in mean per joint position error. Using head direction as a proxy to gaze, our method still achieves an average improvement of 5.5%. We finally report an online user study showing that our method also outperforms prior methods in terms of perceived realism. These results show the significant information content available in eye gaze for human motion forecasting as well as the effectiveness of our method in exploiting this information.

Paper Access: https://arxiv.org/abs/2403.09885

@techreport{hu24_arxiv, author = {Hu, Zhiming and Schmitt, Syn and Häufle, Daniel and Bulling, Andreas}, title = {GazeMotion: Gaze-guided Human Motion Forecasting}, year = {2024}, pages = {1--6}, url = {https://arxiv.org/abs/2403.09885} }
DiffGaze: A Diffusion Model for Continuous Gaze Sequence Generation on 360° Images

Chuhan Jiao, Yao Wang, Guanhua Zhang, Mihai Bâce, Zhiming Hu, Andreas Bulling

arXiv:2403.17477, pp. 1–13, 2024.

Abstract Links BibTeX Project

We present DiffGaze, a novel method for generating realistic and diverse continuous human gaze sequences on 360° images based on a conditional score-based denoising diffusion model. Generating human gaze on 360° images is important for various human-computer interaction and computer graphics applications, e.g. for creating large-scale eye tracking datasets or for realistic animation of virtual humans. However, existing methods are limited to predicting discrete fixation sequences or aggregated saliency maps, thereby neglecting crucial parts of natural gaze behaviour. Our method uses features extracted from 360° images as condition and uses two transformers to model the temporal and spatial dependencies of continuous human gaze. We evaluate DiffGaze on two 360° image benchmarks for gaze sequence generation as well as scanpath prediction and saliency prediction. Our evaluations show that DiffGaze outperforms state-of-the-art methods on all tasks on both benchmarks. We also report a 21-participant user study showing that our method generates gaze sequences that are indistinguishable from real human sequences. Taken together, our evaluations not only demonstrate the effectiveness of DiffGaze but also point towards a new generation of methods that faithfully model the rich spatial and temporal nature of natural human gaze behaviour.

Paper Access: https://arxiv.org/abs/2403.17477

@techreport{jiao24_arxiv, title = {DiffGaze: A Diffusion Model for Continuous Gaze Sequence Generation on 360° Images}, author = {Jiao, Chuhan and Wang, Yao and Zhang, Guanhua and B{\^a}ce, Mihai and Hu, Zhiming and Bulling, Andreas}, year = {2024}, pages = {1--13}, url = {https://arxiv.org/abs/2403.17477} }
DiffEyeSyn: Diffusion-based User-specific Eye Movement Synthesis

Chuhan Jiao, Guanhua Zhang, Zhiming Hu, Andreas Bulling

arXiv:2409.01240, pp. 1–22, 2024.

Abstract Links BibTeX Project

High-frequency components in eye gaze data contain user-specific information promising for various applications, but existing gaze modelling methods focus on low frequencies of typically not more than 30 Hz. We present DiffEyeSyn – the first computational method to synthesise high-frequency gaze data, including eye movement characteristics specific to individual users. The key idea is to consider the high-frequency, user-specific information as a special type of noise in eye movement data. This perspective reshapes eye movement synthesis into the task of injecting this user-specific noise into any given eye movement sequence. We formulate this injection task as a conditional diffusion process in which the synthesis is conditioned on user-specific embeddings extracted from the gaze data using pre-trained models for user authentication. We propose user identity guidance – a novel loss function that allows our model to preserve user identity while generating human-like eye movements in the spatial domain. Experiment results on two public high-frequency eye movement biometric datasets show that our synthetic eye movements are indistinguishable from real human eye movements. Furthermore, we demonstrate that DiffEyeSyn can be used to synthesise eye gaze data at scale and for different downstream tasks, such as gaze data imputation and gaze data super-resolution. As such, our work lays the methodological foundations for personalised eye movement synthesis that has significant application potential, such as for character animation, eye movement biometrics, or gaze-based activity and context recognition.

Paper Access: https://arxiv.org/abs/2409.01240

@techreport{jiao24_arxiv_2, title = {DiffEyeSyn: Diffusion-based User-specific Eye Movement Synthesis}, author = {Jiao, Chuhan and Zhang, Guanhua and Hu, Zhiming and Bulling, Andreas}, year = {2024}, pages = {1--22}, url = {https://arxiv.org/abs/2409.01240} }
OLViT: Multi-Modal State Tracking via Attention-Based Embeddings for Video-Grounded Dialog

Adnen Abdessaied, Manuel Hochmeister, Andreas Bulling

arXiv:2402.13146, pp. 1–11, 2024.

Abstract Links BibTeX Project

We present the Object Language Video Transformer (OLViT) – a novel model for video dialog operating over a multi-modal attention-based dialog state tracker. Existing video dialog models struggle with questions requiring both spatial and temporal localization within videos, long-term temporal reasoning, and accurate object tracking across multiple dialog turns. OLViT addresses these challenges by maintaining a global dialog state based on the output of an Object State Tracker (OST) and a Language State Tracker (LST): while the OST attends to the most important objects within the video, the LST keeps track of the most important linguistic co-references to previous dialog turns. In stark contrast to previous works, our approach is generic by nature and is therefore capable of learning continuous multi-modal dialog state representations of the most relevant objects and rounds. As a result, they can be seamlessly integrated into Large Language Models (LLMs) and offer high flexibility in dealing with different datasets and tasks. Evaluations on the challenging DVD (response classification) and SIMMC 2.1 (response generation) datasets show that OLViT achieves new state-of-the-art performance across both datasets.

Paper: abdessaied24_arxiv.pdf

Code: https://git.hcics.simtech.uni-stuttgart.de/public-projects/OLViT

Paper Access: https://arxiv.org/abs/2402.13146

@techreport{abdessaied24_arxiv, author = {Abdessaied, Adnen and von Hochmeister, Manuel and Bulling, Andreas}, title = {OLViT: Multi-Modal State Tracking via Attention-Based Embeddings for Video-Grounded Dialog}, year = {2024}, pages = {1--11}, url = {https://arxiv.org/abs/2402.13146} }
MultiMediate’24: Multi-Domain Engagement Estimation

Philipp Müller, Michal Balazia, Tobias Baur, Michael Dietz, Alexander Heimerl, Anna Penzkofer, Dominik Schiller, François Brémond, Jan Alexandersson, Elisabeth André, Andreas Bulling

arXiv:2408.16625, pp. 1–6, 2024.

Abstract Links BibTeX Project

Estimating the momentary level of participant’s engagement is an important prerequisite for assistive systems that support human interactions. Previous work has addressed this task in within-domain evaluation scenarios, i.e. training and testing on the same dataset. This is in contrast to real-life scenarios where domain shifts between training and testing data frequently occur. With MultiMediate’24, we present the first challenge addressing multi-domain engagement estimation. As training data, we utilise the NOXI database of dyadic novice-expert interactions. In addition to within-domain test data, we add two new test domains. First, we introduce recordings following the NOXI protocol but covering languages that are not present in the NOXI training data. Second, we collected novel engagement annotations on the MPIIGroupInteraction dataset which consists of group discussions between three to four people. In this way, MultiMediate’24 evaluates the ability of approaches to generalise across factors such as language and cultural background, group size, task, and screen-mediated vs. face-to-face interaction. This paper describes the MultiMediate’24 challenge and presents baseline results. In addition, we discuss selected challenge solutions.

doi: 10.48550/arXiv.2408.16625

Paper: mueller24_arxiv.pdf

Paper Access: http://arxiv.org/abs/2408.16625

@techreport{mueller24_arxiv, title = {MultiMediate'24: Multi-Domain Engagement Estimation}, author = {M{\"{u}}ller, Philipp and Balazia, Michal and Baur, Tobias and Dietz, Michael and Heimerl, Alexander and Penzkofer, Anna and Schiller, Dominik and Brémond, François and Alexandersson, Jan and André, Elisabeth and Bulling, Andreas}, year = {2024}, pages = {1--6}, doi = {10.48550/arXiv.2408.16625}, url = {http://arxiv.org/abs/2408.16625} }

2023

Journal Articles

Privacy-Aware Eye Tracking: Challenges and Future Directions

Céline Gressel, Rebekah Overdorf, Inken Hagenstedt, Murat Karaboga, Helmut Lurtz, Michael Raschke, Andreas Bulling

IEEE Pervasive Computing, 22 (1), pp. 95-102, 2023.

Abstract Links BibTeX Project

What do you have to keep in mind when developing or using eye-tracking technologies regarding privacy? In this article we discuss the main ethical, technical, and legal categories of privacy, which is much more than just data protection. We additionally provide recommendations about how such technologies might mitigate privacy risks and in which cases the risks are higher than the benefits of the technology.

doi: 10.1109/MPRV.2022.3228660

Paper: gressel23_pcm.pdf

@article{gressel23_pcm, title = {Privacy-Aware Eye Tracking: Challenges and Future Directions}, author = {Gressel, Céline and Overdorf, Rebekah and Hagenstedt, Inken and Karaboga, Murat and Lurtz, Helmut and Raschke, Michael and Bulling, Andreas}, journal = {IEEE Pervasive Computing}, year = {2023}, volume = {22}, number = {1}, doi = {10.1109/MPRV.2022.3228660}, pages = {95-102} }
Scanpath Prediction on Information Visualisations

Yao Wang, Mihai Bâce, Andreas Bulling

IEEE Transactions on Visualization and Computer Graphics (TVCG), 30 (7), pp. 3902–3914, 2023.

Abstract Links BibTeX Project

We propose Unified Model of Saliency and Scanpaths (UMSS) – a model that learns to predict multi-duration saliency and scanpaths (i.e. sequences of eye fixations) on information visualisations. Although scanpaths provide rich information about the importance of different visualisation elements during the visual exploration process, prior work has been limited to predicting aggregated attention statistics, such as visual saliency. We present in-depth analyses of gaze behaviour for different information visualisation elements (e.g. Title, Label, Data) on the popular MASSVIS dataset. We show that while, overall, gaze patterns are surprisingly consistent across visualisations and viewers, there are also structural differences in gaze dynamics for different elements. Informed by our analyses, UMSS first predicts multi-duration element-level saliency maps, then probabilistically samples scanpaths from them. Extensive experiments on MASSVIS show that our method consistently outperforms state-of-the-art methods with respect tto several, widely used scanpath and saliency evaluation metrics. Our method achieves a relative improvement in sequence score of 11.5 % for scanpath prediction, and a relative improvement in Pearson correlation coefficient of up to 23.6 % for saliency prediction. These results are auspicious and point towards richer user models and simulations of visual attention on visualisations without the need for any eye tracking equipment.

doi: 10.1109/TVCG.2023.3242293

Paper: wang23_tvcg.pdf

Supplementary Material: wang23_tvcg_sup.pdf

Dataset: https://darus.uni-stuttgart.de/dataset.xhtml?persistentId=doi:10.18419/darus-3361

@article{wang23_tvcg, title = {Scanpath Prediction on Information Visualisations}, author = {Wang, Yao and Bâce, Mihai and Bulling, Andreas}, year = {2023}, journal = {IEEE Transactions on Visualization and Computer Graphics (TVCG)}, volume = {30}, number = {7}, pages = {3902--3914}, doi = {10.1109/TVCG.2023.3242293} }

Conference Papers

Why Don’t You Speak?: A Smartphone Application to Engage Museum Visitors Through Deepfakes Creation

Matteo Zaramella, Irene Amerini, Paolo Russo

Proceedings of the 5th Workshop on analySis, Understanding and proMotion of heritAge Contents (SUMAC), pp. 29–37, 2023.

Links BibTeX Project

doi: 10.1145/3607542.3617359

Paper Access: https://dl.acm.org/doi/10.1145/3607542.3617359

@inproceedings{zaramella23_sumac, title = {Why {Don}'t {You} {Speak}?: {A} {Smartphone} {Application} to {Engage} {Museum} {Visitors} {Through} {Deepfakes} {Creation}}, shorttitle = {Why {Don}'t {You} {Speak}?}, url = {https://dl.acm.org/doi/10.1145/3607542.3617359}, doi = {10.1145/3607542.3617359}, booktitle = {Proceedings of the 5th {Workshop} on {analySis}, {Understanding} and {proMotion} of {heritAge} {Contents} (SUMAC)}, publisher = {ACM}, author = {Zaramella, Matteo and Amerini, Irene and Russo, Paolo}, year = {2023}, pages = {29--37} }
Exploring Natural Language Processing Methods for Interactive Behaviour Modelling

Guanhua Zhang, Matteo Bortoletto, Zhiming Hu, Lei Shi, Mihai Bâce, Andreas Bulling

Proc. IFIP TC13 Conference on Human-Computer Interaction (INTERACT), pp. 1–22, 2023.

Abstract Links BibTeX Project Best Student Paper Nomination

Analysing and modelling interactive behaviour is an important topic in human-computer interaction (HCI) and a key requirement for the development of intelligent interactive systems. Interactive behaviour has a sequential (actions happen one after another) and hierarchical (a sequence of actions forms an activity driven by interaction goals) structure, which may be similar to the structure of natural language. Designed based on such a structure, natural language processing (NLP) methods have achieved groundbreaking success in various downstream tasks. However, few works linked interactive behaviour with natural language. In this paper, we explore the similarity between interactive behaviour and natural language by applying an NLP method, byte pair encoding (BPE), to encode mouse and keyboard behaviour. We then analyse the vocabulary, i.e., the set of action sequences, learnt by BPE, as well as use the vocabulary to encode the input behaviour for interactive task recognition. An existing dataset collected in constrained lab settings and our novel out-of-the-lab dataset were used for evaluation. Results show that this natural language-inspired approach not only learns action sequences that reflect specific interaction goals, but also achieves higher F1 scores on task recognition than other methods. Our work reveals the similarity between interactive behaviour and natural language, and presents the potential of applying the new pack of methods that leverage insights from NLP to model interactive behaviour in HCI.

Paper: zhang23_interact.pdf

@inproceedings{zhang23_interact, title = {Exploring Natural Language Processing Methods for Interactive Behaviour Modelling}, author = {Zhang, Guanhua and Bortoletto, Matteo and Hu, Zhiming and Shi, Lei and B{\^a}ce, Mihai and Bulling, Andreas}, booktitle = {Proc. IFIP TC13 Conference on Human-Computer Interaction (INTERACT)}, pages = {1--22}, year = {2023}, publisher = {Springer} }
Improving Neural Saliency Prediction with a Cognitive Model of Human Visual Attention

Ekta Sood, Lei Shi, Matteo Bortoletto, Yao Wang, Philipp Müller, Andreas Bulling

Proc. Annual Meeting of the Cognitive Science Society (CogSci), pp. 3639–3646, 2023.

Abstract Links BibTeX Project

We present a novel method for saliency prediction that leverages a cognitive model of visual attention as an inductive bias. This approach is in stark contrast to recent purely data-driven saliency models that achieve performance improvements mainly by increased capacity, resulting in high computational costs and the need for large-scale training datasets. We demonstrate that by using a cognitive model, our method achieves competitive performance to the state of the art across several natural image datasets while only requiring a fraction of the parameters. Furthermore, we set the new state of the art for saliency prediction on information visualizations, demonstrating the effectiveness of our approach for cross-domain generalization. We further provide augmented versions of the full MSCOCO dataset with synthetic gaze data using the cognitive model, which we used to pre-train our method. Our results are highly promising and underline the significant potential of bridging between cognitive and data-driven models, potentially also beyond attention.

Paper: sood23_cogsci.pdf

Code: https://git.hcics.simtech.uni-stuttgart.de/public-projects/neural-saliency-prediction-with-a-cognitive-model/

Supplementary Material: sood23_cogsci_sup.pdf

Dataset: https://collaborative-ai.org/research/datasets/MSCOCOEMMAFigureQAEMMA/

@inproceedings{sood23_cogsci, author = {Sood, Ekta and Shi, Lei and Bortoletto, Matteo and Wang, Yao and Müller, Philipp and Bulling, Andreas}, title = {Improving Neural Saliency Prediction with a Cognitive Model of Human Visual Attention}, booktitle = {Proc. Annual Meeting of the Cognitive Science Society (CogSci)}, year = {2023}, pages = {3639--3646} }
Int-HRL: Towards Intention-based Hierarchical Reinforcement Learning

Anna Penzkofer, Simon Schaefer, Florian Strohm, Mihai Bâce, Stefan Leutenegger, Andreas Bulling

Proc. Adaptive and Learning Agents Workshop (ALA), pp. 1–7, 2023.

Abstract Links BibTeX Project

While deep reinforcement learning (RL) agents outperform humans on an increasing number of tasks, training them requires data equivalent to decades of human gameplay. Recent hierarchical RL methods have increased sample efficiency by incorporating information inherent to the structure of the decision problem but at the cost of having to discover or use human-annotated sub-goals that guide the learning process. We show that intentions of human players, i.e. the precursor of goal-oriented decisions, can be robustly predicted from eye gaze even for the long-horizon sparse rewards task of Montezuma’s Revenge – one of the most challenging RL tasks in the Atari2600 game suite. We propose Int-HRL: Hierarchical RL with intention-based sub-goals that are inferred from human eye gaze. Our novel sub-goal extraction pipeline is fully automatic and replaces the need for manual sub-goal annotation by human experts. Our evaluations show that replacing hand-crafted sub-goals with automatically extracted intentions leads to a HRL agent that is significantly more sample efficient than previous methods.

Paper: penzkofer23_ala.pdf

@inproceedings{penzkofer23_ala, author = {Penzkofer, Anna and Schaefer, Simon and Strohm, Florian and Bâce, Mihai and Leutenegger, Stefan and Bulling, Andreas}, title = {Int-HRL: Towards Intention-based Hierarchical Reinforcement Learning}, booktitle = {Proc. Adaptive and Learning Agents Workshop (ALA)}, year = {2023}, pages = {1--7} }
GazeCast: Using Mobile Devices to Allow Gaze-based Interaction on Public Displays

Omar Namnakani, Penpicha Sinrattanavong, Yasmeen Abdrabou, Andreas Bulling, Florian Alt, Mohamed Khamis

Proc. Communication by Gaze Interaction Symposium (COGAIN), pp. 1–8, 2023.

Abstract Links BibTeX Project COGAIN Best Paper Award

Gaze is promising for natural and spontaneous interaction with public displays, but current gaze-enabled displays either require movement-hindering stationary eye trackers or cumbersome head-mounted eye trackers. We propose and evaluate GazeCast - a novel system that leverages users’ personal handheld mobile devices to allow gaze-based interaction with surrounding displays. GazeCast improves gaze interaction on such displays by neither setting limitations on where users have to position themselves nor on the number of concurrent users. In a user study (N = 20), we compared GazeCast to using a standard webcam for gaze-based interaction using Pursuits. We find that while selection using GazeCast requires more time and physical demand, participants value GazeCast’s high accuracy and the flexible positioning. We conclude by discussing how mobile computing can facilitate the adoption of gaze interaction with pervasive displays.

doi: 10.1145/3588015.3589663

Paper: namnakani23_cogain.pdf

@inproceedings{namnakani23_cogain, title = {GazeCast: Using Mobile Devices to Allow Gaze-based Interaction on Public Displays}, author = {Namnakani, Omar and Sinrattanavong, Penpicha and Abdrabou, Yasmeen and Bulling, Andreas and Alt, Florian and Khamis, Mohamed}, year = {2023}, pages = {1--8}, booktitle = {Proc. Communication by Gaze Interaction Symposium (COGAIN)}, doi = {10.1145/3588015.3589663} }
Federated Learning for Appearance-based Gaze Estimation in the Wild

Mayar Elfares, Zhiming Hu, Pascal Reisert, Andreas Bulling, Ralf Küsters

Proceedings of The 1st Gaze Meets ML workshop, PMLR, pp. 20–36, 2023.

Abstract Links BibTeX Project

Gaze estimation methods have significantly matured in recent years, but the large number of eye images required to train deep learning models poses significant privacy risks. In addition, the heterogeneous data distribution across different users can significantly hinder the training process. In this work, we propose the first federated learning approach for gaze estimation to preserve the privacy of gaze data. We further employ pseudo-gradient optimisation to adapt our federated learning approach to the divergent model updates to address the heterogeneous nature of in-the-wild gaze data in collaborative setups. We evaluate our approach on a real-world dataset (MPIIGaze) and show that our work enhances the privacy guarantees of conventional appearance-based gaze estimation methods, handles the convergence issues of gaze estimators, and significantly outperforms vanilla federated learning by 15.8% (from a mean error of 10.63 degrees to 8.95 degrees). As such, our work paves the way to develop privacy-aware collaborative learning setups for gaze estimation while maintaining the model’s performance.

Paper: elfares23_gmml.pdf

Paper Access: https://proceedings.mlr.press/v210/elfares23a.html

@inproceedings{elfares23_gmml, title = {Federated Learning for Appearance-based Gaze Estimation in the Wild}, author = {Elfares, Mayar and Hu, Zhiming and Reisert, Pascal and Bulling, Andreas and K{\"u}sters, Ralf}, booktitle = {Proceedings of The 1st Gaze Meets ML workshop, PMLR}, pages = {20--36}, year = {2023}, editor = {Lourentzou, Ismini and Wu, Joy and Kashyap, Satyananda and Karargyris, Alexandros and Celi, Leo Anthony and Kawas, Ban and Talathi, Sachin}, volume = {210}, series = {Proceedings of Machine Learning Research}, publisher = {PMLR}, url = {https://proceedings.mlr.press/v210/elfares23a.html} }
Usable and Fast Interactive Mental Face Reconstruction

Florian Strohm, Mihai Bâce, Andreas Bulling

Proc. ACM Symposium on User Interface Software and Technology (UIST), pp. 1–15, 2023.

Abstract Links BibTeX Project

We introduce an end-to-end interactive system for mental face reconstruction – the challenging task of visually reconstructing a face image a person only has in their mind. In contrast to existing methods that suffer from low usability and high mental load, our approach only requires the user to rank images over multiple iterations according to the perceived similarity with their mental image. Based on these rankings, our mental face reconstruction system extracts image features in each iteration, combines them into a joint feature vector, and then uses a generative model to visually reconstruct the mental image. To avoid the need for collecting large amounts of human training data, we further propose a computational user model that can simulate human ranking behaviour using data from an online crowd-sourcing study (N=215). Results from a 12-participant user study show that our method can reconstruct mental images that are visually similar to existing approaches but has significantly higher usability, lower perceived workload, and is 40% faster. In addition, results from a third 22-participant lineup study in which we validated our reconstructions on a face ranking task show a identification rate of 55.3%, which is in line with prior work. These results represent an important step towards new interactive intelligent systems that can robustly and effortlessly reconstruct a user’s mental image.

doi: https://doi.org/10.1145/3586183.3606795

Paper: strohm23_uist.pdf

@inproceedings{strohm23_uist, author = {Strohm, Florian and B{\^a}ce, Mihai and Bulling, Andreas}, title = {Usable and Fast Interactive Mental Face Reconstruction}, booktitle = {Proc. ACM Symposium on User Interface Software and Technology (UIST)}, year = {2023}, pages = {1--15}, doi = {https://doi.org/10.1145/3586183.3606795} }
SUPREYES: SUPer Resolution for EYES Using Implicit Neural Representation Learning

Chuhan Jiao, Zhiming Hu, Mihai Bâce, Andreas Bulling

Proc. ACM Symposium on User Interface Software and Technology (UIST), pp. 1–13, 2023.

Abstract Links BibTeX Project

We introduce SUPREYES – a novel self-supervised method to increase the spatio-temporal resolution of gaze data recorded using low(er)-resolution eye trackers. Despite continuing advances in eye tracking technology, the vast majority of current eye trackers – particularly mobile ones and those integrated into mobile devices – suffer from low-resolution gaze data, thus fundamentally limiting their practical usefulness. SUPREYES learns a continuous implicit neural representation from low-resolution gaze data to up-sample the gaze data to arbitrary resolutions. We compare our method with commonly used interpolation methods on arbitrary scale super-resolution and demonstrate that SUPREYES outperforms these baselines by a significant margin. We also test on the sample downstream task of gaze-based user identification and show that our method improves the performance of original low-resolution gaze data and outperforms other baselines. These results are promising as they open up a new direction for increasing eye tracking fidelity as well as enabling new gaze-based applications without the need for new eye tracking equipment.

doi: 10.1145/3586183.3606780

Paper: jiao23_uist.pdf

@inproceedings{jiao23_uist, author = {Jiao, Chuhan and Hu, Zhiming and B{\^a}ce, Mihai and Bulling, Andreas}, title = {SUPREYES: SUPer Resolution for EYES Using Implicit Neural Representation Learning}, booktitle = {Proc. ACM Symposium on User Interface Software and Technology (UIST)}, year = {2023}, pages = {1--13}, doi = {10.1145/3586183.3606780} }
MultiMediate ’23: Engagement Estimation and Bodily Behaviour Recognition in Social Interactions

Philipp Müller, Michal Balazia, Tobias Baur, Michael Dietz, Alexander Heimerl, Dominik Schiller, Mohammed Guermal, Dominike Thomas, François Brémond, Jan Alexandersson, Elisabeth André, Andreas Bulling

Proceedings of the 31st ACM International Conference on Multimedia, pp. 9640–9645, 2023.

Abstract Links BibTeX Project

Automatic analysis of human behaviour is a fundamental prerequisite for the creation of machines that can effectively interact with- and support humans in social interactions. In MultiMediate’23, we address two key human social behaviour analysis tasks for the first time in a controlled challenge: engagement estimation and bodily behaviour recognition in social interactions. This paper describes the MultiMediate’23 challenge and presents novel sets of annotations for both tasks. For engagement estimation we collected novel annotations on the NOvice eXpert Interaction (NOXI) database. For bodily behaviour recognition, we annotated test recordings of the MPIIGroupInteraction corpus with the BBSI annotation scheme. In addition, we present baseline results for both challenge tasks.

doi: 10.1145/3581783.3613851

Paper: mueller23_mm.pdf

Paper Access: https://doi.org/10.1145/3581783.3613851

@inproceedings{mueller23_mm, author = {M\"{u}ller, Philipp and Balazia, Michal and Baur, Tobias and Dietz, Michael and Heimerl, Alexander and Schiller, Dominik and Guermal, Mohammed and Thomas, Dominike and Br\'{e}mond, Fran\c{c}ois and Alexandersson, Jan and Andr\'{e}, Elisabeth and Bulling, Andreas}, title = {MultiMediate '23: Engagement Estimation and Bodily Behaviour Recognition in Social Interactions}, year = {2023}, isbn = {9798400701085}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3581783.3613851}, doi = {10.1145/3581783.3613851}, booktitle = {Proceedings of the 31st ACM International Conference on Multimedia}, pages = {9640–9645}, numpages = {6}, keywords = {dataset, engagement, nonverbal behaviour, challenge}, location = {Ottawa ON, Canada}, series = {MM '23} }
Impact of Privacy Protection Methods of Lifelogs on Remembered Memories

Passant Elagroudy, Mohamed Khamis, Florian Mathis, Diana Irmscher, Ekta Sood, Andreas Bulling, Albrecht Schmidt

Proc. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI), pp. 1–10, 2023.

Abstract Links BibTeX Project

Lifelogging is traditionally used for memory augmentation. However, recent research shows that users’ trust in the completeness and accuracy of lifelogs might skew their memories. Privacy-protection alterations such as body blurring and content deletion are commonly applied to photos to circumvent capturing sensitive information. However, their impact on how users remember memories remain unclear. To this end, we conduct a white-hat memory attack and report on an iterative experiment (N=21) to compare the impact of viewing 1) unaltered lifelogs, 2) blurred lifelogs, and 3) a subset of the lifelogs after deleting private ones, on confidently remembering memories. Findings indicate that all the privacy methods impact memories’ quality similarly and that users tend to change their answers in recognition more than recall scenarios. Results also show that users have high confidence in their remembered content across all privacy methods. Our work raises awareness about the mindful designing of technological interventions.

doi: 10.1145/3544548.3581565

Paper: elagroudy23_chi.pdf

@inproceedings{elagroudy23_chi, author = {Elagroudy, Passant and Khamis, Mohamed and Mathis, Florian and Irmscher, Diana and Sood, Ekta and Bulling, Andreas and Schmidt, Albrecht}, title = {Impact of Privacy Protection Methods of Lifelogs on Remembered Memories}, booktitle = {Proc. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI)}, year = {2023}, doi = {10.1145/3544548.3581565}, pages = {1--10} }
Facial Composite Generation with Iterative Human Feedback

Florian Strohm, Ekta Sood, Dominike Thomas, Mihai Bâce, Andreas Bulling

Proc. The 1st Gaze Meets ML workshop, PMLR, pp. 165–183, 2023.

Abstract Links BibTeX Project

We propose the first method in which human and AI collaborate to iteratively reconstruct the human’s mental image of another person’s face only from their eye gaze. Current tools for generating digital human faces involve a tedious and time-consuming manual design process. While gaze-based mental image reconstruction represents a promising alternative, previous methods still assumed prior knowledge about the target face, thereby severely limiting their practical usefulness. The key novelty of our method is a collaborative, it- erative query engine: Based on the user’s gaze behaviour in each iteration, our method predicts which images to show to the user in the next iteration. Results from two human studies (N=12 and N=22) show that our method can visually reconstruct digital faces that are more similar to the mental image, and is more usable compared to other methods. As such, our findings point at the significant potential of human-AI collaboration for recon- structing mental images, potentially also beyond faces, and of human gaze as a rich source of information and a powerful mediator in said collaboration.

Paper: strohm23_gmml.pdf

Paper Access: https://proceedings.mlr.press/v210/strohm23a.html

@inproceedings{strohm23_gmml, title = {Facial Composite Generation with Iterative Human Feedback}, author = {Strohm, Florian and Sood, Ekta and Thomas, Dominike and B{\^a}ce, Mihai and Bulling, Andreas}, booktitle = {Proc. The 1st Gaze Meets ML workshop, PMLR}, pages = {165--183}, year = {2023}, editor = {Lourentzou, Ismini and Wu, Joy and Kashyap, Satyananda and Karargyris, Alexandros and Celi, Leo Anthony and Kawas, Ban and Talathi, Sachin}, volume = {210}, series = {Proceedings of Machine Learning Research}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v210/strohm23a/strohm23a.pdf}, url = {https://proceedings.mlr.press/v210/strohm23a.html} }
Multimodal Integration of Human-Like Attention in Visual Question Answering

Ekta Sood, Fabian Kögel, Philipp Müller, Dominike Thomas, Mihai Bâce, Andreas Bulling

Proc. Workshop on Gaze Estimation and Prediction in the Wild (GAZE), CVPRW, pp. 2647–2657, 2023.

Abstract Links BibTeX Project Tobii Sponsor Award, Oral Presentation

Human-like attention as a supervisory signal to guide neural attention has shown significant promise but is currently limited to uni-modal integration – even for inherently multi-modal tasks such as visual question answering (VQA). We present the Multimodal Human-like Attention Network (MULAN) – the first method for multimodal integration of human-like attention on image and text during training of VQA models. MULAN integrates attention predictions from two state-of-the-art text and image saliency models into neural self-attention layers of a recent transformer-based VQA model. Through evaluations on the challenging VQAv2 dataset, we show that MULAN achieves a new state-of-the-art performance of 73.98% accuracy on test-std and 73.72% on test-dev and, at the same time, has approximately 80% fewer trainable parameters than prior work. Overall, our work underlines the potential of integrating multimodal human-like and neural attention for VQA.

Paper: sood23_gaze.pdf

Code: https://git.hcics.simtech.uni-stuttgart.de/public-projects/MULAN-VQA/

Paper Access: https://openaccess.thecvf.com/content/CVPR2023W/GAZE/papers/Sood_Multimodal_Integration_of_Human-Like_Attention_in_Visual_Question_Answering_CVPRW_2023_paper.pdf

@inproceedings{sood23_gaze, author = {Sood, Ekta and Kögel, Fabian and Müller, Philipp and Thomas, Dominike and Bâce, Mihai and Bulling, Andreas}, title = {Multimodal Integration of Human-Like Attention in Visual Question Answering}, booktitle = {Proc. Workshop on Gaze Estimation and Prediction in the Wild (GAZE), CVPRW}, year = {2023}, pages = {2647--2657}, url = {https://openaccess.thecvf.com/content/CVPR2023W/GAZE/papers/Sood_Multimodal_Integration_of_Human-Like_Attention_in_Visual_Question_Answering_CVPRW_2023_paper.pdf} }
Gaze Behaviour in Adolescents with Obsessive-compulsive Disorder During Exposure Within Cognitive-behavioural Therapy

Annika Thierfelder, Björn Severitt, Carolin Sarah Klein, Annika Kristin Alt, Karsten Hollmann, Andreas Bulling, Winfried Ilg

Proc. 17th EAI International Conference on Pervasive Computing Technologies for Healthcare (Pervasive Health), 2023.

Abstract Links BibTeX Project

Digital health interventions that involve monitoring patient behaviour increasingly benefit from improvements in sensor technology. Eye tracking in particular can provide useful information for psychotherapy but an effective method to extract this information is currently missing. We propose a method to analyse natural gaze behaviour during exposure exercises for obsessive-compulsive disorder (OCD). At the core of our method is a neural network to detect fixations based on gaze patch similarities. Detected fixations are clustered into exposure-relevant, therapist, and other locations and corresponding eye movement metrics are correlated with subjective stress reported during exposure. We evaluate our method on gaze and stress data recorded during video-based psychotherapy of four adolescents with OCD. We found that fixation duration onto exposure-relevant locations consistently increases with the perceived stress level as opposed to fixations onto other locations. Fixation behaviour towards the therapist varied largely between patients. Taken together, our results not only demonstrate the effectiveness of our method for analysing natural gaze behaviour during exposure sessions. The fixation analysis shows that patients allocate more attention towards exposure-related objects under higher stress levels, suggesting higher mental load. As such, providing feedback on fixation behaviour holds significant promise to support therapists in monitoring intensity of exposure exercises.

doi: 10.13140/RG.2.2.30047.02721

Paper: thierfelder23_pervasiveh.pdf

@inproceedings{thierfelder23_pervasiveh, title = {Gaze Behaviour in Adolescents with Obsessive-compulsive Disorder During Exposure Within Cognitive-behavioural Therapy}, author = {Thierfelder, Annika and Severitt, Björn and Klein, Carolin Sarah and Alt, Annika Kristin and Hollmann, Karsten and Bulling, Andreas and Ilg, Winfried}, year = {2023}, booktitle = {Proc. 17th EAI International Conference on Pervasive Computing Technologies for Healthcare (Pervasive Health)}, doi = {10.13140/RG.2.2.30047.02721} }

Technical Reports

Inferring Human Intentions from Predicted Action Probabilities

Lei Shi, Paul-Christian Bürkner, Andreas Bulling

arXiv:2308.12194, pp. 1–7, 2023.

Abstract Links BibTeX Project

Predicting the next action that a human is most likely to perform is key to human-AI collaboration and has consequently attracted increasing research interests in recent years. An important factor for next action prediction are human intentions: If the AI agent knows the intention it can predict future actions and plan collaboration more effectively. Existing Bayesian methods for this task struggle with complex visual input while deep neural network (DNN) based methods do not provide uncertainty quantifications. In this work we combine both approaches for the first time and show that the predicted next action probabilities contain information that can be used to infer the underlying intention. We propose a two-step approach to human intention prediction: While a DNN predicts the probabilities of the next action, MCMC-based Bayesian inference is used to infer the underlying intention from these predictions. This approach not only allows for independent design of the DNN architecture but also the subsequently fast, design-independent inference of human intentions. We evaluate our method using a series of experiments on the Watch-And-Help (WAH) and a keyboard and mouse interaction dataset. Our results show that our approach can accurately predict human intentions from observed actions and the implicit information contained in next action probabilities. Furthermore, we show that our approach can predict the correct intention even if only few actions have been observed.

Paper Access: https://arxiv.org/abs/2308.12194

@techreport{shi23_arxiv, title = {Inferring Human Intentions from Predicted Action Probabilities}, author = {Shi, Lei and Bürkner, Paul-Christian and Bulling, Andreas}, year = {2023}, pages = {1--7}, url = {https://arxiv.org/abs/2308.12194} }
Neural Reasoning About Agents’ Goals, Preferences, and Actions

Matteo Bortoletto, Lei Shi, Andreas Bulling

arXiv:2312.07122, pp. 1–13, 2023.

Abstract Links BibTeX Project

We propose the Intuitive Reasoning Network (IRENE) – a novel neural model for intuitive psychological reasoning about agents’ goals, preferences, and actions that can generalise previous experiences to new situations. IRENE combines a graph neural network for learning agent and world state representations with a transformer to encode the task context. When evaluated on the challenging Baby Intuitions Benchmark, IRENE achieves new state-of-the-art performance on three out of its five tasks – with up to 48.9 % improvement. In contrast to existing methods, IRENE is able to bind preferences to specific agents, to better distinguish between rational and irrational agents, and to better understand the role of blocking obstacles. We also investigate, for the first time, the influence of the training tasks on test performance. Our analyses demonstrate the effectiveness of IRENE in combining prior knowledge gained during training for unseen evaluation tasks.

doi: 10.48550/arXiv.2312.07122

Paper: bortoletto23_arxiv.pdf

Code: https://git.hcics.simtech.uni-stuttgart.de/public-projects/IRENE

Paper Access: https://arxiv.org/abs/2312.07122

@techreport{bortoletto23_arxiv, author = {Bortoletto, Matteo and Shi, Lei and Bulling, Andreas}, title = {Neural Reasoning About Agents’ Goals, Preferences, and Actions}, year = {2023}, pages = {1--13}, doi = {10.48550/arXiv.2312.07122}, url = {https://arxiv.org/abs/2312.07122} }
GazeMoDiff: Gaze-guided Diffusion Model for Stochastic Human Motion Prediction

Haodong Yan, Zhiming Hu, Syn Schmitt, Andreas Bulling

arXiv:2312.12090, pp. 1–10, 2023.

Abstract Links BibTeX Project

Human motion prediction is important for virtual reality (VR) applications, e.g., for realistic avatar animation. Existing methods have synthesised body motion only from observed past motion, despite the fact that human gaze is known to correlate strongly with body movements and is readily available in recent VR headsets. We present GazeMoDiff – a novel gaze-guided denoising diffusion model to generate stochastic human motions. Our method first uses a graph attention network to learn the spatio-temporal correlations between eye gaze and human movements and to fuse them into cross-modal gaze-motion features. These cross-modal features are injected into a noise prediction network via a cross-attention mechanism and progressively denoised to generate realistic human full-body motions. Experimental results on the MoGaze and GIMO datasets demonstrate that our method outperforms the state-of-the-art methods by a large margin in terms of average displacement error (15.03% on MoGaze and 9.20% on GIMO). We further conducted an online user study to compare our method with state-of-the-art methods and the responses from 23 participants validate that the motions generated by our method are more realistic than those from other methods. Taken together, our work makes a first important step towards gaze-guided stochastic human motion prediction and guides future work on this important topic in VR research.

Paper: yan23_arxiv.pdf

Paper Access: https://arxiv.org/abs/2312.12090

@techreport{yan23_arxiv, author = {Yan, Haodong and Hu, Zhiming and Schmitt, Syn and Bulling, Andreas}, title = {GazeMoDiff: Gaze-guided Diffusion Model for Stochastic Human Motion Prediction}, year = {2023}, pages = {1--10}, url = {https://arxiv.org/abs/2312.12090} }
Int-HRL: Towards Intention-based Hierarchical Reinforcement Learning

Anna Penzkofer, Simon Schaefer, Florian Strohm, Mihai Bâce, Stefan Leutenegger, Andreas Bulling

arxiv:2306.11483, pp. 1–7, 2023.

Abstract Links BibTeX Project

While deep reinforcement learning (RL) agents outperform humans on an increasing number of tasks, training them requires data equivalent to decades of human gameplay. Recent hierarchical RL methods have increased sample efficiency by incorporating information inherent to the structure of the decision problem but at the cost of having to discover or use human-annotated sub-goals that guide the learning process. We show that intentions of human players, i.e. the precursor of goal-oriented decisions, can be robustly predicted from eye gaze even for the long-horizon sparse rewards task of Montezuma’s Revenge – one of the most challenging RL tasks in the Atari2600 game suite. We propose Int-HRL: Hierarchical RL with intention-based sub-goals that are inferred from human eye gaze. Our novel sub-goal extraction pipeline is fully automatic and replaces the need for manual sub-goal annotation by human experts. Our evaluations show that replacing hand-crafted sub-goals with automatically extracted intentions leads to a HRL agent that is significantly more sample efficient than previous methods.

Paper Access: https://arxiv.org/abs/2306.11483

@techreport{penzkofer23_arxiv, author = {Penzkofer, Anna and Schaefer, Simon and Strohm, Florian and Bâce, Mihai and Leutenegger, Stefan and Bulling, Andreas}, title = {Int-HRL: Towards Intention-based Hierarchical Reinforcement Learning}, year = {2023}, pages = {1--7}, url = {https://arxiv.org/abs/2306.11483} }
Pose2Gaze: Generating Realistic Human Gaze Behaviour from Full-body Poses using an Eye-body Coordination Model

Zhiming Hu, Jiahui Xu, Syn Schmitt, Andreas Bulling

arXiv:2312.12042, pp. 1–10, 2023.

Abstract Links BibTeX Project

While generating realistic body movements, e.g., for avatars in virtual reality, is widely studied in computer vision and graphics, the generation of eye movements that exhibit realistic coordination with the body remains under-explored. We first report a comprehensive analysis of the coordination of human eye and full-body movements during everyday activities based on data from the MoGaze and GIMO datasets. We show that eye gaze has strong correlations with head directions and also full-body motions and there exists a noticeable time delay between body and eye movements. Inspired by the analyses, we then present Pose2Gaze – a novel eye-body coordination model that first uses a convolutional neural network and a spatio-temporal graph convolutional neural network to extract features from head directions and full-body poses respectively and then applies a convolutional neural network to generate realistic eye movements. We compare our method with state-of-the-art methods that predict eye gaze only from head movements for three different generation tasks and demonstrate that Pose2Gaze significantly outperforms these baselines on both datasets with an average improvement of 26.4% and 21.6% in mean angular error, respectively. Our findings underline the significant potential of cross-modal human gaze behaviour analysis and modelling.

Paper: hu23_arxiv.pdf

Paper Access: https://arxiv.org/abs/2312.12042

@techreport{hu23_arxiv, author = {Hu, Zhiming and Xu, Jiahui and Schmitt, Syn and Bulling, Andreas}, title = {Pose2Gaze: Generating Realistic Human Gaze Behaviour from Full-body Poses using an Eye-body Coordination Model}, year = {2023}, pages = {1--10}, url = {https://arxiv.org/abs/2312.12042} }

Book Chapters

Using Deep Learning to Recognize Emotions Through Speech Analysis

Arion Mitra, Ankita Biswas, Ananya Ghosh, Ahona Ghosh, Souptik Kumar Majumdar, Jayati Ghosh Dastidar

pp. 161–180, 2023.

Abstract Links BibTeX Project

Emotion recognition is the identification of emotions usually through verbal communication and facial expressions such as happy, angry, sad, etc. Not only on the basis of a wide spectrum of moods, but different emotions can also be recognized in order to track mental health of as many people as possible for societal well being. Inside positive it detects specific emotions like happiness, satisfaction, or excitement -depending on how it’s configured. The main principles involved in the implementation of our sentiment recognition system that identifies various emotions: anger, happiness, depression, neutral, etc. are audio content and identification of the emotion associated with it. The application developed takes audio input, applies Mel-Frequency Cepstral Coefficients (MFCC) algorithm on it, compares them with those of the content of the existing audio file database depicting various human sentiments, and presents output in the text the emotion expressed by the user. The input from testing was gathered and meaningful spectral coefficients were extracted and stored in a database for comparison with future audio samples. The application extracts the coefficients of the external audio sample and matches it with those present in the database. MFCC algorithm is used to extract the spectral coefficients which are good and can be used for feature matching purposes discarding any static and background noise if present. We have done comparative analysis on our models for their performance evaluation, using four classification metrics and also presented the confusion matrix for better understanding.

doi: 10.1007/978-3-031-12419-8_9

Paper Access: https://doi.org/10.1007/978-3-031-12419-8_9

@incollection{mitra23_aisi, title = {Using {Deep} {Learning} to {Recognize} {Emotions} {Through} {Speech} {Analysis}}, author = {Mitra, Arion and Biswas, Ankita and Ghosh, Ananya and Ghosh, Ahona and Majumdar, Souptik Kumar and Dastidar, Jayati Ghosh}, year = {2023}, booktitle = {Artificial {Intelligence} for {Societal} {Issues}}, pages = {161--180}, doi = {10.1007/978-3-031-12419-8_9}, url = {https://doi.org/10.1007/978-3-031-12419-8_9} }

2022

Journal Articles

Understanding, Addressing, and Analysing Digital Eye Strain in Virtual Reality Head-Mounted Displays

Teresa Hirzle, Fabian Fischbach, Julian Karlbauer, Pascal Jansen, Jan Gugenheimer, Enrico Rukzio, Andreas Bulling

ACM Transactions on Computer-Human Interaction (TOCHI), 29 (4), pp. 1-80, 2022.

Abstract Links BibTeX Project

Digital eye strain (DES), caused by prolonged exposure to digital screens, stresses the visual system and negatively affects users’ well-being and productivity. While DES is well-studied in computer displays, its impact on users of virtual reality (VR) head-mounted displays (HMDs) is largely unexplored—despite that some of their key properties (e.g., the vergence-accommodation conflict) make VR-HMDs particularly prone. This work provides the first comprehensive investigation into DES in VR HMDs. We present results from a survey with 68 experienced users to understand DES symptoms in VR-HMDs. To help address DES, we investigate eye exercises resulting from survey answers and blue light filtering in three user studies (N = 71). Results demonstrate that eye exercises, but not blue light filtering, can effectively reduce DES. We conclude with an extensive analysis of the user studies and condense our findings in 10 key challenges that guide future work in this emerging research area.

doi: 10.1145/3492802

Paper: hirzle22_tochi.pdf

@article{hirzle22_tochi, title = {Understanding, Addressing, and Analysing Digital Eye Strain in Virtual Reality Head-Mounted Displays}, author = {Hirzle, Teresa and Fischbach, Fabian and Karlbauer, Julian and Jansen, Pascal and Gugenheimer, Jan and Rukzio, Enrico and Bulling, Andreas}, year = {2022}, pages = {1-80}, doi = {10.1145/3492802}, journal = {ACM Transactions on Computer-Human Interaction (TOCHI)}, volume = {29}, number = {4} }
VisRecall: Quantifying Information Visualisation Recallability via Question Answering

Yao Wang, Chuhan Jiao, Mihai Bâce, Andreas Bulling

IEEE Transactions on Visualization and Computer Graphics (TVCG), 28 (12), pp. 4995-5005, 2022.

Abstract Links BibTeX Project

Despite its importance for assessing the effectiveness of communicating information visually, fine-grained recallability of information visualisations has not been studied quantitatively so far. In this work, we propose a question-answering paradigm to study visualisation recallability and present VisRecall - a novel dataset consisting of 200 visualisations that are annotated with crowd-sourced human (N = 305) recallability scores obtained from 1,000 questions of five question types. Furthermore, we present the first computational method to predict recallability of different visualisation elements, such as the title or specific data values. We report detailed analyses of our method on VisRecall and demonstrate that it outperforms several baselines in overall recallability and FE-, F-, RV-, and U-question recallability. Our work makes fundamental contributions towards a new generation of methods to assist designers in optimising visualisations.

doi: 10.1109/TVCG.2022.3198163

Paper: wang22_tvcg.pdf

Supplementary Material: wang22_tvcg_sup.pdf

Dataset: https://darus.uni-stuttgart.de/dataset.xhtml?persistentId=doi:10.18419/darus-2826

@article{wang22_tvcg, title = {VisRecall: Quantifying Information Visualisation Recallability via Question Answering}, author = {Wang, Yao and Jiao, Chuhan and Bâce, Mihai and Bulling, Andreas}, year = {2022}, journal = {IEEE Transactions on Visualization and Computer Graphics (TVCG)}, volume = {28}, number = {12}, pages = {4995-5005}, doi = {10.1109/TVCG.2022.3198163} }
User-centred multimodal authentication: securing handheld mobile devices using gaze and touch input

Mohamed Khamis, Karola Marky, Andreas Bulling, Florian Alt

Behaviour & Information Technology, 41 (10), pp. 2061-2083, 2022.

Abstract Links BibTeX Project

Handheld mobile devices store a plethora of sensitive data, such as private emails, personal messages, photos, and location data. Authentication is essential to protect access to sensitive data. However, the majority of mobile devices are currently secured by singlemodal authentication schemes which are vulnerable to shoulder surfing, smudge attacks, and thermal attacks. While some authentication schemes protect against one of these attacks, only few schemes address all three of them. We propose multimodal authentication where touch and gaze input are combined to resist shoulder surfing, as well as smudge and thermal attacks. Based on a series of previously published works where we studied the usability of several user-centred multimodal authentication designs and their security against multiple threat models, we provide a comprehensive overview of multimodal authentication on handheld mobile devices. We further present guidelines on how to leverage multiple input modalities for enhancing the usability and security of user authentication on mobile devices.

doi: 10.1080/0144929X.2022.2069597

Paper: khamis22_bit.pdf

@article{khamis22_bit, author = {Khamis, Mohamed and Marky, Karola and Bulling, Andreas and Alt, Florian}, title = {User-centred multimodal authentication: securing handheld mobile devices using gaze and touch input}, journal = {Behaviour \& Information Technology}, volume = {41}, number = {10}, pages = {2061-2083}, year = {2022}, publisher = {Taylor & Francis}, doi = {10.1080/0144929X.2022.2069597} }
Adapting visualizations and interfaces to the user

Francesco Chiossi, Johannes Zagermann, Jakob Karolus, Nils Rodrigues, Priscilla Balestrucci, Daniel Weiskopf, Benedikt Ehinger, Tiare Feuchtner, Harald Reiterer, Lewis L. Chuang, Marc Ernst, Andreas Bulling, Sven Mayer, Albrecht Schmidt

it - Information Technology, 64 (4-5), pp. 133–143, 2022.

Abstract Links BibTeX Project

Adaptive visualization and interfaces pervade our everyday tasks to improve interaction from the point of view of user performance and experience. This approach allows using several user inputs, whether physiological, behavioral, qualitative, or multimodal combinations, to enhance the interaction. Due to the multitude of approaches, we outline the current research trends of inputs used to adapt visualizations and user interfaces. Moreover, we discuss methodological approaches used in mixed reality, physiological computing, visual analytics, and proficiency-aware systems. With this work, we provide an overview of the current research in adaptive systems.

doi: 10.1515/itit-2022-0035

Paper: chiossi22_it.pdf

@article{chiossi22_it, title = {Adapting visualizations and interfaces to the user}, author = {Chiossi, Francesco and Zagermann, Johannes and Karolus, Jakob and Rodrigues, Nils and Balestrucci, Priscilla and Weiskopf, Daniel and Ehinger, Benedikt and Feuchtner, Tiare and Reiterer, Harald and Chuang, Lewis L. and Ernst, Marc and Bulling, Andreas and Mayer, Sven and Schmidt, Albrecht}, pages = {133--143}, volume = {64}, number = {4-5}, journal = {it - Information Technology}, doi = {10.1515/itit-2022-0035}, year = {2022} }
Anticipatory Human-Machine Interaction (Dagstuhl Seminar 22202)

Jelmer Borst, Andreas Bulling, Cleotilde Gonzalez, Nele Russwinkel

Dagstuhl Reports, 12 (5), pp. 131–169, 2022.

Links BibTeX Project

doi: 10.4230/DagRep.12.5.131

Paper: borst22_dagstuhl.pdf

Paper Access: https://drops.dagstuhl.de/opus/volltexte/2022/17446

@article{borst22_dagstuhl, author = {Borst, Jelmer and Bulling, Andreas and Gonzalez, Cleotilde and Russwinkel, Nele}, title = {{Anticipatory Human-Machine Interaction (Dagstuhl Seminar 22202)}}, pages = {131--169}, journal = {Dagstuhl Reports}, year = {2022}, volume = {12}, number = {5}, editor = {Borst, Jelmer and Bulling, Andreas and Gonzalez, Cleotilde and Russwinkel, Nele}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, url = {https://drops.dagstuhl.de/opus/volltexte/2022/17446}, doi = {10.4230/DagRep.12.5.131} }

Conference Papers

ThumbPitch: Enriching Thumb Interaction on Mobile Touchscreens using Deep Learning

Jamie Ullerich, Maximiliane Windl, Andreas Bulling, Sven Mayer

ACM Proceedings of the 34st Australian Conference on Human-Computer-Interaction (OzCHI), pp. 1–9, 2022.

Abstract Links BibTeX Project

Today touchscreens are one of the most common input devices for everyday ubiquitous interaction. Yet, capacitive touchscreens are limited in expressiveness; thus, a large body of work has focused on extending the input capabilities of touchscreens. One promising approach is to use index finger orientation; however, this requires a two-handed interaction and poses ergonomic constraints. We propose using the thumb’s pitch as an additional input dimension to counteract these limitations, enabling one-handed interaction scenarios. Our deep convolutional neural network detecting the thumb’s pitch is trained on more than 230,000 ground truth images recorded using a motion tracking system. We highlight the potential of ThumbPitch by proposing several use cases that exploit the higher expressiveness, especially for one-handed scenarios. We tested three use cases in a validation study and validated our model. Our model achieved a mean error of only 11.9°.

doi: 10.1145/3572921.3572925

Paper: ullerich22_ozchi.pdf

@inproceedings{ullerich22_ozchi, author = {Ullerich, Jamie and Windl, Maximiliane and Bulling, Andreas and Mayer, Sven}, title = {ThumbPitch: Enriching Thumb Interaction on Mobile Touchscreens using Deep Learning}, booktitle = {ACM Proceedings of the 34st Australian Conference on Human-Computer-Interaction (OzCHI)}, year = {2022}, pages = {1--9}, doi = {10.1145/3572921.3572925} }
Impact of Gaze Uncertainty on AOIs in Information Visualisations

Yao Wang, Maurice Koch, Mihai Bâce, Daniel Weiskopf, Andreas Bulling

ETRA Workshop on Eye Tracking and Visualization (ETVIS), pp. 1–6, 2022.

Abstract Links BibTeX Project

Gaze-based analysis of areas of interest (AOIs) is widely used in information visualisation research to understand how people explore visualisations or assess the quality of visualisations concerning key characteristics such as memorability. However, nearby AOIs in visualisations amplify the uncertainty caused by the gaze estimation error, which strongly influences the mapping between gaze samples or fixations and different AOIs. We contribute a novel investigation into gaze uncertainty and quantify its impact on AOI-based analysis on visualisations using two novel metrics: the Flipping Candidate Rate (FCR) and Hit Any AOI Rate (HAAR). Our analysis of 40 real-world visualisations, including human gaze and AOI annotations, shows that gaze uncertainty frequently and significantly impacts the analysis conducted in AOI-based studies. Moreover, we analysed four visualisation types and found that bar and scatter plots are usually designed in a way that causes more uncertainty than line and pie plots in gaze-based analysis.

doi: 10.1145/3517031.3531166

Paper: wang22_etvis.pdf

Supplementary Material: wang22_etvis_sup.pdf

@inproceedings{wang22_etvis, title = {Impact of Gaze Uncertainty on AOIs in Information Visualisations}, author = {Wang, Yao and Koch, Maurice and B{\^a}ce, Mihai and Weiskopf, Daniel and Bulling, Andreas}, year = {2022}, pages = {1--6}, booktitle = {ETRA Workshop on Eye Tracking and Visualization (ETVIS)}, doi = {10.1145/3517031.3531166} }
Mind Wandering Trait-level Tendencies During Lecture Viewing: A Pilot Study

Francesca Zermiani, Andreas Bulling, Maria Wirzberger

Proc. EduEye Workshop on Eye Tracking in Learning and Education (EduEye), pp. 1–7, 2022.

Abstract Links BibTeX Project

Mind wandering (MW) is defined as a shift of attention to task-unrelated internal thoughts that is pervasive and disruptive for learning performance. Current state-of-the-art gaze-based attention-aware intelligent systems are capable of detecting MW from eye movements and delivering interventions to mitigate its negative effects. However, the beneficial functions of MW and its trait-level tendency, defined as the content of MW experience, are still largely neglected by these systems. In this pilot study, we address the questions of whether different MW trait-level tendencies can be detected through off-screen fixations’ frequency and duration and blink rate during a lecture viewing task. We focus on prospective planning and creative problem-solving as two of the main MW trait-level tendencies. Despite the non-significance, the descriptive values show a higher frequency and duration of off-screen fixations, but lower blink rate, in the creative problem-solving MW condition. Interestingly, we do find a highly significant correlation between MW level and engagement scores in the prospective planning MW group. Potential explanations for the observed results are discussed. Overall, these findings represent a preliminary step towards the development of more accurate and adaptive learning technologies, and call for further studies on MW trait-level tendency detection.

doi: 10.1145/3517031.3529241

Paper: zermiani22_edueye.pdf

@inproceedings{zermiani22_edueye, title = {Mind Wandering Trait-level Tendencies During Lecture Viewing: A Pilot Study}, author = {Zermiani, Francesca and Bulling, Andreas and Wirzberger, Maria}, year = {2022}, booktitle = {Proc. EduEye Workshop on Eye Tracking in Learning and Education (EduEye)}, doi = {10.1145/3517031.3529241}, pages = {1--7} }
PrivacyScout: Assessing Vulnerability to Shoulder Surfing on Mobile Devices

Mihai Bâce, Alia Saad, Mohamed Khamis, Stefan Schneegass, Andreas Bulling

Proc. on Privacy Enhancing Technologies (PETs), pp. 650–669, 2022.

Abstract Links BibTeX Project

One approach to mitigate shoulder surfing attacks on mobile devices is to detect the presence of a bystander using the phone’s front-facing camera. However, a person’s face in the camera’s field of view does not always indicate an attack. To overcome this limitation, in a novel data collection study (N=16), we analysed the influence of three viewing angles and four distances on the success of shoulder surfing attacks. In contrast to prior works that mainly focused on user authentication, we investigated three common types of content susceptible to shoulder surfing: text, photos, and PIN authentications. We show that the vulnerability of text and photos depends on the observer’s location relative to the device, while PIN authentications are vulnerable independent of the observation location. We then present PrivacyScout - a novel method that predicts the shoulder-surfing risk based on visual features extracted from the observer’s face as captured by the front-facing camera. Finally, evaluations from our data collection study demonstrate our method’s feasibility to assess the risk of a shoulder surfing attack more accurately.

doi: 10.56553/popets-2022-0090

Paper: bace22_pets.pdf

@inproceedings{bace22_pets, title = {PrivacyScout: Assessing Vulnerability to Shoulder Surfing on Mobile Devices}, author = {B{\^a}ce, Mihai and Saad, Alia and Khamis, Mohamed and Schneegass, Stefan and Bulling, Andreas}, year = {2022}, booktitle = {Proc. on Privacy Enhancing Technologies (PETs)}, doi = {10.56553/popets-2022-0090}, pages = {650--669}, issue = {3} }
Designing for Noticeability: The Impact of Visual Importance on Desktop Notifications

Philipp Müller, Sander Staal, Mihai Bâce, Andreas Bulling

Proc. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI), pp. 1–13, 2022.

Abstract Links BibTeX Project

Desktop notifications should be noticeable but are also subject to a number of design choices, e.g. concerning their size, placement, or opacity. It is currently unknown, however, how these choices interact with the desktop background and their influence on noticeability. To address this limitation, we introduce a software tool to automatically synthesize realistically looking desktop images for major operating systems and applications. Using these images, we present a user study (N=34) to investigate the noticeability of notifications during a primary task. We are first to show that visual importance of the background at the notification location significantly impacts whether users detect notifications. We analyse the utility of visual importance to compensate for suboptimal design choices with respect to noticeability, e.g. small notification size. Finally, we introduce noticeability maps - 2D maps encoding the predicted noticeability across the desktop and inform designers how to trade-off notification design and noticeability.

doi: 10.1145/3491102.3501954

Paper: mueller22_chi.pdf

@inproceedings{mueller22_chi, title = {Designing for Noticeability: The Impact of Visual Importance on Desktop Notifications}, author = {Müller, Philipp and Staal, Sander and B{\^a}ce, Mihai and Bulling, Andreas}, year = {2022}, pages = {1--13}, booktitle = {Proc. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI)}, doi = {10.1145/3491102.3501954} }
Gaze-enhanced Crossmodal Embeddings for Emotion Recognition

Ahmed Abdou, Ekta Sood, Philipp Müller, Andreas Bulling

Proc. International Symposium on Eye Tracking Research and Applications (ETRA), pp. 1–18, 2022.

Abstract Links BibTeX Project

Emotional expressions are inherently multimodal – integrating facial behavior, speech, and gaze – but their automatic recognition is often limited to a single modality, e.g. speech during a phone call. While previous work proposed crossmodal emotion embeddings to improve monomodal recognition performance, despite its importance, a representation of gaze was not included. We propose a new approach to emotion recognition that incorporates an explicit representation of gaze in a crossmodal emotion embedding framework. We show that our method outperforms the previous state of the art for both audio-only and video-only emotion classification on the popular One-Minute Gradual Emotion Recognition dataset. Furthermore, we report extensive ablation experiments and provide insights into the performance of different state-of-the-art gaze representations and integration strategies. Our results not only underline the importance of gaze for emotion recognition but also demonstrate a practical and highly effective approach to leveraging gaze information for this task.

doi: 10.1145/3530879

Paper: abdou22_etra.pdf

Code: https://git.hcics.simtech.uni-stuttgart.de/public-projects/gaze-enhanced-crossmodal-embeddings-for-emotion-recognition

@inproceedings{abdou22_etra, title = {Gaze-enhanced Crossmodal Embeddings for Emotion Recognition}, author = {Abdou, Ahmed and Sood, Ekta and Müller, Philipp and Bulling, Andreas}, year = {2022}, booktitle = {Proc. International Symposium on Eye Tracking Research and Applications (ETRA)}, doi = {10.1145/3530879}, volume = {6}, pages = {1--18} }
Predicting Next Actions and Latent Intents during Text Formatting

Guanhua Zhang, Susanne Hindennach, Jan Leusmann, Felix Bühler, Benedict Steuerlein, Sven Mayer, Mihai Bâce, Andreas Bulling

Proc. the CHI Workshop Computational Approaches for Understanding, Generating, and Adapting User Interfaces, pp. 1–6, 2022.

Abstract Links BibTeX Project

In this work we investigate the challenging task of predicting user intents from mouse and keyboard input as well as gaze behaviour. In contrast to prior work we study intent prediction at two different resolutions on the behavioural timeline: predicting future input actions as well as latent intents to achieve a high-level interaction goal. Results from a user study (N=15) on a sample text formatting task show that the sequence of prior actions is more informative for intent prediction than gaze. Only using the action sequence, we can predict the next action and the high-level intent with an accuracy of 66% and 96%, respectively. In contrast, accuracy when using features extracted from gaze behaviour was significantly lower, at 41% and 46%. This finding is important for the development of future anticipatory user interfaces that aim to proactively adapt to user intents and interaction goals.

Paper: zhang22_caugaui.pdf

@inproceedings{zhang22_caugaui, author = {Zhang, Guanhua and Hindennach, Susanne and Leusmann, Jan and Bühler, Felix and Steuerlein, Benedict and Mayer, Sven and Bâce, Mihai and Bulling, Andreas}, title = {Predicting Next Actions and Latent Intents during Text Formatting}, booktitle = {Proc. the CHI Workshop Computational Approaches for Understanding, Generating, and Adapting User Interfaces}, year = {2022}, pages = {1--6} }
Video Language Co-Attention with Multimodal Fast-Learning Feature Fusion for VideoQA

Adnen Abdessaied, Ekta Sood, Andreas Bulling

Proc. of the 7th Workshop on Representation Learning for NLP (Repl4NLP), pp. 1–12, 2022.

Abstract Links BibTeX Project

[Equal contribution by the first two authors.] We propose the Video Language Co-Attention Network (VLCN) – a novel memory-enhanced model for Video Question Answering (VideoQA). Our model combines two original contributions: A multimodal fast-learning feature fusion (FLF) block and a mechanism that uses self-attended language features to separately guide neural attention on both static and dynamic visual features extracted from individual video frames and short video clips. When trained from scratch, VLCN achieves competitive results with the state of the art on both MSVD-QA and MSRVTT-QA with 38.06% and 36.01% test accuracies, respectively. Through an ablation study, we further show that FLF improves generalization across different VideoQA datasets and performance for question types that are notoriously challenging in current datasets, such as long questions that require deeper reasoning as well as questions with rare answers.

Paper: abdessaied22_repl4NLP.pdf

Code: https://git.hcics.simtech.uni-stuttgart.de/public-projects/vlcn

@inproceedings{abdessaied22_repl4NLP, author = {Abdessaied, Adnen and Sood, Ekta and Bulling, Andreas}, title = {Video Language Co-Attention with Multimodal Fast-Learning Feature Fusion for VideoQA}, booktitle = {Proc. of the 7th Workshop on Representation Learning for NLP (Repl4NLP)}, year = {2022}, pages = {1--12} }
Neuro-Symbolic Visual Dialog

Adnen Abdessaied, Mihai Bâce, Andreas Bulling

Proc. 29th Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING), pp. 1–11, 2022.

Abstract Links BibTeX Project

We propose Neuro-Symbolic Visual Dialog (NSVD) —the first method to combine deep learning and symbolic program execution for multi-round visually-grounded reasoning. NSVD significantly outperforms existing purely-connectionist methods on two key challenges inherent to visual dialog: long-distance co-reference resolution as well as vanishing question-answering performance. We demonstrate the latter by proposing a more realistic and stricter evaluation scheme in which we use predicted answers for the full dialog history when calculating accuracy. We describe two variants of our model and show that using this new scheme, our best model achieves an accuracy of 99.72% on CLEVR-Dialog —a relative improvement of more than 10% over the state of the art —while only requiring a fraction of training data. Moreover, we demonstrate that our neuro-symbolic models have a higher mean first failure round, are more robust against incomplete dialog histories, and generalise better not only to dialogs that are up to three times longer than those seen during training but also to unseen question types and scenes.

Paper: abdessaied22_coling.pdf

Code: Contact Author for further information

@inproceedings{abdessaied22_coling, author = {Abdessaied, Adnen and Bâce, Mihai and Bulling, Andreas}, title = {Neuro-Symbolic Visual Dialog}, booktitle = {Proc. 29th Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING)}, year = {2022}, pages = {1--11} }
MultiMediate’22: Backchannel Detection and Agreement Estimation in Group Interactions

Philipp Müller, Dominik Schiller, Dominike Thomas, Michael Dietz, Hali Lindsay, Patrick Gebhard, Elisabeth André, Andreas Bulling

Proc. ACM Multimedia (MM), pp. 7109-7114, 2022.

Abstract Links BibTeX Project

Backchannels, i.e. short interjections of the listener, serve important meta-conversational purposes like signifying attention or indicating agreement. Despite their key role, automatic analysis of backchannels in group interactions has been largely neglected so far. The MultiMediate challenge addresses, for the first time, the tasks of backchannel detection and agreement estimation from backchannels in group conversations. This paper describes the MultiMediate challenge and presents a novel set of annotations consisting of 7234 backchannel instances for the MPIIGroup Interaction dataset. Each backchannel was additionally annotated with the extent by which it expresses agreement towards the current speaker. In addition to a an analysis of the collected annotations, we present baseline results for both challenge tasks.

doi: 10.1145/3503161.3551589

Paper: mueller22_mm.pdf

@inproceedings{mueller22_mm, title = {MultiMediate'22: Backchannel Detection and Agreement Estimation in Group Interactions}, author = {M{\"{u}}ller, Philipp and Schiller, Dominik and Thomas, Dominike and Dietz, Michael and Lindsay, Hali and Gebhard, Patrick and André, Elisabeth and Bulling, Andreas}, year = {2022}, pages = {7109-7114}, doi = {10.1145/3503161.3551589}, booktitle = {Proc. ACM Multimedia (MM)} }
Multimodal Sensor-based Identification of Stress and Compulsive Actions in Children with Obsessive-compulsive Disorder for Telemedical Treatment

Annika Thierfelder, Jonas Primbs, Björn Severitt, Carolin Sarah Hohnecker, Jan Kühnhausen, Annika Kristin Alt, Anja Pascher, Ursula Wörz, Helene Passon, Jens Seemann, Christian Ernst, Heinrich Lautenbacher, Martin Holderried, Enkelejda Kasneci, Martin Giese, Andreas Bulling, Michael Menth, Gottfried Maria Barth, Winfried Ilg, Karsten Hollmann, Tobias Johann Renner

Proc. the 44th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pp. 1–7, 2022.

Abstract Links BibTeX Project

In modern psychotherapy, digital health technology offers advanced and personalized therapy options, increasing availability as well as ecological validity. These aspects have proven to be highly relevant for children and adolescents with obsessive-compulsive disorder (OCD). Exposure and Response Prevention therapy, which is the state-of-the-art treatment for OCD, builds on the reconstruction of everyday life exposure to anxious situations. However, while compulsive behavior predominantly occurs in home environments, exposure situations during therapy are limited to clinical settings. Telemedical treatment allows to shift from this limited exposure reconstruction to exposure situations in real life. In the SSTeP KiZ study (smart sensor technology in telepsychotherapy for children and adolescents with OCD), we combine video therapy with wearable sensors delivering physiological and behavioral measures to objectively determine the stress level of patients. The setup allows to gain information from exposure to stress in a realistic environment both during and outside of therapy sessions. In a first pilot study, we explored the sensitivity of individual sensor modalities to different levels of stress and anxiety. For this, we captured the obsessive-compulsive behavior of five adolescents with an ECG chest belt, inertial sensors capturing hand movements, and an eye tracker. Despite their prototypical nature, our results deliver strong evidence that the examined sensor modalities yield biomarkers allowing for personalized detection and quantification of stress and anxiety. This opens up future possibilities to evaluate the severity of individual compulsive behavior based on multivariate state classification in real-life situations.

doi: 10.1109/EMBC48229.2022.9871899

Paper: thierfelder22_embc.pdf

@inproceedings{thierfelder22_embc, title = {Multimodal Sensor-based Identification of Stress and Compulsive Actions in Children with Obsessive-compulsive Disorder for Telemedical Treatment}, author = {Thierfelder, Annika and Primbs, Jonas and Severitt, Björn and Hohnecker, Carolin Sarah and Kühnhausen, Jan and Alt, Annika Kristin and Pascher, Anja and Wörz, Ursula and Passon, Helene and Seemann, Jens and Ernst, Christian and Lautenbacher, Heinrich and Holderried, Martin and Kasneci, Enkelejda and Giese, Martin and Bulling, Andreas and Menth, Michael and Barth, Gottfried Maria and Ilg, Winfried and Hollmann, Karsten and Renner, Tobias Johann}, year = {2022}, pages = {1--7}, booktitle = {Proc. the 44th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC)}, doi = {10.1109/EMBC48229.2022.9871899} }

Technical Reports

Federated Learning for Appearance-based Gaze Estimation in the Wild

Mayar Elfares, Zhiming Hu, Pascal Reisert, Andreas Bulling, Ralf Küsters

arXiv:2211.07330, pp. 1–17, 2022.

Abstract Links BibTeX Project

Gaze estimation methods have significantly matured in recent years but the large number of eye images required to train deep learning models poses significant privacy risks. In addition, the heterogeneous data distribution across different users can significantly hinder the training process. In this work, we propose the first federated learning approach for gaze estimation to preserve the privacy of gaze data. We further employ pseudo-gradients optimisation to adapt our federated learning approach to the divergent model updates to address the heterogeneous nature of in-the-wild gaze data in collaborative setups. We evaluate our approach on a real-world dataset (MPIIGaze dataset) and show that our work enhances the privacy guarantees of conventional appearance-based gaze estimation methods, handles the convergence issues of gaze estimators, and significantly outperforms vanilla federated learning by 15.8% (from a mean error of 10.63 degrees to 8.95 degrees). As such, our work paves the way to develop privacy-aware collaborative 14 learning setups for gaze estimation while maintaining the model’s performance.

doi: 10.48550/arXiv.2211.07330

Paper: elfares22_arxiv.pdf

@techreport{elfares22_arxiv, title = {Federated Learning for Appearance-based Gaze Estimation in the Wild}, author = {Elfares, Mayar and Hu, Zhiming and Reisert, Pascal and Bulling, Andreas and Küsters, Ralf}, year = {2022}, doi = {10.48550/arXiv.2211.07330}, pages = {1--17} }
MultiMediate’22: Backchannel Detection and Agreement Estimation in Group Interactions

Philipp Müller, Dominik Schiller, Dominike Thomas, Michael Dietz, Hali Lindsay, Patrick Gebhard, Elisabeth André, Andreas Bulling

arXiv:2209.09578, pp. 1–6, 2022.

Abstract Links BibTeX Project

Backchannels, i.e. short interjections of the listener, serve important meta-conversational purposes like signifying attention or indicating agreement. Despite their key role, automatic analysis of backchannels in group interactions has been largely neglected so far. The MultiMediate challenge addresses, for the first time, the tasks of backchannel detection and agreement estimation from backchannels in group conversations. This paper describes the MultiMediate challenge and presents a novel set of annotations consisting of 7234 backchannel instances for the MPIIGroupInteraction dataset. Each backchannel was additionally annotated with the extent by which it expresses agreement towards the current speaker. In addition to a an analysis of the collected annotations, we present baseline results for both challenge tasks.

doi: 10.48550/arXiv.2209.09578

Paper: mueller22_arxiv.pdf

Paper Access: http://arxiv.org/abs/2209.09578

@techreport{mueller22_arxiv, title = {MultiMediate'22: Backchannel Detection and Agreement Estimation in Group Interactions}, author = {M{\"{u}}ller, Philipp and Schiller, Dominik and Thomas, Dominike and Dietz, Michael and Lindsay, Hali and Gebhard, Patrick and André, Elisabeth and Bulling, Andreas}, year = {2022}, pages = {1--6}, doi = {10.48550/arXiv.2209.09578}, url = {http://arxiv.org/abs/2209.09578} }

2021

Journal Articles

EHTask: Recognizing User Tasks from Eye and Head Movements in Immersive Virtual Reality

Zhiming Hu, Andreas Bulling, Sheng Li, Guoping Wang

IEEE Transactions on Visualization and Computer Graphics (TVCG), 29 (4), pp. 1992–2004, 2021.

Abstract Links BibTeX Project

Understanding human visual attention in immersive virtual reality (VR) is crucial for many important applications, including gaze prediction, gaze guidance, and gaze-contingent rendering. However, previous works on visual attention analysis typically only explored one specific VR task and paid less attention to the differences between different tasks. Moreover, existing task recognition methods typically focused on 2D viewing conditions and only explored the effectiveness of human eye movements. We first collect eye and head movements of 30 participants performing four tasks, i.e. Free viewing, Visual search, Saliency, and Track, in 15 360-degree VR videos. Using this dataset, we analyze the patterns of human eye and head movements and reveal significant differences across different tasks in terms of fixation duration, saccade amplitude, head rotation velocity, and eye-head coordination. We then propose EHTask – a novel learning-based method that employs eye and head movements to recognize user tasks in VR. We show that our method significantly outperforms the state-of-the-art methods derived from 2D viewing conditions both on our dataset (accuracy of 84.4% vs. 62.8%) and on a real-world dataset (61.9% vs. 44.1%). As such, our work provides meaningful insights into human visual attention under different VR tasks and guides future work on recognizing user tasks in VR.

doi: 10.1109/TVCG.2021.3138902

Paper: hu21_tvcg_2.pdf

@article{hu21_tvcg_2, author = {Hu, Zhiming and Bulling, Andreas and Li, Sheng and Wang, Guoping}, title = {EHTask: Recognizing User Tasks from Eye and Head Movements in Immersive Virtual Reality}, journal = {IEEE Transactions on Visualization and Computer Graphics (TVCG)}, year = {2021}, doi = {10.1109/TVCG.2021.3138902}, pages = {1992--2004}, volume = {29}, number = {4} }
FixationNet: Forecasting Eye Fixations in Task-Oriented Virtual Environments

Zhiming Hu, Andreas Bulling, Sheng Li, Guoping Wang

IEEE Transactions on Visualization and Computer Graphics (TVCG), 27 (5), pp. 2681–2690, 2021.

Abstract Links BibTeX Project TVCG Best Journal Nominees Award

Human visual attention in immersive virtual reality (VR) is key for many important applications, such as content design, gaze-contingent rendering, or gaze-based interaction. However, prior works typically focused on free-viewing conditions that have limited relevance for practical applications. We first collect eye tracking data of 27 participants performing a visual search task in four immersive VR environments. Based on this dataset, we provide a comprehensive analysis of the collected data and reveal correlations between users’ eye fixations and other factors, i.e. users’ historical gaze positions, task-related objects, saliency information of the VR content, and users’ head rotation velocities. Based on this analysis, we propose FixationNet – a novel learning-based model to forecast users’ eye fixations in the near future in VR. We evaluate the performance of our model for free-viewing and task-oriented settings and show that it outperforms the state of the art by a large margin of 19.8% (from a mean error of 2.93° to 2.35°) in free-viewing and of 15.1% (from 2.05° to 1.74°) in task-oriented situations. As such, our work provides new insights into task-oriented attention in virtual environments and guides future work on this important topic in VR research.

doi: 10.1109/TVCG.2021.3067779

Paper: hu21_tvcg.pdf

Code: https://github.com/CraneHzm/FixationNet

Paper Access: https://cranehzm.github.io/FixationNet.html

Dataset: https://pkueducn-my.sharepoint.com/personal/jimmyhu_pku_edu_cn/_layouts/15/onedrive.aspx

@article{hu21_tvcg, author = {Hu, Zhiming and Bulling, Andreas and Li, Sheng and Wang, Guoping}, title = {FixationNet: Forecasting Eye Fixations in Task-Oriented Virtual Environments}, journal = {IEEE Transactions on Visualization and Computer Graphics (TVCG)}, year = {2021}, doi = {10.1109/TVCG.2021.3067779}, pages = {2681--2690}, volume = {27}, number = {5}, url = {https://cranehzm.github.io/FixationNet.html} }
Decoding binary decisions under differential target probabilities from pupil dilation: A random forest approach

Christoph Strauch, Teresa Hirzle, Stefan Van Stigchel, Andreas Bulling

Journal of Vision (JOV), 21 (7), pp. 1-13, 2021.

Abstract Links BibTeX Project

While our pupils slightly dilate when we look at an intended target, they do not when we look at irrelevant distractors. This suggests that it may be possible to decode the intention of an observer, understood as the outcome of implicit covert binary decisions, from the pupillary dynamics over time. However, only few previous works have investigated the feasibility of this approach and the few who did, did not control for possible confounds such as motor-execution, changes in brightness, or target and distractor probability. We report on our efforts to decode intentions from pupil dilation obtained under strict experimental control on a single trial basis using a machine learning approach. The basis for our analyses are data of n = 69 participants who looked at letters that needed to be selected with stimulus probabilities that varied systematically in a blockwise manner (n = 19,417 trials). We confirm earlier findings that pupil dilation is indicative of intentions and show that these can be decoded with a classification performance of up to 76% ROCAUC if targets are rarer than distractors. To better understand which characteristics of the pupillary signal are most informative, we finally compare relative feature importances. The first derivative of pupil size changes was found to be most relevant, allowing us to decode intention within only about 800 ms of trial onset. Taken together, our results provide credible insights into the potential of decoding intentions from pupil dilation and may soon form the basis for new applications in visual search, gaze-based interaction, or human-robot interaction.

doi: 10.1167/jov.21.7.6

Paper: strauch21_jov.pdf

@article{strauch21_jov, author = {Strauch, Christoph and Hirzle, Teresa and der Stigchel, Stefan Van and Bulling, Andreas}, title = {Decoding binary decisions under differential target probabilities from pupil dilation: A random forest approach}, journal = {Journal of Vision (JOV)}, year = {2021}, volume = {21}, number = {7}, pages = {1-13}, doi = {10.1167/jov.21.7.6} }

Conference Papers

VQA-MHUG: A gaze dataset to study multimodal neural attention in VQA

Ekta Sood, Fabian Kögel, Florian Strohm, Prajit Dhar, Andreas Bulling

Proc. ACL SIGNLL Conference on Computational Natural Language Learning (CoNLL), pp. 27–43, 2021.

Abstract Links BibTeX Project Oral Presentation

We present VQA-MHUG - a novel 49-participant dataset of multimodal human gaze on both images and questions during visual question answering (VQA) collected using a high-speed eye tracker. We use our dataset to analyze the similarity between human and neural attentive strategies learned by five state-of-the-art VQA models: Modulated Co-Attention Network (MCAN) with either grid or region features, Pythia, Bilinear Attention Network (BAN), and the Multimodal Factorized Bilinear Pooling Network (MFB). While prior work has focused on studying the image modality, our analyses show - for the first time - that for all models, higher correlation with human attention on text is a significant predictor of VQA performance. This finding points at a potential for improving VQA performance and, at the same time, calls for further research on neural text attention mechanisms and their integration into architectures for vision and language tasks, including but potentially also beyond VQA.

doi: 10.18653/v1/2021.conll-1.3

Paper: sood21_conll.pdf

Code: https://git.hcics.simtech.uni-stuttgart.de/public-projects/vqa-mhug-interpretability

Dataset: https://collaborative-ai.org/research/datasets/VQA-MHUG/

@inproceedings{sood21_conll, title = {VQA-MHUG: A gaze dataset to study multimodal neural attention in VQA}, author = {Sood, Ekta and Kögel, Fabian and Strohm, Florian and Dhar, Prajit and Bulling, Andreas}, booktitle = {Proc. ACL SIGNLL Conference on Computational Natural Language Learning (CoNLL)}, year = {2021}, pages = {27--43}, doi = {10.18653/v1/2021.conll-1.3}, publisher = {Association for Computational Linguistics} }
Altering Non-verbal Cues to Implicitly Direct Attention in Social VR

Radiah Rivu, Ken Pfeuffer, Philipp Müller, Yomna Abdelrahman, Andreas Bulling, Florian Alt

ACM Symposium on Spatial User Interaction (SSUI), pp. 1–2, 2021.

Abstract Links BibTeX Project

In this work we explore a concept system that alters the virtual eye movements without the user’s awareness, and whether this can affect social attention among others. Our concept augments the real movements with subtle redirected gazes to people, that occur in intervals to remain unnoticed. We present a user study with groups of people conversing on a topic, and measure the level of visual attention among users. Compared to a baseline of natural eye movements, we find that the method has indeed affected the overall attention in the group, but in unexpected ways. Our work points to a new way to exploit the inherent role of eyes in social virtual reality.

doi: 10.1145/3485279.3485309

Paper: rivu21_ssui.pdf

@inproceedings{rivu21_ssui, title = {Altering Non-verbal Cues to Implicitly Direct Attention in Social VR}, author = {Rivu, Radiah and Pfeuffer, Ken and Müller, Philipp and Abdelrahman, Yomna and Bulling, Andreas and Alt, Florian}, year = {2021}, booktitle = {ACM Symposium on Spatial User Interaction (SSUI)}, pages = {1--2}, doi = {10.1145/3485279.3485309} }
ConAn: A Usable Tool for Multimodal Conversation Analysis

Anna Penzkofer, Philipp Müller, Felix Bühler, Sven Mayer, Andreas Bulling

Proc. ACM International Conference on Multimodal Interaction (ICMI), pp. 341-351, 2021.

Abstract Links BibTeX Project

Multimodal analysis of group behavior is a key task in human-computer interaction, as well as the social and behavioral sciences, but is often limited to more easily controllable laboratory settings or requires elaborate multi-sensor setups and time-consuming manual data annotation. We present ConAn – a usable tool to explore and automatically analyze non-verbal behavior of multiple persons during natural group conversations. In contrast to traditional multi-sensor setups, our tool only requires a single 360° camera and uses state-of-the-art computer vision methods to automatically extract behavioral indicators, such as gaze direction, facial expressions, and speaking activity. Thus, our tool allows for easy and fast deployment supporting researchers in understanding both individual behavior and group interaction dynamics, but also in quantifying user-object interactions. We illustrate the benefits of our tool on three sample use cases: general conversation analysis, assessment of collaboration quality, and impact of technology on audience behavior. Taken together, ConAn represents an important step towards democratizing automatic conversation analysis in HCI and beyond.

doi: 10.1145/3462244.3479886

Paper: penzkofer21_icmi.pdf

Code: https://git.hcics.simtech.uni-stuttgart.de/public-projects/conan

Video: https://www.youtube.com/watch?v=H2KfZNgx6CQ

@inproceedings{penzkofer21_icmi, author = {Penzkofer, Anna and Müller, Philipp and Bühler, Felix and Mayer, Sven and Bulling, Andreas}, title = {ConAn: A Usable Tool for Multimodal Conversation Analysis}, booktitle = {Proc. ACM International Conference on Multimodal Interaction (ICMI)}, year = {2021}, doi = {10.1145/3462244.3479886}, pages = {341-351}, video = {https://www.youtube.com/watch?v=H2KfZNgx6CQ} }
A Critical Assessment of the Use of SSQ as a Measure of General Discomfort in VR Head-Mounted Displays

Teresa Hirzle, Maurice Cordts, Enrico Rukzio, Jan Gugenheimer, Andreas Bulling

Proc. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI), pp. 1–14, 2021.

Abstract Links BibTeX Project

Based on a systematic literature review of more than 300 papers published over the last 10 years, we show that the simulator sickness questionnaire (SSQ) is extensively used and widely accepted as general discomfort measure in virtual reality (VR) research - although it only accounts for one category of symptoms. This results in important other categories (digital eye strain (DES) and ergonomics) being largely neglected. To contribute to a more comprehensive picture of discomfort in VR head-mounted displays, we further conducted an online study (N=352) on the severity and relevance of all three symptom categories. Most importantly, our results reveal that symptoms of simulator sickness are significantly less severe and of lower prevalence than those of DES and ergonomics. In light of these findings, we critically discuss the current use of SSQ as the only discomfort measure and propose a more comprehensive factor model that also includes DES and ergonomics.

doi: 10.1145/3411764.3445361

Paper: hirzle21_chi.pdf

@inproceedings{hirzle21_chi, title = {A Critical Assessment of the Use of {SSQ} as a Measure of General Discomfort in VR Head-Mounted Displays}, author = {Hirzle, Teresa and Cordts, Maurice and Rukzio, Enrico and Gugenheimer, Jan and Bulling, Andreas}, year = {2021}, pages = {1--14}, doi = {10.1145/3411764.3445361}, booktitle = {Proc. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI)} }
Neural Photofit: Gaze-based Mental Image Reconstruction

Florian Strohm, Ekta Sood, Sven Mayer, Philipp Müller, Mihai Bâce, Andreas Bulling

Proc. IEEE International Conference on Computer Vision (ICCV), pp. 245-254, 2021.

Abstract Links BibTeX Project

We propose a novel method that leverages human fixations to visually decode the image a person has in mind into a photofit (facial composite). Our method combines three neural networks: An encoder, a scoring network, and a decoder. The encoder extracts image features and predicts a neural activation map for each face looked at by a human observer. A neural scoring network compares the human and neural attention and predicts a relevance score for each extracted image feature. Finally, image features are aggregated into a single feature vector as a linear combination of all features weighted by relevance which a decoder decodes into the final photofit. We train the neural scoring network on a novel dataset containing gaze data of 19 participants looking at collages of synthetic faces. We show that our method significantly outperforms a mean baseline predictor and report on a human study that shows that we can decode photofits that are visually plausible and close to the observer’s mental image. Code and dataset available upon request.

doi: 10.1109/ICCV48922.2021.00031

Paper: strohm21_iccv.pdf

Code: Available upon request.

Dataset: Available upon request.

@inproceedings{strohm21_iccv, title = {Neural Photofit: Gaze-based Mental Image Reconstruction}, author = {Strohm, Florian and Sood, Ekta and Mayer, Sven and Müller, Philipp and Bâce, Mihai and Bulling, Andreas}, year = {2021}, booktitle = {Proc. IEEE International Conference on Computer Vision (ICCV)}, doi = {10.1109/ICCV48922.2021.00031}, pages = {245-254} }
MultiMediate: Multi-modal Group Behaviour Analysis for Artificial Mediation

Philipp Müller, Dominik Schiller, Dominike Thomas, Guanhua Zhang, Michael Dietz, Patrick Gebhard, Elisabeth André, Andreas Bulling

Proc. ACM Multimedia (MM), pp. 4878–4882, 2021.

Abstract Links BibTeX Project

Artificial mediators are promising to support human group conversations but at present their abilities are limited by insufficient progress in group behaviour analysis. The MultiMediate challenge addresses, for the first time, two fundamental group behaviour analysis tasks in well-defined conditions: eye contact detection and next speaker prediction. For training and evaluation, MultiMediate makes use of the MPIIGroupInteraction dataset consisting of 22 three- to four-person discussions as well as of an unpublished test set of six additional discussions. This paper describes the MultiMediate challenge and presents the challenge dataset including novel fine-grained speaking annotations that were collected for the purpose of MultiMediate. Furthermore, we present baseline approaches and ablation studies for both challenge tasks.

doi: 10.1145/3474085.3479219

Paper: mueller21_mm.pdf

@inproceedings{mueller21_mm, title = {MultiMediate: Multi-modal Group Behaviour Analysis for Artificial Mediation}, author = {M{\"{u}}ller, Philipp and Schiller, Dominik and Thomas, Dominike and Zhang, Guanhua and Dietz, Michael and Gebhard, Patrick and André, Elisabeth and Bulling, Andreas}, year = {2021}, pages = {4878--4882}, doi = {10.1145/3474085.3479219}, booktitle = {Proc. ACM Multimedia (MM)} }

Technical Reports

Scanpath Prediction on Information Visualisations

Yao Wang, Mihai Bâce, Andreas Bulling

arXiv:2112.02340, pp. 1–14, 2021.

Abstract Links BibTeX Project

We propose Unified Model of Saliency and Scanpaths (UMSS) – a model that learns to predict visual saliency and scanpaths (i.e. sequences of eye fixations) on information visualisations. Although scanpaths provide rich information about the importance of different visualisation elements during the visual exploration process, prior work has been limited to predicting aggregated attention statistics, such as visual saliency. We present in-depth analyses of gaze behaviour for different information visualisation elements (e.g. Title, Label, Data) on the popular MASSVIS dataset. We show that while, overall, gaze patterns are surprisingly consistent across visualisations and viewers, there are also structural differences in gaze dynamics for different elements. Informed by our analyses, UMSS first predicts multi-duration element-level saliency maps, then probabilistically samples scanpaths from them. Extensive experiments on MASSVIS show that our method consistently outperforms state-of-the-art methods with respect to several, widely used scanpath and saliency evaluation metrics. Our method achieves a relative improvement in sequence score of 11.5 % for scanpath prediction, and a relative improvement in Pearson correlation coefficient of up to 23.6 % for saliency prediction. These results are auspicious and point towards richer user models and simulations of visual attention on visualisations without the need for any eye tracking equipment.

Paper: wang21_arxiv.pdf

Paper Access: https://arxiv.org/abs/2112.02340

Dataset: https://darus.uni-stuttgart.de/dataset.xhtml?persistentId=doi:10.18419/darus-3361

@techreport{wang21_arxiv, title = {Scanpath Prediction on Information Visualisations}, author = {Wang, Yao and B{\^a}ce, Mihai and Bulling, Andreas}, year = {2021}, pages = {1--14}, url = {https://arxiv.org/abs/2112.02340} }
VisRecall: Quantifying Information Visualisation Recallability via Question Answering

Yao Wang, Chuhan Jiao, Mihai Bâce, Andreas Bulling

arXiv:2112.15217, pp. 1–10, 2021.

Abstract Links BibTeX Project

Despite its importance for assessing the effectiveness of communicating information visually, fine-grained recallability of information visualisations has not been studied quantitatively so far. In this work we propose a visual question answering (VQA) paradigm to study visualisation recallability and present VisRecall — a novel dataset consisting of 200 visualisations that are annotated with crowd-sourced human (N = 305) recallability scores obtained from 1,000 questions from five question types. Furthermore, we present the first computational method to predict recallability of different visualisation elements, such as the title or specific data values. We report detailed analyses of our method on VisRecall and demonstrate that it outperforms several baselines in overall recallability and FE-, F-, RV-, and U-question recallability. We further demonstrate one possible application of our method: recommending the visualisation type that maximises user recallability for a given data source. Taken together, our work makes fundamental contributions towards a new generation of methods to assist designers in optimising visualisations.

Paper: wang21_arxiv_2.pdf

Paper Access: https://arxiv.org/abs/2112.15217

Dataset: https://darus.uni-stuttgart.de/dataset.xhtml?persistentId=doi:10.18419/darus-2826

@techreport{wang21_arxiv_2, title = {VisRecall: Quantifying Information Visualisation Recallability via Question Answering}, author = {Wang, Yao and Jiao, Chuhan and B{\^a}ce, Mihai and Bulling, Andreas}, year = {2021}, pages = {1--10}, url = {https://arxiv.org/abs/2112.15217} }
Multimodal Integration of Human-Like Attention in Visual Question Answering

Ekta Sood, Fabian Kögel, Philipp Müller, Dominike Thomas, Mihai Bâce, Andreas Bulling

arxiv:2109.13139, pp. 1–11, 2021.

Abstract Links BibTeX Project

Human-like attention as a supervisory signal to guide neural attention has shown significant promise but is currently limited to uni-modal integration – even for inherently multi-modal tasks such as visual question answering (VQA). We present the Multimodal Human-like Attention Network (MULAN) – the first method for multimodal integration of human-like attention on image and text during training of VQA models. MULAN integrates attention predictions from two state-of-the-art text and image saliency models into neural self-attention layers of a recent transformer-based VQA model. Through evaluations on the challenging VQAv2 dataset, we show that MULAN achieves a new state-of-the-art performance of 73.98% accuracy on test-std and 73.72% on test-dev and, at the same time, has approximately 80% fewer trainable parameters than prior work. Overall, our work underlines the potential of integrating multimodal human-like and neural attention for VQA.

Paper: sood21_arxiv.pdf

Paper Access: https://arxiv.org/pdf/2109.13139.pdf

@techreport{sood21_arxiv, author = {Sood, Ekta and Kögel, Fabian and Müller, Philipp and Thomas, Dominike and Bâce, Mihai and Bulling, Andreas}, title = {Multimodal Integration of Human-Like Attention in Visual Question Answering}, year = {2021}, url = {https://arxiv.org/pdf/2109.13139.pdf}, pages = {1--11} }
Neural Photofit: Gaze-based Mental Image Reconstruction

Florian Strohm, Ekta Sood, Sven Mayer, Philipp Müller, Mihai Bâce, Andreas Bulling

arXiv:2108.07524, pp. 1–10, 2021.

Abstract Links BibTeX Project

We propose a novel method that leverages human fixations to visually decode the image a person has in mind into a photofit (facial composite). Our method combines three neural networks: An encoder, a scoring network, and a decoder. The encoder extracts image features and predicts a neural activation map for each face looked at by a human observer. A neural scoring network compares the human and neural attention and predicts a relevance score for each extracted image feature. Finally, image features are aggregated into a single feature vector as a linear combination of all features weighted by relevance which a decoder decodes into the final photofit. We train the neural scoring network on a novel dataset containing gaze data of 19 participants looking at collages of synthetic faces. We show that our method significantly outperforms a mean baseline predictor and report on a human study that shows that we can decode photofits that are visually plausible and close to the observer’s mental image. Code and dataset available upon request.

Paper: strohm21_arxiv.pdf

Code: Available upon request.

Paper Access: https://arxiv.org/abs/2108.07524

Dataset: Available upon request.

@techreport{strohm21_arxiv, title = {Neural Photofit: Gaze-based Mental Image Reconstruction}, author = {Strohm, Florian and Sood, Ekta and Mayer, Sven and Müller, Philipp and Bâce, Mihai and Bulling, Andreas}, year = {2021}, pages = {1--10}, url = {https://arxiv.org/abs/2108.07524} }

2020

Journal Articles

Deep Gaze Pooling: Inferring and Visually Decoding Search Intents From Human Gaze Fixations

Hosnieh Sattar, Mario Fritz, Andreas Bulling

Neurocomputing, 387, pp. 369–382, 2020.

Abstract Links BibTeX Project

Predicting the target of visual search from human eye fixations (gaze) is a difficult problem with many applications, e.g. in human-computer interaction. While previous work has focused on predicting specific search target instances, we propose the first approach to predict categories and attributes of search intents from gaze data and to visually reconstruct plausible targets. However, state-of-the-art models for categorical recognition, in general, require large amounts of training data, which is prohibitive for gaze data. To address this challenge, we further propose a novel Gaze Pooling Layer that combines gaze information with visual representations from Deep Learning approaches. Our scheme incorporates both spatial and temporal aspects of human gaze behavior as well as the appearance of the fixated locations. We propose an experimental setup and novel dataset and demonstrate the effectiveness of our method for gaze-based search target prediction and reconstruction. We highlight several practical advantages of our approach, such as compatibility with existing architectures, no need for gaze training data, and robustness to noise from common gaze sources.

doi: 10.1016/j.neucom.2020.01.028

Paper: sattar20_neurocomp.pdf

@article{sattar20_neurocomp, title = {Deep Gaze Pooling: Inferring and Visually Decoding Search Intents From Human Gaze Fixations}, author = {Sattar, Hosnieh and Fritz, Mario and Bulling, Andreas}, journal = {Neurocomputing}, year = {2020}, pages = {369–382}, volume = {387}, doi = {10.1016/j.neucom.2020.01.028} }
How far are we from quantifying visual attention in mobile HCI?

Mihai Bâce, Sander Staal, Andreas Bulling

IEEE Pervasive Computing, 19 (2), pp. 46-55, 2020.

Abstract Links BibTeX Project

With an ever-increasing number of mobile devices competing for attention, quantifying when, how often, or for how long users look at their devices has emerged as a key challenge in mobile human-computer interaction. Encouraged by recent advances in automatic eye contact detection using machine learning and device-integrated cameras, we provide a fundamental investigation into the feasibility of quantifying overt visual attention during everyday mobile interactions. We discuss the main challenges and sources of error associated with sensing visual attention on mobile devices in the wild, including the impact of face and eye visibility, the importance of robust head pose estimation, and the need for accurate gaze estimation. Our analysis informs future research on this emerging topic and underlines the potential of eye contact detection for exciting new applications towards next-generation pervasive attentive user interfaces.

doi: 10.1109/MPRV.2020.2967736

Paper: bace20_pcm.pdf

@article{bace20_pcm, title = {How far are we from quantifying visual attention in mobile HCI?}, author = {B{\^a}ce, Mihai and Staal, Sander and Bulling, Andreas}, journal = {IEEE Pervasive Computing}, year = {2020}, volume = {19}, number = {2}, doi = {10.1109/MPRV.2020.2967736}, pages = {46-55} }

Conference Papers

Improving Natural Language Processing Tasks with Human Gaze-Guided Neural Attention

Ekta Sood, Simon Tannert, Philipp Müller, Andreas Bulling

Advances in Neural Information Processing Systems (NeurIPS), pp. 1–15, 2020.

Abstract Links BibTeX Project

A lack of corpora has so far limited advances in integrating human gaze data as a supervisory signal in neural attention mechanisms for natural language processing (NLP). We propose a novel hybrid text saliency model (TSM) that, for the first time, combines a cognitive model of reading with explicit human gaze supervision in a single machine learning framework. We show on four different corpora that our hybrid TSM duration predictions are highly correlated with human gaze ground truth. We further propose a novel joint modelling approach to integrate the predictions of the TSM into the attention layer of a network designed for a specific upstream task without the need for task-specific human gaze data. We demonstrate that our joint model outperforms the state of the art in paraphrase generation on the Quora Question Pairs corpus by more than 10% in BLEU-4 and achieves state-of-the-art performance for sentence compression on the challenging Google Sentence Compression corpus. As such, our work introduces a practical approach for bridging between data-driven and cognitive models and demonstrates a new way to integrate human gaze-guided neural attention into NLP tasks.

Paper: sood20_neurips.pdf

Code: https://git.hcics.simtech.uni-stuttgart.de/public-projects/human-gaze-guided-neural-attention-for-nlp

Supplementary Material: sood20_neurips_sup.pdf

Paper Access: https://proceedings.neurips.cc/paper/2020/hash/460191c72f67e90150a093b4585e7eb4-Abstract.html

@inproceedings{sood20_neurips, author = {Sood, Ekta and Tannert, Simon and Müller, Philipp and Bulling, Andreas}, title = {Improving Natural Language Processing Tasks with Human Gaze-Guided Neural Attention}, year = {2020}, pages = {1--15}, booktitle = {Advances in Neural Information Processing Systems (NeurIPS)}, url = {https://proceedings.neurips.cc/paper/2020/hash/460191c72f67e90150a093b4585e7eb4-Abstract.html} }
Interpreting Attention Models with Human Visual Attention in Machine Reading Comprehension

Ekta Sood, Simon Tannert, Diego Frassinelli, Andreas Bulling, Ngoc Thang Vu

Proc. ACL SIGNLL Conference on Computational Natural Language Learning (CoNLL), pp. 12-25, 2020.

Abstract Links BibTeX Project

While neural networks with attention mechanisms have achieved superior performance on many natural language processing tasks, it remains unclear to which extent learned attention resembles human visual attention. In this paper, we propose a new method that leverages eye-tracking data to investigate the relationship between human visual attention and neural attention in machine reading comprehension. To this end, we introduce a novel 23 participant eye tracking dataset - MQA-RC, in which participants read movie plots and answered pre-defined questions. We compare state of the art networks based on long short-term memory (LSTM), convolutional neural models (CNN) and XLNet Transformer architectures. We find that higher similarity to human attention and performance significantly correlates to the LSTM and CNN models. However, we show this relationship does not hold true for the XLNet models – despite the fact that the XLNet performs best on this challenging task. Our results suggest that different architectures seem to learn rather different neural attention strategies and similarity of neural to human attention does not guarantee best performance.

doi: 10.18653/v1/P17

Paper: sood20_conll.pdf

Code: https://git.hcics.simtech.uni-stuttgart.de/public-projects/visualizing-human-and-neural-attention

Dataset: https://collaborative-ai.org/research/datasets/MQA-RC/

@inproceedings{sood20_conll, title = {Interpreting Attention Models with Human Visual Attention in Machine Reading Comprehension}, author = {Sood, Ekta and Tannert, Simon and Frassinelli, Diego and Bulling, Andreas and Vu, Ngoc Thang}, booktitle = {Proc. ACL SIGNLL Conference on Computational Natural Language Learning (CoNLL)}, year = {2020}, pages = {12-25}, doi = {10.18653/v1/P17}, publisher = {Association for Computational Linguistics} }
Anticipating Averted Gaze in Dyadic Interactions

Philipp Müller, Ekta Sood, Andreas Bulling

Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA), pp. 1-10, 2020.

Abstract Links BibTeX Project

We present the first method to anticipate averted gaze in natural dyadic interactions. The task of anticipating averted gaze, i.e. that a person will not make eye contact in the near future, remains unsolved despite its importance for human social encounters as well as a number of applications, including human-robot interaction or conversational agents. Our multimodal method is based on a long short-term memory (LSTM) network that analyses non-verbal facial cues and speaking behaviour. We empirically evaluate our method for different future time horizons on a novel dataset of 121 YouTube videos of dyadic video conferences (74 hours in total). We investigate person-specific and person-independent performance and demonstrate that our method clearly outperforms baselines in both settings. As such, our work sheds light on the tight interplay between eye contact and other non-verbal signals and underlines the potential of computational modelling and anticipation of averted gaze for interactive applications.

doi: 10.1145/3379155.3391332

Paper: mueller20_etra.pdf

@inproceedings{mueller20_etra, title = {Anticipating Averted Gaze in Dyadic Interactions}, author = {Müller, Philipp and Sood, Ekta and Bulling, Andreas}, year = {2020}, booktitle = {Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA)}, doi = {10.1145/3379155.3391332}, pages = {1-10} }
Visual Analytics and Annotation of Pervasive Eye Tracking Video

Kuno Kurzhals, Nils Rodrigues, Maurice Koch, Michael Stoll, Andrés Bruhn, Andreas Bulling, Daniel Weiskopf

Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA), pp. 1-9, 2020.

Abstract Links BibTeX Project

We propose a new technique for visual analytics and annotation of long-term pervasive eye tracking data for which a combined analysis of gaze and egocentric video is necessary. Our approach enables two important tasks for such data for hour-long videos from individual participants: (1) efficient annotation and (2) direct interpretation of the results. Exemplary time spans can be selected by the user and are then used as a query that initiates a fuzzy search of similar time spans based on gaze and video features. In an iterative refinement loop, the query interface then provides suggestions for the importance of individual features to improve the search results. A multi-layered timeline visualization shows an overview of annotated time spans. We demonstrate the efficiency of our approach for analyzing activities in about seven hours of video in a case study and discuss feedback on our approach from novices and experts performing the annotation task.

doi: 10.1145/3379155.3391326

Paper: kurzhals20_etra.pdf

@inproceedings{kurzhals20_etra, title = {Visual Analytics and Annotation of Pervasive Eye Tracking Video}, author = {Kurzhals, Kuno and Rodrigues, Nils and Koch, Maurice and Stoll, Michael and Bruhn, Andrés and Bulling, Andreas and Weiskopf, Daniel}, year = {2020}, booktitle = {Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA)}, doi = {10.1145/3379155.3391326}, pages = {1-9} }
Combining Gaze Estimation and Optical Flow for Pursuits Interaction

Mihai Bâce, Vincent Becker, Chenyang Wang, Andreas Bulling

Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA), pp. 1-10, 2020.

Abstract Links BibTeX Project Best Paper Award

Pursuit eye movements have become widely popular because they enable spontaneous eye-based interaction. However, existing methods to detect smooth pursuits require special-purpose eye trackers. We propose the first method to detect pursuits using a single off-the-shelf RGB camera in unconstrained remote settings. The key novelty of our method is that it combines appearance-based gaze estimation with optical flow in the eye region to jointly analyse eye movement dynamics in a single pipeline. We evaluate the performance and robustness of our method for different numbers of targets and trajectories in a 13-participant user study. We show that our method not only outperforms the current state of the art but also achieves competitive performance to a consumer eye tracker for a small number of targets. As such, our work points towards a new family of methods for pursuit interaction directly applicable to an ever-increasing number of devices readily equipped with cameras.

doi: 10.1145/3379155.3391315

Paper: bace20_etra.pdf

@inproceedings{bace20_etra, title = {Combining Gaze Estimation and Optical Flow for Pursuits Interaction}, author = {B{\^a}ce, Mihai and Becker, Vincent and Wang, Chenyang and Bulling, Andreas}, year = {2020}, booktitle = {Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA)}, doi = {10.1145/3379155.3391315}, pages = {1-10} }
Adversarial Attacks on Classifiers for Eye-based User Modelling

Inken Hagestedt, Michael Backes, Andreas Bulling

Adj. Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA), pp. 1-3, 2020.

Abstract Links BibTeX Project

An ever-growing body of work has demonstrated the rich information content available in eye movements for user modelling, e.g. for predicting users’ activities, cognitive processes, or even personality traits. We show that state-of-the-art classifiers for eye-based user modelling are highly vulnerable to adversarial examples: small artificial perturbations in gaze input that can dramatically change a classifier’s predictions. On the sample task of eye-based document type recognition we study the success of adversarial attacks with and without targeting the attack to a specific class.

doi: 10.1145/3379157.3390511

Paper: hagestedt20_etra.pdf

@inproceedings{hagestedt20_etra, title = {Adversarial Attacks on Classifiers for Eye-based User Modelling}, author = {Hagestedt, Inken and Backes, Michael and Bulling, Andreas}, year = {2020}, pages = {1-3}, doi = {10.1145/3379157.3390511}, booktitle = {Adj. Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA)} }
A Survey of Digital Eye Strain in Gaze-Based Interactive Systems

Teresa Hirzle, Maurice Cordts, Enrico Rukzio, Andreas Bulling

Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA), pp. 1-12, 2020.

Abstract Links BibTeX Project

Display-based interfaces pose high demands on users’ eyes that can cause severe vision and eye problems, also known as digital eye strain (DES). Although these problems can become even more severe if the eyes are actively used for interaction, prior work on gaze-based interfaces has largely neglected these risks. We offer the first comprehensive account of DES in gaze-based interactive systems that is specifically geared to gaze interaction designers. Through an extensive survey of more than 400 papers published over the last 46 years, we first discuss the current role of DES in interactive systems. One key finding is that DES is only rarely considered when evaluating novel gaze interfaces and neglected in discussions of usability. We identify the main causes and solutions to DES and derive recommendations for interaction designers on how to guide future research on evaluating and alleviating DES.

doi: 10.1145/3379155.3391313

Paper: hirzle20_etra.pdf

@inproceedings{hirzle20_etra, title = {A Survey of Digital Eye Strain in Gaze-Based Interactive Systems}, author = {Hirzle, Teresa and Cordts, Maurice and Rukzio, Enrico and Bulling, Andreas}, year = {2020}, booktitle = {Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA)}, doi = {10.1145/3379155.3391313}, pages = {1-12} }
Quantification of Users’ Visual Attention During Everyday Mobile Device Interactions

Mihai Bâce, Sander Staal, Andreas Bulling

Proc. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI), pp. 1–14, 2020.

Abstract Links BibTeX Project

We present the first real-world dataset and quantitative evaluation of visual attention of mobile device users in-situ, i.e. while using their devices during everyday routine. Understanding user attention is a core research challenge in mobile HCI but previous approaches relied on usage logs or self-reports that are only proxies and consequently do neither reflect attention completely nor accurately. Our evaluations are based on Everyday Mobile Visual Attention (EMVA) – a new 32-participant dataset containing around 472 hours of video snippets recorded over more than two weeks in real life using the front-facing camera as well as associated usage logs, interaction events, and sensor data. Using an eye contact detection method, we are first to quantify the highly dynamic nature of everyday visual attention across users, mobile applications, and usage contexts. We discuss key insights from our analyses that highlight the potential and inform the design of future mobile attentive user interfaces.

doi: 10.1145/3313831.3376449

Paper: bace20_chi.pdf

Dataset: http://www.emva-dataset.org/

Video: https://www.youtube.com/watch?v=SzLn3LujIqw

@inproceedings{bace20_chi, title = {Quantification of Users' Visual Attention During Everyday Mobile Device Interactions}, author = {B{\^a}ce, Mihai and Staal, Sander and Bulling, Andreas}, year = {2020}, pages = {1--14}, booktitle = {Proc. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI)}, doi = {10.1145/3313831.3376449}, video = {https://www.youtube.com/watch?v=SzLn3LujIqw} }
Learning-based Region Selection for End-to-End Gaze Estimation

Xucong Zhang, Yusuke Sugano, Andreas Bulling, Otmar Hilliges

Proc. British Machine Vision Conference (BMVC), pp. 1-13, 2020.

Abstract Links BibTeX Project

Traditionally, appearance-based gaze estimation methods use statically defined face regions as input to the gaze estimator, such as eye patches, and therefore suffer from difficult lighting conditions and extreme head poses for which these regions are often not the most informative with respect to the gaze estimation task. We posit that facial regions should be selected dynamically based on the image content and propose a novel gaze estimation method that combines the task of region proposal and gaze estimation into a single end-to-end trainable framework. We introduce a novel loss that allows for unsupervised training of a region proposal network alongside the (supervised) training of the final gaze estimator. We show that our method can learn meaningful region selection strategies and outperforms fixed region approaches. We further show that our method performs particularly well for challenging cases, i.e., those with difficult lighting conditions such as directional lights, extreme head angles, or self-occlusion. Finally, we show that the proposed method achieves better results than the current state-of-the-art method in within and cross-dataset evaluations.

Paper: zhang20_bmvc.pdf

Supplementary Material: zhang20_bmvc_sup.pdf

@inproceedings{zhang20_bmvc, author = {Zhang, Xucong and Sugano, Yusuke and Bulling, Andreas and Hilliges, Otmar}, title = {Learning-based Region Selection for End-to-End Gaze Estimation}, booktitle = {Proc. British Machine Vision Conference (BMVC)}, year = {2020}, pages = {1-13} }
Efficient Implementation of Large-Scale Watchlists

Constantin Ruhdorfer, Stephan Schulz

PAAR+ SC^2@ IJCAR, pp. 120–133, 2020.

Links BibTeX Project

Paper Access: https://ceur-ws.org/Vol-2752/paper9.pdf

@inproceedings{ruhdorfer20_ijcar, title = {Efficient {Implementation} of {Large}-{Scale} {Watchlists}}, url = {https://ceur-ws.org/Vol-2752/paper9.pdf}, booktitle = {{PAAR}+ {SC}{\textasciicircum}2@ {IJCAR}}, author = {Ruhdorfer, Constantin and Schulz, Stephan}, year = {2020}, pages = {120--133} }

Technical Reports

Interpreting Attention Models with Human Visual Attention in Machine Reading Comprehension

Ekta Sood, Simon Tannert, Diego Frassinelli, Andreas Bulling, Ngoc Thang Vu

arxiv:2010.06396, pp. 1–14, 2020.

Abstract Links BibTeX Project

While neural networks with attention mecha- nisms have achieved superior performance on many natural language processing tasks, it remains unclear to which extent learned attention resembles human visual attention. In this paper, we propose a new method that leverages eye-tracking data to investigate the relationship between human visual attention and neural attention in machine reading comprehension. To this end, we introduce a novel 23 participant eye tracking dataset - MQA-RC, in which participants read movie plots and answered pre-defined questions. We compare state of the art networks based on long short-term memory (LSTM), convolutional neural models (CNN) and XLNet Transformer architectures. We find that higher similarity to human attention and performance significantly correlates to the LSTM and CNN models. However, we show this relationship does not hold true for the XLNet models – despite the fact that the XLNet performs best on this challenging task. Our results suggest that different architectures seem to learn rather different neural attention strategies and similarity of neural to human attention does not guarantee best performance.

Paper: sood20_arxiv_2.pdf

Code: https://git.hcics.simtech.uni-stuttgart.de/public-projects/visualizing-human-and-neural-attention

Paper Access: https://arxiv.org/abs/2010.06396

Dataset: https://collaborative-ai.org/research/datasets/MQA-RC/

@techreport{sood20_arxiv_2, title = {Interpreting Attention Models with Human Visual Attention in Machine Reading Comprehension}, author = {Sood, Ekta and Tannert, Simon and Frassinelli, Diego and Bulling, Andreas and Vu, Ngoc Thang}, year = {2020}, url = {https://arxiv.org/abs/2010.06396}, pages = {1--14} }
Improving Natural Language Processing Tasks with Human Gaze-Guided Neural Attention

Ekta Sood, Simon Tannert, Philipp Müller, Andreas Bulling

arxiv:2010.07891, pp. 1–18, 2020.

Abstract Links BibTeX Project

A lack of corpora has so far limited advances in integrating human gaze data as a supervisory signal in neural attention mechanisms for natural language processing (NLP). We propose a novel hybrid text saliency model (TSM) that, for the first time, combines a cognitive model of reading with explicit human gaze supervision in a single machine learning framework. We show on four different corpora that our hybrid TSM duration predictions are highly correlated with human gaze ground truth. We further propose a novel joint modelling approach to integrate the predictions of the TSM into the attention layer of a network designed for a specific upstream task without the need for task-specific human gaze data. We demonstrate that our joint model outperforms the state of the art in paraphrase generation on the Quora Question Pairs corpus by more than 10% in BLEU-4 and achieves state-of-the-art performance for sentence compression on the challenging Google Sentence Compression corpus. As such, our work introduces a practical approach for bridging between data-driven and cognitive models and demonstrates a new way to integrate human gaze-guided neural attention into NLP tasks.

Paper: sood20_arxiv.pdf

Code: https://git.hcics.simtech.uni-stuttgart.de/public-projects/human-gaze-guided-neural-attention-for-nlp

Paper Access: https://arxiv.org/abs/2010.07891

@techreport{sood20_arxiv, author = {Sood, Ekta and Tannert, Simon and Müller, Philipp and Bulling, Andreas}, title = {Improving Natural Language Processing Tasks with Human Gaze-Guided Neural Attention}, year = {2020}, url = {https://arxiv.org/abs/2010.07891}, pages = {1--18} }
Adversarial Attacks on Classifiers for Eye-based User Modelling

Inken Hagestedt, Michael Backes, Andreas Bulling

arXiv:2006.00860, pp. 1–9, 2020.

Abstract Links BibTeX Project

An ever-growing body of work has demonstrated the rich information content available in eye movements for user modelling, e.g. for predicting users’ activities, cognitive processes, or even personality traits. We show that state-of-the-art classifiers for eye-based user modelling are highly vulnerable to adversarial examples: small artificial perturbations in gaze input that can dramatically change a classifier’s predictions. We generate these adversarial examples using the Fast Gradient Sign Method (FGSM) that linearises the gradient to find suitable perturbations. On the sample task of eye-based document type recognition we study the success of different adversarial attack scenarios: with and without knowledge about classifier gradients (white-box vs. black-box) as well as with and without targeting the attack to a specific class, In addition, we demonstrate the feasibility of defending against adversarial attacks by adding adversarial examples to a classifier’s training data.

Paper: hagestedt20_arxiv.pdf

Paper Access: https://arxiv.org/abs/2006.00860

@techreport{hagestedt20_arxiv, title = {Adversarial Attacks on Classifiers for Eye-based User Modelling}, author = {Hagestedt, Inken and Backes, Michael and Bulling, Andreas}, year = {2020}, pages = {1--9}, url = {https://arxiv.org/abs/2006.00860} }

2019

Journal Articles

Bit Level Encryption Algorithm Using Chaos Version 1.0(BLEAUC-1.0)

Souptik Kumar Majumdar Asoke Nath

International Journal of Computer Sciences and Engineering, , pp. 451–456, 2019.

Links BibTeX Project

doi: https://doi.org/10.26438/ijcse/v7i4.451456

Paper Access: https://www.ijcseonline.org/full_paper_view.php?paper_id=4056

@article{nath19_ijcse, title = {Bit Level Encryption Algorithm Using Chaos Version 1.0(BLEAUC-1.0)}, author = {Asoke Nath, Suchandra Datta, Souptik Kumar Majumdar}, year = {2019}, journal = {International Journal of Computer Sciences and Engineering}, pages = {451--456}, doi = {https://doi.org/10.26438/ijcse/v7i4.451456}, url = {https://www.ijcseonline.org/full_paper_view.php?paper_id=4056} }
MPIIGaze: Real-World Dataset and Deep Appearance-Based Gaze Estimation

Xucong Zhang, Yusuke Sugano, Mario Fritz, Andreas Bulling

IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 41 (1), pp. 162-175, 2019.

Abstract Links BibTeX Project

Learning-based methods are believed to work well for unconstrained gaze estimation, i.e. gaze estimation from a monocular RGB camera without assumptions regarding user, environment, or camera. However, current gaze datasets were collected under laboratory conditions and methods were not evaluated across multiple datasets. Our work makes three contributions towards addressing these limitations. First, we present the MPIIGaze dataset, which contains 213,659 full face images and corresponding ground-truth gaze positions collected from 15 users during everyday laptop use over several months. An experience sampling approach ensured continuous gaze and head poses and realistic variation in eye appearance and illumination. To facilitate cross-dataset evaluations, 37,667 images were manually annotated with eye corners, mouth corners, and pupil centres. Second, we present an extensive evaluation of state-of-the-art gaze estimation methods on three current datasets, including MPIIGaze. We study key challenges including target gaze range, illumination conditions, and facial appearance variation. We show that image resolution and the use of both eyes affect gaze estimation performance, while head pose and pupil centre information are less informative. Finally, we propose GazeNet, the first deep appearance-based gaze estimation method. GazeNet improves on the state of the art by 22% (from a mean error of 13.9 degrees to 10.8 degrees) for the most challenging cross-dataset evaluation.

doi: 10.1109/TPAMI.2017.2778103

Paper: zhang19_pami.pdf

@article{zhang19_pami, title = {MPIIGaze: Real-World Dataset and Deep Appearance-Based Gaze Estimation}, author = {Zhang, Xucong and Sugano, Yusuke and Fritz, Mario and Bulling, Andreas}, year = {2019}, journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)}, doi = {10.1109/TPAMI.2017.2778103}, pages = {162-175}, volume = {41}, number = {1} }
Classifying Attention Types with Thermal Imaging and Eye Tracking

Yomna Abdelrahman, Anam Ahmad Khan, Joshua Newn, Eduardo Velloso, Sherine Ashraf Safwat, James Bailey, Andreas Bulling, Frank Vetere, Albrecht Schmidt

Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (IMWUT), 3 (3), pp. 1–27, 2019.

Abstract Links BibTeX Project

Despite the importance of attention in user performance, current methods for attention classification do not allow to discriminate between different attention types. We propose a novel method that combines thermal imaging and eye tracking to unobtrusively classify four types of attention: sustained, alternating, selective, and divided. We collected a data set in which we stimulate these four attention types in a user study (N=22) using combinations of audio and visual stimuli while measuring users’ facial temperature and eye movement. Using a Logistic Regression on features extracted from both sensing technologies, we can classify the four attention types with high AUC scores up to 75.7% for the user independent-condition independent, 87% for the user-independent-condition dependent, and 77.4% for the user-dependent prediction. Our findings not only demonstrate the potential of thermal imaging and eye tracking for unobtrusive classification of different attention types but also pave the way for novel applications for attentive user interfaces and attention-aware computing.

doi: 10.1145/3351227

Paper: abdelrahman19_imwut.pdf

@article{abdelrahman19_imwut, author = {Abdelrahman, Yomna and Khan, Anam Ahmad and Newn, Joshua and Velloso, Eduardo and Safwat, Sherine Ashraf and Bailey, James and Bulling, Andreas and Vetere, Frank and Schmidt, Albrecht}, title = {Classifying Attention Types with Thermal Imaging and Eye Tracking}, journal = {Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (IMWUT)}, year = {2019}, volume = {3}, number = {3}, pages = {1--27}, doi = {10.1145/3351227} }
InvisibleEye: Fully Embedded Mobile Eye Tracking Using Appearance-Based Gaze Estimation

Julian Steil, Marc Tonsen, Yusuke Sugano, Andreas Bulling

ACM SIGMOBILE Mobile Computing and Communications Review, 23 (2), pp. 30-34, 2019.

Abstract Links BibTeX Project

Despite their potential for a range of exciting new applications, mobile eye trackers suffer from several fundamental usability problems. InvisibleEye is an innovative approach for mobile eye tracking that uses millimetre-size RGB cameras that can be fully embedded into normal glasses frames, as well as appearance-based gaze estimation to directly estimate gaze from the eye images. Through evaluation on three large-scale, increasingly realistic datasets, we show that InvisibleEyes can achieve a person-specific gaze estimation accuracy of up to 2.04° using three camera pairs with a resolution of only 3x3 pixels.

Paper: steil19_sigmobile.pdf

@article{steil19_sigmobile, author = {Steil, Julian and Tonsen, Marc and Sugano, Yusuke and Bulling, Andreas}, title = {InvisibleEye: Fully Embedded Mobile Eye Tracking Using Appearance-Based Gaze Estimation}, journal = {ACM SIGMOBILE Mobile Computing and Communications Review}, year = {2019}, volume = {23}, number = {2}, pages = {30-34} }

Conference Papers

Emergent Leadership Detection Across Datasets

Philipp Müller, Andreas Bulling

Proc. ACM International Conference on Multimodal Interaction (ICMI), pp. 274-278, 2019.

Abstract Links BibTeX Project

Automatic detection of emergent leaders in small groups from nonverbal behaviour is a growing research topic in social signal processing but existing methods were evaluated on single datasets – an unrealistic assumption for real-world applications in which systems are required to also work in settings unseen at training time. It therefore remains unclear whether current methods for emergent leadership detection generalise to similar but new settings and to which extent. To overcome this limitation, we are the first to study a cross-dataset evaluation setting for the emergent leadership detection task. We provide evaluations for within- and cross-dataset prediction using two current datasets (PAVIS and MPIIGroupInteraction), as well as an investigation on the robustness of commonly used feature channels and online prediction in the cross-dataset setting. Our evaluations show that using pose and eye contact based features, cross-dataset prediction is possible with an accuracy of 0.68, as such providing another important piece of the puzzle towards real-world emergent leadership detection.

doi: 10.1145/3340555.3353721

Paper: mueller19_icmi.pdf

@inproceedings{mueller19_icmi, title = {Emergent Leadership Detection Across Datasets}, author = {M{\"{u}}ller, Philipp and Bulling, Andreas}, year = {2019}, pages = {274-278}, booktitle = {Proc. ACM International Conference on Multimodal Interaction (ICMI)}, doi = {10.1145/3340555.3353721} }
Reducing Calibration Drift in Mobile Eye Trackers by Exploiting Mobile Phone Usage

Philipp Müller, Daniel Buschek, Michael Xuelin Huang, Andreas Bulling

Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA), pp. 1–9, 2019.

Abstract Links BibTeX Project

Automatic saliency-based recalibration is promising for addressing calibration drift in mobile eye trackers but existing bottom-up saliency methods neglect user’s goal-directed visual attention in natural behaviour. By inspecting real-life recordings of egocentric eye tracker cameras, we reveal that users are likely to look at their phones once these appear in view. We propose two novel automatic recalibration methods that exploit mobile phone usage: The first builds saliency maps using the phone location in the egocentric view to identify likely gaze locations. The second uses the occurrence of touch events to recalibrate the eye tracker, thereby enabling privacy-preserving recalibration. Through in-depth evaluations on a recent mobile eye tracking dataset (N=17, 65 hours) we show that our approaches outperform a state-of-the-art saliency approach for the automatic recalibration task. As such, our approach improves mobile eye tracking and gaze-based interaction, particularly for long-term use.

doi: 10.1145/3314111.3319918

Paper: mueller19_etra.pdf

@inproceedings{mueller19_etra, title = {Reducing Calibration Drift in Mobile Eye Trackers by Exploiting Mobile Phone Usage}, author = {M{\"{u}}ller, Philipp and Buschek, Daniel and Huang, Michael Xuelin and Bulling, Andreas}, year = {2019}, booktitle = {Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA)}, doi = {10.1145/3314111.3319918}, pages = {1--9} }
Evaluation of Appearance-Based Methods and Implications for Gaze-Based Applications

Xucong Zhang, Yusuke Sugano, Andreas Bulling

Proc. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI), pp. 1–13, 2019.

Abstract Links BibTeX Project

Appearance-based gaze estimation methods that only require an off-the-shelf camera have significantly improved but they are still not yet widely used in the human-computer interaction (HCI) community. This is partly because it remains unclear how they perform compared to model-based approaches as well as dominant, special-purpose eye tracking equipment. To address this limitation, we evaluate the performance of state-of-the-art appearance-based gaze estimation for interaction scenarios with and without personal calibration, indoors and outdoors, for different sensing distances, as well as for users with and without glasses. We discuss the obtained findings and their implications for the most important gaze-based applications, namely explicit eye input, attentive user interfaces, gaze-based user modelling, and passive eye monitoring. To democratise the use of appearance-based gaze estimation and interaction in HCI, we finally present OpenGaze (www.opengaze.org), the first software toolkit for appearance-based gaze estimation and interaction.

doi: 10.1145/3290605.3300646

Paper: zhang19_chi.pdf

Code: http://www.opengaze.org/

@inproceedings{zhang19_chi, author = {Zhang, Xucong and Sugano, Yusuke and Bulling, Andreas}, title = {Evaluation of Appearance-Based Methods and Implications for Gaze-Based Applications}, booktitle = {Proc. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI)}, year = {2019}, doi = {10.1145/3290605.3300646}, pages = {1--13} }
Towards High-Frequency SSVEP-Based Target Discrimination with an Extended Alphanumeric Keyboard

Sahar Abdelnabi, Michael Xuelin Huang, Andreas Bulling

Proc. IEEE International Conference on Systems, Man, and Cybernetics (SMC), pp. 1-6, 2019.

Abstract Links BibTeX Project

Despite significant advances in using Steady-State Visually Evoked Potentials (SSVEP) for on-screen target discrimination, existing methods either require intrusive, low- frequency visual stimulation or only support a small number of targets. We propose SSVEPNet: a convolutional long short-term memory (LSTM) recurrent neural network for high-frequency stimulation (≥30Hz) using a large number of visual targets. We evaluate our method for discriminating between 43 targets on an extended alphanumeric virtual keyboard and compare three different frequency assignment strategies. Our experimental results show that SSVEPNet significantly outperforms state-of-the-art correlation-based methods and convolutional neural networks. As such, our work opens up an exciting new direction of research towards a new class of unobtrusive and highly expressive SSVEP-based interfaces for text entry and beyond.

Paper: abdelnabi19_smc.pdf

@inproceedings{abdelnabi19_smc, author = {Abdelnabi, Sahar and Huang, Michael Xuelin and Bulling, Andreas}, title = {Towards High-Frequency SSVEP-Based Target Discrimination with an Extended Alphanumeric Keyboard}, booktitle = {Proc. IEEE International Conference on Systems, Man, and Cybernetics (SMC)}, year = {2019}, pages = {1-6} }
Moment-to-Moment Detection of Internal Thought during Video Viewing from Eye Vergence Behavior

Michael Xuelin Huang, Jiajia Li, Grace Ngai, Hong Va Leong, Andreas Bulling

Proc. ACM Multimedia (MM), pp. 1–9, 2019.

Abstract Links BibTeX Project

Internal thought refers to the process of directing attention away from a primary visual task to internal cognitive processing. Internal thought is a pervasive mental activity and closely related to primary task performance. As such, automatic detection of internal thought has significant potential for user modelling in intelligent interfaces, particularly for e-learning applications. Despite the close link between the eyes and the human mind, only a few studies have investigated vergence behaviour during internal thought and none has studied moment-to-moment detection of internal thought from gaze. While prior studies relied on long-term data analysis and required a large number of gaze characteristics, we describe a novel method that is computationally light-weight and that only requires eye vergence information that is readily available from binocular eye trackers. We further propose a novel paradigm to obtain ground truth internal thought annotations that exploits human blur perception. We evaluate our method for three increasingly challenging detection tasks: (1) during a controlled math-solving task, (2) during natural viewing of lecture videos, and (3) during daily activities, such as coding, browsing, and reading. Results from these evaluations demonstrate the performance and robustness of vergence-based detection of internal thought and, as such, open up new directions for research on interfaces that adapt to shifts of mental attention.

doi: 10.1145/3343031.3350573

Paper: huang19_mm.pdf

@inproceedings{huang19_mm, title = {Moment-to-Moment Detection of Internal Thought during Video Viewing from Eye Vergence Behavior}, author = {Huang, Michael Xuelin and Li, Jiajia and Ngai, Grace and Leong, Hong Va and Bulling, Andreas}, booktitle = {Proc. ACM Multimedia (MM)}, year = {2019}, doi = {10.1145/3343031.3350573}, pages = {1--9} }
SacCalib: Reducing Calibration Distortion for Stationary Eye Trackers Using Saccadic Eye Movements

Michael Xuelin Huang, Andreas Bulling

Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA), pp. 1–10, 2019.

Abstract Links BibTeX Project

Recent methods to automatically calibrate stationary eye trackers were shown to effectively reduce inherent calibration distortion. However, these methods require additional information, such as mouse clicks or on-screen content. We propose the first method that only requires users’ eye movements to reduce calibration distortion in the background while users naturally look at an interface. Our method exploits that calibration distortion makes straight saccade trajectories appear curved between the saccadic start and end points. We show that this curving effect is systematic and the result of a distorted gaze projection plane. To mitigate calibration distortion, our method undistorts this plane by straightening saccade trajectories using image warping. We show that this approach improves over the common six-point calibration and is promising for reducing distortion. As such, it provides a non-intrusive solution to alleviating accuracy decrease of eye tracker during long-term use.

doi: 0.1145/3317956.3321553

Paper: huang19_etra.pdf

@inproceedings{huang19_etra, title = {SacCalib: Reducing Calibration Distortion for Stationary Eye Trackers Using Saccadic Eye Movements}, author = {Huang, Michael Xuelin and Bulling, Andreas}, year = {2019}, booktitle = {Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA)}, pages = {1--10}, doi = {0.1145/3317956.3321553} }
A Design Space for Gaze Interaction on Head-mounted Displays

Teresa Hirzle, Jan Gugenheimer, Florian Geiselhart, Andreas Bulling, Enrico Rukzio

Proc. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI), pp. 1–12, 2019.

Abstract Links BibTeX Project

Augmented and virtual reality (AR/VR) has entered the mass market and, with it, will soon eye tracking as a core technology for next generation head-mounted displays (HMDs). In contrast to existing gaze interfaces, the 3D nature of AR and VR requires estimating a user’s gaze in 3D. While first applications, such as foveated rendering, hint at the compelling potential of combining HMDs and gaze, a systematic analysis is missing. To fill this gap, we present the first design space for gaze interaction on HMDs. Our design space covers human depth perception and technical requirements in two dimensions aiming to identify challenges and opportunities for interaction design. As such, our design space provides a comprehensive overview and serves as an important guideline for researchers and practitioners working on gaze interaction on HMDs. We further demonstrate how our design space is used in practice by presenting two interactive applications: EyeHealth and XRay-Vision.

doi: 10.1145/3290605.3300855

Paper: hirzle19_chi.pdf

@inproceedings{hirzle19_chi, author = {Hirzle, Teresa and Gugenheimer, Jan and Geiselhart, Florian and Bulling, Andreas and Rukzio, Enrico}, title = {A Design Space for Gaze Interaction on Head-mounted Displays}, booktitle = {Proc. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI)}, year = {2019}, doi = {10.1145/3290605.3300855}, pages = {1--12} }
Can Privacy-Aware Lifelogs Alter Our Memories?

Passant Elagroudy, Florian Mathis, Andreas Bulling, Mohamed Khamis, Diana Irmscher, Albrecht Schmidt

Ext. Abstr. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI), pp. 1–6, 2019.

Abstract Links BibTeX Project

The abundance of automatically-triggered lifelogging cameras is a privacy threat to bystanders. Countering this by deleting photos limits relevant memory cues and the informative content of lifelogs. An alternative is to obfuscate bystanders, but it is not clear how this impacts the lifelogger’s recall of memories. We report on a study in which we compare viewing 1) unaltered photos, 2) photos with blurred people, and 3) a subset of the photos after deleting private ones, on memory recall. Findings show that obfuscated content helps users recall a lot of content, but it also results in recalling less accurate details, which can sometimes mislead the user. Our work informs the design of privacy- aware lifelogging systems that maximizes recall and steers discussion about ubiquitous technologies that could alter human memories.

doi: 10.1145/3290607.3313052

Paper: elagroudy19_chi.pdf

@inproceedings{elagroudy19_chi, author = {Elagroudy, Passant and Mathis, Florian and Bulling, Andreas and Khamis, Mohamed and Irmscher, Diana and Schmidt, Albrecht}, title = {Can Privacy-Aware Lifelogs Alter Our Memories?}, booktitle = {Ext. Abstr. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI)}, year = {2019}, doi = {10.1145/3290607.3313052}, pages = {1--6} }
$A fast approach to refraction-aware 3D eye-model fitting and gaze prediction$
A fast approach to refraction-aware 3D eye-model fitting and gaze prediction

Kai Dierkes, Moritz Kassner, Andreas Bulling

Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA), pp. 1–9, 2019.

Abstract Links BibTeX Project

By temporally integrating information about pupil contours extracted from eye images, model-based methods for glint-free gaze estimation can mitigate pupil detection noise. However, current approaches require time-consuming iterative solving of a nonlinear minimization problem to estimate key parameters, such as eyeball position. Based on the method presented by [Swirski and Dodgson 2013], we propose a novel approach to glint-free 3D eye-model fitting and gaze prediction using a single near-eye camera. By recasting model optimization as a least-squares intersection of lines, we make it amenable to a fast non-iterative solution. We further present a method for estimating deterministic refraction-correction functions from synthetic eye images and validate them on both synthetic and real eye images. We demonstrate the robustness of our method in the presence of pupil detection noise and show the benefit of temporal integration of pupil contour information on eyeball position and gaze estimation accuracy.

doi: 10.1145/3314111.3319819

Paper: dierkes19_etra.pdf

@inproceedings{dierkes19_etra, title = {A fast approach to refraction-aware 3D eye-model fitting and gaze prediction}, author = {Dierkes, Kai and Kassner, Moritz and Bulling, Andreas}, year = {2019}, booktitle = {Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA)}, doi = {10.1145/3314111.3319819}, pages = {1--9} }
KnuckleTouch: Enabling Knuckle Gestures on Capacitive Touchscreens using Deep Learning

Robin Schweigert, Jan Leusmann, Simon Hagenmayer, Maximilian Weiß, Huy Viet Le, Sven Mayer, Andreas Bulling

Proc. Mensch und Computer, pp. 387-397, 2019.

Abstract Links BibTeX Project

While mobile devices have become essential for social communication and have paved the way for work on the go, their interactive capabilities are still limited to simple touch input. A promising enhancement for touch interaction is knuckle input but recognizing knuckle gestures robustly and accurately remains challenging. We present a method to differentiate between 17 finger and knuckle gestures based on a long short-term memory (LSTM) machine learning model. Furthermore, we introduce an open source approach that is ready-to-deploy on commodity touch-based devices. The model was trained on a new dataset that we collected in a mobile interaction study with 18 participants. We show that our method can achieve an accuracy of 86.8% on recognizing one of the 17 gestures and an accuracy of 94.6% to differentiate between finger and knuckle. In our evaluation study, we validate our models and found that the LSTM gestures recognizing archived an accuracy of 88.6%. We show that KnuckleTouch can be used to improve the input expressiveness and to provide shortcuts to frequently used functions.

doi: 10.1145/3340764.3340767

Paper: schweigert19_muc.pdf

Code: https://git.hcics.simtech.uni-stuttgart.de/public-projects/knuckletouch

Video: https://www.youtube.com/watch?v=akL3Ejx3bv8

@inproceedings{schweigert19_muc, title = {KnuckleTouch: Enabling Knuckle Gestures on Capacitive Touchscreens using Deep Learning}, author = {Schweigert, Robin and Leusmann, Jan and Hagenmayer, Simon and Weiß, Maximilian and Le, Huy Viet and Mayer, Sven and Bulling, Andreas}, year = {2019}, booktitle = {Proc. Mensch und Computer}, doi = {10.1145/3340764.3340767}, pages = {387-397}, video = {https://www.youtube.com/watch?v=akL3Ejx3bv8} }
Privacy-Aware Eye Tracking Using Differential Privacy

Julian Steil, Inken Hagestedt, Michael Xuelin Huang, Andreas Bulling

Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA), pp. 1–9, 2019.

Abstract Links BibTeX Project Best Paper Award

With eye tracking being increasingly integrated into virtual and augmented reality (VR/AR) head-mounted displays, preserving users’ privacy is an ever more important, yet under-explored, topic in the eye tracking community. We report a large-scale online survey (N=124) on privacy aspects of eye tracking that provides the first comprehensive account of with whom, for which services, and to which extent users are willing to share their gaze data. Using these insights, we design a privacy-aware VR interface that uses differential privacy, which we evaluate on a new 20-participant dataset for two privacy sensitive tasks: We show that our method can prevent user re-identification and protect gender information while maintaining high performance for gaze-based document type classification. Our results highlight the privacy challenges particular to gaze data and demonstrate that differential privacy is a potential means to address them. Thus, this paper lays important foundations for future research on privacy-aware gaze interfaces.

doi: 10.1145/3314111.3319915

Paper: steil19_etra_2.pdf

@inproceedings{steil19_etra_2, title = {Privacy-Aware Eye Tracking Using Differential Privacy}, author = {Steil, Julian and Hagestedt, Inken and Huang, Michael Xuelin and Bulling, Andreas}, year = {2019}, booktitle = {Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA)}, doi = {10.1145/3314111.3319915}, pages = {1--9} }
PrivacEye: Privacy-Preserving Head-Mounted Eye Tracking Using Egocentric Scene Image and Eye Movement Features

Julian Steil, Marion Koelle, Wilko Heuten, Susanne Boll, Andreas Bulling

Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA), pp. 1–10, 2019.

Abstract Links BibTeX Project Best Video Award

Eyewear devices, such as augmented reality displays, increasingly integrate eye tracking but the first-person camera required to map a user’s gaze to the visual scene can pose a significant threat to user and bystander privacy. We present PrivacEye, a method to detect privacy-sensitive everyday situations and automatically enable and disable the eye tracker’s first-person camera using a mechanical shutter. To close the shutter in privacy-sensitive situations, the method uses a deep representation of the first-person video combined with rich features that encode users’ eye movements. To open the shutter without visual input, PrivacEye detects changes in users’ eye movements alone to gauge changes in the "privacy level" of the current situation. We evaluate our method on a first-person video dataset recorded in daily life situations of 17 participants, annotated by themselves for privacy sensitivity, and show that our method is effective in preserving privacy in this challenging setting.

doi: 10.1145/3314111.3319913

Paper: steil19_etra.pdf

Supplementary Material: steil19_etra_sup.pdf

Video: https://www.youtube.com/watch?v=Gy61255F8T8

@inproceedings{steil19_etra, title = {PrivacEye: Privacy-Preserving Head-Mounted Eye Tracking Using Egocentric Scene Image and Eye Movement Features}, author = {Steil, Julian and Koelle, Marion and Heuten, Wilko and Boll, Susanne and Bulling, Andreas}, year = {2019}, booktitle = {Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA)}, pages = {1--10}, doi = {10.1145/3314111.3319913}, video = {https://www.youtube.com/watch?v=Gy61255F8T8} }
Predicting Gaze Patterns: Text Saliency for Integration into Machine Learning Tasks

Ekta Sood

Proc. International Workshop on Computational Cognition (ComCo), pp. 1–2, 2019.

Links BibTeX Project Best Poster Award

Paper: sood19_comco.pdf

Paper Access: https://collaborative-ai.org/publications/sood19_comco_poster.pdf

@inproceedings{sood19_comco, author = {Sood, Ekta}, title = {Predicting Gaze Patterns: Text Saliency for Integration into Machine Learning Tasks}, year = {2019}, pages = {1--2}, booktitle = {Proc. International Workshop on Computational Cognition (ComCo)}, url = {https://collaborative-ai.org/publications/sood19_comco_poster.pdf} }

Book Chapters

Pervasive Eye Tracking for Real-World Consumer Behavior Analysis

Andreas Bulling, Michel Wedel

Michael Schulte-Mecklenbeck, Anton Kühberger (Eds.): A Handbook of Process Tracing Methods for Decision Research: A Critical Review and User’s Guide, Taylor & Francis, pp. 27-44, 2019.

Abstract Links BibTeX Project

Eye tracking is the computational process of measuring the absolute point of gaze and/or the relative movement of the eyes over time using sensing systems placed in the environment (remote eye tracking) or worn on the head (mobile eye tracking). Eye tracking has a long history as a tool in psychology, human behaviour, and human-computer interaction research, and has also found its way into many commercial applications, such as marketing, web usability, virtual reality, or automotive engineering. Recent advances in mobile eye tracking as well as remote eye tracking using RGB cameras readily integrated into handheld devices and ambient displays pave the way for a whole new class of everyday eye tracking systems that allow researchers and practitioners to better understand and analyse gaze information in real-world settings. In this chapter, we first provide a history of eye tracking as both a measurement tool and research topic. Afterwards, we discuss the considerable potential but also remaining technical challenges for leveraging everyday eye tracking for real-world consumer behaviour analysis and decision making in retail.

Paper: bulling19_tf.pdf

@inbook{bulling19_tf, author = {Bulling, Andreas and Wedel, Michel}, title = {Pervasive Eye Tracking for Real-World Consumer Behavior Analysis}, booktitle = {A Handbook of Process Tracing Methods for Decision Research: A Critical Review and User's Guide}, year = {2019}, editor = {Schulte-Mecklenbeck, Michael and K{\"{u}}hberger, Anton}, publisher = {Taylor \& Francis}, pages = {27-44} }

Technical Reports

Emergent Leadership Detection Across Datasets

Philipp Müller, Andreas Bulling

arXiv:1905.02058, pp. 1–5, 2019.

Abstract Links BibTeX Project

Automatic detection of emergent leaders in small groups from nonverbal behaviour is a growing research topic in social signal processing but existing methods were evaluated on single datasets – an unrealistic assumption for real-world applications in which systems are required to also work in settings unseen at training time. It therefore remains unclear whether current methods for emergent leadership detection generalise to similar but new settings and to which extent. To overcome this limitation, we are the first to study a cross-dataset evaluation setting for the emergent leadership detection task. We provide evaluations for within- and cross-dataset prediction using two current datasets (PAVIS and MPIIGroupInteraction), as well as an investigation on the robustness of commonly used feature channels (visual focus of attention, body pose, facial action units, speaking activity) and online prediction in the cross-dataset setting. Our evaluations show that using pose and eye contact based features, cross-dataset prediction is possible with an accuracy of 0.68, as such providing another important piece of the puzzle towards emergent leadership detection in the real world.

Paper: mueller19_arxiv.pdf

Paper Access: https://arxiv.org/abs/1905.02058

@techreport{mueller19_arxiv, title = {Emergent Leadership Detection Across Datasets}, author = {M{\"{u}}ller, Philipp and Bulling, Andreas}, year = {2019}, pages = {1--5}, url = {https://arxiv.org/abs/1905.02058} }
Evaluation of Appearance-Based Methods and Implications for Gaze-Based Applications

Xucong Zhang, Yusuke Sugano, Andreas Bulling

arXiv:1901.10906, pp. 1–13, 2019.

Abstract Links BibTeX Project

Appearance-based gaze estimation methods that only require an off-the-shelf camera have significantly improved but they are still not yet widely used in the human-computer interaction (HCI) community. This is partly because it remains unclear how they perform compared to model-based approaches as well as dominant, special-purpose eye tracking equipment. To address this limitation, we evaluate the performance of state-of-the-art appearance-based gaze estimation for interaction scenarios with and without personal calibration, indoors and outdoors, for different sensing distances, as well as for users with and without glasses. We discuss the obtained findings and their implications for the most important gaze-based applications, namely explicit eye input, attentive user interfaces, gaze-based user modelling, and passive eye monitoring. To democratise the use of appearance-based gaze estimation and interaction in HCI, we finally present OpenGaze (www.opengaze.org), the first software toolkit for appearance-based gaze estimation and interaction.

Paper: zhang19_arxiv.pdf

Paper Access: https://arxiv.org/abs/1901.10906

@techreport{zhang19_arxiv, title = {Evaluation of Appearance-Based Methods and Implications for Gaze-Based Applications}, author = {Zhang, Xucong and Sugano, Yusuke and Bulling, Andreas}, year = {2019}, pages = {1--13}, url = {https://arxiv.org/abs/1901.10906} }
SacCalib: Reducing Calibration Distortion for Stationary Eye Trackers Using Saccadic Eye Movements

Michael Xuelin Huang, Andreas Bulling

arXiv:1903.04047, pp. 1–10, 2019.

Abstract Links BibTeX Project

Recent methods to automatically calibrate stationary eye trackers were shown to effectively reduce inherent calibration distortion. However, these methods require additional information, such as mouse clicks or on-screen content. We propose the first method that only requires users’ eye movements to reduce calibration distortion in the background while users naturally look at an interface. Our method exploits that calibration distortion makes straight saccade trajectories appear curved between the saccadic start and end points. We show that this curving effect is systematic and the result of a distorted gaze projection plane. To mitigate calibration distortion, our method undistorts this plane by straightening saccade trajectories using image warping. We show that this approach improves over the common six-point calibration and is promising for reducing distortion. As such, it provides a non-intrusive solution to alleviating accuracy decrease of eye tracker during long-term use.

Paper: huang19_arxiv_2.pdf

Paper Access: https://arxiv.org/abs/1903.04047

@techreport{huang19_arxiv_2, title = {SacCalib: Reducing Calibration Distortion for Stationary Eye Trackers Using Saccadic Eye Movements}, author = {Huang, Michael Xuelin and Bulling, Andreas}, year = {2019}, pages = {1--10}, url = {https://arxiv.org/abs/1903.04047} }
Moment-to-Moment Detection of Internal Thought from Eye Vergence Behaviour

Michael Xuelin Huang, Jiajia Li, Grace Ngai, Hong Va Leong, Andreas Bulling

arXiv:1901.06572, pp. 1–22, 2019.

Abstract Links BibTeX Project

Internal thought refers to the process of directing attention away from a primary visual task to internal cognitive processing. Internal thought is a pervasive mental activity and closely related to primary task performance. As such, automatic detection of internal thought has significant potential for user modelling in intelligent interfaces, particularly for e-learning applications. Despite the close link between the eyes and the human mind, only a few studies have investigated vergence behaviour during internal thought and none has studied moment-to-moment detection of internal thought from gaze. While prior studies relied on long-term data analysis and required a large number of gaze characteristics, we describe a novel method that is computationally light-weight and that only requires eye vergence information that is readily available from binocular eye trackers. We further propose a novel paradigm to obtain ground truth internal thought annotations that exploits human blur perception. We evaluate our method for three increasingly challenging detection tasks: (1) during a controlled math-solving task, (2) during natural viewing of lecture videos, and (3) during daily activities, such as coding, browsing, and reading. Results from these evaluations demonstrate the performance and robustness of vergence-based detection of internal thought and, as such, open up new directions for research on interfaces that adapt to shifts of mental attention.

Paper: huang19_arxiv.pdf

Paper Access: https://arxiv.org/abs/1901.06572

@techreport{huang19_arxiv, title = {Moment-to-Moment Detection of Internal Thought from Eye Vergence Behaviour}, author = {Huang, Michael Xuelin and Li, Jiajia and Ngai, Grace and Leong, Hong Va and Bulling, Andreas}, year = {2019}, pages = {1--22}, url = {https://arxiv.org/abs/1901.06572} }
How far are we from quantifying visual attention in mobile HCI?

Mihai Bâce, Sander Staal, Andreas Bulling

arXiv:1907.11106, pp. 1–7, 2019.

Abstract Links BibTeX Project

With an ever-increasing number of mobile devices competing for our attention, quantifying when, how often, or for how long users visually attend to their devices has emerged as a core challenge in mobile human-computer interaction. Encouraged by recent advances in automatic eye contact detection using machine learning and device-integrated cameras, we provide a fundamental investigation into the feasibility of quantifying visual attention during everyday mobile interactions. We identify core challenges and sources of errors associated with sensing attention on mobile devices in the wild, including the impact of face and eye visibility, the importance of robust head pose estimation, and the need for accurate gaze estimation. Based on this analysis, we propose future research directions and discuss how eye contact detection represents the foundation for exciting new applications towards next-generation pervasive attentive user interfaces.

Paper: bace19_arxiv_2.pdf

Paper Access: https://arxiv.org/abs/1907.11106

@techreport{bace19_arxiv_2, title = {How far are we from quantifying visual attention in mobile HCI?}, author = {B{\^a}ce, Mihai and Staal, Sander and Bulling, Andreas}, year = {2019}, pages = {1--7}, url = {https://arxiv.org/abs/1907.11106} }
Accurate and Robust Eye Contact Detection During Everyday Mobile Device Interactions

Mihai Bâce, Sander Staal, Andreas Bulling

arXiv:1907.11115, pp. 1–12, 2019.

Abstract Links BibTeX Project

Quantification of human attention is key to several tasks in mobile human-computer interaction (HCI), such as predicting user interruptibility, estimating noticeability of user interface content, or measuring user engagement. Previous works to study mobile attentive behaviour required special-purpose eye tracking equipment or constrained users’ mobility. We propose a novel method to sense and analyse visual attention on mobile devices during everyday interactions. We demonstrate the capabilities of our method on the sample task of eye contact detection that has recently attracted increasing research interest in mobile HCI. Our method builds on a state-of-the-art method for unsupervised eye contact detection and extends it to address challenges specific to mobile interactive scenarios. Through evaluation on two current datasets, we demonstrate significant performance improvements for eye contact detection across mobile devices, users, or environmental conditions. Moreover, we discuss how our method enables the calculation of additional attention metrics that, for the first time, enable researchers from different domains to study and quantify attention allocation during mobile interactions in the wild.

Paper: bace19_arxiv.pdf

Paper Access: https://arxiv.org/abs/1907.11115

@techreport{bace19_arxiv, title = {Accurate and Robust Eye Contact Detection During Everyday Mobile Device Interactions}, author = {B{\^a}ce, Mihai and Staal, Sander and Bulling, Andreas}, year = {2019}, pages = {1--12}, url = {https://arxiv.org/abs/1907.11115} }

2018

Journal Articles

GazeDirector: Fully Articulated Eye Gaze Redirection in Video

Erroll Wood, Tadas Baltrušaitis, Louis-Philippe Morency, Peter Robinson, Andreas Bulling

Computer Graphics Forum (CGF), 37 (2), pp. 217-225, 2018.

Abstract Links BibTeX Project Best Paper Honourable Mention Award

We present GazeDirector, a new approach for eye gaze redirection that uses model-fitting. Our method first tracks the eyes by fitting a multi-part eye region model to video frames using analysis-by-synthesis, thereby recovering eye region shape, texture, pose, and gaze simultaneously. It then redirects gaze by 1) warping the eyelids from the original image using a model-derived flow field, and 2) rendering and compositing synthesized 3D eyeballs onto the output image in a photorealistic manner. GazeDirector allows us to change where people are looking without person-specific training data, and with full articulation, i.e. we can precisely specify new gaze directions in 3D. Quantitatively, we evaluate both model-fitting and gaze synthesis, with experiments for gaze estimation and redirection on the Columbia gaze dataset. Qualitatively, we compare GazeDirector against recent work on gaze redirection, showing better results especially for large redirection angles. Finally, we demonstrate gaze redirection on YouTube videos by introducing new 3D gaze targets and by manipulating visual behavior.

doi: 10.1111/cgf.13355

Paper: wood18_cgf.pdf

Video: https://www.youtube.com/watch?v=rSNUGciJH6A

@article{wood18_cgf, title = {GazeDirector: Fully Articulated Eye Gaze Redirection in Video}, author = {Wood, Erroll and Baltru{\v{s}}aitis, Tadas and Morency, Louis-Philippe and Robinson, Peter and Bulling, Andreas}, year = {2018}, journal = {Computer Graphics Forum (CGF)}, volume = {37}, number = {2}, pages = {217-225}, doi = {10.1111/cgf.13355}, video = {https://www.youtube.com/watch?v=rSNUGciJH6A} }
Eye movements during everyday behavior predict personality traits

Sabrina Hoppe, Tobias Loetscher, Stephanie Morey, Andreas Bulling

Frontiers in Human Neuroscience, 12, pp. 1–8, 2018.

Abstract Links BibTeX Project

Besides allowing us to perceive our surroundings, eye movements are also a window into our mind and a rich source of information on who we are, how we feel, and what we do. Here we show that eye movements during an everyday task predict aspects of our personality. We tracked eye movements of 42 participants while they ran an errand on a university campus and subsequently assessed their personality traits using well-established questionnaires. Using a state-of-the-art machine learning method and a rich set of features encoding different eye movement characteristics, we were able to reliably predict four of the Big Five personality traits (neuroticism, extraversion, agreeableness, conscientiousness) as well as perceptual curiosity only from eye movements. Further analysis revealed new relations between previously neglected eye movement characteristics and personality. Our findings demonstrate a considerable influence of personality on everyday eye movement control, thereby complementing earlier studies in laboratory settings. Improving automatic recognition and interpretation of human social signals is an important endeavor, enabling innovative design of human–computer systems capable of sensing spontaneous natural user behavior to facilitate efficient interaction and personalization.

doi: 10.3389/fnhum.2018.00105

Paper: hoppe18_fhns.pdf

@article{hoppe18_fhns, title = {Eye movements during everyday behavior predict personality traits}, author = {Hoppe, Sabrina and Loetscher, Tobias and Morey, Stephanie and Bulling, Andreas}, doi = {10.3389/fnhum.2018.00105}, volume = {12}, pages = {1--8}, year = {2018}, journal = {Frontiers in Human Neuroscience} }
CueAuth: Comparing Touch, Mid-Air Gestures, and Gaze for Cue-based Authentication on Situated Displays

Mohamed Khamis, Ludwig Trotter, Ville Mäkelä, Emanuel Zezschwitz, Jens Le, Andreas Bulling, Florian Alt

Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (IMWUT), 2 (7), pp. 1–22, 2018.

Abstract Links BibTeX Project

Secure authentication on situated displays s (e.g., to access sensitive information or to make purchases) is becoming increasingly important. A promising approach are authentication schemes that employ cues that users respond to while authenticating; these schemes overwhelm observers by requiring them to observe the cue itself as well as users’ response to the cue. Although previous work proposed a variety of modalities, such as gaze and mid-air gestures, to further improve security, an understanding of how they compare with regard to usability and security is still missing as of today. In this paper, we compare modalities for cue-based authentication on situated displays. We provide the first comparison between touch, mid-air gestures, and calibration-free gaze using a state-of-the-art authentication concept. In two user studies (N=37) we found that the choice of touch or gaze presents a clear trade-off between usability and security. For example, while gaze input is more secure, it is also more demanding and requires longer authentication times. Mid-air gestures are slightly slower and more secure than touch but users hesitate using them in public. We conclude with design implications for authentication using touch, mid-air gestures, and gaze and discuss how the choice of modality creates opportunities and challenges for improved authentication in public.

doi: 10.1145/3287052

Paper: khamis18_imwut.pdf

@article{khamis18_imwut, title = {CueAuth: Comparing Touch, Mid-Air Gestures, and Gaze for Cue-based Authentication on Situated Displays}, author = {Khamis, Mohamed and Trotter, Ludwig and Mäkelä, Ville and von Zezschwitz, Emanuel and Le, Jens and Bulling, Andreas and Alt, Florian}, year = {2018}, journal = {Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (IMWUT)}, volume = {2}, number = {7}, pages = {1--22}, doi = {10.1145/3287052} }

Conference Papers

A Multimodal Corpus of Expert Gaze and Behavior during Phonetic Segmentation Tasks

Arif Khan, Ingmar Steiner, Yusuke Sugano, Andreas Bulling, Ross Macdonald

Proc. Language Resources and Evaluation Conference (LREC), pp. 4277–4281, 2018.

Abstract Links BibTeX Project

Phonetic segmentation is the process of splitting speech into distinct phonetic units. Human experts routinely perform this task manually by analyzing auditory and visual cues using analysis software, which is an extremely time-consuming process. Methods exist for automatic segmentation, but these are not always accurate enough. In order to improve automatic segmentation, we need to model it as close to the manual segmentation as possible. This corpus is an effort to capture the human segmentation behavior by recording experts performing a segmentation task. We believe that this data will enable us to highlight the important aspects of manual segmentation, which can be used in automatic segmentation to improve its accuracy.

Paper: khan18_lrec.pdf

@inproceedings{khan18_lrec, title = {A Multimodal Corpus of Expert Gaze and Behavior during Phonetic Segmentation Tasks}, author = {Khan, Arif and Steiner, Ingmar and Sugano, Yusuke and Bulling, Andreas and Macdonald, Ross}, year = {2018}, pages = {4277--4281}, booktitle = {Proc. Language Resources and Evaluation Conference (LREC)} }
Fixation Detection for Head-Mounted Eye Tracking Based on Visual Similarity of Gaze Targets

Julian Steil, Michael Xuelin Huang, Andreas Bulling

Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA), pp. 1–9, 2018.

Abstract Links BibTeX Project

Fixations are widely analysed in human vision, gaze-based interaction, and experimental psychology research. However, robust fixation detection in mobile settings is profoundly challenging given the prevalence of user and gaze target motion. These movements feign a shift in gaze estimates in the frame of reference defined by the eye tracker’s scene camera. To address this challenge, we present a novel fixation detection method for head-mounted eye trackers. Our method exploits that, independent of user or gaze target motion, target appearance remains about the same during a fixation. It extracts image information from small regions around the current gaze position and analyses the appearance similarity of these gaze patches across video frames to detect fixations. We evaluate our method using fine-grained fixation annotations on a five-participant indoor dataset (MPIIEgoFixation) with more than 2,300 fixations in total. Our method outperforms commonly used velocity- and dispersion-based algorithms, which highlights its significant potential to analyse scene image information for eye movement detection.

doi: 10.1145/3204493.3204538

Paper: steil18_etra.pdf

@inproceedings{steil18_etra, author = {Steil, Julian and Huang, Michael Xuelin and Bulling, Andreas}, title = {Fixation Detection for Head-Mounted Eye Tracking Based on Visual Similarity of Gaze Targets}, booktitle = {Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA)}, year = {2018}, pages = {1--9}, doi = {10.1145/3204493.3204538} }
Robust Eye Contact Detection in Natural Multi-Person Interactions Using Gaze and Speaking Behaviour

Philipp Müller, Michael Xuelin Huang, Xucong Zhang, Andreas Bulling

Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA), pp. 1–10, 2018.

Abstract Links BibTeX Project

Eye contact is one of the most important non-verbal social cues and fundamental to human interactions. However, detecting eye contact without specialized eye tracking equipment poses significant challenges, particularly for multiple people in real-world settings. We present a novel method to robustly detect eye contact in natural three- and four-person interactions using off-the-shelf ambient cameras. Our method exploits that, during conversations, people tend to look at the person who is currently speaking. Harnessing the correlation between people’s gaze and speaking behaviour therefore allows our method to automatically acquire training data during deployment and adaptively train eye contact detectors for each target user. We empirically evaluate the performance of our method on a recent dataset of natural group interactions and demonstrate that it achieves a relative improvement over the state-of-the-art method of more than 60%, and also improves over a head pose based baseline.

doi: 10.1145/3204493.3204549

Paper: mueller18_etra.pdf

@inproceedings{mueller18_etra, author = {M{\"{u}}ller, Philipp and Huang, Michael Xuelin and Zhang, Xucong and Bulling, Andreas}, title = {Robust Eye Contact Detection in Natural Multi-Person Interactions Using Gaze and Speaking Behaviour}, booktitle = {Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA)}, year = {2018}, pages = {1--10}, doi = {10.1145/3204493.3204549} }
Forecasting User Attention During Everyday Mobile Interactions Using Device-Integrated and Wearable Sensors

Julian Steil, Philipp Müller, Yusuke Sugano, Andreas Bulling

Proc. ACM International Conference on Human-Computer Interaction with Mobile Devices and Services (MobileHCI), pp. 1–13, 2018.

Abstract Links BibTeX Project Best Paper Award

Visual attention is highly fragmented during mobile interactions but the erratic nature of attention shifts currently limits attentive user interfaces to adapt after the fact, i.e. after shifts have already happened. We instead study attention forecasting – the challenging task of predicting users’ gaze behavior (overt visual attention) in the near future. We present a novel long-term dataset of everyday mobile phone interactions, continuously recorded from 20 participants engaged in common activities on a university campus over 4.5 hours each (more than 90 hours in total). We propose a proof-of-concept method that uses device-integrated sensors and body-worn cameras to encode rich information on device usage and users’ visual scene. We demonstrate that our method can forecast bidirectional attention shifts and whether the primary attentional focus is on the handheld mobile device. We study the impact of different feature sets on performance and discuss the significant potential but also remaining challenges of forecasting user attention during mobile interactions.

doi: 10.1145/3229434.3229439

Paper: steil18_mobilehci.pdf

@inproceedings{steil18_mobilehci, author = {Steil, Julian and M{\"{u}}ller, Philipp and Sugano, Yusuke and Bulling, Andreas}, title = {Forecasting User Attention During Everyday Mobile Interactions Using Device-Integrated and Wearable Sensors}, booktitle = {Proc. ACM International Conference on Human-Computer Interaction with Mobile Devices and Services (MobileHCI)}, year = {2018}, doi = {10.1145/3229434.3229439}, pages = {1--13} }
$A novel approach to single camera, glint-free 3D eye model fitting including corneal refraction$
A novel approach to single camera, glint-free 3D eye model fitting including corneal refraction

Kai Dierkes, Moritz Kassner, Andreas Bulling

Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA), pp. 1–9, 2018.

Abstract Links BibTeX Project

Model-based methods for glint-free gaze estimation typically infer eye pose using pupil contours extracted from eye images. Existing methods, however, either ignore or require complex hardware setups to deal with refraction effects occurring at the corneal interfaces. In this work we provide a detailed analysis of the effects of refraction in glint-free gaze estimation using a single near-eye camera, based on the method presented by [Swirski et al. 2013]. We demonstrate systematic deviations in inferred eyeball positions and gaze directions with respect to synthetic ground-truth data and show that ignoring corneal refraction can result in angular errors of several degrees. Furthermore, we quantify gaze direction dependent errors in pupil radius estimates. We propose a novel approach to account for corneal refraction in 3D eye model fitting and by analyzing synthetic and real images show that our new method successfully captures refraction effects and helps to overcome the shortcomings of the state of the art approach.

doi: 10.1145/3204493.3204525

Paper: dierkes18_etra.pdf

@inproceedings{dierkes18_etra, author = {Dierkes, Kai and Kassner, Moritz and Bulling, Andreas}, title = {A novel approach to single camera, glint-free 3D eye model fitting including corneal refraction}, booktitle = {Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA)}, year = {2018}, pages = {1--9}, doi = {10.1145/3204493.3204525} }
The Past, Present, and Future of Gaze-enabled Handheld Mobile Devices: Survey and Lessons Learned

Mohamed Khamis, Florian Alt, Andreas Bulling

Proc. ACM International Conference on Human-Computer Interaction with Mobile Devices and Services (MobileHCI), pp. 1–17, 2018.

Abstract Links BibTeX Project Best Paper Honourable Mention Award

While first-generation mobile gaze interfaces required special-purpose hardware, recent advances in computational gaze estimation and the availability of sensor-rich and powerful devices is finally fulfilling the promise of pervasive eye tracking and eye-based interaction on off-the-shelf mobile devices. This work provides the first holistic view on the past, present, and future of eye tracking on handheld mobile devices. To this end, we discuss how research developed from building hardware prototypes, to accurate gaze estimation on unmodified smartphones and tablets. We then discuss implications by laying out 1) novel opportunities, which include pervasive advertising and conducting in-the-wild eye tracking studies on handhelds, as well as 2) new challenges that require further research, such as the visibility of the user’s eyes, lighting conditions, and privacy implications. We discuss how these developments shape MobileHCI research in the future, possibly the “next 20 years”, as the overarching theme of MobileHCI 2018 suggests.

doi: 10.1145/3229434.3229452

Paper: khamis18_mobilehci.pdf

@inproceedings{khamis18_mobilehci, author = {Khamis, Mohamed and Alt, Florian and Bulling, Andreas}, title = {The Past, Present, and Future of Gaze-enabled Handheld Mobile Devices: Survey and Lessons Learned}, booktitle = {Proc. ACM International Conference on Human-Computer Interaction with Mobile Devices and Services (MobileHCI)}, year = {2018}, doi = {10.1145/3229434.3229452}, pages = {1--17} }
VRPursuits: Interaction in Virtual Reality using Smooth Pursuit Eye Movements

Mohamed Khamis, Carl Oechsner, Florian Alt, Andreas Bulling

Proc. International Conference on Advanced Visual Interfaces (AVI), pp. 1–8, 2018.

Abstract Links BibTeX Project

Gaze-based interaction using smooth pursuit eye movements (Pursuits) is attractive given that it is intuitive and overcomes the Midas touch problem. At the same time, eye tracking is becoming increasingly popular for VR applications. While Pursuits was shown to be effective in several interaction contexts, it was never explored in-depth for VR before. In a user study (N=26), we investigated how parameters that are specific to VR settings influence the performance of Pursuits. We found that Pursuits is robust against different sizes of virtual 3D targets and sizes to them. However, Pursuits’ performance improves when the trajectory size is larger, particularly if the user is walking while interacting. While walking, selecting moving targets via Pursuits is generally feasible albeit less accurate than when stationary. Finally, we discuss the implications of these findings and the potential of smooth pursuits for interaction in VR by demonstrating two sample use cases: 1) gaze-based authentication in VR, and 2) a space meteors shooting game.

doi: 10.1145/3206505.3206522

Paper: khamis18_avi.pdf

@inproceedings{khamis18_avi, title = {VRPursuits: Interaction in Virtual Reality using Smooth Pursuit Eye Movements}, author = {Khamis, Mohamed and Oechsner, Carl and Alt, Florian and Bulling, Andreas}, year = {2018}, pages = {1--8}, booktitle = {Proc. International Conference on Advanced Visual Interfaces (AVI)}, doi = {10.1145/3206505.3206522} }
Which one is me? Identifying Oneself on Public Displays

Mohamed Khamis, Christian Becker, Andreas Bulling, Florian Alt

Proc. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI), pp. 1–12, 2018.

Abstract Links BibTeX Project Best Paper Honourable Mention Award

While user representations are extensively used on public displays, it remains unclear how well users can recognize their own representation among those of surrounding users. We study the most widely used representations: abstract objects, skeletons, silhouettes and mirrors. In a prestudy (N=12), we identify five strategies that users follow to recognize themselves on public displays. In a second study (N=19), we quantify the users’ recognition time and accuracy with respect to each representation type. Our findings suggest that there is a significant effect of (1) the representation type, (2) the strategies performed by users, and (3) the combination of both on recognition time and accuracy. We discuss the suitability of each representation for different settings and provide specific recommendations as to how user representations should be applied in multi-user scenarios. These recommendations guide practitioners and researchers in selecting the representation that optimizes the most for the deployment’s requirements, and for the user strategies that are feasible in that environment.

doi: 10.1145/3173574.3173861

Paper: khamis18_chi_2.pdf

Video: https://www.youtube.com/watch?v=yG5_RBrnRx0

@inproceedings{khamis18_chi_2, title = {Which one is me? Identifying Oneself on Public Displays}, author = {Khamis, Mohamed and Becker, Christian and Bulling, Andreas and Alt, Florian}, year = {2018}, booktitle = {Proc. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI)}, doi = {10.1145/3173574.3173861}, pages = {1--12}, video = {https://www.youtube.com/watch?v=yG5_RBrnRx0} }
Understanding Face and Eye Visibility in Front-Facing Cameras of Smartphones used in the Wild

Mohamed Khamis, Anita Baier, Niels Henze, Florian Alt, Andreas Bulling

Proc. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI), pp. 1–12, 2018.

Abstract Links BibTeX Project

Commodity mobile devices are now equipped with high-resolution front-facing cameras, paving the way for applications in biometrics, facial expression analysis, or gaze interaction. However, it is unknown how often users hold devices in a way that allows capturing their face or eyes, and how this impacts detection accuracy. We collected 25,726 in-the-wild photos taken from the front-facing camera of smartphones and associated application usage logs. We found that the full face is visible about 29% of the time, and that in most cases the face is only partially visible. We further identified an influence of users’ current activity; for example, when watching videos, the eyes but not the entire face are visible 75% of the time in our dataset. We found that state-of-the-art face detection algorithms perform poorly against photos taken from front-facing cameras. We discuss how these findings impact mobile applications that leverage face and eye detection, and derive practical implications to address state-of-the art’s limitations.

doi: 10.1145/3173574.3173854

Paper: khamis18_chi.pdf

Video: https://www.youtube.com/watch?v=_L6FyzTjFG0

@inproceedings{khamis18_chi, title = {Understanding Face and Eye Visibility in Front-Facing Cameras of Smartphones used in the Wild}, author = {Khamis, Mohamed and Baier, Anita and Henze, Niels and Alt, Florian and Bulling, Andreas}, year = {2018}, booktitle = {Proc. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI)}, doi = {10.1145/3173574.3173854}, pages = {1--12}, video = {https://www.youtube.com/watch?v=_L6FyzTjFG0} }
Learning to Find Eye Region Landmarks for Remote Gaze Estimation in Unconstrained Settings

Seonwook Park, Xucong Zhang, Andreas Bulling, Otmar Hilliges

Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA), pp. 1–10, 2018.

Abstract Links BibTeX Project Best Presentation Award

Conventional feature-based and model-based gaze estimation methods have proven to perform well in settings with controlled illumination and specialized cameras. In unconstrained real-world settings, however, such methods are surpassed by recent appearance-based methods due to difficulties in modeling factors such as illumination changes and other visual artifacts. We present a novel learning-based method for eye region landmark localization that enables conventional methods to be competitive to latest appearance-based methods. Despite having been trained exclusively on synthetic data, our method exceeds the state of the art for iris localization and eye shape registration on real-world imagery. We then use the detected landmarks as input to iterative model-fitting and lightweight learning-based gaze estimation methods. Our approach outperforms existing model-fitting and appearance-based methods in the context of person-independent and personalized gaze estimation.

doi: 10.1145/3204493.3204545

Paper: park18_etra.pdf

Video: https://www.youtube.com/watch?v=I8WlEHgDBV4

@inproceedings{park18_etra, author = {Park, Seonwook and Zhang, Xucong and Bulling, Andreas and Hilliges, Otmar}, title = {Learning to Find Eye Region Landmarks for Remote Gaze Estimation in Unconstrained Settings}, booktitle = {Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA)}, year = {2018}, pages = {1--10}, doi = {10.1145/3204493.3204545}, video = {https://www.youtube.com/watch?v=I8WlEHgDBV4} }
Error-Aware Gaze-Based Interfaces for Robust Mobile Gaze Interaction

Michael Barz, Florian Daiber, Daniel Sonntag, Andreas Bulling

Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA), pp. 1–10, 2018.

Abstract Links BibTeX Project Best Paper Award

Gaze estimation error is unavoidable in head-mounted eye trackers and can severely hamper usability and performance of mobile gaze-based interfaces given that the error varies constantly for different interaction positions. In this work, we explore error-aware gaze-based interfaces that estimate and adapt to gaze estimation error on-the-fly. We implement a sample error-aware user interface for gaze-based selection and different error compensation methods: a naïve approach that increases component size directly proportional to the absolute error, a recent model by Feit et al. (CHI’17) that is based on the 2-dimensional error distribution, and a novel predictive model that shifts gaze by a directional error estimate. We evaluate these models in a 12-participant user study and show that our predictive model outperforms the others significantly in terms of selection rate, particularly for small gaze targets. These results underline both the feasibility and potential of next generation error-aware gaze-based user interfaces.

doi: 10.1145/3204493.3204536

Paper: barz18_etra.pdf

@inproceedings{barz18_etra, author = {Barz, Michael and Daiber, Florian and Sonntag, Daniel and Bulling, Andreas}, title = {Error-Aware Gaze-Based Interfaces for Robust Mobile Gaze Interaction}, booktitle = {Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA)}, year = {2018}, pages = {1--10}, doi = {10.1145/3204493.3204536} }
Detecting Low Rapport During Natural Interactions in Small Groups from Non-Verbal Behavior

Philipp Müller, Michael Xuelin Huang, Andreas Bulling

Proc. ACM International Conference on Intelligent User Interfaces (IUI), pp. 153-164, 2018.

Abstract Links BibTeX Project

Rapport, the close and harmonious relationship in which interaction partners are "in sync" with each other, was shown to result in smoother social interactions, improved collaboration, and improved interpersonal outcomes. In this work, we are first to investigate automatic prediction of low rapport during natural interactions within small groups. This task is challenging given that rapport only manifests in subtle non-verbal signals that are, in addition, subject to influences of group dynamics as well as inter-personal idiosyncrasies. We record videos of unscripted discussions of three to four people using a multi-view camera system and microphones. We analyse a rich set of non-verbal signals for rapport detection, namely facial expressions, hand motion, gaze, speaker turns, and speech prosody. Using facial features, we can detect low rapport with an average precision of 0.7 (chance level at 0.25), while incorporating prior knowledge of participants’ personalities can even achieve early prediction without a drop in performance. We further provide a detailed analysis of different feature sets and the amount of information contained in different temporal segments of the interactions.

doi: 10.1145/3172944.3172969

Paper: mueller18_iui.pdf

@inproceedings{mueller18_iui, title = {Detecting Low Rapport During Natural Interactions in Small Groups from Non-Verbal Behavior}, author = {M{\"{u}}ller, Philipp and Huang, Michael Xuelin and Bulling, Andreas}, year = {2018}, pages = {153-164}, booktitle = {Proc. ACM International Conference on Intelligent User Interfaces (IUI)}, doi = {10.1145/3172944.3172969} }
Revisiting Data Normalization for Appearance-Based Gaze Estimation

Xucong Zhang, Yusuke Sugano, Andreas Bulling

Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA), pp. 1–9, 2018.

Abstract Links BibTeX Project

Appearance-based gaze estimation is promising for unconstrained real-world settings, but the significant variability in head pose and user-camera distance poses significant challenges for training generic gaze estimators. Data normalization was proposed to cancel out this geometric variability by mapping input images and gaze labels to a normalized space. Although used successfully in prior works, the role and importance of data normalization remains unclear. To fill this gap, we study data normalization for the first time using principled evaluations on both simulated and real data. We propose a modification to the current data normalization formulation by removing the scaling factor and show that our new formulation performs significantly better (between 9.5% and 32.7%) in the different evaluation settings. Using images synthesized from a 3D face model, we demonstrate the benefit of data normalization for the efficiency of the model training. Experiments on real-world images confirm the advantages of data normalization in terms of gaze estimation performance.

doi: 10.1145/3204493.3204548

Paper: zhang18_etra.pdf

@inproceedings{zhang18_etra, author = {Zhang, Xucong and Sugano, Yusuke and Bulling, Andreas}, title = {Revisiting Data Normalization for Appearance-Based Gaze Estimation}, booktitle = {Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA)}, year = {2018}, pages = {1--9}, doi = {10.1145/3204493.3204548} }
Training Person-Specific Gaze Estimators from Interactions with Multiple Devices

Xucong Zhang, Michael Xuelin Huang, Yusuke Sugano, Andreas Bulling

Proc. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI), pp. 1–12, 2018.

Abstract Links BibTeX Project

Learning-based gaze estimation has significant potential to enable attentive user interfaces and gaze-based interaction on the billions of camera-equipped handheld devices and ambient displays. While training accurate person- and device-independent gaze estimators remains challenging, person-specific training is feasible but requires tedious data collection for each target device. To address these limitations, we present the first method to train person-specific gaze estimators across multiple devices. At the core of our method is a single convolutional neural network with shared feature extraction layers and device-specific branches that we train from face images and corresponding on-screen gaze locations. Detailed evaluations on a new dataset of interactions with five common devices (mobile phone, tablet, laptop, desktop computer, smart TV) and three common applications (mobile game, text editing, media center) demonstrate the significant potential of cross-device training. We further explore training with gaze locations derived from natural interactions, such as mouse or touch input.

doi: 10.1145/3173574.3174198

Paper: zhang18_chi.pdf

@inproceedings{zhang18_chi, title = {Training Person-Specific Gaze Estimators from Interactions with Multiple Devices}, author = {Zhang, Xucong and Huang, Michael Xuelin and Sugano, Yusuke and Bulling, Andreas}, year = {2018}, booktitle = {Proc. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI)}, doi = {10.1145/3173574.3174198}, pages = {1--12} }
GazeDrone: Mobile Eye-Based Interaction in Public Space Without Augmenting the User

Mohamed Khamis, Anna Kienle, Florian Alt, Andreas Bulling

Proc. ACM Workshop on Micro Aerial Vehicle Networks, Systems, and Applications (DroNet), pp. 66-71, 2018.

Abstract Links BibTeX Project

Gaze interaction holds a lot of promise for seamless human-computer interaction. At the same time, current wearable mobile eye trackers require user augmentation that negatively impacts natural user behavior while remote trackers require users to position themselves within a confined tracking range. We present GazeDrone, the first system that combines a camera-equipped aerial drone with a computational method to detect sidelong glances for spontaneous (calibration-free) gaze-based interaction with surrounding pervasive systems (e.g., public displays). GazeDrone does not require augmenting each user with on-body sensors and allows interaction from arbitrary positions, even while moving. We demonstrate that drone-supported gaze interaction is feasible and accurate for certain movement types. It is well-perceived by users, in particular while interacting from a fixed position as well as while moving orthogonally or diagonally to a display. We present design implications and discuss opportunities and challenges for drone-supported gaze interaction in public.

doi: 10.1145/3213526.3213539

Paper: khamis18_dronet.pdf

@inproceedings{khamis18_dronet, title = {GazeDrone: Mobile Eye-Based Interaction in Public Space Without Augmenting the User}, author = {Khamis, Mohamed and Kienle, Anna and Alt, Florian and Bulling, Andreas}, doi = {10.1145/3213526.3213539}, year = {2018}, booktitle = {Proc. ACM Workshop on Micro Aerial Vehicle Networks, Systems, and Applications (DroNet)}, pages = {66-71} }
Towards a Symbiotic Human-Machine Depth Sensor: Exploring 3D Gaze for Object Reconstruction

Teresa Hirzle, Jan Gugenheimer, Florian Geiselhart, Andreas Bulling, Enrico Rukzio

Adj. Proc. ACM Symposium on User Interface Software and Technology (UIST), pp. 114-116, 2018.

Abstract Links BibTeX Project

Eye tracking is expected to become an integral part of future augmented reality (AR) head-mounted displays (HMDs) given that it can easily be integrated into existing hardware and provides a versatile interaction modality. To augment objects in the real world, AR HMDs require a three-dimensional understanding of the scene, which is currently solved using depth cameras. In this work we aim to explore how 3D gaze data can be used to enhance scene understanding for AR HMDs by envisioning a symbiotic human-machine depth camera, fusing depth data with 3D gaze information. We present a first proof of concept, exploring to what extend we are able to recognise what a user is looking at by plotting 3D gaze data. To measure 3D gaze, we implemented a vergence-based algorithm and built an eye tracking setup consisting of a Pupil Labs headset and an OptiTrack motion capture system, allowing us to measure 3D gaze inside a 50x50x50 cm volume. We show first 3D gaze plots of "gazed-at" objects and describe our vision of a symbiotic human-machine depth camera that combines a depth camera and human 3D gaze information.

doi: 10.1145/3266037.3266119

Paper: hirzle18_uist.pdf

@inproceedings{hirzle18_uist, title = {Towards a Symbiotic Human-Machine Depth Sensor: Exploring 3D Gaze for Object Reconstruction}, author = {Hirzle, Teresa and Gugenheimer, Jan and Geiselhart, Florian and Bulling, Andreas and Rukzio, Enrico}, year = {2018}, pages = {114-116}, doi = {10.1145/3266037.3266119}, booktitle = {Adj. Proc. ACM Symposium on User Interface Software and Technology (UIST)} }
Hidden Pursuits: Evaluating Gaze-selection via Pursuits when the Stimulus Trajectory is Partially Hidden

Thomas Mattusch, Mahsa Mirzamohammad, Mohamed Khamis, Andreas Bulling, Florian Alt

Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA), pp. 1–5, 2018.

Abstract Links BibTeX Project

The idea behind gaze interaction using Pursuits is to leverage the human’s smooth pursuit eye movements performed when following moving targets. However, humans can also anticipate where a moving target would reappear if it temporarily hides from their view. In this work, we investigate how well users can select targets using Pursuits in cases where the target’s trajectory is partially invisible (HiddenPursuits): e.g., can users select a moving target that temporarily hides behind another object? Although HiddenPursuits was not studied in the context of interaction before, understanding how well users can perform HiddenPursuits presents numerous opportunities, particularly for small interfaces where a target’s trajectory can cover area outside of the screen. We found that users can still select targets quickly via Pursuits even if their trajectory is up to 50% hidden, and at the expense of longer selection times when the hidden portion is larger. We discuss how gaze-based interfaces can leverage HiddenPursuits for an improved user experience.

doi: 10.1145/3204493.3204569

Paper: mattusch18_etra.pdf

@inproceedings{mattusch18_etra, author = {Mattusch, Thomas and Mirzamohammad, Mahsa and Khamis, Mohamed and Bulling, Andreas and Alt, Florian}, title = {Hidden Pursuits: Evaluating Gaze-selection via Pursuits when the Stimulus Trajectory is Partially Hidden}, booktitle = {Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA)}, year = {2018}, pages = {1--5}, doi = {10.1145/3204493.3204569} }

Technical Reports

PrivacEye: Privacy-Preserving First-Person Vision Using Image Features and Eye Movement Analysis

Julian Steil, Marion Koelle, Wilko Heuten, Susanne Boll, Andreas Bulling

arXiv:1801.04457, pp. 1–14, 2018.

Abstract Links BibTeX Project

As first-person cameras in head-mounted displays become increasingly prevalent, so does the problem of infringing user and bystander privacy. To address this challenge, we present PrivacEye, a proof-of-concept system that detects privacysensitive everyday situations and automatically enables and disables the first-person camera using a mechanical shutter. To close the shutter, PrivacEye detects sensitive situations from first-person camera videos using an end-to-end deep-learning model. To open the shutter without visual input, PrivacEye uses a separate, smaller eye camera to detect changes in users’ eye movements to gauge changes in the "privacy level" of the current situation. We evaluate PrivacEye on a dataset of first-person videos recorded in the daily life of 17 participants that they annotated with privacy sensitivity levels. We discuss the strengths and weaknesses of our proof-of-concept system based on a quantitative technical evaluation as well as qualitative insights from semi-structured interviews.

Paper: steil18_arxiv_2.pdf

Paper Access: https://arxiv.org/abs/1801.04457

@techreport{steil18_arxiv_2, title = {PrivacEye: Privacy-Preserving First-Person Vision Using Image Features and Eye Movement Analysis}, author = {Steil, Julian and Koelle, Marion and Heuten, Wilko and Boll, Susanne and Bulling, Andreas}, year = {2018}, pages = {1--14}, url = {https://arxiv.org/abs/1801.04457} }
Privacy-Aware Eye Tracking Using Differential Privacy

Julian Steil, Inken Hagestedt, Michael Xuelin Huang, Andreas Bulling

arXiv:1812.08000, pp. 1–22, 2018.

Abstract Links BibTeX Project

With eye tracking being increasingly integrated into virtual and augmented reality (VR/AR) head-mounted displays, preserving users’ privacy is an ever more important, yet under-explored, topic in the eye tracking community. We report a large-scale online survey (N=124) on privacy aspects of eye tracking that provides the first comprehensive account of with whom, for which services, and to which extent users are willing to share their gaze data. Using these insights, we design a privacy-aware VR interface that uses differential privacy, which we evaluate on a new 20-participant dataset for two privacy sensitive tasks: We show that our method can prevent user re-identification and protect gender information while maintaining high performance for gaze-based document type classification. Our results highlight the privacy challenges particular to gaze data and demonstrate that differential privacy is a potential means to address them. Thus, this paper lays important foundations for future research on privacy-aware gaze interfaces.

Paper: steil18_arxiv.pdf

Paper Access: https://arxiv.org/abs/1812.08000

@techreport{steil18_arxiv, author = {Steil, Julian and Hagestedt, Inken and Huang, Michael Xuelin and Bulling, Andreas}, title = {Privacy-Aware Eye Tracking Using Differential Privacy}, year = {2018}, pages = {1--22}, url = {https://arxiv.org/abs/1812.08000} }

2017

Journal Articles

EyePACT: Eye-Based Parallax Correction on Touch-Enabled Interactive Displays

Mohamed Khamis, Daniel Buschek, Tobias Thieron, Florian Alt, Andreas Bulling

Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (IMWUT), 1 (4), pp. 1–18, 2017.

Abstract Links BibTeX Project

The parallax effect describes the displacement between the perceived and detected touch locations on a touch-enabled surface. Parallax is a key usability challenge for interactive displays, particularly for those that require thick layers of glass between the screen and the touch surface to protect them from vandalism. To address this challenge, we present EyePACT, a method that compensates for input error caused by parallax on public displays. Our method uses a display-mounted depth camera to detect the user’s 3D eye position in front of the display and the detected touch location to predict the perceived touch location on the surface. We evaluate our method in two user studies in terms of parallax correction performance as well as multi-user support. Our evaluations demonstrate that EyePACT (1) significantly improves accuracy even with varying gap distances between the touch surface and the display, (2) adapts to different levels of parallax by resulting in significantly larger corrections with larger gap distances, and (3) maintains a significantly large distance between two users’ fingers when interacting with the same object. Our results provide implications for the development of future touch-enabled public displays.

doi: 10.1145/3161168

Paper: khamis17_imwut.pdf

@article{khamis17_imwut, author = {Khamis, Mohamed and Buschek, Daniel and Thieron, Tobias and Alt, Florian and Bulling, Andreas}, title = {EyePACT: Eye-Based Parallax Correction on Touch-Enabled Interactive Displays}, journal = {Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (IMWUT)}, year = {2017}, volume = {1}, number = {4}, pages = {1--18}, doi = {10.1145/3161168} }
InvisibleEye: Mobile Eye Tracking Using Multiple Low-Resolution Cameras and Learning-Based Gaze Estimation

Marc Tonsen, Julian Steil, Yusuke Sugano, Andreas Bulling

Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (IMWUT), 1 (3), pp. 1–21, 2017.

Abstract Links BibTeX Project Distinguished Paper Award

Analysis of everyday human gaze behaviour has significant potential for ubiquitous computing, as evidenced by a large body of work in gaze-based human-computer interaction, attentive user interfaces, and eye-based user modelling. However, current mobile eye trackers are still obtrusive, which not only makes them uncomfortable to wear and socially unacceptable in daily life, but also prevents them from being widely adopted in the social and behavioural sciences. To address these challenges we present InvisibleEye, a novel approach for mobile eye tracking that uses millimetre-size RGB cameras that can be fully embedded into normal glasses frames. To compensate for the cameras’ low image resolution of only a few pixels, our approach uses multiple cameras to capture different views of the eye, as well as learning-based gaze estimation to directly regress from eye images to gaze directions. We prototypically implement our system and characterise its performance on three large-scale, increasingly realistic, and thus challenging datasets: 1) eye images synthesised using a recent computer graphics eye region model, 2) real eye images recorded of 17 participants under controlled lighting, and 3) eye images recorded of four participants over the course of four recording sessions in a mobile setting. We show that InvisibleEye achieves a top person-specific gaze estimation accuracy of 1.79° using four cameras with a resolution of only 5×5 pixels. Our evaluations not only demonstrate the feasibility of this novel approach but, more importantly, underline its significant potential for finally realising the vision of invisible mobile eye tracking and pervasive attentive user interfaces.

doi: 10.1145/3130971

Paper: tonsen17_imwut.pdf

@article{tonsen17_imwut, author = {Tonsen, Marc and Steil, Julian and Sugano, Yusuke and Bulling, Andreas}, title = {InvisibleEye: Mobile Eye Tracking Using Multiple Low-Resolution Cameras and Learning-Based Gaze Estimation}, journal = {Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (IMWUT)}, year = {2017}, doi = {10.1145/3130971}, volume = {1}, number = {3}, pages = {1--21} }
Look together: using gaze for assisting co-located collaborative search

Yanxia Zhang, Ken Pfeuffer, Ming Ki Chong, Jason Alexander, Andreas Bulling, Hans Gellersen

Personal and Ubiquitous Computing, 21 (1), pp. 173-186, 2017.

Abstract Links BibTeX Project

Gaze information provides indication of users focus which complements remote collaboration tasks, as distant users can see their partner’s focus. In this paper, we apply gaze for co-located collaboration, where users’ gaze locations are presented on the same display, to help collaboration between partners. We integrated various types of gaze indicators on the user interface of a collaborative search system, and we conducted two user studies to understand how gaze enhances coordination and communication between co-located users. Our results show that gaze indeed enhances co-located collaboration, but with a trade-off between visibility of gaze indicators and user distraction. Users acknowledged that seeing gaze indicators eases communication, because it let them be aware of their partner’s interests and attention. However, users can be reluctant to share their gaze information due to trust and privacy, as gaze potentially divulges their interests.

doi: 10.1007/s00779-016-0969-x

Paper: zhang17_puc.pdf

@article{zhang17_puc, title = {Look together: using gaze for assisting co-located collaborative search}, author = {Zhang, Yanxia and Pfeuffer, Ken and Chong, Ming Ki and Alexander, Jason and Bulling, Andreas and Gellersen, Hans}, year = {2017}, journal = {Personal and Ubiquitous Computing}, publisher = {Springer}, volume = {21}, number = {1}, pages = {173-186}, doi = {10.1007/s00779-016-0969-x} }

Conference Papers

Noticeable or Distractive? A Design Space for Gaze-Contingent User Interface Notifications

Michaela Klauck, Yusuke Sugano, Andreas Bulling

Ext. Abstr. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI), pp. 1779-1786, 2017.

Abstract Links BibTeX Project

Users are interrupted by an ever-increasing number of notifications, ranging from error messages, over new email or chat alerts, to advertisement pop-ups. We explore gaze-contingent user interfaces notifications that are shown depending on users’ current gaze location. Specifically, we evaluate how different design properties influence notification noticeability and distractiveness. We measure noticeability quantitatively by analyzing participants’ performance in confirming notifications and distractiveness using a questionnaire. Based on a 12-participant user study on a public display, we show that each of these properties affects noticeability and distractiveness differently and that the properties, in turn, allow for fine-grained optimization of notification display. These findings inform the design of future attentive user interfaces that could optimize the trade-off between, for example, the notification importance and the cost of interruption.

doi: 10.1145/3027063.3053085

Paper: klauck17_chi.pdf

@inproceedings{klauck17_chi, author = {Klauck, Michaela and Sugano, Yusuke and Bulling, Andreas}, title = {Noticeable or Distractive? A Design Space for Gaze-Contingent User Interface Notifications}, booktitle = {Ext. Abstr. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI)}, year = {2017}, pages = {1779-1786}, doi = {10.1145/3027063.3053085} }
Gaze Embeddings for Zero-Shot Image Classification

Nour Karessli, Zeynep Akata, Bernt Schiele, Andreas Bulling

Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6412-6421, 2017.

Abstract Links BibTeX Project Spotlight Presentation

Zero-shot image classification using auxiliary information, such as attributes describing discriminative object properties, requires time-consuming annotation by domain experts. We instead propose a method that relies on human gaze as auxiliary information, exploiting that even non-expert users have a natural ability to judge class membership. We present a data collection paradigm that involves a discrimination task to increase the information content obtained from gaze data. Our method extracts discriminative descriptors from the data and learns a compatibility function between image and gaze using three novel gaze embeddings: Gaze Histograms (GH), Gaze Features with Grid (GFG) and Gaze Features with Sequence (GFS). We introduce two new gaze-annotated datasets for fine-grained image classification and show that human gaze data is indeed class discriminative, provides a competitive alternative to expert-annotated attributes, and outperforms other baselines for zero-shot image classification.

doi: 10.1109/CVPR.2017.679

Paper: karessli17_cvpr.pdf

@inproceedings{karessli17_cvpr, title = {Gaze Embeddings for Zero-Shot Image Classification}, author = {Karessli, Nour and Akata, Zeynep and Schiele, Bernt and Bulling, Andreas}, year = {2017}, booktitle = {Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR)}, pages = {6412-6421}, doi = {10.1109/CVPR.2017.679} }
GTmoPass: Two-factor Authentication on Public Displays Using GazeTouch passwords and Personal Mobile Devices

Mohamed Khamis, Regina Hasholzner, Andreas Bulling, Florian Alt

Proc. ACM International Symposium on Pervasive Displays (PerDis), pp. 1–9, 2017.

Abstract Links BibTeX Project

As public displays continue to deliver increasingly private and personalized content, there is a need to ensure that only the legitimate users can access private information in sensitive contexts. While public displays can adopt similar authentication concepts like those used on public terminals (e.g., ATMs), authentication in public is subject to a number of risks. Namely, adversaries can uncover a user’s password through (1) surfing users, (2) thermal attacks, or (3) smudge attacks. To address this problem we propose GTmoPass, an authentication architecture that enables Multi-factor user authentication on public displays. The first factor is a knowledge-factor: we employ a shoulder-surfing resilient multimodal scheme that combines gaze and touch input for password entry. The second factor is a possession-factor: users utilize their personal mobile devices, on which they enter the password. Credentials are securely transmitted to a server via Bluetooth beacons. We describe the implementation of GTmoPass and report on an evaluation of its usability and security, which shows that although authentication using GTmoPass is slightly slower than traditional methods, it protects against the three aforementioned threats.

doi: 10.1145/3078810.3078815

Paper: khamis17_perdis.pdf

@inproceedings{khamis17_perdis, title = {GTmoPass: Two-factor Authentication on Public Displays Using GazeTouch passwords and Personal Mobile Devices}, author = {Khamis, Mohamed and Hasholzner, Regina and Bulling, Andreas and Alt, Florian}, doi = {10.1145/3078810.3078815}, year = {2017}, pages = {1--9}, booktitle = {Proc. ACM International Symposium on Pervasive Displays (PerDis)} }
EyeScout: Active Eye Tracking for Position and Movement Independent Gaze Interaction with Large Public Displays

Mohamed Khamis, Axel Hoesl, Alexander Klimczak, Martin Reiss, Florian Alt, Andreas Bulling

Proc. ACM Symposium on User Interface Software and Technology (UIST), pp. 155-166, 2017.

Abstract Links BibTeX Project

While gaze holds a lot of promise for hands-free interaction with public displays, remote eye trackers with their confined tracking box restrict users to a single stationary position in front of the display. We present EyeScout, an active eye tracking system that combines an eye tracker mounted on a rail system with a computational method to automatically detect and align the tracker with the user’s lateral movement. EyeScout addresses key limitations of current gaze-enabled large public displays by offering two novel gaze-interaction modes for a single user: In "Walk then Interact" the user can walk up to an arbitrary position in front of the display and interact, while in "Walk and Interact" the user can interact even while on the move. We report on a user study that shows that EyeScout is well perceived by users, extends a public display’s sweet spot into a sweet line, and reduces gaze interaction kick- off time to 3.5 seconds - a 62% improvement over state of the art solutions. We discuss sample applications that demonstrate how EyeScout can enable position and movement-independent gaze interaction with large public displays.

doi: 10.1145/3126594.3126630

Paper: khamis17_uist.pdf

Video: https://www.youtube.com/watch?v=J7_OiRqsmdM

@inproceedings{khamis17_uist, title = {EyeScout: Active Eye Tracking for Position and Movement Independent Gaze Interaction with Large Public Displays}, author = {Khamis, Mohamed and Hoesl, Axel and Klimczak, Alexander and Reiss, Martin and Alt, Florian and Bulling, Andreas}, year = {2017}, pages = {155-166}, doi = {10.1145/3126594.3126630}, booktitle = {Proc. ACM Symposium on User Interface Software and Technology (UIST)}, video = {https://www.youtube.com/watch?v=J7_OiRqsmdM} }
GazeTouchPIN: Protecting Sensitive Data on Mobile Devices using Secure Multimodal Authentication

Mohamed Khamis, Mariam Hassib, Emanuel Zezschwitz, Andreas Bulling, Florian Alt

Proc. ACM International Conference on Multimodal Interaction (ICMI), pp. 446-450, 2017.

Abstract Links BibTeX Project

Although mobile devices provide access to a plethora of sensitive data, most users still only protect them with PINs or patterns, which are vulnerable to side-channel attacks (e.g., shoulder surfing). However, prior research has shown that privacy-aware users are willing to take further steps to protect their private data. We propose GazeTouchPIN, a novel secure authentication scheme for mobile devices that combines gaze and touch input. Our multimodal approach complicates shoulder-surfing attacks by requiring attackers to observe the screen as well as the user’s eyes to find the password. We evaluate the security and usability of GazeTouchPIN in two user studies (N=30). We found that while GazeTouchPIN requires longer entry times, privacy aware users would use it on-demand when feeling observed or when accessing sensitive data. The results show that successful shoulder surfing attack rate drops from 68% to 10.4% when using GazeTouchPIN.

doi: 10.1145/3136755.3136809

Paper: khamis17_icmi.pdf

Video: https://www.youtube.com/watch?v=gs2YO0gP4kI

@inproceedings{khamis17_icmi, title = {GazeTouchPIN: Protecting Sensitive Data on Mobile Devices using Secure Multimodal Authentication}, author = {Khamis, Mohamed and Hassib, Mariam and von Zezschwitz, Emanuel and Bulling, Andreas and Alt, Florian}, year = {2017}, pages = {446-450}, doi = {10.1145/3136755.3136809}, booktitle = {Proc. ACM International Conference on Multimodal Interaction (ICMI)}, video = {https://www.youtube.com/watch?v=gs2YO0gP4kI} }
They are all after you: Investigating the Viability of a Threat Model that involves Multiple Shoulder Surfers

Mohamed Khamis, Linda Bandelow, Stina Schick, Dario Casadevall, Andreas Bulling, Florian Alt

Proc. International Conference on Mobile and Ubiquitous Multimedia (MUM), pp. 1–5, 2017.

Abstract Links BibTeX Project Best Paper Honourable Mention Award

Many of the authentication schemes for mobile devices that were proposed lately complicate shoulder surfing by splitting the attacker’s attention into two or more entities. For example, multimodal authentication schemes such as GazeTouchPIN and GazeTouchPass require attackers to observe the user’s gaze input and the touch input performed on the phone’s screen. These schemes have always been evaluated against single observers, while multiple observers could potentially attack these schemes with greater ease, since each of them can focus exclusively on one part of the password. In this work, we study the effectiveness of a novel threat model against authentication schemes that split the attacker’s attention. As a case study, we report on a security evaluation of two state of the art authentication schemes in the case of a team of two observers. Our results show that although multiple observers perform better against these schemes than single observers, multimodal schemes are significantly more secure against multiple observers compared to schemes that employ a single modality. We discuss how this threat model impacts the design of authentication schemes.

doi: 10.1145/3152832.3152851

Paper: khamis17_mum.pdf

@inproceedings{khamis17_mum, title = {They are all after you: Investigating the Viability of a Threat Model that involves Multiple Shoulder Surfers}, author = {Khamis, Mohamed and Bandelow, Linda and Schick, Stina and Casadevall, Dario and Bulling, Andreas and Alt, Florian}, year = {2017}, doi = {10.1145/3152832.3152851}, pages = {1--5}, booktitle = {Proc. International Conference on Mobile and Ubiquitous Multimedia (MUM)} }
EyeMirror: Mobile Calibration-Free Gaze Approximation using Corneal Imaging

Christian Lander, Sven Gehring, Markus Löchtefeld, Andreas Bulling, Antonio Krüger

Proc. International Conference on Mobile and Ubiquitous Multimedia (MUM), pp. 1–13, 2017.

Abstract Links BibTeX Project

Gaze is a powerful measure of people’s attracted attention and reveals where we are looking at within our current field of view. Hence, gaze-based interfaces are gaining in importance. However, gaze estimation usually requires extensive hardware and depends on a calibration that has to be renewed regularly. We present EyeMirror, a mobile device for calibration-free gaze approximation on surfaces (e.g. displays). It consists of a head-mounted camera, connected to a wearable mini-computer, capturing the environment reflected on the human cornea. The corneal images are analyzed using natural feature tracking for gaze estimation on surfaces. In two lab studies we compared variations of EyeMirror against established methods for gaze estimation in a display scenario, and investigated the effect of display content (i.e. number of features). EyeMirror achieved 4.03° gaze estimation error, while we found no significant effect of display content.

doi: 10.1145/3152832.3152839

Paper: lander17_mum.pdf

@inproceedings{lander17_mum, title = {EyeMirror: Mobile Calibration-Free Gaze Approximation using Corneal Imaging}, author = {Lander, Christian and Gehring, Sven and L{\"{o}}chtefeld, Markus and Bulling, Andreas and Kr{\"{u}}ger, Antonio}, year = {2017}, pages = {1--13}, doi = {10.1145/3152832.3152839}, booktitle = {Proc. International Conference on Mobile and Ubiquitous Multimedia (MUM)} }
Predicting the Category and Attributes of Visual Search Targets Using Deep Gaze Pooling

Hosnieh Sattar, Andreas Bulling, Mario Fritz

Proc. IEEE International Conference on Computer Vision Workshops (ICCVW), pp. 2740-2748, 2017.

Abstract Links BibTeX Project

Predicting the target of visual search from eye fixation (gaze) data is a challenging problem with many applications in human-computer interaction. In contrast to previous work that has focused on individual instances as search target, we propose the first approach to predict categories and attributes of search targets based on gaze data. However, state of the art models for categorical recognition in general require large amounts of training data, which is prohibitive for gaze data. To address this challenge, we propose a novel Gaze Pooling Layer that integrates gaze information into CNN-based architectures as an attention mechanism - incorporating both spatial and temporal aspects of human gaze behavior. We show that our approach is effective even when the gaze pooling layer is added to an already trained CNN, thus eliminating the need for expensive joint data collection of visual and gaze data. We propose an experimental setup and data set and demonstrate the effectiveness of our method for search target prediction based on gaze behavior. We further study how to integrate temporal and spatial gaze information most effectively, and indicate directions for future research in gaze-based prediction of mental states.

doi: 10.1109/ICCVW.2017.322

Paper: sattar17_iccvw.pdf

@inproceedings{sattar17_iccvw, title = {Predicting the Category and Attributes of Visual Search Targets Using Deep Gaze Pooling}, author = {Sattar, Hosnieh and Bulling, Andreas and Fritz, Mario}, year = {2017}, pages = {2740-2748}, doi = {10.1109/ICCVW.2017.322}, booktitle = {Proc. IEEE International Conference on Computer Vision Workshops (ICCVW)} }
It’s Written All Over Your Face: Full-Face Appearance-Based Gaze Estimation

Xucong Zhang, Yusuke Sugano, Mario Fritz, Andreas Bulling

Proc. IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 2299-2308, 2017.

Abstract Links BibTeX Project

Eye gaze is an important non-verbal cue for human affect analysis. Recent gaze estimation work indicated that information from the full face region can benefit performance. Pushing this idea further, we propose an appearance-based method that, in contrast to a long-standing line of work in computer vision, only takes the full face image as input. Our method encodes the face image using a convolutional neural network with spatial weights applied on the feature maps to flexibly suppress or enhance information in different facial regions. Through extensive evaluation, we show that our full-face method significantly outperforms the state of the art for both 2D and 3D gaze estimation, achieving improvements of up to 14.3% on MPIIGaze and 27.7% on EYEDIAP for person-independent 3D gaze estimation. We further show that this improvement is consistent across different illumination conditions and gaze directions and par- ticularly pronounced for the most challenging extreme head poses.

doi: 10.1109/CVPRW.2017.284

Paper: zhang17_cvprw.pdf

@inproceedings{zhang17_cvprw, title = {It's Written All Over Your Face: Full-Face Appearance-Based Gaze Estimation}, author = {Zhang, Xucong and Sugano, Yusuke and Fritz, Mario and Bulling, Andreas}, booktitle = {Proc. IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)}, year = {2017}, doi = {10.1109/CVPRW.2017.284}, pages = {2299-2308} }
Everyday Eye Contact Detection Using Unsupervised Gaze Target Discovery

Xucong Zhang, Yusuke Sugano, Andreas Bulling

Proc. ACM Symposium on User Interface Software and Technology (UIST), pp. 193-203, 2017.

Abstract Links BibTeX Project Best Paper Honourable Mention Award

Eye contact is an important non-verbal cue in social signal processing and promising as a measure of overt attention in human-object interactions and attentive user interfaces. However, robust detection of eye contact across different users, gaze targets, camera positions, and illumination conditions is notoriously challenging. We present a novel method for eye contact detection that combines a state-of-the-art appearance-based gaze estimator with a novel approach for unsupervised gaze target discovery, i.e. without the need for tedious and time-consuming manual data annotation. We evaluate our method in two real-world scenarios: detecting eye contact at the workplace, including on the main work display, from cameras mounted to target objects, as well as during everyday social interactions with the wearer of a head-mounted egocentric camera. We empirically evaluate the performance of our method in both scenarios and demonstrate its effectiveness for detecting eye contact independent of target object type and size, camera position, and user and recording environment.

doi: 10.1145/3126594.3126614

Paper: zhang17_uist.pdf

Video: https://www.youtube.com/watch?v=ccrS5XuhQpk

@inproceedings{zhang17_uist, title = {Everyday Eye Contact Detection Using Unsupervised Gaze Target Discovery}, author = {Zhang, Xucong and Sugano, Yusuke and Bulling, Andreas}, year = {2017}, pages = {193-203}, doi = {10.1145/3126594.3126614}, booktitle = {Proc. ACM Symposium on User Interface Software and Technology (UIST)}, video = {https://www.youtube.com/watch?v=ccrS5XuhQpk} }

Technical Reports

A Multimodal Corpus of Expert Gaze and Behavior during Phonetic Segmentation Tasks

Arif Khan, Ingmar Steiner, Yusuke Sugano, Andreas Bulling, Ross Macdonald

arXiv:1712.04798, pp. 1–4, 2017.

Abstract Links BibTeX Project

Phonetic segmentation is the process of splitting speech into distinct phonetic units. Human experts routinely perform this task manually by analyzing auditory and visual cues using analysis software, which is an extremely time-consuming process. Methods exist for automatic segmentation, but these are not always accurate enough. In order to improve automatic segmentation, we need to model it as close to the manual segmentation as possible. This corpus is an effort to capture the human segmentation behavior by recording experts performing a segmentation task. We believe that this data will enable us to highlight the important aspects of manual segmentation, which can be used in automatic segmentation to improve its accuracy.

Paper: khan17_arxiv.pdf

Paper Access: https://arxiv.org/abs/1712.04798

@techreport{khan17_arxiv, title = {A Multimodal Corpus of Expert Gaze and Behavior during Phonetic Segmentation Tasks}, author = {Khan, Arif and Steiner, Ingmar and Sugano, Yusuke and Bulling, Andreas and Macdonald, Ross}, year = {2017}, pages = {1--4}, url = {https://arxiv.org/abs/1712.04798} }
Visual Decoding of Targets During Visual Search From Human Eye Fixations

Hosnieh Sattar, Mario Fritz, Andreas Bulling

arXiv:1706.05993, pp. 1–9, 2017.

Abstract Links BibTeX Project

What does human gaze reveal about a users’ intents and to which extend can these intents be inferred or even visualized? Gaze was proposed as an implicit source of information to predict the target of visual search and, more recently, to predict the object class and attributes of the search target. In this work, we go one step further and investigate the feasibility of combining recent advances in encoding human gaze information using deep convolutional neural networks with the power of generative image models to visually decode, i.e. create a visual representation of, the search target. Such visual decoding is challenging for two reasons: 1) the search target only resides in the user’s mind as a subjective visual pattern, and can most often not even be described verbally by the person, and 2) it is, as of yet, unclear if gaze fixations contain sufficient information for this task at all. We show, for the first time, that visual representations of search targets can indeed be decoded only from human gaze fixations. We propose to first encode fixations into a semantic representation and then decode this representation into an image. We evaluate our method on a recent gaze dataset of 14 participants searching for clothing in image collages and validate the model’s predictions using two human studies. Our results show that 62% (Chance level = 10%) of the time users were able to select the categories of the decoded image right. In our second studies we show the importance of a local gaze encoding for decoding visual search targets of users.

Paper: sattar17_arxiv.pdf

Paper Access: https://arxiv.org/abs/1706.05993

@techreport{sattar17_arxiv, title = {Visual Decoding of Targets During Visual Search From Human Eye Fixations}, author = {Sattar, Hosnieh and Fritz, Mario and Bulling, Andreas}, year = {2017}, pages = {1--9}, url = {https://arxiv.org/abs/1706.05993} }
GazeDirector: Fully Articulated Eye Gaze Redirection in Video

Erroll Wood, Tadas Baltrušaitis, Louis-Philippe Morency, Peter Robinson, Andreas Bulling

arXiv:1704.08763, pp. 1–10, 2017.

Abstract Links BibTeX Project

We present GazeDirector, a new approach for eye gaze redirection that uses model-fitting. Our method first tracks the eyes by fitting a multi-part eye region model to video frames using analysis-by-synthesis, thereby recovering eye region shape, texture, pose, and gaze simultaneously. It then redirects gaze by 1) warping the eyelids from the original image using a model-derived flow field, and 2) rendering and compositing synthesized 3D eyeballs onto the output image in a photorealistic manner. GazeDirector allows us to change where people are looking without person-specific training data, and with full articulation, i.e. we can precisely specify new gaze directions in 3D. Quantitatively, we evaluate both model-fitting and gaze synthesis, with experiments for gaze estimation and redirection on the Columbia gaze dataset. Qualitatively, we compare GazeDirector against recent work on gaze redirection, showing better results especially for large redirection angles. Finally, we demonstrate gaze redirection on YouTube videos by introducing new 3D gaze targets and by manipulating visual behavior.

Paper: wood17_arxiv.pdf

Paper Access: https://arxiv.org/abs/1704.08763

@techreport{wood17_arxiv, title = {GazeDirector: Fully Articulated Eye Gaze Redirection in Video}, author = {Wood, Erroll and Baltru{\v{s}}aitis, Tadas and Morency, Louis-Philippe and Robinson, Peter and Bulling, Andreas}, year = {2017}, pages = {1--10}, url = {https://arxiv.org/abs/1704.08763} }
MPIIGaze: Real-World Dataset and Deep Appearance-Based Gaze Estimation

Xucong Zhang, Yusuke Sugano, Mario Fritz, Andreas Bulling

arXiv:1711.09017, pp. 1–14, 2017.

Abstract Links BibTeX Project

Learning-based methods are believed to work well for unconstrained gaze estimation, i.e. gaze estimation from a monocular RGB camera without assumptions regarding user, environment, or camera. However, current gaze datasets were collected under laboratory conditions and methods were not evaluated across multiple datasets. Our work makes three contributions towards addressing these limitations. First, we present the MPIIGaze dataset, which contains 213,659 full face images and corresponding ground-truth gaze positions collected from 15 users during everyday laptop use over several months. An experience sampling approach ensured continuous gaze and head poses and realistic variation in eye appearance and illumination. To facilitate cross-dataset evaluations, 37,667 images were manually annotated with eye corners, mouth corners, and pupil centres. Second, we present an extensive evaluation of state-of-the-art gaze estimation methods on three current datasets, including MPIIGaze. We study key challenges including target gaze range, illumination conditions, and facial appearance variation. We show that image resolution and the use of both eyes affect gaze estimation performance, while head pose and pupil centre information are less informative. Finally, we propose GazeNet, the first deep appearance-based gaze estimation method. GazeNet improves on the state of the art by 22% (from a mean error of 13.9 degrees to 10.8 degrees) for the most challenging cross-dataset evaluation.

Paper: zhang17_arxiv.pdf

Paper Access: https://arxiv.org/abs/1711.09017

@techreport{zhang17_arxiv, title = {MPIIGaze: Real-World Dataset and Deep Appearance-Based Gaze Estimation}, author = {Zhang, Xucong and Sugano, Yusuke and Fritz, Mario and Bulling, Andreas}, year = {2017}, pages = {1--14}, url = {https://arxiv.org/abs/1711.09017} }

2016

Journal Articles

Pervasive Attentive User Interfaces

Andreas Bulling

IEEE Computer, 49 (1), pp. 94-98, 2016.

Abstract Links BibTeX Project

As the number of displays we interact with rapidly increases, managing user attention has emerged as a critical challenge for next-generation human−computer interfaces.

doi: 10.1109/MC.2016.32

Paper: bulling16_computer.pdf

@article{bulling16_computer, title = {Pervasive Attentive User Interfaces}, author = {Bulling, Andreas}, doi = {10.1109/MC.2016.32}, year = {2016}, journal = {IEEE Computer}, volume = {49}, number = {1}, pages = {94-98} }
EyeWear Computers for Human-Computer Interaction

Andreas Bulling, Kai Kunze

ACM Interactions, 23 (3), pp. 70-73, 2016.

Abstract Links BibTeX Project

Head-worn displays and eye trackers, augmented and virtual reality glasses, egocentric cameras, and other "smart eyewear" have recently emerged as a research platform in fields such as ubiquitous computing, computer vision, and cognitive and social science. While earlier generations of devices were too bulky to be worn regularly, recent technological advances have made eyewear unobtrusive and lightweight, and therefore more suitable for daily use. Given that many human senses are located on the head, smart eyewear provides opportunities for types of interaction that were impossible before now. In this article, we highlight the potential of eyewear computing for HCI, discuss available input and output modalities, and suggest the most promising future directions for eyewear computing research, namely multimodal user modeling, lifelong learning, and large-scale (collective) human-behavior sensing and analysis.

doi: 10.1145/2912886

Paper: bulling16_interactions.pdf

@article{bulling16_interactions, title = {EyeWear Computers for Human-Computer Interaction}, author = {Bulling, Andreas and Kunze, Kai}, year = {2016}, journal = {ACM Interactions}, volume = {23}, number = {3}, doi = {10.1145/2912886}, pages = {70-73} }
Eyewear Computing – Augmenting the Human with Head-mounted Wearable Assistants (Dagstuhl Seminar 16042)

Andreas Bulling, Ozan Cakmakci, Kai Kunze, James M. Rehg

Dagstuhl Reports, 6 (1), pp. 160–206, 2016.

Links BibTeX Project

doi: 10.4230/DagRep.6.1.160

Paper: bulling16_dagstuhl.pdf

Paper Access: http://drops.dagstuhl.de/opus/volltexte/2016/5820

@article{bulling16_dagstuhl, author = {Bulling, Andreas and Cakmakci, Ozan and Kunze, Kai and Rehg, James M.}, title = {{Eyewear Computing – Augmenting the Human with Head-mounted Wearable Assistants (Dagstuhl Seminar 16042)}}, pages = {160--206}, journal = {Dagstuhl Reports}, year = {2016}, volume = {6}, number = {1}, editor = {Bulling, Andreas and Cakmakci, Ozan and Kunze, Kai and Rehg, James M.}, publisher = {Schloss Dagstuhl--Leibniz-Zentrum fuer Informatik}, address = {Dagstuhl, Germany}, url = {http://drops.dagstuhl.de/opus/volltexte/2016/5820}, doi = {10.4230/DagRep.6.1.160} }
Pupil detection for head-mounted eye tracking in the wild: an evaluation of the state of the art

Wolfgang Fuhl, Marc Tonsen, Andreas Bulling, Enkelejda Kasneci

Springer Machine Vision and Applications, 27, pp. 1275-1288, 2016.

Abstract Links BibTeX Project

Robust and accurate detection of the pupil position is a key building block for head-mounted eye tracking and prerequisite for applications on top, such as gaze-based human-computer interaction or attention analysis. Despite a large body of work, detecting the pupil in images recorded under real-world conditions is challenging given significant variability in eye appearance (e.g., illumination, reflections, occlusions, etc.), individual differences in eye physiology, as well as other sources of noise, such as contact lenses or make-up. In this paper we review six state-of-the-art pupil detection methods, namely ElSe, ExCuSe, Pupil Labs, SET, Starburst, and Swirski. We compare their performance on a large-scale dataset consisting of 225,569 annotated eye images taken from four publicly available datasets. Our experimental results show that the algorithm ElSe outperforms other pupil detection methods by a large margin, offering thus robust and accurate pupil positions on challenging everyday eye images.

doi: 10.1007/s00138-016-0776-4

Paper: fuhl16_mvap.pdf

@article{fuhl16_mvap, title = {Pupil detection for head-mounted eye tracking in the wild: an evaluation of the state of the art}, author = {Fuhl, Wolfgang and Tonsen, Marc and Bulling, Andreas and Kasneci, Enkelejda}, year = {2016}, pages = {1275-1288}, doi = {10.1007/s00138-016-0776-4}, journal = {Springer Machine Vision and Applications}, volume = {27} }

Conference Papers

Prediction of Gaze Estimation Error for Error-Aware Gaze-Based Interfaces

Michael Barz, Florian Daiber, Andreas Bulling

Proc. International ACM Symposium on Eye Tracking Research and Applications (ETRA), pp. 275-278, 2016.

Abstract Links BibTeX Project

Gaze estimation error is inherent in head-mounted eye trackers and seriously impacts performance, usability, and user experience of gaze-based interfaces. Particularly in mobile settings, this error varies constantly as users move in front and look at different parts of a display. We envision a new class of gaze-based interfaces that are aware of the gaze estimation error and adapt to it in real time. As a first step towards this vision we introduce an error model that is able to predict the gaze estimation error. Our method covers major building blocks of mobile gaze estimation, specifically mapping of pupil positions to scene camera coordinates, marker-based display detection, and mapping of gaze from scene camera to on-screen coordinates. We develop our model through a series of principled measurements of a state-of-the-art head-mounted eye tracker.

doi: 10.1145/2857491.2857493

Paper: barz16_etra.pdf

@inproceedings{barz16_etra, author = {Barz, Michael and Daiber, Florian and Bulling, Andreas}, title = {Prediction of Gaze Estimation Error for Error-Aware Gaze-Based Interfaces}, booktitle = {Proc. International ACM Symposium on Eye Tracking Research and Applications (ETRA)}, year = {2016}, pages = {275-278}, doi = {10.1145/2857491.2857493} }
Spatio-Temporal Modeling and Prediction of Visual Attention in Graphical User Interfaces

Pingmei Xu, Yusuke Sugano, Andreas Bulling

Proc. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI), pp. 3299-3310, 2016.

Abstract Links BibTeX Project Best Paper Honourable Mention Award

We present a computational model to predict users’ spatio-temporal visual attention for WIMP-style (windows, icons, mouse, pointer) graphical user interfaces. Like existing models of bottom-up visual attention in computer vision, our model does not require any eye tracking equipment. Instead, it predicts attention solely using information available to the interface, specifically users’ mouse and keyboard input as well as the UI components they interact with. To study our model in a principled way we further introduce a method to synthesize user interface layouts that are functionally equivalent to real-world interfaces, such as from Gmail, Facebook, or GitHub. We first quantitatively analyze attention allocation and its correlation with user input and UI components using ground-truth gaze, mouse, and keyboard data of 18 participants performing a text editing task. We then show that our model predicts attention maps more accurately than state-of-the-art methods. Our results underline the significant potential of spatio-temporal attention modeling for user interface evaluation, optimization, or even simulation.

doi: 10.1145/2858036.2858479

Paper: xu16_chi.pdf

@inproceedings{xu16_chi, title = {Spatio-Temporal Modeling and Prediction of Visual Attention in Graphical User Interfaces}, author = {Xu, Pingmei and Sugano, Yusuke and Bulling, Andreas}, year = {2016}, pages = {3299-3310}, booktitle = {Proc. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI)}, doi = {10.1145/2858036.2858479} }
3D Gaze Estimation from 2D Pupil Positions on Monocular Head-Mounted Eye Trackers

Mohsen Mansouryar, Julian Steil, Yusuke Sugano, Andreas Bulling

Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA), pp. 197-200, 2016.

Abstract Links BibTeX Project

3D gaze information is important for scene-centric attention analysis, but accurate estimation and analysis of 3D gaze in real-world environments remains challenging. We present a novel 3D gaze estimation method for monocular head-mounted eye trackers. In contrast to previous work, our method does not aim to infer 3D eye- ball poses, but directly maps 2D pupil positions to 3D gaze directions in scene camera coordinate space. We first provide a detailed discussion of the 3D gaze estimation task and summarize different methods, including our own. We then evaluate the performance of different 3D gaze estimation approaches using both simulated and real data. Through experimental validation, we demonstrate the effectiveness of our method in reducing parallax error, and we identify research challenges for the design of 3D calibration procedures.

doi: 10.1145/2857491.2857530

Paper: mansouryar16_etra.pdf

@inproceedings{mansouryar16_etra, author = {Mansouryar, Mohsen and Steil, Julian and Sugano, Yusuke and Bulling, Andreas}, title = {3D Gaze Estimation from 2D Pupil Positions on Monocular Head-Mounted Eye Trackers}, booktitle = {Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA)}, year = {2016}, pages = {197-200}, doi = {10.1145/2857491.2857530} }
Memorability of Cued-Recall Graphical Passwords with Saliency Masks

Florian Alt, Mateusz Mikusz, Stefan Schneegass, Andreas Bulling

Proc. International Conference on Mobile and Ubiquitous Multimedia (MUM), pp. 191-200, 2016.

Abstract Links BibTeX Project

Cued-recall graphical passwords have a lot of potential for secure user authentication, particularly if combined with saliency masks to prevent users from selecting weak passwords. Saliency masks exclude those areas of the image that are most likely to lead to hotspots and were shown to significantly improve password security. In this paper we investigate the impact of such saliency masks on the memorability of cued-recall graphical passwords. We first conduct two pre-studies with 52 participants to obtain a set of images with three different image complexities as well as real passwords. Based on a month-long user study with 26 participants we then show that cued-recall graphical passwords defined on a single image with a saliency mask are not more difficult to remember than those without saliency mask, and that the complexity of the password images does not have any influence on password memorability. These results complement prior work on the security of such passwords and underline the potential of saliency masks as both a secure and usable improvement to cued-recall gaze-based graphical passwords.

doi: 10.1145/3012709.3012730

Paper: alt16_mum.pdf

@inproceedings{alt16_mum, title = {Memorability of Cued-Recall Graphical Passwords with Saliency Masks}, author = {Alt, Florian and Mikusz, Mateusz and Schneegass, Stefan and Bulling, Andreas}, year = {2016}, doi = {10.1145/3012709.3012730}, pages = {191-200}, booktitle = {Proc. International Conference on Mobile and Ubiquitous Multimedia (MUM)} }
Combining Eye Tracking with Optimizations for Lens Astigmatism in Modern Wide-Angle HMDs

Daniel Pohl, Xucong Zhang, Andreas Bulling

Proc. IEEE Virtual Reality (VR), pp. 269-270, 2016.

Abstract Links BibTeX Project

Virtual Reality has hit the consumer market with affordable head-mounted displays. When using these, it quickly becomes apparent that the resolution of the built-in display panels still needs to be highly increased. To overcome the resulting higher performance demands, eye tracking can be used for foveated rendering. However, as there are lens distortions in HMDs, there are more possibilities to increase the performance with smarter rendering approaches. We present a new system using optimizations for rendering considering lens astigmatism and combining this with foveated rendering through eye tracking. Depending on the current eye gaze, this delivers a rendering speed-up of up to 20%.

doi: 10.1109/VR.2016.7504757

Paper: pohl16_vr.pdf

@inproceedings{pohl16_vr, title = {Combining Eye Tracking with Optimizations for Lens Astigmatism in Modern Wide-Angle HMDs}, author = {Pohl, Daniel and Zhang, Xucong and Bulling, Andreas}, year = {2016}, pages = {269-270}, doi = {10.1109/VR.2016.7504757}, booktitle = {Proc. IEEE Virtual Reality (VR)} }
Concept for Using Eye Tracking in a Head-mounted Display to Adapt Rendering to the User’s Current Visual Field

Daniel Pohl, Xucong Zhang, Andreas Bulling, Oliver Grau

Proc. of the 22nd ACM Conference on Virtual Reality Software and Technology (VRST), pp. 323-324, 2016.

Abstract Links BibTeX Project

With increasing spatial and temporal resolution in head-mounted displays (HMDs), using eye trackers to adapt rendering to the user is getting important to handle the rendering workload. Besides using methods like foveated rendering, we propose to use the current visual field for rendering, depending on the eye gaze. We use two effects for performance optimizations. First, we noticed a lens defect in HMDs, where depending on the distance of the eye gaze to the center, certain parts of the screen towards the edges are not visible anymore. Second, if the user looks up, he cannot see the lower parts of the screen anymore. For the invisible areas, we propose to skip rendering and to reuse the pixels colors from the previous frame. We provide a calibration routine to measure these two effects. We apply the current visual field to a renderer and get up to 2x speed-ups.

doi: 10.1145/2993369.2996300

Paper: pohl16_vrst.pdf

@inproceedings{pohl16_vrst, title = {Concept for Using Eye Tracking in a Head-mounted Display to Adapt Rendering to the User's Current Visual Field}, author = {Pohl, Daniel and Zhang, Xucong and Bulling, Andreas and Grau, Oliver}, doi = {10.1145/2993369.2996300}, year = {2016}, booktitle = {Proc. of the 22nd ACM Conference on Virtual Reality Software and Technology (VRST)}, pages = {323-324} }
A 3D Morphable Eye Region Model for Gaze Estimation

Erroll Wood, Tadas Baltrušaitis, Louis-Philippe Morency, Peter Robinson, Andreas Bulling

Proc. European Conference on Computer Vision (ECCV), pp. 297-313, 2016.

Abstract Links BibTeX Project

Morphable face models are a powerful tool, but have previ- ously failed to model the eye accurately due to complexities in its material and motion. We present a new multi-part model of the eye that includes a morphable model of the facial eye region, as well as an anatomy-based eyeball model. It is the first morphable model that accurately captures eye region shape, since it was built from high-quality head scans. It is also the first to allow independent eyeball movement, since we treat it as a separate part. To showcase our model we present a new method for illumination- and head-pose–invariant gaze estimation from a single RGB image. We fit our model to an image through analysis-by-synthesis, solving for eye region shape, texture, eyeball pose, and illumination simul- taneously. The fitted eyeball pose parameters are then used to estimate gaze direction. Through evaluation on two standard datasets we show that our method generalizes to both webcam and high-quality camera images, and outperforms a state-of-the-art CNN method achieving a gaze estimation accuracy of 9.44° in a challenging user-independent scenario.

doi: 10.1007/978-3-319-46448-0_18

Paper: wood16_eccv.pdf

@inproceedings{wood16_eccv, author = {Wood, Erroll and Baltru{\v{s}}aitis, Tadas and Morency, Louis-Philippe and Robinson, Peter and Bulling, Andreas}, title = {A 3D Morphable Eye Region Model for Gaze Estimation}, booktitle = {Proc. European Conference on Computer Vision (ECCV)}, year = {2016}, pages = {297-313}, doi = {10.1007/978-3-319-46448-0_18} }
Gaussian Processes as an Alternative to Polynomial Gaze Estimation Functions

Laura Sesma-Sanchez, Yanxia Zhang, Andreas Bulling, Hans Gellersen

Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA), pp. 229-232, 2016.

Abstract Links BibTeX Project

Interpolation-based methods are widely used for gaze estimation due to their simplicity. In particular, feature-based methods that map the image eye features to gaze, are very popular. The most spread regression function used in this kind of method is the polynomial regression. In this paper, we present an alternative regression function to estimate gaze: the Gaussian regression. We show how the Gaussian processes can better adapt to the non-linear behavior of the eye movement, providing higher gaze estimation accuracies. The Gaussian regression is compared, in a simulated environment, to the polynomial regression, when using the same mapping features, the normalized pupil center-corneal reflection and pupil center-eye corners vectors. This comparison is done for three different screen sizes. The results show that for larger screens, where wider gaze angles are required, i.e., the non-linear behavior of the eye is more present, the outperformance of the Gaussian regression is more evident. Furthermore, we can conclude that, for both types of regressions, the gaze estimation accuracy increases for smaller screens, where the eye movements are more linear.

doi: 10.1145/2857491.2857509

Paper: sesma16_etra.pdf

@inproceedings{sesma16_etra, author = {Sesma-Sanchez, Laura and Zhang, Yanxia and Bulling, Andreas and Gellersen, Hans}, title = {Gaussian Processes as an Alternative to Polynomial Gaze Estimation Functions}, booktitle = {Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA)}, year = {2016}, pages = {229-232}, doi = {10.1145/2857491.2857509} }
Learning an appearance-based gaze estimator from one million synthesised images

Erroll Wood, Tadas Baltrušaitis, Louis-Philippe Morency, Peter Robinson, Andreas Bulling

Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA), pp. 131–138, 2016.

Abstract Links BibTeX Project Emerging Investigator Award

Learning-based methods for appearance-based gaze estimation achieve state-of-the-art performance in challenging real-world settings but require large amounts of labelled training data. Learning-by-synthesis was proposed as a promising solution to this problem but current methods are limited with respect to speed, the appearance variability as well as the head pose and gaze angle distribution they can synthesize. We present UnityEyes, a novel method to rapidly synthesize large amounts of variable eye region images as training data. Our method combines a novel generative 3D model of the human eye region with a real-time rendering framework. The model is based on high-resolution 3D face scans and uses real- time approximations for complex eyeball materials and structures as well as novel anatomically inspired procedural geometry methods for eyelid animation. We show that these synthesized images can be used to estimate gaze in difficult in-the-wild scenarios, even for extreme gaze angles or in cases in which the pupil is fully occluded. We also demonstrate competitive gaze estimation results on a benchmark in-the-wild dataset, despite only using a light-weight nearest-neighbor algorithm. We are making our UnityEyes synthesis framework freely available online for the benefit of the research community.

doi: 10.1145/2857491.2857492

Paper: wood16_etra.pdf

@inproceedings{wood16_etra, author = {Wood, Erroll and Baltru{\v{s}}aitis, Tadas and Morency, Louis-Philippe and Robinson, Peter and Bulling, Andreas}, title = {Learning an appearance-based gaze estimator from one million synthesised images}, booktitle = {Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA)}, year = {2016}, pages = {131--138}, doi = {10.1145/2857491.2857492} }
Labeled pupils in the wild: A dataset for studying pupil detection in unconstrained environments

Marc Tonsen, Xucong Zhang, Yusuke Sugano, Andreas Bulling

Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA), pp. 139-142, 2016.

Abstract Links BibTeX Project

We present labelled pupils in the wild (LPW), a novel dataset of 66 high-quality, high-speed eye region videos for the development and evaluation of pupil detection algorithms. The videos in our dataset were recorded from 22 participants in everyday locations at about 95 FPS using a state-of-the-art dark-pupil head-mounted eye tracker. They cover people of different ethnicities and a diverse set of everyday indoor and outdoor illumination environments, as well as natural gaze direction distributions. The dataset also includes participants wearing glasses, contact lenses, and make-up. We bench- mark five state-of-the-art pupil detection algorithms on our dataset with respect to robustness and accuracy. We further study the influence of image resolution and vision aids as well as recording lo- cation (indoor, outdoor) on pupil detection performance. Our evaluations provide valuable insights into the general pupil detection problem and allow us to identify key challenges for robust pupil detection on head-mounted eye trackers.

doi: 10.1145/2857491.2857520

Paper: tonsen16_etra.pdf

@inproceedings{tonsen16_etra, author = {Tonsen, Marc and Zhang, Xucong and Sugano, Yusuke and Bulling, Andreas}, title = {Labeled pupils in the wild: A dataset for studying pupil detection in unconstrained environments}, booktitle = {Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA)}, year = {2016}, pages = {139-142}, doi = {10.1145/2857491.2857520} }
Xplore-M-Ego: Contextual Media Retrieval Using Natural Language Queries

Sreyasi Nag Chowdhury, Mateusz Malinowski, Andreas Bulling, Mario Fritz

Proc. ACM International Conference on Multimedia Retrieval (ICMR), pp. 243-247, 2016.

Abstract Links BibTeX Project

The widespread integration of cameras in hand-held and head-worn devices and the ability to share content online enables a large and diverse visual capture of the world that millions of users build up collectively every day. We envision these images as well as associated meta information, such as GPS coordinates and timestamps, to form a collective visual memory that can be queried while automatically taking the ever-changing context of mobile users into account. As a first step towards this vision, in this work we present Xplore-M-Ego: a novel media retrieval system that allows users to query a dynamic database of images using spatio-temporal natural language queries. We evaluate our system using a new dataset of real image queries as well as through a usability study. One key finding is that there is a considerable amount of inter-user variability in the resolution of spatial relations in natural language utterances. We show that our system can cope with this variability using personalisation through an online learning-based retrieval formulation.

doi: 10.1145/2911996.2912044

Paper: chowdhury16_icmr.pdf

@inproceedings{chowdhury16_icmr, title = {Xplore-M-Ego: Contextual Media Retrieval Using Natural Language Queries}, author = {Chowdhury, Sreyasi Nag and Malinowski, Mateusz and Bulling, Andreas and Fritz, Mario}, year = {2016}, booktitle = {Proc. ACM International Conference on Multimedia Retrieval (ICMR)}, doi = {10.1145/2911996.2912044}, pages = {243-247} }
Smooth Eye Movement Interaction Using EOG Glasses

Murtaza Dhuliawala, Juyoung Lee, Junichi Shimizu, Andreas Bulling, Kai Kunze, Thad Starner, Woontack Woo

Proc. ACM International Conference on Multimodal Interaction (ICMI), pp. 307-311, 2016.

Abstract Links BibTeX Project

Orbits combines a visual display and an eye motion sensor to allow a user to select between options by tracking a cursor with the eyes as the cursor travels in a circular path around each option. Using an off-the-shelf J!NS MEME pair of eyeglasses, we present a pilot study that suggests that the eye movement required for Orbits can be sensed using three electrodes: one in the nose bridge and one in each nose pad. For forced choice binary selection, we achieve a 2.6 bits per second (bps) input rate at 250ms per input. We also introduce Head Orbits, where the user fixates the eyes on a target and moves the head in synchrony with the orbiting target. Measuring only the relative movement of the eyes in relation to the head, this method achieves a maximum rate of 2.0 bps at 500ms per input. Finally, we combine the two techniques together with a gyro to create an interface with a maximum input rate of 5.0 bps.

doi: 10.1145/2993148.2993181

Paper: dhuliawala16_icmi.pdf

@inproceedings{dhuliawala16_icmi, title = {Smooth Eye Movement Interaction Using EOG Glasses}, author = {Dhuliawala, Murtaza and Lee, Juyoung and Shimizu, Junichi and Bulling, Andreas and Kunze, Kai and Starner, Thad and Woo, Woontack}, year = {2016}, doi = {10.1145/2993148.2993181}, pages = {307-311}, booktitle = {Proc. ACM International Conference on Multimodal Interaction (ICMI)} }
SkullConduct: Biometric User Identification on Eyewear Computers Using Bone Conduction Through the Skull

Stefan Schneegass, Youssef Oualil, Andreas Bulling

Proc. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI), pp. 1379-1384, 2016.

Abstract Links BibTeX Project

Secure user identification is important for the increasing number of eyewear computers but limited input capabilities pose significant usability challenges for established knowledge-based schemes, such as a passwords or PINs. We present SkullConduct, a biometric system that uses bone conduction of sound through the user’s skull as well as a microphone readily integrated into many of these devices, such as Google Glass. At the core of SkullConduct is a method to analyze the characteristic frequency response created by the user’s skull using a combination of Mel Frequency Cepstral Coefficient (MFCC) features as well as a computationally light-weight 1NN classifier. We report on a controlled experiment with 10 participants that shows that this frequency response is person-specific and stable - even when taking off and putting on the device multiple times - and thus serves as a robust biometric. We show that our method can identify users with 97.0% accuracy and authenticate them with an equal error rate of 6.9%, thereby bringing biometric user identification to eyewear computers equipped with bone conduction technology.

doi: 10.1145/2858036.2858152

Paper: schneegass16_chi.pdf

Video: https://www.youtube.com/watch?v=A4BCnsQmo6c

@inproceedings{schneegass16_chi, title = {SkullConduct: Biometric User Identification on Eyewear Computers Using Bone Conduction Through the Skull}, author = {Schneegass, Stefan and Oualil, Youssef and Bulling, Andreas}, year = {2016}, pages = {1379-1384}, booktitle = {Proc. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI)}, doi = {10.1145/2858036.2858152}, video = {https://www.youtube.com/watch?v=A4BCnsQmo6c} }
AggreGaze: Collective Estimation of Audience Attention on Public Displays

Yusuke Sugano, Xucong Zhang, Andreas Bulling

Proc. ACM Symposium on User Interface Software and Technology (UIST), pp. 821-831, 2016.

Abstract Links BibTeX Project Best Paper Honourable Mention Award

Gaze is frequently explored in public display research given its importance for monitoring and analysing audience attention. However, current gaze-enabled public display interfaces require either special-purpose eye tracking equipment or explicit personal calibration for each individual user. We present AggreGaze, a novel method for estimating spatio-temporal audience attention on public displays. Our method requires only a single off-the-shelf camera attached to the display, does not require any personal calibration, and provides visual attention estimates across the full display. We achieve this by 1) compensating for errors of state-of-the-art appearance-based gaze estimation methods through on-site training data collection, and by 2) aggregating uncalibrated and thus inaccurate gaze estimates of multiple users into joint attention estimates. We propose different visual stimuli for this compensation: a standard 9-point calibration, moving targets, text and visual stimuli embedded into the display content, as well as normal video content. Based on a two-week deployment in a public space, we demonstrate the effectiveness of our method for estimating attention maps that closely resemble ground-truth audience gaze distributions.

doi: 10.1145/2984511.2984536

Paper: sugano16_uist.pdf

Video: https://www.youtube.com/watch?v=eFK39S_lgdg

@inproceedings{sugano16_uist, title = {AggreGaze: Collective Estimation of Audience Attention on Public Displays}, author = {Sugano, Yusuke and Zhang, Xucong and Bulling, Andreas}, year = {2016}, booktitle = {Proc. ACM Symposium on User Interface Software and Technology (UIST)}, doi = {10.1145/2984511.2984536}, pages = {821-831}, video = {https://www.youtube.com/watch?v=eFK39S_lgdg} }
GazeTouchPass: Multimodal Authentication Using Gaze and Touch on Mobile Devices

Mohamed Khamis, Florian Alt, Mariam Hassib, Emanuel Zezschwitz, Regina Hasholzner, Andreas Bulling

Ext. Abstr. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI), pp. 2156-2164, 2016.

Abstract Links BibTeX Project

We propose a multimodal scheme, GazeTouchPass, that combines gaze and touch for shoulder-surfing resistant user authentication on mobile devices. GazeTouchPass allows passwords with multiple switches between input modalities during authentication. This requires attackers to simultaneously observe the device screen and the user’s eyes to find the password. We evaluate the security and usability of GazeTouchPass in two user studies. Our findings show that GazeTouchPass is usable and significantly more secure than single-modal authentication against basic and even advanced shoulder-surfing attacks.

doi: 10.1145/2851581.2892314

Paper: khamis16_chi.pdf

@inproceedings{khamis16_chi, author = {Khamis, Mohamed and Alt, Florian and Hassib, Mariam and von Zezschwitz, Emanuel and Hasholzner, Regina and Bulling, Andreas}, title = {GazeTouchPass: Multimodal Authentication Using Gaze and Touch on Mobile Devices}, booktitle = {Ext. Abstr. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI)}, year = {2016}, pages = {2156-2164}, doi = {10.1145/2851581.2892314} }
TextPursuits: Using Text for Pursuits-Based Interaction and Calibration on Public Displays

Mohamed Khamis, Ozan Saltuk, Alina Hang, Katharina Stolz, Andreas Bulling, Florian Alt

Proc. ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp), pp. 274-285, 2016.

Abstract Links BibTeX Project

Pursuits, a technique that correlates users’ eye movements with moving on-screen targets, was recently introduced for calibration-free interaction with public displays. While prior work used abstract objects or dots as targets, we explore the use of Pursuits with text (read-and-pursue). Given that much of the content on public displays includes text, designers could greatly benefit from users being able to spontaneously interact and implicitly calibrate an eye tracker while simply read- ing text on a display. At the same time, using Pursuits with textual content is challenging given that the eye movements performed while reading interfere with the pursuit movements. We present two systems, EyeVote and Read2Calibrate, that enable spontaneous gaze interaction and implicit calibration by reading text. Results from two user studies (N=37) show that Pursuits with text is feasible and can achieve similar accu- racy as non text-based pursuit approaches. While calibration is less accurate, it integrates smoothly with reading and allows to identify areas of the display the user is looking at.

doi: 10.1145/2971648.2971679

Paper: khamis16_ubicomp.pdf

@inproceedings{khamis16_ubicomp, author = {Khamis, Mohamed and Saltuk, Ozan and Hang, Alina and Stolz, Katharina and Bulling, Andreas and Alt, Florian}, booktitle = {Proc. ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp)}, title = {TextPursuits: Using Text for Pursuits-Based Interaction and Calibration on Public Displays}, year = {2016}, doi = {10.1145/2971648.2971679}, pages = {274-285} }
Challenges and Design Space of Gaze-enabled Public Displays

Mohamed Khamis, Florian Alt, Andreas Bulling

Adj. Proc. ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp), pp. 1736-1745, 2016.

Abstract Links BibTeX Project

Gaze is an attractive modality for public displays, hence the recent years saw an increase in deployments of gaze-enabled public displays. Although gaze has been thoroughly investigated for desktop scenarios, gaze-enabled public displays present new challenges that are unique to this setup. In contrast to desktop settings, public displays (1) cannot afford requiring eye tracker calibration, (2) expect users to interact from different positions, and (3) expect multiple users to interact simultaneously. In this work we discuss these challenges, and explore the design space of gaze-enabled public displays. We conclude by discussing how the current state of research stands wrt. the identified challenges, and highlight directions for future work.

doi: 10.1145/2968219.2968342

Paper: khamis16_petmei.pdf

@inproceedings{khamis16_petmei, title = {Challenges and Design Space of Gaze-enabled Public Displays}, author = {Khamis, Mohamed and Alt, Florian and Bulling, Andreas}, year = {2016}, pages = {1736-1745}, doi = {10.1145/2968219.2968342}, booktitle = {Adj. Proc. ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp)} }
EyeVote in the Wild: Do Users bother Correcting System Errors on Public Displays?

Mohamed Khamis, Ludwig Trotter, Markus Tessman, Christina Dannhart, Andreas Bulling, Florian Alt

Proc. International Conference on Mobile and Ubiquitous Multimedia (MUM), pp. 57-62, 2016.

Abstract Links BibTeX Project

Although recovering from errors is straightforward on most interfaces, public display systems pose very unique design challenges. Namely, public display users interact for very short amounts of times and are believed to abandon the display when interrupted or forced to deviate from the main task. To date, it is not well understood whether public display designers should enable users to correct errors (e.g. by asking users to confirm or giving them a chance correct their input), or aim for faster interaction and rely on other types of feedback to estimate errors. To close this gap, we conducted a field study where we investigated the users willingness to correct their input on public displays. We report on our findings from an in-the-wild deployment of a public gaze-based voting system where we intentionally evoked system errors to see if users correct them. We found that public display users are willing to correct system errors provided that the correction is fast and straightforward. We discuss how our findings influence the choice of interaction methods for public displays; interaction methods that are highly usable but suffer from low accuracy can still be effective if users can "undo" their interactions.

doi: 10.1145/3012709.3012743

Paper: khamis16_mum.pdf

@inproceedings{khamis16_mum, title = {EyeVote in the Wild: Do Users bother Correcting System Errors on Public Displays?}, author = {Khamis, Mohamed and Trotter, Ludwig and Tessman, Markus and Dannhart, Christina and Bulling, Andreas and Alt, Florian}, year = {2016}, doi = {10.1145/3012709.3012743}, pages = {57-62}, booktitle = {Proc. International Conference on Mobile and Ubiquitous Multimedia (MUM)} }
Solar System: Smooth Pursuit Interactions Using EOG Glasses

Junichi Shimizu, Juyoung Lee, Murtaza Dhuliawala, Andreas Bulling, Thad Starner, Woontack Woo, Kai Kunze

Adj. Proc. ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp), pp. 369-372, 2016.

Abstract Links BibTeX Project

Solar System implements smooth pursuit eye movement interactions on commercial smart glasses using electrooculography. The system requires no calibration and little to no training. We present a prototype implementation, describe initial user tests and show several application scenarios for hands-free eye gaze interactions.

doi: 10.1145/2968219.2971376

Paper: shimizu16_ubicomp.pdf

@inproceedings{shimizu16_ubicomp, author = {Shimizu, Junichi and Lee, Juyoung and Dhuliawala, Murtaza and Bulling, Andreas and Starner, Thad and Woo, Woontack and Kunze, Kai}, title = {Solar System: Smooth Pursuit Interactions Using EOG Glasses}, booktitle = {Adj. Proc. ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp)}, year = {2016}, pages = {369-372}, doi = {10.1145/2968219.2971376}, keywords = {eye tracking, gaze interaction, wearable computing} }
Three-Point Interaction: Combining Bi-manual Direct Touch with Gaze

Adalberto Simeone, Andreas Bulling, Jason Alexander, Hans Gellersen

Proc. International Conference on Advanced Visual Interfaces (AVI), pp. 168-175, 2016.

Abstract Links BibTeX Project

The benefits of two-point interaction for tasks that require users to simultaneously manipulate multiple entities or dimensions are widely known. Two-point interaction has become common, e.g., when zooming or pinching using two fingers on a smartphone. We propose a novel interaction technique that implements three-point interaction by augmenting two-finger direct touch with gaze as a third input channel. We evaluate two key characteristics of our technique in two multi-participant user studies. In the first, we used the technique for object selection. In the second, we evaluate it in a 3D matching task that requires simultaneous continuous input from fingers and the eyes. Our results show that in both cases participants learned to interact with three input channels without cognitive or mental overload. Participants’ performance tended towards fast selection times in the first study and exhibited parallel interaction in the second. These results are promising and show that there is scope for additional input channels beyond two-point interaction.

doi: 10.1145/2909132.2909251

Paper: simeone16_avi.pdf

@inproceedings{simeone16_avi, author = {Simeone, Adalberto and Bulling, Andreas and Alexander, Jason and Gellersen, Hans}, title = {Three-Point Interaction: Combining Bi-manual Direct Touch with Gaze}, booktitle = {Proc. International Conference on Advanced Visual Interfaces (AVI)}, year = {2016}, pages = {168-175}, doi = {10.1145/2909132.2909251} }
On the Verge: Voluntary Convergences for Accurate and Precise Timing of Gaze Input

Dominik Kirst, Andreas Bulling

Ext. Abstr. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI), pp. 1519-1525, 2016.

Abstract Links BibTeX Project

The problem of triggering input accurately (with a small temporal offset) and precisely (with high repeatability) at a specific point in time has so far been largely ignored in gaze interaction research. We explore voluntary eye convergences as a novel interaction technique for precise and accurate timing of gaze input and a solution to the "Midas touch" problem, i.e. the accidental triggering of input when looking at an interface. We introduce a novel clock paradigm to study input timing and demonstrate that voluntary convergences are significantly more accurate and precise than common gaze dwelling. Our findings suggest that voluntary convergences are well-suited for applications in which timing of user input is important, thereby complementing existing gaze techniques that focus on speed and spatial precision.

doi: 10.1145/2851581.2892307

Paper: kirst16_chi.pdf

@inproceedings{kirst16_chi, author = {Kirst, Dominik and Bulling, Andreas}, title = {On the Verge: Voluntary Convergences for Accurate and Precise Timing of Gaze Input}, booktitle = {Ext. Abstr. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI)}, year = {2016}, pages = {1519-1525}, doi = {10.1145/2851581.2892307} }
Attention, please! Comparing Features for Measuring Audience Attention Towards Pervasive Displays

Florian Alt, Andreas Bulling, Lukas Mecke, Daniel Buschek

Proc. ACM SIGCHI Conference on Designing Interactive Systems (DIS), pp. 823-828, 2016.

Abstract Links BibTeX Project

Measuring audience attention towards pervasive displays is important but accurate measurement in real time remains a significant sensing challenge. Consequently, researchers and practitioners typically use other features, such as face presence, as a proxy. We provide a principled comparison of the performance of six features and their combinations for measuring attention: face presence, movement trajectory, walking speed, shoulder orientation, head pose, and gaze direction. We implemented a prototype that is capable of capturing this rich set of features from video and depth camera data. Using a controlled lab experiment (N=18) we show that as a single feature, face presence is indeed among the most accurate. We further show that accuracy can be increased through a combination of features (+10.3%), knowledge about the audience (+63.8%), as well as user identities (+69.0%). Our findings are valuable for display providers who want to collect data on display effectiveness or build interactive, responsive apps.

doi: 10.1145/2901790.2901897

Paper: alt16_dis.pdf

@inproceedings{alt16_dis, author = {Alt, Florian and Bulling, Andreas and Mecke, Lukas and Buschek, Daniel}, title = {Attention, please! Comparing Features for Measuring Audience Attention Towards Pervasive Displays}, booktitle = {Proc. ACM SIGCHI Conference on Designing Interactive Systems (DIS)}, year = {2016}, doi = {10.1145/2901790.2901897}, pages = {823-828} }

Book Chapters

Proc. 2016 ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp)

Paul Lukowicz, Antonio Krüger, Andreas Bulling, Youn-Kyung Lim, Shwetak N. Patel

Heidelberg, Germany, ACM, 2016.

Links BibTeX Project

Paper Access: https://dl.acm.org/citation.cfm?id=2971648

@inbook{lukowicz16_ubicomp, title = {Proc. 2016 ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp)}, year = {2016}, author = {Lukowicz, Paul and Kr{\"{u}}ger, Antonio and Bulling, Andreas and Lim, Youn-Kyung and Patel, Shwetak N.}, isbn = {978-1-4503-4461-6}, location = {Heidelberg, Germany}, publisher = {ACM}, address = {New York, NY, USA}, url = {https://dl.acm.org/citation.cfm?id=2971648} }

Technical Reports

Predicting the Category and Attributes of Mental Pictures Using Deep Gaze Pooling

Hosnieh Sattar, Andreas Bulling, Mario Fritz

arXiv:1611.10162, pp. 1–14, 2016.

Abstract Links BibTeX Project

Predicting the target of visual search from eye fixation (gaze) data is a challenging problem with many applications in human-computer interaction. In contrast to previous work that has focused on individual instances as a search target, we propose the first approach to predict categories and attributes of search targets based on gaze data. However, state of the art models for categorical recognition, in general, require large amounts of training data, which is prohibitive for gaze data. To address this challenge, we propose a novel Gaze Pooling Layer that integrates gaze information into CNN-based architectures as an attention mechanism - incorporating both spatial and temporal aspects of human gaze behavior. We show that our approach is effective even when the gaze pooling layer is added to an already trained CNN, thus eliminating the need for expensive joint data collection of visual and gaze data. We propose an experimental setup and data set and demonstrate the effectiveness of our method for search target prediction based on gaze behavior. We further study how to integrate temporal and spatial gaze information most effectively, and indicate directions for future research in the gaze-based prediction of mental states.

Paper: sattar16_arxiv.pdf

Paper Access: https://arxiv.org/abs/1611.10162

@techreport{sattar16_arxiv, title = {Predicting the Category and Attributes of Mental Pictures Using Deep Gaze Pooling}, author = {Sattar, Hosnieh and Bulling, Andreas and Fritz, Mario}, year = {2016}, pages = {1--14}, url = {https://arxiv.org/abs/1611.10162} }
Computational Modelling and Prediction of Gaze Estimation Error for Head-mounted Eye Trackers

Michael Barz, Andreas Bulling, Florian Daiber

DFKI Research Reports, pp. 1–6, 2016.

Abstract Links BibTeX Project

Head-mounted eye tracking has significant potential for mobile gaze-based interaction with ambient displays but current interfaces lack information about the tracker’s gaze estimation error. Consequently, current interfaces do not exploit the full potential of gaze input as the inherent estimation error can not be dealt with. The error depends on the physical properties of the display and constantly varies with changes in position and distance of the user to the display. In this work we present a computational model of gaze estimation error for head-mounted eye trackers. Our model covers the full processing pipeline for mobile gaze estimation, namely mapping of pupil positions to scene camera coordinates, marker-based display detection, and display mapping. We build the model based on a series of controlled measurements of a sample state-of-the-art monocular head-mounted eye tracker. Results show that our model can predict gaze estimation error with a root mean squared error of 17.99 px (1.96^circ).

Paper: barz16_techrep.pdf

@techreport{barz16_techrep, author = {Barz, Michael and Bulling, Andreas and Daiber, Florian}, title = {Computational Modelling and Prediction of Gaze Estimation Error for Head-mounted Eye Trackers}, volume = {1}, year = {2016}, pages = {1--6}, institution = {German Research Center for Artificial Intelligence (DFKI)} }
Contextual Media Retrieval Using Natural Language Queries

Sreyasi Nag Chowdhury, Mateusz Malinowski, Andreas Bulling, Mario Fritz

arXiv:1602.04983, pp. 1–8, 2016.

Abstract Links BibTeX Project

The widespread integration of cameras in hand-held and head-worn devices as well as the ability to share content online enables a large and diverse visual capture of the world that millions of users build up collectively every day. We envision these images as well as associated meta information, such as GPS coordinates and timestamps, to form a collective visual memory that can be queried while automatically taking the ever-changing context of mobile users into account. As a first step towards this vision, in this work we present Xplore-M-Ego: a novel media retrieval system that allows users to query a dynamic database of images and videos using spatio-temporal natural language queries. We evaluate our system using a new dataset of real user queries as well as through a usability study. One key finding is that there is a considerable amount of inter-user variability, for example in the resolution of spatial relations in natural language utterances. We show that our retrieval system can cope with this variability using personalisation through an online learning-based retrieval formulation.

Paper: chowdhury16_arxiv.pdf

Paper Access: https://arxiv.org/abs/1602.04983

@techreport{chowdhury16_arxiv, title = {Contextual Media Retrieval Using Natural Language Queries}, author = {Chowdhury, Sreyasi Nag and Malinowski, Mateusz and Bulling, Andreas and Fritz, Mario}, year = {2016}, pages = {1--8}, url = {https://arxiv.org/abs/1602.04983} }
End-to-End Eye Movement Detection Using Convolutional Neural Networks

Sabrina Hoppe, Andreas Bulling

arXiv:1609.02452, pp. 1–15, 2016.

Abstract Links BibTeX Project

Common computational methods for automated eye movement detection - i.e. the task of detecting different types of eye movement in a continuous stream of gaze data - are limited in that they either involve thresholding on hand-crafted signal features, require individual detectors each only detecting a single movement, or require pre-segmented data. We propose a novel approach for eye movement detection that only involves learning a single detector end-to-end, i.e. directly from the continuous gaze data stream and simultaneously for different eye movements without any manual feature crafting or segmentation. Our method is based on convolutional neural networks (CNN) that recently demonstrated superior performance in a variety of tasks in computer vision, signal processing, and machine learning. We further introduce a novel multi-participant dataset that contains scripted and free-viewing sequences of ground-truth annotated saccades, fixations, and smooth pursuits. We show that our CNN-based method outperforms state-of-the-art baselines by a large margin on this challenging dataset, thereby underlining the significant potential of this approach for holistic, robust, and accurate eye movement protocol analysis.

Paper: hoppe16_arxiv.pdf

Paper Access: https://arxiv.org/abs/1609.02452

@techreport{hoppe16_arxiv, title = {End-to-End Eye Movement Detection Using Convolutional Neural Networks}, author = {Hoppe, Sabrina and Bulling, Andreas}, year = {2016}, pages = {1--15}, url = {https://arxiv.org/abs/1609.02452} }
Gaze Embeddings for Zero-Shot Image Classification

Nour Karessli, Zeynep Akata, Bernt Schiele, Andreas Bulling

arXiv:1611.09309, pp. 1–10, 2016.

Abstract Links BibTeX Project

Zero-shot image classification using auxiliary information, such as attributes describing discriminative object properties, requires time-consuming annotation by domain experts. We instead propose a method that relies on human gaze as auxiliary information, exploiting that even non-expert users have a natural ability to judge class membership. We present a data collection paradigm that involves a discrimination task to increase the information content obtained from gaze data. Our method extracts discriminative descriptors from the data and learns a compatibility function between image and gaze using three novel gaze embeddings: Gaze Histograms (GH), Gaze Features with Grid (GFG) and Gaze Features with Sequence (GFS). We introduce two new gaze-annotated datasets for fine-grained image classification and show that human gaze data is indeed class discriminative, provides a competitive alternative to expert-annotated attributes, and outperforms other baselines for zero-shot image classification.

Paper: karessli16_arxiv.pdf

Paper Access: https://arxiv.org/abs/1611.09309

@techreport{karessli16_arxiv, title = {Gaze Embeddings for Zero-Shot Image Classification}, author = {Karessli, Nour and Akata, Zeynep and Schiele, Bernt and Bulling, Andreas}, year = {2016}, pages = {1--10}, url = {https://arxiv.org/abs/1611.09309} }
Seeing with Humans: Gaze-Assisted Neural Image Captioning

Yusuke Sugano, Andreas Bulling

arXiv:1608.05203, pp. 1–8, 2016.

Abstract Links BibTeX Project

Gaze reflects how humans process visual scenes and is therefore increasingly used in computer vision systems. Previous works demonstrated the potential of gaze for object-centric tasks, such as object localization and recognition, but it remains unclear if gaze can also be beneficial for scene-centric tasks, such as image captioning. We present a new perspective on gaze-assisted image captioning by studying the interplay between human gaze and the attention mechanism of deep neural networks. Using a public large-scale gaze dataset, we first assess the relationship between state-of-the-art object and scene recognition models, bottom-up visual saliency, and human gaze. We then propose a novel split attention model for image captioning. Our model integrates human gaze information into an attention-based long short-term memory architecture, and allows the algorithm to allocate attention selectively to both fixated and non-fixated image regions. Through evaluation on the COCO/SALICON datasets we show that our method improves image captioning performance and that gaze can complement machine attention for semantic scene understanding tasks.

Paper: sugano16_tr.pdf

Paper Access: https://arxiv.org/abs/1608.05203

@techreport{sugano16_tr, title = {Seeing with Humans: Gaze-Assisted Neural Image Captioning}, author = {Sugano, Yusuke and Bulling, Andreas}, year = {2016}, pages = {1--8}, url = {https://arxiv.org/abs/1608.05203} }
It’s Written All Over Your Face: Full-Face Appearance-Based Gaze Estimation

Xucong Zhang, Yusuke Sugano, Mario Fritz, Andreas Bulling

arXiv:1611.08860, pp. 1–10, 2016.

Abstract Links BibTeX Project

Eye gaze is an important non-verbal cue for human affect analysis. Recent gaze estimation work indicated that information from the full face region can benefit performance. Pushing this idea further, we propose an appearance-based method that, in contrast to a long-standing line of work in computer vision, only takes the full face image as input. Our method encodes the face image using a convolutional neural network with spatial weights applied on the feature maps to flexibly suppress or enhance information in different facial regions. Through extensive evaluation, we show that our full-face method significantly outperforms the state of the art for both 2D and 3D gaze estimation, achieving improvements of up to 14.3% on MPIIGaze and 27.7% on EYEDIAP for person-independent 3D gaze estimation. We further show that this improvement is consistent across different illumination conditions and gaze directions and particularly pronounced for the most challenging extreme head poses.

Paper: zhang16_arxiv.pdf

Paper Access: https://arxiv.org/abs/1611.08860

@techreport{zhang16_arxiv, title = {It's Written All Over Your Face: Full-Face Appearance-Based Gaze Estimation}, author = {Zhang, Xucong and Sugano, Yusuke and Fritz, Mario and Bulling, Andreas}, year = {2016}, pages = {1--10}, url = {https://arxiv.org/abs/1611.08860} }
3D Gaze Estimation from 2D Pupil Positions on Monocular Head-Mounted Eye Trackers

Mohsen Mansouryar, Julian Steil, Yusuke Sugano, Andreas Bulling

arXiv:1601.02644, pp. 1–6, 2016.

Abstract Links BibTeX Project

3D gaze information is important for scene-centric attention analysis but accurate estimation and analysis of 3D gaze in real-world environments remains challenging. We present a novel 3D gaze estimation method for monocular head-mounted eye trackers. In contrast to previous work, our method does not aim to infer 3D eyeball poses but directly maps 2D pupil positions to 3D gaze directions in scene camera coordinate space. We first provide a detailed discussion of the 3D gaze estimation task and summarize different methods, including our own. We then evaluate the performance of different 3D gaze estimation approaches using both simulated and real data. Through experimental validation, we demonstrate the effectiveness of our method in reducing parallax error, and we identify research challenges for the design of 3D calibration procedures.

Paper: mansouryar16_arxiv.pdf

Paper Access: https://arxiv.org/abs/1601.02644

@techreport{mansouryar16_arxiv, title = {3D Gaze Estimation from 2D Pupil Positions on Monocular Head-Mounted Eye Trackers}, author = {Mansouryar, Mohsen and Steil, Julian and Sugano, Yusuke and Bulling, Andreas}, year = {2016}, pages = {1--6}, url = {https://arxiv.org/abs/1601.02644} }

2015

Journal Articles

The Feet in HCI: A Survey of Foot-Based Interaction

Eduardo Velloso, Dominik Schmidt, Jayson Alexander, Hans Gellersen, Andreas Bulling

ACM Computing Surveys, 48 (2), pp. 1–36, 2015.

Abstract Links BibTeX Project

Foot-operated computer interfaces have been studied since the inception of Human-Computer Interaction. Thanks to the miniaturisation and decreasing cost of sensing technology, there is an increasing interest exploring this alternative input modality, but no comprehensive overview of its research landscape. In this survey, we review the literature on interfaces operated by the lower limbs. We investigate the characteristics of users and how they affect the design of such interfaces. Next, we describe and analyse foot-based research prototypes and commercial systems in how they capture input and provide feedback. We then analyse the interactions between users and systems from the perspective of the actions performed in these interactions. Finally, we discuss our findings and use them to identify open questions and directions for future research.

doi: 10.1145/2816455

Paper: velloso15_csur.pdf

@article{velloso15_csur, author = {Velloso, Eduardo and Schmidt, Dominik and Alexander, Jayson and Gellersen, Hans and Bulling, Andreas}, title = {{The Feet in HCI: A Survey of Foot-Based Interaction}}, journal = {ACM Computing Surveys}, year = {2015}, volume = {48}, number = {2}, pages = {1--36}, doi = {10.1145/2816455} }
Pursuits: Spontaneous Eye-Based Interaction for Dynamic Interfaces

Mélodie Vidal, Andreas Bulling, Hans Gellersen

ACM SIGMOBILE Mobile Computing and Communications Review, 18 (4), pp. 8-10, 2015.

Abstract Links BibTeX Project

Although gaze is an attractive modality for pervasive interaction, real-world implementation of eye-based interfaces poses significant challenges. In particular, user calibration is tedious and time consuming. Pursuits is an innovative interaction technique that enables truly spontaneous interaction with eye-based interfaces. A user can simply walk up to the screen and readily interact with moving targets. Instead of being based on gaze location, Pursuits correlates eye pursuit movements with objects dynamically moving on the interface.

doi: 10.1145/2721914.2721917

Paper: vidal15_sigmobile.pdf

@article{vidal15_sigmobile, author = {Vidal, M{\'{e}}lodie and Bulling, Andreas and Gellersen, Hans}, title = {Pursuits: Spontaneous Eye-Based Interaction for Dynamic Interfaces}, journal = {ACM SIGMOBILE Mobile Computing and Communications Review}, volume = {18}, number = {4}, year = {2015}, pages = {8-10}, doi = {10.1145/2721914.2721917} }
Walking reduces spatial neglect

Tobias Loetscher, Celia Chen, Sabrina Hoppe, Andreas Bulling, Sophie Wignall, Churches Owen, Nicole Thomas, Andrew Lee

, 21 (S2), pp. 120-121, 2015.

Abstract Links BibTeX Project

Spatial neglect is a common consequence of stroke. Neglect behaviour is typically exacerbated by increased task demands. It was thus anticipated that the addition of a secondary task requiring general attention (walking) would worsen performance on tests of spatial neglect. Here, however, we report a patient in whom neglect was considerably reduced when performing a visual search task while walking.

Paper: loetscher15_ins.pdf

@article{loetscher15_ins, title = {Walking reduces spatial neglect}, author = {Loetscher, Tobias and Chen, Celia and Hoppe, Sabrina and Bulling, Andreas and Wignall, Sophie and Owen, Churches and Thomas, Nicole and Lee, Andrew}, year = {2015}, volume = {21}, number = {S2}, pages = {120-121}, booktitle = {Journal of the International Neuropsychological Society} }
A study on the natural history of scanning behaviour in patients with visual field defects after stroke

Tobias Loetscher, Celia Chen, Sophie Wignall, Andreas Bulling, Sabrina Hoppe, Owen Churches, Nicole Thomas

BMC Neurology, 15 (64), pp. 1–4, 2015.

Abstract Links BibTeX Project

A visual field defect (VFD) is a common consequence of stroke with a detrimental effect upon the survivors’ functional ability and quality of life. The identification of effective treatments for VFD is a key priority relating to life post-stroke. Understanding the natural evolution of scanning compensation over time may have important ramifications for the development of efficacious therapies. The study aims to unravel the natural history of visual scanning behaviour in patients with VFD. The assessment of scanning patterns in the acute to chronic stages of stroke will reveal who does and does not learn to compensate for vision loss. Methods/Design Eye-tracking glasses are used to delineate eye movements in a cohort of 100 stroke patients immediately after stroke, and additionally at 6 and 12 months post-stroke. The longitudinal study will assess eye movements in static (sitting) and dynamic (walking) conditions. The primary outcome constitutes the change of lateral eye movements from the acute to chronic stages of stroke. Secondary outcomes include changes of lateral eye movements over time as a function of subgroup characteristics, such as side of VFD, stroke location, stroke severity and cognitive functioning. Discussion The longitudinal comparison of patients who do and do not learn compensatory scanning techniques may reveal important prognostic markers of natural recovery. Importantly, it may also help to determine the most effective treatment window for visual rehabilitation.

doi: 10.1186/s12883-015-0321-5

Paper: loetscher15_neurology.pdf

@article{loetscher15_neurology, title = {A study on the natural history of scanning behaviour in patients with visual field defects after stroke}, author = {Loetscher, Tobias and Chen, Celia and Wignall, Sophie and Bulling, Andreas and Hoppe, Sabrina and Churches, Owen and Thomas, Nicole}, doi = {10.1186/s12883-015-0321-5}, year = {2015}, journal = {BMC Neurology}, volume = {15}, number = {64}, pages = {1--4} }
Eye tracking for public displays in the wild

Yanxia Zhang, Ming Ki Chong, Jörg Müller, Andreas Bulling, Hans Gellersen

Springer Personal and Ubiquitous Computing, 19 (5), pp. 967-981, 2015.

Abstract Links BibTeX Project

In public display contexts, interactions are spontaneous and have to work without preparation. We propose gaze as a modality for such con- texts, as gaze is always at the ready, and a natural indicator of the user’s interest. We present GazeHorizon, a system that demonstrates sponta- neous gaze interaction, enabling users to walk up to a display and navi- gate content using their eyes only. GazeHorizon is extemporaneous and optimised for instantaneous usability by any user without prior configura- tion, calibration or training. The system provides interactive assistance to bootstrap gaze interaction with unaware users, employs a single off-the- shelf web camera and computer vision for person-independent tracking of the horizontal gaze direction, and maps this input to rate-controlled nav- igation of horizontally arranged content. We have evaluated GazeHorizon through a series of field studies, culminating in a four-day deployment in a public environment during which over a hundred passers-by interacted with it, unprompted and unassisted. We realised that since eye move- ments are subtle, users cannot learn gaze interaction from only observing others, and as a results guidance is required.

doi: 10.1007/s00779-015-0866-8

Paper: zhang15_puc.pdf

@article{zhang15_puc, title = {Eye tracking for public displays in the wild}, author = {Zhang, Yanxia and Chong, Ming Ki and M\"uller, J\"org and Bulling, Andreas and Gellersen, Hans}, year = {2015}, doi = {10.1007/s00779-015-0866-8}, pages = {967-981}, volume = {19}, number = {5}, journal = {Springer Personal and Ubiquitous Computing}, keywords = {Eye tracking; Gaze interaction; Public displays; Scrolling; Calibration-free; In-the-wild study; Deployment} }
Introduction to the Special Issue on Activity Recognition for Interaction

Andreas Bulling, Ulf Blanke, Desney Tan, Jun Rekimoto, Gregory Abowd

ACM Transactions on Interactive Intelligent Systems (TiiS), 4 (16e), pp. 1–3, 2015.

Abstract Links BibTeX Project

This editorial introduction describes the aims and scope of the ACM Transactions on Interactive Intelligent Systems special issue on Activity Recognition for Interaction. It explains why activity recognition is becoming crucial as part of the cycle of interaction between users and computing systems, and it shows how the five articles selected for this special issue reflect this theme.

doi: 10.1145/2694858

Paper: bulling15_tiis.pdf

@article{bulling15_tiis, author = {Bulling, Andreas and Blanke, Ulf and Tan, Desney and Rekimoto, Jun and Abowd, Gregory}, title = {Introduction to the Special Issue on Activity Recognition for Interaction}, journal = {ACM Transactions on Interactive Intelligent Systems (TiiS)}, volume = {4}, number = {16e}, year = {2015}, pages = {1--3}, doi = {10.1145/2694858} }

Conference Papers

Self-Calibrating Head-Mounted Eye Trackers Using Egocentric Visual Saliency

Yusuke Sugano, Andreas Bulling

Proc. ACM Symposium on User Interface Software and Technology (UIST), pp. 363-372, 2015.

Abstract Links BibTeX Project

Head-mounted eye tracking has significant potential for gaze-based applications such as life logging, mental health monitoring, or quantified self. However, a neglected challenge for such applications is that drift in the initial person-specific eye tracker calibration, for example caused by physical activity, can severely impact gaze estimation accuracy and, thus, system performance and user experience. We first analyse calibration drift on a new dataset of natural gaze data recorded using synchronised video-based and Electrooculography-based eye trackers of 20 users performing everyday activities in a mobile setting. Based on this analysis we present a method to automatically self-calibrate head-mounted eye trackers based on a computational model of bottom-up visual saliency. Through evaluations on the dataset we show that our method is 1) effective in reducing calibration drift in calibrated eye trackers and 2) given sufficient data, can achieve competitive gaze estimation accuracy to a calibrated eye tracker without any manual calibration.

doi: 10.1145/2807442.2807445

Paper: sugano15_uist.pdf

Video: https://www.youtube.com/watch?v=CvsZ3YCWFPk

@inproceedings{sugano15_uist, title = {Self-Calibrating Head-Mounted Eye Trackers Using Egocentric Visual Saliency}, author = {Sugano, Yusuke and Bulling, Andreas}, year = {2015}, booktitle = {Proc. ACM Symposium on User Interface Software and Technology (UIST)}, doi = {10.1145/2807442.2807445}, pages = {363-372}, video = {https://www.youtube.com/watch?v=CvsZ3YCWFPk} }
Gaze+RST: Integrating Gaze and Multitouch for Remote Rotate-Scale-Translate Tasks

Jayson Turner, Jason Alexander, Andreas Bulling, Hans Gellersen

Proc. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI), pp. 4179-4188, 2015.

Abstract Links BibTeX Project

Our work investigates the use of gaze and multitouch to flu- idly perform rotate-scale translate (RST) tasks on large dis- plays. The work specifically aims to understand if gaze can provide benefit in such a task, how task complexity af- fects performance, and how gaze and multitouch can be com- bined to create an integral input structure suited to the task of RST. We present four techniques that individually strike a different balance between gaze-based and touch-based trans- lation while maintaining concurrent rotation and scaling op- erations. A 16 participant empirical evaluation revealed that three of our four techniques present viable options for this scenario, and that larger distances and rotation/scaling opera- tions can significantly affect a gaze-based translation configu- ration. Furthermore we uncover new insights regarding mul- timodal integrality, finding that gaze and touch can be com- bined into configurations that pertain to integral or separable input structures.

doi: 10.1145/2702123.2702355

Paper: turner15_chi.pdf

@inproceedings{turner15_chi, author = {Turner, Jayson and Alexander, Jason and Bulling, Andreas and Gellersen, Hans}, title = {Gaze+RST: Integrating Gaze and Multitouch for Remote Rotate-Scale-Translate Tasks}, booktitle = {Proc. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI)}, year = {2015}, pages = {4179-4188}, doi = {10.1145/2702123.2702355} }
Prediction of Search Targets From Fixations in Open-world Settings

Hosnieh Sattar, Sabine Müller, Mario Fritz, Andreas Bulling

Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 981-990, 2015.

Abstract Links BibTeX Project

Previous work on predicting the target of visual search from human fixations only considered closed-world settings in which training labels are available and predictions are performed for a known set of potential targets. In this work we go beyond the state of the art by studying search target prediction in an open-world setting in which we no longer assume that we have fixation data to train for the search targets. We present a dataset containing fixation data of 18 users searching for natural images from three image categories within synthesised image collages of about 80 images. In a closed-world baseline experiment we show that we can predict the correct target image out of a candidate set of five images. We then present a new problem formulation for search target prediction in the open-world setting that is based on learning compatibilities between fixations and potential targets.

doi: 10.1109/CVPR.2015.7298700

Paper: sattar15_cvpr.pdf

@inproceedings{sattar15_cvpr, author = {Sattar, Hosnieh and M{\"{u}}ller, Sabine and Fritz, Mario and Bulling, Andreas}, title = {Prediction of Search Targets From Fixations in Open-world Settings}, booktitle = {Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR)}, year = {2015}, pages = {981-990}, doi = {10.1109/CVPR.2015.7298700} }
An Empirical Investigation of Gaze Selection in Mid-Air Gestural 3D Manipulation

Eduardo Velloso, Jayson Turner, Jason Alexander, Andreas Bulling, Hans Gellersen

Proc. IFIP TC13 Conference on Human-Computer Interaction (INTERACT), pp. 315-330, 2015.

Abstract Links BibTeX Project

In this work, we investigate gaze selection in the context of mid-air hand gestural manipulation of 3D rigid bodies in monoscopic displays. We present the results of a user study with 12 participants in which we compared the performance of Gaze, a Raycasting technique (2D Cursor) and a Virtual Hand technique (3D Cursor) to select objects in two 3D mid-air interaction tasks. Also, we compared selection confirmation times for Gaze selection when selection is followed by manipulation to when it is not. Our results show that gaze selection is faster and more preferred than 2D and 3D mid-air-controlled cursors, and is particularly well suited for tasks in which users constantly switch between several objects during the manipulation. Further, selection confirmation times are longer when selection is followed by manipulation than when it is not.

doi: 10.1007/978-3-319-22668-2_25

Paper: velloso15_interact.pdf

@inproceedings{velloso15_interact, title = {{An Empirical Investigation of Gaze Selection in Mid-Air Gestural 3D Manipulation}}, author = {Velloso, Eduardo and Turner, Jayson and Alexander, Jason and Bulling, Andreas and Gellersen, Hans}, year = {2015}, pages = {315-330}, doi = {10.1007/978-3-319-22668-2_25}, booktitle = {Proc. IFIP TC13 Conference on Human-Computer Interaction (INTERACT)} }
Interactions Under the Desk: A Characterisation of Foot Movements for Input in a Seated Position

Eduardo Velloso, Jason Alexander, Andreas Bulling, Hans Gellersen

Proc. IFIP TC13 Conference on Human-Computer Interaction (INTERACT), pp. 384-401, 2015.

Abstract Links BibTeX Project

This paper takes a bottom-up approach to characterising foot movements as input for users seated at computing systems. We conducted four user studies to characterise various aspects of foot-based interaction. First, we built unconstrained foot pointing performance models for 16 participants in a seated desktop setting using 1D and 2D ISO 9241-9-compliant Fitts’s Law tasks. Second, we evaluated the effect of the foot and direction in one-direction tasks, finding no effect of the foot used, but a significant effect of the direction in which targets are distributed. Third, we compared the use of one foot against two feet to control two independent variables, finding that while one foot is better suited for tasks with a spatial representation that matches its movement, there is little difference between the two feet techniques when it does not. Fourth, we analysed the overhead caused by introducing a feet-controlled variable in a mouse-based task, finding the feet to be comparable to the scroll wheel. The results of our studies show the feet are an effective method of enhancing our interaction with desktop systems; we use our findings to inform a series of design guidelines for such systems.

doi: 10.1007/978-3-319-22701-6_29

Paper: velloso15_interact_2.pdf

@inproceedings{velloso15_interact_2, title = {Interactions Under the Desk: A Characterisation of Foot Movements for Input in a Seated Position}, author = {Velloso, Eduardo and Alexander, Jason and Bulling, Andreas and Gellersen, Hans}, year = {2015}, pages = {384-401}, doi = {10.1007/978-3-319-22701-6_29}, booktitle = {Proc. IFIP TC13 Conference on Human-Computer Interaction (INTERACT)} }
The Royal Corgi: Exploring Social Gaze Interaction for Immersive Gameplay

Mélodie Vidal, Remi Bismuth, Andreas Bulling, Hans Gellersen

Proc. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI), pp. 115-124, 2015.

Abstract Links BibTeX Project

The eyes are a rich channel for non-verbal communication in our daily interactions. We propose social gaze interaction as a game mechanic to enhance user interactions with virtual characters. We develop a game from the ground-up in which characters are designed to be reactive to the player’s gaze in social ways, such as getting annoyed when the player seems distracted or changing their dialogue depending on the player’s apparent focus of attention. Results from a qualitative user study provide insights bout how social gaze interaction is intuitive for users, elicits deep feelings of immersion, and highlight the players’ self-consciousness of their own eye movements through their strong reactions to the characters.

doi: 10.1145/2702123.2702163

Paper: vidal15_chi.pdf

@inproceedings{vidal15_chi, author = {Vidal, M{\'{e}}lodie and Bismuth, Remi and Bulling, Andreas and Gellersen, Hans}, title = {{The Royal Corgi: Exploring Social Gaze Interaction for Immersive Gameplay}}, booktitle = {Proc. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI)}, year = {2015}, pages = {115-124}, doi = {10.1145/2702123.2702163} }
Analyzing Visual Attention During Whole Body Interaction with Public Displays

Robert Walter, Andreas Bulling, David Lindlbauer, Martin Schüssler, Hans Jörg Müller

Proc. ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp), pp. 1263-1267, 2015.

Abstract Links BibTeX Project

While whole body interaction can enrich user experience on public displays, it remains unclear how common visualizations of user representations impact users’ ability to perceive content on the display. In this work we use a head-mounted eye tracker to record visual behavior of 25 users interacting with a public display game that uses a silhouette user representation, mirroring the users’ movements. Results from visual attention analysis as well as post-hoc recall and recognition tasks on display contents reveal that visual attention is mostly on users’ silhouette while peripheral screen elements remain largely unattended. In our experiment, content attached to the user representation attracted significantly more attention than other screen contents, while content placed at the top and bottom of the screen attracted significantly less. Screen contents attached to the user representation were also significantly better remembered than those at the top and bottom of the screen.

doi: 10.1145/2750858.280425

Paper: walter15_ubicomp.pdf

Video: https://www.youtube.com/watch?v=JlEnUyhQ1cY

@inproceedings{walter15_ubicomp, author = {Walter, Robert and Bulling, Andreas and Lindlbauer, David and Sch{\"{u}}ssler, Martin and M{\"{u}}ller, Hans J{\"{o}}rg}, title = {Analyzing Visual Attention During Whole Body Interaction with Public Displays}, booktitle = {Proc. ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp)}, year = {2015}, doi = {10.1145/2750858.280425}, pages = {1263-1267}, video = {https://www.youtube.com/watch?v=JlEnUyhQ1cY} }
Emotion recognition from embedded bodily expressions and speech during dyadic interactions

Philipp Müller, Sikandar Amin, Prateek Verma, Mykhaylo Andriluka, Andreas Bulling

Proc. International Conference on Affective Computing and Intelligent Interaction (ACII), pp. 663-669, 2015.

Abstract Links BibTeX Project

Previous work on emotion recognition from bodily expressions focused on analysing such expressions in isolation, of individuals or in controlled settings, from a single camera view, or required intrusive motion tracking equipment. We study the problem of emotion recognition from bodily expressions and speech during dyadic (person-person) interactions in a real kitchen instrumented with ambient cameras and microphones. We specifically focus on bodily expressions that are embedded in regular interactions and background activities and recorded without human augmentation to increase naturalness of the expressions. We present a human-validated dataset that contains 224 high-resolution, multi-view video clips and audio recordings of emotionally charged interactions between eight couples of actors. The dataset is fully annotated with categorical labels for four basic emotions (anger, happiness, sadness, and surprise) and continuous labels for valence, activation, power, and anticipation provided by five annotators for each actor. We evaluate vision and audio-based emotion recognition using dense trajectories and a standard audio pipeline and provide insights into the importance of different body parts and audio features for emotion recognition.

doi: 10.1109/ACII.2015.7344640

Paper: mueller15_acii.pdf

@inproceedings{mueller15_acii, title = {Emotion recognition from embedded bodily expressions and speech during dyadic interactions}, author = {M{\"{u}}ller, Philipp and Amin, Sikandar and Verma, Prateek and Andriluka, Mykhaylo and Bulling, Andreas}, year = {2015}, pages = {663-669}, doi = {10.1109/ACII.2015.7344640}, booktitle = {Proc. International Conference on Affective Computing and Intelligent Interaction (ACII)} }
Rendering of Eyes for Eye-Shape Registration and Gaze Estimation

Erroll Wood, Tadas Baltrušaitis, Xucong Zhang, Yusuke Sugano, Peter Robinson, Andreas Bulling

Proc. IEEE International Conference on Computer Vision (ICCV), pp. 3756-3764, 2015.

Abstract Links BibTeX Project

Images of the eye are key in several computer vision problems, such as shape registration and gaze estimation. Recent large-scale supervised methods for these problems require time-consuming data collection and manual annotation, which can be unreliable. We propose synthesizing perfectly labelled photo-realistic training data in a fraction of the time. We used computer graphics techniques to build a collection of dynamic eye-region models from head scan geometry. These were randomly posed to synthesize close-up eye images for a wide range of head poses, gaze directions, and illumination conditions. We used our model’s controllability to verify the importance of realistic illumination and shape variations in eye-region training data. Finally, we demonstrate the benefits of our synthesized training data (SynthesEyes) by out-performing state-of-the-art methods for eye-shape registration as well as cross-dataset appearance-based gaze estimation in the wild.

doi: 10.1109/ICCV.2015.428

Paper: wood15_iccv.pdf

@inproceedings{wood15_iccv, title = {Rendering of Eyes for Eye-Shape Registration and Gaze Estimation}, author = {Wood, Erroll and Baltru{\v{s}}aitis, Tadas and Zhang, Xucong and Sugano, Yusuke and Robinson, Peter and Bulling, Andreas}, doi = {10.1109/ICCV.2015.428}, year = {2015}, pages = {3756-3764}, booktitle = {Proc. IEEE International Conference on Computer Vision (ICCV)} }
Appearance-based Gaze Estimation in the Wild

Xucong Zhang, Yusuke Sugano, Mario Fritz, Andreas Bulling

Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4511-4520, 2015.

Abstract Links BibTeX Project

Appearance-based gaze estimation is believed to work well in real-world settings but existing datasets were collected under controlled laboratory conditions and methods were not evaluated across multiple datasets. In this work we study appearance-based gaze estimation in the wild. We present the MPIIGaze dataset that contains 213,659 images we collected from 15 participants during natural everyday laptop use over more than three months. Our dataset is significantly more variable than existing datasets with respect to appearance and illumination. We also present a method for in-the-wild appearance-based gaze estimation using multimodal convolutional neural networks, which significantly outperforms state-of-the art methods in the most challenging cross-dataset evaluation setting. We present an extensive evaluation of several state-of-the-art image-based gaze estimation algorithm on three current datasets, including our own. This evaluation provides clear insights and allows us identify key research challenges of gaze estimation in the wild.

doi: 10.1109/CVPR.2015.7299081

Paper: zhang15_cvpr.pdf

Video: https://www.youtube.com/watch?v=rw6LZA1USG8

@inproceedings{zhang15_cvpr, author = {Zhang, Xucong and Sugano, Yusuke and Fritz, Mario and Bulling, Andreas}, title = {Appearance-based Gaze Estimation in the Wild}, booktitle = {Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR)}, year = {2015}, pages = {4511-4520}, doi = {10.1109/CVPR.2015.7299081}, video = {https://www.youtube.com/watch?v=rw6LZA1USG8} }
Graphical Passwords in the Wild – Understanding How Users Choose Pictures and Passwords in Image-based Authentication Schemes

Florian Alt, Stefan Schneegass, Alireza Sahami, Mariam Hassib, Andreas Bulling

Proc. ACM International Conference on Human-Computer Interaction with Mobile Devices and Services (MobileHCI), pp. 316-322, 2015.

Abstract Links BibTeX Project

Common user authentication methods on smartphones, such as lock patterns, PINs, or passwords, impose a trade-off between security and password memorability. Image-based passwords were proposed as a secure and usable alternative. As of today, however, it remains unclear how such schemes are used in the wild. We present the first study to investigate how image-based passwords are used over long periods of time in the real world. Our analyses are based on data from 2318 unique devices collected over more than one year using a custom application released in the Android Play store. We present an in-depth analysis of what kind of images users select, how they define their passwords, and how secure these passwords are. Our findings provide valuable insights into real-world use of image-based passwords and inform the design of future graphical authentication schemes.

doi: 10.1145/2785830.2785882

Paper: alt15_mobilehci.pdf

@inproceedings{alt15_mobilehci, title = {Graphical Passwords in the Wild -- Understanding How Users Choose Pictures and Passwords in Image-based Authentication Schemes}, author = {Alt, Florian and Schneegass, Stefan and Sahami, Alireza and Hassib, Mariam and Bulling, Andreas}, year = {2015}, pages = {316-322}, booktitle = {Proc. ACM International Conference on Human-Computer Interaction with Mobile Devices and Services (MobileHCI)}, doi = {10.1145/2785830.2785882} }
GravitySpot: Guiding Users in Front of Public Displays Using On-Screen Visual Cues

Florian Alt, Andreas Bulling, Gino Gravanis, Daniel Buschek

Proc. ACM Symposium on User Interface Software and Technology (UIST), pp. 47-56, 2015.

Abstract Links BibTeX Project

Users tend to position themselves in front of interactive public displays in such a way as to best perceive its content. Currently, this sweet spot is implicitly defined by display properties, content, the input modality, as well as space constraints in front of the display. We present GravitySpot – an approach that makes sweet spots flexible by actively guiding users to arbitrary target positions in front of displays using visual cues. Such guidance is beneficial, for example, if a particular input technology only works at a specific distance or if users should be guided towards a non-crowded area of a large display. In two controlled lab studies (n=29) we evaluate different visual cues based on color, shape, and motion, as well as position-to-cue mapping functions. We show that both the visual cues and mapping functions allow for fine-grained control over positioning speed and accuracy. Findings are complemented by observations from a 3-month real-world deployment.

doi: 10.1145/2807442.2807490

Paper: alt15_uist.pdf

Video: https://www.youtube.com/watch?v=laWfbOpQQ8A

@inproceedings{alt15_uist, title = {GravitySpot: Guiding Users in Front of Public Displays Using On-Screen Visual Cues}, author = {Alt, Florian and Bulling, Andreas and Gravanis, Gino and Buschek, Daniel}, year = {2015}, pages = {47-56}, booktitle = {Proc. ACM Symposium on User Interface Software and Technology (UIST)}, doi = {10.1145/2807442.2807490}, video = {https://www.youtube.com/watch?v=laWfbOpQQ8A} }
GazeProjector: Accurate Gaze Estimation and Seamless Gaze Interaction Across Multiple Displays

Christian Lander, Sven Gehring, Antonio Krüger, Sebastian Boring, Andreas Bulling

Proc. ACM Symposium on User Interface Software and Technology (UIST), pp. 395-404, 2015.

Abstract Links BibTeX Project

Mobile gaze-based interaction with multiple displays may occur from arbitrary positions and orientations. However, maintaining high gaze estimation accuracy in such situations remains a significant challenge. In this paper, we present GazeProjector, a system that combines (1) natural feature tracking on displays to determine the mobile eye tracker’s position relative to a display with (2) accurate point-of-gaze estimation. GazeProjector allows for seamless gaze estimation and interaction on multiple displays of arbitrary sizes independently of the user’s position and orientation to the display. In a user study with 12 participants we compare GazeProjector to established methods (here: visual on-screen markers and a state-of-the-art video-based motion capture system). We show that our approach is robust to varying head poses, orientations, and distances to the display, while still providing high gaze estimation accuracy across multiple displays without re-calibration for each variation. Our system represents an important step towards the vision of pervasive gaze-based interfaces.

doi: 10.1145/2807442.2807479

Paper: lander15_uist.pdf

@inproceedings{lander15_uist, title = {GazeProjector: Accurate Gaze Estimation and Seamless Gaze Interaction Across Multiple Displays}, author = {Lander, Christian and Gehring, Sven and Kr{\"{u}}ger, Antonio and Boring, Sebastian and Bulling, Andreas}, year = {2015}, pages = {395-404}, booktitle = {Proc. ACM Symposium on User Interface Software and Technology (UIST)}, doi = {10.1145/2807442.2807479} }
On the interplay between spontaneous spoken instructions and human visual behaviour in an indoor guidance task

Nikolina Koleva, Sabrina Hoppe, Mohammed Mehdi Moniri, Maria Staudte, Andreas Bulling

Proc. Annual Meeting of the Cognitive Science Society (CogSci), pp. 1–6, 2015.

Abstract Links BibTeX Project

We report on an indoor guidance study to explore the inter- play between spontaneous spoken instructions and listeners’ eye movement behaviour. The study involves a remote speaker (the instructor) to verbally guide a listener (the walker) to com- plete nine everyday tasks in different locations inside a room. We collect a multi-modal dataset of 12 pairs of users consist- ing of egocentric videos from the listener’s perspective, their gaze data, and instructors’ verbal instructions. We analyse the impact on instructions and listener gaze when the speaker can see 1) only the egocentric video, 2) the video and the point of gaze, or 3) the video and gaze with artificial noise. Our re- sults show that gaze behaviour varies significantly after (but hardly before) instructions and that speakers give more nega- tive feedback when listener gaze is available. These findings suggest that although speakers use gaze information as an in- dication of what referent the listener is effectively considering, this does not lead listeners to deliberately use their gaze as a pointer even when this is potentially beneficial for the task.

Paper: koleva15_cogsci.pdf

@inproceedings{koleva15_cogsci, title = {On the interplay between spontaneous spoken instructions and human visual behaviour in an indoor guidance task}, author = {Koleva, Nikolina and Hoppe, Sabrina and Moniri, Mohammed Mehdi and Staudte, Maria and Bulling, Andreas}, year = {2015}, pages = {1--6}, booktitle = {Proc. Annual Meeting of the Cognitive Science Society (CogSci)} }
Scene viewing and gaze analysis during phonetic segmentation tasks

Arif Khan, Ingmar Steiner, Ross Macdonald, Yusuke Sugano, Andreas Bulling

Proc. European Conference on Eye Movements (ECEM), pp. 1–2, 2015.

Links BibTeX Project

Paper: khan15_ecem.pdf

@inproceedings{khan15_ecem, title = {Scene viewing and gaze analysis during phonetic segmentation tasks}, author = {Khan, Arif and Steiner, Ingmar and Macdonald, Ross and Sugano, Yusuke and Bulling, Andreas}, year = {2015}, pages = {1--2}, booktitle = {Proc. European Conference on Eye Movements (ECEM)} }
A Field Study on Spontaneous Gaze-based Interaction with a Public Display using Pursuits

Mohamed Khamis, Florian Alt, Andreas Bulling

Adj. Proc. ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp), pp. 865-874, 2015.

Abstract Links BibTeX Project

Smooth pursuit eye movements were recently introduced as a promising technique for calibration-free and thus spontaneous and natural gaze interaction. While pursuits have been evaluated in controlled laboratory studies, the technique has not yet been evaluated with respect to usability in the wild. We report on a field study in which we deployed a game on a public display where participants used pursuits to select fish moving in linear and circular trajectories at different speeds. The study ran for two days in a busy computer lab resulting in a total of 56 interactions. Results from our study show that linear trajectories are statistically faster to select via pursuits than circular trajectories. We also found that pursuits is well perceived by users who find it fast and responsive.

doi: 10.1145/2800835.2804335

Paper: khamis15_ubicomp_2.pdf

@inproceedings{khamis15_ubicomp_2, title = {A Field Study on Spontaneous Gaze-based Interaction with a Public Display using Pursuits}, author = {Khamis, Mohamed and Alt, Florian and Bulling, Andreas}, year = {2015}, booktitle = {Adj. Proc. ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp)}, doi = {10.1145/2800835.2804335}, pages = {865-874} }
Tackling Challenges of Interactive Public Displays using Gaze

Mohamed Khamis, Andreas Bulling, Florian Alt

Adj. Proc. ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp), pp. 763-766, 2015.

Abstract Links BibTeX Project

Falling hardware prices led to a widespread use of public displays. Common interaction techniques for such displays currently include touch, mid-air, or smartphone-based interaction. While these techniques are well understood from a technical perspective, several remaining challenges hinder the uptake of interactive displays among passersby. In this paper we propose addressing major public display challenges through gaze as a novel interaction modality. We discuss why gaze-based interaction can tackle these challenges effectively and discuss how solutions can be technically realized. Furthermore, we summarize state-of-the-art eye tracking techniques that show particular promise in the area of public displays.

doi: 10.1145/2800835.2807951

Paper: khamis15_ubicomp.pdf

@inproceedings{khamis15_ubicomp, title = {Tackling Challenges of Interactive Public Displays using Gaze}, author = {Khamis, Mohamed and Bulling, Andreas and Alt, Florian}, year = {2015}, booktitle = {Adj. Proc. ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp)}, doi = {10.1145/2800835.2807951}, pages = {763-766} }
Recognition of Curiosity Using Eye Movement Analysis

Sabrina Hoppe, Tobias Loetscher, Stephanie Morey, Andreas Bulling

Adj. Proc. ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp), pp. 185-188, 2015.

Abstract Links BibTeX Project

Among the different personality traits that guide our behaviour, curiosity is particularly interesting for context-aware assistive systems as it is closely linked to our well-being and the way we learn. This work proposes eye movement analysis for automatic recognition of different levels of curiosity. We present a 26-participant gaze dataset recorded during a real-world shopping task with empirically validated curiosity questionnaires as ground truth. Using a support vector machine classifier and a leave-one-person-out evaluation scheme we can discriminate between two to four classes of standard curiosity scales well above chance. These results are promising and point towards a new class of context-aware systems that take the user’s curiosity into account, thereby enabling new types of interaction and user adaptation.

doi: 10.1145/2800835.2800910

Paper: hoppe15_ubicomp.pdf

@inproceedings{hoppe15_ubicomp, title = {Recognition of Curiosity Using Eye Movement Analysis}, author = {Hoppe, Sabrina and Loetscher, Tobias and Morey, Stephanie and Bulling, Andreas}, year = {2015}, booktitle = {Adj. Proc. ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp)}, doi = {10.1145/2800835.2800910}, pages = {185-188} }
Orbits: Enabling Gaze Interaction in Smart Watches using Moving Targets

Augusto Esteves, Eduardo Velloso, Andreas Bulling, Hans Gellersen

Proc. ACM Symposium on User Interface Software and Technology (UIST), pp. 457-466, 2015.

Abstract Links BibTeX Project Best Paper Award

We introduce Orbits, a novel gaze interaction technique that enables hands-free input on smart watches. The technique relies on moving controls to leverage the smooth pursuit movements of the eyes and detect whether and at which control the user is looking at. In Orbits, controls include targets that move in a circular trajectory in the face of the watch, and can be selected by following the desired one for a small amount of time. We conducted two user studies to assess the technique’s recognition and robustness, which demonstrated how Orbits is robust against false positives triggered by natural eye movements and how it presents a hands-free, high accuracy way of interacting with smart watches using off-the-shelf devices. Finally, we developed three example interfaces built with Orbits: a music player, a notifications face plate and a missed call menu. Despite relying on moving controls – very unusual in current HCI interfaces – these were generally well received by participants in a third and final study.

doi: 10.1145/2807442.2807499

Paper: esteves15_uist.pdf

@inproceedings{esteves15_uist, title = {Orbits: Enabling Gaze Interaction in Smart Watches using Moving Targets}, author = {Esteves, Augusto and Velloso, Eduardo and Bulling, Andreas and Gellersen, Hans}, year = {2015}, booktitle = {Proc. ACM Symposium on User Interface Software and Technology (UIST)}, doi = {10.1145/2807442.2807499}, pages = {457-466} }
Orbits: Enabling Gaze Interaction in Smart Watches using Moving Targets

Augusto Esteves, Eduardo Velloso, Andreas Bulling, Hans Gellersen

Adj. Proc. ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp), pp. 419-422, 2015.

Abstract Links BibTeX Project

In this paper we demonstrate Orbits, a novel gaze interaction technique that accounts for both the reduced size of smart watch displays and the hands-free nature of conventional watches. Orbits combines graphical controls that display one or multiple targets moving on a circular path, with input that is provided by users as they follow any of the targets briefly with their eyes. This gaze input triggers the functionality associated with the followed target – be it answering a call, playing a song or managing multiple notifications.

doi: 10.1145/2800835.2800942

Paper: esteves15_ubicomp.pdf

Video: https://www.youtube.com/watch?v=KEIgw5A0yfI

@inproceedings{esteves15_ubicomp, title = {Orbits: Enabling Gaze Interaction in Smart Watches using Moving Targets}, author = {Esteves, Augusto and Velloso, Eduardo and Bulling, Andreas and Gellersen, Hans}, year = {2015}, booktitle = {Adj. Proc. ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp)}, doi = {10.1145/2800835.2800942}, pages = {419-422}, video = {https://www.youtube.com/watch?v=KEIgw5A0yfI} }
Human Visual Behaviour for Collaborative Human-Machine Interaction

Andreas Bulling

Adj. Proc. ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp), pp. 903-907, 2015.

Abstract Links BibTeX Project

Non-verbal behavioural cues are fundamental to human communication and interaction. Despite significant advances in recent years, state-of-the-art human-machine systems still fall short in sensing, analysing, and fully "understanding" cues naturally expressed in everyday settings. Two of the most important non-verbal cues, as evidenced by a large body of work in experimental psychology and behavioural sciences, are visual (gaze) behaviour and body language. We envision a new class of collaborative human-machine systems that fully exploit the information content available in non-verbal human behaviour in everyday settings through joint analysis of human gaze and physical behaviour.

doi: 10.1145/2800835.2815378

Paper: bulling15_ubicomp.pdf

@inproceedings{bulling15_ubicomp, title = {Human Visual Behaviour for Collaborative Human-Machine Interaction}, author = {Bulling, Andreas}, doi = {10.1145/2800835.2815378}, pages = {903-907}, year = {2015}, booktitle = {Adj. Proc. ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp)} }
Discovery of Everyday Human Activities From Long-term Visual Behaviour Using Topic Models

Julian Steil, Andreas Bulling

Proc. ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp), pp. 75-85, 2015.

Abstract Links BibTeX Project

Human visual behaviour has significant potential for activity recognition and computational behaviour analysis, but previous works focused on supervised methods and recognition of predefined activity classes based on short-term eye movement recordings. We propose a fully unsupervised method to discover users’ everyday activities from their long-term visual behaviour. Our method combines a bag-of-words representation of visual behaviour that encodes saccades, fixations, and blinks with a latent Dirichlet allocation (LDA) topic model. We further propose different methods to encode saccades for their use in the topic model. We evaluate our method on a novel long-term gaze dataset that contains full-day recordings of natural visual behaviour of 10 participants (more than 80 hours in total). We also provide annotations for eight sample activity classes (outdoor, social interaction, focused work, travel, reading, computer work, watching media, eating) and periods with no specific activity. We show the ability of our method to discover these activities with performance competitive with that of previously published supervised methods.

doi: 10.1145/2750858.2807520

Paper: steil15_ubicomp.pdf

@inproceedings{steil15_ubicomp, author = {Steil, Julian and Bulling, Andreas}, title = {Discovery of Everyday Human Activities From Long-term Visual Behaviour Using Topic Models}, booktitle = {Proc. ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp)}, year = {2015}, doi = {10.1145/2750858.2807520}, pages = {75-85} }

Book Chapters

Proc. International Workshop on Pervasive Eye Tracking and Mobile Eye-Based Interaction (PETMEI)

Peter Kiefer, Yanxia Zhang, Andreas Bulling

Proc. ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp), pp. 825–828, 2015.

Abstract Links BibTeX Project

Previous work on eye tracking and eye-based human-computer interfaces mainly concentrated on making use of the eyes in traditional desktop settings. With the recent growth of interest in smart eyewear and low-cost mobile eye trackers, gaze-based techniques for mobile computing is becoming increasingly important. PETMEI 2015 focuses on the pervasive eye tracking paradigm as a trailblazer for mobile eye-based interaction and eye-based context-awareness. We want to stimulate and explore the creativity of these communities with respect to the implications, key research challenges, and new applications for pervasive eye tracking in ubiquitous computing. The long-term goal is to create a strong interdisciplinary research community linking these fields together and to establish the workshop as the premier forum for research on pervasive eye tracking.

doi: 10.1145/2800835.2807960

Paper: kiefer15_petmei.pdf

@inbook{kiefer15_petmei, author = {Kiefer, Peter and Zhang, Yanxia and Bulling, Andreas}, title = {Proc. International Workshop on Pervasive Eye Tracking and Mobile Eye-Based Interaction (PETMEI)}, year = {2015}, doi = {10.1145/2800835.2807960}, pages = {825--828}, booktitle = {Proc. ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp)} }

Technical Reports

GazeDPM: Early Integration of Gaze Information in Deformable Part Models

Iaroslav Shcherbatyi, Andreas Bulling, Mario Fritz

arXiv:1505.05753, pp. 1–14, 2015.

Abstract Links BibTeX Project

An increasing number of works explore collaborative human-computer systems in which human gaze is used to enhance computer vision systems. For object detection these efforts were so far restricted to late integration approaches that have inherent limitations, such as increased precision without increase in recall. We propose an early integration approach in a deformable part model, which constitutes a joint formulation over gaze and visual data. We show that our GazeDPM method improves over the state-of-the-art DPM baseline by 4% and a recent method for gaze-supported object detection by 3% on the public POET dataset. Our approach additionally provides introspection of the learnt models, can reveal salient image structures, and allows us to investigate the interplay between gaze attracting and repelling areas, the importance of view-specific models, as well as viewers’ personal biases in gaze patterns. We finally study important practical aspects of our approach, such as the impact of using saliency maps instead of real fixations, the impact of the number of fixations, as well as robustness to gaze estimation error.

Paper: shcherbatyi15_arxiv.pdf

Paper Access: https://arxiv.org/abs/1505.05753

@techreport{shcherbatyi15_arxiv, title = {GazeDPM: Early Integration of Gaze Information in Deformable Part Models}, author = {Shcherbatyi, Iaroslav and Bulling, Andreas and Fritz, Mario}, year = {2015}, pages = {1--14}, url = {https://arxiv.org/abs/1505.05753} }
Labeled pupils in the wild: A dataset for studying pupil detection in unconstrained environments

Marc Tonsen, Xucong Zhang, Yusuke Sugano, Andreas Bulling

arXiv:1511.05768, pp. 1–4, 2015.

Abstract Links BibTeX Project

We present labelled pupils in the wild (LPW), a novel dataset of 66 high-quality, high-speed eye region videos for the development and evaluation of pupil detection algorithms. The videos in our dataset were recorded from 22 participants in everyday locations at about 95 FPS using a state-of-the-art dark-pupil head-mounted eye tracker. They cover people with different ethnicities, a diverse set of everyday indoor and outdoor illumination environments, as well as natural gaze direction distributions. The dataset also includes participants wearing glasses, contact lenses, as well as make-up. We benchmark five state-of-the-art pupil detection algorithms on our dataset with respect to robustness and accuracy. We further study the influence of image resolution, vision aids, as well as recording location (indoor, outdoor) on pupil detection performance. Our evaluations provide valuable insights into the general pupil detection problem and allow us to identify key challenges for robust pupil detection on head-mounted eye trackers.

Paper: tonsen15_arxiv.pdf

Paper Access: https://arxiv.org/abs/1511.05768

@techreport{tonsen15_arxiv, title = {Labeled pupils in the wild: A dataset for studying pupil detection in unconstrained environments}, author = {Tonsen, Marc and Zhang, Xucong and Sugano, Yusuke and Bulling, Andreas}, year = {2015}, pages = {1--4}, url = {https://arxiv.org/abs/1511.05768} }
Prediction of Search Targets From Fixations in Open-world Settings

Hosnieh Sattar, Sabine Müller, Mario Fritz, Andreas Bulling

arXiv:1502.05137, pp. 1–10, 2015.

Abstract Links BibTeX Project

Previous work on predicting the target of visual search from human fixations only considered closed-world settings in which training labels are available and predictions are performed for a known set of potential targets. In this work we go beyond the state of the art by studying search target prediction in an open-world setting in which we no longer assume that we have fixation data to train for the search targets. We present a dataset containing fixation data of 18 users searching for natural images from three image categories within synthesised image collages of about 80 images. In a closed-world baseline experiment we show that we can predict the correct target image out of a candidate set of five images. We then present a new problem formulation for search target prediction in the open-world setting that is based on learning compatibilities between fixations and potential targets.

Paper: sattar15_arxiv.pdf

Paper Access: https://arxiv.org/abs/1502.05137

@techreport{sattar15_arxiv, author = {Sattar, Hosnieh and M{\"{u}}ller, Sabine and Fritz, Mario and Bulling, Andreas}, title = {Prediction of Search Targets From Fixations in Open-world Settings}, year = {2015}, pages = {1--10}, url = {https://arxiv.org/abs/1502.05137} }
Rendering of Eyes for Eye-Shape Registration and Gaze Estimation

Erroll Wood, Tadas Baltrušaitis, Xucong Zhang, Yusuke Sugano, Peter Robinson, Andreas Bulling

arXiv:1505.05916, pp. 1–9, 2015.

Abstract Links BibTeX Project

Images of the eye are key in several computer vision problems, such as shape registration and gaze estimation. Recent large-scale supervised methods for these problems require time-consuming data collection and manual annotation, which can be unreliable. We propose synthesizing perfectly labelled photo-realistic training data in a fraction of the time. We used computer graphics techniques to build a collection of dynamic eye-region models from head scan geometry. These were randomly posed to synthesize close-up eye images for a wide range of head poses, gaze directions, and illumination conditions. We used our model’s controllability to verify the importance of realistic illumination and shape variations in eye-region training data. Finally, we demonstrate the benefits of our synthesized training data (SynthesEyes) by out-performing state-of-the-art methods for eye-shape registration as well as cross-dataset appearance-based gaze estimation in the wild.

Paper: wood15_arxiv.pdf

Paper Access: https://arxiv.org/abs/1505.05916

@techreport{wood15_arxiv, title = {Rendering of Eyes for Eye-Shape Registration and Gaze Estimation}, author = {Wood, Erroll and Baltru{\v{s}}aitis, Tadas and Zhang, Xucong and Sugano, Yusuke and Robinson, Peter and Bulling, Andreas}, year = {2015}, pages = {1--9}, url = {https://arxiv.org/abs/1505.05916} }
Appearance-Based Gaze Estimation in the Wild

Xucong Zhang, Yusuke Sugano, Mario Fritz, Andreas Bulling

arXiv:1504.02863, pp. 1–10, 2015.

Abstract Links BibTeX Project

Appearance-based gaze estimation is believed to work well in real-world settings, but existing datasets have been collected under controlled laboratory conditions and methods have been not evaluated across multiple datasets. In this work we study appearance-based gaze estimation in the wild. We present the MPIIGaze dataset that contains 213,659 images we collected from 15 participants during natural everyday laptop use over more than three months. Our dataset is significantly more variable than existing ones with respect to appearance and illumination. We also present a method for in-the-wild appearance-based gaze estimation using multimodal convolutional neural networks that significantly outperforms state-of-the art methods in the most challenging cross-dataset evaluation. We present an extensive evaluation of several state-of-the-art image-based gaze estimation algorithms on three current datasets, including our own. This evaluation provides clear insights and allows us to identify key research challenges of gaze estimation in the wild.

Paper: zhang15_arxiv.pdf

Paper Access: https://arxiv.org/abs/1504.02863

@techreport{zhang15_arxiv, title = {Appearance-Based Gaze Estimation in the Wild}, author = {Zhang, Xucong and Sugano, Yusuke and Fritz, Mario and Bulling, Andreas}, year = {2015}, pages = {1--10}, url = {https://arxiv.org/abs/1504.02863} }
GazeProjector: Location-independent gaze interaction on and across multiple displays

Christian Lander, Sven Gehring, Antonio Krüger, Sebastian Boring, Andreas Bulling

DFKI Research Reports, pp. 1–10, 2015.

Abstract Links BibTeX Project

Mobile gaze-based interaction with multiple displays may occur from arbitrary positions and orientations. However, maintaining high gaze estimation accuracy still represents a significant challenge. To address this, we present GazeProjector, a system that combines accurate point-of-gaze estimation with natural feature tracking on displays to determine the mobile eye tracker’s position relative to a display. The detected eye positions are transformed onto that display allowing for gaze-based interaction. This allows for seamless gaze estimation and interaction on (1) multiple displays of arbitrary sizes, (2) independently of the user’s position and orientation to the display. In a user study with 12 participants we compared GazeProjector to existing well- established methods such as visual on-screen markers and a state-of-the-art motion capture system. Our results show that our approach is robust to varying head poses, orientations, and distances to the display, while still providing high gaze estimation accuracy across multiple displays without re-calibration. The system represents an important step towards the vision of pervasive gaze-based interfaces.

Paper: lander15_techrep.pdf

Video: https://www.youtube.com/watch?v=peuL4WRfrRM

@techreport{lander15_techrep, author = {Lander, Christian and Gehring, Sven and Kr{\"{u}}ger, Antonio and Boring, Sebastian and Bulling, Andreas}, title = {GazeProjector: Location-independent gaze interaction on and across multiple displays}, volume = {1}, year = {2015}, pages = {1--10}, institution = {German Research Center for Artificial Intelligence (DFKI)}, video = {https://www.youtube.com/watch?v=peuL4WRfrRM} }

2014

Journal Articles

A Tutorial on Human Activity Recognition Using Body-worn Inertial Sensors

Andreas Bulling, Ulf Blanke, Bernt Schiele

ACM Computing Surveys, 46 (3), pp. 1–33, 2014.

Abstract Links BibTeX Project

The last 20 years have seen an ever increasing research activity in the field of human activity recognition. With activity recognition having considerably matured so did the number of challenges in designing, implementing and evaluating activity recognition systems. This tutorial aims to provide a comprehensive hands-on introduction for newcomers to the field of human activity recognition. It specifically focuses on activity recognition using on-body inertial sensors. We first discuss the key research challenges that human activity recognition shares with general pattern recognition and identify those challenges that are specific to human activity recognition. We then describe the concept of an activity recognition chain (ARC) as a general-purpose framework for designing and evaluating activity recognition systems. We detail each component of the framework, provide references to related research and introduce the best practise methods developed by the activity recognition research community. We conclude with the educational example problem of recognising different hand gestures from inertial sensors attached to the upper and lower arm. We illustrate how each component of this framework can be implemented for this specific activity recognition problem and demonstrate how different implementations compare and how they impact overall recognition performance.

doi: 10.1145/2499621

Paper: bulling14_csur.pdf

@article{bulling14_csur, author = {Bulling, Andreas and Blanke, Ulf and Schiele, Bernt}, title = {A Tutorial on Human Activity Recognition Using Body-worn Inertial Sensors}, journal = {ACM Computing Surveys}, volume = {46}, number = {3}, year = {2014}, pages = {1--33}, doi = {10.1145/2499621} }
Cognition-Aware Computing

Andreas Bulling, Thorsten O. Zander

IEEE Pervasive Computing, 13 (3), pp. 80-83, 2014.

Abstract Links BibTeX Project

Despite significant advances in context sensing and inference since its inception in the late 1990s, context-aware computing still doesn’t implement a holistic view of all covert aspects of the user state. Here, the authors introduce the concept of cognitive context as an extension to the current notion of context with a cognitive dimension. They argue that visual behavior and brain activity are two promising sensing modalities for assessing the cognitive context and thus the development of cognition-aware computing systems.

doi: 10.1109/mprv.2014.42

Paper: bulling14_pcm.pdf

@article{bulling14_pcm, author = {Bulling, Andreas and Zander, Thorsten O.}, keywords = {bioinformatics, cognition, cognition-aware computing, Context modeling, Context-aware computing, electroencephalography, intelligent systems, Pervasive computing, Sensors, tracking, Visualization}, title = {Cognition-Aware Computing}, journal = {IEEE Pervasive Computing}, volume = {13}, number = {3}, year = {2014}, pages = {80-83}, doi = {10.1109/mprv.2014.42} }
On the potential of human visual behaviour for memory augmentation and life logging

Andreas Bulling

Dagstuhl Reports/14362, , 2014.

BibTeX Project

@article{bulling14_dagstuhl, title = {On the potential of human visual behaviour for memory augmentation and life logging}, author = {Bulling, Andreas}, journal = {Dagstuhl Reports/14362}, year = {2014} }

Conference Papers

Ubic: Bridging the gap between digital cryptography and the physical world

Mark Simkin, Dominique Schröder, Andreas Bulling, Mario Fritz

Proc. European Symposium on Research in Computer Security (ESORICS), pp. 56-75, 2014.

Abstract Links BibTeX Project

Advances in computing technology increasingly blur the boundary between the digital domain and the physical world. Although the research community has developed a large number of cryptographic primitives and has demonstrated their usability in all-digital communication, many of them have not yet made their way into the real world due to usability aspects. We aim to make another step towards a tighter integration of digital cryptography into real world interactions. We describe Ubic, a framework that allows users to bridge the gap between digital cryptography and the physical world. Ubic relies on head-mounted displays, like Google Glass, resource-friendly computer vision techniques as well as mathematically sound cryptographic primitives to provide users with better security and privacy guarantees. The framework covers key cryptographic primitives, such as secure identification, document verification using a novel secure physical document format, as well as content hiding. To make a contribution of practical value, we focused on making Ubic as simple, easily deployable, and user friendly as possible.

doi: 10.1007/978-3-319-11203-9_4

Paper: simkin14_esorics.pdf

@inproceedings{simkin14_esorics, author = {Simkin, Mark and Schr{\"{o}}der, Dominique and Bulling, Andreas and Fritz, Mario}, keywords = {authentication, content hiding, content verification, cryptography, head-mounted displays, ubiquitous, usable security}, title = {Ubic: Bridging the gap between digital cryptography and the physical world}, booktitle = {Proc. European Symposium on Research in Computer Security (ESORICS)}, year = {2014}, pages = {56-75}, doi = {10.1007/978-3-319-11203-9_4} }
SmudgeSafe: Geometric Image Transformations for Smudge-resistant User Authentication

Stefan Schneegass, Frank Steimle, Andreas Bulling, Florian Alt, Albrecht Schmidt

Proc. ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp), pp. 775-786, 2014.

Abstract Links BibTeX Project

Touch-enabled user interfaces have become ubiquitous, such as on ATMs or portable devices. At the same time, authentication using touch input is problematic, since finger smudge traces may allow attackers to reconstruct passwords. We present SmudgeSafe, an authentication system that uses random geometric image transformations, such as translation, rotation, scaling, shearing, and flipping, to increase the security of cued-recall graphical passwords. We describe the design space of these transformations and report on two user studies: A lab-based security study involving 20 participants in attacking user-defined passwords, using high quality pictures of real smudge traces captured on a mobile phone display; and an in-the-field usability study with 374 participants who generated more than 130,000 logins on a mobile phone implementation of SmudgeSafe. Results show that SmudgeSafe significantly increases security compared to authentication schemes based on PINs and lock patterns, and exhibits very high learnability, efficiency, and memorability.

doi: 10.1145/2632048.2636090

Paper: schneegass14_ubicomp.pdf

@inproceedings{schneegass14_ubicomp, author = {Schneegass, Stefan and Steimle, Frank and Bulling, Andreas and Alt, Florian and Schmidt, Albrecht}, title = {SmudgeSafe: Geometric Image Transformations for Smudge-resistant User Authentication}, booktitle = {Proc. ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp)}, year = {2014}, pages = {775-786}, doi = {10.1145/2632048.2636090} }
In the Blink of an Eye: Combining Head Motion and Eye Blink Frequency for Activity Recognition with Google Glass

Shoya Ishimaru, Jens Weppner, Kai Kunze, Koichi Kise, Andreas Dengel, Paul Lukowicz, Andreas Bulling

Proc. ACM Augmented Human International Conference (AH), pp. 1–4, 2014.

Abstract Links BibTeX Project

We demonstrate how information about eye blink frequency and head motion patterns derived from Google Glass sensors can be used to distinguish different types of high level activities. While it is well known that eye blink frequency is correlated with user activity, our aim is to show that (1) eye blink frequency data from an unobtrusive, commercial platform which is not a dedicated eye tracker is good enough to be useful and (2) that adding head motion patterns information significantly improves the recognition rates. The method is evaluated on a data set from an experiment containing five activity classes (reading, talking, watching TV, mathematical problem solving, and sawing) of eight participants showing 67% recognition accuracy for eye blinking only and 82% when extended with head motion patterns.

doi: 10.1145/2582051.2582066

Paper: ishimaru14_ah.pdf

@inproceedings{ishimaru14_ah, author = {Ishimaru, Shoya and Weppner, Jens and Kunze, Kai and Kise, Koichi and Dengel, Andreas and Lukowicz, Paul and Bulling, Andreas}, title = {In the Blink of an Eye: Combining Head Motion and Eye Blink Frequency for Activity Recognition with Google Glass}, booktitle = {Proc. ACM Augmented Human International Conference (AH)}, year = {2014}, pages = {1--4}, doi = {10.1145/2582051.2582066} }
Pupil: an open source platform for pervasive eye tracking and mobile gaze-based interaction

Moritz Kassner, William Patera, Andreas Bulling

Adj. Proc. ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp), pp. 1151-1160, 2014.

Abstract Links BibTeX Project

In this paper we present Pupil – an accessible, affordable, and extensible open source platform for pervasive eye tracking and gaze-based interaction. Pupil comprises 1) a light-weight eye tracking headset, 2) an open source software framework for mobile eye tracking, as well as 3) a graphical user interface to playback and visualize video and gaze data. Pupil features high-resolution scene and eye cameras for monocular and binocular gaze estimation. The software and GUI are platform-independent and include state-of-the-art algorithms for real-time pupil detection and tracking, calibration, and accurate gaze estimation. Results of a performance evaluation show that Pupil can provide an average gaze estimation accuracy of 0.6 degree of visual angle (0.08 degree precision) with a processing pipeline latency of only 0.045 seconds.

doi: 10.1145/2638728.2641695

Paper: kassner14_ubicomp.pdf

@inproceedings{kassner14_ubicomp, title = {Pupil: an open source platform for pervasive eye tracking and mobile gaze-based interaction}, author = {Kassner, Moritz and Patera, William and Bulling, Andreas}, year = {2014}, doi = {10.1145/2638728.2641695}, booktitle = {Adj. Proc. ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp)}, pages = {1151-1160} }
EyeTab: Model-based gaze estimation on unmodified tablet computers

Erroll Wood, Andreas Bulling

Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA), pp. 207-210, 2014.

Abstract Links BibTeX Project

Despite the widespread use of mobile phones and tablets, hand-held portable devices have only recently been identified as a promising platform for gaze-aware applications. Estimating gaze on portable devices is challenging given their limited computational resources, low quality integrated front-facing RGB cameras, and small screens to which gaze is mapped. In this paper we present EyeTab, a model-based approach for binocular gaze estimation that runs entirely on an unmodified tablet. EyeTab builds on set of established image processing and computer vision algorithms and adapts them for robust and near-realtime gaze estimation. A technical prototype evaluation with eight participants in a normal indoors office setting shows that EyeTab achieves an average gaze estimation accuracy of 6.88° of visual angle at 12 frames per second.

doi: 10.1145/2578153.2578185

Paper: wood14_etra.pdf

@inproceedings{wood14_etra, author = {Wood, Erroll and Bulling, Andreas}, title = {EyeTab: Model-based gaze estimation on unmodified tablet computers}, booktitle = {Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA)}, year = {2014}, pages = {207-210}, doi = {10.1145/2578153.2578185} }
Pupil-Canthi-Ratio: A Calibration-Free Method for Tracking Horizontal Gaze Direction

Yanxia Zhang, Andreas Bulling, Hans Gellersen

Proc. International Conference on Advanced Visual Interfaces (AVI), pp. 129-132, 2014.

Abstract Links BibTeX Project

Eye tracking is compelling for hands-free interaction with pervasive displays. However, most existing eye tracking systems require specialised hardware and explicit calibrations of equipment and individual users, which inhibit their widespread adoption. In this work, we present a light-weight and calibration-free gaze estimation method that leverages only an off-the-shelf camera to track users’ gaze horizontally. We introduce pupil-canthi-ratio (PCR), a novel measure for estimating gaze directions. By using the displacement vector between the inner eye corner and the pupil centre of an eye, PCR is calculated as the ratio of the displacement vectors from both eyes. We establish a mapping between PCR to gaze direction by Gaussian process regression, which inherently infers averted horizontal gaze directions of users. We present a study to identify the characteristics of PCR. The results show that PCR achieved an average accuracy of 3.9 degrees across different people. Finally, we show examples of real-time applications of PCR that allow users to interact with a display by moving only their eyes.

doi: 10.1145/2598153.2598186

Paper: zhang14_avi.pdf

@inproceedings{zhang14_avi, author = {Zhang, Yanxia and Bulling, Andreas and Gellersen, Hans}, title = {Pupil-Canthi-Ratio: A Calibration-Free Method for Tracking Horizontal Gaze Direction}, booktitle = {Proc. International Conference on Advanced Visual Interfaces (AVI)}, year = {2014}, pages = {129-132}, doi = {10.1145/2598153.2598186} }
Test-time Adaptation for 3D Human Pose Estimation

Sikandar Amin, Mykhaylo Andriluka, Philipp Müller, Andreas Bulling

Proc. of the DAGM German Conference on Pattern Recognition (GCPR), pp. 253-264, 2014.

Abstract Links BibTeX Project

In this paper we consider the task of articulated 3D human pose estimation in challenging scenes with dynamic background and multiple people. Initial progress on this task has been achieved building on discriminatively trained part-based models that deliver a set of 2D body pose candidates that are then subsequently refined by reasoning in 3D [1, 4, 5]. The performance of such methods is limited by the performance of the underlying 2D pose estimation approaches. In this paper we explore a way to boost the performance of 2D pose estimation based on the output of the 3D pose reconstruction process, thus closing the loop in the pose estimation pipeline. We build our approach around a component that is able to identify true positive pose estimation hypotheses with high confidence. We then either retrain 2D pose estimation models using such highly confident hypotheses as additional training examples, or we use similarity o these hypotheses as a cue for 2D pose estimation. We consider a number of features that can be used for assessing the confidence of the pose estimation results. The strongest feature in our comparison corresponds to the ensemble greement on the 3D pose output. We evaluate our approach on two publicly available datasets improving over state of the art in each case.

doi: 10.1007/978-3-319-11752-2_20

Paper: amin14_gcpr.pdf

@inproceedings{amin14_gcpr, author = {Amin, Sikandar and Andriluka, Mykhaylo and M{\"{u}}ller, Philipp and Bulling, Andreas}, title = {Test-time Adaptation for 3D Human Pose Estimation}, booktitle = {Proc. of the DAGM German Conference on Pattern Recognition (GCPR)}, year = {2014}, pages = {253-264}, doi = {10.1007/978-3-319-11752-2_20} }
GazeHorizon: Enabling Passers-by to Interact with Public Displays by Gaze

Yanxia Zhang, Hans Jörg Müller, Ming Ki Chong, Andreas Bulling, Hans Gellersen

Proc. ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp), pp. 559-563, 2014.

Abstract Links BibTeX Project

Public displays can be made interactive by adding gaze control. However, gaze interfaces do not offer any physical affordance, and require users to move into a tracking range. We present GazeHorizon, a system that provides interactive assistance to enable passers-by to walk up to a display and to navigate content using their eyes only. The system was developed through field studies culminating in a four-day deployment in a public environment. Our results show that novice users can be facilitated to successfully use gaze control by making them aware of the interface at first glance and guiding them interactively into the tracking range.

doi: 10.1145/2632048.2636071

Paper: zhang14_ubicomp.pdf

Video: https://www.youtube.com/watch?v=zKsSeLvvsXU

@inproceedings{zhang14_ubicomp, author = {Zhang, Yanxia and M{\"{u}}ller, Hans J{\"{o}}rg and Chong, Ming Ki and Bulling, Andreas and Gellersen, Hans}, title = {GazeHorizon: Enabling Passers-by to Interact with Public Displays by Gaze}, booktitle = {Proc. ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp)}, year = {2014}, pages = {559-563}, doi = {10.1145/2632048.2636071}, video = {https://www.youtube.com/watch?v=zKsSeLvvsXU} }
Cross-Device Gaze-Supported Point-to-Point Content Transfer

Jayson Turner, Andreas Bulling, Jason Alexander, Hans Gellersen

Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA), pp. 19-26, 2014.

Abstract Links BibTeX Project

Within a pervasive computing environment, we see content on shared displays that we wish to acquire and use in a specific way i.e., with an application on a personal device, transferring from point-to-point. The eyes as input can indicate intention to interact with a service, providing implicit pointing as a result. In this paper we investigate the use of gaze and manual input for the positioning of gaze-acquired content on personal devices. We evaluate two main techniques, (1) Gaze Positioning, transfer of content using gaze with manual input to confirm actions, (2) Manual Positioning, content is selected with gaze but final positioning is performed by manual input, involving a switch of modalities from gaze to manual input. A first user study compares these techniques applied to direct and indirect manual input configurations, a tablet with touch input and a laptop with mouse input. A second study evaluated our techniques in an application scenario involving distractor targets. Our overall results showed general acceptance and understanding of all conditions, although there were clear individual user preferences dependent on familiarity and preference toward gaze, touch, or mouse input.

doi: 10.1145/2578153.2578155

Paper: turner14_etra.pdf

@inproceedings{turner14_etra, author = {Turner, Jayson and Bulling, Andreas and Alexander, Jason and Gellersen, Hans}, title = {Cross-Device Gaze-Supported Point-to-Point Content Transfer}, booktitle = {Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA)}, year = {2014}, pages = {19-26}, doi = {10.1145/2578153.2578155} }

Book Chapters

Eye Tracking and Eye-Based Human-Computer Interaction

Päivi Majaranta, Andreas Bulling

Stephen H. Fairclough, Kiel Gilleade (Eds.): Advances in Physiological Computing, Springer Publishing London, pp. 39-65, 2014.

Abstract Links BibTeX Project

Eye tracking has a long history in medical and psychological research as a tool for recording and studying human visual behavior. Real-time gaze-based text entry can also be a powerful means of communication and control for people with physical disa-bilities. Following recent technological advances and the advent of affordable eye trackers, there is a growing interest in pervasive at-tention-aware systems and interfaces that have the potential to rev-olutionize mainstream human-technology interaction. In this chapter, we provide an introduction to the state-of-the art in eye tracking technology and gaze estimation. We discuss challenges involved in using a perceptual organ, the eye, as an input modality. Examples of real life applications are reviewed, together with design solutions derived from research results. We also discuss how to match the user requirements and key features of different eye tracking sys-tems to find the best system for each task and application.

doi: 10.1007/978-1-4471-6392-3_3

Paper: majaranta14_apc.pdf

@inbook{majaranta14_apc, author = {Majaranta, P{\"{a}}ivi and Bulling, Andreas}, title = {Eye Tracking and Eye-Based Human-Computer Interaction}, booktitle = {Advances in Physiological Computing}, editor = {Fairclough, Stephen H. and Gilleade, Kiel}, year = {2014}, pages = {39-65}, publisher = {Springer Publishing London}, doi = {10.1007/978-1-4471-6392-3_3} }

Technical Reports

Pupil: An Open Source Platform for Pervasive Eye Tracking and Mobile Gaze-based Interaction

Moritz Kassner, William Patera, Andreas Bulling

arXiv:1405.0006, pp. 1–10, 2014.

Abstract Links BibTeX Project

Commercial head-mounted eye trackers provide useful features to customers in industry and research but are expensive and rely on closed source hardware and software. This limits the application areas and use of mobile eye tracking to expert users and inhibits user-driven development, customisation, and extension. In this paper we present Pupil – an accessible, affordable, and extensible open source platform for mobile eye tracking and gaze-based interaction. Pupil comprises 1) a light-weight headset with high-resolution cameras, 2) an open source software framework for mobile eye tracking, as well as 3) a graphical user interface (GUI) to playback and visualize video and gaze data. Pupil features high-resolution scene and eye cameras for monocular and binocular gaze estimation. The software and GUI are platform-independent and include state-of-the-art algorithms for real-time pupil detection and tracking, calibration, and accurate gaze estimation. Results of a performance evaluation show that Pupil can provide an average gaze estimation accuracy of 0.6 degree of visual angle (0.08 degree precision) with a latency of the processing pipeline of only 0.045 seconds.

Paper: kassner14_arxiv.pdf

Paper Access: https://arxiv.org/abs/1405.0006

@techreport{kassner14_arxiv, author = {Kassner, Moritz and Patera, William and Bulling, Andreas}, keywords = {eye movement, Gaze-based Interaction, Mobile Eye Tracking, Wearable Computing}, title = {Pupil: An Open Source Platform for Pervasive Eye Tracking and Mobile Gaze-based Interaction}, year = {2014}, pages = {1--10}, institution = {Pupil Labs UG, Berlin, Germany and Max Planck Institute for Informatics, Saarbr{\"{u}}cken, Germany}, url = {https://arxiv.org/abs/1405.0006} }
Ubic: Bridging the gap between digital cryptography and the physical world

Mark Simkin, Andreas Bulling, Mario Fritz, Dominique Schröder

arXiv:1403.1343, pp. 1–20, 2014.

Abstract Links BibTeX Project

Advances in computing technology increasingly blur the boundary between the digital domain and the physical world. Although the research community has developed a large number of cryptographic primitives and has demonstrated their usability in all-digital communication, many of them have not yet made their way into the real world due to usability aspects. We aim to make another step towards a tighter integration of digital cryptography into real world interactions. We describe Ubic, a framework that allows users to bridge the gap between digital cryptography and the physical world. Ubic relies on head-mounted displays, like Google Glass, resource-friendly computer vision techniques as well as mathematically sound cryptographic primitives to provide users with better security and privacy guarantees. The framework covers key cryptographic primitives, such as secure identification, document verification using a novel secure physical document format, as well as content hiding. To make a contribution of practical value, we focused on making Ubic as simple, easily deployable, and user friendly as possible.

Paper: simkin14_arxiv.pdf

Paper Access: https://arxiv.org/abs/1403.1343

@techreport{simkin14_arxiv, author = {Simkin, Mark and Bulling, Andreas and Fritz, Mario and Schr{\"{o}}der, Dominique}, title = {Ubic: Bridging the gap between digital cryptography and the physical world}, year = {2014}, pages = {1--20}, url = {https://arxiv.org/abs/1403.1343} }

2013

Journal Articles

Using eye-tracking glasses to evaluate the effect of visual scanning training on everyday activities

Tobias Loetscher, Michael Nicholls, Nicole Thomas, Andreas Bulling, Gayle Clarke, Allison Hayes, Celia Chen

Brain Impairment, 14 (2), pp. 354-355, 2013.

Abstract Links BibTeX Project

Screening for cognitive impairment may help predict neurorehabilitation outcomes. We investigated (1) the use of the ACE-R in predicting functional gain during in-patient rehabilitation, and (2) whether ACE-R scores identified patients requiring additional therapy support during their admission.

Paper: loetscher13_bi.pdf

@article{loetscher13_bi, title = {Using eye-tracking glasses to evaluate the effect of visual scanning training on everyday activities}, author = {Loetscher, Tobias and Nicholls, Michael and Thomas, Nicole and Bulling, Andreas and Clarke, Gayle and Hayes, Allison and Chen, Celia}, year = {2013}, journal = {Brain Impairment}, volume = {14}, number = {2}, pages = {354-355} }

Conference Papers

AutoBAP: Automatic Coding of Body Action and Posture Units from Wearable Sensors

Eduardo Velloso, Andreas Bulling, Hans Gellersen

Proc. Humaine Association Conference on Affective Computing and Intelligent Interaction (ACII), pp. 135-140, 2013.

Abstract Links BibTeX Project

Manual annotation of human body movement is an integral part of research on non-verbal communication and computational behaviour analysis but also a very time-consuming and tedious task. In this paper we present AutoBAP, a system that automates the coding of bodily expressions according to the body action and posture (BAP) coding scheme. Our system takes continuous body motion and gaze behaviour data as its input. The data is recorded using a full body motion tracking suit and a wearable eye tracker. From the data our system automatically generates a labelled XML file that can be visualised and edited with off-the-shelf video annotation tools. We evaluate our system in a laboratory-based user study with six participants performing scripted sequences of 184 actions. Results from the user study show that our prototype system is able to annotate 172 out of the 274 labels of the full BAP coding scheme with good agreement with a manual annotator (Cohen’s kappa > 0.6).

doi: 10.1109/ACII.2013.29

Paper: velloso13_acii.pdf

@inproceedings{velloso13_acii, author = {Velloso, Eduardo and Bulling, Andreas and Gellersen, Hans}, title = {AutoBAP: Automatic Coding of Body Action and Posture Units from Wearable Sensors}, booktitle = {Proc. Humaine Association Conference on Affective Computing and Intelligent Interaction (ACII)}, year = {2013}, pages = {135-140}, doi = {10.1109/ACII.2013.29} }
I know what you are reading – Recognition of document types using mobile eye tracking

Kai Kunze, Andreas Bulling, Yuzuko Utsumi, Shiga Yuki, Koichi Kise

Proc. IEEE International Symposium on Wearable Computers (ISWC), pp. 113-116, 2013.

Abstract Links BibTeX Project

Reading is a ubiquitous activity that many people even perform in transit, such as while on the bus or while walking. Tracking reading enables us to gain more insights about expertise level and potential knowledge of users – towards a reading log tracking and improve knowledge acquisition. As a first step towards this vision, in this work we investigate whether different document types can be automatically detected from visual behaviour recorded using a mobile eye tracker. We present an initial recognition approach that com- bines special purpose eye movement features as well as machine learning for document type detection. We evaluate our approach in a user study with eight participants and five Japanese document types and achieve a recognition performance of 74% using user-independent training.

doi: 10.1145/2493988.2494354

Paper: kunze13_iswc.pdf

@inproceedings{kunze13_iswc, author = {Kunze, Kai and Bulling, Andreas and Utsumi, Yuzuko and Yuki, Shiga and Kise, Koichi}, title = {I know what you are reading -- Recognition of document types using mobile eye tracking}, booktitle = {Proc. IEEE International Symposium on Wearable Computers (ISWC)}, year = {2013}, pages = {113-116}, doi = {10.1145/2493988.2494354} }
MotionMA: Motion Modelling and Analysis by Demonstration

Eduardo Velloso, Andreas Bulling, Hans Gellersen

Proc. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI), pp. 1309-1318, 2013.

Abstract Links BibTeX Project

Particularly in sports or physical rehabilitation, users have to perform body movements in a specific manner for the exercises to be most effective. It remains a challenge for experts to specify how to perform such movements so that an automated system can analyse further performances of it. In a user study with 10 participants we show that experts’ explicit estimates do not correspond to their performances. To address this issue we present MotionMA, a system that: (1) automatically extracts a model of movements demonstrated by one user, e.g. a trainer, (2) assesses the performance of other users repeating this movement in real time, and (3) provides real-time feedback on how to improve their performance. We evaluated the system in a second study in which 10 other participants used the system to demonstrate arbitrary movements. Our results demonstrate that MotionMA is able to extract an accurate movement model to spot mistakes and variations in movement execution.

doi: 10.1145/2470654.2466171

Paper: velloso13_chi.pdf

Video: https://www.youtube.com/watch?v=fFFWyt9LOhg

@inproceedings{velloso13_chi, author = {Velloso, Eduardo and Bulling, Andreas and Gellersen, Hans}, title = {MotionMA: Motion Modelling and Analysis by Demonstration}, booktitle = {Proc. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI)}, year = {2013}, pages = {1309-1318}, doi = {10.1145/2470654.2466171}, video = {https://www.youtube.com/watch?v=fFFWyt9LOhg} }
Qualitative Activity Recognition of Weight Lifting Exercises

Eduardo Velloso, Andreas Bulling, Hans Gellersen, Wallace Ugulino, Hugo Fuks

Proc. Augmented Human International Conference (AH), pp. 116-123, 2013.

Abstract Links BibTeX Project

Research on human activity recognition has traditionally focused on discriminating between different activities, i.e. to predict “which” activity was performed at a specific point in time. The quality of executing an activity, the “how (well)”, has only received little attention so far, even though it potentially provides useful information for a large variety of applications, such as sports training. In this work we first define quality of execution and investigate three aspects that pertain to qualitative activity recognition: the problem of specifying correct execution, the automatic and robust detection of execution mistakes, and how to provide feedback on the quality of execution to the user. We illustrate our approach on the example problem of qualitatively assessing and providing feedback on weight lifting exercises. In two user studies we try out a sensor- and a model-based approach to qualitative activity recognition. Our results underline the potential of model-based assessment and the positive impact of real-time user feedback on the quality of execution.

doi: 10.1145/2459236.2459256

Paper: velloso13_ah.pdf

@inproceedings{velloso13_ah, author = {Velloso, Eduardo and Bulling, Andreas and Gellersen, Hans and Ugulino, Wallace and Fuks, Hugo}, title = {Qualitative Activity Recognition of Weight Lifting Exercises}, booktitle = {Proc. Augmented Human International Conference (AH)}, year = {2013}, pages = {116-123}, doi = {10.1145/2459236.2459256} }
Eye Drop: An Interaction Concept for Gaze-Supported Point-to-Point Content Transfer

Jayson Turner, Andreas Bulling, Jason Alexander, Hans Gellersen

Proc. International Conference on Mobile and Ubiquitous Multimedia (MUM), pp. 1–4, 2013.

Abstract Links BibTeX Project

The shared displays in our environment contain content that we desire. Furthermore, we often acquire content for a specific purpose, i.e., the acquisition of a phone number to place a call. We have developed a content transfer concept, Eye Drop. Eye Drop provides techniques that allow fluid content acquisition, transfer from shared displays, and local positioning on personal devices using gaze combined with manual input. The eyes naturally focus on content we desire. Our techniques use gaze to point remotely, removing the need for explicit pointing on the user’s part. A manual trigger from a personal device confirms selection. Transfer is performed using gaze or manual input to smoothly transition content to a specific location on a personal device. This work demonstrates how techniques can be applied to acquire and apply actions to content through a natural sequence of interaction. We demonstrate a proof of concept prototype through five implemented application scenarios.

doi: 10.1145/2541831.2541868

Paper: turner13_mum.pdf

@inproceedings{turner13_mum, author = {Turner, Jayson and Bulling, Andreas and Alexander, Jason and Gellersen, Hans}, title = {Eye Drop: An Interaction Concept for Gaze-Supported Point-to-Point Content Transfer}, booktitle = {Proc. International Conference on Mobile and Ubiquitous Multimedia (MUM)}, year = {2013}, pages = {1--4}, doi = {10.1145/2541831.2541868} }
Eye Pull, Eye Push: Moving Objects between Large Screens and Personal Devices with Gaze & Touch

Jayson Turner, Jason Alexander, Andreas Bulling, Dominik Schmidt, Hans Gellersen

Proc. IFIP TC13 Conference on Human-Computer Interaction (INTERACT), pp. 170-186, 2013.

Abstract Links BibTeX Project

Previous work has validated the eyes and mobile input as a viable approach for pointing at, and selecting out of reach objects. This work presents Eye Pull, Eye Push, a novel interaction concept for content transfer between public and personal devices using gaze and touch. We present three techniques that enable this interaction: Eye Cut & Paste, Eye Drag & Drop, and Eye Summon & Cast. We outline and discuss several scenarios in which these techniques can be used. In a user study we found that participants responded well to the visual feedback provided by Eye Drag & Drop during object movement. In contrast, we found that although Eye Summon & Cast significantly improved performance, participants had difficulty coordinating their hands and eyes during interaction.

doi: 10.1007/978-3-642-40480-1_11

Paper: turner13_interact.pdf

@inproceedings{turner13_interact, author = {Turner, Jayson and Alexander, Jason and Bulling, Andreas and Schmidt, Dominik and Gellersen, Hans}, title = {Eye Pull, Eye Push: Moving Objects between Large Screens and Personal Devices with Gaze \& Touch}, booktitle = {Proc. IFIP TC13 Conference on Human-Computer Interaction (INTERACT)}, year = {2013}, pages = {170-186}, doi = {10.1007/978-3-642-40480-1_11} }
EyeContext: Recognition of High-level Contextual Cues from Human Visual Behaviour

Andreas Bulling, Christian Weichel, Hans Gellersen

Proc. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI), pp. 305-308, 2013.

Abstract Links BibTeX Project

In this work we present EyeContext, a system to infer high-level contextual cues from human visual behaviour. We conducted a user study to record eye movements of four participants over a full day of their daily life, totalling 42.5 hours of eye movement data. Participants were asked to self-annotate four non-mutually exclusive cues: social (interacting with somebody vs. no interaction), cognitive (concentrated work vs. leisure), physical (physically active vs. not active), and spatial (inside vs. outside a building). We evaluate a proof-of-concept EyeContext system that combines encoding of eye movements into strings and a spectrum string kernel support vector machine (SVM) classifier. Our results demonstrate the large information content available in long-term human visual behaviour and opens up new venues for research on eye-based behavioural monitoring and life logging.

doi: 10.1145/2470654.2470697

Paper: bulling13_chi.pdf

Video: https://www.youtube.com/watch?v=bhdVmWnnnIM

@inproceedings{bulling13_chi, author = {Bulling, Andreas and Weichel, Christian and Gellersen, Hans}, title = {EyeContext: Recognition of High-level Contextual Cues from Human Visual Behaviour}, booktitle = {Proc. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI)}, year = {2013}, pages = {305-308}, doi = {10.1145/2470654.2470697}, video = {https://www.youtube.com/watch?v=bhdVmWnnnIM} }
Pursuit Calibration: Making Gaze Calibration Less Tedious and More Flexible

Ken Pfeuffer, Mélodie Vidal, Jayson Turner, Andreas Bulling, Hans Gellersen

Proc. ACM Symposium on User Interface Software and Technology (UIST), pp. 261-270, 2013.

Abstract Links BibTeX Project

Eye gaze is a compelling interaction modality but requires a user calibration before interaction can commence. State of the art procedures require the user to fixate on a succession of calibration markers, a task that is often experienced as difficult and tedious. We present a novel approach, pursuit calibration, that instead uses moving targets for calibration. Users naturally perform smooth pursuit eye movements when they follow a moving target, and we use correlation of eye and target movement to detect the users attention and to sample data for calibration. Because the method knows when the users is attending to a target, the calibration can be performed implicitly, which enables more flexible design of the calibration task. We demonstrate this in application examples and user studies, and show that pursuit calibration is tolerant to interruption, can blend naturally with applications, and is able to calibrate users without their awareness.

doi: 10.1145/2501988.2501998

Paper: pfeuffer13_uist.pdf

Video: https://www.youtube.com/watch?v=T7S76L1Rkow

@inproceedings{pfeuffer13_uist, author = {Pfeuffer, Ken and Vidal, M{\'{e}}lodie and Turner, Jayson and Bulling, Andreas and Gellersen, Hans}, title = {Pursuit Calibration: Making Gaze Calibration Less Tedious and More Flexible}, booktitle = {Proc. ACM Symposium on User Interface Software and Technology (UIST)}, year = {2013}, pages = {261-270}, doi = {10.1145/2501988.2501998}, video = {https://www.youtube.com/watch?v=T7S76L1Rkow} }
Pursuits: Spontaneous Interaction with Displays based on Smooth Pursuit Eye Movement and Moving Targets

Mélodie Vidal, Andreas Bulling, Hans Gellersen

Proc. ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp), pp. 439-448, 2013.

Abstract Links BibTeX Project

Although gaze is an attractive modality for pervasive interactions, the real-world implementation of eye-based interfaces poses significant challenges, such as calibration. We present Pursuits, an innovative interaction technique that enables truly spontaneous interaction with eye-based interfaces. A user can simply walk up to the screen and readily interact with moving targets. Instead of being based on gaze location, Pursuits correlates eye pursuit movements with objects dynamically moving on the interface. We evaluate the influence of target speed, number and trajectory and develop guidelines for designing Pursuits-based interfaces. We then describe six realistic usage scenarios and implement three of them to evaluate the method in a usability study and a field study. Our results show that Pursuits is a versatile and robust technique and that users can interact with Pursuits-based interfaces without prior knowledge or preparation phase.

doi: 10.1145/2468356.2479632

Paper: vidal13_ubicomp.pdf

Video: https://www.youtube.com/watch?v=fpVPD_wQAWo

@inproceedings{vidal13_ubicomp, author = {Vidal, M{\'{e}}lodie and Bulling, Andreas and Gellersen, Hans}, title = {Pursuits: Spontaneous Interaction with Displays based on Smooth Pursuit Eye Movement and Moving Targets}, booktitle = {Proc. ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp)}, year = {2013}, pages = {439-448}, doi = {10.1145/2468356.2479632}, video = {https://www.youtube.com/watch?v=fpVPD_wQAWo} }
Pursuits: eye-based interaction with moving targets

Mélodie Vidal, Ken Pfeuffer, Andreas Bulling, Hans Gellersen

Ext. Abstr. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI), pp. 3147–3150, 2013.

Abstract Links BibTeX Project

Eye-based interaction has commonly been based on estimation of eye gaze direction, to locate objects for interaction. We introduce Pursuits, a novel and very different eye tracking method that instead is based on following the trajectory of eye movement and comparing this with trajectories of objects in the field of view. Because the eyes naturally follow the trajectory of moving objects of interest, our method is able to detect what the user is looking at, by matching eye movement and object movement. We illustrate Pursuits with three applications that demonstrate how the method facilitates natural interaction with moving targets.

doi: 10.1145/2468356.2479632

Paper: vidal13_chi.pdf

@inproceedings{vidal13_chi, author = {Vidal, M{\'{e}}lodie and Pfeuffer, Ken and Bulling, Andreas and Gellersen, Hans}, keywords = {eye gaze, natural user interface, smooth pursuit eye movement}, title = {Pursuits: eye-based interaction with moving targets}, booktitle = {Ext. Abstr. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI)}, year = {2013}, pages = {3147--3150}, doi = {10.1145/2468356.2479632} }
SideWays: A Gaze Interface for Spontaneous Interaction with Situated Displays

Yanxia Zhang, Andreas Bulling, Hans Gellersen

Proc. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI), pp. 851-860, 2013.

Abstract Links BibTeX Project

Eye gaze is compelling for interaction with situated displays as we naturally use our eyes to engage with them. In this work we present SideWays, a novel person-independent eye gaze interface that supports spontaneous interaction with displays: users can just walk up to a display and immediately interact using their eyes, without any prior user calibration or training. Requiring only a single off-the-shelf camera and lightweight image processing, SideWays robustly detects whether users attend to the centre of the display or cast glances to the left or right. The system supports an interaction model in which attention to the central display is the default state, while "sidelong glances" trigger input or actions. The robustness of the system and usability of the interaction model are validated in a study with 14 participants. Analysis of the participants’ strategies in performing different tasks provides insights on gaze control strategies for design of SideWays applications.

doi: 10.1145/2470654.2470775

Paper: zhang13_chi.pdf

Video: https://www.youtube.com/watch?v=cucOArVoyV0

@inproceedings{zhang13_chi, author = {Zhang, Yanxia and Bulling, Andreas and Gellersen, Hans}, title = {SideWays: A Gaze Interface for Spontaneous Interaction with Situated Displays}, booktitle = {Proc. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI)}, year = {2013}, pages = {851-860}, doi = {10.1145/2470654.2470775}, video = {https://www.youtube.com/watch?v=cucOArVoyV0} }

Book Chapters

Proc. 4th ACM Augmented Human International Conference (AH)

Albrecht Schmidt, Andreas Bulling, Christian Holz

Stuttgart, Germany, ACM, 2013.

Abstract Links BibTeX Project

We are very happy to present the proceedings of the 4th Augmented Human International Conference (Augmented Human 2013). Augmented Human 2013 focuses on augmenting human capabilities through technology for increased well-being and enjoyable human experience. The conference is in cooperation with ACM SIGCHI, with its proceedings to be archived in ACM’s Digital Library. With technological advances, computing has progressively moved beyond the desktop into new physical and social contexts. As physical artifacts gain new computational behaviors, they become reprogrammable, customizable, repurposable, and interoperable in rich ecologies and diverse contexts. They also become more complex, and require intense design effort in order to be functional, usable, and enjoyable. Designing such systems requires interdisciplinary thinking. Their creation must not only encompass software, electronics, and mechanics, but also the system’s physical form and behavior, its social and physical milieu, and beyond.

Paper Access: http://dl.acm.org/citation.cfm?id=2459236

@inbook{schmidt13_ah, author = {Schmidt, Albrecht and Bulling, Andreas and Holz, Christian}, title = {Proc. 4th ACM Augmented Human International Conference (AH)}, year = {2013}, publisher = {ACM}, location = {Stuttgart, Germany}, address = {New York, NY, USA}, url = {http://dl.acm.org/citation.cfm?id=2459236} }
Proc. 3rd International Workshop on Pervasive Eye Tracking and Mobile Eye-Based Interaction (PETMEI)

Andreas Bulling, Roman Bednarik

Proc. European Conference on Eye Movements (ECEM), 2013.

BibTeX Project

@inbook{bulling13_petmei, author = {Bulling, Andreas and Bednarik, Roman}, title = {Proc. 3rd International Workshop on Pervasive Eye Tracking and Mobile Eye-Based Interaction (PETMEI)}, year = {2013}, booktitle = {Proc. European Conference on Eye Movements (ECEM)} }
Signal processing technologies for activity-aware smart textiles

Daniel Roggen, Andreas Bulling, Gerhard Tröster

Tünde Kirstein (Eds.): Woodhead Publishing Series in Textiles, Woodhead Publishing Limited, pp. 329-366, 2013.

Abstract Links BibTeX Project

Garments made of smart textiles have an enormous potential for embedding sensors in close proximity to the body in an unobtrusive and comfortable manner. Combined with signal processing and pattern recognition technologies, complex high-level information about human behaviors or situations can be inferred from the sensor data. The goal of this chapter is to introduce the reader to the design of activity-aware systems that use body-worn sensors, such as those that can be made available through smart textiles. We start this chapter by emphasizing recent trends towards ‘wearable’ sensing and computing and we present several examples of activity-aware applications. Then we outline the role that smart textiles can play in activity-aware applications, but also the challenges that they pose. We conclude by discussing the design process followed to devise activity-aware systems: the choice of sensors, the available data processing methods, and the evaluation techniques. We discuss recent data processing methods that address the challenges resulting from the use of smart textiles.

doi: 10.1533/9780857093530.2.329

@inbook{roggen13_wpt, author = {Roggen, Daniel and Bulling, Andreas and Tr{\"{o}}ster, Gerhard}, keywords = {Activity Recognition, context recognition, signal processing, Wearable Sensing}, title = {Signal processing technologies for activity-aware smart textiles}, booktitle = {Woodhead Publishing Series in Textiles}, editor = {Kirstein, Tünde}, number = {139}, chapter = {12}, year = {2013}, pages = {329-366}, publisher = {Woodhead Publishing Limited}, doi = {10.1533/9780857093530.2.329} }

2012

Journal Articles

Multimodal Recognition of Reading Activity in Transit Using Body-Worn Sensors

Andreas Bulling, Jamie A. Ward, Hans Gellersen

ACM Transactions on Applied Perception (TAP), 9 (1), pp. 1–21, 2012.

Abstract Links BibTeX Project

Reading is one of the most well studied visual activities. Vision research traditionally focuses on understanding the perceptual and cognitive processes involved in reading. In this work we recognise reading activity by jointly analysing eye and head movements of people in an everyday environment. Eye movements are recorded using an electrooculography (EOG) system; body movements using body-worn inertial measurement units. We compare two approaches for continuous recognition of reading: String matching (STR) that explicitly models the characteristic horizontal saccades during reading, and a support vector machine (SVM) that relies on 90 eye movement features extracted from the eye movement data. We evaluate both methods in a study performed with eight participants reading while sitting at a desk, standing, walking indoors and outdoors, and riding a tram. We introduce a method to segment reading activity by exploiting the sensorimotor coordination of eye and head movements during reading. Using person-independent training, we obtain an average precision for recognising reading of 88.9% (recall 72.3%) using STR and of 87.7% (recall 87.9%) using SVM over all participants. We show that the proposed segmentation scheme improves the performance of recognising reading events by more than 24%. Our work demonstrates that the joint analysis of multiple modalities is beneﬁcial for reading recognition and opens up discussion on the wider applicability of this recognition approach to other visual and physical activities.

doi: 10.1145/2134203.2134205

Paper: bulling12_tap.pdf

@article{bulling12_tap, author = {Bulling, Andreas and Ward, Jamie A. and Gellersen, Hans}, title = {Multimodal {R}ecognition of {R}eading {A}ctivity in {T}ransit {U}sing {B}ody-{W}orn {S}ensors}, journal = {ACM Transactions on Applied Perception (TAP)}, volume = {9}, number = {1}, year = {2012}, pages = {1--21}, doi = {10.1145/2134203.2134205} }
Wearable Eye Tracking for Mental Health Monitoring

Mélodie Vidal, Jayson Turner, Andreas Bulling, Hans Gellersen

Computer Communications, 35 (11), pp. 1306-1311, 2012.

Abstract Links BibTeX Project

Pervasive healthcare is a promising field of research as small and unobtrusive on-body sensors become available. However, despite considerable advances in the field, current systems are limited in terms of the pathologies they can detect, particularly regarding mental disorders. In this work we propose wearable eye tracking as a new method for mental health monitoring. We provide two reviews: one of the state-of-the-art in wearable eye tracking equipment and a second one of the work in experimental psychology and clinical research on the link between eye movements and cognition. Both reviews show a significant potential of wearable eye tracking for mental health monitoring in daily life settings. This finding calls for further research on unobtrusive sensing equipment and novel algorithms for automated analysis of long-term eye movement data.

doi: 10.1016/j.comcom.2011.11.002

Paper: vidal12_comcom.pdf

@article{vidal12_comcom, author = {Vidal, M{\'{e}}lodie and Turner, Jayson and Bulling, Andreas and Gellersen, Hans}, title = {Wearable Eye Tracking for Mental Health Monitoring}, journal = {Computer Communications}, volume = {35}, number = {11}, year = {2012}, pages = {1306-1311}, doi = {10.1016/j.comcom.2011.11.002} }

Conference Papers

Analysing the Potential of Adapting Head-Mounted Eye Tracker Calibration to a New User

Benedict Fehringer, Andreas Bulling, Antonio Krüger

Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA), pp. 245-248, 2012.

Abstract Links BibTeX Project

A key issue with state-of-the-art mobile eye trackers, particularly during long-term recordings in daily life, is the need for cumbersome and time consuming (re)calibration. To reduce this burden, in this paper we investigate the feasibility of adapting the calibration obtained for one user to another. Calibration adaptation is automatically performed using a light-weight linear translation. We compare three different methods to compute the translation: "multi-point", where all calibration-points are used, "1-point", and "0-point" that uses only an external parameter. We evaluate these methods in a 6-participant user study in a controlled laboratory setting by measuring the error in visual angle between the predicted gaze point and the true gaze point. Our results show that, averaged across all participants, the best adapted calibration is only 0.8° (mean) off the calibration obtained for that specific user. We also show the potential of the 1-point and 0-point methods compared to the time-consuming multi-point computation.

doi: 10.1145/2168556.2168607

Paper: fehringer12_etra.pdf

@inproceedings{fehringer12_etra, author = {Fehringer, Benedict and Bulling, Andreas and Kr{\"{u}}ger, Antonio}, title = {Analysing the Potential of Adapting Head-Mounted Eye Tracker Calibration to a New User}, booktitle = {Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA)}, year = {2012}, pages = {245-248}, doi = {10.1145/2168556.2168607} }
Gaze interaction in the post-WIMP world

Andreas Bulling, Raimund Dachselt, Andrew Duchowski, Robert Jacob, Sophie Stellmach, Veronica Sundstedt

Ext. Abstracts of the 2012 ACM International Conference on Human Factors in Computing Systems (CHI), pp. 1221-1224, 2012.

Abstract Links BibTeX Project

With continuous progression away from desktop to post-WIMP applications, including multi-touch, gestural, or tangible interaction, there is high potential for eye gaze as a more natural human-computer interface in numerous contexts. Examples include attention-aware adaptations or the combination of gaze and hand gestures for interaction with distant displays. This SIG meeting provides a discussion venue for researchers and practitioners interested in gaze interaction in the post-WIMP era. We wish to draw attention to this emerging field and eventually formulate fundamental research questions. We will discuss the potential of gaze interaction for diverse application areas, interaction tasks, and multimodal user interface combinations. Our aims are to promote this research field, foster a larger research community, and establish the basis for a workshop at CHI 2013.

doi: 10.1145/2212776.2212428

Paper: bulling12_chi_sig.pdf

@inproceedings{bulling12_chi_sig, title = {Gaze interaction in the post-WIMP world}, author = {Bulling, Andreas and Dachselt, Raimund and Duchowski, Andrew and Jacob, Robert and Stellmach, Sophie and Sundstedt, Veronica}, isbn = {978-1-4503-1016-1}, doi = {10.1145/2212776.2212428}, year = {2012}, booktitle = {Ext. Abstracts of the 2012 ACM International Conference on Human Factors in Computing Systems (CHI)}, pages = {1221-1224} }
Robust, real-time pupil tracking in highly off-axis images

Lech Świrski, Andreas Bulling, Neil Dodgson

Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA), pp. 173-176, 2012.

Abstract Links BibTeX Project

Robust, accurate, real-time pupil tracking is a key component for online gaze estimation. On head-mounted eye trackers, existing algorithms that rely on circular pupils or contiguous pupil regions fail to detect or accurately track the pupil. This is because the pupil ellipse is often highly eccentric and partially occluded by eyelashes. We present a novel, real-time dark-pupil tracking algorithm that is robust under such conditions. Our approach uses a Haar-like feature detector to roughly estimate the pupil location, performs a k-means segmentation on the surrounding region to refine the pupil centre, and fits an ellipse to the pupil using a novel image-aware Random Sample Concensus (RANSAC) ellipse fitting. We compare our approach against existing real-time pupil tracking implementations, using a set of manually labelled infra-red dark-pupil eye images. We show that our technique has a higher pupil detection rate and greater pupil tracking accuracy.

doi: 10.1145/2168556.2168585

Paper: swirski12_etra.pdf

@inproceedings{swirski12_etra, author = {{\'{S}}wirski, Lech and Bulling, Andreas and Dodgson, Neil}, title = {Robust, real-time pupil tracking in highly off-axis images}, booktitle = {Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA)}, year = {2012}, pages = {173-176}, doi = {10.1145/2168556.2168585} }
Detection of smooth pursuits using eye movement shape features

Mélodie Vidal, Andreas Bulling, Hans Gellersen

Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA), pp. 177-180, 2012.

Abstract Links BibTeX Project

Smooth pursuit eye movements hold information about the health, activity and situation of people, but to date there has been no efficient method for their automated detection. In this work we present a method to tackle the problem, based on machine learning. At the core of our method is a novel set of shape features that capture the characteristic shape of smooth pursuit movements over time. The features individually represent incomplete information about smooth pursuits but are combined in a machine learning approach. In an evaluation with eye movements collected from 18 participants, we show that our method can detect smooth pursuit movements with an accuracy of up to 92%, depending on the size of the feature set used for their prediction. Our results have twofold significance. First, they demonstrate a method for smooth pursuit detection in mainstream eye tracking, and secondly they highlight the utility of machine learning for eye movement analysis.

doi: 10.1145/2168556.2168586

Paper: vidal12_etra.pdf

@inproceedings{vidal12_etra, author = {Vidal, M{\'{e}}lodie and Bulling, Andreas and Gellersen, Hans}, title = {Detection of smooth pursuits using eye movement shape features}, booktitle = {Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA)}, year = {2012}, pages = {177-180}, doi = {10.1145/2168556.2168586} }
Increasing the Security of Gaze-Based Cued-Recall Graphical Passwords Using Saliency Masks

Andreas Bulling, Florian Alt, Albrecht Schmidt

Proc. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI), pp. 3011-3020, 2012.

Abstract Links BibTeX Project

With computers being used ever more ubiquitously in situations where privacy is important, secure user authentication is a central requirement. Gaze-based graphical passwords are a particularly promising means for shoulder-surfing-resistant authentication, but selecting secure passwords remains challenging. In this paper, we present a novel gaze-based authentication scheme that makes use of cued-recall graphical pass- words on a single image. In order to increase password security, our approach uses a computational model of visual attention to mask those areas of the image that are most likely to attract visual attention. We create a realistic threat model for attacks that may occur in public settings, such as filming the user’s interaction while drawing money from an ATM. Based on a 12-participant user study, we show that our approach is significantly more secure than a standard image-based authentication and gaze-based 4-digit PIN entry.

doi: 10.1145/2207676.2208712

Paper: bulling12_chi.pdf

@inproceedings{bulling12_chi, author = {Bulling, Andreas and Alt, Florian and Schmidt, Albrecht}, keywords = {Cued-recall graphical passwords, Eye Tracking, Gaze-based, Saliency masks, user authentication}, title = {Increasing the Security of Gaze-Based Cued-Recall Graphical Passwords Using Saliency Masks}, booktitle = {Proc. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI)}, year = {2012}, pages = {3011-3020}, doi = {10.1145/2207676.2208712} }
Towards pervasive gaze tracking with low-level image features

Yanxia Zhang, Andreas Bulling, Hans Gellersen

Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA), pp. 261-264, 2012.

Abstract Links BibTeX Project

We contribute a novel gaze estimation technique, which is adaptable for person-independent applications. In a study with 17 participants, using a standard webcam, we recorded the subjects’ left eye images for different gaze locations. From these images, we extracted five types of basic visual features. We then sub-selected a set of features with minimum Redundancy Maximum Relevance (mRMR) for the input of a 2-layer regression neural network for estimating the subjects’ gaze. We investigated the effect of different visual features on the accuracy of gaze estimation. Using machine learning techniques, by combing different features, we achieved average gaze estimation error of 3.44° horizontally and 1.37° vertically for person-dependent.

doi: 10.1145/2168556.2168611

Paper: zhang12_etra.pdf

@inproceedings{zhang12_etra, author = {Zhang, Yanxia and Bulling, Andreas and Gellersen, Hans}, title = {Towards pervasive gaze tracking with low-level image features}, booktitle = {Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA)}, year = {2012}, pages = {261-264}, doi = {10.1145/2168556.2168611} }
Extending the Visual Field of a Head-Mounted Eye Tracker for Pervasive Eye-Based Interaction

Jayson Turner, Andreas Bulling, Hans Gellersen

Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA), pp. 269-272, 2012.

Abstract Links BibTeX Project

Pervasive eye-based interaction refers to the vision of eye-based interaction becoming ubiquitously usable in everyday life, e. g. across multiple displays in the environment. While current head-mounted eye trackers work well for interaction with displays at similar distances, the scene camera often fails to cover both remote and close proximity displays, e. g. a public display on a wall and a handheld portable device. In this paper we describe an approach that allows for robust detection and gaze mapping across multiple such displays. Our approach uses an additional scene camera to extend the viewing and gaze mapping area of the eye tracker and automatically switches between both cameras depending on the display in view. Results from a pilot study show that our system achieves a similar gaze estimation accuracy to a single-camera system while at the same time increasing usability.

doi: 10.1145/2168556.2168613

Paper: turner12_etra.pdf

@inproceedings{turner12_etra, author = {Turner, Jayson and Bulling, Andreas and Gellersen, Hans}, title = {Extending the Visual Field of a Head-Mounted Eye Tracker for Pervasive Eye-Based Interaction}, booktitle = {Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA)}, year = {2012}, pages = {269-272}, doi = {10.1145/2168556.2168613} }
Eye Gesture Recognition on Portable Devices

Vytautas Vaitukaitis, Andreas Bulling

Proc. International Workshop on Pervasive Eye Tracking and Mobile Gaze-Based Interaction (PETMEI), pp. 711-714, 2012.

Abstract Links BibTeX Project

Eye tracking using special-purpose devices has a large number of applications, such as usability testing or marketing research. In contrast, portable devices, such as mobile phones and tablets, have received only little attention so far. This is mainly due to their - until recently - limited sensing capabilities and processing power. In this work-in-progress paper we present the first prototype eye gesture recognition system for portable devices that does not require any additional equipment. The system combines techniques from image processing, computer vision and pattern recognition to detect eye gestures in the video recorded using the built-in front-facing camera. In a five-participant pilot study we show that our prototype can recognise four different continuous eye gestures in near real-time with an average accuracy of 60% on an Android-based smartphone (17.6% false positives) and 67.3% on a laptop (5.9% false positives). This initial result is promising and underlines the potential of eye tracking and eye-based interaction on portable devices.

doi: 10.1145/2370216.2370370

Paper: vaitukaitis12_petmei.pdf

@inproceedings{vaitukaitis12_petmei, author = {Vaitukaitis, Vytautas and Bulling, Andreas}, keywords = {Eye Gesture, Eye Tracking, gaze estimation, Laptop, Mobile Phone}, title = {Eye Gesture Recognition on Portable Devices}, booktitle = {Proc. International Workshop on Pervasive Eye Tracking and Mobile Gaze-Based Interaction (PETMEI)}, year = {2012}, pages = {711-714}, doi = {10.1145/2370216.2370370} }

Book Chapters

Proc. 2nd International Workshop on Pervasive Eye Tracking and Mobile Eye-Based Interaction (PETMEI)

Andreas Bulling, Geert Brône, Shiwei Cheng, Päivi Majaranta

Proc. ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp), pp. 673-676, 2012.

Abstract Links BibTeX Project

Early work on applied eye tracking investigated gaze as an input modality to interact with a desktop computer and discussed some of the human factors and technical aspects involved in performing common computer tasks with the eyes such as pointing and menu selection. Since then, eye tracking technology has considerably matured. Research on eye-based interaction is starting to gain interest in various specialized areas that are no longer restricted to desktop environment, such as virtual reality, human-human and humanrobot interaction. There is also a growing interest to take eye tracking out into the wild, to mobile and pervasive settings.

doi: 10.1145/2370216.2370362

@inbook{bulling12_petmei, author = {Bulling, Andreas and Br{\^{o}}ne, Geert and Cheng, Shiwei and Majaranta, P{\"{a}}ivi}, title = {Proc. 2nd International Workshop on Pervasive Eye Tracking and Mobile Eye-Based Interaction (PETMEI)}, booktitle = {Proc. ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp)}, year = {2012}, pages = {673-676}, doi = {10.1145/2370216.2370362} }

2011

Journal Articles

What’s in the Eyes for Context-Awareness?

Andreas Bulling, Daniel Roggen, Gerhard Tröster

IEEE Pervasive Computing, 10 (2), pp. 48-57, 2011.

Abstract Links BibTeX Project

Eye movements are a rich source of information about a person’s context. Analyzing the link between eye movements and cognition might even allow us to develop cognition-aware pervasive computing systems that assess a person’s cognitive context.

doi: 10.1109/MPRV.2010.49

Paper: bulling11_pcm.pdf

@article{bulling11_pcm, author = {Bulling, Andreas and Roggen, Daniel and Tr{\"{o}}ster, Gerhard}, keywords = {Machine learning, Pervasive computing, signal processing, Wearable Computing}, title = {What's in the Eyes for Context-Awareness?}, journal = {IEEE Pervasive Computing}, volume = {10}, number = {2}, year = {2011}, pages = {48-57}, doi = {10.1109/MPRV.2010.49} }
Eye Movement Analysis for Activity Recognition Using Electrooculography

Andreas Bulling, Jamie A. Ward, Hans Gellersen, Gerhard Tröster

IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 33 (4), pp. 741-753, 2011.

Abstract Links BibTeX Project

In this work we investigate eye movement analysis as a new sensing modality for activity recognition. Eye movement data was recorded using an electrooculography (EOG) system. We first describe and evaluate algorithms for detecting three eye movement characteristics from EOG signals - saccades, fixations, and blinks - and propose a method for assessing repetitive patterns of eye movements. We then devise 90 different features based on these characteristics and select a subset of them using minimum redundancy maximum relevance feature selection (mRMR). We validate the method using an eight participant study in an office environment using an example set of five activity classes: copying a text, reading a printed paper, taking hand-written notes, watching a video, and browsing the web. We also include periods with no specific activity (the NULL class). Using a support vector machine (SVM) classifier and a person-independent (leave-one-out) training scheme, we obtain an average precision of 76.1% and recall of 70.5% over all classes and participants. The work demonstrates the promise of eye-based activity recognition (EAR) and opens up discussion on the wider applicability of EAR to other activities that are difficult, or even impossible, to detect using common sensing modalities.

doi: 10.1109/TPAMI.2010.86

Paper: bulling11_pami.pdf

@article{bulling11_pami, author = {Bulling, Andreas and Ward, Jamie A. and Gellersen, Hans and Tr{\"{o}}ster, Gerhard}, keywords = {Feature evaluation and selection, signal processing, Ubiquitous computing}, title = {Eye {M}ovement {A}nalysis for {A}ctivity {R}ecognition {U}sing {E}lectrooculography}, journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)}, volume = {33}, number = {4}, year = {2011}, pages = {741-753}, doi = {10.1109/TPAMI.2010.86} }

Conference Papers

Discrimination of Gaze Directions Using Low-Level Eye Image Features

Yanxia Zhang, Andreas Bulling, Hans Gellersen

Proc. International Workshop on Pervasive Eye Tracking and Mobile Eye-Based Interaction (PETMEI), pp. 9-13, 2011.

Abstract Links BibTeX Project

In mobile daily life settings, video-based gaze tracking faces challenges associated with changes in lighting conditions and artefacts in the video images caused by head and body movements. These challenges call for the development of new methods that are robust to such inﬂuences. In this paper we investigate the problem of gaze estimation, more specifically how to discriminate different gaze directions from eye images. In a 17 participant user study we record eye images for 13 different gaze directions from a standard web- cam. We extract a total of 50 features from these images that encode information on color, intensity and orientations. Using mRMR feature selection and a k-nearest neighbor (kNN) classifier we show that we can estimate these gaze directions with a mean recognition performance of 86%.

doi: 10.1145/2029956.2029961

Paper: zhang11_petmei.pdf

@inproceedings{zhang11_petmei, author = {Zhang, Yanxia and Bulling, Andreas and Gellersen, Hans}, title = {Discrimination of Gaze Directions Using Low-Level Eye Image Features}, booktitle = {Proc. International Workshop on Pervasive Eye Tracking and Mobile Eye-Based Interaction (PETMEI)}, year = {2011}, pages = {9-13}, doi = {10.1145/2029956.2029961} }
Combining Gaze with Manual Interaction to Extend Physical Reach

Jayson Turner, Andreas Bulling, Hans Gellersen

Proc. International Workshop on Pervasive Eye Tracking and Mobile Eye-Based Interaction (PETMEI), pp. 33-36, 2011.

Abstract Links BibTeX Project

Situated public displays and interactive surfaces are becoming ubiquitous in our daily lives. Issues arise with these devices when attempting to interact over a distance or with content that is physically out of reach. In this paper we outline three techniques that combine gaze with manual hand-controlled input to move objects. We demonstrate and discuss how these techniques could be applied to two scenarios involving, (1) a multi-touch surface and (2) a public display and a mobile device.

doi: 10.1145/2029956.2029966

Paper: turner11_petmei.pdf

@inproceedings{turner11_petmei, author = {Turner, Jayson and Bulling, Andreas and Gellersen, Hans}, title = {Combining Gaze with Manual Interaction to Extend Physical Reach}, booktitle = {Proc. International Workshop on Pervasive Eye Tracking and Mobile Eye-Based Interaction (PETMEI)}, year = {2011}, pages = {33-36}, doi = {10.1145/2029956.2029966} }
Recognition of Hearing Needs From Body and Eye Movements to Improve Hearing Instruments

Bernd Tessendorf, Andreas Bulling, Daniel Roggen, Thomas Stiefmeier, Manuela Feilner, Peter Derleth, Gerhard Tröster

Proc. International Conference on Pervasive Computing (Pervasive), pp. 314-331, 2011.

Abstract Links BibTeX Project

Hearing instruments (HIs) have emerged as true pervasive computers as they continuously adapt the hearing program to the user’s context. However, current HIs are not able to distinguish different hearing needs in the same acoustic environment. In this work, we explore how information derived from body and eye movements can be used to improve the recognition of such hearing needs. We conduct an experiment to provoke an acoustic environment in which different hearing needs arise: active conversation and working while colleagues are having a conversation in a noisy office environment. We record body movements on nine body locations, eye movements using electrooculography (EOG), and sound using commercial HIs for eleven participants. Using a support vector machine (SVM) classifier and person-independent training we improve the accuracy of 77% based on sound to an accuracy of 92% using body movements. With a view to a future implementation into a HI we then perform a detailed analysis of the sensors attached to the head. We achieve the best accuracy of 86% using eye movements compared to 84% for head movements. Our work demonstrates the potential of additional sensor modalities for future HIs and motivates to investigate the wider applicability of this approach on further hearing situations and needs.

doi: 10.1007/978-3-642-21726-5_20

Paper: tessendorf11_pervasive.pdf

@inproceedings{tessendorf11_pervasive, author = {Tessendorf, Bernd and Bulling, Andreas and Roggen, Daniel and Stiefmeier, Thomas and Feilner, Manuela and Derleth, Peter and Tr{\"{o}}ster, Gerhard}, title = {Recognition of Hearing Needs From Body and Eye Movements to Improve Hearing Instruments}, booktitle = {Proc. International Conference on Pervasive Computing (Pervasive)}, year = {2011}, pages = {314-331}, doi = {10.1007/978-3-642-21726-5_20} }
Analysing EOG Signal Features for the Discrimination of Eye Movements with Wearable Devices

Mélodie Vidal, Andreas Bulling, Hans Gellersen

Proc. International Workshop on Pervasive Eye Tracking and Mobile Eye-Based Interaction (PETMEI), pp. 15-20, 2011.

Abstract Links BibTeX Project

Eye tracking research in human-computer interaction and experimental psychology traditionally focuses on stationary devices and a small number of common eye movements. The advent of pervasive eye tracking promises new applications, such as eye-based mental health monitoring or eye-based activity and context recognition. These applications might require further research on additional eye movement types such as smooth pursuits and the vestibulo-ocular reﬂex as these movements have not been studied as extensively as saccades, ﬁxations and blinks. In this paper we report our ﬁrst step towards an effective discrimination of these movements. In a user study we collect naturalistic eye movements from 19 people using the two most common measurement techniques (EOG and IR-based). We develop a set of basic signal features that we extract from the collected eye movement data and show that a feature-based approach has the potential to discriminate between saccades, smooth pursuits, and vestibulo-ocular reﬂex movements.

doi: 10.1145/2029956.2029962

Paper: vidal11_petmei.pdf

@inproceedings{vidal11_petmei, author = {Vidal, M{\'{e}}lodie and Bulling, Andreas and Gellersen, Hans}, title = {Analysing EOG Signal Features for the Discrimination of Eye Movements with Wearable Devices}, booktitle = {Proc. International Workshop on Pervasive Eye Tracking and Mobile Eye-Based Interaction (PETMEI)}, year = {2011}, pages = {15-20}, doi = {10.1145/2029956.2029962} }
Recognition of Visual Memory Recall Processes Using Eye Movement Analysis

Andreas Bulling, Daniel Roggen

Proc. ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp), pp. 455-464, 2011.

Abstract Links BibTeX Project

Physical activity, location, as well as a person’s psychophysiological and affective state are common dimensions for developing context-aware systems in ubiquitous computing. An important yet missing contextual dimension is the cognitive context that comprises all aspects related to mental information processing, such as perception, memory, knowledge, or learning. In this work we investigate the feasibility of recognising visual memory recall. We use a recognition methodology that combines minimum redundancy maximum relevance feature selection (mRMR) with a support vector machine (SVM) classifier. We validate the methodology in a dual user study with a total of fourteen participants looking at familiar and unfamiliar pictures from four picture categories: abstract, landscapes, faces, and buildings. Using person-independent training, we are able to discriminate between familiar and unfamiliar abstract pictures with a top recognition rate of 84.3% (89.3% recall, 21.0% false positive rate) over all participants. We show that eye movement analysis is a promising approach to infer the cognitive context of a person and discuss the key challenges for the real-world implementation of eye-based cognition-aware systems.

doi: 10.1145/2030112.2030172

Paper: bulling11_ubicomp.pdf

@inproceedings{bulling11_ubicomp, author = {Bulling, Andreas and Roggen, Daniel}, keywords = {Cognition- Awareness, Cognitive Context, Electrooculography (EOG), Eye Movement Analysis, Visual Memory Recall}, title = {Recognition of Visual Memory Recall Processes Using Eye Movement Analysis}, booktitle = {Proc. ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp)}, year = {2011}, pages = {455-464}, doi = {10.1145/2030112.2030172} }
Towards Qualitative Assessment of Weight Lifting Exercises Using Body-Worn Sensors

Eduardo Velloso, Andreas Bulling, Hans Gellersen

Proc. ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp), pp. 587-588, 2011.

Abstract Links BibTeX Project

Sports exercises are beneficial for general health and fitness. Some exercises such as weight lifting are particularly error-prone and using incorrect techniques can result in serious injuries. The current work aims to develop a weight lifting assistant that relies on motion sensors mounted on the body and integrated into gym equipment that provides qualitative feedback on the user’s performance. We believe that by comparing motion data recorded from different parts of the body with a mathematical model of the correct technique, we will be able to qualitatively assess the user’s performance, and provide a score and suggestions for improvement.

doi: 10.1145/2030112.2030226

Paper: velloso11_ubicomp.pdf

@inproceedings{velloso11_ubicomp, author = {Velloso, Eduardo and Bulling, Andreas and Gellersen, Hans}, title = {Towards Qualitative Assessment of Weight Lifting Exercises Using Body-Worn Sensors}, booktitle = {Proc. ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp)}, year = {2011}, pages = {587-588}, doi = {10.1145/2030112.2030226} }
Kontexterkennung für Hörgeräte mittels zusätzlicher Sensormodalitäten

Bernd Tessendorf, Andreas Bulling, Daniel Roggen, Thomas Stiefmeier, Manuela Feilner, Peter Derleth, Gerhard Tröster

Proc. Annual Convention for Acoustics (DAGA), 2011.

BibTeX Project

@inproceedings{tessendorf11_daga, author = {Tessendorf, Bernd and Bulling, Andreas and Roggen, Daniel and Stiefmeier, Thomas and Feilner, Manuela and Derleth, Peter and Tr{\"{o}}ster, Gerhard}, title = {Kontexterkennung f{\"{u}}r H{\"{o}}rger{\"{a}}te mittels zus{\"{a}}tzlicher Sensormodalit{\"{a}}ten}, booktitle = {Proc. Annual Convention for Acoustics (DAGA)}, year = {2011} }
The Web of Things as an Infrastructure for Improving Users’ Health and Wellbeing

Eduardo Velloso, Débora Cardador, Katia Vega, Wallace Ugulino, Andreas Bulling, Hans Gellersen, Hugo Fuks

Proc. Workshop of the Brazilian Institute for Web Science Research, pp. 1–7, 2011.

Abstract Links BibTeX Project

This position paper outlines the authors’ vision on how the Web of Things, using interconnected devices, including sensor nodes, mobile phones and conventional computers can help improve the overall health and wellbeing of its users. We describe ongoing work being carried by our research group both at PUC-Rio and at Lancaster University as well as the motivating background.

Paper: velloso11_wsr.pdf

@inproceedings{velloso11_wsr, author = {Velloso, Eduardo and Cardador, D{\'{e}}bora and Vega, Katia and Ugulino, Wallace and Bulling, Andreas and Gellersen, Hans and Fuks, Hugo}, title = {The Web of Things as an Infrastructure for Improving Users' Health and Wellbeing}, booktitle = {Proc. Workshop of the Brazilian Institute for Web Science Research}, year = {2011}, pages = {1--7} }

Book Chapters

Proc. 1st International Workshop on Pervasive Eye Tracking and Mobile Eye-Based Interaction (PETMEI)

Andreas Bulling, Andrew T. Duchowski, Päivi Majaranta

Proc. ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp), pp. 627-628, 2011.

Abstract Links BibTeX Project

Recent developments in mobile eye tracking equipment and automated eye movement analysis point the way toward unobtrusive eye-based human-computer interfaces that are pervasively usable in everyday life. We call this new paradigm pervasive eye tracking - continuous eye monitoring and analysis 24/7. PETMEI 2011 provides a forum for researcher from human-computer interaction, context-aware computing, and eye tracking to discuss techniques and applications that go beyond classical eye tracking and stationary eye-based interaction. We aim to discuss the implications of pervasive eye tracking for context-aware computing and to identify the key research challenges of mobile eye-based interaction. The long-term goal is to create a strong interdisciplinary research community linking these research fields together and to establish the workshop as the premier forum for research on pervasive eye tracking and mobile eye-based interaction.

doi: 10.1145/2030112.2030248

@inbook{bulling11_petmei, author = {Bulling, Andreas and Duchowski, Andrew T. and Majaranta, P{\"{a}}ivi}, keywords = {activity and context recognition, Cognition-Awareness, Eye Movement Analysis, Eye Tracking, eye-based interaction}, title = {Proc. 1st International Workshop on Pervasive Eye Tracking and Mobile Eye-Based Interaction (PETMEI)}, booktitle = {Proc. ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp)}, year = {2011}, pages = {627-628}, doi = {10.1145/2030112.2030248} }

2010

Journal Articles

Toward Mobile Eye-Based Human-Computer Interaction

Andreas Bulling, Hans Gellersen

IEEE Pervasive Computing, 9 (4), pp. 8-12, 2010.

Abstract Links BibTeX Project

Research in eye-based human-computer interaction (HCI) has matured over the past 20 years with current HCI research mostly focusing on stationary eye trackers in laboratory settings. This survey of latest advances in eye-tracking equipment and automated eye movement analysis suggests a new generation of mobile eye-based interfaces that will become pervasive and seamlessly integrated into people’s everyday lives.

doi: 10.1109/MPRV.2010.86

Paper: bulling10_pcm.pdf

@article{bulling10_pcm, author = {Bulling, Andreas and Gellersen, Hans}, title = {Toward {M}obile {E}ye-{B}ased {H}uman-{C}omputer {I}nteraction}, journal = {IEEE Pervasive Computing}, volume = {9}, number = {4}, year = {2010}, pages = {8-12}, doi = {10.1109/MPRV.2010.86} }

Conference Papers

Towards Multi-Modal Context Recognition for Hearing Instruments by Analysing Eye and Head Movements

Bernd Tessendorf, Andreas Bulling, Daniel Roggen, Thomas Stiefmeier, Gerhard Tröster, Manuela Feilner, Peter Derleth

Proc. IEEE International Symposium on Wearable Computers (ISWC), pp. 1-2, 2010.

Abstract Links BibTeX Project

Current hearing instruments (HI) only rely on auditory scene analysis to adapt to the situation of the user. It is for this reason that these systems are limited in the number and type of situations they can detect. We investigate how context information derived from eye and head movements can be used to resolve such situations. We focus on two example problems that are challenging for current HIs: To distinguish concentrated from interaction, and to detect whether a person is walking alone or walking while having a conversation. We collect an eleven participant (6 male, 5 female, age 24-59) dataset that covers different typical office activities. Using person-independent training and isolated recognition we achieve an average precision of 71.7% (recall: 70.1%) for recognising concentrated work and 57.2% precision (recall: 81.3%) for detecting walking while conversing.

doi: 10.1109/ISWC.2010.5665855

Paper: tessendorf10_iswc.pdf

@inproceedings{tessendorf10_iswc, author = {Tessendorf, Bernd and Bulling, Andreas and Roggen, Daniel and Stiefmeier, Thomas and Tr{\"{o}}ster, Gerhard and Feilner, Manuela and Derleth, Peter}, title = {Towards Multi-Modal Context Recognition for Hearing Instruments by Analysing Eye and Head Movements}, booktitle = {Proc. IEEE International Symposium on Wearable Computers (ISWC)}, year = {2010}, pages = {1-2}, doi = {10.1109/ISWC.2010.5665855} }
On the issue of variability in labels and sensor configurations in activity recognition systems

Daniel Roggen, Kilian Förster, Alberto Calatroni, Andreas Bulling, Gerhard Tröster

Proc. "How to do good activity recognition research? Experimental methodologies, evaluation metrics, and reproducibility issues" (Pervasive), pp. 1–4, 2010.

Abstract Links BibTeX Project

Two aspects of the design and characterization of activity recognition systems are rarely elaborated in the literature. First, the influence of system performance with variability in sensor placement and orientation is often overlooked. This is important for the deployment of robust activity recognition systems. Second, the influence of labeling variability is also overlooked, especially w.r.t. label boundary jitter and labeling errors. This is important during the development of an activity recognition system as acquiring labels is costly. We argue that there is a need to explicitly address the consequences of such variability in publications, together with the mitigation strategies that are used. Elaborating on this is required to move the state of the art towards real-world applications, such as in industrial wearable assistance applications or pervasive healthcare.

Paper: roggen10_pervasive.pdf

@inproceedings{roggen10_pervasive, author = {Roggen, Daniel and F{\"{o}}rster, Kilian and Calatroni, Alberto and Bulling, Andreas and Tr{\"{o}}ster, Gerhard}, title = {On the issue of variability in labels and sensor configurations in activity recognition systems}, booktitle = {Proc. "How to do good activity recognition research? Experimental methodologies, evaluation metrics, and reproducibility issues" (Pervasive)}, year = {2010}, pages = {1--4} }

2009

Journal Articles

Wearable EOG goggles: Seamless sensing and context-awareness in everyday environments

Andreas Bulling, Daniel Roggen, Gerhard Tröster

Journal of Ambient Intelligence and Smart Environments, 1 (2), pp. 157-171, 2009.

Abstract Links BibTeX Project

In this article we introduce the analysis of eye motion as a new input modality for activity recognition, context-awareness and mobile HCI applications. We describe a novel embedded eye tracker that, in contrast to common systems using video cameras, relies on Electrooculography (EOG). This self-contained wearable device consists of goggles with dry electrodes integrated into the frame and a small pocket-worn component with a DSP for real-time EOG signal processing. It can store data locally for long-term recordings or stream processed EOG signals to a remote device over Bluetooth. We show how challenges associated with wearability, eye motion analysis and signal artefacts caused by physical activity can be addressed with a combination of a special mechanical design, optimised algorithms for eye movement detection and adaptive signal processing. In two case studies, we demonstrate that EOG is a suitable measurement technique for the recognition of reading activity and eye-based human-computer interaction. Eventually, wearable EOG goggles may pave the way for seamless eye movement analysis and new forms of context-awareness not possible today.

doi: 10.3233/AIS-2009-0020

Paper: bulling09_jaise.pdf

@article{bulling09_jaise, author = {Bulling, Andreas and Roggen, Daniel and Tr{\"{o}}ster, Gerhard}, keywords = {Activity Recognition, Context-awareness, Electrooculography (EOG), Human-Computer Interaction (HCI), Wearable Eye Tracking}, title = {Wearable {EOG} goggles: {S}eamless sensing and context-awareness in everyday environments}, journal = {Journal of Ambient Intelligence and Smart Environments}, volume = {1}, number = {2}, year = {2009}, pages = {157-171}, doi = {10.3233/AIS-2009-0020} }

Conference Papers

Speech as a Feedback Modality for Smart Objects

Clemens Lombriser, Andreas Bulling, Andreas Breitenmoser, Gerhard Tröster

Proc. IEEE International Workshop on Intelligent Pervasive Devices (PerDev), pp. 1-5, 2009.

Abstract Links BibTeX Project

One part of the vision of ubiquitous computing is the integration of sensing and actuation nodes into everyday objects, clothes worn on the body, and in large numbers in the environment. These augmented environments require novel types of interfaces that provide for naturalistic and adaptive interaction depending on the user context. In this paper, we evaluate the use of speech synthesis on small, low-power sensor nodes that may be integrated into smart objects. We evaluate the so-called Wireless Voice Node, a wireless sensor node with the ability to produce speech as a novel feedback modality for ambient intelligence applications. As an example, we present a doll that aims at using speech synthesis to improve the playing experience of children.

doi: 10.1109/PERCOM.2009.4912831

Paper: lombriser09_perdev.pdf

@inproceedings{lombriser09_perdev, author = {Lombriser, Clemens and Bulling, Andreas and Breitenmoser, Andreas and Tr{\"{o}}ster, Gerhard}, title = {Speech as a Feedback Modality for Smart Objects}, booktitle = {Proc. IEEE International Workshop on Intelligent Pervasive Devices (PerDev)}, year = {2009}, pages = {1-5}, doi = {10.1109/PERCOM.2009.4912831} }
OPPORTUNITY: Activity and context awareness in opportunistic open-ended sensor environments

Daniel Roggen, Kilian Förster, Alberto Calatroni, Andreas Bulling, Thomas Holleczek, Gerhard Tröster, Paul Lukowicz, Gerald Pirkl, David Bannach, Alois Ferscha, Andreas Riener, Clemens Holzmann, Ricardo Chavarriaga, José R. Millán

Proc. European Future and Emerging Technologies Conference (FET), 2009.

BibTeX Project

@inproceedings{roggen09_fet, author = {Roggen, Daniel and F{\"{o}}rster, Kilian and Calatroni, Alberto and Bulling, Andreas and Holleczek, Thomas and Tr{\"{o}}ster, Gerhard and Lukowicz, Paul and Pirkl, Gerald and Bannach, David and Ferscha, Alois and Riener, Andreas and Holzmann, Clemens and Chavarriaga, Ricardo and del R. Mill{\'{a}}n, Jos{\'{e}}}, keywords = {OPPORTUNITY}, title = {OPPORTUNITY: Activity and context awareness in opportunistic open-ended sensor environments}, booktitle = {Proc. European Future and Emerging Technologies Conference (FET)}, year = {2009}, publisher = {European Commission}, location = {Prague, Czech Republic} }
Wearable EOG Goggles: Eye-Based Interaction in Everyday Environments

Andreas Bulling, Daniel Roggen, Gerhard Tröster

Ext. Abstr. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI), pp. 3259-3264, 2009.

Abstract Links BibTeX Project

In this paper, we present an embedded eye tracker for context-awareness and eye-based human-computer interaction â€“ the wearable EOG goggles. In contrast to common systems using video, this unobtrusive device relies on Electrooculography (EOG). It consists of goggles with dry electrodes integrated into the frame and a small pocket-worn component with a powerful microcontroller for EOG signal processing. Using this lightweight system, sequences of eye movements, so-called eye gestures, can be efficiently recognised from EOG signals in real-time for HCI purposes. The device is self-contained solution and allows for seamless eye motion sensing, context-recognition and eye-based interaction in everyday environments.

doi: 10.1145/1520340.1520468

Paper: bulling09_chi.pdf

@inproceedings{bulling09_chi, author = {Bulling, Andreas and Roggen, Daniel and Tr{\"{o}}ster, Gerhard}, keywords = {Context-awareness, Electrooculography (EOG), Eye Gestures, Eye Tracking, Human-Computer Interaction (HCI), Wearable Computing}, title = {Wearable EOG Goggles: Eye-Based Interaction in Everyday Environments}, booktitle = {Ext. Abstr. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI)}, year = {2009}, pages = {3259-3264}, doi = {10.1145/1520340.1520468} }
Eye Movement Analysis for Activity Recognition

Andreas Bulling, Jamie A. Ward, Hans Gellersen, Gerhard Tröster

Proc. ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp), pp. 41-50, 2009.

Abstract Links BibTeX Project

In this work we investigate eye movement analysis as a new modality for recognising human activity. We devise 90 different features based on the main eye movement characteristics: saccades, fixations and blinks. The features are derived from eye movement data recorded using a wearable electrooculographic (EOG) system. We describe a recognition methodology that combines minimum redundancy maximum relevance feature selection (mRMR) with a support vector machine (SVM) classifier. We validate the method in an eight participant study in an office environment using five activity classes: copying a text, reading a printed paper, taking hand-written notes, watching a video and browsing the web. In addition, we include periods with no specific activity. Using a person-independent (leave-one-out) training scheme, we obtain an average precision of 76.1% and recall of 70.5% over all classes and participants. We discuss the most relevant features and show that eye movement analysis is a rich and thus promising modality for activity recognition.

doi: 10.1145/1620545.1620552

Paper: bulling09_ubicomp.pdf

@inproceedings{bulling09_ubicomp, author = {Bulling, Andreas and Ward, Jamie A. and Gellersen, Hans and Tr{\"{o}}ster, Gerhard}, keywords = {Activity Recognition, Electrooculography (EOG), Eye Movement Analysis, Recognition of Office Activities}, title = {Eye Movement Analysis for Activity Recognition}, booktitle = {Proc. ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp)}, year = {2009}, pages = {41-50}, doi = {10.1145/1620545.1620552} }

2008

Conference Papers

EyeMote - Towards Context-Aware Gaming Using Eye Movements Recorded From Wearable Electrooculography

Andreas Bulling, Daniel Roggen, Gerhard Tröster

Proc. ACM International Conference on Fun and Games (FnG), pp. 33-45, 2008.

Abstract Links BibTeX Project

Physical activity has emerged as a novel input modality for so-called active video games. Input devices such as music instruments, dance mats or the Wii accessories allow for novel ways of interaction and a more immersive gaming experience. In this work we describe how eye movements recognised from electrooculographic (EOG) signals can be used for gaming purposes in three different scenarios. In contrast to common video-based systems, EOG can be implemented as a wearable and light-weight system which allows for long-term use with unconstrained simultaneous physical activity. In a stationary computer game we show that eye gestures of varying complexity can be recognised online with equal performance to a state-of-the-art video-based system. For pervasive gaming scenarios, we show how eye movements can be recognised in the presence of signal artefacts caused by physical activity such as walking. Finally, we describe possible future context-aware games which exploit unconscious eye movements and show which possibilities this new input modality may open up.

doi: 10.1007/978-3-540-88322-7_4

Paper: bulling08_fng.pdf

@inproceedings{bulling08_fng, author = {Bulling, Andreas and Roggen, Daniel and Tr{\"{o}}ster, Gerhard}, keywords = {Active Video Games, Context-awareness, Electrooculography (EOG), Eye Tracking, Human-Computer Interaction (HCI), Location-Based Gaming, Pervasive Gaming}, title = {EyeMote - Towards Context-Aware Gaming Using Eye Movements Recorded From Wearable Electrooculography}, booktitle = {Proc. ACM International Conference on Fun and Games (FnG)}, year = {2008}, pages = {33-45}, doi = {10.1007/978-3-540-88322-7_4} }
It’s in Your Eyes - Towards Context-Awareness and Mobile HCI Using Wearable EOG Goggles

Andreas Bulling, Daniel Roggen, Gerhard Tröster

Proc. ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp), pp. 84-93, 2008.

Abstract Links BibTeX Project

In this work we describe the design, implementation and evaluation of a novel eye tracker for context-awareness and mobile HCI applications. In contrast to common systems using video cameras, this compact device relies on Electrooculography (EOG). It consists of goggles with dry electrodes integrated into the frame and a small pocket-worn component with a DSP for real-time EOG signal processing. The device is intended for wearable and standalone use: It can store data locally for long-term recordings or stream processed EOG signals to a remote device over Bluetooth. We describe how eye gestures can be efficiently recognised from EOG signals for HCI purposes. In an experiment conducted with 11 subjects playing a computer game we show that 8 eye gestures of varying complexity can be continuously recognised with equal performance to a state-of-the-art video-based system. Physical activity leads to artefacts in the EOG signal. We describe how these artefacts can be removed using an adaptive filtering scheme and characterise this approach on a 5-subject dataset. In addition to explicit eye movements for HCI, we discuss how the analysis of unconscious eye movements may eventually allow to deduce information on user activity and context not available with current sensing modalities.

doi: 10.1145/1409635.1409647

Paper: bulling08_ubicomp.pdf

@inproceedings{bulling08_ubicomp, author = {Bulling, Andreas and Roggen, Daniel and Tr{\"{o}}ster, Gerhard}, keywords = {Context-awareness, Electrooculography (EOG), Eye Gestures, Eye Tracking, Human-Computer Interaction (HCI), Wearable Computing}, title = {It's in {Y}our {E}yes - {T}owards {C}ontext-{A}wareness and {M}obile {HCI} {U}sing {W}earable {EOG} {G}oggles}, booktitle = {Proc. ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp)}, year = {2008}, pages = {84-93}, doi = {10.1145/1409635.1409647} }
Robust Recognition of Reading Activity in Transit Using Wearable Electrooculography

Andreas Bulling, Jamie A. Ward, Hans Gellersen, Gerhard Tröster

Proc. International Conference on Pervasive Computing (Pervasive), pp. 19-37, 2008.

Abstract Links BibTeX Project

In this work we analyse the eye movements of people in transit in an everyday environment using a wearable electrooculographic (EOG) system. We compare three approaches for continuous recognition of reading activities: a string matching algorithm which exploits typical characteristics of reading signals, such as saccades and fixations; and two variants of Hidden Markov Models (HMMs) - mixed Gaussian and discrete. The recognition algorithms are evaluated in an experiment performed with eight subjects reading freely chosen text without pictures while sitting at a desk, standing, walking indoors and outdoors, and riding a tram. A total dataset of roughly 6 hours was collected with reading activity accounting for about half of the time. We were able to detect reading activities over all subjects with a top recognition rate of 80.2% (71.0% recall, 11.6% false positives) using string matching. We show that EOG is a potentially robust technique for reading recognition across a number of typical daily situations.

doi: 10.1007/978-3-540-79576-6_2

Paper: bulling08_pervasive.pdf

@inproceedings{bulling08_pervasive, author = {Bulling, Andreas and Ward, Jamie A. and Gellersen, Hans and Tr{\"{o}}ster, Gerhard}, keywords = {Activity Recognition, Electrooculography (EOG), Reading Activity, Recognition of Reading, Transit, wearable}, title = {Robust {R}ecognition of {R}eading {A}ctivity in {T}ransit {U}sing {W}earable {E}lectrooculography}, booktitle = {Proc. International Conference on Pervasive Computing (Pervasive)}, year = {2008}, pages = {19-37}, doi = {10.1007/978-3-540-79576-6_2} }

2026

Conference Papers

Technical Reports

2025

Journal Articles

Conference Papers

Technical Reports

2024

Journal Articles

Conference Papers

Technical Reports

2023

Journal Articles

Conference Papers

Technical Reports

Book Chapters

2022

Journal Articles

Conference Papers

Technical Reports

2021

Journal Articles

Conference Papers

Technical Reports

2020

Journal Articles

Conference Papers

Technical Reports

2019

Journal Articles

Conference Papers

Book Chapters

Technical Reports

2018

Journal Articles

Conference Papers

Technical Reports

2017

Journal Articles

Conference Papers

Technical Reports

2016

Journal Articles

Conference Papers

Book Chapters

Technical Reports

2015

Journal Articles

Conference Papers

Book Chapters

Technical Reports

2014

Journal Articles

Conference Papers

Book Chapters

Technical Reports

2013

Journal Articles

Conference Papers

Book Chapters

2012

Journal Articles

Conference Papers

Book Chapters

2011

Journal Articles

Conference Papers

Book Chapters

2010

Journal Articles

Conference Papers

2009

Journal Articles

Conference Papers

2008

Conference Papers

Links

Contact Us