IMAGINE/LIGM
École des Ponts ParisTech (ENPC)
6-8, Av Blaise Pascal – Cité Descartes
Champs-sur-Marne
77455 Marne-la-Vallée cedex 2 – France
✉ mathis (dot) petrovich (at) enpc.fr
Perceiving Systems
Max Planck Institute for Intelligent Systems (MPI-IS)
Max-Planck-Ring 4
72076 Tübingen - Germany
✉ mathis (dot) petrovich (at) tuebingen.mpg.de
ELLIS PhD program member
ELLIS personnal webpage
I am an ELLIS PhD student in the IMAGINE computer vision team of École des Ponts ParisTech (ENPC) and in the Perceiving Systems Department of Max Planck Institute for Intelligent Systems (MPI-IS). I am co-advised by Gül Varol (ENPC) and Michael J. Black (MPI). My PhD topic is to generate realistic and diverse human body motion in a controllable way (given labels or text instructions), and to create text-motion joint latent spaces. Before that, I studied at the École normale supérieure Paris-Saclay where I obtained a BS degree in Computer Science and the MVA MS degree. I am currently doing an internship at NVIDIA in the Sanja Fidler team.
@inproceedings{petrovich23tmr, title = {{TMR}: Text-to-Motion Retrieval Using Contrastive {3D} Human Motion Synthesis}, author = {Petrovich, Mathis and Black, Michael J. and Varol, G{\"u}l}, booktitle = {International Conference on Computer Vision ({ICCV})}, year = {2023} }
In this paper, we present TMR, a simple yet effective approach for text to 3D human motion retrieval. While previous work has only treated retrieval as a proxy evaluation metric, we tackle it as a standalone task. Our method extends the state-of-the-art text-to-motion synthesis model TEMOS, and incorporates a contrastive loss to better structure the cross-modal latent space. We show that maintaining the motion generation loss, along with the contrastive training, is crucial to obtain good performance. We introduce a benchmark for evaluation and provide an in-depth analysis by reporting results on several protocols. Our extensive experiments on the KIT-ML and HumanML3D datasets show that TMR outperforms the prior work by a significant margin, for example reducing the median rank from 54 to 19. Finally, we showcase the potential of our approach on moment retrieval. Our code and models are publicly available.
@inproceedings{SINC:ICCV:2023, title = {{SINC}: Spatial Composition of {3D} Human Motions for Simultaneous Action Generation}, author = {Athanasiou, Nikos and Petrovich, Mathis and Black, Michael J. and Varol, G\"{u}l }, booktitle = {International Conference on Computer Vision ({ICCV})}, year = {2023} }
Our goal is to synthesize 3D human motions given textual inputs describing simultaneous actions, for example 'waving hand' while 'walking' at the same time. We refer to generating such simultaneous movements as performing 'spatial compositions'. In contrast to temporal compositions that seek to transition from one action to another, spatial compositing requires understanding which body parts are involved in which action, to be able to move them simultaneously. Motivated by the observation that the correspondence between actions and body parts is encoded in powerful language models, we extract this knowledge by prompting GPT-3 with text such as "what are the body parts involved in the action
@inproceedings{TEACH:3DV:2022, title = {{TEACH}: {T}emporal {A}ction {C}ompositions for {3D} {H}umans}, author = {Athanasiou, Nikos and Petrovich, Mathis and Black, Michael J. and Varol, G{\"u}l }, booktitle = {{International Conference on 3D Vision (3DV)}}, year = {2022} }
Given a series of natural language descriptions, our task is to generate 3D human motions that correspond semantically to the text, and follow the temporal order of the instructions. In particular, our goal is to enable the synthesis of a series of actions, which we refer to as temporal action composition. The current state of the art in text-conditioned motion synthesis only takes a single action or a single sentence as input. This is partially due to lack of suitable training data containing action sequences, but also due to the computational complexity of their non-autoregressive model formulation, which does not scale well to long sequences. In this work, we address both issues. First, we exploit the recent BABEL motion-text collection, which has a wide range of labeled actions, many of which occur in a sequence with transitions between them. Next, we design a Transformer-based approach that operates non-autoregressively within an action, but autoregressively within the sequence of actions. This hierarchical formulation proves effective in our experiments when compared with multiple baselines. Our approach, called TEACH for "TEmporal Action Compositions for Human motions", produces realistic human motions for a wide variety of actions and temporal compositions from language descriptions. To encourage work on this new task, we make our code available for research purposes at our website.
@inproceedings{petrovich22temos, title = {{TEMOS}: Generating diverse human motions from textual descriptions}, author = {Petrovich, Mathis and Black, Michael J. and Varol, G{\"u}l}, booktitle = {European Conference on Computer Vision ({ECCV})}, year = {2022} }
We address the problem of generating diverse 3D human motions from textual descriptions. This challenging task requires joint modeling of both modalities: understanding and extracting useful human-centric information from the text, and then generating plausible and realistic sequences of human poses. In contrast to most previous work which focuses on generating a single, deterministic, motion from a textual description, we design a variational approach that can produce multiple diverse human motions. We propose TEMOS, a text-conditioned generative model leveraging variational autoencoder (VAE) training with human motion data, in combination with a text encoder that produces distribution parameters compatible with the VAE latent space. We show the TEMOS framework can produce both skeleton-based animations as in prior work, as well more expressive SMPL body motions. We evaluate our approach on the KIT Motion-Language benchmark and, despite being relatively straightforward, demonstrate significant improvements over the state of the art. Code and models are available on our webpage.
@inproceedings{petrovich21actor, title = {Action-Conditioned 3{D} Human Motion Synthesis with Transformer {VAE}}, author = {Petrovich, Mathis and Black, Michael J. and Varol, G{\"u}l}, booktitle = {International Conference on Computer Vision ({ICCV})}, year = {2021} }
We tackle the problem of action-conditioned generation of realistic and diverse human motion sequences. In contrast to methods that complete, or extend, motion sequences, this task does not require an initial pose or sequence. Here we learn an action-aware latent representation for human motions by training a generative variational autoencoder (VAE). By sampling from this latent space and querying a certain duration through a series of positional encodings, we synthesize variable-length motion sequences conditioned on a categorical action. Specifically, we design a Transformer-based architecture, ACTOR, for encoding and decoding a sequence of parametric SMPL human body models estimated from action recognition datasets. We evaluate our approach on the NTU RGB+D, HumanAct12 and UESTC datasets and show improvements over the state of the art. Furthermore, we present two use cases: improving action recognition through adding our synthesized data to training, and motion denoising. Code and models are available on our project page.
@inproceedings{petrovich2022FROT, title = {Feature Robust Optimal Transport for High-dimensional Data}, author = {Mathis Petrovich, Chao Liang, Ryoma Sato, Yanbin Liu, Yao-Hung Hubert Tsai, Linchao Zhu, Yi Yang, Ruslan Salakhutdinov and Makoto Yamada}, booktitle = {{European Conference on Machine Learning (ECML)}}, year = {2022} }
Optimal transport is a machine learning problem with applications including distribution comparison, feature selection, and generative adversarial networks. In this paper, we propose feature-robust optimal transport (FROT) for high-dimensional data, which solves high-dimensional OT problems using feature selection to avoid the curse of dimensionality. Specifically, we find a transport plan with discriminative features. To this end, we formulate the FROT problem as a min--max optimization problem. We then propose a convex formulation of the FROT problem and solve it using a Frank--Wolfe-based optimization algorithm, whereby the subproblem can be efficiently solved using the Sinkhorn algorithm. Since FROT finds the transport plan from selected features, it is robust to noise features. To show the effectiveness of FROT, we propose using the FROT algorithm for the layer selection problem in deep neural networks for semantic correspondence. By conducting synthetic and benchmark experiments, we demonstrate that the proposed method can find a strong correspondence by determining important layers. We show that the FROT algorithm achieves state-of-the-art performance in real-world semantic correspondence datasets.
@inproceedings{dinesh2020fsnet, title = {{FsNet}: Feature Selection Network on High-dimensional Biological Data} author = {Dinesh Singh, Héctor Climente-González, Mathis Petrovich, Eiryo Kawakami and Makoto Yamada}, booktitle = {{International Joint Conference on Neural Networks (IJCNN)}}, year = {2023} }
Biological data including gene expression data are generally high-dimensional and require efficient, generalizable, and scalable machine-learning methods to discover their complex nonlinear patterns. The recent advances in machine learning can be attributed to deep neural networks (DNNs), which excel in various tasks in terms of computer vision and natural language processing. However, standard DNNs are not appropriate for high-dimensional datasets generated in biology because they have many parameters, which in turn require many samples. In this paper, we propose a DNN-based, nonlinear feature selection method, called the feature selection network (FsNet), for high-dimensional and small number of sample data. Specifically, FsNet comprises a selection layer that selects features and a reconstruction layer that stabilizes the training. Because a large number of parameters in the selection and reconstruction layers can easily result in overfitting under a limited number of samples, we use two tiny networks to predict the large, virtual weight matrices of the selection and reconstruction layers. Experimental results on several real-world, high-dimensional biological datasets demonstrate the efficacy of the proposed method.
@inproceedings{petrovich2020fall, title = {Fast local linear regression with anchor regularization}, author = {Mathis Petrovich and Makoto Yamada}, booktitle = {arXiv preprint}, year = {2020} }
Regression is an important task in machine learning and data mining. It has several applications in various domains, including finance, biomedical, and computer vision. Recently, network Lasso, which estimates local models by making clusters using the network information, was proposed and its superior performance was demonstrated. In this study, we propose a simple yet effective local model training algorithm called the fast anchor regularized local linear method (FALL). More specifically, we train a local model for each sample by regularizing it with precomputed anchor models. The key advantage of the proposed algorithm is that we can obtain a closed-form solution with only matrix multiplication; additionally, the proposed algorithm is easily interpretable, fast to compute and parallelizable. Through experiments on synthetic and real-world datasets, we demonstrate that FALL compares favorably in terms of accuracy with the state-of-the-art network Lasso algorithm with significantly smaller training time (two orders of magnitude).
@inproceedings{abhishek2020tonemapping, title = {Tone Mapping Operators: Progressing Towards Semantic-Awareness}, author = {Abhishek Goswami, Mathis Petrovich, Wolf Hauser and Frederic Dufaux}, booktitle = {{International Conference on Multimedia & Expo Workshops (ICMEW 2020)}}, year = {2020} }
A Tone Mapping Operator (TMO) aims at reproducing the visual perception of a scene with a high dynamic range (HDR) on low dynamic range (LDR) media. TMOs have primarily aimed to preserve global perception by employing a model of human visual system (HVS), analysing perceptual attributes of each pixel and adjusting exposure at the pixel level. Preserving semantic perception, also an essential step for HDR rendering, has never been in explicit focus. We argue that explicitly introducing semantic information to create a 'content and semantic'-aware TMO has the potential to further improve existing approaches. In this paper, we therefore propose a new local tone mapping approach by introducing semantic information using off-the-shelf semantic segmenta-tion tools into a novel tone mapping pipeline. More specifically , we adjust pixel values to a semantic specific target to reproduce the real-world semantic perception.