Recent studies indicate that physical characteristics are inherent in the voice. Therefore, it is possible to recreate facial images from voice. A recent paper on arXiv.org examines the ability to predict one’s facial geometry or skull structures from voice.
Instead of producing images of faces that include features unrelated to the speaker’s voice, such as hairstyles and facial features, the researchers propose to work on 3D meshes. They present a cross-model perceptionist framework that examines the feasibility of predicting face meshes using 3D morphable models from voices.
First, the neural network is trained in a direct supervised learning method using paired voices. In addition, a more realistic unsupervised learning scenario is investigated to test whether facial geometry can still be simulated without paired voices and 3D faces. The results show that 3D faces can be roughly recreated from voices.
This work digs into a fundamental question in human perception: Can one’s voice counteract geometry? Previous work that studied this question only adopted evolution in image synthesis and converted voices into facial images to show correlations, but work on the image domain essentially involved predicting those features. What voices may not indicate, including facial features, hairstyles, and backgrounds. Instead we examine the ability to reconstruct 3D faces, focusing only on the geometry, which is heavily anatomically based. We propose our analysis framework, the Cross-Model Perceptionist, under both supervised and unsupervised learning. First, we build a dataset, Voxceleb-3D, that extends Voxceleb and includes paired voices and face meshes, making supervised learning possible. Second, we use a knowledge distillation apparatus to study whether face geometry can still be brightened by voice without paired voices and 3D face data under the limited availability of 3D face scans. We divide the original question into four parts and do visual and numerical analysis as an answer to the original question. Our findings resonate in physiology and neuroscience regarding the relationship between the voice and facial structures. This work provides an explainable foundation for future human-centered cross-modal learning. See our project page: https URL
Research Paper: Wu, C.-Y., Hsu, C.-C., and Neumann, U., “Cross-modal perceptionists: can face geometry be glade from voice?”, 2022. Link: https://arxiv.org/abs/2203.09824