When pianists play a musical piece on a piano, their body reacts to the music. Their fingers strike piano keys to create music. They move their arms to play on different octaves. Violin players draw the bow with one hand across the strings and touch lightly or pluck the strings with the other hand’s fingers. Faster bowing produces a faster music pace.
In the long term goal of using augmented and artificial intelligence to help teach people how to play musical instruments, this research investigated whether correlation between music signals and fingers can be predicted computationally. We show that indeed it can be predicted. To our knowledge, this is the first time such an idea was tested.
Our goal was to create an animation of an avatar that moves its hands in the way a pianist or violinist would do, just by hearing the audio. Our research introduces a method that inputs violin or piano music, and outputs a video of skeleton predictions that are further used to animate an avatar, and we successfully demonstrate that natural body dynamics can be predicted. This research was presented in our paper Audio to Body Dynamics at the 2018 Conference on Computer Vision and Pattern Recognition (CVPR) conference.
Predicting body movement from a music signal is a highly challenging computational problem. To tackle it we needed a good training set of videos, we needed to be able to accurately predict body poses in those videos, and our algorithm needed to be able to find the correlation between music and body.
There is no available training data for such a purpose. Traditionally, state-of-the-art prediction of natural body movement from video sequences (not audio) used motion capture sequences created in a lab. To replicate a traditional approach, we would need to bring a pianist to a laboratory and have them play several hours with sensors attached to their fingers and body joints. This is hard to execute and not easily generalizable.
Instead, we leveraged publicly available videos of highly skilled musicians playing online which could also potentially allow a higher degree of diversity in data. We collected 3.6 hours of violin and 4.4 hours of piano recital “in the wild” videos from the Internet and processed the videos by detecting upper body, and fingers in each frame of each video.
We then built a Long-Short-Term-Memory (LSTM) neural network that learns the correlation between audio features and body skeleton landmarks. Predicted points were applied onto a rigged avatar to create the animation, with the final output as an avatar that moves according to the audio input.
Method overview: (a) Our method gets as input an audio signal, e.g., piano music, (b) that is fed into our LSTM network to predict body movement points, (c) which in turn are used to animate an avatar and show it playing the input music on a piano (the avatar and piano are models while the rest is a real background of an apartment).
The output skeletons are promising, and produce interesting body dynamics. To best experience our results, watch the videos with audio turned on.
The research was inspired by a system we had created at the University of Washington that can find correlation between a person’s speech and how the lips move. Our hypothesis that body gestures can be predicted from audio signals shows promising initial results. We believe the correlation between audio to human body has the potential for a variety of applications in VR/AR and recognition.
One potential application is to use AR to teach people how to play musical instruments. People could potentially learn from the best pianists in the world because we’re using professional pianists for training videos. When the experience is shown in AR, a person can walk around the avatar in 3D and zoom in to the fingers to see what movements are being made. It is exciting to show how AI can help people create music by grasping which movements make great performances from real-world examples.
This work has shown the potential AR has to change the way we learn new capabilities. We are excited to show the beginning of the potential capabilities for music.