(1) Stanford University(2) NVIDIA (3) University of Toronto (4) Vector Institute (5) Simon Fraser University
Abstract
We present a system that learns diverse, physically simulated
tennis skills from large-scale demonstrations of tennis play
harvested from broadcast videos. Our approach is built upon
hierarchical models, combining a low-level imitation policy
and a high-level motion planning policy to steer the character
in a motion embedding learned from broadcast videos. When
deployed at scale on large video collections that encompass a
vast set of examples of real-world tennis play, our approach
can learn complex tennis shotmaking skills and realistically
chain together multiple shots into extended rallies, using
only simple rewards and without explicit annotations of
stroke types. To address the low quality of motions extracted
from broadcast videos, we correct estimated motion with
physics-based imitation, and use a hybrid control policy that
overrides erroneous aspects of the learned motion embedding
with corrections predicted by the high-level policy. We
demonstrate that our system produces controllers for
physically-simulated tennis players that can hit the incoming
ball to target positions accurately using a diverse array of
strokes (serves, forehands, and backhands), spins (topspins
and slices), and playing styles (one/two-handed backhands,
left/right-handed play). Overall, our system can synthesize
two physically simulated characters playing extended tennis
rallies with simulated racket and ball dynamics.