Visual system identification: learning physical parameters and latent spaces from pixels
dc.contributor.advisor
Williams, Christopher
dc.contributor.advisor
Hospedales, Timothy
dc.contributor.advisor
Burke, M.
dc.contributor.author
Jaques, Miguel
dc.date.accessioned
2023-01-20T15:11:55Z
dc.date.available
2023-01-20T15:11:55Z
dc.date.issued
2023-01-20
dc.description.abstract
In this thesis, we develop machine learning systems that are able to leverage the knowledge
of equations of motion (scene-specific or scene-agnostic) to perform object discovery,
physical parameter estimation, position and velocity estimation, camera pose
estimation, and learn structured latent spaces that satisfy physical dynamics rules.
These systems are unsupervised, learning from unlabelled videos, and use as inductive
biases the general equations of motion followed by objects of interest in the scene.
This is an important task as in many complex real world environments ground-truth
states are not available, although there is physical knowledge of the underlying system.
Our goals with this approach, i.e. integration of physics knowledge with unsupervised
learning models, are to improve vision-based prediction, enable new forms of control,
increase data-efficiency and provide model interpretability, all of which are key areas
of interest in machine learning. With the above goals in mind, we start by asking the
following question: given a scene in which the objects’ motions are known up to some
physical parameters (e.g. a ball bouncing off the floor with unknown restitution coefficient),
how do we build a model that uses such knowledge to discover the objects in the
scene and estimate these physical parameters?
Our first model, PAIG (Physics-as-Inverse-Graphics), approaches this problem from a
vision-as-inverse-graphics perspective, describing the visual scene as a composition of
objects defined by their location and appearance, which are rendered onto the frame in
a graphics manner. This is a known approach in the unsupervised learning literature,
where the fundamental problem then becomes that of derendering, that is, inferring and
discovering these locations and appearances for each object. In PAIG we introduce a
key rendering component, the Coordinate-Consistent Decoder, which enables the integration
of the known equations of motion with an inverse-graphics autoencoder architecture
(trainable end-to-end), to perform simultaneous object discovery and physical
parameter estimation. Although trained on simple simulated 2D scenes, we show that
knowledge of the physical equations of motion of the objects in the scene can be used
to greatly improve future prediction and provide physical scene interpretability.
Our second model, V-SysId, tackles the limitations shown by the PAIG architecture,
namely the training difficulty, the restriction to simulated 2D scenes, and the need for
noiseless scenes without distractors. Here, we approach the problem from rst principles
by asking the question: are neural networks a necessary component to solve this
problem? Can we use simpler ideas from classical computer vision instead? With V-
SysId, we approach the problem of object discovery and physical parameter estimation
from a keypoint extraction, tracking and selection perspective, composed of 3 separate
stages: proposal keypoint extraction and tracking, 3D equation tting and camera pose
estimation from 2D trajectories, and entropy-based trajectory selection. Since all the
stages use lightweight algorithms and optimisers, V-SysId is able to perform joint object
discovery, physical parameter and camera pose estimation from even a single video,
drastically improving data-efficiency. Additionally, due to the fact that it does not use a
rendering/derendering approach, it can be used in real 3D scenes with many distractor
objects. We show that this approach enables a number of interest applications, such as
vision-based robot end-effector localisation and remote breath rate measurement.
Finally, we move into the area of structured recurrent variational models from vision,
where we are motivated by the following observation: in existing models, applying a
force in the direction from a start point and an end point (in latent space), does not
result in a movement from the start point towards the end point, even on the simplest
unconstrained environments. This means that the latent space learned by these models
does not follow Newton’s law, where the acceleration vector has the same direction
as the force vector (in point-mass systems), and prevents the use of PID controllers,
which are the simplest and most well understood type of controller. We solve this problem
by building inductive biases from Newtonian physics into the latent variable model,
which we call NewtonianVAE. Crucially, Newtonian correctness in the latent space brings
about the ability to perform proportional (or PID) control, as opposed to the more computationally
expensive model predictive control (MPC). PID controllers are ubiquitous
in industrial applications, but had thus far lacked integration with unsupervised vision
models. We show that the NewtonianVAE learns physically correct latent spaces in simulated
2D and 3D control systems, which can be used to perform goal-based discovery
and control in imitation learning, and path following via Dynamic Motion Primitives.
en
dc.identifier.uri
https://hdl.handle.net/1842/39745
dc.identifier.uri
http://dx.doi.org/10.7488/era/2993
dc.language.iso
en
en
dc.publisher
The University of Edinburgh
en
dc.subject
Visual System Identi cation
en
dc.subject
Learning Physical Parameters
en
dc.subject
Learning Latent Spaces
en
dc.subject
machine learning systems
en
dc.subject
equations of motion
en
dc.subject
unsupervised learning models
en
dc.subject
vision-based prediction
en
dc.subject
Physics-as-Inverse-Graphics
en
dc.subject
PAIG
en
dc.subject
Coordinate-Consistent Decoder
en
dc.subject
inverse-graphics autoencoder architecture
en
dc.subject
V-SysId
en
dc.subject
3D equation fitting
en
dc.subject
simulated 2D
en
dc.subject
latent variable model
en
dc.subject
model predictive control
en
dc.subject
MPC
en
dc.title
Visual system identification: learning physical parameters and latent spaces from pixels
en
dc.type
Thesis or Dissertation
en
dc.type.qualificationlevel
Doctoral
en
dc.type.qualificationname
PhD Doctor of Philosophy
en
Files
Original bundle
1 - 1 of 1
- Name:
- JaquesM_2022.pdf
- Size:
- 17.43 MB
- Format:
- Adobe Portable Document Format
- Description:
This item appears in the following Collection(s)

