-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathindex.html
363 lines (299 loc) · 17.4 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta name="description"
content="Compositional Visual Generation with Energy Based Models">
<meta name="author"
content="Yilun Du, Shuang Li, Igor Mordatch">
<title>Compositional Visual Generation with Energy Based Models</title>
<link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/4.0.0/css/bootstrap.min.css"
integrity="sha384-Gn5384xqQ1aoWXA+058RXPxPg6fy4IWvTNh0E263XmFcJlSAwiGgFAW/dAiS6JXm" crossorigin="anonymous">
<!-- Custom styles for this template -->
<link href="offcanvas.css" rel="stylesheet">
<!-- <link rel="icon" href="img/favicon.gif" type="image/gif">-->
</head>
<body>
<div class="jumbotron jumbotron-fluid">
<div class="container"></div>
<h2>Compositional Visual Generation with Energy Based Models</h2>
<h3>NeurIPS 2020 (Spotlight)</h3>
<hr>
<p class="authors">
<a href="https://yilundu.github.io">Yilun Du<sup>1</sup></a>,
<a href="https://people.csail.mit.edu/lishuang/"> Shuang Li<sup>1</sup></a>,
<a href="https://scholar.google.com/citations?user=Vzr1RukAAAAJ&hl=en">Igor Mordatch<sup>2</sup></a>
</p>
<p class="institution">
<sup>1</sup> MIT CSAIL
<sup>2</sup> Google Brain
</p>
<div class="btn-group" role="group" aria-label="Top menu">
<a class="btn btn-primary" href="https://arxiv.org/abs/2004.06030">Paper</a>
<a class="btn btn-primary" href="https://github.com/yilundu/ebm_compositionality">Code</a>
</div>
</div>
<div class="container">
<div class="section">
<p>
A vital aspect of human intelligence is the ability to compose increasingly complex concepts out of simpler ideas, enabling both rapid learning and adaptation of knowledge. In this paper we show that energy-based models can exhibit this ability by directly combining probability distributions. Samples from the combined distribution correspond to compositions of concepts. For example, given a distribution for smiling faces, and another for male faces, we can combine them to generate smiling male faces. This allows us to generate natural images that simultaneously satisfy conjunctions, disjunctions, and negations of concepts. We evaluate compositional generation abilities of our model on the CelebA dataset of natural faces and synthetic 3D scene images. We also demonstrate other unique advantages of our model, such as the ability to continually learn and incorporate new concepts, or infer compositions of concept properties underlying an image.
</p>
</div>
<div class="section">
<h2>The Problem with Existing Concept Representations</h2>
<hr>
<p>
Existing works on representing visual concepts can be divided into two separate categories. One category
of approaches represent global factors of variation, such as facial expression, by situating them
on a fixed underlying latent space, such as
the β-VAE model. A separate category of approaches represent local factors of variation by a set of
as segmentation masks, as exemplified by the MONET model. While the former can enable representing global
concepts, and the latter local concepts, there does exist a concept representation that can represent
both in an unified manner. In this paper, we propose instead to
represent concepts as <b>EBMs</b>. Such an approach enables us to represent both global factors of variation
as well as local object factors of variation. Furthermore, we show that our concept representation enables
us to combine <b>disjointly</b> learned factors together in a <b>zero shot </b> manner.
</p>
</div>
<div class="section">
<h2>Compositional Visual Generation with Energy Based Models</h2>
<hr>
<p>
Energy based models <a href="https://openai.com/blog/energy-based-models/">(EBMs) </a> represent
a distribution over data by defining an energy \(E_\theta(x) \) so that the likelihood of the data
is proportional to \( \propto e^{-E_\theta(x)}\). Sampling in EBMs is done through MCMC sampling,
using <a href="https://www.ics.uci.edu/~welling/publications/papers/stoclangevin_v6.pdf">Langevin dynamics</a>.
Our key insight in our work is that underlying probability distributions can be <b>composed</b> together to
represent different composition of concepts. Sampling under the modified probability distributions may then be
done using Langevin dynamics. We define three different composition operators over probability distributions,
for the standard logical operators of conjunction, disjunction, and negation which we illustrate below.
</p>
<div class="row justify-content-left">
<div class="col-sm-1">
</div>
<div class="col-sm-10">
<img src="fig/comp_cartoon.png" style="width:100%">
</div>
<div class="col-sm-1">
</div>
</div>
<br>
<h4>Conjunction</h4>
<hr>
<p>
In concept conjunction, given separate independent concepts (such as a particular gender, hair style, or
facial expression), we wish to construct an generation with the specified gender, hair style, and facial
expression -- the combination of each concept. The likelihood of such an generation given a set of specific
concepts is equal to the product of the likelihood of each individual concept
\[ p(x|c_1 \; \text{and} \; c_2, \ldots, \; \text{and} \; c_i) = \prod_i p(x|c_i) \propto e^{-\sum_i E(x|c_i)}. \]
We utilize Langevin sampling to sample from this modified EBM distribution.
</p>
<h4>Disjunction</h4>
<hr>
<p>
In concept disjunction, given separate concepts such as the colors red and blue, we wish to construct an output that is either red or blue. We wish to construct a new distribution that has probability mass when any chosen concept is true. A natural choice of such a distribution is the sum of the likelihood of each concept:
\[ p(x|c_1 \; \text{or} \; c_2, \ldots \; \text{or} \; c_i) \propto \sum_i p(x|c_i) / Z(c_i). \]
where \( Z(c_i) \) denotes partition function for the chosen concept. We utilize Langevin sampling
to sample from this modified EBM distribution.
</p>
<h4>Negation</h4>
<hr>
<p>
In concept negation, we wish to generate an output that does not contain the concept. Given a color red, we want an output that is of a different color, such as blue. Thus, we want to construct a distribution that places high likelihood to data that is outside a given concept. One choice is a distribution inversely proportional to the concept. Importantly, negation must be defined with respect to another concept to be useful. The opposite of alive may be dead, but not inanimate. Negation without a data distribution is not integrable and leads to a generation of chaotic textures which, while satisfying absence of a concept, is not desirable. Thus in our experiments with negation we combine it with another concept to ground the negation and obtain an integrable distribution:
\[ p(x| \text{not}(c_1), c_2) \propto \frac{p(x|c_2)}{p(x|c_1)^\alpha} \propto e^{ \alpha E(x|c_1) - E(x|c_2) } \]
We utilize Langevin sampling to sample from this modified EBM distribution.
</p>
<h4>Concept Composition</h4>
<hr>
By combining the above operators, we can controllably generate images with complex relationships. For example, given EBMs trained on male, smiling, and black haired faces, through combinations of negation, disjunction and conjunction, we can selectively generate images in a Venn diagram as shown below:
<div class="row justify-content-left">
<div class="col-sm-2">
</div>
<div class="col-sm-8">
<img src="fig/venn.jpg" style="width:100%">
</div>
<div class="col-sm-2">
</div>
</div>
</div>
<div class="section">
<h2>Global Factor Compositional Generation</h2>
<hr>
<p>
We first explore the ability of our models to compose different global factors of variation. We train seperate
EBMs on the attributes of shape, position, size and color. Through conjunction on each model sequentially,
we are able to generate successively more refined versions of an object scene.
</p>
<div class="row align-items-center">
<div class="col justify-content-center text-center">
<img src="fig/com_cube.jpg" style="float:none;width:500px">
</div>
</div>
<p>
We can similarily train seperate EBMs on the attributes of young, female, smiling, and wavy hair. Through
conjunction on each model sequentially, we are able to generate successively more refined versions of human faces.
Surprisingly, we find that generations of our model are able to become increasingly more refined by adding
more models.
</p>
<div class="row align-items-center">
<div class="col justify-content-center text-center">
<img src="fig/com_human.jpg" style="float:none;width:400px">
</div>
</div>
</div>
<div class="section">
<h2>Higher Level Composition</h2>
<hr>
<p>
We can further compose seperately trained EBMs in additional ways by nesting the compositional operators of conjunction, disjunction and negation. In the figure below, we showcase face generations of nested compositions.
</p>
<div class="row align-items-center">
<div class="col justify-content-center text-center">
<img src="fig/symbolic.jpg" style="float:none;width:400px">
</div>
</div>
</div>
<div class="section">
<h2>Object Level Compositional Generation</h2>
<hr>
<p>
We can also learn EBMs on the object attributes. We train a single EBM model to represent the position attribute. By summing EBMs conditioned on two different positions (conjunction), we can compositionally generate different number of cubes at the object level.
</p>
<div class="row align-items-center">
<div class="col justify-content-center text-center">
<img src="fig/multiobject.jpg" style="float:none;width:350px">
</div>
</div>
</div>
<div class="section">
<h2>Continual Learning of Visual Concepts</h2>
<hr>
<p>
By having the ability to compose in a zero-shot manner, EBMs enable us to learn, combine, and compose
visual concepts acquired in a <b>continual</b> manner. To test this we consider:
<ul id='continual_dataset'>
<li font-size: 6px>
A dataset consisting of position annotations of purple cubes at different positions.
</li>
<li font-size: 6px>
A dataset consisting of shape annotations of different purple shapes at different positions.
</li>
<li font-size: 6px>
A dataset consisting of color annotations of different color shapes at different positions.
</li>
</ul>
To mimick the concept learning process found in humans, we train a position EBM model using only the first
dataset, a shape EBM model using only the second dataset, and color EBM model using the third dataset.
Interestingly, <b>despite dataset mismatch</b>, we find that such learned visual concepts can <b>robustly
</b> and <b>continually</b> combine which other models. We illustrate different combinations of position, shape
and color below:
</p>
<div class="row align-items-center">
<div class="col justify-content-center text-center">
<img src='fig/continuous.jpg' style="float:none;width:400px" class='img-fluid'>
</div>
</div>
</div>
<div class="section">
<h2>Related Projects</h2>
<hr>
<p>
Check out our related projects on utilizing energy based models! <br>
</p>
<div class='row vspace-top'>
<div class="col-sm-3">
<img src='img/landmap_short.png' class='img-fluid'>
</div>
<div class="col">
<div class='paper-title'>
<a href="https://energy-based-model.github.io/Energy-Based-Models-for-Continual-Learning/">Energy-Based Models for Continual Learning</a>
</div>
<div>
We show how EBMs help provide an orthogonal approach towards tackling continual learning problems by changing the underlying training objective to causes less interference with previously learned information. Our proposed EBM formulation is simple, efficient, and outperforms baseline methods by a large margin on several benchmarks.
We further show that EBMs are adaptable to a harder continual learning setting where the data distribution changes without explicitly delineated task boundaries.
</div>
</div>
</div>
<div class='row vspace-top'>
<div class="col-sm-3">
<img src='img/protein.png' class='img-fluid'>
</div>
<div class="col">
<div class='paper-title'>
<a href="https://arxiv.org/abs/2004.13167">Energy Based Models for Atomic Level Protein Conformations</a>
</div>
<div>
We introduce EBMs for modeling the underlying energy landscape of atomic level protein conformations. We train
EBMs to predict the energy of different protein rotamer configurations, and find that our trained EBM models
can nearly match the performance of classical energy function Rosetta on the task of protein sidechain prediction.
</div>
</div>
</div>
<div class='row vspace-top'>
<div class="col-sm-3">
<img src='img/ebm_plan.png' class='img-fluid'>
</div>
<div class="col">
<div class='paper-title'>
<a href="https://arxiv.org/abs/1909.06878">Model Based Planning with Energy Based Models</a>
</div>
<div>
We present a framework towards utilizing EBMs to learn, in an online fashion, trajectory level plans for
different start and goal configurations. This allows us to flexibly change and adapt to different
sets of goals by changing the underlying trajectory inference objective.
</div>
</div>
</div>
<div class='row vspace-top'>
<div class="col-sm-3">
<video width="100%" playsinline="" autoplay="" loop="" preload="" muted="">
<source src="img/half.mp4" type="video/mp4">
</video>
</div>
<div class="col">
<div class='paper-title'>
<a href="https://openai.com/blog/energy-based-models/">Implicit Generation and Generalization with Energy Based Models</a>
</div>
<div>
We introduce a method to scale EBM training to modern neural network architectures.
We show that such trained EBMs have a set of unique properties, enabling model robustness,
image and trajectory modeling, continual learning and compositional visual generation.
</div>
</div>
</div>
<div class="section">
<h2>Paper</h2>
<hr>
<div>
<div class="list-group">
<a href="https://arxiv.org/abs/2004.06030"
class="list-group-item">
<img src="img/paper_thumbnail.png"
style="width:100%; margin-right:-20px; margin-top:-10px;">
</a>
</div>
</div>
</div>
<div class="section">
<h2>Bibtex</h2>
<hr>
<div class="bibtexsection">
@inproceedings{du2020compositional,
title={Compositional Visual Generation with Energy Based Models},
author={Du, Yilun and Li, Shuang and Mordatch, Igor},
booktitle={Advances in Neural Information Processing Systems},
year={2020}
}
</div>
</div>
<hr>
<footer>
<p>Send feedback and questions to <a href="https://yilundu.github.io">Yilun Du</a></p>
</footer>
</div>
</body>
<script src="https://polyfill.io/v3/polyfill.min.js?features=es6"></script>
<script id="MathJax-script" async src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js"></script>
</html>