Recently this kind of DEEP METHOD is quite popular, below are some paper reading summary about GANs, which include some extensions for solving (1) vanishing gradients (W-GAN), (2) mode collapse (W-GAN, unrolled GAN),  (3) diverging or oscillatory behavior (f-GAN),  (4) existence of equilibrium and generalization and (5) evaluation of GAN performance. They result from (1) improper loss function and (2) optimization of the non-convex error function.

Although I am a fan of VAEs, I do think both Techniques are worth digging into. I will keep updated of this post. For GANs study, I may keep updated three aspects: (1) theory; (2) algorithm; and (3) application.

### Theory

#### Original GANs:

Learn how to generate new samples from trained samples.  Game theory leads to Nash Equilibrium,  but in GANs do not know whether achieve it. The final results mean the generator produces the original samples. $p_{data}(x) == p_{g}(x)$. Note discriminator as $D(x)$, Generator as $G(z)$. Simply two networks are competing with Minimax as follows: $\min_{G}\max_{D} V(D, G) = E_{x\sim p_{data}(x)}[\log D(x)] + E_{z\sim p_{z}(z)}[\log (1-D(G(z))]$.

For my understanding, by maximizing data log-likelihood which helps to formulate a good discriminator that distinguish $X$ from data and generator, while the second step minimizing the second term (only $G$ involved) equals to generate a good example from prior $p(z)$.

Some conclusions:

1.  For $G$ fixed, the optimal $D$ is $D^{\star} (x) = \frac{p_{data}(x)}{p_{data}(x) + p_{g}(x)}$;
2.  The global minimum of $max_{D}V(D, G)$ is achieved if and only if $p_{g} == p_{data}$ and it is $-\log (4)$.
##### Algorithms:

During each training iterations

• sample true data from $\{x^{1}, \cdots, x^{m}\} \sim p_{data}(x)$
• sample noise from prior $\{z^{i},\cdots,z^{m}\} \sim p_{prior}(z)$
• Get generated data from generator network $x_{s}^{i} = G(z^{i})$
• update discriminator $\theta_{D}$ using gradient descent as
• $V = 1/m \sum_{i=1}^{m} \log D(x^{i}) +1/m \sum_{i=1}^{m} \log(1-D(x_{s}^{i}))$
• $\theta_{D} \leftarrow \theta_{D} + \eta \nabla V(\theta_{D}))$
• Above learning discriminator network needs to be performed $K$ times
• sample noise from prior $\{z^{i},\cdots,z^{m}\} \sim p_{prior}(z)$
• update generator now
• $V = 1/m \sum_{1}^{m} \log(1- D(G(z^{i})))$
• $\theta_{G} \leftarrow \theta_{G} - \eta \nabla V(\theta_{G})$
• Above learning generator network needs only once.

#### f-GAN

1. f-divergence uses the idea from Fenchel conjugate. It is a unified framework for GANs. That means it basically extends discriminator to any divergence belong to $f$. $D_{f} (P||Q) = \int_{x} q(x) f(\frac{p(x)}{q(x)}) dx$

Two requirements are for $f$: (1) $f$ should be convex function; (2) $f(1) = 0$. Many examples of commonly-used divergence:

• $f = x\log x$, KL-divergence
• $f = -\log x$, Reverse-KL-divergence
• $f = (x-1)^{2}$, Chi-square

2. Fenchel conjugate

Every convex function has a conjugate function. $f^{\star}(t) = sup_{x\sim dom(f)} \{xt - f(x)\} \to f(x) = max_{t \sim dom(f^{\star})}\{xt - f^{\star}(t)\}$, which is basical idea of fenchel conjugate

3. Connection with original GANs $D_{f}(P||Q) = \int_{x} q(x) f(\frac{p(x)}{q(x)}) dx = \int_{x} q(x) (\max_{t \sim dom(f^{\star})}\{\frac{p(x)}{q(x)}t - f^{\star}(t) \} ) dx \\ \approx \max_{D} \int_{x} p(x)D(x) dx - \int_{x} q(x) f^{\star}(D(x)) dx$

Where $D$ is a function whose input is $X$ and output is $t$.

Thus finally we get $D_{f}(P||Q) \approx \max_{D} \int_{x} p(x)D(x) dx - \int_{x} q(x) f^{\star}(D(x)) dx = \max_{D} E_{x \sim P}[D(x)] - E_{x \sim Q}[f^{\star}(D(x))]$

It is something similar and related to GANs $D_{f}(p_{data}||p_{G}) = \max_{D} E_{x \sim p_{data}}[D(x)] - E_{x \sim p_{G}}[f^{\star}(D(x))]$

We need to figure out the transiformation function from $t$ to $x$, which is the function of $G$, thus \begin{aligned} G^{\star} &= arg \min_{G} D_{f}(p_{data}||p_{G}) &= arg\min_{G} \max_{D} \{E_{x \sim p_{data}}[D(x)] - E_{x \sim p_{G}}[f^{\star}(D(x))]\} \end{aligned}

Borrowing from the idea of french conjugate, minmax is a saddle point optimization problem. Change the optimization in vanilla algorithm with single step update.

#### Wasserstein GAN:

This part has information learned from https://zhuanlan.zhihu.com/p/25071913, which is insightful and has more details, also from GAN to WGAN, good blog exists

For me, to some extend KO “DCGAN” paper discussed below

1. Solve unstable training problem
2. Solve collapse mode problem, making the samples generated to be more general
3. Give indicator like cross-entropy and accuracy to show the training process
4. No need careful deigning network structure, simple fully connected layer is ok

wasserstein reveals the distance that can truly tell us the distance between two distributions even they are not overlapping, while JS and KL cannot do it.

Basically, in WGAN,

(1) The Discriminator does not try to distinguish generated sample and the real sample, and they try to fit the Wasserstein distance among all the real samples, which is transferring from classification task to regression task;

(2) The Generator does try to generator sample like the real one instead they minimize the Wasserstein distance.

#### Improved Training of Wasserstein GANs:

WGAN still has non-convergence and low-quality generation problem, they think it results from the weight clipping method used in WGAN with Lipschitz constraint, thus they have penalized the norm of gradients of the critic with respect to its input. Which achieves stable training (no hyperparameter tuning) and high-quality generations.

My understanding of it: the weights clipping lead to weights to have very limited choices under training while gradient penalty (GP) improves it to enforce more diverse choices. When the weight changes sharply, it may lead to gradient explosion or vanishing, thus the key point is to find another way to meet Lipschitz constraint, which they call  gradient penalty (GP).

#### Info-GAN:

It is unsupervised learning method for disentangled representation. Basically, it is a framework by adding GANs with Mutual information method.

Intuition:  Interpretable representation. By disentangling the features, it can be more clearly use them in the future task. In other words, we need to force a disentangled representation. $\min_{G}\max_{D} V_{I}(D, G) = V(D, G) - \lambda I(c; G(z,c))$

My understanding

• Adding mutual information penalty into generator network, it means we intend to maximize the mutual information between latent code $c$ and $G(z,c)$ to tell us that the latent codes modulate the generator results.
• In the paper, that’s why the latent code has so much influence on the final generated images.
• Forcing high information content, we encode the most interesting aspects of the representation into latent code $c$.
• latent code $c$: meaningful information; noise input $z$ unrelated or unimportant information for images.

Instead of maximizing $I(c; G(z,c))$, can maximize the variational lower bound with approximated posterior $Q(c|x)$

#### Dynamic GAN:

The first paper introduces some theory on the convergence dynamics on GANs. The structure of discriminator network is analyzed through a Gaussian mixture model, which is a simple but intuitive way to have first-order dynamics analysis. The deep neural network dynamics are much more complex, and this step is really giving me insight into how the convergence dynamics influence to the final static state (people always focus on this aspect).

It also explains the problem of gradients vanishing and mode collapse problems in GANs. Also compared multiple step update on discriminator and the first-order update gives a theory of convergence dynamics on GMM structure.

#### The Numerics of GANs:

The paper proposes Numerics analysis of GANs, which suggests the convergence of GANs results from two principal features:

1.  if the Jacobian has only eigenvalues with negative real-part at the equilibrium point, GAN training converges locally for small enough learning rates.
2. On the other hand, if the Jacobian has eigenvalues on the imaginary axis, it is generally not locally convergent

It is a more fundamental suggestion for designing a new algorithm. Absolutely continuous data and generator distributions have all eigenvalues of the
Jacobian have negative real-part, but it is not the commonly-seen case in the real world (where both distributions lie on lower dimensional manifolds).

#### Which Training Methods for GANs do actually Converge?; (2018ICML) ***

I strongly recommend for this paper if you want to know more recent updates (until May 2018) on the convergence analysis of GAN gradient descent dynamics analysis.

### Classification of Problems and Related Paper:

First of all, read GANs tutorial from NIPS2016

#### 1. Mode Collapse

Unrolled GANs; (2017ICLR) **

Improves Techniques for Training GANs; (2016) ***

On Distinguishability Criteria for Estimating Generative Models; (2014) **
Towards Principled Methods for training GANs; (2016) ***

Mode Regularized GAN; (2016) **

Improved Training of Generative Adversarial Networks using Representative Features; (2018ICML) **


#### 2. Non-convergence

On Distinguishability Criteria for Estimating Generative Models; (2014) **

Improves Techniques for Training GANs; (2016) ***

f-GAN: Training Generative Sampler using Variational Divergence Minimization; (2016)***

Stabilizing Training of Generative Adversarial Networks through Regularization; (2017) **

The Numerics of GANS; (2017NIPS) **

Gradient descent GAN optimization is locally stable; (2017NIPS) **

Which Training Methods for GANs do actually Converge?; (2018ICML,very new) ***

Improved Training of Generative Adversarial Networks using Representative Features; (2018ICML) **


Towards Principled Methods for training GANs; (2016) ***

Wasserstein GAN; (2017) ****

Improved Training of Wasserstein GANs; (2017NIPS) ***

#### 4. Generalizability and Existence of Equilibrium

Generalization and Equilibrium in Generative Adversarial Nets (GANs); (2017) **

#### 5. Evaluation methods for GANs

Improves Techniques for Training GANs; (2016) ***

Pros and Cons of GAN Evaluation Measures; (2018) *

Geometry Score: A Method For Comparing Generative Adversarial Networks