Understanding the mathematical background of Generative Adversarial Networks (GANs)

Generative Adversarial Networks (GANs) have gained widespread attention since their introduction, leading to numerous extensions and applications of the original GAN idea. A thorough understanding of GANs’ mathematical foundations is necessary to use and build upon these techniques. However, most studies on GANs are presented from a computer science or engineering perspective, which can be challenging for beginners to understand fully. Therefore, this paper aims to provide an overview of the mathematical background of GANs, including detailed proofs of optimal solutions for vanilla GANs and boundaries for f -GANs that minimize a variational approximation of the f -divergence between two distributions. These contributions will enhance the understanding of GANs for those with a mathematical background and pave the way for future research


Introduction
Generative Adversarial Networks (GANs) introduced by [1] consist of generative and discriminative neural network models that are usually denoted by letters G and D, respectively.To visualize GANs environment better in our mind, the generative model may be regarded as a counterfeiter who is attempting to produce a fraud Van Gogh's Starry Night painting and sell it without being noticed, whereas the discriminative model is equivalent to an expert who specializes in Van Gogh, trying to detect the counterfeit fraud painting.However, the counterfeiter does not care about producing images that are a variation of the original Starry Night painting.In the applications of GANs, the aim is not to present a new image identical to the original painting.Instead, it aims to create a unique illustration of Starry Night that the Van Gogh expert recognizes as an unknown Van Gogh painting that is unprecedented anywhere before.As a result, a computation starts between the generator and discriminator over the fraud painting detection.The competition continues until the counterfeiter becomes intelligent enough to deceive the expert successfully.More precisely, the discriminator's role is to distinguish the real and fraud paintings, while the generator's role is to generate fraud paintings in such a way that it can mislead the discriminator, and the discriminator is unable to cope with rejecting the fraud paintings any longer (see Figure 1).The Generator takes random noise from the latent space as input and generates fake data, attempting to mimic the real data distribution.Initially, the generated data is random and typically of low quality.A batch of real data is sampled from the training dataset that serves as the ground truth for the Discriminator during training.The Discriminator is trained on both real and fake data.It is presented with the real data and the corresponding labels (1 for real) to learn to distinguish real data from fake data.It is then presented with the fake data generated by the Generator and the corresponding labels (0 for fake) to learn to identify the fake data.The Discriminator's performance is evaluated using a loss function, such as binary cross-entropy, which measures how well the Discriminator is differentiating the real data from the fake data.The Generator is trained to deceive the Discriminator by generating fake data that appears as realistic as possible.The Generator takes random noise as input and aims to generate data that the Discriminator labels as real.The Generator's performance is evaluated using the Discriminator's response to the fake data it generates.The Generator's loss function encourages it to generate data that deceives the Discriminator (i.e., the Discriminator's prediction closer to 1).The model parameters of both the Generator and Discriminator are updated using gradient descent or some variant, optimizing considering loss functions.The process continues iteratively, with the Generator getting better at generating realistic data, and the Discriminator becoming more skilled at differentiating real from fake data.The ideal state is reached when the Generator can create data that is indistinguishable from real data, and the Discriminator cannot confidently classify between the real and generated data.It is important to note that the Generator never uses the real data as input and trains solely with random noises.
To put this in a positive framework, we can say that the discriminator serves as a kind of quality control of the generated data.The better the discriminator performs, the better the benchmark for the generator.Then, the generator can finally beat the benchmark in a form in which the optimal strategy of the discriminator is essentially only guessing whether the generated data are fraud or real.Finally, the generator is ready to be used in synthetic data generation.Some of the key issues that are critical in the applications of GANs are as follows: • Quantifying "similar objects" is trickier than it sounds, and it carries the core to GANs.In the field of mathematics research, we have many alternative methods to quantify the similarity between any two objects, which can also lead us to different objectives in setting up GANs.• In the application of GANs, we aim to generate original objects, which can be distant in which distance measure we consider to any objects at hand as a training dataset χ (i.e.we do not want to copy χ, but we feel the generated objects and χ belong to the same class).• We do not care about generating a perturbation of the original painting.Instead, we want to produce a fake painting that the expert is going to consider like a unique painting that belongs to Van Gogh, which she has seen for the first time in her life.• In this setting, the appropriate concept of similarity is distributional similarity.We call two objects similar if both are samplings from the same (or roughly same) probability distribution.This means that the two objects share similar characteristics and features that are determined by the underlying probability distribution.Therefore, we maintain a training dataset denoted with χ ⊂ R n that consists of samples gathered from µ.In this context, µ is a probability distribution, and its density is represented by p(x).We want to arrive at a reasonable approximating probability distribution ν having a density q(x) to µ.Then, we can obtain artificial or synthetic objects that are identical to objects in the training (real) dataset χ by sampling from ν. • You may question, why we do not just consider the distributions as ν = µ and obtain samples from the real data distribution µ.
Unfortunately, such sampling is exactly the main problem of GANs since µ is not known explicitly.The only thing that we know is that we have a finite set of samples χ sampled from µ. Consequently, the actual issue is identifying the properties of µ by only using χ.In this sense, we must focus on specifying an appropriate probability distribution ν as an approximation process to µ. • In addition to considering a distribution similar to µ in the sense of probability distances, one can also try to characterize µ by the empirical behavior of the data, their so-called stylized facts.• Generally, the success of GANs depends on the sophistication of µ and the training dataset χ size.

The basic approach of GANs
The purpose of the study is to clarify the mathematical background of GANs.Therefore, it focuses on only theoretical aspects of GANs and contains any applications.
To approximate a given probability distribution µ, GANs require an initially defined probability distribution to start its training.Generally, the initial distribution, which we define as γ, is introduced in space R d .Here, the space dimension d is not necessarily identical to the space dimension n (of R n ).Now, suppose we have chosen the initial distribution γ to be the standard normal distribution, and we have denoted it with N(0, I d ).However, we are free to choose γ from other well-known probability distribution families (e.g., uniform).GANs utilize a technique to discover a mapping G, defined as G : R d −→ R n .At this stage, consider a random variable z ∈ R d sampled from initial distribution γ.Then, we can claim that the mapping G(z) is from the same distribution family as µ.To emphasize, the probability distribution of G(z) can be defined in the form of γ • G −1 .Here, G −1 denotes the inverse of G, and the inverse maps subsets of space R n to subsets in space R d .Therefore, in the GANs modeling method, we desire to find a mapping G(z) that satisfies γ • G −1 = µ or at least γ • G −1 is a reasonable approximation of the real data distribution µ.
The vanilla GAN approach forms an adversarial system from which the generator receives updates on a continuous basis to increase output accuracy.More rigorously, the vanilla GAN presents a neural network called a discriminator, which attempts to label the observed samples as real, and generated samples as fake.From this perspective, the discriminator behaves like a classifier that attempts to distinguish real samples from fake samples.To this end, the discriminator assigns a probability D(x) ∈ [0, 1] to each sample x for its probability of being a real sample.If samples G(z j ) are outputs of the generator, the discriminator attempts to restrict them since they are fake samples.
In the early stage of training a GAN, restricting generated samples as fake should not be challenging since the generator is not elegant at generating realistic samples.However, after each attempt G fails to produce realistic samples to trick D, and G learns and adjusts itself with a refinement update.Thus, the improved G performs more reasonably compared to the one used at the early stages, and then it is the discriminator D's progression to revise for refinement.In an ideal case, through such an adversarial iterative process, we can eventually arrive at an equilibrium point; therefore, even the most reasonable D cannot perform more satisfactory labeling than a random guess.At this point, the samples generated by G become extremely identical to training samples χ in distribution.Consequently, the discriminator decision becomes completely random, and the probability of being real approximates 50%.
In GANs modelling approach, we have to define both the discriminator and generator by utilizing neural networks to understand the distributional properties of given data.Each neural network has its corresponding parameters ω and θ.These parameters are used in the training of the discriminator and generator and include the weights (also known as synaptic weights) of the neural network layers, as well as the biases of these layers.They are learned during training to optimize the performance of the GAN in generating realistic samples.Hence, we should register D ω (x) for the discriminator and G θ (z) for the generator, and we should denote ν θ := γ • G −1 θ .Thus, it is clear that our task is to identify the desired generator G θ (z) by adequately adjusting its parameter θ.

Building a GAN framework
As we mentioned above, there are two parties in GANs modeling method: generator G θ (z) and a discriminator D ω (x) who are in competition, and both parties have their own roles during the modeling process.More specifically, The generator: • The generator operates with a random vector whose length is fixed and, then, produces a fake sample in the corresponding domain.• The vector is sampled from the Gaussian distribution (generally) and utilized to seed the generator.After the training, points in the multidimensional vector space conform with points in the real data domain, forming a compact replica of the training data distribution.
• The vector space is called the latent space or equally vector space.It consists of some latent variables or some hidden variables, which are critical for the domain but cannot be observed directly.
The discriminator: • The discriminator uses a sample from the domain as input (it may be either real or fake) and assigns a real or fake (generated) binary class label.• The real sample directly comes from the original data, while fake samples are only outputs of the generator.• The discriminator is a classifier model.When the training is finished, the discriminatory model is junked as we are curious about in the generator.Occasionally, the generator can be reset as it has learned to effectively determine characteristic from examples sampled from the problem domain.Some or all of the characteristics extraction layers can be utilized in transfer learning applications by utilizing the same or similar input data.
Both players in the min-max game are expressed by a corresponding function.Each function is differentiable concerning its inputs and parameters.As it introduced above, the discriminator is a differentiable function denoted by D that uses x as input and is allowed to use only the discriminator network weights ω as parameters.On the other hand, the generator is specified by G and uses the random vector z as the initial input and is only allowed to use the weights of the generator network θ as parameters [2].In this setting, both players have their own loss functions.The loss functions are described with regard to parameters specific to players.The discriminator desires to minimize the problem L (D) (ω, θ) and it must accomplish the minimization by controlling only its parameters ω.On the other hand, the generator desires to minimize L (G) (ω, θ) and must accomplish the minimization by controlling only its parameters θ.Here, the discriminator and generator losses rely on the other player's parameters.However, both players are limited to controlling only their own parameters.
Since each player's loss relies on the opposite player's parameters, despite each player being allowed to regulate its parameters and cannot control the opposite player's parameters, such a scenario is generally expressed as a game rather than a classical optimization problem [2].
As we mentioned already, generator G is a differentiable function.After we produce its random vector z from a well-known initial distribution called γ, G generates a fake sample x, which is implicitly sampled from the model distribution (P model = ν).Commonly, a deep neural network is utilized to characterize the generator.However, we have some constraints on the configuration of the corresponding neural network.If we want P model to have complete support on X , the dimension of the generator should be at least as large as the dimension of X [2].In a similar fashion, discriminator D is also a differentiable function, whose objective is to categorize samples accurately as real and fake.The discriminator is also naturally characterized by a deep neural network.Again, it has some restrictions on the configuration of its corresponding network.It has to use only real and fake samples as entries and assigns a probability score D(x) ∈ [0, 1] for each x [2].Here, notice that the generator never sees the real data and only uses random vector z as input, while the discriminator uses both real, and the generator's output.

A simple derivation of the loss functions
Before starting the definition of the loss functions, note that in the classical GANs architectures, the design of the discriminator loss functions L (D) always remains the same.They differ only by the cost function for the generator, L (G) [2].The loss function introduced in the original study [1] is obtained from the binary cross-entropy formula as follows Here, y and ŷ correspond to the original and fake data, respectively.
In the training of the discriminator, the label of data assigned by the real data µ(x) is y = 1 (real/observed data) and ŷ = D(x).Then, by substituting this into Eq.( 1), we have and for the data sampled from the generator, the label is y = 0 (fake data) and ŷ = D(G(z)).
Similarly, by substituting these into Eq.( 1), we end up with In this setting, the goal of the discriminator is to accurately classify its input as fake or real.Therefore, the given loss functions for G and D have to be maximized.Then, the final loss function of D is denoted as At this stage, it is important to remember that the generator is competing against the discriminator.Hence, the generator aims to minimize the optimization problem given in Eq. ( 3), and consequently, its loss function evolves to Now, let us combine the loss functions (3) and (4).By combining these two equations, we obtain a min-max problem as Here, it is worth emphasizing that the loss function in Eq. ( 5) is valid only for a single data point.Therefore, to consider the entire dataset, we need to consider the expectation of the combined loss function as The min-max formulation introduced in Eq. ( 6) is a concise one-liner function that intuitively captures the adversarial nature of the competition between the players G and D. However, in practice, individual loss functions are defined for both players since the gradient of y = log(x) is steeper around x = 0 than y = log(1 − x).This means that trying to maximize log(D(G(z))), or equivalently minimizing − log(D(G(z))) leads to quicker and more significant improvements in the generator performance than attempting to minimize log(1 − D(G(z))).

Mathematical description of vanilla GANs
The adversarial game introduced in the previous section can be expressed mathematically by a min-max task for a target function defined by the discriminator D(x) : R : −→ [0, 1] and generator G : R d −→ R n .Here, it is clear that G transforms the random vector z ∈ R d sampled from γ into generated (fake) samples G(z).Then, D attempts to distinguish the generated samples from the training samples that are supposed to be sampled from µ while G attempts to generate new samples that are identical in distribution to the data that we use in the training of GANs [3].
In the original study [1], a target loss function is introduced as where E represents the expectation concerning the distribution appointed in the subscript.We can avoid the subscript if there is no confusion.
The vanilla GAN solves the min-max problem given in Eq. ( 6).Heuristically, for a given G, the optimization problem max D V(D, G) reveals the optimal D to reject outputs G(z) by assigning higher probabilities to samples from µ and low probabilities to outputs G(z).In contrast, for a given D, min G V(D, G), the optimization problem reveals the optimal G, and therefore, the outputs G(z) attempt to deceive D by assigning high probabilities for G(z) [3].
Then, let us define y = G(z) ∈ R n having a distribution defined as ν := γ • G −1 , and the random vector z ∈ R d is from the γ distribution family.Thus, we may rearrange V(D, G) in terms of D and ν as follows Then, the min-max problem defined in Eq. ( 6) evolves to min Now, suppose that the distributions µ and ν have densities given as p(x) and q(x), respectively.Note that this can only happen under the condition of d ≥ n.This condition is necessary for GANs to ensure that the discriminator is sufficiently powerful to distinguish real samples from generated ones.When d ≥ n, the discriminator possesses a greater number of parameters compared to the sample size in the training dataset.Consequently, this asymmetry facilitates the discriminator's ability to effectively differentiate between real and generated samples.If d < n, the discriminator may not effectively learn to distinguish real from generated samples, resulting in poor-quality generated samples.
By using the densities, we obtain With the help of the current evolution, the min-max problem given in Eq. ( 6) evolves to min From the evolved problem, notice that the equation is equal to min ν max D Ṽ(D, ν) under the condition ν = γ • G −1 for some generator G.

Proposition 1 ([1]
).For distributions µ and ν on R n having densities p(x) and q(x), respectively Proof Let us define the integrand as To find the optimal solution, we look at the first order condition dD(x) = 0 and second order condition By solving this equality for D(x) we find the critical point .
Now, let us compute the second derivative Then, it is obvious that the second derivative is strictly negative for at least one of p(x) or q(x) being positive.Therefore, we find the optimal solution D p,q (x) as .
■ As a result of Proposition 1, we can give the following remark immediately.
Remark 1.The discriminator optimal solution of the min-max problem satisfies D p,q (x) = p(x) p(x)+q(x) ∈ [0, 1], and this is the requirement for the optimal discriminator.Note that the optimal solution makes the following sense intuitively: • If some sample x is favorably actual, we may anticipate p(x) to be close to one and q(x) to converge at zero.Hence, the optimal D assigns one to such samples.• For a generated sample x = G(z), we anticipate the optimal D to assign zero since p(G(z)) has to be close to zero.When we train G to its optimal value, density q(x) gets very close to density p(x), i.e. we obtain D p,q (G(z)) ≈ 0.5.
As a consequence of Proposition 1, we can introduce the following theorem immediately.
Theorem 1. Suppose p(x) is a probability density function defined on space R n .Additionally, consider a probability distribution ν having a density function denoted as q(x) and a discriminator function D : R n −→ [0, 1] as usual.Then, we have a min-max problem as follows [3], and, we reach a solution with a special choice of q(x) = p(x) and D(x) = 1 2 , ∀x ∈ supp(p).
Therefore, Ṽ(D, ν) cannot be smaller than − log(4).Thus, we have proved that q(x) = p(x) -and thus D(x) = 1/2 -yields the minimum possible value of Ṽ(D, v) for any ν for the given choice of D(x) = p(x)/(p(x) + q(x)).Consequently, we end up with the desired result.■ Theorem 1 reveals that the solution to the min-max problem given by Eq. ( 9) is the result we seek under the hypothesis of the distributions having the same densities.Theorem 1 holds for all distributions in general.
Theorem 2. Suppose that µ again is a probability distribution function given on space R n as in Theorem 1.Then, for a probability distribution ν and a discriminator D : R n −→ [0, 1], we can introduce a min-max problem as follows [3] whose solution is achieved with the special choice ν = µ and D(x) = 1 2 µ−a.e.
Proof We first show that with the special choice of ν = µ and D(x) = 1 2 µ-almost everywhere, the min-max problem in Equation ( 10) is solved.
First, let's consider the objective function Ṽ(D, ν): Substituting ν = µ and D(x) = 1 2 , we have: Since µ is a probability distribution function, the integral R n dµ(x) is equal to 1. Therefore, the objective function simplifies to: Hence, with the choice of ν = µ and D(x) = 1 2 µ-almost everywhere, the objective function Ṽ(D, ν) is minimized.To complete the proof, we need to show that for any other choice of ν and D, the objective function Ṽ(D, ν) is not smaller than 0.Let us consider an arbitrary choice of ν ′ and D ′ (where ν ′ ̸ = µ or D ′ ̸ = 1 2 µ-almost everywhere).Without loss of generality, assume that there exists a set A ⊂ R n with positive measure such that D ′ (x) ̸ = 1  2 for all x ∈ A. Since µ is a probability distribution function, we have µ(A) > 0. Therefore, we can rewrite the objective function as: 2 for all x ∈ A, we have log(D ′ (x)) < 0 for all x ∈ A. Therefore, log(D ′ (x))dµ(x) < 0 for x ∈ A. On the other hand, consider the term log(1 Since ν ′ and D ′ were chosen arbitrarily, we can conclude that for any other choice of ν and D, the objective function Ṽ(D, ν) is not smaller than 0. Hence, the solution to the min-max problem in Equation ( 10) is achieved with the special choice ν = µ and D(x) = 1 2 µ-almost everywhere.This completes the proof.■ Like many min-max problems, we may utilize the alternative optimization algorithm to find an optimal solution to the problem introduced by Eq. ( 9) that alternates by updating the discriminator and density q.Here, the updating process contains first updating the discriminator for density q, and second, updating density q with recently updated D. Notice that updating density q means updating the generator.This process is repeated until we find an equilibrium point for the optimization.
Proposition 2. If in each step of the training process, D is qualified to achieve an optimum point given q(x), which is pursued by an update of approximating density q(x) to further develop the criterion of minimization given as At this stage, the approximating density q converges to the target density p.
Proof First, we show that if the discriminator D is qualified to achieve an optimum point given q(x) in each step of the training process, then the approximating density q converges to the target density p.Let us consider the objective function to be minimized: In each step of the training process, the discriminator D is qualified to achieve an optimum point given q(x).This means that for a fixed q(x), the discriminator D is updated to maximize the objective function with respect to D. Let's denote this updated discriminator as D * q .Now, let us consider the objective function with the updated discriminator D * q : min Since the discriminator D * q is optimized for a fixed q(x), the objective function becomes: The first term min q R n log(D q (x))p(x) dx does not depend on q(x) and can be treated as a constant.Therefore, minimizing this term is equivalent to maximizing R n log(D q (x))p(x) dx.
Similarly, the second term min q R n log(1 − D q (x))q(x) dx does not depend on p(x) and can be treated as a constant.Therefore, minimizing this term is equivalent to maximizing R n log(1 − D q ((x))q(x) dx.Since the objective function is the sum of these two terms, minimizing the objective function is equivalent to maximizing both R n log(D q (x))p(x) dx and R n log(1 − D q (x))q(x) dx.
Now, let us consider the first term R n log(D q (x))p(x) dx.Since D q (x) is optimized for a fixed q(x), it can be considered as a constant with respect to p(x).Therefore, maximizing this term is equivalent to maximizing R n p(x)dx.
Similarly, let us consider the second term R n log(1 − D q (x))q(x) dx.Since D q (x) is optimized for a fixed q(x), it can be considered as a constant with respect to q(x).Therefore, maximizing this term is equivalent to maximizing R n q(x)dx.Since the objective function is the sum of these two terms, maximizing the objective function is equivalent to maximizing both R n p(x)dx and R n q(x)dx.Now, let us consider the convergence of the approximating density q to the target density p.As we maximize the objective function, we aim to maximize both R n p(x)dx and R n q(x)dx.To achieve this, the approximating density q needs to converge to the target density p.Therefore, if in each step of the training process, the discriminator D is qualified to achieve an optimum point given q(x), then the approximating density q converges to the target density p.This completes the proof.■ In each step of the process, first, we find the optimal discriminator D * (x) for the current density q(x).Later, update density q(x) given the currently updated discriminator D(x) to improve the accuracy.Repeating such a process finally leads us to the desired solution.In practice, nevertheless, we infrequently focus on optimizing discriminator D for a provided generator G. Instead, we generally focus on updating D a little while ago swapping to update generator G.
It is worth emphasizing here that the unconstrained min-max problems given by Eqs. ( 9) and (10) are not the same as the original min-max problem introduced in Eq. ( 6) or the equivalent to Eq. ( 7), where the probability distribution ν is constrained to ν = γ • G −1 .However, it is useful in applications to suppose Eqs. ( 6) and ( 7) exhibit identical properties introduced in Theorem 2 and Proposition 2. We can suppose the same, even after further restricting the discriminator and generator functions are neural networks defined as D = D ω and G = G θ as instead.Then, set θ .Under this setting, the min-max problem becomes min θ max ω V(D ω , G θ ), where Eq. ( 11) is the key to executing the fundamental optimization problem.Here, since we do not know the explicit form of µ (target distribution), we should approximate the expectations through sample averages.Thus, Eq. ( 11) helps us to find an approximation to V(D ω , G θ ).More precisely, suppose a set A that is a subset of samples drawn from the training/original dataset χ (a minibatch) defined above and suppose a set B that is a minibatch of samples in space R d sampled from γ.
Under these assumptions, we can approximate as [3] Note that a minibatch in the GANs framework refers to a small subset of training examples fed to the network in each training iteration.The minibatch size is typically chosen to balance the computational efficiency of training and the quality of GANs.Smaller minibatches can lead to faster training; however, they may result in a noisier gradient estimate and slower convergence.On the other hand, larger minibatches can provide a more accurate gradient estimate; however, they may require more memory and computational resources to process.During training, generator and discriminator networks are trained simultaneously by optimizing their respective loss functions using backpropagation.The minibatches of real data samples and generated samples are used to compute the discriminator's loss, while the generator's loss is computed using the generated samples only.By using minibatches in GANs, the networks can efficiently learn the complex distribution of the data and generate high-quality synthetic samples.

f-divergence and f-GAN concepts
Recall our motivating problem defined for GAN having a probability distribution µ, known simply for the training samples at hand.We want to find a distribution ν through an iterative process.By beginning with a probability distribution ν and iteratively updating ν, we approximate the target distribution µ with ν.To approximate µ, first, we need to measure the distance between distributions µ and ν.The vanilla GAN uses the discriminator to approximate target distribution µ.However, we can use other measures to identify the distance between distributions.

f-divergence
We can measure the dissimilarity between any two distributions, in our case target distribution µ and approximated distribution ν, with the Kullback-Leibler (KL) divergence.Let p(x) and q(x) be the corresponding probability density functions of µ and ν defined on R n .Then, the distance between densities p and q is defined in the following form Here, notice that D KL (p∥q) is finite only if q(x) ̸ = 0 on supp(p) almost everywhere.At this stage, we can conclude the following results for KL-divergence [4]: x is a point in the real data with a high probability.This case is the heart of the 'mode dropping' phenomenon.It occurs when we have large regions having high values of p, whereas having small values in q.Here, it is important to remark that if p(x) > 0 and q(x) → 0, the integrand of D KL rises to infinity very quickly.This means that such a cost function sets an exceptionally elevated cost to the generator's distribution that does not cover some data parts.• If p(x) < q(x), x has a low chance of being a data point, instead of a high chance of being a generated point.It is faced when we observe the generator producing an unrealistic image.If we observe p(x) → 0 and q(x) > 0 we find that the value inside the D KL shifts to 0. This means that such a cost function pays an exceptionally low cost for generating fake samples.
Remark 2. Regarding GANs, D KL (p∥q) has a unique minimum at p(x) = q(x).Furthermore, it does not require knowing the unknown density p(x) to estimate.However, it is impressive to notice that D KL (p∥q) is not symmetrical for p(x) and q(x) [3,4].
Even though KL-divergence is widely used in the applications of GAN, there are other measures to identify the dissimilarity between distributions.For instance, the Jensen-Shannon (JS) divergence is given as where M = p(x)+q(x) 2 is a divergence measure derived from KL-divergence.The most significant benefit of JS-divergence is that it is well-defined for any densities p(x), and q(x) and symmetric concerning the densities (D JS (p∥q) = D JS (q∥p)) while KL-divergence is not symmetric.Following Proposition 1, the minimization part of the min-max problem in the context of the vanilla GAN is exactly the minimization over density q of D JS (p∥q) for a given p.As things stand, D KL and D JS divergences are both particular cases of the f − divergence where a more general form is introduced in [5] for such divergence measures.Consider a strictly convex function f (x) with a domain I ⊆ R that satisfies f (1) = 0. Additionaly, for computation purposes, we interiorize f (x) = +∞, ∀x / ∈ I convention.Then, we can introduce the f-divergence concept as introduced in [3].Definition 1.Consider two probability density functions p(x) and q(x) defined on space R n .Then, the f − divergence between these two densities is where we adopt f ( p(x) q(x) )q(x) = 0 if q(x) = 0.
) in general, we can confuse which density divides and which density in the fraction.If we obey the original setting introduced in [5], then the definition of D f (p∥q) will be our D f (q∥p).In this study, we adopt the definition introduced in [7], where the f-GAN concept is first introduced.
Proof Using the convexity property of function f and Jensen's inequality, we have where the equality holds if and only if the ratio q(x)/p(x) is a constant or function f is linear on the range of the ratio p(x)/q(x).The range of p(x)/q(x) depends on the probability distributions p(x) and q(x) being considered.In general, the ratio p(x)/q(x) can take any positive value, zero, or infinity, depending on the values of p(x) and q(x) for a given x.However, in the context of importance sampling, it is common to consider the ratio p(x)/q(x) as a weighting function for sampling from the target distribution p(x).In this case, the p(x)/q(x) range is typically restricted to a finite interval to ensure that the importance weights are bounded and can be effectively used for sampling.Function f is a strictly convex function, so it may only be previous or for that matter, we should have p(x) = rq(x) on supp(q) for the equality to hold.Suppose we have the r ≤ 1 condition.If we have supp(p) ⊆ supp(q), then we obtain r = 1, and hence, we have D f (p∥q) ≥ 0. Such an equality holds if and only if we have p = q.Suppose f (t) > 0, ∀t ∈ [0, 1), then we also have D f (p∥q) ≥ f (r) ≥ 0. For r < 1, we have D f (p∥q) ≥ f (r) ≥ 0. Therefore, if D f (p∥q) = 0, the conditions r = 1 and p = q hold.■ At this stage, we should note that f − divergence can be specified for arbitrary probability distributions µ and ν on probability space Ω.Let τ be a third probability distribution that satisfies µ, ν ≪ τ, more specifically both µ and ν are absolutely continuous concerning the third probability distribution τ.For instance, suppose τ = 1 2 (µ + ν).Let p = dµ dτ and q = dν dτ be Radon-Nikodym derivatives of p and q, respectively.We characterize the f − divergence of probability distributions µ and ν as [3] Here, once more we adopt the convention f p(x) q(x) q(x) = 0 if q(x) = 0. Here, it is clear that this definition is free from the choice of the probability measure τ.In the application of the f − divergence, the greatest difficulty is the unknown explicit expression of the target distribution denoted by µ.Hence, in the vanilla GAN setting, to calculate the f − divergence (D f (p∥q)), we should express the divergence in terms of the average of samples.In [6], this problem is solved with the help of the convex conjugate of the convex function at hand.Definition 2. Suppose f (•) is a convex function on the interval defined as I ⊆ R. The convex conjugate of f is simply a generalization of the celebrated Legendre transform.The convex conjugate f * : R −→ R ∪ {±∞} is given as [3] f * (y) = sup t∈I ty − f (t) .
We can introduce the following remark as an immediate result of the definition.
Proof Define g(t) = ty − f (t).Then, g ′ (t) = y − f ′ (t) on I ⊆ R, which is strictly decreasing since f (t) is convex.Here, g(t) is a function that is strictly concave on the domain defined with I.Note that, if y = f ′ (t * ) for some t * ∈ I • , t * is called a critical point of function g.Therefore, t * has to be a global maximum of g.Therefore, g(t) reaches its maximum at point t = t * = f ′−1 (y).Now, suppose y is not in the range of f ′ , in that case, g ′ (t) > 0 or g ′ (t) < 0 on I • .Suppose the case g ′ (t) > 0 ∀t ∈ I • .Here, it is clear that the supremum of function g(t) is attained while t → b − because g(t) is a monotonously increasing function.In a similar fashion, the second case g ′ (t) < 0, ∀t ∈ I • may be derived.■ Based on Lemma 1, we can give the following remark: Remark 5. Note that +∞ is a potential f * value.Hence, the domain of f * (Dom( f * )) is characterized as sets where f * is finite.
A result of Lemma 1, under the assumption that f is a continuously differentiable function, sup t∈I {ty − f (t)} is achieved for some t ∈ I if and only if, y is in the range of f ′ (t).Such a result is clear when y ∈ f ′ (I • ), however, it is arguable relatively effortlessly for finite boundary points in domain I.More commonly, without the differentiability assumption, sup t∈I {ty − f (t)} is achieved if and only if y ∈ ∂ f (t) for some t ∈ I (∂ f (t) is set of subderivatives).We summarize some of the important properties of the convex conjugate in the following proposition [3]: Proposition 4. Let f (x) be a convex function defined on R having a range R ∪ {±∞}.Then, its convex conjugate f * is a convex and lower-semi continuous function.Moreover, if f is a lower-semi continuous function, f satisfies Fenchel duality f = ( f * ) * .

Calculation of f-divergence using the convex dual
To calculate the f − divergence from samples, [6] proposes using the convex dual of function f .Let µ and ν be probability two measures that satisfy µ, ν ≪ τ for some probability measure τ, with p = dµ/dτ and q = dν/dτ.In the best scenario of µ ≪ ν, by f (x) = ( f * ) * (x), we retain = where T(•) denotes any Borel function.Therefore, by considering T overall Borel functions, one obtains In addition, ∀x, sup t {t p(x) q(x) − f * (t)} is achieved for some t = T * (x) if p(x) q(x) is in the f * subderivatives range [6].Hence, if it holds for ∀x, we obtain Such equality holds, generally under some light conditions.Theorem 3. Let f (•) be a strictly convex and continuously differentiable function on the domain I ⊆ R and let µ and ν be Borel two probability distributions on space R n that satisfy µ ≪ ν.Then, we have [6] where sup T is considered an overall Borel functions defined as T : R n −→ Dom( f * ).In addition, if the probability measure p satisfies p(x) ∈ I, ∀x, T * (x) := f ′ (p(x)) is an optimizer of Eq. (15).

Proof
We have obtained the upper bound for the problem in Eq. ( 14) showing the lower bound part will finish the proof.Let p(x) = dµ(x)/dν(x).Let us analyze Eq. ( 13) in detail by assuming q(x) = 1, and sup t tp(x) − f * (t) for each x.Let us express g x (t) = tp(x) − f * (t), S = Dom( f * ) and suppose S • = (a, b) where a, b ∈ R ∪ {±∞}.Then, we can introduce a sequence T k (x) as follows: If density function p(x) is in the range of f * ′ , say for instance p(x) = f * ′ (t x ), we formed T k (x) = t x ∈ S. If p(x) − f * ′ > 0 for all t, then, g x (t) is a strictly increasing function.Hence, the supremum of g x (t) is achieved at the upper boundary point b.Therefore, we assign T k (x) = b k ∈ S, where b k → b − .Here, if p(x) − f * ′ (t) < 0, ∀t, g x (t) becomes a strictly decreasing function.Therefore, in this case, the supremum of g x (t) is achieved at the lower boundary point a, and we assign T k (x) = a k ∈ S, where a k → a + .By Lemma 1 and its proof, we know that lim Thus, To show the proof of the last, suppose p(x) ∈ I.Then, again by Lemma 1, define s(t) = f ′−1 (t) for t in the range of f ′ , then we can write Hence, we have g ′ x (t) = p(x) − f * ′ (t) = p(x) − f ′−1 (t).Then, g x (t) has a maximum at t = f ′ (p(x)).This result proves that T * = f ′ (p(x)) is an optimizer for Eq.(15).■ Note that Theorem 3 holds for only µ ≪ ν.However, one may give the following theorem for other cases.Theorem 4. Let f (t) be a convex function where the domain of f * includes (a, ∞) for some a ∈ R. Let µ and ν be two Borel probability measures on R n that satisfy µ ̸≪ ν.Then, Here, sup T is considered an overall Borel function defined as T : R n −→ Dom( f * ).
Proof Consider a new distribution defined as τ = 1 2 (µ + ν).Then, these two densities satisfy µ, ν ≪ τ.Moreover, let p = dµ/dτ and q = dν/dτ be the Radon-Nikodym derivatives of the given densities.Here, we know that µ ̸≪ ν.Therefore, we can find a set S 0 with µ(S 0 ) > 0 on which q(x) = 0. Now, fix a point t 0 in the domain of f * .Let us define T k (x) = k for x ∈ S 0 , and T k (x) = t 0 otherwise.Then we can introduce, holds.This result leads us to the desired proof.■ At this stage, notice that the domain of f * has no boundary from above, and Eq. ( 15) is not satisfied unless we have µ ≪ ν.In many studies, we face a singular target distribution µ, as the training data we are handling might have a lower-dimensional manifold.Hence, we can introduce the following theorem.
Theorem 5. Consider a function f (•) that is a lower semicontinuous convex function and the domain I * of f * has sup I * = b * < +∞.Let µ and ν be two Borel probability measures on space R n , and µ = µ s + µ ab , where µ s ⊥ ν and µ ab ≪ ν.Then [3], where sup T is carried over all Borel functions given as T : R n −→ Dom( f * ).

Variational divergence minimization (VDM) with f-GANs
It is possible to generalize the standard vanilla GAN with the help of f − divergence measures.For a given probability distribution µ, f -GAN aims to minimize the distance between distributions via D f (µ∥ν), concerning the probability distribution ν.Fulfilled in the sample space, f -GAN solves the min-max problem given as The f -GAN framework came on to stage primarily in [7], and the optimization problem given in Eq. ( 16) guides us to the (VDM).Note that the VDM looks identical to the min-max problem given for the vanilla GAN.Here, the Borel function T is named a critic function, or shortly a critic.With the assumption µ ≪ ν, by Theorem 3 it is equal to min ν D f (µ∥ν).As we mentioned earlier, one possible problem of the f-GAN is facing µ ̸≪ ν in Theorem 4.Then, Eq. ( 16) is generally not equal to min ν D f (µ∥ν).
Luckily, some particularly selected f , such a case is not a problem anymore.
Theorem 6. Suppose f (t) is such a function that is lower semicontinuous and strictly convex, and the domain denoted as I * of convex conjugate f * satisfies sup I * = b * ∈ [0, ∞).Additionally, suppose that f is a continuously differentiable function on its domain and satisfies f (t) > 0, ∀t ∈ (0, 1), and let µ be Borel probability measures on space R n .Under these assumptions, we obtain our unique optimizer for [7] inf By Proposition 3, such equality holds if and only if ν = µ.Consequently, ν = µ becomes our unique optimizer for GANs.■ Some remarks on special solutions Remark 6. Suppose that both the density functions p(x) and q(x) satisfy p(x) = q(x).Then, the optimal value becomes D * (x) = 1/2.For such a special case, we have a loss function as Furthermore, if we calculate JS divergence, we have p(x) log p(x) p(x) + q(x) dx + 1 2 log(2) + R n q(x) log q(x) p(x) + q(x) dx As an immediate result, we can also give the following remark.
Remark 7.Under the assumptions given in the preceding remark, the followings hold [3]: • Fundamentally, the objective of a GAN loss function is to quantify the similarity between the generated data distribution ν and the real sample distribution µ by using D JS under the optimal discriminator D condition.The best generator (G * ) imitates the distribution of real data, which leads us to the minimum given as L(G * , D * ) = −2 log 2. • If we train the discriminator D until it convergences, its error approximates 0. This indicates that the D JS between the distributions has reached its maximum (it is easy to see that 0 ≤ D J S(µ∥ν) ≤ ln(2)).
We can find it only if their distributions are not continuous (meaning: their densities are not absolutely continuous functions) or the distributions have disjoint supports.One potential reason behind the noncontinuity of the distribution is if their supports rely on low-dimensional manifolds.For such a case, there is substantial empirical and theoretical evidence to believe that the generated data distribution ν is focused on a low-dimensional manifold for many datasets.• If both µ and ν rest in low-dimensional manifolds, they are almost undoubtedly disjoint.If the distributions have disjoint supports, we can always find a perfect discriminator that divides real and fake samples 100% accurately.

Concluding remarks
In this study, we discovered and explored the mathematical background of GANs to illustrate a deep understanding of them for further extensions.Hence, in this study, we took a detailed tour of the mathematics behind GANs.After the celebrated work of Goodfellow et al. [1], new adversarial training objectives and techniques for generative modeling have been developed, such as Wasserstein GANs [8,9].Furthermore, GANs have been widely applied to new fields of research, including mathematical finance [10][11][12], time series generation [13,14], audio synthesis [15], and fraud detection in financial datasets [16].The underlying mathematics for these models are obviously different from what we have discussed above, but the study is a good starting point nonetheless.

Figure 1 .Figure 1
Figure 1.A visualization of the discriminator and generator networks as a counterfeiter and Van Gogh's painting expert

Remark 4 .Lemma 1 .
The convex conjugate of convex functions is also called the Fenchel transform or Fenchel-Legendre transform.As we mentioned above, we may extend the convex conjugate f * to R by defining f (x) = +∞ for all x / ∈ I. Therefore, a more precise indication of f * is illustrated in the following lemma.Let f (x) be a strictly convex and continuously differentiable function on I ⊆ R, where I • = (a, b) with a, b ∈ [−∞, +∞].Then [3],