Image-to-Image Translation with CNN Based Perceptual Similarity Metrics

— I mage-to-image translation is the process of transforming images from different domains. Generative Adversarial Networks (GANs), and Convolutional Neural Networks (CNNs) are widely used in image translation. This study aims to find the most effective loss function for GAN architectures and synthesize better images. For this, experimental results were obtained by changing the loss functions on the Pix2Pix method, one of the basic GAN architectures. The exist loss function used in the Pix2Pix method is the Mean Absolute Error (MAE). It is called the ℒ 1 metric. In this study, the effect of convolutional-based perceptual similarity CONTENT, LPIPS, and DISTS metrics on image-to-image translation was applied on the loss function in Pix2Pix architecture. In addition, the effects on image-to-image translation were analyzed using perceptual similarity metrics ( ℒ 1 _CONTENT, ℒ 1 _LPIPS, and ℒ 1 _DISTS) with the original ℒ 1 loss at a rate of 50%. Performance analyzes of the methods were performed with the Cityscapes, Denim2Mustache, Maps, and Papsmear datasets. Visual results were analyzed with conventional (FSIM, HaarPSI, MS-SSIM, PSNR, SSIM, VIFp and VSI) and up-to-date (FID and KID) image comparison metrics. As a result, it has been observed that better results are obtained when convolutional-based methods are used instead of conventional methods for the loss function of GAN architectures. It has been observed that LPIPS and DISTS methods can be used in the loss function of GAN architectures in the future.


Introduction
Deep learning-based studies have been advancing rapidly in recent years.One of the evolving methods in this field is image synthesis.Image synthesis involves the process of editing, manipulating, translating an image, or generating an image from a signal.Convolutional Neural Networks (CNNs) (Zhu et al., 2017) and Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) are extensively utilized in research within this domain.CNN-based studies in feature extraction processes, such as image synthesis or pattern recognition, surpass conventional methods (Karpathy et al., 2014;Koushik et al., 2016;Guo et al., 2016).One of the approaches to image synthesis using CNN is PixelRNN, which was proposed by Oord et al. (June 2016).PixelRNN is a deep neural network designed to predict output images from input images in the spatial domain, encoding the full set of dependencies in the pattern by modeling the discrete probability of raw pixel values (Oord et al., June 2016).The dataset used for experimentation is ImageNet, and the results demonstrate the consistency of the method.Oord et al. (2016) also introduced another generative model, PixelCNN.PixelCNN employs autoregressive links during image synthesis and is faster than PixelRNNs in the training phase.The proposed method proves advantageous as dataset patterns increase in complexity during image synthesis.Salimans et al. (2017) proposed the PixelCNN++ method.PixelCNN++ is a generative model similar to PixelCNN but simplifies its structure, conditioning all pixels instead of R/G/B subpixels.It also incorporates Dropout regularization to regularize the model.Chen et al. (2017) presented an image synthesis approach called SCA-CNN.SCA-CNN combines spatial information and image color channel information in the image, outperforming current attention-based image synthesis methods.
Along with the high performance in CNN-based studies, Generative Adversarial Networks (GANs) also exhibit superior performance in image synthesis processes (Liu et al., 2017;Liu et al., 2016;Kingma et al., 2013;Wang et al., 2018).GANs, a method developed based on deep learning, were proposed by Ian Goodfellow et al. (Goodfellow et al., 2014) in 2014.This method comprises two neural networks operating in contention.Numerous studies have been conducted on image synthesis with GAN architectures.For instance, Liu et al. (2017) introduced the UNIT method, which performs unsupervised image-to-image translation based on GANs (Liu et al., 2017).UNIT combines the CoGAN (Liu et al., 2016) and VAE (Kingma et al., 2013) methods to achieve unsupervised image-to-image translation.The UNIT method involves six networks: two encoders, generators, and discriminators.Input and output images must have similar areas for optimal performance in this model.Wang et al. (2018) proposed the Pix2PixHD architecture to address the synthesis problem (Wang et al., 2018).This architecture is based on a generator network and three scaled discriminators.Output images have dimensions of 2048×1024.Liu et al. (2020) proposed a model that synthesizes representation content by separating it from domain attributes.Named GMM-UNIT, this model uses the Gaussian mixture model (GMM) for the hidden field attribute.GMM-UNIT has two main advantages: it allows translation in multiple fields and enables interpolation between fields and extrapolation within invisible fields.Royer et al. (2020) introduced the XGAN model based on the loss of semantic consistency.This model is a binary contention autoencoder capturing the shared feature representation of both areas to learn standard feature-level information rather than pixel-level information.It utilizes the loss of semantic consistency in both domains to preserve the image's semantic content across domains.In 2018, Frid-Adar et al. created synthetic medical images using GAN in image synthesis.It was observed that the produced images can be used in data augmentation and medical image classification, thereby improving the performance of CNN.Today, various GAN architectures are employed in multiple areas, as highlighted by Isola et al. (2017) and Zhu et al. (2017).
One of the major challenges in image-to-image translation lies in the insufficient evaluation of the similarity between the original and the generated image.As a result, multiple studies have been conducted to address the issue of assessing image quality, which falls under the domain of Image Quality Assessment (IQA).IQA aims to measure various aspects of image quality, including structural, textural, diversity, and signal strength.LPIPS (Zhang et al., 2018), DISTS (Ding et al., 2020), and CONTENT (Gatys et al., 2015) methods can be employed both within IQA frameworks and as loss functions during image synthesis.Given that these three methods are based on convolutional neural networks, they are examined within a CNN context in this study.For the comparison of image synthesis outputs, traditional methods such as FSIM (Zhang et al., 2011), HaarPSI (Reisenhofer et al., 2018), PSNR (Fardo et al., 2016), MS-SSIM (Wang et al., 2003), SSIM (Wang et al., 2004), VIFp (Sheikh and Bovik, 2006), and VSI (Zhang et al., 2014), as well as modern FID (Heusel et al., 2003) and KID (Bin´kowski et al., 2003) IQA methods, were utilized.Numerous studies in the literature explore IQA methods and leverage them in image analysis (Heusel et al., 2003;Bin´kowski et al., 2003;Choi et al., 2020;Ding et al., 2021;Sim et al., 2020;Borasinski et al., 2022;Peng et al., 2022).
This study examined the impact of altering the loss function in a fundamental MMS architecture on image synthesis.The architecture is based on the Pix2Pix method (Isola et al., 2017), which utilizes supervised learning techniques introduced by Isola et al. in 2017.In supervised synthesis, the loss is calculated as the distance between the estimated output (y) generated from the input and the real image (x).This loss value is then utilized to update both the generator (G) and discriminator (D) networks.The original Pix2Pix method employs the mean absolute error (MAE) loss function.In this study, the influence of using LPIPS, DISTS, and CONTENT methods as loss functions in GANs was investigated.These methods are CNN-based and can serve as measures of similarity between images.Additionally, these methods are founded on the VGG (Simonyan and Zisserman, 2014) network.The recently proposed LPIPS method encourages reverse mapping to learn while emphasizing perceptual similarity between fake and real images reconstructed by the generator network.It also gauges average feature distances between synthesized samples.A higher LPIPS score indicates greater diversity among rendered images.This method assesses structure and tissue similarity akin to SSIM.Suzuki et al. (2021) observed in their study that utilizing this method yielded results equivalent to visual outputs (Suzuki et al., 2021).Chuan et al. (2018) investigated a hybrid content similarity metric, using the CONTENT method as an example.The study analyzed four different datasets with various visual characteristics (Chuan et al., 2018).
The goal is to identify a continuous and highly accurate loss function.In the analysis of results, both traditional and contemporary image comparison metrics were employed.The image synthesis outcomes of the methods are presented in both visual and tabular formats based on these metrics.The primary contributions of favoring modern CNN-based methods over conventional loss functions can be summarized as follows: • It has been demonstrated that it can be utilized as a general loss function in GAN methods.
• It has been observed to positively influence the results of image synthesis.
• It has been investigated and found that it leads to better synthesis of textural structures in the image.
The study's most significant contribution to the literature has been the determination of the loss function and similarity metric, which maximizes the accuracy of the image synthesis process made with the GAN approach.It is predicted that the ease of use instead of most loss functions will guide most research in the future.The remainder of the manuscript follows; details of materials and methods explained in Section 2. Experimental results explained in Section 3. Finally, the conclusion is given in Section 4.

Pix2Pix
One of the extensively employed GAN architecture methods is Pix2Pix, based on DCGAN (Radford et al., 2015).This method comprises a generator and a discriminator network.The generator utilizes the U-Net architecture (Ronneberger et al., 2015), while the discriminator network employs PatchGAN (Li and Wand, 2016).The loss functions for the generator and discriminator in the Pix2Pix method are provided in Eq. (1, 2, 3). (1) ℒ  (, , , ) =   +   (3) It has been observed that the blur level is high in the images synthesized with Pix2Pix.Isola et al. (2017) added the ℒ 1 regularization term to the loss of the generator architecture to remove some fuzziness.The loss function of the updated Pix2Pix is as follows Eq. ( 4):

CNN-Based Loss Function
There are three commonly used measures in the literature that employ trained CNN architectures to generate similarity metrics between pairs of images: CONTENT, LPIPS, and DISTS.While statistical methods focus on pixel values, convolutional methods concentrate on image content (Zhang et al., 2018;Ding et al., 2020;Gatys et al., 2015;Zhang et al., 2011).Therefore, in this study, the performance of CNN-based architectures in image synthesis was analyzed using these loss functions.These three CNN-based methods share similarities and utilize trained VGG networks.Although they have demonstrated significant success as image benchmarks, the high computational cost and lack of interpretability may potentially hinder their practical applicability (Ding et al., 2021).In this study, the VGG19 network was employed to calculate the loss for CONTENT, LPIPS, and DISTS.The VGG19 network consists of 16 convolutional layers and 5 pooling layers (Simonyan and Zisserman, 2014), organized into five blocks concluding with a pooling layer.Figure 1 illustrates which blocks the CNN-based loss functions utilize the outputs from the VGG19 network.Additionally, specific weights of the VGG19 network trained according to the datasets from the original studies of each method were employed.

CONTENT
The concept of CONTENT loss originated from the Neural Style Transfer study by Gatys et al. (2015).The value of this cost function is computed between the output of each block by providing both the content and target (synthesized) images to the VGG networks.The CONTENT cost aims to preserve the essential characteristics of the content display, as proposed by Gatys et al. (2015).This loss function is illustrated in Figure 1 on the VGG19 network.The similarity value is calculated between the outputs in the fifth block, denoted as F5, of the network as shown in the figure.After the outputs of the blocks pass through the Rectified Linear Unit (ReLU) activation function, they are subtracted from each other using the Mean Squared Error (MSE) method (Mihelich et al., 2020).The main formula of this metric is shown in Eq. ( 5).In the equation, the content image (x), the target image (y), and the list of block outputs (n) are expressed.

LPIPS
LPIPS (Learned Perceptual Image Patch Similarity) is one of the most recent metrics used to measure perceptual similarity between pairs of images (Zhang et al., 2018).Simultaneously, this metric employs deep attributes that mimic human perception (Ding et al., 2021).Figure 1 illustrates the activation outputs on which the LPIPS function is computed using the VGG19 network.Accordingly, the outputs of same-level layers (F1-4, F6) for the x and y images in the VGG19 network are normalized and subtracted from each other.The resulting data is scaled, and the outputs are transformed into a single vector form (Zhang et al., 2018).The loss is obtained by calculating the vector norm using the Mean Squared Error (MSE) method.
The numerical representation of this method is given in Eq. ( 6).The wi coefficient in the equation shows the perceptual importance of each layer.

DISTS
The DISTS (Deep Image Structure and Texture Similarity) function uses VGG network (Ding et al., 2020).This method uses l2 pooling instead of general max pooling in the VGG network.The DISTS function consists of a combination of structure and texture similarity.It resists slight geometric distortions and performs well on textural images (Ding et al., 2020;Ding et al., 2021).DISTS distance between the x and y input images over the block outputs (F0-4, F6) in Figure 1.The main formula of this metric is shown in Eq. ( 7).The coefficients () and () in the equation are previously trained particular values from the original run of the DISTS function.In equation include the structural(s) and textural(t) functions.

Use of metrics with 𝓛 𝟏
While calculating the smoothing term, the effects on image-to-image translation were analyzed using perceptual similarity metrics ( ℒ 1 _CONTENT, ℒ 1 _LPIPS, and ℒ 1 _DISTS) with the original ℒ 1 loss at a rate of 50%.The generator network loss of Pix2Pix is updated as follows Eq. ( 6).

Conventional methods
In this section, brief definitions of conventional image comparison metrics are provided.More detailed information can be obtained from the cited resources if desired.

Feature-Based Similarity Index Measurement (FSIM):
Compares the phase coherence and gradient magnitude properties of image pairs (Zhang et al., 2011).

Structural Similarity Index Metric (SSIM):
Utilizes simple statistical moments, such as mean (µ) and standard deviation (σ), to determine the similarity score of image pairs (Wang et al., 2004).

Peak Signal Noise Ratio (PSNR):
Widely used objective image signal quality metric.However, PSNR values may not correlate well with perceived image quality due to the complex, highly nonlinear nature of the human visual system (Fardo et al., 2016).

Visual Information Fidelity (VIF):
Uses natural scene statistical models (NSS) with a distortion model to measure information shared between fake and original images (Sheikh and Bovik, 2006).

Visual Saliency-Induced Index (VSI):
Assumes that a disturbance in one area that attracts the observer's attention is more disturbing than in another area.It aims to weigh local distortions with a local clarity map (Zhang et al., 2014).
In this section, brief definitions of conventional image comparison metrics are given.If desired, more comprehensive information can be obtained from the given resources.Feature-Based Similarity Index Measurement (FSIM) compares image pairs' phase coherence and gradient magnitude properties (Zhang et al., 2011).Haar Wavelet-Based Perceptual Similarity Index (HaarPSI) is based on comparing local wavelet coefficients extracted from image patches (Reisenhofer et al., 2018).In the Structural Similarity Index Metric (SSIM), a few simple statistical moments, such as mean (µ ) and standard deviation (σ ) are used to obtain the similarity score of the image pairs (Wang et al., 2004).In the Multi-Structural Similarity Index Metric (MS-SSIM), VSI for the MAPS dataset, the ℒ 1 _LPIPS function has observed that the LPIPS function gives better results than MS-SSIM, VIFp.When the up-to-date FID and KID similarity metrics are examined in Table 1, the DISTS function was successful.As a result, when the translation results in the datasets are evaluated in general, it is seen that DISTS and LPIPS functions provide adequate accuracy compared to the other metrics.It has been observed thatconventional similarity measures cannot provide sufficient accuracy.In contrast, up-to-date similarity measures give better results in the DISTS function for the Cityscapes and Maps datasets and the LPIPS function for the Denim2Mustache and Papsmear datasets.Consistent similarity results from DISTS and LPIPS functions are similar to the study results (Ding et al., 2021).Thus, it has been seen that DISTS and LPIPS functions can be used for loss measurement for GANs architectures.Up-to-date similarity metric have been studied in detail as they give more accurate results.In the Cityscapes dataset, the DISTS function showed success with 67.77FID and 0.036 KID values.In the Maps dataset, the DISTS function achieved an FID of 144.4 and a KID of 0.067.For the Denim2Mustache dataset, the LPIPS function demonstrated success with an FID of 130.7 and a KID of 0.042.Notably, in the Papsmear dataset, the FID and KID results were 39.47 and 0.002, respectively, and it was observed that the LPIPS function outperformed the other functions.
In this study, the original ℒ 1 loss function has been added to some Convolutional Neural Network methods in addition to the standard Generative Adversarial Network (GAN) architectures, specifically the Pix2Pix method.The purpose of keeping the GAN method constant is to observe the impact of the ℒ 1 loss function on Convolutional Neural Networks.When the ℒ 1 function is added, the model is tested on the Pix2Pix method, which is a fixed GAN architecture.When the proposed methods, namely ℒ 1 _DISTS and ℒ 1 _LPIPS, are examined metrically, they are observed to achieve better performance.These methods are advantageous as they reach the result faster and more accurately.In summary, the addition of the ℒ 1 loss function leads to higher performance.It has been observed that in the future, the ℒ 1 loss function may be used in addition to other methods.In some cases, such as variations in the dataset, visual observations indicate that there is variability in the results, and in certain situations, the ℒ 1 loss function is visually observed to be less successful.

Conclusion
The aim of this study is to evaluate the performance of loss function on the Pix2Pix architecture for GANs.CNN-based loss functions (CONTENT, DISTS, and LPIPS) were used instead of Pix2Pix's original ℒ 1 loss.Four different datasets were used to examine the effect of the loss function.The effects of adding CNN-based structures to the contentious loss term and regularization terms in the loss function are analyzed.As a result of the training and testing process, translation accuracies were transferred to tables with conventional and up-to-date metrics.When the experimental results were examined, it was seen that the LPIPS and DISTS method had the best synthesis performance according to the up-to-date similarity metrics (FID and KID).It seems that conventional similarity measures do not give consistent results in the translating accuracy.It is seen that the DISTS function in datasets with high complexity Cityscapes, Maps, and the LPIPS function with less complexity.Denim2Mustache and Papsmear give better results compared to other methods.As a result, it can be said that using DISTS and LPIPS functions in image-to-image translation architectures positively effects the translating accuracy.

Table 1 .
Performance comparison of architectures in image synthesis.(Rows:CNN-basedlossfunctionsanddatasets,Columns:conventional and up-to-date similarity metrics)The fourth dataset analyzed in the article is Papsmear.That is a medical dataset.It differs from the Cityscapes and Maps datasets.The objects (nucleus, cytoplasm, etc.) in the images in the Papsmear dataset are partitioned interconnectedly.Figure5presents the visual results of the Papsmear dataset to understand the outcome of the other losses of the Pix2Pix.Looking at the up-to-date similarity metrics for the Papsmear dataset in Table1, it is observed that the LPIPS function gives better results than other methods.Looking at the conventional similarity metrics FSIM, MS-SSIM, SSIM, and VIFp in Table1, the ℒ 1 _LPIPS function shows; the ℒ 1 _DISTS function in HaarPSI; It has been observed that the CONTENT function gives good results in PSNR and VSI.