AnimeGANv2

Xin Chen and Gang Liu

Figure 1. Example of photo cartoonization with our method: left is the real-world photo used for input, right is the animation style picture output by our proposed method.

Abstract

Transforming photos of real-world scenes into anime style images is a meaningful and challenging task in terms of computer vision and artistic style transfer. Our previously proposed AnimeGAN combines neural style transfer and generative adversarial network (GAN) to accomplish this task. However, AnimeGAN still has some obvious problems, such as high-frequency artifacts in the images generated by the model. Therefore, in this research, we propose an improved version of AnimeGAN, namely AnimeGANv2. It prevents the generation of high-frequency artifacts by simply changing the normalization of features in the network. In addition, we further reduce the scale of the generator network to achieve more efficient animation style transfer. AnimeGANv2 trained on the newly established high-quality dataset can generate animation images with better visual quality than AnimeGAN.

Method

AnimeGANv2 uses layer normalization of features to prevent the network from producing high-frequency artifacts in the generated images. However, AnimeGAN is prone to generate high-frequency artifacts due to the use of instance normalization, which is the same as the reason why styleGAN generates high-frequency artifacts. In fact, total variation loss cannot completely suppress the generation of high-frequency noise. Instance normalization is generally regarded as the best normalization method in style transfer. It can make different channels in the feature map have different feature properties, thereby promoting the diversity of styles in the images generated by the model. Layer normalization can make different channels in the feature map have the same distribution of feature properties, which can effectively prevent the generation of local noise.

The network structure of the generator in AnimeGANv2 is shown in Figure 2. K represents the size of the convolution kernel, S represents the step size, C represents the number of convolution kernels, IRB represents the inverted residual block, resize represents the interpolation up-sampling method, and SUM means the element-wise addition. The generator parameter size of AnimeGANv2 is 8.6MB, and the generator parameter size of AnimeGAN is 15.8MB. AnimeGANv2 uses the same discriminator as AnimeGAN, the difference is that the discriminator uses layer normalization instead of instance normalization.

The three animation style datasets used by AnimeGANv2 are shown in Table 1. The image size used in the training is 256*256, and these style images are all from the video frames in the corresponding high-definition style movies. Figure 3 shows the qualitative results of AnimeGANv2 on the test dataset for the three animation styles.

Figure 2. Architecture of the generator network.

Table 1. Information of three anime style datasets.

Anime style	Film	Picture Number	Quality	Download Style Dataset
Miyazaki Hayao	The Wind Rises	1752	1080p	Link
Makoto Shinkai	Your Name & Weathering with you	1445	BD
Kon Satoshi	Paprika	1284	BDRip

Figure 3. The qualitative results of AnimeGANv2 on the three animation styles on the test dataset.