ETNet: Error Transition Network for Arbitrary Style Transfer

Conference on Neural Information Processing Systems (Proceedings of NeurIPS 2019)

Chunjin Song1    Zhijie Wu1    Yang Zhou1*    Minglun Gong2    Hui Huang1*

1Shenzhen University    2University of Guelph

Figure 1: State of the art methods are all able to achieve good stylization with a simple target style (top row in (a)). But for a complex style (bottom row in (a)), both WCT and Avatar-Net distort the spatial structures and fail to preserve texture consistency, our method, however, still performs well. Different from the existing methods, our proposed model achieves style transfer by a coarse-to-fine refinement. One can see that, from left to right in (b), finer details arise more and more along with the refinements. See the close-up views in our supplementary material for a better visualization.


Numerous valuable efforts have been devoted to achieve arbitrary style transfer since the seminal work of Gatys et al. However, existing state-of-the-art approaches often generate insufficiently stylized results under challenging cases. We believe a fundamental reason is that these approaches try to generate the stylized result in a single shot and hence fail to fully satisfy the constraints on semantic structures in the content images and style patterns in the style images. Inspired by the works on error-correction, instead we propose a self-correcting model to predict what is wrong with the current stylized result and refine it iteratively. For each refinement, we transit the error features across both the spatial and scale domain and invert the processed features into a residual image, with a network we call Error Transition Network (ETNet). The proposed model improves over the stateof- the-art methods with better semantic structures and more adaptive style pattern details. Various qualitative and quantitative experiments show that the key concept of both progressive strategy and error-correction yield better results.

Figure 2: Framework of our proposed stylization procedure. We start with a zero vector to represent the initial stylized image, i.e. ^I3cs = 0. Together with downsampled input content-style image pair (I3c and I3s ), it is fed into the residual image generator ETNet3 to generate a residual image I3h. The sum of I3 h and ^I3cs gives us the updated stylized image I3cs, which is then upsampled into ^I2cs. This process is repeated across two subsequent levels to yield a final stylized image I1cs with full resolution.

Figure 3: Error Transition network (a) and the detailed architecture of error propagation block (b). For ETNet, the input images include a content-style image pair (Ic, Is) and an intermediate stylization (^Ics). The two encoders extract deep features {fiin}, and error features ΔE4c and {ΔEis } respectively. After the fusion of ΔE4c and ΔE4s , we input the fused error feature ΔE4, together with f4in into a non-local block to compute a global residual feature ΔD4. Then both ΔD4 and ΔE4 are fed to a series of error propagation blocks to further receive lower-level information until we get the residual image Ih. Finally, we add Ih to ^Ics and output the refined image Ics.

Figure 4: Ablation study on explicit error computation. Our full model is successful in reducing artifacts and synthesize texture patterns more faithful to the style image. Better zoom for more details.

Figure 5: Ablation study on the effect of current stylized results in computing residual images.

Table 1: Ablation study on multiple refinements, the effect of error computation and the joint analysis of intermediate stylized results in computing the residual images. All the results are averaged over 100 synthesized images with perceptual metrics. Note that both K and K' represent the number of refinements, where K denotes a full model and K' represents a simple model that removes the upper encoder, which means it does not consider the intermediate stylization in computing residual images.

Figure 6: Comparison with results from different methods.

Figure 7: Detail cut-outs. The top row shows close-ups for highlighted areas for a better visualization. Only our result successfully captures the paint brush patterns in the style image.

Figure 8: At deployment stage, we can adjust the degree of stylization with paramter α.

Figure 9: A refinement for the outputs of AdaIN and WCT.

Table 2: Quantitative comparison over different models on perceptual (content & style) loss, preference score of user study and stylization speed. Note that all the results are averaged over 100 test images except the preference score.

Data & Code

Note that the DATA and CODE are free for Research and Education Use ONLY. 

Please cite our paper (add the bibtex below) if you use any part of our ALGORITHM, CODE, DATA or RESULTS in any publication.



We thank the anonymous reviewers for their constructive comments. This work was supported in parts by NSFC (61761146002, 61861130365, 61602461), GD Higher Education Innovation Key Program (2018KZDXM058), GD Science and Technology Program (2015A030312015), Shenzhen Innovation Program (KQJSCX20170727101233642), LHTD (20170003), and Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ).

title = {
ETNet: Error Transition Network for Arbitrary Style Transfer},
author = {Chunjin Song and Zhijie Wu and Yang Zhou and Minglun Gong and Hui Huang},
journal = {Conference on Neural Information Processing Systems (Proceedings of NeurIPS 2019)},
pages = {668--677},  
year = {2019},

Downloads(faster for people in China)

Downloads(faster for people in other places)