Training Behavior of Deep Neural Network in Frequency Domain

Xu, Zhi-Qin John; Zhang, Yaoyu; Xiao, Yanyang

doi:10.1007/978-3-030-36708-4_22

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11953))

Included in the following conference series:

International Conference on Neural Information Processing

5477 Accesses
180 Citations
3 Altmetric

Abstract

Why deep neural networks (DNNs) capable of overfitting often generalize well in practice is a mystery [24]. To find a potential mechanism, we focus on the study of implicit biases underlying the training process of DNNs. In this work, for both real and synthetic datasets, we empirically find that a DNN with common settings first quickly captures the dominant low-frequency components, and then relatively slowly captures the high-frequency ones. We call this phenomenon Frequency Principle (F-Principle). The F-Principle can be observed over DNNs of various structures, activation functions, and training algorithms in our experiments. We also illustrate how the F-Principle helps understand the effect of early-stopping as well as the generalization of DNNs. This F-Principle potentially provides insight into a general principle underlying DNN optimization and generalization.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+

from €39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Chapter: EUR 29.95; Price includes VAT (Germany)

eBook: EUR 42.79; Price includes VAT (Germany)

Softcover Book: EUR 53.49; Price includes VAT (Germany)

Tax calculation will be finalised at checkout

Purchases are for personal use only

') var buybox = document.querySelector("[data-id=id_"+ timestamp +"]").parentNode var buyingOptions = buybox.querySelectorAll(".buying-option") ;[].slice.call(buyingOptions).forEach(initCollapsibles) var buyboxMaxSingleColumnWidth = 480 function initCollapsibles(subscription, index) { var toggle = subscription.querySelector(".buying-option-price") subscription.classList.remove("expanded") var form = subscription.querySelector(".buying-option-form") var priceInfo = subscription.querySelector(".price-info") var buyingOption = toggle.parentElement if (toggle && form && priceInfo) { toggle.setAttribute("role", "button") toggle.setAttribute("tabindex", "0") toggle.addEventListener("click", function (event) { var expandedBuyingOptions = buybox.querySelectorAll(".buying-option.expanded") var buyboxWidth = buybox.offsetWidth ;[].slice.call(expandedBuyingOptions).forEach(function(option) { if (buyboxWidth buyboxMaxSingleColumnWidth) { toggle.click() } else { if (index === 0) { toggle.click() } else { toggle.setAttribute("aria-expanded", "false") form.hidden = "hidden" priceInfo.hidden = "hidden" } } }) } initialStateOpen() if (window.buyboxInitialised) return window.buyboxInitialised = true initKeyControls() })()

Institutional subscriptions

Overview Frequency Principle/Spectral Bias in Deep Learning

Article 04 September 2024

Learning high frequency data via the coupled frequency predictor-corrector triangular DNN

Article 26 March 2023

Improving Generalization in Deep Neural Networks by Mitigating Memorization

Notes

1.
Almost at the same time, another research [15] finds a similar result. However, they add noise to MNIST, which contaminates the labels.
2.
The bias terms are always initialized by standard deviation 0.1.

References

Alain, G., Bengio, Y.: Understanding intermediate layers using linear classifier probes. arXiv preprint arXiv:1610.01644 (2016)
Arpit, D., et al.: A closer look at memorization in deep networks. arXiv preprint arXiv:1706.05394 (2017)
Barnett, A., Greengard, L., Pataki, A., Spivak, M.: Rapid solution of the cryo-EM reconstruction problem by frequency marching. SIAM J. Imaging Sci. 10(3), 1170–1195 (2017)
Article MathSciNet Google Scholar
Cai, W., Li, X., Liu, L.: Phasednn-a parallel phase shift deep neural network for adaptive wideband learning. arXiv preprint arXiv:1905.01389 (2019)
Hackbusch, W.: Multi-grid Methods and Applications, vol. 4. Springer Science & Business Media (2013)
Google Scholar
Hardt, M., Recht, B., Singer, Y.: Train faster, generalize better: stability of stochastic gradient descent. arXiv preprint arXiv:1509.01240 (2015)
Kawaguchi, K., Kaelbling, L.P., Bengio, Y.: Generalization in deep learning. arXiv preprint arXiv:1710.05468 (2017)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436 (2015)
Article Google Scholar
Lin, J., Camoriano, R., Rosasco, L.: Generalization properties and implicit regularization for multiple passes SGM. In: International Conference on Machine Learning, pp. 2340–2348 (2016)
Google Scholar
Martin, C.H., Mahoney, M.W.: Rethinking generalization requires revisiting old ideas: statistical mechanics approaches and complex learning behavior. arXiv preprint arXiv:1710.09553 (2017)
Mishali, M., Eldar, Y.C.: Blind multiband signal reconstruction: compressed sensing for analog signals. IEEE Trans. Signal Process. 57(3), 993–1009 (2009)
Article MathSciNet Google Scholar
Percival, D.B., Walden, A.T.: Spectral Analysis for Physical Applications. Cambridge University Press, Cambridge (1993)
Book Google Scholar
Rabinowitz, N.C.: Meta-learners’ learning dynamics are unlike learners’. arXiv preprint arXiv:1905.01320 (2019)
Rahaman, N., et al.: On the spectral bias of deep neural networks. arXiv preprint arXiv:1806.08734 (2018)
Saxe, A.M., Bansal, Y., Dapello, J., Advani, M.: On the information bottleneck theory of deep learning. In: International Conference on Learning Representations (2018)
Google Scholar
Shannon, C.E.: Communication in the presence of noise. Proc. IRE 37(1), 10–21 (1949)
Article MathSciNet Google Scholar
Shwartz-Ziv, R., Tishby, N.: Opening the black box of deep neural networks via information. arXiv preprint arXiv:1703.00810 (2017)
Wu, L., Zhu, Z., Weinan, E.: Towards understanding generalization of deep learning: perspective of loss landscapes. arXiv preprint arXiv:1706.10239 (2017)
Xu, Z.Q.J.: Frequency principle in deep learning with general loss functions and its potential application. arXiv preprint arXiv:1811.10146 (2018)
Xu, Z.Q.J., Zhang, Y., Luo, T., Xiao, Y., Ma, Z.: Frequency principle: Fourier analysis sheds light on deep neural networks. arXiv preprint arXiv:1901.06523 (2019)
Xu, Z.J.: Understanding training and generalization in deep learning by Fourier analysis. arXiv preprint arXiv:1808.04295 (2018)
Yen, J.: On nonuniform sampling of bandwidth-limited signals. IRE Trans. Circuit Theory 3(4), 251–257 (1956)
Article Google Scholar
Zhang, C., Bengio, S., Hardt, M., Recht, B., Vinyals, O.: Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530 (2016)
Zhang, Y., Xu, Z.Q.J., Luo, T., Ma, Z.: Explicitizing an implicit bias of the frequency principle in two-layer neural networks. arXiv:1905.10264 [cs, stat], May 2019
Zhang, Y., Xu, Z.Q.J., Luo, T., Ma, Z.: A type of generalization error induced by initialization in deep neural networks. arXiv:1905.07777 [cs, stat], May 2019
Zhen, H.L., Lin, X., Tang, A.Z., Li, Z., Zhang, Q., Kwong, S.: Nonlinear collaborative scheme for deep neural networks. arXiv preprint arXiv:1811.01316 (2018)
Zheng, G., Sang, J., Xu, C.: Understanding deep learning generalization by maximum entropy. arXiv preprint arXiv:1711.07758 (2017)

Download references

Acknowledgments

The authors want to thank David W. McLaughlin for helpful discussions and thank Qiu Yang (NYU), Zheng Ma (Purdue University), and Tao Luo (Purdue University), Shixiao Jiang (Penn State), Kai Chen (SJTU) for critically reading the manuscript. Part of this work was done when ZX, YZ, YX are postdocs at New York University Abu Dhabi and visiting members at Courant Institute supported by the NYU Abu Dhabi Institute G1301. The authors declare no conflict of interest.

Author information

Authors and Affiliations

School of Mathematical Sciences and Institute of Natural Sciences, Shanghai Jiao Tong University, Shanghai, China
Zhi-Qin John Xu
School of Mathematics, Institute for Advanced Study, Princeton, NJ, 08540, USA
Yaoyu Zhang
The Brain Cognition and Brain Disease Institute, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
Yanyang Xiao

Authors

Zhi-Qin John Xu
View author publications
Search author on:PubMed Google Scholar
Yaoyu Zhang
View author publications
Search author on:PubMed Google Scholar
Yanyang Xiao
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Zhi-Qin John Xu .

Editor information

Editors and Affiliations

Australian National University, Canberra, ACT, Australia
Tom Gedeon
Murdoch University, Murdoch, WA, Australia
Kok Wai Wong
Kyungpook National University, Daegu, Korea (Republic of)
Minho Lee

7 Appendix

Through the empirical exploration of the training behavior of DNNs in the information plane, regarding information compression phase, Schwartz-Ziv and Tishby [18] claimed that (i) information compression is a general process; (ii) information compression is induced by SGD. In this section, we demonstrate how the F-Principle can be used to understand the compression phase.

1.1 7.1 Computation of Information

For any random variables U and V with a joint distribution P(u, v): the entropy of U is defined as $I(U)=-\sum _{u}P(u)\log P(u)$; their mutual information is defined as $I(U,V)=\sum _{u,v}P(u,v)\log \frac{P(u,v)}{P(u)P(v)}$; the conditional entropy of U on V is defined as

$$ I(U|V)=\sum _{u,v}P(u,v)\log \frac{P(v)}{P(u,v)}=I(U)-I(U,V). $$

By the construction of the DNN, its output T is a deterministic function of its input X, thus, $I(T|X)=0$ and $I(X,T)=I(T)$. To compute entropy numerically, we evenly bin X, Y, T to $X_{b}$, $Y_{b}$, $T_{b}$ with bin size b as follows. For any value v, its binned value is define as $v_{b}=\mathrm{Round}(v/b)\times b$. In our work, I(T) and I(Y, T) are approximated by $I(T_{b})$ and $I(Y_{b},T_{b})$, respectively, with $b=0.05$. Note that, after binning, one value of $X_{b}$ may map to multiple values of $T_{b}$. Thus, $I(T_{b}|X_{b})\ne 0$ and $I(X_{b},T_{b})\ne I(T_{b})$. The difference vanishes as bin size shrinks. Therefore, with a small bin size, $I(T_{b})$ is a good approximation of I(X, T). In experiments, we also find that $I(X_{b},T_{b})$ and $I(T_{b})$ behave almost the same in the information plane for the default value $b=0.05$.

1.2 7.2 Compression vs. No Compression in the Information Plane

We demonstrate how compression can appear or disappear by tuning the parameter $\alpha $ in Eq. (2) with $f_{0}(x)=x$ for $x\in [-1,1]$ using full batch gradient descent (GD) without stochasticity. In our simulations, the DNN well fits f(x) for both $\alpha $ equal to 0 and 0.5 after training (see Fig. 5a and c). In the information plane, there is no compression phase for I(T) for $\alpha =0$ (see Fig. 5b). By increasing $\alpha $ in Eq. (2) we can observe that: (i) the fitted function is discretized with only few possible outputs (see Fig. 5c); (ii) the compression of I(T) appears (see Fig. 5d). For $\alpha >0$, behaviors of information plane are similar to previous results [18]. To understand why compression happens for $\alpha >0$, we next focus on the training courses for different $\alpha $ in the frequency domain.

A key feature of the class of functions described by Eq. (2) is that the dominant low-frequency components for f(x) with different $\alpha $ are the same. By the F-Principle, the DNN first captures those dominant low-frequency components, thus, the training courses for different $\alpha $ at the beginning are similar, i.e., (i) the DNN output is close to $f_{0}(x)$ at certain training epochs (blue lines in Fig. 5a and c); (ii) I(T) in the information plane increases rapidly until it reaches a value close to the entropy of $f_{0}(x)$, i.e., $I(f_{0}(x))$ (see Fig. 5b and d). For $\alpha =0$, the target function is $f_{0}(x)$, therefore, I(T) will be closer and closer to $I(f_{0}(x))$ during the training. For $\alpha >0$, the entropy of the target function, I(f(x)), is much less than $I(f_{0}(x))$. In the latter stage of capturing high-frequency components, the DNN output T would converge to the discretized function f(x). Therefore, I(T) would decrease from $I(f_{0}(x))$ to I(f(x)).

This analysis is also applicable to other functions. As the discretization is in general inevitable for classification problems with discrete labels, we can often observe information compression in practice as described in the previous study [18].

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Xu, ZQ.J., Zhang, Y., Xiao, Y. (2019). Training Behavior of Deep Neural Network in Frequency Domain. In: Gedeon, T., Wong, K., Lee, M. (eds) Neural Information Processing. ICONIP 2019. Lecture Notes in Computer Science(), vol 11953. Springer, Cham. https://doi.org/10.1007/978-3-030-36708-4_22

Download citation

DOI: https://doi.org/10.1007/978-3-030-36708-4_22
Published: 09 December 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-36707-7
Online ISBN: 978-3-030-36708-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Training Behavior of Deep Neural Network in Frequency Domain

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Overview Frequency Principle/Spectral Bias in Deep Learning

Learning high frequency data via the coupled frequency predictor-corrector triangular DNN

Improving Generalization in Deep Neural Networks by Mitigating Memorization

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

7 Appendix

7 Appendix

1.1 7.1 Computation of Information

1.2 7.2 Compression vs. No Compression in the Information Plane

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us