Here is the translated article:

Bias–variance decomposition

{{short description|Property of a model}} [[File:Bias and variance contributing to total error.svg|thumb|Bias and variance as function of model complexity]] {{Machine learning|Theory}}

In [[statistics]] and [[machine learning]], the '''bias–variance tradeoff''' describes the relationship between a model's complexity, the accuracy of its predictions, and how well it can make predictions on previously unseen data that were not used to train the model. In general, as the number of tunable parameters in a model increases, it becomes more flexible, and can better fit a training data set. That is, the model has lower error or lower [[Bias of an estimator|bias]]. However, for more flexible models, there will tend to be greater '''variance''' to the model fit each time we take a set of [[sample (statistics)|samples]] to create a new training data set. It is said that there is greater [[variance]] in the model's [[estimation theory|estimated]] [[statistical parameter|parameters]].

The '''bias–variance dilemma''' or '''bias–variance problem''' is the conflict in trying to simultaneously minimize these two sources of [[Errors and residuals in statistics|error]] that prevent [[supervised learning]] algorithms from generalizing beyond their [[training set]]:{{cite journal |last1=Kohavi |first1=Ron |last2=Wolpert |first2=David H. |title=Bias Plus Variance Decomposition for Zero-One Loss Functions |journal=ICML |date=1996 |volume=96}}{{cite journal |last1=Luxburg |first1=Ulrike V. |last2=Schölkopf |first2=B. |title=Statistical learning theory: Models, concepts, and results |journal=Handbook of the History of Logic |date=2011 |volume=10| page=Section 2.4}}

The [[Bias of an estimator|''bias'']] error is an error from erroneous assumptions in the learning [[algorithm]]. High bias can cause an algorithm to miss the relevant relations between features and target outputs ([[Overfitting#Underfitting|underfitting]]).
The ''[[variance]]'' is an error from sensitivity to small fluctuations in the training set. High variance may result from an algorithm modeling the random [[Noise (signal processing)|noise]] in the training data ([[overfitting]]).

The '''bias–variance decomposition''' is a way of analyzing a learning algorithm's [[expected value|expected]] [[generalization error]] with respect to a particular problem as a sum of three terms, the bias, variance, and a quantity called the ''irreducible error'', resulting from noise in the problem itself.

{{multiple image | align = right | direction = vertical | width = 200 | image1 = Test function and noisy data.png | caption1 = Function and noisy data | image2 = Radial basis function fit, spread=5.png | caption2 = Spread=5 | image3 = Radial basis function fit, spread=1.png | caption3 = Spread=1 | image4 = Radial basis function fit, spread=0.1.png | caption4 = Spread=0.1 | footer = A function (red) is approximated using [[radial basis functions]] (blue). Several trials are shown in each graph. For each trial, a few noisy data points are provided as a training set (top). For a wide spread (image 2) the bias is high: the RBFs cannot fully approximate the function (especially the central dip), but the variance between different trials is low. As spread decreases (image 3 and 4) the bias decreases: the blue curves more closely approximate the red. However, depending on the noise in different trials the variance between trials increases. In the lowermost image the approximated values for x=0 varies wildly depending on where the data points were located. }}

==Motivation== {{See also|Accuracy and precision}} File:Truen bad prec ok.png|High bias, low variance File:Truen bad prec bad.png|High bias, high variance File:En low bias low variance.png|Low bias, low variance File:Truen ok prec bad.png|Low bias, high variance The bias–variance tradeoff is a central problem in supervised learning. Ideally, one wants to [[Model selection|choose a model]] that both accurately captures the regularities in its training data, but also [[Generalization|generalizes]] well to unseen data. Unfortunately, it is typically impossible to do both simultaneously. High-variance learning methods may be able to represent their training set well but are at risk of overfitting to noisy or unrepresentative training data. In contrast, algorithms with high bias typically produce simpler models that may fail to capture important regularities (i.e. underfit) in the data.

It is an often made [[Affirming the consequent|fallacy]]{{cite arXiv |last=Neal |first=Brady |eprint=1912.08286 |title=On the Bias–Variance Tradeoff: Textbooks Need an Update |class=cs.LG |date=2019}}{{cite arXiv |first1=Brady |last1=Neal |first2=Sarthak |last2=Mittal |first3=Aristide |last3=Baratin |first4=Vinayak |last4=Tantia |first5=Matthew |last5=Scicluna |first6=Simon |last6=Lacoste-Julien |first7=Ioannis |last7=Mitliagkas |eprint=1810.08591 |title=A Modern Take on the Bias–Variance Tradeoff in Neural Networks |class=cs.LG |date=2018}} to assume that complex models must have high variance. High variance models are "complex" in some sense, but the reverse need not be true.{{Cite conference |last1=Neal |first1=Brady |last2=Mittal |first2=Sarthak |last3=Baratin |first3=Aristide |last4=Tantia |first4=Vinayak |last5=Scicluna |first5=Matthew |last6=Lacoste-Julien |first6=Simon |last7=Mitliagkas |first7=Ioannis |date=2019 |title=A Modern Take on the Bias–Variance Tradeoff in Neural Networks |url=https://openreview.net/forum?id=HkgmzhC5F7 |conference=International Conference on Learning Representations (ICLR) 2019}} In addition, one has to be careful how to define complexity. In particular, the number of parameters used to describe the model is a poor measure of complexity. This is illustrated by an example adapted from:{{cite book |last1=Vapnik |first1=Vladimir |title=The nature of statistical learning theory |date=2000 |publisher=Springer-Verlag |location=New York |doi=10.1007/978-1-4757-3264-1 |isbn=978-1-4757-3264-1 |s2cid=7138354 |url=https://dx.doi.org/10.1007/978-1-4757-3264-1}} The model f_{a,b}(x)=a\sin(bx) has only two parameters (a,b) but it can interpolate any number of points by oscillating with a high enough frequency, resulting in both a high bias and high variance.

An analogy can be made to the relationship between [[accuracy and precision]]. Accuracy is one way of quantifying bias and can intuitively be improved by selecting from only [[Sample space|local]] information. Consequently, a sample will appear accurate (i.e. have low bias) under the aforementioned selection conditions, but may result in underfitting. In other words, [[Training, validation, and test data sets|test data]] may not agree as closely with training data, which would indicate imprecision and therefore inflated variance. A graphical example would be a straight line fit to data exhibiting quadratic behavior overall. Precision is a description of variance and generally can only be improved by selecting information from a comparatively larger space. The option to select many data points over a broad sample space is the ideal condition for any analysis. However, intrinsic constraints (whether physical, theoretical, computational, etc.) will always play a limiting role. The limiting case where only a finite number of data points are selected over a broad sample space may result in improved precision and lower variance overall, but may also result in an overreliance on the training data (overfitting). This means that test data would also not agree as closely with the training data, but in this case the reason is inaccuracy or high bias. To borrow from the previous example, the graphical representation would appear as a high-order polynomial fit to the same data exhibiting quadratic behavior. Note that error in each case is measured the same way, but the reason ascribed to the error is different depending on the balance between bias and variance. To mitigate how much information is used from neighboring observations, a model can be [[smoothing|smoothed]] via explicit [[Regularization (mathematics)|regularization]], such as [[shrinkage (statistics)|shrinkage]].

==Bias–variance decomposition of mean squared error== {{main|Mean squared error}} [[File:Bias-variance decomposition.png|thumb|Bias–variance decomposition in the case of mean squared loss. The green dots are samples of test label y at a fixed test feature x. Their variance around the mean \mathbb E_{y \sim p(\cdot | x)}[y] is the irreducible error \sigma^2. The red dots are test label predictions f(x | D) as the training set D is randomly sampled. Their variance around the mean \mathbb E_D[f(x | D)] is the variance \operatorname{Var}_D\big[f(x | D)\big] . The difference between the red dash and the green dash is the bias \operatorname{Bias}_D\big[f (x | D)\big] . The bias–variance decomposition is then visually clear: the mean squared error between the red dots and the green dots is the sum of the three components.]] Suppose that we have a training set consisting of a set of points x_1, \dots, x_n and real-valued labels y_i associated with the points x_i. We assume that the data is generated by a function f(x) such as y = f(x) + \varepsilon, where the noise, \varepsilon, has zero mean and unit variance \sigma^2. That is, y_i = f(x_i) + \varepsilon_i, where \varepsilon_i is a noise sample.

We want to find a function \hat{f}!(x;D), that approximates the true function f(x) as well as possible, by means of some learning algorithm based on a training dataset (sample) D={(x_1,y_1) \dots, (x_n, y_n)}. We make "as well as possible" precise by measuring the [[mean squared error]] between y and \hat{f}!(x;D): we want (y - \hat{f}!(x;D))^2 to be minimal, both for x_1, \dots, x_n ''and for points outside of our sample''. Of course, we cannot hope to do so perfectly, since the y_i contain noise \varepsilon; this means we must be prepared to accept an ''irreducible error'' in any function we come up with.

Finding an \hat{f} that generalizes to points outside of the training set can be done with any of the countless algorithms used for supervised learning. It turns out that whichever function \hat{f} we select, we can decompose its [[expected value|expected]] error on an unseen sample x (i.e. conditional on x) as follows:{{cite book |first1=Gareth |last1=James |first2=Daniela |last2=Witten |author-link2=Daniela Witten |first3=Trevor |last3=Hastie |author-link3=Trevor Hastie |first4=Robert |last4=Tibshirani |author-link4=Robert Tibshirani |title=An Introduction to Statistical Learning |publisher=Springer |year=2013 |url=http://www-bcf.usc.edu/~gareth/ISL/ }}{{rp|34}}{{cite book |first1=Trevor |last1=Hastie |first2=Robert |last2=Tibshirani |first3=Jerome H. |last3=Friedman |author-link3=Jerome H. Friedman |year=2009 |title=The Elements of Statistical Learning |url=http://statweb.stanford.edu/~tibs/ElemStatLearn/ |access-date=2014-08-20 |archive-url=https://web.archive.org/web/20150126123924/http://statweb.stanford.edu/~tibs/ElemStatLearn/ |archive-date=2015-01-26 |url-status=dead }}{{rp|223}}

where \begin{align} \operatorname{Bias}_D\big[\hat{f}!(x;D)\big] &\triangleq \mathbb{E}_D\big[\hat{f}!(x;D)- f(x)\big]\ &= \mathbb{E}_D\big[\hat{f}!(x;D)\big] , - , f(x)\ &= \mathbb{E}D\big[\hat{f}!(x;D)\big] , - , \mathbb{E}{y|x}\big[y(x)\big] \end{align}

and

The expectation ranges over different choices of the training set D={(x_1,y_1) \dots, (x_n, y_n)}, all sampled from the same joint distribution P(x,y) which can for example be done via [[Bootstrapping (statistics)|bootstrapping]]. The three terms represent:

the square of the ''bias'' of the learning method, which can be thought of as the error caused by the simplifying assumptions built into the method. E.g., when approximating a non-linear function f(x) using a learning method for [[linear model]]s, there will be error in the estimates \hat{f}!(x) due to this assumption;
the ''variance'' of the learning method, or, intuitively, how much the learning method \hat{f}!(x) will move around its mean;
the irreducible error \sigma^2.

Since all three terms are non-negative, the irreducible error forms a lower bound on the expected error on unseen samples.{{rp|34}}

The more complex the model \hat{f}!(x) is, the more data points it will capture, and the lower the bias will be. However, complexity will make the model "move" more to capture the data points, and hence its variance will be larger.

===Derivation=== The derivation of the bias–variance decomposition for squared error proceeds as follows.{{cite web |first1=Sethu |last1=Vijayakumar |author-link=Sethu Vijayakumar |title=The Bias–Variance Tradeoff |publisher=[[University of Edinburgh]] |year=2007 |access-date=19 August 2014 |url=http://www.inf.ed.ac.uk/teaching/courses/mlsc/Notes/Lecture4/BiasVariance.pdf }}{{cite web |title=Notes on derivation of bias–variance decomposition in linear regression |first=Greg |last=Shakhnarovich |year=2011 |access-date=20 August 2014 |url=http://ttic.uchicago.edu/~gregory/courses/wis-ml2012/lectures/biasVarDecom.pdf |archive-url=https://web.archive.org/web/20140821063842/http://ttic.uchicago.edu/~gregory/courses/wis-ml2012/lectures/biasVarDecom.pdf |archive-date=21 August 2014 }} For convenience, we drop the D subscript in the following lines, such that \hat{f}!(x;D) = \hat{f}!(x).

Let us write the mean-squared error of our model:

We can show that the second term of this equation is null:

Moreover, the third term of this equation is nothing but \sigma^2, the variance of \varepsilon.

Let us now expand the remaining term:

We show that:

This last series of equalities comes from the fact that f(x) is not a random variable, but a fixed, deterministic function of x. Therefore, \operatorname\mathbb{E}\left[f(x)\right] = f(x). Similarly \operatorname\mathbb{E}\left[ f(x)^2 \right] = f(x)^2, and \operatorname\mathbb{E} \left[ f(x) , \operatorname\mathbb{E}[\hat{f}!(x)] \right] = f(x) , \operatorname\mathbb{E} \left[ \operatorname\mathbb{E}[\hat{f}!(x)] \right] = f(x) \operatorname\mathbb{E}[\hat{f}!(x)]. Using the same reasoning, we can expand the second term and show that it is null:

Eventually, we plug our derivations back into the original equation, and identify each term:

Finally, the MSE loss function (or negative log-likelihood) is obtained by taking the expectation value over x\sim P: \text{MSE} = \operatorname\mathbb{E}_x \left[ \text{MSE}(x) \right] = \operatorname\mathbb{E}_x \left{\operatorname{Bias}_D!\left[\hat{f}!(x;D)\right]^2 + \operatorname{Var}_D\left[\hat{f}!(x;D)\right]\right} + \sigma^2.

==Approaches== [[Dimensionality reduction]] and [[feature selection]] can decrease variance by simplifying models. Similarly, a larger training set tends to decrease variance. Adding features (predictors) tends to decrease bias, at the expense of introducing additional variance. Learning algorithms typically have some tunable parameters that control bias and variance; for example,

[[linear model|linear]] and [[Generalized linear model|Generalized linear]] models can be [[Regularization (mathematics)|regularized]] to decrease their variance at the cost of increasing their bias.{{cite book |last=Belsley |first=David |title=Conditioning diagnostics : collinearity and weak data in regression |publisher=Wiley |location=New York (NY) |year=1991 |isbn=978-0471528890 }}
In [[artificial neural network]]s, the variance increases and the bias decreases as the number of hidden units increase,{{cite journal |last1=Geman |first1=Stuart |author-link1=Stuart Geman |first2=Élie |last2=Bienenstock |first3=René |last3=Doursat |year=1992 |title=Neural networks and the bias/variance dilemma |journal=Neural Computation |volume=4 |pages=1–58 |doi=10.1162/neco.1992.4.1.1 |s2cid=14215320 |url=http://web.mit.edu/6.435/www/Geman92.pdf }} although this classical assumption has been the subject of recent debate. Like in GLMs, regularization is typically applied.
In [[k-nearest neighbor|''k''-nearest neighbor]] models, a high value of {{mvar|k}} leads to high bias and low variance (see below).
In [[instance-based learning]], regularization can be achieved varying the mixture of [[prototype]]s and exemplars.{{cite journal |last1=Gagliardi |first1=Francesco |date=May 2011 |title=Instance-based classifiers applied to medical databases: diagnosis and knowledge extraction |journal=Artificial Intelligence in Medicine |volume=52 |issue=3 |pages=123–139 |doi=10.1016/j.artmed.2011.04.002 |pmid=21621400 |url=https://www.researchgate.net/publication/51173579 }}
In [[decision tree]]s, the depth of the tree determines the variance. Decision trees are commonly pruned to control variance.{{rp|307}}

One way of resolving the trade-off is to use [[mixture models]] and [[ensemble learning]].{{cite book |first1=Jo-Anne |last1=Ting |first2=Sethu |last2=Vijaykumar |first3=Stefan |last3=Schaal |url=http://homepages.inf.ed.ac.uk/svijayak/publications/ting-EMLDM2016.pdf |chapter=Locally Weighted Regression for Control |title=Encyclopedia of Machine Learning |editor-first1=Claude |editor-last1=Sammut |editor-first2=Geoffrey I. |editor-last2=Webb |publisher=Springer |year=2011 |page=615 |bibcode=2010eoml.book.....S }}{{cite web |first=Scott |last=Fortmann-Roe |title=Understanding the Bias–Variance Tradeoff |year=2012 |url=http://scott.fortmann-roe.com/docs/BiasVariance.html }} For example, [[Boosting (machine learning)|boosting]] combines many "weak" (high bias) models in an ensemble that has lower bias than the individual models, while [[Bootstrap aggregating|bagging]] combines "strong" learners in a way that reduces their variance.

[[Model validation]] methods such as [[cross-validation (statistics)]] can be used to tune models so as to optimize the trade-off.

===''k''-nearest neighbors=== In the case of [[k-nearest neighbors algorithm|{{mvar|k}}-nearest neighbors regression]], when the expectation is taken over the possible labeling of a fixed training set, a [[closed-form expression]] exists that relates the bias–variance decomposition to the parameter {{mvar|k}}:{{rp|37, 223}}

where N_1(x), \dots, N_k(x) are the {{mvar|k}} nearest neighbors of {{mvar|x}} in the training set. The bias (first term) is a monotone rising function of {{mvar|k}}, while the variance (second term) drops off as {{mvar|k}} is increased. In fact, under "reasonable assumptions" the bias of the first-nearest neighbor (1-NN) estimator vanishes entirely as the size of the training set approaches infinity.

==Applications==

===In regression=== The bias–variance decomposition forms the conceptual basis for regression [[Regularization (mathematics)|regularization]] methods such as [[Lasso (statistics)|LASSO]] and [[ridge regression]]. Regularization methods introduce bias into the regression solution that can reduce variance considerably relative to the [[ordinary least squares|ordinary least squares (OLS)]] solution. Although the OLS solution provides non-biased regression estimates, the lower variance solutions produced by regularization techniques provide superior MSE performance.

===In classification=== The bias–variance decomposition was originally formulated for least-squares regression. For the case of [[statistical classification|classification]] under the [[0-1 loss]] (misclassification rate), it is possible to find a similar decomposition, with the caveat that the variance term becomes dependent on the target label.{{cite conference |last=Domingos |first=Pedro |author-link=Pedro Domingos |title=A unified bias–variance decomposition |conference=ICML |year=2000 |url=http://homes.cs.washington.edu/~pedrod/bvd.pdf }}{{cite journal |first1=Giorgio |last1=Valentini |first2=Thomas G. |last2=Dietterich |author-link2=Thomas G. Dietterich |title=Bias–variance analysis of support vector machines for the development of SVM-based ensemble methods |journal=[[Journal of Machine Learning Research]] |volume=5 |year=2004 |pages=725–775 |url=http://www.jmlr.org/papers/volume5/valentini04a/valentini04a.pdf }} Alternatively, if the classification problem can be phrased as [[probabilistic classification]], then the expected cross-entropy can instead be decomposed to give bias and variance terms with the same semantics but taking a different form.

It has been argued that as training data increases, the variance of learned models will tend to decrease, and hence that as training data quantity increases, error is minimised by methods that learn models with lesser bias, and that conversely, for smaller training data quantities it is ever more important to minimise variance.{{cite conference |last1=Brain|first1=Damian|last2=Webb|first2=Geoffrey|author-link2=Geoff Webb|title=The Need for Low Bias Algorithms in Classification Learning From Large Data Sets|conference=Proceedings of the Sixth European Conference on Principles of Data Mining and Knowledge Discovery (PKDD 2002)|year=2002 |url=http://i.giwebb.com/wp-content/papercite-data/pdf/brainwebb02.pdf}}

===In reinforcement learning=== Even though the bias–variance decomposition does not directly apply in [[reinforcement learning]], a similar tradeoff can also characterize generalization. When an agent has limited information on its environment, the suboptimality of an RL algorithm can be decomposed into the sum of two terms: a term related to an asymptotic bias and a term due to overfitting. The asymptotic bias is directly related to the learning algorithm (independently of the quantity of data) while the overfitting term comes from the fact that the amount of data is limited.{{cite journal |first1=Vincent |last1=Francois-Lavet |first2=Guillaume |last2=Rabusseau |first3=Joelle |last3=Pineau |first4=Damien |last4=Ernst |first5=Raphael |last5=Fonteneau |title=On Overfitting and Asymptotic Bias in Batch Reinforcement Learning with Partial Observability |journal= Journal of Artificial Intelligence Research|volume=65 |year=2019 |pages=1–30 |url=https://jair.org/index.php/jair/article/view/11478 |doi=10.1613/jair.1.11478 |doi-access=free |arxiv=1709.07796 }}

===In Monte Carlo methods=== While in traditional Monte Carlo methods the bias is typically zero, modern approaches, such as [[Markov chain Monte Carlo]] are only asymptotically unbiased, at best.{{Cite conference | last1 = Zlochin | first1 = M. | last2 = Baram | first2 = Y. | title = The Bias–Variance Dilemma of the Monte Carlo Method | editor1-last = Dorffner | editor1-first = G. | editor2-last = Bischof | editor2-first = H. | editor3-last = Hornik | editor3-first = K. | book-title = Artificial Neural Networks — ICANN 2001 | series = Lecture Notes in Computer Science | volume = 2130 | publisher = Springer | year = 2001 | pages = 257–264 | doi = 10.1007/3-540-44668-0_20 | url = https://doi.org/10.1007/3-540-44668-0_20 | access-date = 17 November 2024 | url-access = subscription }} Convergence diagnostics can be used to control bias via [[burn-in]] removal, but due to a limited computational budget, a bias–variance trade-off arises,{{Cite journal | last1 = South | first1 = Leah F. | last2 = Riabiz | first2 = Marina | last3 = Teymur | first3 = Onur | last4 = Oates | first4 = Chris J. | title = Postprocessing of MCMC | journal = Annual Review of Statistics and Its Application | volume = 9 | issue = 1 | pages = 529–555 | date = March 1, 2022 | doi = 10.1146/annurev-statistics-040220-091727 | pmid = 39006247 | pmc = 7616193 | url = https://ssrn.com/abstract=4065369 | access-date = 17 November 2024 | arxiv = 2103.16048 | bibcode = 2022AnRSA...9..529S }} leading to a wide range of approaches, in which a controlled bias is accepted, if this allows to dramatically reduce the variance, and hence the overall estimation error.{{Cite journal | last1 = Nemeth | first1 = C. | last2 = Fearnhead | first2 = P. | title = Stochastic Gradient Markov Chain Monte Carlo | journal = Journal of the American Statistical Association | volume = 116 | issue = 533 | pages = 433–450 | year = 2021 | doi = 10.1080/01621459.2020.1847120 | url = https://doi.org/10.1080/01621459.2020.1847120 | access-date = 17 November 2024 | arxiv = 1907.06986 }}{{Cite journal | last1 = Vazquez | first1 = M.A. | last2 = Míguez | first2 = J. | title = Importance sampling with transformed weights | journal = Electronics Letters | date = 2017 | volume = 53 | issue = 12 | pages = 783–785 | doi = 10.1049/el.2016.3462 | url = https://doi.org/10.1049/el.2016.3462 | access-date = 17 November 2024 | arxiv = 1702.01987 | bibcode = 2017ElL....53..783V }}{{Cite conference | last1 = Korba | first1 = A. | last2 = Portier | first2 = F. | title = Adaptive Importance Sampling meets Mirror Descent: A Bias–Variance Tradeoff | book-title = Proceedings of The 25th International Conference on Artificial Intelligence and Statistics | series = Proceedings of Machine Learning Research | volume = 151 | pages = 11503–11527 | year = 2022 | url = https://proceedings.mlr.press/v151/korba22a.html | access-date = 17 November 2024 }}

===In human learning=== While widely discussed in the context of machine learning, the bias–variance dilemma has been examined in the context of [[Cognitive science|human cognition]], most notably by [[Gerd Gigerenzer]] and co-workers in the context of learned heuristics. They have argued (see references below) that the human brain resolves the dilemma in the case of the typically sparse, poorly-characterized training-sets provided by experience by adopting high-bias/low variance heuristics. This reflects the fact that a zero-bias approach has poor generalizability to new situations, and also unreasonably presumes precise knowledge of the true state of the world. The resulting heuristics are relatively simple, but produce better inferences in a wider variety of situations.{{Cite journal |last1=Gigerenzer |first1=Gerd |author-link1=Gerd Gigerenzer |last2=Brighton |first2=Henry |doi=10.1111/j.1756-8765.2008.01006.x |title=Homo Heuristicus: Why Biased Minds Make Better Inferences |journal=Topics in Cognitive Science |volume=1 |issue=1 |pages=107–143 |year=2009 |pmid=25164802 |hdl=11858/00-001M-0000-0024-F678-0 |hdl-access=free }}

[[Stuart Geman|Geman]] et al. argue that the bias–variance dilemma implies that abilities such as generic [[object recognition]] cannot be learned from scratch, but require a certain degree of "hard wiring" that is later tuned by experience. This is because model-free approaches to inference require impractically large training sets if they are to avoid high variance.

==See also== {{Div col|colwidth=25em}}

[[Accuracy and precision]]
[[Bias of an estimator]]
[[Double descent]]
[[Gauss–Markov theorem]]
[[Hyperparameter optimization]]
[[Law of total variance]]
[[Minimum-variance unbiased estimator]]
[[Model selection]]
[[Regression model validation]]
[[Supervised learning]]
[[Cramér–Rao bound]]
[[Prediction interval]] {{Div col end}}

==References== {{Reflist}}

== External links ==

[https://mlu-explain.github.io/bias-variance/ MLU-Explain: The Bias Variance Tradeoff] — An interactive visualization of the bias–variance tradeoff in LOESS Regression and K-Nearest Neighbors.

{{DEFAULTSORT:Bias-variance dilemma}} [[Category:Dilemmas]] [[Category:Model selection]] [[Category:Machine learning]] [[Category:Statistical classification]]

Bias–variance decomposition

From MOAI Insights