# Structural risk minimization

In practical terms, Structural Risk Minimization is implemented by minimizing $E_{train}+\beta H(W)$ , where $E_{train}$ is the train error, the function $H(W)$ is called a regularization function, and $\beta$ is a constant. $H(W)$ is chosen such that it takes large values on parameters $W$ that belong to high-capacity subsets of the parameter space. Minimizing $H(W)$ in effect limits the capacity of the accessible subsets of the parameter space, thereby controlling the trade-off between minimizing the training error and minimizing the expected gap between the training error and test error.
The SRM problem can be formulated in terms of data. Given n data points consisting of data x and labels y, the objective $J(\theta )$ is often expressed in the following manner:
$J(\theta )={\frac {1}{2n}}\sum _{i=1}^{n}(h_{\theta }(x^{i})-y^{i})^{2}+{\frac {\lambda }{2}}\sum _{j=1}^{d}\theta _{j}^{2}$ The first term is the mean squared error (MSE) term between the value of the learned model, $h_{\theta }$ , and the given labels $y$ . This term is the training error, $E_{train}$ , that was discussed earlier. The second term, places a prior over the weights, to favor sparsity and penalize larger weights. The trade-off coefficient, $\lambda$ , is a hyperparameter that places more or less importance on the regularization term. Larger $\lambda$ encourages sparser weights at the expense of a more optimal MSE, and smaller $\lambda$ relaxes regularization allowing the model to fit to data. Note that as $\lambda \to \infty$ the weights become zero, and as $\lambda \to 0$ , the model typically suffers from overfitting.