Understanding the basics of Generalization, Overfitting & Underfitting in Machine Learning

himani.t
3 min readNov 4, 2022

--

The main goal of Machine Learning is to learn without being explicitly programmed. It’s usually done by training a model on sample data and predicting values for the unknown data. The values predicted could be categorical — to predict whether the input image is a dog or cat, or it could be a regression-based problem, for example, forecasting sales from historical data.

The goal of machine learning is to build a model that can generalize the data accurately. Generalization is the ability to make predictions, from unseen data which has similar characteristics as the training data.

Usually, when we build models, we expect that the test set will give similar results as the train set but, that is not the case.

Several factors could reduce the accuracy of the testing set, two of the most common issues faced are Overfitting and Underfitting.

Sometimes, when the model gets too complicated, i.e., there are a lot of features, the model is overfitted. When the training accuracy is higher (99%) while the testing accuracy is (56%), it usually means that the model is overfitting. In this scenario, the model tries to fit the train data entirely and ends up considering even the noise and fluctuations. It could happen for the following reasons –

1. The data is not cleaned properly and contains a lot of noise and garbage values.

2. The model is trained for several epochs

3. Too many features and complex architecture!

Using steps like Feature Selection, Data Augmentation, and Regularization overfitting can be reduced to a certain amount.

Note — Overfitted models usually have a high variance and a low bias

Underfitting, however, occurs when your model is too simple. For example, trying to predict the price of a house based on the size of the house. In this case, the number of features (size) is not enough to make an accurate prediction. The price of a house depends on various factors — the city and neighborhood, the number of bedrooms, whether it’s a school district, and many more. Underfitted models perform poorly even for the training data.

The most common reasons for underfitting are –

1. Simple model / fewer features

2. Less amount of training data

3. Noisy and uncleaned data

Underfit models have a low variance and a high bias

Increasing the size of the training data and increasing the number of features can help resolve the problem.

Model complexity is usually associated with the type of data that is used. If there is a good amount of variation in the training set, a complex model can be built without overfitting. A good example of this is, to classify dogs, a set of 10,000 training images containing 10 different types of dog breeds will perform better than a set containing 2 types of dog breeds. Similarly, a model with 10,000 images for training will perform better than a training set containing 2,000 images.

A balance between these factors can help reach optimal accuracy.

--

--

No responses yet