data science
< 1 minute read

The Vanishing Gradient Problem appears in Neural Networks when you train a NN using Gradient Descent, the gradients tend to get smaller and smaller as we keep on moving backward in a NN.

Basically in a Neural Network, after the forward propagation ends, the gradient is not providing meaningful information back to the first layers of the network.

The first layers of a Neural Network are important because they learn the simple patterns (features) and are actually the building blocks of the neural network. As we advance into the hidden layers of our network, the layers tend to learn more complex features. Small gradients result in poorly updated weights and biases for the initial layers with each training epoch.

In the backpropagation step, the derivatives of the network are obtained, by multiplying the derivatives of each layer down the network, layer by layer, from the final layer to the initial one (causing the derivatives in the backpropagation step to become very small and in the end to disappear).
This results in the inability of the model to learn, and it happens most often in networks with multiple hidden layers.

One fix for the Vanishing Gradient Problem is using batch normalization, which reduces this problem by simply normalizing the input so |x| doesn’t reach the outer edges of the sigmoid function.