It was one of the papers that was discussed in my interview at Goldman.
I came to know about this research paper a few years back after consulting a friend doing an ML PhD at University of Maryland, College Park.
The explanation of the paper:
-
Initialize the neural network with small random values typically (-0.1,0.1) to avoid symmetry issues.
-
Now get ready to do Forward propagation: you pass thetraining data through the multilayer perceptron and compute the output. For each neuron in the MLP, calculate the weighted sum of its inputs and apply the activation function. (my favourite is tanh for LSTM applications)
-
Now compute the loss using a loss function like mean squared error, between output computed and the actual value.
-
Now get ready to do backpropagation, where you need to calculate the gradient of the loss function with respect to each weight by propagating the error backward through the network.
-
So, compute partial derivatives of the loss with respect to each weight, starting from the output layer and moving back to the input layer.
-
Here is the fun part: update the weights using the gradients obtained from the backward pass. here people usually use adam optimizer, which allows for accelerated stochastic gradient descent. Fun trivia: Adam stands for "Adaptive Moment Estimation".
-
Now repeat the forward and backward propagation process for numerous tries until theperformance of the model stabilizes.