Developing any machine learning model involves a rigorous experimental process that follows the idea-experiment-evaluation cycle.
The above cycle is repeated multiple times until satisfactory performance levels are achieved. The “experiment” phase involves both the coding and the training steps of the machine learning model. As models become more complex and are trained over much larger datasets, training time inevitably expands. As a consequence, training a large deep neural network can be painfully slow.
Fortunately for data science practitioners, there exist several techniques to accelerate the training process, including:
- Transfer Learning.
- Weight Initialization, as Glorot or He initialization.
- Batch Normalization for training data.
- Picking a reliable activation function.
- Use a faster optimizer.
While all the techniques I pointed out are important, in this post I will focus deeply on the last point. I will describe multiple algorithm for neural network parameters optimization, highlighting both their advantages and limitations.
In the last section of this post, I will present a visualization displaying the comparison between the discussed optimization algorithms.
For practical implementation, all the code used in this article can be accessed in this GitHub repository:
Traditonally, Batch Gradient Descent is considered the default choice for the optimizer method in neural networks.