A Comparative Analysis of Optimization Methods for Classification on Various Datasets
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Optimization, which involves the study of the conditions by which a variety of mathematical structures can be analyzed through the minimization or maximization of a function, is often seen as the heart of mathematics. In deep learning (DL), the scope of optimization broadly includes hyperparameter tuning, weight and bias adjustments, etc., until convergence of the loss or cost function (J), aiming to improve the model’s performance, prediction accuracy, and reliability in tasks like classification and regression. In recent years, the stochastic gradient algorithm and its variants are becoming widely used, and each offering various levels of success. The variants are called Adaptive Gradient Methods. A thorough comparison of adaptive gradient methods with respect to their convergence speed and Cross-Entropy Loss (CEL) in the mentioned classification tasks is provided; hence, the study covered optimization algorithms like SGD, Momentum SGD, RMSProp, Adam, Adagrad, Adadelta, Adamax, Nadam, and AMSGrad across three CNN architectures on MNIST, Fashion-MNIST, and CIFAR-10 datasets for 30 epochs. On MNIST (CNN-2), momentum SGD achieved 1.0000 accuracy coupled with 0.0000 loss, while it reached 0.9672 on CIFAR-10 (CNN-2); RMSProp achieved 0.9714 on Fashion-MNIST (CNN-1) and 0.9582 on CIFAR-10 (CNN-1); Adam reached 0.9898 on Fashion-MNIST (CNN-2) and 0.9733 on CIFAR-10 (CNN-2). Nadam was also performing relatively well on all three. In contrast, Adagrad and Adadelta showed poor results; for example, Adadelta on CIFAR-10 (CNN-1) showed 0.2576 accuracy with 2.0727 loss, and on Fashion-MNIST (CNN-1), it showed 0.6890 accuracy. The optimizers that overall proved to be the best were SGD, RMSProp, Adam, and Nadam, while Adagrad and Adadelta showed consistent underperformance.