Momentum, RMSprop and Adam Optimizers
So, I’m having difficulty getting RMSprop and Adam to work.
I’ve correctly implemented Momentum as an optimizing algorithm, meaning that, comparing Gradient Descent with Momentum, the Cost goes down much faster using Momentum. The model Accuracy, for the same number of Epochs, is also higher for the test set if using Momentum.
Here is the code:
# only momentum elif name == 'momentum': # calculate momentum for every layer for i in range(self.number_of_layers - 1): self.v[f'dW{i}'] = beta1 * self.v[f'dW{i}'] + (1 - beta1) * self.gradients[f'dW{i}'] self.v[f'db{i}'] = beta1 * self.v[f'db{i}'] + (1 - beta1) * self.gradients[f'db{i}'] # update parameters for i in range(self.number_of_layers - 1): self.weights[i] = self.weights[i] - self.learning_rate * self.v[f'dW{i}'] self.biases[i] = self.biases[i] - self.learning_rate * self.v[f'db{i}']
I’ve tried everything I could come up with to try to implement both RMSprop and Adam, both to no success. Below the code. Any help on why it is not working would be much appreciated!
# only rms elif name == 'rms': # calculate rmsprop for every layer for i in range(self.number_of_layers - 1): self.s[f'dW{i}'] = beta2 * self.s[f'dW{i}'] + (1 - beta2) * self.gradients[f'dW{i}']**2 self.s[f'db{i}'] = beta2 * self.s[f'db{i}'] + (1 - beta2) * self.gradients[f'db{i}']**2 # update parameters for i in range(self.number_of_layers - 1): self.weights[i] = self.weights[i] - self.learning_rate * self.gradients[f'dW{i}'] / (np.sqrt(self.s[f'dW{i}']) + epsilon) self.biases[i] = self.biases[i] - self.learning_rate * self.gradients[f'db{i}'] / (np.sqrt(self.s[f'db{i}']) + epsilon)
# adam optimizer elif name == 'adam': # counter # this resets every time an epoch finishes self.t += 1 # loop through layers for i in range(self.number_of_layers - 1): # calculate v and s self.v[f'dW{i}'] = beta1 * self.v[f'dW{i}'] + (1 - beta1) * self.gradients[f'dW{i}'] self.v[f'db{i}'] = beta1 * self.v[f'db{i}'] + (1 - beta1) * self.gradients[f'db{i}'] self.s[f'dW{i}'] = beta2 * self.s[f'dW{i}'] + (1 - beta2) * np.square(self.gradients[f'dW{i}']) self.s[f'db{i}'] = beta2 * self.s[f'db{i}'] + (1 - beta2) * np.square(self.gradients[f'db{i}']) # bias correction self.v1[f'dW{i}'] = self.v[f'dW{i}'] / (1 - beta1**self.t) self.v1[f'db{i}'] = self.v[f'db{i}'] / (1 - beta1**self.t) self.s1[f'dW{i}'] = self.s[f'dW{i}'] / (1 - beta2**self.t) self.s1[f'db{i}'] = self.s[f'db{i}'] / (1 - beta2**self.t) # update parameters for i in range(self.number_of_layers - 1): self.weights[i] = self.weights[i] - self.learning_rate * np.divide(self.v1[f'dW{i}'], (np.sqrt(self.s1[f'dW{i}']) + epsilon)) self.biases[i] = self.biases[i] - self.learning_rate * np.divide(self.v1[f'db{i}'], (np.sqrt(self.s1[f'db{i}']) + epsilon))
# additional information # epsilon = 1e-8 # beta1 = 0.9 # beta2 = 0.999