14. What is the difference between batch and stochastic gradient descent?

Batch vs Stochastic Gradient Descent

Gradient Descent is an optimization algorithm often used in machine learning to minimize a cost function by iteratively moving towards the minimum value. There are different variants of gradient descent, the most common being Batch Gradient Descent and Stochastic Gradient Descent (SGD).

Batch Gradient Descent

  • Definition: In Batch Gradient Descent, the entire dataset is used to compute the gradient of the cost function. The parameters are updated after processing all data instances.
  • Advantages:
    • The cost function is smooth and convex, leading to stable convergence.
    • Efficient use of vectorized operations in computational libraries.
  • Disadvantages:
    • Requires the entire dataset in memory, which can be computationally expensive and slow for very large datasets.
    • Convergence can be slow due to the need to calculate the gradients over the entire dataset before each update.

Stochastic Gradient Descent

  • Definition: In Stochastic Gradient Descent, the parameters are updated for each training example, one at a time, rather than using the entire dataset.
  • Advantages:
    • Faster convergence, especially for large datasets, as updates are more frequent.
    • Can potentially escape local minima due to its noisy updates.
  • Disadvantages:
    • The cost function is noisier, which may lead to less stable convergence.
    • Parameter updates can be less accurate due to the randomness of sampling one data point at a time.

Code Examples

Here is a simple example of both methods using Python and NumPy:

import numpy as np # Dummy data X = np.array([[1, 2], [3, 4], [5, 6]]) y = np.array([3, 7, 11]) # Parameters theta = np.zeros(2) learning_rate = 0.01 # Batch Gradient Descent for _ in range(100): gradient = X.T.dot(X.dot(theta) - y) / len(y) theta -= learning_rate * gradient print("Batch GD Theta:", theta) # Stochastic Gradient Descent theta = np.zeros(2) for _ in range(100): for i in range(len(y)): xi = X[i:i+1] yi = y[i:i+1] gradient = xi.T.dot(xi.dot(theta) - yi) theta -= learning_rate * gradient print("SGD Theta:", theta)

In this example, you can see how Batch Gradient Descent uses the entire dataset for each update, while Stochastic Gradient Descent updates the parameters with each individual data point. The choice between these methods often depends on the dataset size and computational resources.

Struggling to find common date to meet with your friends? Try our new tool commondate.xyz
devFlipCards 2025

Do you accept cookies?

Cookies are small amounts of data saved locally on you device, which helps our website - it saves your settings like theme or language. It helps in adjusting ads and in traffic analysis. By using this site, you consent cookies usage.

Struggling to find common date to meet with your friends? Try our new tool
commondate.xyz