14. What is the difference between batch and stochastic gradient descent?

Batch vs Stochastic Gradient Descent

Gradient Descent is an optimization algorithm often used in machine learning to minimize a cost function by iteratively moving towards the minimum value. There are different variants of gradient descent, the most common being Batch Gradient Descent and Stochastic Gradient Descent (SGD).

Batch Gradient Descent

Definition: In Batch Gradient Descent, the entire dataset is used to compute the gradient of the cost function. The parameters are updated after processing all data instances.
Advantages:
- The cost function is smooth and convex, leading to stable convergence.
- Efficient use of vectorized operations in computational libraries.
Disadvantages:
- Requires the entire dataset in memory, which can be computationally expensive and slow for very large datasets.
- Convergence can be slow due to the need to calculate the gradients over the entire dataset before each update.

Stochastic Gradient Descent

Definition: In Stochastic Gradient Descent, the parameters are updated for each training example, one at a time, rather than using the entire dataset.
Advantages:
- Faster convergence, especially for large datasets, as updates are more frequent.
- Can potentially escape local minima due to its noisy updates.
Disadvantages:
- The cost function is noisier, which may lead to less stable convergence.
- Parameter updates can be less accurate due to the randomness of sampling one data point at a time.

Code Examples

Here is a simple example of both methods using Python and NumPy:

import numpy as np

# Dummy data
X = np.array([[1, 2], [3, 4], [5, 6]])
y = np.array([3, 7, 11])

# Parameters
theta = np.zeros(2)
learning_rate = 0.01

# Batch Gradient Descent
for _ in range(100):
    gradient = X.T.dot(X.dot(theta) - y) / len(y)
    theta -= learning_rate * gradient

print("Batch GD Theta:", theta)

# Stochastic Gradient Descent
theta = np.zeros(2)
for _ in range(100):
    for i in range(len(y)):
        xi = X[i:i+1]
        yi = y[i:i+1]
        gradient = xi.T.dot(xi.dot(theta) - yi)
        theta -= learning_rate * gradient

print("SGD Theta:", theta)

In this example, you can see how Batch Gradient Descent uses the entire dataset for each update, while Stochastic Gradient Descent updates the parameters with each individual data point. The choice between these methods often depends on the dataset size and computational resources.

PREVIOUS QUESTION

QUESTION 14 OF 15

NEXT QUESTION

14. What is the difference between batch and stochastic gradient descent?

Batch vs Stochastic Gradient Descent

Batch Gradient Descent

Stochastic Gradient Descent

Code Examples

Exploring the Power of Web Workers in Frontend Development

Do you accept cookies?