# 17 Large Scale Machine Learning

## Learning With Large Datasets

Draw a learning curve and determine if more data needs to be collected.

Cost Function in SGD : $$cost(\theta, (x^{(i)}, y^{(i)})) = \frac {1}{2}(h_{\theta}(x^{(i)})-y^{(i)})^2$$

1. randomly shuffle the data set
2. a little gradient descent step using just one single training example
3. maybe head in a bad direction, generally move the parameters in the direction of the global minimum, but not always
4. it ends up doing is wandering around continuously in some region that’s in some region close to the global minimum

In Batch gradient descent we will use all m examples in each generation.
Whereas in Stochastic gradient descent we will use a single example in each generation.
What Mini-batch gradient descent does is somewhere in between.

$$\alpha = \frac {const1}{iterationNumber + const2}$$

We can compute the cost function on the last 1000 examples or so. And we can use this method both to make sure the stochastic gradient descent is okay and is converging or to use it to tune the learning rate alpha.

## Online Learning

The online learning setting allows us to model problems where we have a continuous flood or a continuous stream of data coming in and we would like an algorithm to learn from that.

We learn using that example like so and then we throw that example away.

If you really have a continuous stream of data, then an online learning algorithm can be very effective.

If you have a changing pool of users, or if the things you’re trying to predict are slowly changing like your user taste is slowly changing, the online learning algorithm can slowly adapt your learned hypothesis to whatever the latest sets of user behaviors are like as well.

## Map Reduce and Data Parallelism

In the MapReduce idea, one way to do, is split this training set in to different subsets and use many different machines.

• multi-core machine
• multiple machines
• numerical linear algebra libraries