1 ML strategy (1)

Why ML Strategy?

How to improve your system :

• more data
• diverse
• poses
• negative examples
• train the algorithm longer
• try a different optimization algorithm
• trying a bigger network or a smaller network
• try to dropout or maybe L2 regularization
• change the network architecture
• changing activation functions
• changing the number of hidden units and so on

And the problem is that if you choose poorly, it is entirely possible that you end up spending six months charging in some direction only to realize after six months that that didn’t do any good. So we need a number of strategies, that is, ways of analyzing a machine learning problem that will point you in the direction of the most promising things to try.

Orthogonalization

You must be very clear-eyed about what to tune in order to try to achieve one effect. This is a process we call orthogonalization.

So the concept of orthogonalization refers to that, if you think of one dimension of what you want to do as controlling a steering angle, and another dimension as controlling your speed.

And by having orthogonal, orthogonal means at 90 degrees to each other. By having orthogonal controls that are ideally aligned with the things you actually want to control, it makes it much easier to tune the knobs you have to tune.

Chain of assumptions in ML

• Fit training set well on cost function
• Fit dev set well on cost function
• Fit test set well on cost function
• Performs well in real world

Single number evaluation metric

Whether you’re tuning hyperparameters, or trying out different ideas for learning algorithms, or just trying out different options for building your machine learning system. You’ll find that your progress will be much faster if you have a single real number evaluation metric that lets you quickly tell if the new thing you just tried is working better or worse than your last idea.

One reasonable way to evaluate the performance of your classifiers is to look at its precision and recall.

It turns out that there’s often a tradeoff between precision and recall, and you care about both.

$$F_1 = \frac{2}{\frac{1}{P} + \frac{1}{R}}$$

And in mathematics, this function is called the harmonic mean of precision P and recall R.

A well-defined dev set which is how you’re measuring precision and recall, plus a single number evaluation metric allows you to quickly tell if classifier A or classifier B is better,

Satisficing and optimizing metrics

It’s not always easy to combine all the things you care about into a single real number evaluation metric. In those cases it sometimes useful to set up satisficing as well as optimizing metrics.

If you have N metrics that you care about it’s sometimes reasonable to pick one of them to be optimizing. So you want to do as well as is possible on that one. And then N minus 1 to be satisficing, meaning that so long as they reach some threshold. Such as running times faster than 100 milliseconds, but so long as they reach some threshold, you don’t care how much better it is in that threshold, but they have to reach that threshold.

If there are multiple things you care about by say there’s one as the optimizing metric that you want to do as well as possible on and one or more as satisficing metrics were you’ll be satisfice. Almost it does better than some threshold you can now have an almost automatic way of quickly looking at multiple cost size and picking the, quote, best one.

Train/dev/test distributions

The dev set is also called the development set, or sometimes called the hold out cross validation set.

Make your dev and test sets come from the same distribution. Take all this data, randomly shuffled data into the dev and test set. So that, both the dev and test sets have data from all eight regions and that the dev and test sets really come from the same distribution,

Machine learning teams are often very good at shooting different arrows into targets and iterating to get closer and closer to hitting the bullseye. Once you do well on, try to get data that looks like that. And, whatever that data is, put it into both your dev set and your test set. A totally different location that just was a very frustrating experience for the team.

Setting up the dev set, as well as the evaluation metric, is really defining what target you want to aim at.

Size of dev and test sets

• train and test set : 70/30
• train dev and test sets : 60/20/20
• train dev and test sets : 98/1/1

Size of test set

Set your test set to be big enough to give high condifience in the over all performance of your system.

When to change dev/test sets and metrics

Sometimes partway through a project you might realize you put your target in the wrong place. In that case you should move your target.

Misclassification error metric : $$Error = \frac{1}{m_{dev}} \sum _{i=1}^{m_{dev}}I\{y_{pred}^{(i)} \neq y^{(i)}\}$$

One way to change this evaluation metric : $$Error = \frac{1}{m_{dev}} \sum _{i=1}^{m_{dev}}w^{(i)}I\{y_{pred}^{(i)} \neq y^{(i)}\}$$

If you want this normalization constant, technically this becomes sum over i of w(i), so then this error would still be between zero and one. $$Error = \frac{1}{\sum w^{(i)}} \sum _{i=1}^{m_{dev}}w^{(i)}I\{y_{pred}^{(i)} \neq y^{(i)}\}$$

The goal of the evaluation metric is accurately tell you, given two classifiers, which one is better for your application.

If you’re not satisfied with your old error metric then don’t keep coasting with an error metric you’re unsatisfied with, instead try to define a new one that you think better captures your preferences in terms of what’s actually a better algorithm.

Take a machine learning problem and break it into distinct steps.

• place the target
• shooting at the target

The point was with the philosophy of orthogonalization.

If doing well on your metric and your current dev sets or dev and test sets’ distribution, if that does not correspond to doing well on the application you actually care about, then change your metric and your dev test set.

The overall guideline is if your current metric and data you are evaluating on doesn’t correspond to doing well on what you actually care about, then change your metrics and/or your dev/test set to better capture what you need your algorithm to actually do well on.

Even if you can’t define the perfect evaluation metric and dev set, just set something up quickly and use that to drive the speed of your team iterating.

And if later down the line you find out that it wasn’t a good one, you have better idea, change it at that time, it’s perfectly okay.

Why human-level performance?

1. In deep learning, machine learning algorithms are suddenly working much better and so it has become much more feasible in a lot of application areas for machine learning algorithms to actually become competitive with human-level performance.
2. The workflow of designing and building a machine learning system, the workflow is much more efficient when you’re trying to do something that humans can also do.

And over time, as you keep training the algorithm, maybe bigger and bigger models on more and more data, the performance approaches but never surpasses some theoretical limit, which is called the Bayes optimal error. So Bayes optimal error, think of this as the best possible error. And Bayes optimal error, or Bayesian optimal error, or sometimes Bayes error for short, is the very best theoretical function for mapping from x to y. That can never be surpassed.

It turns out that progress is often quite fast until you surpass human level performance. And it sometimes slows down after you surpass human level performance.

1. One reason is that human level performance is for many tasks not that far from Bayes’ optimal error.
2. so long as your performance is worse than human level performance, then there are actually certain tools you could use to improve performance that are harder to use once you’ve surpassed human level performance.
• For tasks that humans are good at, so long as your machine learning algorithm is still worse than the human, you can get labeled data from humans. That is you can ask people, ask or hire humans, to label examples for you so that you can have more data to feed your learning algorithm.

Knowing how well humans can do well on a task can help you understand better how much you should try to reduce bias and how much you should try to reduce variance.

Avoidable bias

If there’s a huge gap between how well your algorithm does on your training set versus how humans do shows that your algorithm isn’t even fitting the training set well.So in terms of tools to reduce bias or variance, in this case I would say focus on reducing bias.

In another case, even though your training error and dev error are the same as the other example, you see that maybe you’re actually doing just fine on the training set. It’s doing only a little bit worse than human level performance. You would maybe want to focus on reducing this component, reducing the variance in your learning algorithm.

Think of human level error as a proxy or as a estimate for Bayes error or for Bayes optimal error. And for computer vision tasks, this is a pretty reasonable proxy because humans are actually very good at computer vision and so whatever a human can do is maybe not too far from Bayes error.

The difference between Bayes error or approximation of Bayes error and the training error is the avoidable bias.

The difference between your training area and the dev error, there’s a measure still of the variance problem of your algorithm.

Understanding human-level performance

Human-level error, is that it gives us a way of estimating Bayes error. What is the best possible error any function could, either now or in the future.

How should you define human-level error?

To be clear about what your purpose is in defining the term human-level error.

This gap between Bayes error or estimate of Bayes error and training error is calling that a measure of the avoidable bias. And this as a measure or an estimate of how much of a variance problem you have in your learning algorithm.

• The difference between your estimate of Bayes error tells you how much avoidable bias is a problem, how much avoidable bias there is.
• And the difference between training error and dev error, that tells you how much variance is a problem, whether your algorithm’s able to generalize from the training set to the dev set.

A better estimate for Bayes error can help you better estimate avoidable bias and variance. And therefore make better decisions on whether to focus on bias reduction tactics, or on variance reduction tactics.

Surpassing human- level performance

Surpassing human-level performance :

• Team of humans
• One human
• Training error
• Dev error

If your error is already better than even a team of humans looking at and discussing and debating the right label, then it’s just also harder to rely on human intuition to tell your algorithm what are ways that your algorithm could still improve the performance

Humans tend to be very good in natural perception task. So it is possible, but it’s just a bit harder for computers to surpass human-level performance on natural perception task.

Problems where ML significantly surpasses human-level performance :

• Product recommendations
• Logistics (predicting transit time)
• Loan approvals

And finally, all of these are problems where there are teams that have access to huge amounts of data. So for example, the best systems for all four of these applications have probably looked at far more data of that application than any human could possibly look at. And so, that’s also made it relatively easy for a computer to surpass human-level performance.

Improving your model performance

The two fundamental assumptions of supervised learning

• You can fit the training set pretty well
• The training set performance generalizes pretty well to the dev/test set

Reducing (avoidable) bias and variance

• Human-level
• Train bigger model
• Train longer/better optimization algorithms
• NN architecture/hyperparameters search
• Training error
• More data
• Regularization
• NN architecture/hyperparameters search
• Dev error

This notion of bias or avoidable bias and variance there is one of those things that easily learned, but tough to master