2 ML strategy (2)

Carrying out error analysis

If you’re trying to get a learning algorithm to do a task that humans can do. And if your learning algorithm is not yet at the performance of a human. Then manually examining mistakes that your algorithm is making, can give you insights into what to do next. This process is called error analysis. An error analysis procedure that can let you very quickly tell whether or not this could be worth your effort. In machine learning, sometimes we call this the ceiling on performance which just means, what’s in the best case? In machine learning, sometimes we speak disparagingly of hand engineering things, or using too much manual insight. But if you’re building applied systems, then this simple counting procedure, error analysis, can save you a lot of time. In terms of deciding what’s the most important, or what’s the most promising direction to focus on.

Maybe this is a 5 to 10 minute effort. This will gives you an estimate of how worthwhile this direction is. And could help you make a much better decision, How to using error analysis to evaluate whether or not is worth working on. Sometimes you can also evaluate multiple ideas in parallel doing error analysis.

During error analysis, you’re just looking at dev set examples that your algorithm has misrecognized. Quick counting procedure, which you can often do in, at most, small numbers of hours can really help you make much better prioritization decisions, and understand how promising different approaches are to work on.

To carry out error analysis,you should find a set of mislabeled examples, either in your dev set, or in your development set.
And look at the mislabeled examples for false positives and false negatives. And just count up the number of errors that fall into various different categories.

During this process,

you might be inspired to generate new categories of errors,
You can create new categories during that process.
By counting up the fraction of examples that are mislabeled in different ways, often this will help you prioritize or give you inspiration for new directions to go in.

Cleaning up Incorrectly labeled data

If you going through your data and you find that some of these output labels Y are incorrect, you have data which is incorrectly labeled? Is it worth your while to go in to fix up some of these labels?

It turns out that deep learning algorithms are quite robust to random errors in the training set.
If the errors are reasonably random, then it’s probably okay to just leave the errors as they are and not spend too much time fixing them.
So long as the total data set size is big enough and the actual percentage of errors is maybe not too high.
There is one caveat to this which is that deep learning algorithms are robust to random errors. They are less robust to systematic errors.

If it makes a significant difference to your ability to evaluate algorithms on your dev set, then go ahead and spend the time to fix incorrect labels. But if it doesn’t make a significant difference to your ability to use the dev set to evaluate cost bias, then it might not be the best use of your time. Apply whatever process you apply to both your dev and test sets at the same time. It’s actually less important to correct the labels in your training set.

In building practical systems, often there’s also more manual error analysis and more human insight that goes into the systems than sometimes deep learning researchers like to acknowledge.
Actually go in and look at the data myself and try to counter the fraction of errors. And I think that because these minutes or maybe a small number of hours of counting data can really help you prioritize where to go next.

Build your first system quickly, then iterate

And more generally, for almost any machine learning application, there could be 50 different directions you could go in and each of these directions is reasonable and would make your system better. But the challenge is, how do you pick which of these to focus on.

If you’re starting on building a brand new machine learning application, is to build your first system quickly and then iterate.
- First quickly set up a dev/test set and metric. So this is really deciding where to place your target.

All the value of the initial system is having some learned system, having some trained system allows you to localize bias/variance, to try to prioritize what to do next, allows you to do error analysis, look at some mistakes, to figure out all the different directions you can go in, which ones are actually the most worthwhile.

If there’s a significant body of academic literature that you can draw on for pretty much the exact same problem you’re building. It might be okay to build a more complex system from the get-go by building on this large body of academic literature.
- But if you are tackling a new problem for the first time, then I would encourage you to really not, there are more teams overthink and build something too complicated.

If you are applying to your machine learning algorithms to a new application, and if your main goal is to build something that works, as opposed to if your main goal is to invent a new machine learning algorithm which is a different goal, then your main goal is to get something that works really well.

Training and testing on different distributions

How to deal with when your train and test distributions differ from each other.

Put both of these data sets together

But the disadvantage, is that if you look at your dev set, a lot of it rather than what you actually care about.

The training set is still the images, and then for dev and test sets would be all app images. Now you’re aiming the target where you want it to be.

Bias and Variance with mismatched data distributions

In order to tease out these two effects it will be useful to define a new piece of data which we’ll call the training-dev set.

What we’re going to do is randomly shuffle the training sets and then carve out just a piece of the training set to be the training-dev set. So just as the dev and test set have the same distribution, the training set and the training-dev set, also have the same distribution.
But, the difference is that now you train your neural network, just on the training set proper. You won’t let the neural network, you won’t run backpropagation on the training-dev portion of this data.

Bias / variance on mismatched training and dev / test sets

When you went from training data to training dev data the error really went up a lot. And only the difference between the training data and the training-dev data is that your neural network got to sort the first part of this. It was trained explicitly on this, but it wasn’t trained explicitly on the training-dev data. So this tells you that you have a variance problem.
But then it really jumps when you go to the dev set. So this is a data mismatch problem, where data mismatched. So somehow your algorithm has learned to do well on a different distribution than what you really care about, so we call that a data mismatch problem.

Addressing data mismatch

Carry out manual error analysis to try to understand the differences between the training set and the dev/test sets.
Make training data more similar,; or collect more data similar to dev/test sets

One of the ways we talked about is artificial data synthesis. And artificial data synthesis does work. But, if you’re using artificial data synthesis, just be cautious and bear in mind whether or not you might be accidentally simulating data only from a tiny subset of the space of all possible examples.

Transfer learning

One of the most powerful ideas in deep learning is that sometimes you can take knowledge the neural network has learned from one task and apply that knowledge to a separate task. What you can do is take this last output layer of the neural network and just delete that and delete also the weights feeding into that last output layer and create a new set of randomly initialized weights just for the last layer and have that now output. A couple options of how you retrain neural network with radiology data :

If you have a small radiology dataset, you might want to just retrain the weights of the last layer.
But if you have a lot of data, then maybe you can retrain all the parameters in the network.

When transfer learning makes sense :

Task A and B have the same input X
You have a lot more data for Task A than Task B
Low level features from A could be helpful for learning B

Multi-task learning

So whereas in transfer learning, you have a sequential process where you learn from task A and then transfer that to task B. In multi-task learning, you start off simultaneously, trying to have one neural network do several things at the same time. And then each of these task helps hopefully all of the other task. When multi-task learning makes sense

Training on a set of tasks that could benefit from having shared lower-level features.
Usually: Amount of data you have for each task is quite similar.
Can train a big enough neural network to do well on all the tasks.

What is end-to-end deep learning?

Briefly, there have been some data processing systems, or learning systems that require multiple stages of processing. And what end-to-end deep learning does, is it can take all those multiple stages, and replace it usually with just a single neural network.

It turns out that one of the challenges of end-to-end deep learning is that you might need a lot of data before it works well.
If you’re training on smaller data to build a speech recognition system, then the traditional pipeline, the full traditional pipeline works really well.

So why is it that the two step approach works better?

One is that each of the two problems you’re solving is actually much simpler.
But second, is that you have a lot of data for each of the two sub-tasks.

Although if you had enough data for the end-to-end approach, maybe the end-to-end approach would work better.

Whether to use end-to-end learning?

The benefits of applying end-to-end learning :

end-to-end learning really just lets the data speak.
there’s less hand designing of components needed.

The disadvantages :

it may need a large amount of data.
it excludes potentially useful hand designed components, but the hand-designed components could be very helpful if well designed.