Problem Motivation
It’s mainly for unsupervised problem, that there’s some aspects of it that are also very similar to sort of the supervised learning problem.
some examples :
- detect strange behavior or fraudulent behavior
- manufacturing
- monitoring computers in a data center
Gaussian Distribution
Gaussian distribution
\(x\sim N(\mu , \sigma ^2)\)
Gaussian probability density
\(p(x, \mu , \sigma ^2) = \frac {1}{\sqrt{2\pi }\sigma } exp(-\frac {(x – \mu)^2}{2 \sigma ^2})\)
The location of the center of this bell-shaped curve
\(\mu = \frac {1}{m} \sum _{i=1}^{m}x^{(i)}\)
The width of this bell-shaped curve
\(\sigma ^ 2 = \frac {1}{m} \sum _{i=1}^{m} (x^{(i)} – \mu) ^2\)
Notice : The formula here we use \(m\) instead of \(m – 1\) which is used in a statistics.
Algorithm
Address anomaly detection :
\(\mu _j = \frac {1}{m} \sum _{i=1}^{m}x^{(i)} _j\)\(\sigma ^ 2 _j = \frac {1}{m} \sum _{i=1}^{m} (x^{(i)}_j – \mu _j) ^2\)
\(p(x) = \prod _{j=1}^{n}p(x_j; \mu _j, \sigma ^2_j) = \prod _{j=1}^{1}\frac {1}{\sqrt{2\pi } \sigma _j} exp(-\frac {(x_j – \mu_j)^2}{2 \sigma ^2_j})\)
If \(p(x) < \varepsilon \), it’s anomaly.
Developing and Evaluating an Anomaly Detection System
How to develop and evaluate an algorithm ?
- Take the training sets and fit the model \(p(x)\)
- On the cross validation of the test set, try to use different \(\varepsilon\), and then compute the F1 score
- After choosed \(\varepsilon\), evaluation of the algorithm on the test sets
Anomaly Detection vs. Supervised Learning
Anomaly Detection | Supervised Learning |
---|---|
very small number of positive, and a relatively large number of negative examples | a reasonably large number of both positive and negative examples |
many different types of anomalies | have enough positive examples for an algorithm to get a sense of what the positive examples are like |
future anomalies may look nothing like the ones you've seen so far | |
fraud detection, manufacturing, data center | SPAM email, weather prediction, classifying cancers |
Choosing What Features to Use
- model the features using this sort of Gaussian distribution (play with different transformations of the data in order to make it look more Gaussian)
- do an error analysis procedure to come up with features for an anomaly detection algorithm
- create new features by combining me features
Multivariate Gaussian Distribution
\(p(x) = \prod _{j=1}^{n}p(x_j; \mu, \sigma ^2_j) = \prod _{j=1}^{n}\frac {1}{\sqrt{2\pi } \sigma _j} exp(-\frac {(x_j – \mu_j)^2}{2 \sigma ^2_j})\)\(\mu = \frac {1}{m} \sum _{i=1}^{m}x^{(i)}\)
\(\sum = \frac {1}{m} \sum_{i=1}^{m} (x^{(i)} – \mu )(x^{(i)} – \mu )^T = \frac {1}{m} (X – \mu)^T(X – \mu)\)
\(p(x) = \frac {1}{(2 \pi)^{\frac {n}{2}}\left | \sum \right | ^{\frac {1}{2}}} exp(-\frac {1}{2} (x-\mu)^T\sum ^{-1}(x-\mu))\)
Gaussian Distribution | Multivariate Gaussian Distribution |
---|---|
Manually create features to capture anomalies | Automatically captures correlations between features |
Computationally cheaper | |
Must have m > 10n or else sum is non-invertible |