15 Anomaly Detection

Problem Motivation

It’s mainly for unsupervised problem, that there’s some aspects of it that are also very similar to sort of the supervised learning problem.

some examples : 

  • detect strange behavior or fraudulent behavior
  • manufacturing
  • monitoring computers in a data center

Gaussian Distribution

Gaussian distribution

\(x\sim N(\mu , \sigma ^2)\)


Gaussian probability density

\(p(x, \mu , \sigma ^2) = \frac {1}{\sqrt{2\pi }\sigma } exp(-\frac {(x – \mu)^2}{2 \sigma ^2})\)


The location of the center of this bell-shaped curve

\(\mu = \frac {1}{m} \sum _{i=1}^{m}x^{(i)}\)


The width of this bell-shaped curve

\(\sigma ^ 2 = \frac {1}{m} \sum _{i=1}^{m} (x^{(i)} – \mu) ^2\)


Notice : The formula here  we use \(m\) instead of \(m – 1\) which is used in a statistics.


Address anomaly detection :

\(\mu _j = \frac {1}{m} \sum _{i=1}^{m}x^{(i)} _j\)


\(\sigma ^ 2 _j = \frac {1}{m} \sum _{i=1}^{m} (x^{(i)}_j – \mu _j) ^2\)


\(p(x) = \prod _{j=1}^{n}p(x_j; \mu _j, \sigma ^2_j) = \prod _{j=1}^{1}\frac {1}{\sqrt{2\pi } \sigma _j} exp(-\frac {(x_j – \mu_j)^2}{2 \sigma ^2_j})\)


If \(p(x) < \varepsilon \), it’s anomaly.

Developing and Evaluating an Anomaly Detection System

How to develop and evaluate an algorithm ?

  1. Take the training sets and fit the model \(p(x)\)
  2. On the cross validation of the test set, try to use different \(\varepsilon\), and then compute the F1 score
  3. After choosed \(\varepsilon\), evaluation of the algorithm on the test sets

Anomaly Detection vs. Supervised Learning

Anomaly DetectionSupervised Learning
very small number of positive, and a relatively large number of negative examplesa reasonably large number of both positive and negative examples
many different types of anomalieshave enough positive examples for an algorithm to get a sense of what the positive examples are like
future anomalies may look nothing like the ones you've seen so far
fraud detection, manufacturing, data centerSPAM email, weather prediction, classifying cancers

Choosing What Features to Use

  1. model the features using this sort of Gaussian distribution (play with different transformations of the data in order to make it look more Gaussian)
  2. do an error analysis procedure to come up with features for an anomaly detection algorithm
  3. create new features by combining me features

Multivariate Gaussian Distribution

\(p(x) = \prod _{j=1}^{n}p(x_j; \mu, \sigma ^2_j) = \prod _{j=1}^{n}\frac {1}{\sqrt{2\pi } \sigma _j} exp(-\frac {(x_j – \mu_j)^2}{2 \sigma ^2_j})\)


\(\mu = \frac {1}{m} \sum _{i=1}^{m}x^{(i)}\)


\(\sum = \frac {1}{m} \sum_{i=1}^{m} (x^{(i)} – \mu )(x^{(i)} – \mu )^T = \frac {1}{m} (X – \mu)^T(X – \mu)\)


\(p(x) = \frac {1}{(2 \pi)^{\frac {n}{2}}\left | \sum \right | ^{\frac {1}{2}}} exp(-\frac {1}{2} (x-\mu)^T\sum ^{-1}(x-\mu))\)


Gaussian Distribution Multivariate Gaussian Distribution
Manually create features to capture anomaliesAutomatically captures correlations between features
Computationally cheaper
Must have m > 10n or else sum is non-invertible