Calling a C Function From Lua

Prepare So

  1. #include <math.h>
  2. #include <lua.h>
  3. #include <lauxlib.h>
  4. #include <lualib.h>
  5.  
  6. static int my_sum(lua_State *L){
  7.     int d1 = luaL_checknumber(L, 1);
  8.     int d2 = luaL_checknumber(L, 2);
  9.     lua_pushnumber(L, d1+d2);
  10.     return 1;
  11. }
  12.  
  13. static int my_info(lua_State *L){
  14.     lua_pushstring(L, "qinuu");
  15.     return 1;
  16. }
  17.  
  18. static const struct luaL_Reg test_lib[] = {
  19.     {"my_sum" , my_sum},
  20.     {"my_info" , my_info},
  21.     {NULL, NULL}
  22. };
  23.  
  24. int luaopen_test_lib(lua_State *L){
  25.     //luaL_newlib(L, test_lib); // 5.2
  26.     luaL_register(L, "test_lib",test_lib); // lua 5.1
  27.     return 1;
  28. }

Test

  1. local my_lib = require "test_lib"
  2.  
  3. print(type(test_lib))
  4.  
  5. print(test_lib.my_sum(23,17))
  6.  
  7. print(test_lib.my_info())

How to Connect Internal and External Networks at the same time : Windows

Delete Default Route

route delete 0.0.0.0

Add Gateway : Internal

route add 192.168.10.0 mask 255.255.255.0 192.168.10.1 -p

Add Gateway : External

route add 0.0.0.0 mask 0.0.0.0 192.168.20.1 -p

Check It!

route print
IPv4 Route Table
===========================================================================
Active Routes:
Network Destination Netmask Gateway Interface Metric
0.0.0.0 0.0.0.0 192.168.20.254 192.168.20.151 55
0.0.0.0 0.0.0.0 192.168.1.1 192.168.1.103 45

===========================================================================
Persistent Routes:
Network Address Netmask Gateway Address Metric
192.168.1.0 255.255.255.0 192.168.1.1 1
192.168.3.0 255.255.255.0 192.168.3.1 1

Speed up by ignoring std::cout

Test Code

  1. #include <iostream>
  2.  
  3. int main(int argc, char *argv[])
  4. {
  5. 	//std::cout.setstate(std::ios_base::badbit);
  6. 	for(int i = 0; i < 100; i ++) {
  7. 		for(int i = 0; i < 100; i ++) {
  8. 			;//std::cout << "" << std::endl;
  9. 		}
  10. 	}
  11. 	return 0;
  12. }

Test it without std::cout

  1. root@imx6ul7d:~/tmp# time ./t1_1 
  2.  
  3. real    0m0.041s
  4. user    0m0.020s
  5. sys     0m0.000s

Test it with std::cout, but redirect to /dev/null

  1. root@imx6ul7d:~/tmp# time ./t1_0 > /dev/null
  2.  
  3. real    0m0.096s
  4. user    0m0.030s
  5. sys     0m0.030s

Test it with std::cout, but set io state

  1. root@imx6ul7d:~/tmp# time ./t1_2
  2.  
  3. real    0m0.061s
  4. user    0m0.040s
  5. sys     0m0.000s

Profile : Linux

Example 1

Add -pg

  1. arm-linux-gnueabihf-g++ -Wall -g -pg hello.cpp -o hello -std=c++17

Test

  1. root@imx6ul7d:~# gprof -b hello 
  2. Flat profile:
  3.  
  4. Each sample counts as 0.01 seconds.
  5.   %   cumulative   self              self     total           
  6.  time   seconds   seconds    calls   s/call   s/call  name    
  7. 100.00      3.45     3.45        1     3.45     3.45  hehe()
  8.   0.00      3.45     0.00        4     0.00     0.00  std::_Optional_base<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >::_M_is_engaged() const
  9. ......

3 Sequence models & Attention mechanism

Various sequence to sequence architectures

Sequence to sequence models which are useful for everything from machine translation to speech recognition.

  • translate
  • image captioning

Picking the most likely sentence

 

Beam Search

You don’t want to output a random English translation, you want to output the best and the most likely English translation. Beam search is the most widely used algorithm to do this.

So, whereas greedy search will pick only the one most likely words and move on, Beam Search instead can consider multiple alternatives.  So, the Beam Search algorithm has a parameter called B, which is called the beam width.

Notice that what we ultimately care about in this second step would be to find the pair of the first and second words that is most likely. so it’s not just a second where is most likely but the pair of the first and second words most likely.

Evaluate all of these 30000 options according to the probability of the first and second words and then pick the top three. Because of beam width is equal to three, every step you instantiate three copies of the network to evaluate these partial sentence fragments and the output.

And it’s because of beam width is equal to three that you have three copies of the network with different choices for the first words,

Beam search will usually find a much better output sentence than greedy search.

Refinements to Beam Search

Length normalization is a small change to the beam search algorithm that can help you get much better results.

Numerical underflow. Meaning that it’s too small for the floating part representation in your computer to store accurately.

So in most implementations, you keep track of the sum of logs of the probabilities rather than the production of probabilities.

Instead of using this as the objective you’re trying to maximize, one thing you could do is normalize this by the number of words in your translation. And so this takes the average of the log of the probability of each word.

And this significantly reduces the penalty for outputting longer translations.

And in practice, as a heuristic instead of dividing by Ty, by the number of words in the output sentence, sometimes you use a softer approach. We have Ty to the power of alpha, where maybe alpha is equal to 0.7. So if alpha was equal to 1, then yeah, completely normalizing by length. If alpha was equal to 0, then, well, Ty to the 0 would be 1, then you’re just not normalizing at all. And this is somewhat in between full normalization and no normalization. And alpha’s another hyper parameter of algorithm that you can tune to try to get the best results.

Pick the one that achieves the highest value on this normalized log probability objective. Sometimes it’s called a normalized log likelihood objective.

In production systems, it’s not uncommon to see a beam width maybe around 10.

Exact search algorithms : 

  • BFS, Breadth First Search
  • DFS, Depth First Search

Beam search runs much faster but does not guarantee to find the exact maximum for this arg max that you would like to find.

Error analysis in beam search

Beam search is an approximate search algorithm, also called a heuristic search algorithm.

How error analysis interacts with beam search and how you can figure out whether it is the beam search algorithm that’s causing problems and worth spending time on. Or whether it might be your RNN model that is causing problems and worth spending time on.

Model : 

  • RNN model (neural network model or sequence to sequence model)
    • It’s really an encoder and a decoder.
    • P(y|x)
  • Beam search algorithm
\(\left\{\begin{matrix}
P(y^{*}|x) & use \ model\\
P(\hat y|x) & use \ RNN
\end{matrix}\right.\)

 

 

Bleu Score

How to evaluate a machine translation system

The way this is done conventionally is through something called the BLEU score.

What the BLEU score does is given a machine generated translation, it allows you to automatically compute a score that measures how good is that machine translation.

BLEU, by the way, stands for bilingual evaluation understudy.

Tthe intuition behind the BLEU score is we’re going to look at the machine generated output and see if the types of words it generates appear in at least one of the human generated references.

 
The reason the BLEU score was revolutionary for machine translation was because this gave a pretty good, by no means perfect, but pretty good single real number evaluation metric. And so that accelerated the progress of the entire field of machine translation.

Today, BLEU score is used to evaluate many systems that generate text, such as machine translation systems, as well as the example I showed briefly earlier of image captioning systems.

Attention Model Intuition

Attention Model, that makes RNN work much better.

  • It’s just difficult to get in your network to memorize a super long sentence.
  • But with an Attention Model, machine translation systems performance can look like this, because by working one part of the sentence at a time, 
    • What the Attention Model would be computing is a set of attention weights.

Attention Model

This algorithm runs in quadratic cost, Although in machine translation applications where neither input nor output sentences is usually that long maybe quadratic cost is actually acceptable.

Speech recognition

One of the most exciting developments were sequence-to-sequence models has been the rise of very accurate speech recognition.

A common pre-processing step for audio data is to run your raw audio clip and generate a spectrogram. So, this is the plots where the horizontal axis is time, and the vertical axis is frequencies, and intensity of different colors shows the amount of energy.

Once upon a time, speech recognition systems used to be built using phonemes and this where, hand-engineered basic units of cells. But with end-to-end deep learning, we’re finding that phonemes representations are no longer necessary.

 

Trigger Word Detection

When the rise of speech recognition have been more and more devices you can wake up with your voice and those are sometimes called trigger word detection systems.

 
The literature on trigger word detection algorithm is still evolving. So there isn’t wide consensus yet on what’s the best algorithm for trigger word detection.

One example of an algorithm you can use RNN like this and what we really do is take an audio clip maybe compute spectrogram features and that generates features \(x^{<1>},x^{<2>},x^{<3>}\) audio features \(x^{<1>},x^{<2>},x^{<3>}\). Then you pass to an RNN and so all that remains to be done is to define the target labels \(Y\). So if this point in the audio clip is when someone just finished saying the trigger word such as Alexa or xiaodunihao or hey Siri or okay Google. 

 
Then in the training sets you can set the target labels to be zero for everything before that point and right after that to set the target label of one.And then if a little bit later on the trigger word was said again, and the trigger was said at this point, then you can again set the target label to be one right after that.

One slight disadvantage of this is it creates a very imbalanced training set.So a lot more zeros than ones.

Instead of setting only a single time step to output one, you can actually make an output a few ones for several times or for a fixed period of time before reverting back to zero. So and that, slightly evens out the ratio of ones to zeros. But this is a little bit of a hack.

2 Natural Language Processing and Word Embeddings

Word Representation

NLP, Natural Language Processing

Word embeddings, which is a way of representing words. that let your algorithms automatically understand analogies like that, man is to woman, as king is to queen, and many other examples.

Representing words using a vocabulary of words.

One of the weaknesses of this representation is that it treats each word as a thing onto itself, and it doesn’t allow an algorithm to easily generalize the cross words.


You see plots like these sometimes on the internet to visualize some of these 300 or higher dimensional embeddings.

To visualize it, algorithms like t-SNE, map this to a much lower dimensional space.

Using Word Embeddings

Transfer learning and word embeddings

  1. Learn word embeddings from large text corpus. (1-100B words or download pre-trained embedding online.)
  2. Transfer embedding to new task with smaller training set. (say, 100k words)
  3. Optional: Continue to finetune the word embeddings with new data.

Properties of Word Embeddings

One of the most fascinating properties of word embeddings is that they can also help with analogy reasoning.

The most commonly used similarity function is called cosine similarity : \(CosineSimilarity(u,v) = \frac{u.v}{\left \| u \right \|_2\left \| v \right \|_2} = cos(\theta)\)

Embedding Matrix

When you implement an algorithm to learn a word embedding, what you end up learning is an embedding matrix.

And the columns of this matrix would be the different embeddings for the 10,000 different words you have in your vocabulary.

 

Learning Word Embeddings

It turns out that building a neural language model is a reasonable way to learn a set of embedding.

Well, what’s actually more commonly done is to have a fixed historical window.

And using a fixed history, just means that you can deal with even arbitrarily long sentences because the input sizes are always fixed.

If your goal is to learn a embedding. Researchers have experimented with many different types of context.

  • If your goal is to build a language model then it is natural for the context to be a few words right before the target word.
  • But if your goal isn’t to learn the language model per se, then you can choose other contexts.

Word2Vec

The Word2Vec algorithm which is simple and computationally more efficient way to learn this types of embeddings.

Skip-Gram model

\(\begin{matrix}
Softmax : & p(t|c) = \frac{e^{\theta ^T_t e_c}}{\sum _{j=1}^{10,000} e^{\theta ^T_j e_c}} \\
Loss Function : & L(\hat y, y) = – \sum _{i=1}^{10,000} y_i log \hat y _i
\end{matrix}\)

the primary problem is computational speed, because of the softmax step is very expensive to calculate because needing to sum over your entire vocabulary size into the denominator of the softmax.

a few solutions

    • hierarchical softmax classifier
    • negative sampling

CBow

the Continuous Bag-Of-Words Model, which takes the surrounding contexts from middle word, and and uses the surrounding words to try to predict the middle word.

Negative Sampling

What to do in this algorithm is create a new supervised learning problem. And the problem is, given a pair of words like orange and juice, we’re going to predict is this a context-target pair? It’s really to try to distinguish between these two types of distributions from which you might sample a pair of words.

How do you choose the negative examples?

  • sample the words in the middle, the candidate target words.
  • use 1 over the vocab size, sample the negative examples uniformly at random, but that’s also very non-representative of the distribution of English words.
  • the authors, Mikolov et al, reported that empirically, \(P(w_i) = \frac{f(w_i)^{\frac{3}{4}}}{\sum _{j=1}^{10,000}f(w_j)^{\frac{3}{4}}}\)

GloVe Word Vectors

GloVe stands for global vectors for word representation.

Sampling pairs of words, context and target words, by picking two words that appear in close proximity to each other in our text corpus. So, what the GloVe algorithm does is, it starts off just by making that explicit.

 

Sentiment Classification

Sentiment classification is the task of looking at a piece of text and telling if someone likes or dislikes the thing they’re talking about.

One of the challenges of sentiment classification is you might not have a huge label training set for it. But with word embeddings, you’re able to build good sentiment classifiers even with only modest-size label training sets.

One of the problems with this algorithm is it ignores word order.

More Sophisticated Model : 

 

Debiasing Word Embeddings

Machine learning and AI algorithms are increasingly trusted to help with, or to make, extremely important decisions. And so we like to make sure that as much as possible that they’re free of undesirable forms of bias, such as gender bias, ethnicity bias and so on.

  • So the first thing we’re going to do is identify the direction corresponding to a particular bias we want to reduce or eliminate.
  • the next step is a neutralization step. So for every word that’s not definitional, project it to get rid of bias.
  • And then the final step is called equalization in which you might have pairs of words such as grandmother and grandfather, or girl and boy, where you want the only difference in their embedding to be the gender.
  • And then, finally, the number of pairs you want to equalize, that’s actually also relatively small, and is, at least for the gender example, it is quite feasible to hand-pick.