以下是论文《Efficient Estimation of Word Representations in Vector Space》第五部分 “Negative Sampling” 的英文原文:
In addition to hierarchical softmax, negative sampling is another technique that can be used to reduce the computational complexity of training neural network language models and the new log-linear models.
The basic idea of negative sampling is to convert the original multi-class classification problem of predicting the correct word from the vocabulary into a binary classification problem of predicting whether a given word is a context word of the input word or not.
We sample a small number of "negative" examples (words that are not in the context of the input word) for each "positive" example (the actual context word of the input word). The training objective is then to maximize the probability of correctly classifying the positive examples as 1 and the negative examples as 0.
The number of negative samples k is typically much smaller than the size of the vocabulary V. For example, in our experiments, we found that k=5 to 20 works well for small datasets, and k=2 to 5 is sufficient for large datasets.
The computational complexity of negative sampling is much lower than that of the standard softmax. Instead of computing the probabilities of all words in the vocabulary, we only need to compute the probabilities of the positive example and the k negative examples.
In the case of the skip-gram model, the training objective with negative sampling can be written as:
J(θ)=−∑i=1n∑−c≤j≤c,j=0logσ(uoi+jTvwi)+∑i=1n∑j=1klogσ(−uni,jTvwi)
where n is the number of training examples, c is the size of the context window, uoi+j is the vector representation of the j-th context word of the i-th training example, vwi is the vector representation of the input word wi, uni,j is the vector representation of the j-th negative sample for the i-th training example, and σ(x)=1+e−x1 is the sigmoid function.
Negative sampling has several advantages. It is computationally efficient, especially for large vocabularies. It also allows for more flexible training, as we can choose different sampling strategies for the negative examples. For example, we can sample the negative examples according to the frequency of the words in the corpus, so that more frequent words are more likely to be sampled as negative examples.
In our experiments, we found that negative sampling can significantly improve the training speed of the models, and the quality of the learned word representations is also comparable to that of the models trained with the standard softmax or hierarchical softmax.
In addition to negative sampling and hierarchical softmax, there are other techniques such as noise-contrastive estimation (NCE) that can also be used to reduce the computational complexity of training language models. However, in our experience, negative sampling and hierarchical softmax are relatively simple and effective, and work well for a wide range of applications.