[파이썬] Gensim Negative Sampling

06 Sep 2023

gensim

In natural language processing (NLP), word embeddings are essential for various tasks, such as text classification, sentiment analysis, and machine translation. Gensim is a popular Python library for generating word embeddings. Among the different techniques used in Gensim, one common approach is negative sampling. In this blog post, we will explore how to use negative sampling in Gensim to train word embeddings.

What is Negative Sampling?

Negative sampling is a technique used to train word embeddings. The main idea behind negative sampling is to reformulate the original task of predicting the probability of a word in the context of another word as a binary classification problem. Instead of trying to predict the probability of all the words in the vocabulary, we focus on correctly predicting the actual neighboring words and randomly selecting a small number of “negative” words that are not in the context.

The negative sampling strategy allows us to simplify the training process and make it more efficient. By reducing the number of words we need to train on, we can significantly speed up the training process and still obtain high-quality word embeddings.

Using Gensim Negative Sampling

Gensim provides a convenient way to train word embeddings using the negative sampling technique. Let’s take a look at an example that demonstrates how to use Gensim for negative sampling.

First, we need to import the necessary libraries and load the corpus of text data:

import gensim
from gensim.models import Word2Vec
sentences = gensim.models.word2vec.LineSentence('corpus.txt')

Next, we define the parameters for training the word embeddings model, including the size of the embedding vectors, the number of negative samples to be drawn, and the window size for defining the context:

params = {
    'size': 200,          # Size of the embedding vectors
    'window': 5,          # Window size for context
    'min_count': 5,       # Minimum word frequency
    'workers': 4,         # Number of parallel threads
    'negative': 5,        # Number of negative samples
    'sg': 1,              # Skip-gram model
}

Now, we can train the Word2Vec model using negative sampling:

model = Word2Vec(sentences, **params)

After training, we can use the trained model to obtain word embeddings for any word in the vocabulary. For example, to get the embedding vector for the word “apple”:

embedding = model.wv['apple']

Conclusion

Negative sampling is a powerful technique for training word embeddings using Gensim. By reducing the number of words to be predicted in the training process, negative sampling improves the efficiency of training while maintaining the quality of the resulting word embeddings. Gensim provides an easy-to-use interface for implementing negative sampling, allowing developers to leverage this technique for various NLP tasks.

In this blog post, we explored the concept of negative sampling and demonstrated how to use Gensim for training word embeddings using this technique. Hopefully, you now have a better understanding of negative sampling and how to apply it in the context of Gensim.