[파이썬] Gensim Out-of-Vocabulary (OOV) 단어 처리

Introduction

In natural language processing (NLP), it is common to encounter Out-of-Vocabulary (OOV) words or unknown words that are not present in the training data. Handling OOV words effectively is crucial for building robust NLP models. Gensim, a popular Python library for topic modeling and document similarity, provides several methods to handle OOV words. In this blog post, we will explore these methods and demonstrate how to use them in Python.

Method 1: Skip OOV Words during Training

The simplest approach to handle OOV words using Gensim is to skip them during the training process. By setting the update parameter to False in the Word2Vec or FastText model, the model will ignore OOV words during training. Here’s an example of how to implement this:

from gensim.models import Word2Vec

sentences = [['hello', 'world'], ['example', 'sentence']]
model = Word2Vec(sentences, size=100, update=False)

In this example, any word that is not present in the sentences list will be ignored during training.

Method 2: Handling OOV with Pretrained Embeddings

If you have a pretrained word embedding model that contains embeddings for a significantly larger vocabulary, you can utilize those embeddings to handle OOV words. Gensim provides a way to update the model’s vocabulary by loading pretrained embeddings and using them to initialize the weights of the OOV words. Here’s an example of how to implement this:

from gensim.models import KeyedVectors, Word2Vec

word_vectors = KeyedVectors.load_word2vec_format('pretrained_vectors.bin', binary=True)
model = Word2Vec(size=100)
model.build_vocab_from_freq(word_vectors.vocab)
model.intersect_word2vec_format('pretrained_vectors.bin', binary=True, lockf=1.0)

# Now you can train the model further with additional data
model.train(sentences, total_examples=model.corpus_count, epochs=model.epochs)

In this example, we first load the pretrained word vectors into word_vectors. Then, we initialize the Word2Vec model without any training data and build its vocabulary from the pretrained embeddings. Finally, we intersect the model with the pretrained word vectors to update the weights of OOV words.

Method 3: Replacing OOV Words with a Special Token

Another common approach is to replace all OOV words with a special token, such as <UNK>. By doing this, you ensure that all OOV words are treated as the same unknown token during training and inference. Here’s an example of how to implement this:

from gensim.models import Word2Vec
import re

sentences = [['hello', 'world'], ['example', 'sentence']]

# Replace OOV words with '<UNK>' token
filtered_sentences = [[word if word in model.wv.vocab else '<UNK>' for word in sentence] for sentence in sentences]

model = Word2Vec(filtered_sentences, size=100)

In this example, we check whether each word in the sentence exists in the model’s vocabulary. If not, we replace it with the <UNK> token. You can customize the <UNK> token based on your preference.

Conclusion

Handling Out-of-Vocabulary (OOV) words is an essential task in NLP. In this blog post, we explored three different methods to handle OOV words using Gensim in Python. By skipping OOV words during training, leveraging pretrained embeddings, or replacing OOV words with a special token, we can improve the performance and robustness of our NLP models.