[파이썬] `nltk`에서의 Word2Vec

06 Sep 2023

nltk

The Word2Vec algorithm is a popular technique for generating word embeddings, which are numerical representations of words in a way that captures semantic meaning. The Natural Language Toolkit (NLTK) library in Python provides a convenient way to implement Word2Vec.

In this blog post, we will explore how to use nltk to train a Word2Vec model and to obtain word embeddings for various natural language processing tasks.

Installing NLTK and Word2Vec

To get started, make sure you have NLTK installed on your machine. You can install it using pip:

pip install nltk

After installing NLTK, we also need to download the necessary resources. Open a Python shell and execute the following commands:

import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

The first command downloads the Punkt Tokenizer Models, which are used for tokenization. The second command downloads the Averaged Perceptron Tagger, which is used for part-of-speech tagging.

Preparing the Data

Before training the Word2Vec model, we need to preprocess our text data. This involves tokenizing the text into individual words and performing part-of-speech tagging.

from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag

text = "This is an example sentence."

# Tokenize the text
tokens = word_tokenize(text)

# Perform part-of-speech tagging
tagged_tokens = pos_tag(tokens)

print(tagged_tokens)

The output of the above code will be a list of tuples, where each tuple consists of a word and its corresponding part-of-speech tag.

Training the Word2Vec Model

Once we have preprocessed our text data, we can train the Word2Vec model using nltk.

from nltk.corpus import brown
from nltk import Word2Vec

# Train the Word2Vec model on the Brown corpus
model = Word2Vec(brown.sents())

# Print the word embeddings of a specific word
word = "example"
print(model.wv[word])

In the above code, we use the Brown corpus from NLTK to train our Word2Vec model. The brown.sents() function returns a list of sentences in the Brown corpus.

Using Word Embeddings

Now that we have trained our Word2Vec model, we can use the word embeddings to perform various NLP tasks, such as finding similar words or calculating word similarity.

# Find similar words
similar_words = model.wv.most_similar(positive=["example"])
print(similar_words)

# Calculate word similarity
word1 = "example"
word2 = "sentence"
similarity = model.wv.similarity(word1, word2)
print(similarity)

In the above code, we use the most_similar() function to find words similar to a given word. We also use the similarity() function to calculate the similarity between two words.

Conclusion

In this blog post, we have explored how to use nltk to implement Word2Vec in Python. We learned how to train a Word2Vec model, obtain word embeddings, and use them for various natural language processing tasks.

NLTK provides a powerful set of tools for natural language processing, and Word2Vec is just one of the many algorithms that can be implemented with this library.