[파이썬] Gensim 유사한 단어 찾기

Gensim is a popular Python library used for natural language processing tasks, including finding similar words in a given corpus of text. In this blog post, we will explore how to use Gensim to find similar words in Python.

Installation

Before we can start using Gensim, we need to install it. You can install Gensim using pip by running the following command:

pip install gensim

Once Gensim is installed, we can proceed with finding similar words.

Loading the Corpus

The first step is to load the corpus of text that we want to work with. A corpus is a collection of documents or text data. In this example, let’s say we have a corpus of news articles.

from gensim.models import Word2Vec

# Assuming 'corpus.txt' is the file containing our corpus of text
with open('corpus.txt', 'r') as f:
    corpus = f.read().splitlines()

Training the Word2Vec Model

Next, we need to train a Word2Vec model using the corpus that we loaded. Word2Vec is an algorithm provided by Gensim that converts words into numerical vectors, allowing us to perform mathematical operations on words.

# Train the Word2Vec model
model = Word2Vec(corpus, min_count=1)

Finding Similar Words

Once the model is trained, we can use it to find similar words to a given word. The most_similar() method of the Word2Vec model returns the most similar words to a target word based on their vector representations.

# Find similar words to a target word
similar_words = model.wv.most_similar('apple')

The most_similar() method returns a list of tuples, where each tuple contains a similar word and its similarity score. We can specify the number of similar words to return by using the topn parameter. By default, it returns the top 10 similar words.

# Find top 5 similar words to 'apple'
similar_words = model.wv.most_similar('apple', topn=5)

Example Output

Here’s an example of what the output of finding similar words may look like:

[('fruit', 0.85), ('banana', 0.81), ('pear', 0.78), ('orange', 0.76), ('mango', 0.74)]

In this example, the word ‘apple’ is found to be similar to words like ‘fruit’, ‘banana’, ‘pear’, ‘orange’, and ‘mango’, based on their similarity scores.

Conclusion

Finding similar words in a corpus of text is a common task in natural language processing. Gensim provides a powerful and efficient way to achieve this using the Word2Vec algorithm. In this blog post, we have explored how to load a corpus, train a Word2Vec model, and find similar words using Gensim in Python.

To learn more about Gensim and its capabilities, refer to the official documentation.