[파이썬] Gensim 코사인 유사도 계산

06 Sep 2023

gensim

In natural language processing (NLP) tasks, calculating similarity between documents or words is often a common requirement. One popular method for measuring similarity is using the cosine similarity metric.

In this blog post, we will explore how to calculate cosine similarity using the Gensim library in Python. Gensim is a powerful NLP library that provides efficient implementations of various similarity measures, including cosine similarity.

Step 1: Installing Gensim

To get started, we need to install Gensim. Open your terminal and run the following command:

pip install gensim

Step 2: Importing necessary libraries

Once Gensim is installed, we can import the required libraries in our Python script:

import gensim
from gensim.models import Word2Vec
from gensim import similarities

Step 3: Preprocessing the data

Before we can calculate cosine similarity, we need to preprocess the data. This may involve tokenizing documents, removing stop words, converting text to lowercase, and other text cleaning operations. For example, let’s preprocess a list of documents:

documents = ['I love to code in Python.', 'Python is a popular programming language.', 'Gensim provides useful NLP tools.']
processed_documents = [gensim.utils.simple_preprocess(doc) for doc in documents]

Step 4: Training the Word2Vec model

To calculate cosine similarity, we can use pretrained word embeddings or train our own model using Word2Vec. Word2Vec is an algorithm that learns word embeddings from large datasets. We can train a Word2Vec model on our processed documents as follows:

model = Word2Vec(processed_documents, min_count=1)

Step 5: Calculating cosine similarity

Finally, we can calculate cosine similarity between two documents or words using the trained model.

For example, let’s calculate the cosine similarity between the first and second documents in our preprocessed list:

doc1 = processed_documents[0]
doc2 = processed_documents[1]

vec1 = model.infer_vector(doc1)
vec2 = model.infer_vector(doc2)

similarity = similarities.cosine_similarity([vec1], [vec2])[0][0]

The similarity variable will now contain the cosine similarity between the two documents.

Conclusion

In this blog post, we have explored how to calculate cosine similarity using Gensim in Python. By leveraging the power of Gensim and its Word2Vec model, you can easily measure the similarity between documents or words in your NLP projects.

Remember to preprocess your data, train the Word2Vec model, and use the similarities.cosine_similarity function to calculate the cosine similarity. Gensim also provides other useful features for NLP tasks, so be sure to check out its official documentation for more information.