[파이썬] Gensim Similarity Queries

Gensim is a popular Python library for natural language processing, specifically designed to handle large corpora of text documents. One of the key features of Gensim is its ability to perform similarity queries, allowing us to find documents that are most similar to a given query document. In this blog post, we will explore how to use Gensim for similarity queries in Python.

Installing Gensim

Before we can start using Gensim, we need to install it. You can install Gensim using pip, the Python package installer, by running the following command:

pip install gensim

Preparing the Corpus

To perform similarity queries, we first need to prepare a corpus of documents. In Gensim, a corpus is represented as a list of documents, where each document is a list of words. Let’s assume we have a list of N documents, where each document is represented as a string.

documents = [
    "This is the first document.",
    "This document is the second document.",
    "And this is the third one.",
    "Is this the first document?",
]

To convert this list of documents into a Gensim-compatible format, we need to tokenize each document into a list of words. We can use the simple_preprocess function from Gensim for this purpose.

from gensim.utils import simple_preprocess

corpus = [simple_preprocess(document) for document in documents]

Building the Document Representation

After preparing the corpus, we need to build a numerical representation of each document in the corpus. Gensim provides several techniques for this, such as bag-of-words, TF-IDF, and word embeddings. For simplicity, let’s use the bag-of-words representation.

To build the bag-of-words representation, we first need to create a dictionary of unique words in the corpus.

from gensim.corpora import Dictionary

dictionary = Dictionary(corpus)

Next, we need to convert each document in the corpus into its bag-of-words representation.

bow_corpus = [dictionary.doc2bow(document) for document in corpus]

Performing Similarity Queries

Now that we have prepared the corpus and built the document representation, we can perform similarity queries using Gensim.

First, we need to build a similarity index based on our document representation.

from gensim.similarities import MatrixSimilarity

similarity_index = MatrixSimilarity(bow_corpus)

To perform a similarity query, we need to convert our query document into its bag-of-words representation.

query_document = "This is a new document."
query_bow = dictionary.doc2bow(simple_preprocess(query_document))

Finally, we can use the similarity index to find the most similar documents to our query document.

similar_docs = similarity_index[query_bow]

The similar_docs variable will contain a list of (document_id, similarity_score) tuples, where the document ID represents the index of the similar document in the original corpus, and the similarity score represents the similarity between the query document and the similar document.

Conclusion

In this blog post, we explored how to use Gensim for similarity queries in Python. We learned how to prepare a corpus, build the document representation, and perform similarity queries using Gensim’s similarity index. Gensim provides a powerful and efficient way to find similar documents in large text corpora, making it a valuable tool for natural language processing tasks.