[파이썬] Gensim Latent Semantic Analysis (LSA)

06 Sep 2023

gensim

Latent Semantic Analysis (LSA) is a technique used in natural language processing and information retrieval to analyze relationships between a set of documents and the terms they contain. Gensim is a popular Python library for natural language processing that provides implementations for various text analysis tasks, including LSA.

In this blog post, we will explore how to perform LSA using Gensim in Python. Let’s get started!

Installation

Before we begin, make sure you have Gensim installed. You can install it using pip:

pip install gensim

Data Preprocessing

To perform LSA, we need a corpus of documents. In this example, let’s assume we have a collection of text documents stored in a list called documents. Before applying LSA, we need to preprocess the data by tokenizing the text, removing stop words, and applying other text cleaning techniques. Gensim provides convenient functions to perform these tasks. Here’s an example of how we can preprocess our documents:

from gensim.parsing.preprocessing import preprocess_string, remove_stopwords

preprocessed_documents = [preprocess_string(doc) for doc in documents]
filtered_documents = [remove_stopwords(doc) for doc in preprocessed_documents]

Building the LSA Model

To build an LSA model, we need to represent our preprocessed documents as a numerical matrix. Gensim provides the TfidfVectorizer class for this purpose. Here’s an example of how to create the matrix:

from gensim.sklearn_api import TfIdfTransformer

# Convert preprocessed documents to numerical matrix
tfidf = TfIdfTransformer()
document_matrix = tfidf.fit_transform(filtered_documents)

Once we have the document matrix, we can pass it to the LsiModel class in Gensim to create the LSA model. Here’s how we can create the model and specify the number of dimensions (or topics) to consider:

from gensim.models import LsiModel

num_topics = 10  # Number of dimensions/topics
lsa_model = LsiModel(document_matrix, id2word=tfidf.idf_, num_topics=num_topics)

Extracting Topics and Document Similarities

After training the LSA model, we can extract the topics and compute document similarities. Gensim provides functions to do both of these tasks. Here’s an example of how we can extract the topics:

topics = lsa_model.print_topics(num_topics=num_topics)
for topic in topics:
    print(topic)

To compute document similarities, we can use the model’s similarity method. Here’s an example:

doc_id = 0  # ID of the document for which we want to find similar documents
query_document = document_matrix[doc_id]

similar_documents = lsa_model.similarity(query_document, document_matrix)

Conclusion

In this blog post, we explored how to perform Latent Semantic Analysis (LSA) using Gensim in Python. We learned how to preprocess the data, build the LSA model, and extract topics and document similarities. LSA is a powerful technique for understanding the semantic relationships between documents, and Gensim simplifies the implementation process.

Gensim also provides many other text analysis capabilities, so be sure to check out the official documentation for more details and examples.

I hope you found this tutorial helpful! Let me know if you have any questions or suggestions in the comments below. Happy coding!