[파이썬] Gensim MmCorpus 사용

Gensim is a widely used library in Natural Language Processing (NLP) for topic modeling, document similarity, and other text processing tasks. One of the key components in Gensim is the MmCorpus module, which is used for storing and efficiently accessing large corpora of text documents.

What is MmCorpus?

MmCorpus stands for Matrix Market Corpus. It is a file format used to store large document collections in a compressed manner, making it memory-friendly for processing. In Gensim, MmCorpus is represented using the MmCorpus class. It allows you to load and save corpus data, perform efficient streaming of documents, and interact with the corpus in a memory-efficient way.

Usage of MmCorpus

To use MmCorpus in your Python code, first, you need to install Gensim using pip:

pip install gensim

Once installed, you can import the necessary modules and start using MmCorpus:

from gensim import corpora, models

# Load the MmCorpus
corpus = corpora.MmCorpus('path/to/your/corpus.mm')

# Iterate over the documents in the corpus
for doc in corpus:
    # Process the document
    # ...
    pass

# Save the MmCorpus
corpora.MmCorpus.serialize('path/to/save/corpus.mm', corpus)

In the code snippet above, we first import the necessary modules from Gensim. Then, we load the MmCorpus using the corpora.MmCorpus class by providing the path to the corpus file. We can then iterate over the documents in the corpus using a simple for loop.

After processing the documents as needed, we can save the MmCorpus using the corpora.MmCorpus.serialize method by providing the path to save the corpus and the corpus object.

Advanced Usage

MmCorpus provides advanced features such as streaming, memory mapping, and tokenization. Here’s an example of using these features:

from gensim import corpora, models
from gensim.utils import tokenize

# Create a MmCorpus from tokenized documents
corpus = corpora.MmCorpus('path/to/your/corpus.mm', id2word=dictionary, tokenizer=tokenize)

# Use MmCorpus in a memory-mapped manner
corpus_mm = corpora.MmCorpus('path/to/your/corpus.mm', mmap='r')

# Stream documents from the corpus
streaming_corpus = corpora.MmCorpus('path/to/your/corpus.mm', streaming=True)

# Process documents from the streaming corpus
for doc in streaming_corpus:
    # Process the document
    # ...
    pass

In the above example, we create a MmCorpus from tokenized documents by passing the id2word parameter, which is a mapping of word IDs to words in the dictionary. We also specify a custom tokenizer using tokenize from gensim.utils module.

The corpus_mm object in the code snippet is a memory-mapped version of the MmCorpus. It allows efficient memory usage by only loading the necessary parts of the corpus into memory.

The streaming_corpus is created with the streaming=True parameter, which streams documents lazily instead of loading all of them into memory at once. This is useful when dealing with large corpora that cannot fit into memory.

Conclusion

In this blog post, we have explored the usage of Gensim’s MmCorpus module in Python. We learned about the benefits of MmCorpus for storing and accessing large text corpora efficiently. We also covered basic and advanced usage of MmCorpus, such as loading, saving, streaming, and memory mapping.

Using MmCorpus in your NLP projects can greatly improve the memory efficiency and performance when working with large document collections.