[파이썬] Gensim Phrases 모델을 이용한 Bigram 및 Trigram 학습

Gensim is a popular Python library for natural language processing tasks such as topic modeling, document similarity analysis, and text generation. One of the powerful features of Gensim is its ability to learn bigram and trigram models using the Phrases module.

In this blog post, we will explore how to use the Gensim Phrases module to train bigram and trigram models for improving text processing and analysis tasks.

Installing Gensim

Before we start, make sure to install Gensim by running the following command:

pip install gensim

Loading the Data

First, let’s start by loading some example text data. For simplicity, we will use a list of sentences.

data = [
    "I love to read books",
    "Reading is a great way to learn",
    "Books are a window to the world",
    "I love to travel and explore different cultures"
]

Preprocessing the Text

Before training the bigram and trigram models, we need to preprocess the text by tokenizing it into words and converting all words to lowercase.

import gensim
from gensim.utils import simple_preprocess

# Tokenize and preprocess the text
data_preprocessed = [simple_preprocess(sentence, deacc=True) for sentence in data]

Training the Bigram Model

To train the bigram model, we create an instance of the Phrases class and pass the preprocessed data as input. We set the min_count parameter to 2 to ensure that only bigrams that occur at least twice are considered.

bigram = gensim.models.Phrases(data_preprocessed, min_count=2)

Applying the Bigram Model

After training the bigram model, we can apply it to the preprocessed data to generate a list of sentences with bigrams.

data_bigram = [bigram[sentence] for sentence in data_preprocessed]

Training the Trigram Model

Similarly, we can train the trigram model by creating another instance of the Phrases class and passing the data with bigrams as input.

trigram = gensim.models.Phrases(data_bigram, min_count=2)

Applying the Trigram Model

Once the trigram model is trained, we can apply it to the data with bigrams to generate a list of sentences with trigrams.

data_trigram = [trigram[sentence] for sentence in data_bigram]

Conclusion

In this blog post, we have learned how to use the Gensim Phrases module to train bigram and trigram models. These models can improve text processing and analysis tasks by capturing meaningful word combinations and phrases.

By incorporating bigrams and trigrams in your models, you can enhance tasks such as topic modeling, document similarity analysis, and text generation.

Gensim provides a simple and efficient way to train and apply these models, making it a valuable tool for natural language processing tasks in Python.

Stay tuned for more exciting tutorials on Gensim and other Python libraries for text analysis!