[파이썬] Gensim FastText 모델을 이용한 임베딩

06 Sep 2023

gensim

FastText is a powerful library for word embeddings that was developed by Facebook AI Research. It is an extension of the Word2Vec model, which not only considers individual words but also subword information.

In this blog post, we will walk through how to use the Gensim library in Python to train a FastText model and generate word embeddings.

Preparing the Data

Before we can train the FastText model, we need to ensure that our data is properly preprocessed and tokenized. It is also recommended to convert the corpus into a list of sentences, where each sentence is a list of individual words.

import nltk

# Tokenize the data
nltk.download('punkt')
from nltk.tokenize import sent_tokenize, word_tokenize

data = "This is an example sentence. Another sentence follows."
sentences = sent_tokenize(data)
tokenized_sentences = [word_tokenize(sentence) for sentence in sentences]

print(tokenized_sentences)

Training the FastText Model

Now that we have our tokenized sentences, we can train the FastText model using the Gensim library.

from gensim.models.fasttext import FastText

# Train the FastText model
model = FastText(sentences=tokenized_sentences, size=100, window=5, min_count=1, sg=1, iter=10)

# Save the model for later use
model.save("fasttext_model.bin")

In the example above, we pass the tokenized sentences to the FastText constructor to create the model. Here, size represents the dimensionality of the word vectors, window controls the length of the context window, min_count specifies the minimum frequency of a word to be included, sg stands for skip-gram model, and iter is the number of training iterations.

Using the FastText Model for Embeddings

Once the model is trained, we can utilize it to generate word embeddings for new words or sentences.

from gensim.models import FastText

# Load the saved FastText model
model = FastText.load("fasttext_model.bin")

# Get the word embeddings
word_embeddings = model.wv
embedding = word_embeddings['example']

print(embedding)

In the code snippet above, we load the saved FastText model using the FastText.load() method. Then, we can access the word vectors using the wv attribute of the model. We retrieve the embedding for the word ‘example’ and print it out.

Conclusion

In this blog post, we have seen how to train a FastText model using the Gensim library in Python. We learned how to prepare the data, train the model, and generate word embeddings. FastText is a popular method for creating word representations, especially when dealing with out-of-vocabulary words or unknown terms.

By running this code on your own dataset, you can leverage the power of FastText embeddings to enhance various natural language processing tasks, such as text classification, sentiment analysis, or information retrieval.

Feel free to experiment with different parameters, such as the dimensionality of the word vectors or the size of the context window, to optimize the performance of your FastText model.