[파이썬] Gensim에서의 Embedding Aggregation

Gensim is a popular Python library for natural language processing and topic modeling. It provides various functionalities to work with word embeddings, which are vector representations of words in a high-dimensional space. One common task in NLP is to aggregate word embeddings to obtain sentence or document embeddings. In this blog post, we will explore how to perform embedding aggregation using Gensim.

Word Embeddings in Gensim

Gensim provides support for a variety of pre-trained word embeddings, such as Word2Vec, FastText, and GloVe. These word embeddings represent words as dense vectors, capturing semantic and syntactic relationships between words. Let’s assume we have already trained or loaded these word embeddings using Gensim.

Aggregating Word Embeddings

To obtain a sentence or document embedding, we need to aggregate the word embeddings of the constituent words. There are various methods for aggregation, such as averaging, summing, or weighted aggregation based on word importance. Here, we will focus on the averaging method.

import numpy as np

def aggregate_embeddings(embeddings):
    if len(embeddings) == 0:
        return np.zeros(embeddings[0].shape)
    else:
        return np.mean(embeddings, axis=0)

# Example usage
words = ['apple', 'banana', 'orange']
word_embeddings = [model[word] for word in words]  # Assuming `model` is the loaded word embedding model
sentence_embedding = aggregate_embeddings(word_embeddings)

In the above code snippet, we define a function aggregate_embeddings that takes a list of word embeddings and returns the aggregated sentence embedding. The function uses NumPy’s mean function to calculate the average of the word embeddings along the first axis (axis 0), which represents the individual words.

Example: Sentence Similarity using Embedding Aggregation

Now that we know how to aggregate word embeddings, let’s see how we can utilize this to measure the similarity between sentences using Gensim.

from scipy.spatial.distance import cosine

def sentence_similarity(sentence1, sentence2):
    tokenized_sentence1 = sentence1.split()
    tokenized_sentence2 = sentence2.split()
    
    embeddings1 = [model[word] for word in tokenized_sentence1]
    embeddings2 = [model[word] for word in tokenized_sentence2]
    
    sentence_embedding1 = aggregate_embeddings(embeddings1)
    sentence_embedding2 = aggregate_embeddings(embeddings2)
    
    return 1 - cosine(sentence_embedding1, sentence_embedding2)

# Example usage
sentence1 = "I love apples"
sentence2 = "I like bananas"
similarity = sentence_similarity(sentence1, sentence2)

In the code above, we define a function sentence_similarity that takes two sentences as input and returns their similarity score using cosine similarity. We tokenize the sentences, obtain the word embeddings for each token, and then aggregate them to obtain the sentence embeddings. Finally, the cosine similarity between the sentence embeddings is calculated.

Conclusion

Aggregating word embeddings is a useful technique in NLP for obtaining sentence or document-level representations. In this blog post, we discussed how to perform embedding aggregation using Gensim, a powerful library for natural language processing. We saw an example of calculating sentence similarity based on aggregated embeddings. By experimenting with different word embeddings and aggregation methods, you can further enhance your NLP applications.