[파이썬] nltk 시맨틱 유사성 (Semantic Similarity)

Semantic similarity is a measure of how closely related two pieces of text are in terms of meaning. It aims to capture the underlying semantic content rather than just comparing words or phrases. This concept is widely used in various natural language processing tasks such as information retrieval, text classification, and question answering.

In this blog post, we will explore how to calculate semantic similarity using the Natural Language Toolkit (NLTK) library in Python. NLTK is a powerful open-source library that provides various tools and resources for natural language processing tasks.

Installing NLTK

Before we begin, make sure you have NLTK installed. If not, you can install it using pip:

pip install nltk

Preparing the Text Data

To calculate semantic similarity, we need to tokenize and preprocess the text data. Tokenization is the process of breaking text into individual words or sentences, and preprocessing involves cleaning the text by removing stop words, punctuation, and converting all words to lowercase.

Let’s take an example where we have two sentences:

sentence1 = "The quick brown fox jumps over the lazy dog."
sentence2 = "The lazy dog is jumped over by the quick brown fox."

Calculating Semantic Similarity

NLTK provides various algorithms for calculating semantic similarity, including:

  1. Path similarity: Measures the similarity based on the shortest path between two concepts in a WordNet hierarchy.
  2. Leacock-Chodorow similarity: Uses the shortest path length and depth of the two concepts in WordNet to calculate similarity.
  3. Wu-Palmer similarity: Computes similarity based on the depth of the two concepts and their Least Common Subsumer (LCS) in WordNet.
  4. Resnik similarity: Measures similarity by computing the information content of the Least Common Subsumer (LCS) in WordNet.
  5. Lin similarity: Calculates similarity by combining the information content of the LCS and the information content of the concepts themselves.

Let’s use the path similarity algorithm to calculate the semantic similarity between our example sentences:

from nltk.corpus import wordnet

def calculate_semantic_similarity(sentence1, sentence2):
    s1_tokens = nltk.word_tokenize(sentence1)
    s2_tokens = nltk.word_tokenize(sentence2)
    
    s1_synsets = [wordnet.synsets(token)[0] for token in s1_tokens]
    s2_synsets = [wordnet.synsets(token)[0] for token in s2_tokens]
    
    similarity_scores = []
    
    for s1_synset in s1_synsets:
        for s2_synset in s2_synsets:
            similarity_score = s1_synset.path_similarity(s2_synset)
            if similarity_score:
                similarity_scores.append(similarity_score)
            
    average_similarity = sum(similarity_scores) / len(similarity_scores)
    return average_similarity

similarity_score = calculate_semantic_similarity(sentence1, sentence2)
print(f"Semantic similarity score: {similarity_score}")

Conclusion

In this blog post, we explored how to calculate semantic similarity using NLTK in Python. We saw how to tokenize and preprocess text data, and then used the path similarity algorithm from WordNet to calculate the semantic similarity between two sentences. NLTK provides a range of other algorithms that can be explored according to the specific use case.

Semantic similarity is a powerful tool in natural language processing and can be utilized in various applications such as chatbots, information retrieval, and language translation. It helps us better understand the meaning and context of text, enabling more accurate analysis and decision-making.

NLTK makes it easy to implement semantic similarity in Python and provides a range of resources and functionalities to enhance natural language processing tasks. It’s definitely worth exploring for anyone working with text data and looking to extract its semantic meaning.