[python] NLTK를 사용해 텍스트의 조사를 추출하는 방법은 무엇인가요?

30 Nov 2023

python

from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords

def extract_particles(text):
    # 문장 토큰화
    sentences = sent_tokenize(text)
    
    # 조사 리스트
    particles = []
    
    # 각 문장에서 단어 토큰화
    for sentence in sentences:
        words = word_tokenize(sentence)
        
        # 각 단어의 품사 태깅
        tagged_words = nltk.pos_tag(words)
        
        # 조사 추출
        for word, tag in tagged_words:
            if tag == 'IN':
                particles.append(word)
    
    # 불용어 제거
    stop_words = set(stopwords.words('english'))
    particles = [p for p in particles if p not in stop_words]
    
    return particles

# 텍스트 입력
text = "나는 밥을 먹고 집에 갔다."

# 조사 추출 함수 호출
particles = extract_particles(text)

# 결과 출력
print(particles)

위의 코드에서는 NLTK의 word_tokenize와 sent_tokenize를 사용해 텍스트를 단어와 문장으로 토큰화합니다. 그리고 pos_tag를 사용해 단어에 품사를 부착한 후, 품사가 ‘IN’인 단어를 조사로 추출합니다.

마지막으로, stopwords를 사용해 추출된 조사 중에서 불용어를 제거합니다.

이 코드를 실행하면, 주어진 텍스트에서 추출된 조사가 출력될 것입니다.

참고 링크: