[java] 아파치 루신(Apache Lucene)을 이용한 텍스트 데이터의 전처리 기법

28 Nov 2023

java

아파치 루신(Apache Lucene)은 자바를 기반으로 한 오픈소스 검색 엔진 라이브러리로, 텍스트 데이터의 색인화(indexing)와 검색(searching)을 위한 기능을 제공합니다. 이를 통해 우리는 텍스트 데이터를 빠르고 효율적으로 처리할 수 있습니다.

텍스트 데이터의 전처리는 검색 엔진의 성능을 향상시키기 위해 필요한 과정입니다. 아파치 루신을 사용하여 텍스트 데이터를 전처리할 때 유용한 몇 가지 기법을 살펴보겠습니다.

1. 토큰화(Tokenization)

토큰화는 텍스트 데이터를 작은 단위로 분리하는 과정입니다. 일반적으로 단어 단위로 텍스트를 분리합니다. 아파치 루신은 WhitespaceTokenizer나 StandardTokenizer와 같은 토크나이저를 제공하여 텍스트 데이터를 토큰화할 수 있습니다.

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;

public class TokenizationExample {
    public static void main(String[] args) {
        String text = "Apache Lucene을 이용한 텍스트 데이터의 전처리 기법";
        
        try (TokenStream tokenStream = new StandardAnalyzer().tokenStream(null, text)) {
            tokenStream.reset();
            
            CharTermAttribute term = tokenStream.getAttribute(CharTermAttribute.class);
            
            while (tokenStream.incrementToken()) {
                System.out.println(term.toString());
            }
            
            tokenStream.end();
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

2. 정규화(Normalization)

정규화는 텍스트 데이터를 일관된 형태로 변환하는 과정입니다. 예를 들어, 소문자로 변환하거나 특수문자를 제거할 수 있습니다. 아파치 루신은 LowerCaseFilter나 ASCIIFoldingFilter와 같은 필터를 제공하여 텍스트 데이터를 정규화할 수 있습니다.

import org.apache.lucene.analysis.core.LowerCaseFilter;
import org.apache.lucene.analysis.core.StopFilter;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.util.Version;

public class NormalizationExample {
    public static void main(String[] args) {
        String text = "Lucene은 Apache 재단에서 개발된 오픈소스 검색 엔진 라이브러리입니다.";
        
        try (StandardAnalyzer analyzer = new StandardAnalyzer()) {
            TokenStream tokenStream = analyzer.tokenStream(null, text);
            tokenStream = new LowerCaseFilter(Version.LUCENE_8_8_2, tokenStream);
            tokenStream.reset();
            
            CharTermAttribute term = tokenStream.getAttribute(CharTermAttribute.class);
            
            while (tokenStream.incrementToken()) {
                System.out.println(term.toString());
            }
            
            tokenStream.end();
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

3. 불용어 제거(Stop Word Removal)

불용어란 검색 시 무시해야 할 흔한 단어들을 의미합니다. 예를 들어 ‘a’, ‘the’, ‘is’ 등이 불용어에 해당합니다. 아파치 루신은 StopFilter와 같은 필터를 제공하여 텍스트 데이터에서 불용어를 제거할 수 있습니다.

import org.apache.lucene.analysis.core.StopFilter;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.util.Version;

public class StopWordRemovalExample {
    public static void main(String[] args) {
        String text = "Apache Lucene은 검색 엔진 개발에 많이 사용되는 도구입니다.";
        
        try (StandardAnalyzer analyzer = new StandardAnalyzer()) {
            TokenStream tokenStream = analyzer.tokenStream(null, text);
            tokenStream = new StopFilter(Version.LUCENE_8_8_2, tokenStream, StandardAnalyzer.ENGLISH_STOP_WORDS_SET);
            tokenStream.reset();
            
            CharTermAttribute term = tokenStream.getAttribute(CharTermAttribute.class);
            
            while (tokenStream.incrementToken()) {
                System.out.println(term.toString());
            }
            
            tokenStream.end();
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

위에서는 아파치 루신을 사용하여 텍스트 데이터의 전처리 기법을 설명하였습니다. 이 외에도 다양한 전처리 기법들을 적용하여 효율적인 검색을 위한 텍스트 데이터를 구축할 수 있습니다.

1. 토큰화(Tokenization)

2. 정규화(Normalization)

3. 불용어 제거(Stop Word Removal)

참고 자료