파이썬으로 PyLucene을 활용한 컨텍스트 기반 검색 기능 구현하기

18 Oct 2023

pylucene

이번 글에서는 파이썬 언어를 사용하여 PyLucene 라이브러리를 활용하여 컨텍스트 기반 검색 기능을 구현하는 방법에 대해 알아보겠습니다. PyLucene은 Apache Lucene 검색 엔진의 파이썬 바인딩으로, 풍부한 검색 기능을 제공합니다.

1. PyLucene 설치하기

먼저, PyLucene을 설치해야 합니다. PyLucene은 Java와 파이썬 간의 상호 작용을 제공하기 때문에, Java 개발 환경이 설치되어 있어야 합니다. 아래의 명령어를 사용하여 PyLucene을 설치할 수 있습니다.

pip install PyLucene

2. 인덱싱하기

PyLucene을 사용하여 검색을 수행하기 전에, 먼저 인덱싱을 해야 합니다. 인덱싱은 검색 대상이 될 문서들을 색인화하여 검색에 용이하게 만드는 과정입니다. 아래는 간단한 예제 코드입니다.

import lucene

from org.apache.lucene.analysis.standard import StandardAnalyzer
from org.apache.lucene.document import Document, Field, FieldType
from org.apache.lucene.index import IndexWriter, IndexWriterConfig
from org.apache.lucene.store import SimpleFSDirectory
from org.apache.lucene.util import Version
from java.nio.file import Paths

# initialize the Java VM
lucene.initVM()

index_dir = "/path/to/index"  # 인덱스 파일 저장 경로
doc_dir = "/path/to/docs"  # 검색 대상 문서 디렉토리

analyzer = StandardAnalyzer()

config = IndexWriterConfig(analyzer)
config.setOpenMode(IndexWriterConfig.OpenMode.CREATE)

index_writer = IndexWriter(SimpleFSDirectory(Paths.get(index_dir)), config)

# 문서 디렉토리에서 문서들을 가져와서 인덱싱
for filename in os.listdir(doc_dir):
    with open(os.path.join(doc_dir, filename), "r") as file:
        contents = file.read()
        
        doc = Document()
        doc.add(Field("contents", contents, fieldType))
        doc.add(Field("filename", filename, fieldType))
        
        index_writer.addDocument(doc)

index_writer.close()

위의 코드는 주어진 문서 디렉토리의 내용을 읽어들여 contents 필드에 저장하고, 각 문서의 파일 이름을 filename 필드에 저장하여 인덱스를 생성합니다.

3. 검색하기

인덱싱이 완료되었다면, 이제 검색을 수행할 수 있습니다. 아래는 간단한 예제 코드입니다.

import lucene

from org.apache.lucene.analysis.standard import StandardAnalyzer
from org.apache.lucene.index import IndexReader
from org.apache.lucene.queryparser.classic import QueryParser
from org.apache.lucene.search import IndexSearcher
from org.apache.lucene.store import SimpleFSDirectory
from org.apache.lucene.util import Version
from java.nio.file import Paths

# initialize the Java VM
lucene.initVM()

query_str = "[your_query_string]"  # 검색 질의
index_dir = "/path/to/index"  # 인덱스 파일 경로

analyzer = StandardAnalyzer()

query_parser = QueryParser("contents", analyzer)
query = query_parser.parse(query_str)

index_reader = IndexReader.open(SimpleFSDirectory(Paths.get(index_dir)))
index_searcher = IndexSearcher(index_reader)

top_docs = index_searcher.search(query, 10)  # 상위 10개 결과 반환

for score_doc in top_docs.scoreDocs:
    doc = index_searcher.doc(score_doc.doc)
    print(f"Filename: {doc.get('filename')}, Score: {score_doc.score}")

index_reader.close()

위의 코드는 주어진 검색 질의에 대해 인덱스에서 검색을 수행하여 상위 10개의 결과를 출력하는 예제입니다. filename 필드에서 파일명을 가져오고, score를 출력합니다.

결론

이번 글에서는 파이썬에서 PyLucene을 사용하여 컨텍스트 기반 검색 기능을 구현하는 방법에 대해 알아보았습니다. PyLucene은 파이썬으로 Lucene 검색 엔진을 활용할 수 있도록 도와주는 강력한 도구입니다. 다양한 검색 기능을 구현할 수 있으며, 효율적인 검색을 제공합니다.

더 자세한 내용은 PyLucene 공식 문서를 참조하시기 바랍니다.

#PyLucene #파이썬 #검색 #인덱싱