[python] 파이썬을 활용한 스팸 메일 필터링 알고리즘

11 Dec 2023

python

소개
데이터 전처리
특성 추출
모델 구축
성능 평가
마무리

1. 소개

스팸 메일은 많은 이용자들에게 불편함을 초래하고, 비정상적인 링크나 파일을 통해 보안에 문제를 일으키기도 합니다. 이러한 문제를 해결하기 위해 파이썬을 활용한 스팸 메일 필터링 알고리즘을 개발할 수 있습니다.

2. 데이터 전처리

데이터 전처리는 메일의 내용을 크게 영향을 미치는 부분입니다. 특수문자, 불용어, HTML 태그 등을 제거하고, 텍스트를 소문자로 변환하여 일관된 데이터를 확보합니다.

import re
from bs4 import BeautifulSoup
import nltk
from nltk.corpus import stopwords

def clean_text(text):
    text = re.sub(r'<.*?>', '', text)  # Remove HTML tags
    text = re.sub(r'[^a-zA-Z]', ' ', text)  # Remove special characters
    text = text.lower()  # Convert to lowercase
    text = ' '.join([word for word in text.split() if word not in set(stopwords.words('english'))])  # Remove stopwords
    return text

3. 특성 추출

이메일 텍스트로부터 유의미한 정보를 추출하여 모델이 학습할 수 있도록 변환해야 합니다. 주로 TF-IDF 변환을 통해 특성을 추출할 수 있습니다.

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vect = TfidfVectorizer(max_features=5000)
X_train_tfidf = tfidf_vect.fit_transform(X_train)
X_test_tfidf = tfidf_vect.transform(X_test)

4. 모델 구축

로지스틱 회귀, 나이브 베이즈, SVM 등 여러 머신러닝 모델을 사용하여 스팸 메일을 분류하는 모델을 구축할 수 있습니다.

from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train_tfidf, y_train)

5. 성능 평가

모델의 성능을 측정하여 최적의 모델을 선택할 수 있습니다. 정확도, 정밀도, 재현율 등을 고려하여 모델을 선택합니다.

from sklearn.metrics import accuracy_score, classification_report
y_pred = model.predict(X_test_tfidf)
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

6. 마무리

파이썬을 활용한 스팸 메일 필터링 알고리즘을 개발하여 효과적으로 스팸 메일을 걸러낼 수 있습니다. 추가적인 기법들을 활용하여 모델의 성능을 향상시킬 수 있습니다.

참고문헌:

https://archive.ics.uci.edu/ml/datasets/Spambase
https://towardsdatascience.com/machine-learning-for-email-spam-filtering-and-priority-inbox-3eb4e6adbab6