[파이썬] Keras 텍스트 분류 작업

Introduction

Text classification is a common task in natural language processing (NLP) that involves categorizing a piece of text into predefined classes or categories. Keras is a popular deep learning library in Python that provides a high-level interface for building and training deep neural networks. In this blog post, we will walk through the steps of performing text classification using Keras in Python.

Preprocessing the text data

Before we can train a text classification model, we need to preprocess the text data. This involves steps such as tokenization, converting words to numerical representation, and creating training and testing datasets. Let’s see an example:

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

text_data = [
    "This is the first document.",
    "This document is the second document.",
    "And this is the third document.",
    "Is this the first document?"
]

# Tokenize the text
tokenizer = Tokenizer()
tokenizer.fit_on_texts(text_data)
word_index = tokenizer.word_index

# Create sequences of tokens
sequences = tokenizer.texts_to_sequences(text_data)

# Pad sequences to ensure equal length
max_sequence_length = max(len(seq) for seq in sequences)
padded_sequences = pad_sequences(sequences, maxlen=max_sequence_length)

In the above code, we first initialize a Tokenizer object to convert the text into tokens. We then fit the tokenizer on the text data and get the word index mapping. Next, we convert the text data into sequences of tokens using the texts_to_sequences method. Finally, we pad the sequences to ensure equal length using the pad_sequences method.

Building the text classification model

Once the text data is preprocessed, we can build the text classification model using Keras. In this example, we will use a simple feedforward neural network with an embedding layer. Let’s take a look at the code:

from keras.models import Sequential
from keras.layers import Embedding, Flatten, Dense

vocab_size = len(word_index) + 1  # Add 1 for padding token

model = Sequential()
model.add(Embedding(vocab_size, 100, input_length=max_sequence_length))
model.add(Flatten())
model.add(Dense(64, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

model.summary()

In the above code, we create a Sequential model and add layers to it. The first layer is an embedding layer that takes the vocabulary size and the sequence length as input. This layer converts the integer sequences into dense vectors of fixed size. We then add a flatten layer to convert the 2D tensor into a 1D tensor. Next, we add a dense layer with 64 units and a ReLU activation function. Finally, we add a dense layer with a single unit and a sigmoid activation function, which is suitable for binary classification tasks. We compile the model with the Adam optimizer and binary cross-entropy loss.

Training and evaluating the model

Once the model is built, we can train it on our text data and evaluate its performance. Here’s an example:

labels = [1, 0, 0, 1]  # Binary class labels

model.fit(padded_sequences, labels, epochs=10, batch_size=1, verbose=1)

# Evaluate the model
test_data = [
    "This is a new document for testing.",
    "Is this the second document?"
]
test_sequences = tokenizer.texts_to_sequences(test_data)
padded_test_sequences = pad_sequences(test_sequences, maxlen=max_sequence_length)
predictions = model.predict(padded_test_sequences)

for i, sentence in enumerate(test_data):
    label = "Positive" if predictions[i] > 0.5 else "Negative"
    print(f"Sentence: {sentence} - Label: {label}")

In the above code, we define the binary class labels for the training data. We then use the fit method to train the model on the padded sequences and labels. After training, we can use the model to make predictions on new data by converting it into sequences and padding them. Finally, we print the predicted label for each sentence in the test data.

Conclusion

In this blog post, we have learned how to perform text classification using Keras in Python. We covered the preprocessing steps for text data, building a text classification model, and training and evaluating the model. Keras provides a convenient and efficient way to implement deep learning models for text classification tasks.