[파이썬] Keras 이미지 캡션 생성

In this blog post, we will explore how to use Keras, a popular deep learning library, to generate image captions with Python. Generating captions for images is an important task in computer vision and natural language processing, as it combines the understanding of both visual and textual information.

Installing Dependencies

First, let’s ensure we have all the necessary dependencies installed. We need Keras, TensorFlow, and other supporting libraries.

!pip install keras tensorflow pillow

Dataset Preparation

To train our image captioning model, we need a dataset of images paired with their corresponding captions. There are various datasets available for this task, such as Microsoft COCO, Flickr8k, and Flickr30k. For this example, let’s use the Flickr8k dataset.

  1. Download the Flickr8k dataset from here.
  2. Extract the downloaded file and place it in the desired location.
  3. In the extracted folder, you will find image files and a text file containing image captions.

Preprocessing the Data

Before feeding the images and captions into our model, we need to preprocess the data.

1. Extracting Image Features

To extract features from the images, we will use a pre-trained image classification model, such as VGG16 or ResNet50, that has been trained on a large dataset like ImageNet. This allows us to leverage the learned representations of the images instead of training a model from scratch.

from keras.preprocessing.image import load_img, img_to_array
from keras.applications.vgg16 import VGG16, preprocess_input

def extract_image_features(image_path):
    # Load and preprocess the image
    image = load_img(image_path, target_size=(224, 224))
    image = img_to_array(image)
    image = preprocess_input(image)
    
    # Load pre-trained VGG16 model
    model = VGG16(weights='imagenet', include_top=False)
    
    # Extract features from the image
    features = model.predict(image.reshape(1, 224, 224, 3))
    
    return features.reshape(-1)

2. Preparing Caption Sequences

We need to tokenize the captions and map each token to a unique integer value. Additionally, we will add special tokens like <start> and <end> to denote the beginning and end of each caption sequence.

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

def prepare_caption_sequences(captions):
    tokenizer = Tokenizer(oov_token='<unk>')
    tokenizer.fit_on_texts(captions)

    # Convert text to sequences
    sequences = tokenizer.texts_to_sequences(captions)

    # Add <start> and <end> tokens to sequences
    sequences = [[tokenizer.word_index['<start>']] + sequence + [tokenizer.word_index['<end>']] for sequence in sequences]

    # Pad sequences to a fixed length
    sequences = pad_sequences(sequences, padding='post')

    return sequences, tokenizer

Building the Model

Now, let’s build the model for generating image captions.

from keras.models import Model
from keras.layers import Input, LSTM, Embedding, Dense

def build_model(vocab_size, max_length):
    # Image feature input
    image_input = Input(shape=(4096,))
    
    # Image feature embedding
    image_embedding = Dense(256, activation='relu')(image_input)
    
    # Caption sequence input
    caption_input = Input(shape=(max_length,))
    
    # Caption word embedding
    caption_embedding = Embedding(vocab_size, 256, mask_zero=True)(caption_input)
    
    # Caption LSTM
    caption_lstm = LSTM(256)(caption_embedding)
    
    # Concatenate image and caption features
    merged = keras.layers.concatenate([image_embedding, caption_lstm])
    
    # Caption generation
    output = Dense(vocab_size, activation='softmax')(merged)
    
    # Model
    model = Model(inputs=[image_input, caption_input], outputs=output)
    
    return model

Training the Model

To train the model, we need to compile it with appropriate loss and optimization functions, as well as provide the training images and captions.

def train_model(model, images, captions, epochs=10, batch_size=64):
    model.compile(loss='sparse_categorical_crossentropy', optimizer='adam')

    # Split dataset into training and validation sets
    image_train, image_val, caption_train, caption_val = train_test_split(images, captions, test_size=0.2)

    # Train the model
    model.fit([image_train, caption_train], caption_train[:,1:], 
              validation_data=([image_val, caption_val], caption_val[:,1:]),
              epochs=epochs, 
              batch_size=batch_size)

    return model

Generating Captions

Once the model is trained, we can generate captions for new images.

def generate_caption(model, image):
    # Extract features from the image
    image_features = extract_image_features(image)

    # Start token
    caption = [tokenizer.word_index['<start>']]

    for _ in range(max_length-1):
        # Predict the next word
        caption_input = pad_sequences([caption], maxlen=max_length, padding='post')
        preds = model.predict([image_features.reshape(1,-1), caption_input], verbose=0)[0]
        predicted_id = np.argmax(preds)
        
        # Break if the end token is predicted
        if predicted_id == tokenizer.word_index['<end>']:
            break

        # Append the predicted word to the caption
        caption.append(predicted_id)

    # Convert the predicted caption to text
    predicted_caption = ' '.join([tokenizer.index_word[i] for i in caption])

    return predicted_caption

Conclusion

In this blog post, we have covered the process of generating image captions using Keras in Python. We started with installing the necessary dependencies, preparing the dataset, preprocessing the images and captions, building the model, training it, and finally generating captions for new images. Image captioning is a fascinating field with practical applications in image understanding, text generation, and multimedia retrieval. If you want to dive deeper, consider exploring more advanced architectures and datasets to further improve the quality of the generated captions.

Happy coding!