[파이썬][scikit-learn] scikit-learn에서 k-NN (k-Nearest Neighbors)

Introduction

In machine learning, k-Nearest Neighbors (k-NN) is a popular algorithm for classification and regression. It belongs to the category of instance-based learning or lazy learning algorithms, where the model is constructed based on the neighboring data points in the feature space.

In this blog post, we will explore how to implement the k-NN algorithm using the scikit-learn library in Python.

What is k-NN?

k-NN is a non-parametric algorithm that classifies new data points based on the majority of their k nearest neighbors. The basic idea is that if a data point is similar to its neighbors, it most likely belongs to the same class or has similar properties.

The key steps involved in the k-NN algorithm are:

  1. Load the dataset
  2. Preprocess the dataset
  3. Split the dataset into training and testing sets
  4. Train the k-NN model
  5. Evaluate the model
  6. Predict using the trained model

Let’s dive into the implementation!

Implementation Steps

1. Load the dataset

First, we need to load a dataset on which we will train and test our k-NN model. Scikit-learn provides various datasets that can be easily loaded for testing purposes. For example, we can use the iris dataset, which contains information about different types of iris flowers.

from sklearn import datasets

# Load the iris dataset
iris = datasets.load_iris()

# Prepare the feature matrix X and the target array y
X = iris.data
y = iris.target

2. Preprocess the dataset

Before training the k-NN model, it is essential to preprocess the dataset. This step involves scaling the features to have similar magnitudes and handling missing values if any.

from sklearn.preprocessing import StandardScaler

# Scale the feature matrix X
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

3. Split the dataset into training and testing sets

To evaluate the performance of the trained model, we need to split the dataset into training and testing sets. The training set is used to train the model, while the testing set is used to evaluate its performance on unseen data.

from sklearn.model_selection import train_test_split

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

4. Train the k-NN model

Now, we can train the k-NN model using the training data. We need to decide the value of the hyperparameter k, which represents the number of neighbors to consider. Generally, the optimal value of k is determined through cross-validation or other metrics.

from sklearn.neighbors import KNeighborsClassifier

# Create a k-NN classifier with k=3
knn = KNeighborsClassifier(n_neighbors=3)

# Train the model on the training data
knn.fit(X_train, y_train)

5. Evaluate the model

After training the model, we need to evaluate its performance on the testing set. Common evaluation metrics for classification models include accuracy, precision, recall, and F1-score.

from sklearn.metrics import accuracy_score

# Predict the target values for the test set
y_pred = knn.predict(X_test)

# Compute the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

6. Predict using the trained model

Once the model is trained and evaluated, we can use it to make predictions on new, unseen data points.

# Predict the class labels for a new data point
new_data = [[5.1, 3.5, 1.4, 0.2]]
predicted_class = knn.predict(new_data)

print(f"Predicted class: {predicted_class}")

Conclusion

k-Nearest Neighbors is a straightforward yet effective algorithm for classification tasks. Implementing k-NN in Python using the scikit-learn library allows us to take advantage of its robustness and performance. In this blog post, we covered the steps required to train, evaluate, and predict using the k-NN algorithm. Experiment with different datasets and hyperparameter values to explore the algorithm further and improve its predictions.