[파이썬][scikit-learn] scikit-learn에서 CountVectorizer

05 Sep 2023

scikit learn

The scikit-learn library is a powerful tool for machine learning in Python. One of the useful features it offers is the CountVectorizer class, which is used for text analysis and feature extraction.

What is CountVectorizer?

CountVectorizer is a feature extraction method that converts a collection of text documents into a matrix of word counts. It helps us represent text data in a format that can be used by machine learning algorithms. Each row in the matrix represents a document, and each column represents a unique word in the corpus.

How to Use CountVectorizer

Here is an example of how to use CountVectorizer in scikit-learn:

from sklearn.feature_extraction.text import CountVectorizer

# Create an instance of CountVectorizer
vectorizer = CountVectorizer()

# Input data - a list of sentences
data = [
    "This is the first sentence.",
    "This sentence is the second one.",
    "And this is the third sentence."
]

# Convert the data into a matrix of word counts
X = vectorizer.fit_transform(data)

# Print the feature names
print("Feature names:", vectorizer.get_feature_names())

# Print the matrix of word counts
print("Matrix of word counts:")
print(X.toarray())

In this example, we import the CountVectorizer class from the sklearn.feature_extraction.text module. We then create an instance of the CountVectorizer class.

Next, we define our input data, which is a list of sentences. We use the fit_transform method of the CountVectorizer instance to convert the input data into a matrix of word counts.

We can access the feature names, which are the unique words in the corpus, using the get_feature_names method of the CountVectorizer instance. We can also access the matrix of word counts using the toarray method of the X variable.

Conclusion

The CountVectorizer class in scikit-learn is a powerful tool for feature extraction from text data. It allows us to convert a collection of text documents into a matrix of word counts, which can be used by machine learning algorithms. This makes it easier to work with text data and create models that can understand and process textual information efficiently.