[파이썬] 문자열의 유니코드 문자 정규화

01 Sep 2023

python

Unicode is a character encoding standard that provides a unique number for every character, regardless of the platform, application, or language. In Python, dealing with Unicode characters is made easy by the built-in support for Unicode strings. However, there are cases where Unicode characters might be represented in different forms, leading to potential issues in string comparisons or manipulations. This is where Unicode normalization comes into play.

Unicode normalization is the process of transforming Unicode characters into a canonical form, so that different representations of the same character will be considered equal. In Python, we can use the unicodedata module to normalize Unicode strings. There are four normalization forms defined: NFD, NFC, NFKD, and NFKC.

Let’s explore some examples of Unicode normalization in Python:

Example 1: Normalizing to NFC form

import unicodedata

string = "café"
normalized_string = unicodedata.normalize('NFC', string)
print(normalized_string)

Output:

café

In this example, we have a string with the character “é” represented as two Unicode code points: “e” (U+0065) and combining acute accent (U+0301). By normalizing the string to NFC form, the combining acute accent is combined with the base character “e” to form a single code point (U+00E9), resulting in “café”.

Example 2: Normalizing to NFD form

import unicodedata

string = "café"
normalized_string = unicodedata.normalize('NFD', string)
print(normalized_string)

Output:

café

In this example, we normalize the same string to NFD form. The character “é” is decomposed into two separate code points: “e” (U+0065) and combining acute accent (U+0301), resulting in “café”.

Example 3: Comparing normalized strings

import unicodedata

string1 = "café"
string2 = "café"

normalized_string1 = unicodedata.normalize('NFC', string1)
normalized_string2 = unicodedata.normalize('NFC', string2)

print(normalized_string1 == normalized_string2)

Output:

True

In this example, we compare two strings that are visually different, but their normalized forms are equal. The comparison returns True because the strings are considered equivalent in NFC form.

Unicode normalization is an essential process when working with Unicode strings, particularly when handling user inputs, internationalization, or text processing. By normalizing Unicode strings, you can ensure consistent and accurate processing of characters across various platforms and contexts.

Unicode normalization is just one aspect of dealing with Unicode in Python. To learn more about Unicode support in Python, check out the official Python Unicode HOWTO documentation.