How To Convert Python Unicode Characters To String

In this tutorial let’s learn about How To Convert Python Unicode Characters To String in Python. This article shows how to translate Unicode characters into an ASCII string. The purpose is to either eliminate non-ASCII characters or replace Unicode characters with their corresponding ASCII ones.

Unicode Characters in Python

Unicode Characters is the universal character encoding standard for all languages. Unlike ASCII, which only allows for a single byte per character, Unicode characters allow for four bytes, allowing for more characters in any language.

Unicode strings can be encoded in plain strings to any encoding you like. The abstract object large enough to carry the character in Python Unicode character is equivalent to Python’s long integers. If the string simply contains ASCII characters, transform it to a string using the str() method.

Code Example :

string = u"python"

# str() method
data = str(string)

print(data)

Output :

python

encode() and decode() Methods in Python

If you have a Unicode string and need to write it to a file or another serialised form, you must first encode it into a saveable format. There are many popular Unicode encodings, such as UTF-16 (which requires two bytes for most Unicode characters) or UTF-8, and others.

You can use the following code to convert that string to a certain encoding.

Code Example :

string = u"python"

# encode() method
data = string.encode('UTF-16')

print(data)

Output :

b’\xff\xfep\x00y\x00t\x00h\x00o\x00n\x00′

The above output is bytes datatype.

Use the decode() function to convert bytes to strings in python.

Example :

string = b'\xff\xfep\x00y\x00t\x00h\x00o\x00n\x00'

# decode() method
data = string.decode('UTF-16')

print(data)

Output :

python

Convert Python Unicode Characters To String

Use the unicodedata.normalize() method to convert Python Unicode to string. Based on canonical equivalence and compatibility equivalence, the Unicode standard offers multiple normalisation forms of a Unicode string.

There are two standard forms for each character:

normal form C
normal form D

unicodedata.normalize() to Convert Unicode to ASCII String in Python

The Python module unicodedata provides a method to use the Unicode character database as well as utility functions that make accessing, filtering, and looking up these characters more easier.

normalize() is a function in unicodedata that accepts two parameters: the normalized form of the Unicode string and the provided string. Normalized Unicode forms are classified into four types: NFC, NFKC, NFD, and NFKD. The NFKD normalized form will be used in this article.

Syntax:

unicodedata.normalize(form, unicode_string)

Code Example :

import unicodedata

unicode_char = u"Klüft inför på fédéral électoral große"

print(unicodedata.normalize('NFKD', unicode_char).encode('ascii', 'ignore'))

Output :

b’Kluft infor pa federal electoral groe’

Because the encode() method is used on the string, the b symbol at the beginning indicates that it is a byte literal. To remove the symbol and the single quotes that enclose the string, call decode() after calling encode() to re-convert it to a string literal.

You can see in the result that we got the encoded bytes string, which we can now decode to get a Python string using the string decode() function.

Code Example :

import unicodedata

unicode_char = u"Klüft inför på fédéral électoral große"

data = unicodedata.normalize('NFKD', unicode_char).encode('ascii', 'ignore')
string = data.decode()
print(string)

Output :

Kluft infor pa federal electoral groe

Let’s attempt another example where the replace argument is used as the second parameter in the encode() function.

Code Example :

import unicodedata

unicode_char = u"áæãåāœčćęßßoße"

data = unicodedata.normalize('NFKD', unicode_char).encode('ascii', 'replace')
string = data.decode()
print(string)

Output :

a??a?a?a??c?c?e???o?e

The replace argument substitutes all characters that do not have ASCII equivalents with a question mark? symbol. If we used ignore on the same string the output will be:

import unicodedata

unicode_char = u"áæãåāœčćęßßoße"

data = unicodedata.normalize('NFKD', unicode_char).encode('ascii', 'ignore')
string = data.decode()
print(string)

Output:

aaaacceoe

Conclusion

To convert Unicode characters to ASCII characters, use the unicodedata module’s normalize() function and the string’s built-in encode() function. Unicode characters that do not have ASCII counterparts can be ignored or replaced. The ignore option removes the character, while the replace option replaces it with question marks.

How To Convert Python Unicode Characters To String | Python

Unicode Characters in Python

encode() and decode() Methods in Python

Convert Python Unicode Characters To String

unicodedata.normalize() to Convert Unicode to ASCII String in Python

Conclusion

Related Codes