Causes and remedies for UnicodeDecodeError [Python]
When I tried to read a dataset in csv file format with pandas, I immediately encountered the error "UnicodeDecodeError:'utf-8' codec can't decode byte 0x91 in position 1: invalid start byte". Make a note.
error contents
Read the csv file with pd.read_csv.Then ...
import pandas as pd df = pd.read_csv ('data.csv')
>> UnicodeDecodeError:'utf-8' codec can't decode byte 0x91 in position 1: invalid start byte
I got a Unicode decoding (converting a character code to a character) error.
By the way, converting characters to character codes is called encoding.
Cause
Inside the personal computer, the characters are not recognized as they are, but are treated as numbers (character codes) assigned to each character in advance.
There are many types of character codes depending on the language and how to assign numbers. UTF-8 is one of the encoding methods for Unicode set as a unified standard of the world standard so that it can be used in any language.
However, UTF-8 doesn't seem to be perfect either (Wikipedia:Unicode problems in Japanese environment)
I don't know the character that caused the problem, but since the csv file contained Japanese characters, the poor conversion between the Japanese character code and UTF-8 caused the Unicode decoding error. It seems that it was.
solution
When reading with read_csv (), specify the encoding format from the argument to the Japanese character code (shift_jis, etc.).
df = pd.read_csv ('data.csv', encoding ='shift_jis')
I was able to read it successfully.
Other Japanese-compatible Encodings format
By the way, there are multiple character codes that can express Japanese, and Python implements the following as standard. (Reference:List of Python standard encodings )
- cp932
- euc_jp, euc_jis_2004, euc_jisx0213
- iso2022_jp, iso2022_jp_1, iso2022_jp_2, iso2022_jp_3
- iso2022_jp_2004, iso2022_jp_2ext
- shift_jis, shift_jis _2004, shift_jisx0213
However, I have the impression that it can be read well with shift_jis or cp932, which is a major measure.
In-Depth Discussions
Comment list
cp949 is Korean, isn't it?
Please use cp932.
引用
cp932 | 932, ms932, mskanji, ms-kanji | Japanese
cp949 | 949, ms949, uhc | Korean
Thank you for your advice.It seems that the reference column was mistakenly described as cp949.I have corrected it.
Bruise!solved!
It was helpful