Causes and remedies for UnicodeDecodeError [Python]

2019/6/5

When I tried to read a dataset in csv file format with pandas, I immediately encountered the error "UnicodeDecodeError:'utf-8' codec can't decode byte 0x91 in position 1: invalid start byte". Make a note.

error contents

Read the csv file with pd.read_csv.Then ...

import pandas as pd df = pd.read_csv ('data.csv')

>> UnicodeDecodeError:'utf-8' codec can't decode byte 0x91 in position 1: invalid start byte

I got a Unicode decoding (converting a character code to a character) error.
By the way, converting characters to character codes is called encoding.

Cause

Inside the personal computer, the characters are not recognized as they are, but are treated as numbers (character codes) assigned to each character in advance.

There are many types of character codes depending on the language and how to assign numbers. UTF-8 is one of the encoding methods for Unicode set as a unified standard of the world standard so that it can be used in any language.
However, UTF-8 doesn't seem to be perfect either (Wikipedia:Unicode problems in Japanese environment

I don't know the character that caused the problem, but since the csv file contained Japanese characters, the poor conversion between the Japanese character code and UTF-8 caused the Unicode decoding error. It seems that it was.

solution

When reading with read_csv (), specify the encoding format from the argument to the Japanese character code (shift_jis, etc.).

df = pd.read_csv ('data.csv', encoding ='shift_jis')

I was able to read it successfully.

Other Japanese-compatible Encodings format

By the way, there are multiple character codes that can express Japanese, and Python implements the following as standard. (Reference:List of Python standard encodings )

  • cp932
  • euc_jp, euc_jis_2004, euc_jisx0213
  • iso2022_jp, iso2022_jp_1, iso2022_jp_2, iso2022_jp_3
  • iso2022_jp_2004, iso2022_jp_2ext
  • shift_jis, shift_jis _2004, shift_jisx0213

However, I have the impression that it can be read well with shift_jis or cp932, which is a major measure.