Tsung

3 年 ago

Categories: Programming

Python 判斷檔案的語系編碼 UTF-8、Big5

Python3 要開啟、讀取檔案時，若不是 UTF-8，會需要輸入檔案的語系編碼，Python 會自動都轉換成 UTF-8 做操作。

如下範例：(現在會遇到 Big5 的，大多數都是 Windows 的 CSV)

with open(filename, encoding='Big5') as csvline:
    rows = csv.reader(csvline, delimiter=',')

但是有些來源是 Big5、有些是 UTF-8，就需要偵測語系編碼，要怎麼做呢？

Python 判斷檔案的語系編碼 UTF-8、Big5

Python 可以使用 chardet 來抓取文字編碼，所以要判斷檔案編碼，需要抓一小段文字給他

Python3 的 chardet 安裝：pip3 install chardet
- CLI$ chardet filename # or $ chardetect filename (兩者一樣)
  - filename: UTF-8-SIG with confidence 1.0
- 簡易範例
  - import chardet
  - chardet.detect('string...') # {'confidence': 1.0, 'encoding': 'ascii'}

Python3 chardet 的程式範例

#!/usr/bin/python3
import chardet

# 偵測檔案編碼 big5 / utf-8
def detect_file_encoding(filename):
    with open(filename, 'rb') as rawdata:
        t = chardet.detect(rawdata.read(1000))
    return t['encoding']  # Big5、UTF-8-SIG、utf-8

print(detect_file_encoding(filename)) # Big5、UTF-8-SIG、utf-8 ...

在這範例程式裡面，看到 Big5、utf-8 都很容易懂，但是 UTF-8-SIG 是什麼？
- UTF-8-SIG：檔案有 BOM 開頭的，就會是這個編碼
- 這些編碼可以直接丟進去 open(filename, encoding='UTF-8-SIG')，都可以直接操作

一	二	三	四	五	六	日
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30

Python 判斷檔案的語系編碼 UTF-8、Big5

Python 判斷檔案的語系編碼 UTF-8、Big5

Python3 chardet 的程式範例

相關

gvim編UTF-8的文件

Python 在寫入遇到 UnicodeEncodeError: 'cp950' codec can't encode 錯誤

vi 設定

Tsung's Note FB 粉絲團

管理區

適用電子郵件訂閱網站

Microsoft Clarity

Search

好友與 Blog

好站連結

贊助商

月曆

Python 判斷檔案的語系編碼 UTF-8、Big5

Python3 chardet 的程式範例

相關

gvim編UTF-8的文件

Python 在寫入遇到 UnicodeEncodeError: 'cp950' codec can't encode 錯誤

vi 設定

Tsung's Note FB 粉絲團

管理區

適用電子郵件訂閱網站

Microsoft Clarity

Search

好友 與 Blog

好站連結

贊助商

月曆

好友與 Blog