DESCRIPTION.rst 2.12 KB

Raw Blame History Permalink



Chardet: The Universal Character Encoding Detector


Detects


ASCII, UTF-8, UTF-16 (2 variants), UTF-32 (4 variants)
Big5, GB2312, EUC-TW, HZ-GB-2312, ISO-2022-CN (Traditional and Simplified Chinese)
EUC-JP, SHIFT_JIS, CP932, ISO-2022-JP (Japanese)
EUC-KR, ISO-2022-KR (Korean)
KOI8-R, MacCyrillic, IBM855, IBM866, ISO-8859-5, windows-1251 (Cyrillic)
ISO-8859-5, windows-1251 (Bulgarian)
ISO-8859-1, windows-1252 (Western European languages)
ISO-8859-7, windows-1253 (Greek)
ISO-8859-8, windows-1255 (Visual and Logical Hebrew)
TIS-620 (Thai)


Note
Our ISO-8859-2 and windows-1250 (Hungarian) probers have been temporarily
disabled until we can retrain the models.

Requires Python 2.6, 2.7, or 3.3+.

Installation
Install from PyPI:
pip install chardet


Documentation
For users, docs are now available at https://chardet.readthedocs.io/.

Command-line Tool
chardet comes with a command-line script which reports on the encodings of one
or more files:
% chardetect somefile someotherfile
somefile: windows-1252 with confidence 0.5
someotherfile: ascii with confidence 1.0


About
This is a continuation of Mark Pilgrim's excellent chardet. Previously, two
versions needed to be maintained: one that supported python 2.x and one that
supported python 3.x.  We've recently merged with Ian Cordasco's
charade fork, so now we have one
coherent version that works for Python 2.6+.


maintainer:
Dan Blanchard