cpdetector project is a small yet clever framework for codepage detection.
cpdetector is a small yet clever framework for codepage detection that integrates different strategies. It may be used as a library for third party software that accesses textual data over network.
It also includes a best-practice implementation in form of a command line tool that allows sorting and transforming large collections of documents based on their codepage.
Available strategies include: jchardet (exclusion, frequency analysis, and guessing), detection of the HTML charset property, and detection of the XML encoding declaration.
What is a code page?
At first, a textual document is nothing more than sequences of bits. A computer has to decide, how he can display this data in form of characters (which are identified by the computer as numbers).
A code page - which is also known as charset encoding - maps the raw data of a textual document to characters. The original ASCII code page for example only uses 7 bits of an octet (byte) for deciding the character that is represented thus allowing only to map 128 different characters. In the past memory was expensive and computers most often only had registers and busses for 8 bit.
When a mainframe was conceived it had to be decided, which characters it should support. Physicians and mathematicians for example needed special characters for equations. As a result, a computer often shipped with a special codepage.
What is new in this release:
- This major bugfix version fixes two issues in command-line batch mode.
- The switch to skip moving undetected documents works now again.
- No attempt will be made to transcode undetected documents (the latter caused exceptional program flow).
What is new in version 1.0.8:
- This release is a stability release and fixes the byte order mark detection and incompatibility with OpenJDK. It also requires Java 1.5 now.
Comments not found