IRIZ Character Conversion Tools

by Christian Wittern

Overview

Kanji conversion between JIS and Big5 is a rather complex process and can only be badly achieved in a one-to-one conversion. Rather, one needs to distinguish what the aim of the conversion and the need of the user is and perform the conversion accordingly.
In order to solve this problem we came up with the following strategy:

First, you should to prepare your text file for the conversion process.
- For JIS to Big5 conversion, we recommend using our New2Old tool which converts post-war simplified forms into pre-war traditional ones.
- For Big5 to JIS conversion you should first use our NORMALIZ tool which reduces the number of variant forms and thus will increase the JIS matches.
Second, you should decide which level of precision or strictness you need in the resulting file (see below). This depends on the purpose of your conversion.
Third, you run the conversion program on a Macintosh or DOS machine.
Fourth, you look at the resulting file in an editor and get familiar with the replacement codes that we use in our conversion.
Finally, if you converted from Big5 to JIS and would like post-war JIS characters, you can use our Old2New tool.

Conversion between the most frequently encountered encodings of Chinese Characters is no easy task. What is attempted here is only the conversion between the ShiftJis encoding (frequently encountered in Japan) and the Big5 encoding (originating from Taiwan).

The main problems in this conversion process and my attempt to solve them is outlined here, a proper understanding of these issues is necessary for using the conversion tools successfully.

Here are some of the problems:

The number of characters in Big5 is more than 13000; in JIS, it is only around 6500.
JIS contains pre-war (traditional) and postwar (simplified) forms of characters as well as some variant forms (for example as much as five different forms for the character meaning 'sword').
Big5 contains only traditional forms and in general no variant forms.
JIS is a national standard issued from an ISO member organization, Big5 originated from an agreement of different vendors in the computer industry; a consortium that long since dissolved.
The character forms for JIS are very strictly prescribed by the issuing body, whereas it is today impossible to establish 'correct' forms for Big5
JIS has been modified several times, characters have been added or rearranged. For a detailed account of this see
Ken Lunde, Understanding Japanese Information Processing, O'Reilly and Ass, Inc., 1993
The Taiwanese Standards Bureau has made an attempt to correct the mistakes in Big5 by creating their own version, which is known as CNS. This is not just a different encoding, like in the case of the various JIS encodings, but a different codeset.

It should be clear from this that the process of converting between these codes is far from straightforward. To be practicable, the conversion must often perform some kind of remapping. As the degree of the desirable tolerance for this remapping depends on the text and the purpose of the user, we attempt no simple, fixed translation. Rather, we use several different mapping tables.

Grades of Precision

We distinguish three grades of precision for the conversion:

Strict conversion. A one to one conversion will be performed where possible. Where no counterpart exists (from JIS -> Big5 ca. 1000 characters, from Big5 -> JIS more than 7000), a character reference to CEF or a Unicode placeholder is inserted in the text. This is appropriate where round-trip conversion is required.
Relaxed conversion. Slight differences in the character forms are allowed, as long as the meaning and readings are not changed. This will greatly reduce the number of external character references and produces texts of improved readability. As this will be the most frequently needed version, and there are a still a number of questionable decisions involved, not every possible replacement is attempted in the main conversion. You can customize the conversion by defining your own replacement table or use another set of scripts is provided for fine tuning of this conversion.
Black hole conversion. Characters are remapped wherever possible, where no conversion can be made a black circle is inserted in the text, thus creating black holes in the texts. You can use this to easily track down non-convertible characters.

Replacement Conventions

The following conventions apply for the replacement characters inserted in the text:

Replacement for Kanji
:
Wherever possible, KanjiBase codes are inserted into the document. They refer to the public standard CNS 11643-1992, issued by the National Standards Institute, Taipei, Taiwan.
Where no KanjiBase codes exist, a Unicode reference is inserted. Its form is similar: &U-4E01; refers to the Unicode character at the code point U+4E01.

Replacement for non-Kanji
:
Some punctuation marks, parentheses, Japanese kana etc. do not map exactly either. They are mapped to similar forms where possible; otherwise they are replaced by a reference that marks their position in the original code.
For conversion from Big5 to Shift-JIS, the reference has the following pattern: &BJ-A1A4;. Here, A1A4 is the Big5 code of the character in question.
For conversion from Shift-JIS to Big-5, the similar reference will be &SJ-8441;, where the code 8441 is the shifted JIS code of the character that could not be converted.
User-defined characters that have positions outside the public code set are remapped along the same lines.

When encountering the above codes in a converted text, you can use a code table (or the original file, where available) to determine which character could not be converted. Then you can use global replace in an editor to convert it to the character you want to have in that position.

Author:Christian Wittern
Last updated: 95/04/27

IRIZ Character Conversion Tools

Overview

Grades of Precision

Replacement Conventions

Replacement for Kanji

Replacement for non-Kanji