IRIZ Character Conversion Tools
by Christian Wittern
Overview
Kanji conversion between JIS and Big5 is a rather complex process and can only be badly achieved in a one-to-one conversion. Rather, one needs to distinguish what the aim of the conversion and the need of the user is and perform the conversion accordingly.
In order to solve this problem we came up with the following strategy:
- First, you should to prepare your text file for the conversion process.
- For JIS to Big5 conversion, we recommend using our New2Old tool which converts post-war simplified forms into pre-war traditional ones.
- For Big5 to JIS conversion you should first use our NORMALIZ tool which reduces the number of variant forms and thus will increase the JIS matches.
- Second, you should decide which level of precision or strictness you need in the resulting file (see below). This depends on the purpose of your conversion.
- Third, you run the conversion program on a Macintosh or DOS machine.
- Fourth, you look at the resulting file in an editor and get familiar with the replacement codes that we use in our conversion.
- Finally, if you converted from Big5 to JIS and would like post-war JIS characters, you can use our Old2New tool.
Conversion between the most frequently encountered encodings of
Chinese Characters is no easy task. What is attempted here is only the
conversion between the ShiftJis encoding (frequently encountered in
Japan) and the Big5 encoding (originating from Taiwan).
The main problems in this conversion process and my attempt to
solve them is outlined here, a proper understanding of these issues is
necessary for using the conversion tools successfully.
Here are some of the problems:
- The number of characters in Big5 is more than 13000; in JIS, it is only around 6500.
- JIS contains pre-war (traditional) and postwar (simplified) forms of
characters as well as some variant forms (for example as much as
five different forms for the character meaning 'sword').
- Big5 contains only traditional forms and in general no variant
forms.
- JIS is a national standard issued from an ISO member organization,
Big5 originated from an agreement of different vendors in the computer
industry; a consortium that long since dissolved.
- The character forms for JIS are very strictly prescribed by the
issuing body, whereas it is today impossible to establish 'correct'
forms for Big5
- JIS has been modified several times, characters have been added or
rearranged. For a detailed account of this see
Ken Lunde,
Understanding Japanese Information Processing, O'Reilly and Ass,
Inc., 1993
- The Taiwanese Standards Bureau has made an attempt to correct the
mistakes in Big5 by creating their own version, which is known as CNS.
This is not just a different encoding, like in the case of the various
JIS encodings, but a different codeset.
It should be clear from this that the process of converting
between these codes is far from straightforward. To be practicable, the conversion must often perform some kind of remapping. As the degree
of the desirable tolerance for this remapping depends on the text and the purpose of the user, we attempt no simple, fixed translation. Rather, we use several different mapping tables.
Grades of Precision
We distinguish three grades of precision for the conversion:
- Strict conversion. A one to one conversion will be performed where
possible. Where no counterpart exists (from JIS -> Big5 ca. 1000
characters, from Big5 -> JIS more than 7000), a character reference to
CEF or a Unicode placeholder is inserted in the text. This is appropriate where round-trip conversion is required.
- Relaxed conversion. Slight differences in the character forms are
allowed, as long as the meaning and readings are not changed. This will
greatly reduce the number of external character references and produces
texts of improved readability. As this will be the most frequently
needed version, and there are a still a number of questionable decisions
involved, not every possible replacement is attempted in the main
conversion. You can customize the conversion by defining your own
replacement table or use another set of scripts is provided for fine
tuning of this conversion.
- Black hole conversion. Characters are remapped wherever possible,
where no conversion can be made a black circle is inserted in the text,
thus creating black holes in the texts. You can use this to easily track
down non-convertible characters.
Replacement Conventions
The following conventions apply for the replacement characters inserted in the text:
Replacement for Kanji
:
- Wherever possible, KanjiBase
codes are inserted into the document. They refer to the
public standard CNS 11643-1992, issued by the National Standards
Institute, Taipei, Taiwan.
- Where no KanjiBase codes exist, a Unicode reference is inserted. Its form is similar: &U-4E01; refers to the Unicode
character at the code point U+4E01.
Replacement for non-Kanji
:
- Some punctuation marks, parentheses, Japanese kana etc. do not
map exactly either. They are mapped to similar forms where possible; otherwise they are replaced by a reference that marks their position in the
original code.
- For conversion from Big5 to Shift-JIS, the reference has
the following pattern: &BJ-A1A4;. Here, A1A4 is the Big5 code of the
character in question.
- For conversion from Shift-JIS to Big-5, the similar reference will be &SJ-8441;,
where the code 8441 is the shifted JIS code of the character that could
not be converted.
- User-defined characters that have positions outside the public
code set are remapped along the same lines.
When encountering the above codes in a converted text, you can use a code table (or the original file, where available) to determine which
character could not be converted. Then you can use global replace in an editor to convert it to the character you want to have in that position.
Author:Christian Wittern
Last updated: 95/04/27