Chinese character sets and codes

Chinese character codes: an update

by Christian Wittern

Summary

This article presents an update to Christian Wittern's and Urs App's articles concerning Chinese character codes (Electronic Bodhidharma No. 3). In those articles, Urs App argued that database creators must make the most crucial distinction between master data and user data. Master data should be of the highest quality, recording even minute detail like studio recording equipment. User data, on the other hand, must conform to what codes and equipment we presently have. Christ ian Wittern's article compared different codes and concluded that CCCII, a very large Taiwanese code that also includes Japanese and Korean letters, seems to be the best choice for the master data set of Chinese text databases.
We shelled out US $ 2000 for a CCCII board, only to discover that both the code itself and its implementation are seriously flawed. We thus had to continue using Big-5 for all practical purposes while looking for better solutions. Finally, Christian decided that the only practical approach at this time was to build on Big-5 (and other national codes such as JIS) and extend them through code references that are both stable and portable. His ingenious approach forms the basis of the IRIZ KanjiBase and its encoding scheme -- a scheme which will be as useful after the introduction of Unicode as it proves to be right now. (U.A.)

Some kanji codes for computers
1. Japanese JIS Codes
2. Taiwanese Big5
3. Taiwanese CNS
4. CCCII and EACC
5. Unicode
More information is available at ifcss.org in Ross Patterson's document CJK Codes and in Ken Lunde: Understanding Japanese Information Processing p35ff.

Development of kanji codes for computers

Japanese JIS Codes

The first character code designed to make the processing of ideographic characters on computers possible was the JIS C 6226-1978. It was developed according to the guidelines laid down in the ISO standard 2022-1973 and became the model for most other code standards used today in East Asia (the most notable exception is Big5). Covering approximately 6500 characters, this standard has been revised two times, in 1983 and 1990, where the assignment of some characters where changed and a few added. Revising a standard is about the worst thing a standard body can do and has caused much grieve and headache among manufacturers and users alike. Today we finally have fonts that bear the year of the standard they cover in their name, so that users can know which version is encoded in that font and select if accordingly. Our texts and tools are based on the latest version.

The version of 1990 has become known under the name JIS X 0208-1990 and has been together with an additional set of 5800 characters (JIS X 0212) the base of the Japanese contribution to Unicode.

The JIS code is almost never used in computers as it was defined; rather, some changes are made in the way the code numbers are represented. This is necessary to allow JIS be mixed with ASCII characters and, as in the case of ShiftJis (or MS-Kanji, the most popular encoding on personal computers) with earlier Japanese encodings of half-width kana. East Asian text is thus most frequently based on a multibyte encoding, a character stream that contains a mixture of characters represented by one single byte and of characters represented by two bytes.

In addition to the characters in the national standard, many Japanese vendors have added their own private characters to JIS, making the conversion between these different encodings difficult beyond belief.

Big5

There are different legends about the beginnings of Big5; some say that the code had been developed for an integrated application with 5 parts, and others say it was an agreement of five big vendors in the computer industry. No matter which one is true (and it might as well be something else), the Taiwanese government did not realize the need for a practical encoding of Chinese characters timely enough. Government agencies had apparently been involved also in the development of Big5, but it was only in 1986 that an official code was announced, a time by which Big5 was already a de facto standard with numerous applications in daily use.

Big5 defines 13051 Chinese characters, arranged in two parts according to their frequency of usage. The arrangement within these parts is by number of strokes, then Kangxi radical. As Big5 was apparently developed in a great hurry, some mistakes were made in the stroke count (and thus placement) of characters, and two characters are twice represented. On the other hand, some frequently used characters were left out and were later implemented by individual companies.

All implementations agree on the core part of Big5, but different extensions by individual vendors aquired much weight, most notably in the case of the ETEN Chinese system that was very popular in the late eighties and early nineties. As there is no document that defines Big5 apart from the documentation provided by the vendors with their products, it is impossible to single out one standard Big5. This was actually a big problem in the process of designing Unicode -- and it remains one even today.

CNS X-11643-1986 and CNS X-11643-1992

This is the Chinese National Code for Taiwan. In the form published in 1992, it defines the glyph-shape, stroke count and radical heading for 48027 characters. For all these characters a reference font in a 40 by 40 grid ( and for most of them also in 24 by 24 grid ) is available from the issuing body. These characters are assigned to 7 levels with the more frequent at the lower levels and the variant forms at the two top levels. The whole architecture reserves space for five more standard levels and four level are reserved for non-standard, private encoding, bringing the total to 16 levels, with a hypothetical space for roughly 120 000 ideographs. On top of the currently defined ones, one more level with about 7000 characters is currently under revision and expected to be published in the course of 1995. This will bring the total number of assigned characters to roughly 55000.

The overall structure has already been outlined; but how does the CNS code relate to other code sets in use in East Asia, e.g. the Korean KSC, the Japanese JIS, and the mainland Chinese GB? And what about Unicode?

The answer to this is somewhat disappointing: Although CNS defines roughly eight times the number of characters, more than three hundred characters present in the Japanese JIS are still missing from the CNS. In relation to GB, the CNS misses roughly 1800 simplified characters. With this it is also clear that the CNS code will miss quite a number of Unicode Han characters. Upon closer examination, the reason is soon obvious: CNS in its higher levels occasionally defines some abbreviated forms, but in general it does not include characters created as a result of the modern character reforms. I consider this a serious drawback and an obstacle to a true universal character set. But this seems to have been a design principle of CNS. It is of course also understandable, as CNS is not designed for the use of researchers but rather for the use of Taiwanese government agencies and its census registers. However, since the additional CNS level under revision will include all Unicode Han characters that are still missing in CNS, it is expected that the problem may not be so important after all.

CCCII and EACC

CCCII is a very large codeset developed and maintained by the Chinese Character Research Group in Taipei. It is the earliest attempt to create a unified set of Chinese characters that would contain all characters from all East Asian countries. It also tried to provide a solution for the problem of how to deal with variant characters. CCCII currently defines ca 75000 code points (not 33,000 as I mistakenly stated in Electronic Bodhidharma 3, p. 46), of which ca. 18,000 still have draft status and are currently under revision. As some single characters are doubly, triply, or even more times encoded in CCCII, it is impossible to say how many characters it really contains. I expect the number to be around 60,000. CCCII is used today only in some libraries in Taiwan, the US, and Hongkong -- in many cases in the guise of EACC (also known as ANSI Z39.64-1989) which is a modified subset of CCCII that contains about 16,000 characters.

When I first heard about CCCII I was very enthusiastic and thought it might provide a solution for the problems we dealt with. This turned out to be wrong. Here are my main problems with CCCII:

There are many cases where CCCII has more than one code point for the same character. When encountering such multidefined characters (my last count ran to about 12000 of these), the user has to decide which code point to use. Since these codepoints have different semantics, this is a quite impossible task for most input operators. How should they be able to decide whether they should take a character in its "root" form or in one of it's "variant" incarnations? Thus the very logic of CCCII becomes an impediment: if these encoded distinctions are ignord, then the codepoint becomes meaningless and it would thus be better not to have it at all. The relationship of orthographic characters and variant forms is very complex and can not be expressed in a fixed, one-dimensional, hard-wired codetable. For the normalization of variant forms in a text to their orthographic forms (or better: to the forms found in the codeset), there are other, "softer" strategies (as the one on this disk called NORMALIZ). Such soft tools work with a reference table that can be updated and changed according to the needs of the user and the aim of the normalization.
The character glyphs are neither well defined nor consistent. In contrast to CNS -- which has very consistently defined reference bitmaps -- there is still no such thing for CCCII. In the CCDB (which is available on the Internet) a variety of different styles and fonts, obvious errors, etc. make it difficult to make use of this at all.
CCCII is a closed system approach; you either get all or nothing. If you want to use CCCII, you have to abandon your present operating system and application platform, buy expensive new hardware, and work in a stone-age computing environment. Admittedly, efforts are made to port the CCCII code to graphical interfaces, to revise the code, and to eliminate problems; but I still do not see it move in the right direction.

Unicode

Unicode is the upcoming new standard not only for East Asian but for a large number of other scripts of the world. Although there is much hype around Unicode and also ISO 10646-1:1993 (which are identical), it has so far been vaporware without real-world platforms and applications. If applications for Unicode become available, it is sure to become the preferred base codeset for any serious work. However, there will still be many characters missing in Unicode, which again shows the need for an approach such as that of the IRIZ KanjiBase encoding strategy, namely: to let the user work on any platform he or she likes, and to supplement characters missing in that platform by inserting standardized, portable code placeholders that can be transformed into bitmaps when the need arises.

Author:Christian Wittern
Last updated: 95/05/01