The first character code designed to make the processing of ideographic characters on computers possible was the JIS C 6226-1978. It was developed according to the guidelines laid down in the ISO standard 2022-1973 and became the model for most other code standards used today in East Asia (the most notable exception is Big5). Covering approximately 6500 characters, this standard has been revised two times, in 1983 and 1990, where the assignment of some characters where changed and a few added. Revising a standard is about the worst thing a standard body can do and has caused much grieve and headache among manufacturers and users alike. Today we finally have fonts that bear the year of the standard they cover in their name, so that users can know which version is encoded in that font and select if accordingly. Our texts and tools are based on the latest version.
The version of 1990 has become known under the name JIS X 0208-1990 and has been together with an additional set of 5800 characters (JIS X 0212) the base of the Japanese contribution to Unicode.
The JIS code is almost never used in computers as it was defined; rather, some changes are made in the way the code numbers are represented. This is necessary to allow JIS be mixed with ASCII characters and, as in the case of ShiftJis (or MS-Kanji, the most popular encoding on personal computers) with earlier Japanese encodings of half-width kana. East Asian text is thus most frequently based on a multibyte encoding, a character stream that contains a mixture of characters represented by one single byte and of characters represented by two bytes.
In addition to the characters in the national standard, many Japanese vendors have added their own private characters to JIS, making the conversion between these different encodings difficult beyond belief.
There are different legends about the beginnings of Big5; some say that the code had been developed for an integrated application with 5 parts, and others say it was an agreement of five big vendors in the computer industry. No matter which one is true (and it might as well be something else), the Taiwanese government did not realize the need for a practical encoding of Chinese characters timely enough. Government agencies had apparently been involved also in the development of Big5, but it was only in 1986 that an official code was announced, a time by which Big5 was already a de facto standard with numerous applications in daily use.
Big5 defines 13051 Chinese characters, arranged in two parts according to their frequency of usage. The arrangement within these parts is by number of strokes, then Kangxi radical. As Big5 was apparently developed in a great hurry, some mistakes were made in the stroke count (and thus placement) of characters, and two characters are twice represented. On the other hand, some frequently used characters were left out and were later implemented by individual companies.
All implementations agree on the core part of Big5, but different extensions by individual vendors aquired much weight, most notably in the case of the ETEN Chinese system that was very popular in the late eighties and early nineties. As there is no document that defines Big5 apart from the documentation provided by the vendors with their products, it is impossible to single out one standard Big5. This was actually a big problem in the process of designing Unicode -- and it remains one even today.
This is the Chinese National Code for Taiwan. In the form published in 1992, it defines the glyph-shape, stroke count and radical heading for 48027 characters. For all these characters a reference font in a 40 by 40 grid ( and for most of them also in 24 by 24 grid ) is available from the issuing body. These characters are assigned to 7 levels with the more frequent at the lower levels and the variant forms at the two top levels. The whole architecture reserves space for five more standard levels and four level are reserved for non-standard, private encoding, bringing the total to 16 levels, with a hypothetical space for roughly 120 000 ideographs. On top of the currently defined ones, one more level with about 7000 characters is currently under revision and expected to be published in the course of 1995. This will bring the total number of assigned characters to roughly 55000.
The overall structure has already been outlined; but how does the CNS code relate to other code sets in use in East Asia, e.g. the Korean KSC, the Japanese JIS, and the mainland Chinese GB? And what about Unicode?
The answer to this is somewhat disappointing: Although CNS defines roughly eight times the number of characters, more than three hundred characters present in the Japanese JIS are still missing from the CNS. In relation to GB, the CNS misses roughly 1800 simplified characters. With this it is also clear that the CNS code will miss quite a number of Unicode Han characters. Upon closer examination, the reason is soon obvious: CNS in its higher levels occasionally defines some abbreviated forms, but in general it does not include characters created as a result of the modern character reforms. I consider this a serious drawback and an obstacle to a true universal character set. But this seems to have been a design principle of CNS. It is of course also understandable, as CNS is not designed for the use of researchers but rather for the use of Taiwanese government agencies and its census registers. However, since the additional CNS level under revision will include all Unicode Han characters that are still missing in CNS, it is expected that the problem may not be so important after all.
CCCII is a very large codeset developed and maintained by the Chinese Character Research Group in Taipei. It is the earliest attempt to create a unified set of Chinese characters that would contain all characters from all East Asian countries. It also tried to provide a solution for the problem of how to deal with variant characters. CCCII currently defines ca 75000 code points (not 33,000 as I mistakenly stated in Electronic Bodhidharma 3, p. 46), of which ca. 18,000 still have draft status and are currently under revision. As some single characters are doubly, triply, or even more times encoded in CCCII, it is impossible to say how many characters it really contains. I expect the number to be around 60,000. CCCII is used today only in some libraries in Taiwan, the US, and Hongkong -- in many cases in the guise of EACC (also known as ANSI Z39.64-1989) which is a modified subset of CCCII that contains about 16,000 characters.
When I first heard about CCCII I was very enthusiastic and thought it might provide a solution for the problems we dealt with. This turned out to be wrong. Here are my main problems with CCCII:
Unicode is the upcoming new standard not only for East Asian but for a large number of other scripts of the world. Although there is much hype around Unicode and also ISO 10646-1:1993 (which are identical), it has so far been vaporware without real-world platforms and applications. If applications for Unicode become available, it is sure to become the preferred base codeset for any serious work. However, there will still be many characters missing in Unicode, which again shows the need for an approach such as that of the IRIZ KanjiBase encoding strategy, namely: to let the user work on any platform he or she likes, and to supplement characters missing in that platform by inserting standardized, portable code placeholders that can be transformed into bitmaps when the need arises.