The IRIZ KanjiBase

by Christian Wittern and Urs App

Why KanjiBase?
What is KanjiBase?
How to use KanjiBase
Technical information about our coding approach

Summary

KanjiBase was developed by Christian Wittern in the framework of the Zen KnowledgeBase project. It is a new method to furnish lacking Chinese characters by placeholders that are both standardized and system-independent. It uses the Taiwanese government's CNS code (so far 48,000 characters) for supplementing extant codes such as JIS or Big5 or future ones such as Unicode. Thus one can continue using one's habitual word processing and database programs and assign stable, portable codes for characters that are not present in those codes.
Due to this approach, you can continue using your habitual environment, whether it is Japanese, Taiwanese, mainland Chinese, or Korean Windows. The Macintosh (and later the Unix) environment will be supported, and the KanjiBase will soon be accessible on the Internet. In the present implementations, characters not available in your system can be searched in the KanjiBase and pasted into your document as a printable graphic image linked to an SGML-conform unambiguous and portable code. Documents containing KanjiBase characters can be printed on your ordinary printer by using, for example, MS Word for Windows, Word for Macintosh, or Werner Lemberg's CJK TeX method which works on several platforms.

Why KanjiBase?

Due to the structure of the Chinese script and the tools available today for processing it on computers, there always are Chinese characters that can not be input. Although they may make up less than 1 to 5 % of a classical text, they pose a serious problem. So far, each individual and institution created its own code or placeholder for such characters, resulting in data that cannot be exchanged and conform to no commonly accepted standard.

Rather than defining ad hoc a private encoding for every character missing from the code set in use, it is advisable to use standard references wherever possible; in this way, data become exchangeable and database maintenance possible. We carefully evaluated all available character codes for Chinese characters and came to the conclusion that the Taiwanese CNS code furnishes the best starting point as it is large, well defined, and builds on the widely used Big-5 code.

However, what was needed was not just a large character set; rather, it was a method to use those characters in combination with whatever system and kanji code you have installed on your machine. In other words, a good method needs to be system-independent while not preventing the use of those systems. Like the way accented characters are handled on the World Wide Web, an entirely ASCII-based method of encoding characters was sought -- but in our case, we needed thousands and thousands of such references.

What is KanjiBase?

The foundation of KanjiBase, the method invented by Christian Wittern to encode such an extended character set, works by inserting ASCII placeholders where a character is missing in your system or the national code that your are using. This can be useful for text databases or ordinary word processing requirements. However, through these references, one can also more easily convert texts among different encodings (such as JIS or GB or Big5) or achieve varied levels of unification for specific needs.

In distinction to other large code sets, the Chinese National Standard (CNS) from which KanjiBase takes its codepoints has a very close relationship to the Big-5 code that is widely used today. Although other East-Asian code sets do not merge as well with KanjiBase as Big5, the same references can also be used to represent characters not in those other code sets (for example JIS in Japan or GB in mainland China). KanjiBase thus is a way to extend any of these code sets, not just Big5, and to let you continue working in the habitual OS and application environment while having many more Chinese characters at your disposal.

The KanjiBase encoding not only facilitates and standardizes the use of lacking characters but can also serve as the foundation for character code conversions of various kinds. For example, in a Big-5 to JIS conversion, many characters will be lacking in JIS. The KanjiBase encoding strategy allows representing these lacking characters by its placeholders which can be transformed into printable bitmaps if needed (for example for proofreading). Another example: When doing the same conversion, one can use the KanjiBase encoding in order to achieve different degrees of strictness of code conversion depending on one's needs. If one uses the characters in a scholarly article, one may want the strictest conversion which reflects even slight differences of the glyphs. On the other hand, when one aims at making a concordance, a higher degree of unification may be needed to facilitate looking up characters in the printed product. The code conversion tool suite that is currently being developed at the I RIZ includes a tool that demonstrates such different degrees of conversion strictness between from JIS to Big5 and vice-versa. However, other codes such as the mainland Chinese GB code or the Korean KSC can also be accommodated on this basis.

How to use KanjiBase

Due to the whole logic of KanjiBase, no specialized tools or expensive equipment is needed for using our codes in your Chinese texts. Using our Windows implementation, or the Electronic Bodhidharma home page on the Internet (WWW.iijnet.or.jp/IRIZ/irizhome.htm), you can look up a character and copy the code into your texts. While the Internet and Macintosh implementations of KanjiBase are still in preparation, our ZenBase CD1 contains the KanjiBase for Windows which delivers a tool to select characters and insert them into a word processing document, or to paste them to a clipboard where they are available to any Windows application. For the Macintosh, the support is more limited at this time; we only include a set of macros for use with Word6 that converts codes into bitmaps for reading and printing purposes.

Implementation for Windows

As a standalone implementation for Windows would defeat the purpose of supplementing current user environments, we have for a start built one that interfaces with the most commonly used word processing program today, MS Word for Windows version 6 (English, Japanese, and Chinese versions were tested). After having installed KanjiBase for Windows on your system, you can set the option "Paste to Word" to on, and it will automatically paste the KanjiBase code of the needed character into your document. The C EF2BMP macro in our Kanjitools for Winword transforms the code into a displayable and printable bitmap. The code itself is embedded as a hidden comment, so that even saving the document as a text file will not obliterate the inserted KanjiBase code.

Interfaces for other word processing applications on Windows are of course possible, but for the time being we rather focus on Macintosh and Internet implementations. Please contact us if you are willing to construct such an interface yourself.

Implementation for Macintosh

At present, we have only a set of macros that work with Word6 on the Mac. They allow converting KanjiBase codes into bitmaps for reading and printing. The KanjiBase code is embedded as a hidden comment; thus you can save the file as a text file and will not lose this information. A fuller implementation on the Macintosh that also allows searching the KanjiBase and pasting codes into documents is in preparation.

Implementations for other platforms

Currently no other platforms are supported, but we are working on an Internet implementation that will serve the needs of students, teachers, and researchers.

Werner Lembergs CJK TeX

Werner Lemberg has developed a CJK TeX, a platform-independent implementation with great potential since the TeX typesetting system is available on most platforms. The CJK TeX package allows you to use Chinese, Korean and Japanese text in your LaTex documents; if needed, these languages can even be used at the same time. Mr. Lemberg also added support for CNS via the KanjiBase code references. The most recent version, version 2.5, is included on the ZenBase CD1; please refer to the included documentation for the details.

Technical information

The codes used to construct the KanjiBase placeholders are constructed as in the following example: &C3-213A;. For better understanding, a detailed description is given here, less technically minded people can skip this without fear of missing important information. Several elements can be distinguished in the above example:
The first and the last characters, & and ; are the opening and closing delimiters, they signal to the processing software and the human reader that the characters in between are to be treated differently from the other parts of the data stream. The following C signals to KanjiBase aware software that the following is a CNS code (For a code reference table of this code see ***).

What follows up to the ; is the code designating the character itself. This code again consists of two parts: a classifier that specifies what kind of code from which of the areas covered by KanjiBase follows; and (after the dash) a four digit hexadecimal code. The following is a list of the allowed classifiers and their semantics:

0 (example &C0-1234;): Big5 code. This will appear only in texts that have Big5 not as their base code. This covers the same area as CNS levels 1 and 2, but current implementations like KanjiBase for Windows allow only Big5 codes here. The codes are valid only in the range from A440 to C67E and C940 to F9D5, with the exception of the codes C94A and DDFC.
3-7 CNS levels 3 to 7 (example &C4-423A;). The codes are in the range from 2121 to a maximum of 7C51, this varies in different levels. Additional levels will be added here as they become published:
X,Y (example &CY-1234;): These are temporary encodings of characters that are not yet assigned a CNS code. This assignment is temporary these characters might be included in additional levels of CNS that are planned. However the need to add characters that are not yet part of public codes will always be there. X codes are reserved for use by the IRIZ, Y codes are generally available. Hexadecimal codes are assigned sequentially beginning at 2121.
U (see below)

Applications that support KanjiBase tags should at least be able to process the first two types; but support for X and Y codes is strongly recommended.

Another type of character will be encountered in documents. For characters from other East Asian code sets that are not available in KanjiBase (typically modern simplified characters), no private encoding should be used but rather a reference to the corresponding codepoint in Unicode. Such references should follow the recommendations developed by Rick Jelliffe for SGML Open and look like &U-4E00; for the Unicode character U+4E00.

Authors:Christian Wittern and Urs App
Last updated: 95/04/23