Due to the structure of the Chinese script and the tools available today for processing it on computers, there always are Chinese characters that can not be input. Although they may make up less than 1 to 5 % of a classical text, they pose a serious problem. So far, each individual and institution created its own code or placeholder for such characters, resulting in data that cannot be exchanged and conform to no commonly accepted standard.
Rather than defining ad hoc a private encoding for every character missing from the code set in use, it is advisable to use standard references wherever possible; in this way, data become exchangeable and database maintenance possible. We carefully evaluated all available character codes for Chinese characters and came to the conclusion that the Taiwanese CNS code furnishes the best starting point as it is large, well defined, and builds on the widely used Big-5 code.
However, what was needed was not just a large character set; rather, it was a method to use those characters in combination with whatever system and kanji code you have installed on your machine. In other words, a good method needs to be system-independent while not preventing the use of those systems. Like the way accented characters are handled on the World Wide Web, an entirely ASCII-based method of encoding characters was sought -- but in our case, we needed thousands and thousands of such references.
In distinction to other large code sets, the Chinese National Standard (CNS) from which KanjiBase takes its codepoints has a very close relationship to the Big-5 code that is widely used today. Although other East-Asian code sets do not merge as well with KanjiBase as Big5, the same references can also be used to represent characters not in those other code sets (for example JIS in Japan or GB in mainland China). KanjiBase thus is a way to extend any of these code sets, not just Big5, and to let you continue working in the habitual OS and application environment while having many more Chinese characters at your disposal.
The KanjiBase encoding not only facilitates and standardizes the use of lacking characters but can also serve as the foundation for character code conversions of various kinds. For example, in a Big-5 to JIS conversion, many characters will be lacking in JIS. The KanjiBase encoding strategy allows representing these lacking characters by its placeholders which can be transformed into printable bitmaps if needed (for example for proofreading). Another example: When doing the same conversion, one can use the KanjiBase encoding in order to achieve different degrees of strictness of code conversion depending on one's needs. If one uses the characters in a scholarly article, one may want the strictest conversion which reflects even slight differences of the glyphs. On the other hand, when one aims at making a concordance, a higher degree of unification may be needed to facilitate looking up characters in the printed product. The code conversion tool suite that is currently being developed at the I RIZ includes a tool that demonstrates such different degrees of conversion strictness between from JIS to Big5 and vice-versa. However, other codes such as the mainland Chinese GB code or the Korean KSC can also be accommodated on this basis.
Due to the whole logic of KanjiBase, no specialized tools or expensive equipment is needed for using our codes in your Chinese texts. Using our Windows implementation, or the Electronic Bodhidharma home page on the Internet (WWW.iijnet.or.jp/IRIZ/irizhome.htm), you can look up a character and copy the code into your texts. While the Internet and Macintosh implementations of KanjiBase are still in preparation, our ZenBase CD1 contains the KanjiBase for Windows which delivers a tool to select characters and insert them into a word processing document, or to paste them to a clipboard where they are available to any Windows application. For the Macintosh, the support is more limited at this time; we only include a set of macros for use with Word6 that converts codes into bitmaps for reading and printing purposes.
As a standalone implementation for Windows would defeat the purpose of supplementing current user environments, we have for a start built one that interfaces with the most commonly used word processing program today, MS Word for Windows version 6 (English, Japanese, and Chinese versions were tested). After having installed KanjiBase for Windows on your system, you can set the option "Paste to Word" to on, and it will automatically paste the KanjiBase code of the needed character into your document. The C EF2BMP macro in our Kanjitools for Winword transforms the code into a displayable and printable bitmap. The code itself is embedded as a hidden comment, so that even saving the document as a text file will not obliterate the inserted KanjiBase code.
Interfaces for other word processing applications on Windows are of course possible, but for the time being we rather focus on Macintosh and Internet implementations. Please contact us if you are willing to construct such an interface yourself.
Currently no other platforms are supported, but we are working on an Internet implementation that will serve the needs of students, teachers, and researchers.
Werner Lemberg has developed a CJK TeX, a platform-independent implementation with great potential since the TeX typesetting system is available on most platforms. The CJK TeX package allows you to use Chinese, Korean and Japanese text in your LaTex documents; if needed, these languages can even be used at the same time. Mr. Lemberg also added support for CNS via the KanjiBase code references. The most recent version, version 2.5, is included on the ZenBase CD1; please refer to the included documentation for the details.
The codes used to construct the KanjiBase
placeholders are constructed as in the following example: &C3-213A;.
For better understanding, a detailed description is given here, less
technically minded people can skip this without fear of
missing important information. Several elements can be distinguished in
the above example:
The first and the last characters, &
and
; are the opening and closing delimiters, they signal to the
processing software and the human reader that the
characters in between are to be treated differently from the other parts of the data
stream. The following C signals to KanjiBase
aware software that the following is a CNS code (For a code reference table of this code see ***).
What follows up to the ; is the code designating the character itself. This code again consists of two parts: a classifier that specifies what kind of code from which of the areas covered by KanjiBase follows; and (after the dash) a four digit hexadecimal code. The following is a list of the allowed classifiers and their semantics:
Applications that support KanjiBase tags should at least be able to process the first two types; but support for X and Y codes is strongly recommended.
Another type of character will be encountered in documents. For characters from other East Asian code sets that are not available in KanjiBase (typically modern simplified characters), no private encoding should be used but rather a reference to the corresponding codepoint in Unicode. Such references should follow the recommendations developed by Rick Jelliffe for SGML Open and look like &U-4E00; for the Unicode character U+4E00.
Authors:Christian Wittern and Urs App