Guidelines for

Guidelines for the Creation of Large Chinese Text Databases

by Urs App

Abstract

This article (which was also published in the Electronic Bodhidharma No. 3) establishes some guidelines for the creation of large Chinese databases. Practical experience at our institute in trying to create master data in CCCII code has shown that CCCII is not a practical option for this purpose. Our IRIZ KanjiBase encoding, on the other hand, has quite admirably served this purpose. Actual work shows how important maintaining the habitu al working environment (with front end processors etc.) is. Therefore I would now advocate using one of the national codes in combination with KanjiBase rather than CCCII.

Before launching large database projects, one ought to find out what has already been done in the area and study its qualities and defaults. Often one learns much by asking programmers and database designers what they would do differently if they could start all over again. In the field of Buddhist studies, the Electronic Buddhist Text Initiative tries to help in this coordination and learning process.
This may sound trite, but it is a fact that even major projects in the field are unaware of what is happening elsewhere ‹ and sometimes even in their own institution. On the recent field trip organized by the Electronic Buddhist Text Initiative, we found for example that the people managing the Chinese University of Hong Kong concordance project were not aware of the very similar effort in Oslo; and a long-time resident scholar at the Academia sinica found out through us that important materials for a Chinese text he has been translating are on his institute¹s computer. That electronic versions of a text exist does not mean much in itself; one must evaluate data quality, accessibility, and suitability for one¹s project.
One must classify data input projects by the amount of data involved and their destination. Thus one must distinguish between small amounts of data and large amounts of data, data destined for individual users or small groups and data destined for large user groups and institutions, etc. The present guidelines apply to large input projects that contain many full-form Chinese characters and are aimed at a large and diverse group of users.
Failure to make such distinctions may lead to inadequate demands for data quality, search strategies, etc. For example, certain automatic or half-automatic methods of scanner input can be quite useful and efficient for an individual user prepared to spend a substantial amount of time for data correction; but the very same method may prove totally inadequate for large-scale institutional data input because of the high cost of error correction. Similarly, a relatively high number of mistakes may n ot bother some users but is unacceptable for data that are to be distributed to other users. Again, the use of many self-defined characters can be acceptable for individuals but not for institutions.
It is of the greatest importance to make basic decisions at the beginning of a project and to discuss them with specialists. In making these decisions, both present and future possibilities of use must be kept in mind. This applies particularly to the choice of source text, text editing, annotation, basic data character (character encoding, data format, non-standard character handling, etc.), and hard/software environments. Such questions must be discussed by a team of specialists at the outset of a la rge project, i.e. before the main input activity starts, and an action plan should be approved by the whole team.
Failure to do this can result in gigantic waste of money. Several Chinese text databases I know of started out with little planning; mostly they were designed to fit the hardware and software environment of some years ago at a specific location. Later, when trying to convert the data to present requirements and for use by other institutions, they found that automatic conversion was not possible or corrupted the data set. Prior planning and consultation with specialists could have prevented this. Another example: tagging data during the input or correction / editing process can improve the value of a database enormously, for example in making it possible to look for all plant names or place names in the whole Pali canon. Doing something like this at a later point would be another major enterprise that could have been avoided through careful planning.
If the electronic text is (or may at a later point in time be) destined for international users and a variety of hardware and software environments, it is necessary to make a basic data set (master data set) that can later be automatically converted into any necessary code or format. It is important to treat this master data set as a separate entity whose input conditions, character code, hardware environment, etc. can be very different from that of the eventual user, just as studio quality music recor ding and editing equipment is different from the reproduction equipment of the consumer.
With Chinese text, the difference shows particularly in the way rare characters and different national standards are handled. Institutions that do not separate master data and user data invariably produce data that follow the low standards of character codes now used on PCs (JIS, GB, BIG-5, etc.; see the article in this number by C. Wittern). Of the institutions visited on the recent field trip, those who did not distinguish between master and user data all suffer from data quality problems whi ch will become even more serious as larger codes become available. Those who were wise enough to make this distinction are: the libraries of Taiwan National University and Hong Kong University of Science and Technology (both use master data in CCCII code and user data in BIG-5) and the Chinese Academy of Social Sciences (master data in their own 45,000 character code, user data in various formats). Just like master tapes in the music business, master data must be of such quality that it can be used in many different environments, present and future. Most of the Chinese text data so far input in Japan, Korea, and mainland China will have about as much future as the recording of a concert made on a Walkman.
In order to assure such convertibility and adaptability, the master data must contain the greatest possible amount of information. This is an important factor of data quality. In the case of Chinese, Korean, or Japanese data (or any other text set that may include characters that are not standardized and where several competing standards exist), one must utilize the character standard with the best structure, greatest number of characters, and best convertibility. At present, the best standardized Chi nese character code for master data is the Taiwanese CCCII code (see the article by C. Wittern below). In spite of its clumsy three-byte format, its elevated price (around US $ 4000 for a PC card, conversion routines into other codes, and exhaustive documentation), and its very small number of users, adoption of this code seems at present the most sensible approach for creating a master data set of large bodies of full-form Chinese character text.
(Note May 1995: Though the principle of clearly distingui shing master data from user data stands, for various reasons I do not see the CCCII code as the best possible code any more. A national code in combination with the IRIZ KanjiBase is more practical.)
Data that is input in character codes with a mixture of simplified and non-simplified characters and a small total number of characters (such as JIS or GB codes) cannot be automatically converted into more elaborate codes; for example, ï cannot be converted automatically because no machine will know whether it stands for ôû or ôü or “ê or some other long form. The reverse conversion, however, is easy. The same is true for variant forms of characters: the objective is to preserve as much of thi s information in the electronic text as is possible. In Japan, characters not existing in the JIS code are often input by an empty box and other characters in simplified forms (for example in the monogatari CD-ROM of the Kokubungaku shiryôkan in Tokyo). Such data has bad convertibility and is thus of deficient quality even if the input text is quite accurate.
If variant forms of characters exist in the printed form of a source text, one should strive to reproduce them as they are in the electronic text. This is not always possible or even wise; some variations (for example different print shapes of radicals (as in ’W and Â˜) are commonly accepted. However, all such decisions must be documented and strictly adhered to. Obviously, a master data code such as CCCII makes the management of such variations easier since it links variant characters to basic forms, making conversion from one into the other possible if the need arises. If the printed text contains several printed forms of a character, one must reproduce these features in the electronic text. If a good record is kept of such variations during the input and correction process, one will later be able to create search modules that automatically search all variants of a character or term concurrently if the user wishes this. The producers of electronic text should keep in mind that present and future users may have interests that can not be imagined or foreseen and that it is not the job of the data producer to limit such interests. Rather, the basic data set should be as faithful as possible, just like the master recording in music.
For the Chinese University of Hong Kong¹s concordance series (stored in Big-5 format without distinction of master and user data), the variant forms were reduced to standard forms and listed in printed form. The electronic text only features the standard forms. If in the future a larger code comes into common use that contains many variant forms, there will be no master data that can be converted in such a code, and much of the work that was done in reducing the information will have to be repea ted in the other direction. In contrast, the Hong Kong University of Science and Technology inputs mainland book information in simplified forms and Taiwanese information in long forms, as they appear in the book. The search module then treats the simplified characters as variant forms which are also searched, allowing the user to find information regardless of the specific form of the printed character.
In electronic text, data accuracy is exceedingly important because machines search data with much more accuracy than humans. They find only what the data set contains, and browsing is usually not possible or feasible. Mistakes are in general only found by chance and not as a result of a search; one thus cannot expect users to correct data. With large data sets, users are often blinded by the amount of information that can be found. However, one must also be able to rely on accurate information on what is not in a text. Input mistakes prevent gaining such information. Data mistakes can be eliminated by adequate input methods and data correction procedures. Data accuracy depends on a variety of factors which are usually interdependent: quality and readability of the source material, choice of input method, education of personnel, quality of input guidelines (definition of identity / difference of characters), size of the character code, quality of reference materials, data correction procedures and perso nnel, consistency of the application of the guidelines, quality of input and correction documentation, honesty of personnel in admitting problems, etc.
Master data of good quality must not only be in an adequate code which contains much information (and has thus good conversion characteristics) and does not distort the printed original: it must also be error-free. With alphabetical text, input of the same text by two typists and machine-comparison of the typed text yield quite good results. However, this method is not totally adequate because fast typists sometimes mistakenly hit the same wrong adjacent key. For Chinese data, this method has no t proven successful because typists often make the same mistakes. Thus a good error-correction procedure must be applied and strict guidelines must be given to the input and correction personnel. They must be trained in strict quality control procedures; all individual decisions must be documented, approved, and consistently applied.
The overall value of a database can often be substantially improved by teamwork and by team discussions of a variety of basic issues. For example: the choice of the printed text that serves as data source; the presence or absence of scholarly commentary or annotation; the references to printed sources; the user profile; the required search tools; the cost and quality of necessary hard- and software; future hardware and software environ-ment prospects; the ease of use of hardware and software; the varie ty and quality of character conversion utilities; the cost of the data; the accura-cy standard of data; the convertibility standard of the data; the structure of the data; the flexibility of the data structure (adaptability of format, etc.); the standardization level; etc.
Having heard too many ³if only we had thought about this before input started ...² I believe that in database planning and management, group decisions based on discussion are often better than individual decisions. Scholars must be careful not to leave such decisions to technicians and programmers. On the field trip, we met programmers who admitted that they have never actually used the database they have been working on for years...
Databases are made for users; therefore the wishes, working environment, and likely working habits of users must be carefully studied and respected. For example, most users search while writing a paper or book; therefore it must be possible to use the database concurrently with a word processing program. Any large text database should also let the user attach notes and tags to the main text. Such notes should also be searchable, printable (together with the text or separately), savable as separate file s with location tags, and portable to updated versions of the electronic text. Search engines must also be adapted to many users¹ needs. Therefore it must be flexible and adaptable to a variety of users¹ preferences (just like word processing programs) rather hard-coded. Search results should be viewable and printable and file saveable in a variety of formats according to the user¹s wishes. Since the main aim of databases is the retrieval of information, such retrieval should be carefully planned with many options for the user.
In projects whose input takes many years of work, one must make programmers produce multiple test versions of search software and have scholars and other prospective users evaluate it even while input is going on. If necessary, data structure decisions have to be reevaluated. Users should have a say in all important software decisions, and programmers should assist users to evaluate test versions and to formulate their wishes by telling them about alternative possibilities.

Author:Urs App
Last updated: 95/04/23

HTML>