This may sound trite, but it is a fact that even major projects in the field are unaware of what is happening elsewhere and sometimes even in their own institution. On the recent field trip organized by the Electronic Buddhist Text Initiative, we found for example that the people managing the Chinese University of Hong Kong concordance project were not aware of the very similar effort in Oslo; and a long-time resident scholar at the Academia sinica found out through us that important materials for a Chinese text he has been translating are on his institutes computer. That electronic versions of a text exist does not mean much in itself; one must evaluate data quality, accessibility, and suitability for ones project.
Failure to make such distinctions may lead to inadequate demands for data quality, search strategies, etc. For example, certain automatic or half-automatic methods of scanner input can be quite useful and efficient for an individual user prepared to spend a substantial amount of time for data correction; but the very same method may prove totally inadequate for large-scale institutional data input because of the high cost of error correction. Similarly, a relatively high number of mistakes may n ot bother some users but is unacceptable for data that are to be distributed to other users. Again, the use of many self-defined characters can be acceptable for individuals but not for institutions.
Failure to do this can result in gigantic waste of money. Several Chinese text databases I know of started out with little planning; mostly they were designed to fit the hardware and software environment of some years ago at a specific location. Later, when trying to convert the data to present requirements and for use by other institutions, they found that automatic conversion was not possible or corrupted the data set. Prior planning and consultation with specialists could have prevented this. Another example: tagging data during the input or correction / editing process can improve the value of a database enormously, for example in making it possible to look for all plant names or place names in the whole Pali canon. Doing something like this at a later point would be another major enterprise that could have been avoided through careful planning.
With Chinese text, the difference shows particularly in the way rare characters and different national standards are handled. Institutions that do not separate master data and user data invariably produce data that follow the low standards of character codes now used on PCs (JIS, GB, BIG-5, etc.; see the article in this number by C. Wittern). Of the institutions visited on the recent field trip, those who did not distinguish between master and user data all suffer from data quality problems whi ch will become even more serious as larger codes become available. Those who were wise enough to make this distinction are: the libraries of Taiwan National University and Hong Kong University of Science and Technology (both use master data in CCCII code and user data in BIG-5) and the Chinese Academy of Social Sciences (master data in their own 45,000 character code, user data in various formats). Just like master tapes in the music business, master data must be of such quality that it can be used in many different environments, present and future. Most of the Chinese text data so far input in Japan, Korea, and mainland China will have about as much future as the recording of a concert made on a Walkman.
(Note May 1995: Though the principle of clearly distingui shing master data from user data stands, for various reasons I do not see the CCCII code as the best possible code any more. A national code in combination with the IRIZ KanjiBase is more practical.)
Data that is input in character codes with a mixture of simplified and non-simplified characters and a small total number of characters (such as JIS or GB codes) cannot be automatically converted into more elaborate codes; for example, cannot be converted automatically because no machine will know whether it stands for or or or some other long form. The reverse conversion, however, is easy. The same is true for variant forms of characters: the objective is to preserve as much of thi s information in the electronic text as is possible. In Japan, characters not existing in the JIS code are often input by an empty box and other characters in simplified forms (for example in the monogatari CD-ROM of the Kokubungaku shirykan in Tokyo). Such data has bad convertibility and is thus of deficient quality even if the input text is quite accurate.
For the Chinese University of Hong Kongs concordance series (stored in Big-5 format without distinction of master and user data), the variant forms were reduced to standard forms and listed in printed form. The electronic text only features the standard forms. If in the future a larger code comes into common use that contains many variant forms, there will be no master data that can be converted in such a code, and much of the work that was done in reducing the information will have to be repea ted in the other direction. In contrast, the Hong Kong University of Science and Technology inputs mainland book information in simplified forms and Taiwanese information in long forms, as they appear in the book. The search module then treats the simplified characters as variant forms which are also searched, allowing the user to find information regardless of the specific form of the printed character.
Master data of good quality must not only be in an adequate code which contains much information (and has thus good conversion characteristics) and does not distort the printed original: it must also be error-free. With alphabetical text, input of the same text by two typists and machine-comparison of the typed text yield quite good results. However, this method is not totally adequate because fast typists sometimes mistakenly hit the same wrong adjacent key. For Chinese data, this method has no t proven successful because typists often make the same mistakes. Thus a good error-correction procedure must be applied and strict guidelines must be given to the input and correction personnel. They must be trained in strict quality control procedures; all individual decisions must be documented, approved, and consistently applied.
Having heard too many if only we had thought about this before input started ... I believe that in database planning and management, group decisions based on discussion are often better than individual decisions. Scholars must be careful not to leave such decisions to technicians and programmers. On the field trip, we met programmers who admitted that they have never actually used the database they have been working on for years...
In projects whose input takes many years of work, one must make programmers produce multiple test versions of search software and have scholars and other prospective users evaluate it even while input is going on. If necessary, data structure decisions have to be reevaluated. Users should have a say in all important software decisions, and programmers should assist users to evaluate test versions and to formulate their wishes by telling them about alternative possibilities.
HTML>