A Look at the
Korean Tripitaka Input Project

by Urs App

Preface

Input of the Korean canon has been finished, and in January of 1996, a ceremony was held in Seoul to commemorate the occasion.
One thousand copies of a CD were pressed and distributed to some interested and/or involved parties and sponsors.
The CDs can at present only be used on a Korean windows system with a specific word processing program. Each file represents a page of the printed version. See more on this in our report on the EBTI meeting in Taipei (April 1996)

History of the project
Personnel of the project
The Seoul input center
The input procedure, character codes
Lacking characters
Input quality control
Handling of variant character forms
Normalization of characters
Procedural and descriptive markup
Computer experts or scholars?

On September 24, 1995, I visited the two offices at which work on one of the world's largest input projects is progressing: the Tripitaka Koreana project. In the following I will reflect on some of my impressions of my visit -- impressions which might, I hope, not only benefit the Korean input project but also similar projects that are envisaged in East Asia and elsewhere.

The Korean Tripitaka (Korean Buddhist canon) is well known not only because of its exceptional textual quality but also because its wooden printing plates still exist. Carved in the 13th century, the collection of more than 80.000 large double-sided printing blocks is housed in a building at one of Korea's most famous Zen (Son) monasteries, at Haeinsa. The collection, a Korean national treasure, is the oldest extant collection of wood blocks containing the entire Chinese Buddhist canon.

Prof. S. Yanagida holding one of Haeinsa's wooden printing blocks

The history of the project is not very long. It essentially began with input tests initiated by Prof. Lewis Lancaster around 1991. Since 1993 Ven. Chongnim of Haeinsa, who had been studying at my research institute in Kyoto, is the director of the whole project. After the initial tests during which the two first volumes of the Tripitaka were input and various problems became clearer, a number of monks and academics began studying various issues connected with the variant forms of Chi nese characters which abound in this edition of the Buddhist canon. Funds were collected and various forms of support gained; but during the Electronic Buddhist Text Initiative (EBTI) meeting at Haeinsa in September of 1994 I did not think that the large-scale input activity would start any time soon. However, soon afterwards the president of Samsung Corporation decided to support the input project with both funds and personnel.

At present, there are two main groups of workers involved in the project:

1) Around fifty people at the input center in downtown Seoul, employed and paid by Samsung. Their main job is the input of the whole canon. About forty are typists; the rest are editors, students in charge of character and variant lists, and engineers.

2) About half a dozen people employed by Haeinsa that work in another office in Seoul. They are mainly studying issues connected with the delivery of the data (data correction, search mechanisms, Internet, etc.)

Since the end of 1994, input is progressing at a fast pace. It is estimated that the basic input will be finished around November of 1995. The head of the Samsung team, a computer engineer, said that they will try to produce a CD-ROM by the first quarter of 1996. The Haeinsa team realizes that the correction of this enormous mass of electronic text might well take more than a decade, but following the lead of our ZenBase CD-ROM in releasing insufficiently corrected electronic texts as Al pha-versions, we may soon have the whole Tripitaka Koreana as Alpha-text.

The input facility in Seoul is impeccably clean, spacious, and well equipped. The input personnel is using good computers, manuscript holders, handrests, screen filters, etc. All computers are linked in a network to the engineer's office. Having seen various other input centers, I was quite impressed with the whole setup.

The first input tests in Shanghai four years ago had shown that it was not so much the sheer number of different Chinese characters but rather the Korean Tripitaka's abundant variant character forms that would pose the greatest problems for input. Should one create slightly different characters anew or simply "normalize" them even at the input level? After taking charge of the input project, the Samsung engineers decided to use a screen font that resembles the woodblock characters, thus eliminating many of the problems with Big-5 fonts from the outset. The principle is easily stated: the typists should see on screen what they see on paper, thus eliminating most decision making at the input level. Such decisions, experience had showed, corrupt the data because too many people with too many different opinions are involved.

As character code, the Samsung engineers decided to use a 16,000 character code from a Korean word processor. When it became clear that this was not sufficient, the code was augmented by close to 2,000 characters. Now that the basic input is nearing completion, a second batch of similar size is being added. The use of a not commonly used character code has some advantages; one is free to define characters and shapes according to the needs and can augment the number of ch aracters as one pleases. At a later stage, conversion tables linking the Samsung code to the Taiwanese Big-5, the Japanese JIS, Unicode, etc. must be created. I was told that a KS code conversion table already exists; however, it was not clear how the many characters not present in KS would be handled. I would suggest that the KanjiBase approach would be most useful because it is designed to be both hardware- and software-independent.

For the input activity, Samsung hired and schooled about forty people (mostly female). The attractive screen font, combined with a Windows-based custom-made word processing progam and fast hardware and an input method of Chinese origin together provide an optimal environment. As they do on paper, screen characters appear in vertical lines, and editing text in the graphical user interface is easy.

The input system is based on shapes rather than pronunciation; the typist thus has to learn to associate certain shapes with certain keys on the keyboard. An average of around three keystrokes on an ordinary keyboard thus produces a Chinese character. For the most common characters or phrases, abbreviations can be created. The speed of the input is quite baffling.

For characters not available in the system, the input personnel marks the character on paper with a red circle and types in an asterisk. On a single screen that was only two thirds full I counted as many as nineteen asterisks -- and this points to one of the major problems I found with the adopted strategy.

This photograph of a page with circles shows that even very common characters are marked as missing -- and this at a stage where more than two thirds of the whole canon are already input! Of course, periodical addition of missing characters to the available set would have eliminated most of these asterisks. Such missed opportunities tend to follow a general rule: what one fails to do in the first instance will take much longer afterwards. As the great number of asterisks in the data files indicats, the < STRONG>elimination of asterisks will become a major headache. Quite unnecessarily so: instead of asterisks, one could simply leave the unsuccessful input code in the file, delimited by markers. At a later stage, characters with a single input code could be replaced automatically.

Furthermore, even a cursory glance showed that some of the marked characters were wrongly circled because they actually are present in the character set. The typist probably did not find it right away and input an asterisk. If speed is valued highly and supervision is not tight, such habits can become the rule rather than the exception. Looking at some of the input sheets, I found ample evidence of such habits; for example, the first character on the bottom right of the above photograph (Chinese "dun", s udden) is certainly in the font set and could have been input.

This led me to question the degree of quality control. In the input project I am directing, I had noticed that for good data quality one needs frequent and exhaustive data correction and very strict guidelines for the input personnel. Input time is thus relative and should always be calculated in combination with data correction and editing time. But to do this, one must actually correct data and edit them. On my question about the percentage of input errors, I was t old it was about one percent -- way too much for such a huge project, considering that correcting is usually more labor-intensive than the initial input. However, on repeated questioning I learned that very little checking of input data had so far gone on; this was postponed until a later stage. Had I been in charge, I would in the early stages of input have instructed the typists to proofread their own input; after a second proofreading pass by an educated proofreader familiar with problems of Chinese char acters and variant forms, each typist would have had to correct his or her own mistakes, thus educating both the eye and the mind and teaching the optimal mixture of input speed and data quality. Of course, such basic quality control also brings all sorts of problems to light which otherwise stay in the dark. It is on these problems that I will now focus.

The handling of variant forms is of paramount importance; and since woodblock prints are based on handwritten models, there is a never-ending supply of variations. The photograph below shows a sheet on a typist's desk which lists some of the most common variants, input codes, and a shortcut letter.

Even if one would add screen characters for variants every day and hand them out on notes to the typists (as in the above photograph) or -- as I would do it -- in online documentation triggered by the input code, one would never come to an end. Indeed, the Tripitaka Koreana is so rich in variant forms that some people make this their field of study. In the present input project, Samsung employed a handful of students to note down "encircled" and variant characters on cards which are then catalogued.

Since the drawers of this sizable catalogue feature Hangul characters, I assume that the whole catalogue is sorted according to reading. However, the reading of variant forms and rare characters is often either unknown or difficult to figure out. Why not arrange them according to the shape-based input codes, thus also allowing the typists to consult it? One reason may be that the supervisor for variant characters is a Korean university professor who comes to Seoul once in a while and brings lists of "equ ivalent" characters. He probably does not know how to type characters himself and thus cannot use input codes. It was not clear to me how strictly the directives of this professor are followed on the input level, but judging from the small number of variant characters noted down on the typist's desks, I had to assume that much of the "normalization" was also postponed to a later stage.

The question of the necessary degree of normalization is usually addressed from the wrong end. Overall, the printer's mind set is still at work: a printer cannot print two characters on top of each other and has to choose a single one. He thus tends to "normalize," i.e., to take the standard form of the character in his possession rather than the variant form that he would first have to create. Electronic text, however, easily allows using any number of characters at the same position, provided that one adds specific markers which allow filtering. Depending on the need at hand, one can then display a common character form for common users and the original form for specialists. In short: electronic text, when correctly marked up, is flexible and can easily be tailored for specific needs.

Normalization of electronic text is thus essentially a software task; but one must make some preparations for this. If it were up to me, I would make online lists of scanned variant characters available to the typists so that they could simply click on the character that they actually see. This would then input the professor's proposed "normalized" character surrounded by a markup sequence indicating an identification number of the variant character. This would hardly cause more trouble to the typists(ju st one more look at the screen and one more keystroke) but provide invaluable information for subsequent processing. First, it would liberate the overall quality of the data from the judgment of a single specialist. Second, it would allow subsequent batch correction. Third, it would facilitate producing various editions, among them one for use of researchers who are interested in variant characters. Fourth, and most importantly perhaps, it would immensely facilitate conversion into various present and futur e codes. The Japanese JIS code, for example, contains a surprising number of variant characters; the Big-5 code tends to contain few of them; Unicode will contain a fair amount, and some future code (or one for specialists) may even feature them all. With appropriate software methodology, all of these interests can be served -- including some interests which we are not yet aware of. The normalization effort is certainly necessary, but it should not influence the basic data but rather be part of "processing" the basic data for particular users and purposes. To use the image I employed in vol. 3 of the Electronic Bodhidharma: the input data should be like a master tape in music, superbly rich in information. This richness can later be reduced to serve particular needs and formats (such as audio cassettes). The path from more to less is easy; but, as any lover of Caruso recordings will confirm, that from less to more is hardly smooth.

Some of the problems pointed out above are due to the newness of the electronic medium and our inexperience with electronic text. It is hard to overcome the print-on-paper mindset. This was also in evidence at the editing desks of the Seoul input center. There, about half a dozen people correct the format of the input data, deleting or adding carriage returns in order to model the electronic text exactly on the printed version. As far as I could see, all of this was purely procedural work (procedural markup), not taking into account any content aspect. For example, each title of a text's chapter is usually set on a single line, followed by a carriage return. The print layout usually makes it immediately clear that this is a title. For electronic text, one needs to make this implicit information explicit by inputting markers. For example, one could select the title and choose "Chapter heading" from a menu. This could add at the beginning of the selected line and at the end of it. The title of the text could be marked up by using and . This simple measure would, for example, make it possible in one go to set thousands of different titles in specific point sizes -- or to create a comprehensive index of all titles and subtitles of the canon. In contrast to procedural markup (selecting each title by hand and increasing its point size), this is called descriptive markup. Descriptive markup coul d, in the case of the Korean Tripitaka input project, be used for a variety of purposes. The following examples are with one exception easy to implement and could very well be made part of the editing process:

Markup of various levels of titles
Markup of volume information (such as "second fascicle")
Markup of names of authors
Markup of tables of contents
Markup of prematurely broken-off lines (which usually indicate a kind of structural break, similar to a new paragraph)
Markup of information in the margins (such as carver's names)
Markup of large characters (as in the lookup words of dictionaries contained in the Tripitaka)
Markup of smaller characters (as in comments)
Markup of verse, and other obvious formats
Markup of different kinds of spaces (for example, the space between two lines of verse, the space at the beginning of a line with the function of a tab key, etc.)
Markup of variant characters with identity numbers (all asterisks will have to be replaced by characters anyway, and that would be a good beginning).

Such descriptive markup would greatly facilitate subsequent work; for instance, files corresponding to texts could be generated completely automatically together with file names in Chinese character; or all variant character forms could be listed up by one command.

It is not yet clear how much the Samsung corporation wants to be involved in correcting and editing the data; however, I think that at least the first stage of data correction, the character-by-character comparison of the original text with a printout of the input data, should be considered part of the input task. A second or third stage, where people also read the text for content, could then be used to prepare for additional content-based markup using various color pencils: punctuation, names of person s, names of buddhas and bodhisattvas, place names, names of texts, citations, comments, and so on.

Computer experts ought to play a prominent part in the first stages of markup and can build the foundation for good data quality. The more they know about electronic text and its particular elements, formats, and possibilities, the better their job will be done. This means that from the very first stage, scholars with experience in electronic text should be involved. As my evaluation of some aspects of this project proves, such consultation can potentially save thousands a nd thousands of man-hours and significantly improve basic data quality. The more one moves towards content markup, however, the larger the role of traditional scholars of the field becomes. No computer expert can puncuate a classical Chinese Buddhist text, let alone perform more important markup. Much of this kind of markup can even be achieved by scholars unfamiliar with computers; equipped with a set of color pencils, one can much of this kind of work on paper.

Of course, collaboration between computer experts (who may know what is possible) and scholars (who have to grasp what is necessary) is essential all along.

Descriptive markup, if well planned and managed, can finally bring out the inherent advantages of electronic text, some of which we already know and many of which we still ignore. One of the most apparent ones, the ability to link chunks of data, could play a crucial role in linking canon information to dictionary information; thus one could, for example, search only texts translated by a specific person or of a specific time period or c lick on a person's name to immediately get the relevant dictionary entry. At any rate, one thing is certain: the devoted monks and laymen who originally "input" the Korean Tripitaka on wood and made a prostration before carving every character, would never have dreamed of this new form that the fruit of their labor is now taking. In the light of their efforts, a dozen years of data correction and markup may appear less arduous a task. It is to them, too, that one owes an effort that produces an electronic e dition whose quality assures survival for at least another millenium.

Author:Urs APP
Last updated: 95.10.23

A Look at the Korean Tripitaka Input Project

by Urs App

Preface

A Look at the
Korean Tripitaka Input Project