to Home Page

The Code of the Codex

An illustrated article by Urs APP


Abstract

This illustrated article (which is also published in the Electronic Bodhidharma No. 4) deals with the transition of printed text to electronic text. It examines how the basic ingredients and "code" of the book (paper, page numbers, line breaks, index, etc) developed and draws parallels to the codification of electronic text that has barely begun. The first newspaper was printed in 1597, 150 years after Gutenberg, and the idea of publishing a scholarly periodical journal was b orn more than 200 years after Gutenberg. There is no reason to doubt that in the coming centuries the digital medium will open up far more numerous and significant possibilities.


PAPER

It took quite a while after the invention of paper until some Chinese about two thousand years ago realized that paper could be used for writing. Little did they know that a few centuries later paper would already be the main carrier of text in the empire. Paper reached India in the seventh century, but it took the Indians about five centuries to realize the potential of the new medium. In the twelfth century, when the first Europeans began manufacturing paper, they regarded it as a cheap substitute for par chment, but Gutenberg's invention took the paper medium into a direction nobody had anticipated. In 1800, the total paper production in Germany was 15,000 tons per year, in 1900 already 800,000 tons, and after another quarter century two million tons per year. Today, I am told, the University of California alone adds nine kilometers of shelf space for printed materials every year.

If the history of paper was full of surprises, we can look forward to far greater surprises with the media that are just being born at this moment. Just a few decades ago, digital storage and processing were thought to be only of interest to some mathematicians and other number crunchers. Today, they are already used in an astonishing variety of ways such as: processing signals from the heart and sending back electric currents when needed (heart pacemaker); recording and reproducing sound (compact disk); cr eating and showing movie dinosaurs (digital image processing); and creating an inventory of all words attributed to some ancient Zen master (digital text processing).

It is a mind-boggling enterprise to try to judge the influence of the use of paper on, say, Buddhism or Christianity. With regard to the digital medium, it is hard to judge where we now stand, let alone where we are going. As my overview in No. 3 of the Electronic Bodhidharma showed, there are already many projects under way that aim at digitalizing Buddhist texts or images. But even the most advanced ones are little more than imitations or extensions of printed text with imp roved searching capabilities and faster access and, in the best of all cases, some hypertext or video links. Is that all, you might ask? It is at the moment -- but the moment is comparable to the instant when the ancient Chinese first scribbled some letters on a piece of paper.


CODES

In order to be transmitted, information needs to be codified in some way. Codification necessitates fixed conventions, such as the alphabet or the standard character forms introduced by the Chinese emperor Qin Shi Huangdi or the computer ASCII code. One can say, as a general rule, that the better defined and simpler a code is the more useful it may prove to be in communication. The very simplicity of digital coding, which uses only "on" and "off" or 1 and 0, accounts for its enormous po tential for communication. Much like money which is a fabulous exchange medium for things and services, digitally coded information is an extraordinarily flexible medium for information exchange. Sound, static and moving pictures, text, etc. can all be stored and communicated in digital form; the term "multimedia" is in this sense a misnomer since its most salient feature is exactly that there is only one medium, namely, the digital one.

New code schemes tend to develop in similar ways. While examining the development of a code all of us are familiar with, the code of printed matter, I will in the following draw some parallels to a code that is just being born: the code of digital text.


THE CODE OF THE BOOK

The transition from handwritten to printed, and now from printed to digital, media brings with it a variety of significant changes. The famous Book of Kells, handwritten around 800, is a good example of a codex, i.e., a bound book. However, it lacks many features that we expect in books. For example, we neither know who the calligraphers and illustrators were nor where the book was produced (Kells in Ireland is only a guess).

The concordance table (four gospels) of a handwritten bible

The book opens with a "codex," a concordance table; but since these numbers have no correspondence in the body of the book, they are not very helpful. That is why, almost 800 years later in 1568, Gerald Plunket inserted these numbers in the body of the book. Had pagination by numbers been invented, he would probably have paginated the whole book and added these numbers to the concordance table. In spite of the beautiful handwriting of the Book of Kells, the text is rather hard to decipher because it is coded in unfamiliar ways: abbreviations are indicated by various symbols, spaces between words are often lacking, letters such as u and v are not distinguished, lower and upper case letters are seemingly used at random, the punctuation is unfamilar, line sequences are sometimes reversed, etc. Once one learns the code, the text reads relatively well.

Examples of some common manuscript abbreviation codes

In the above example, IHV with a dash on top stands for "Iesum", and XPI with a dash for "Christi." Animal codes indicate changes of line sequence.

From the Book of Kells

Here, the line with the animal starts with "filium Iesum." Then the animal indicates that the rest of the line is to be jumped; it looks ahead rather than back, so one must read on below: "Nativitatem Iesu adnuntiat angelus pas." The last word is only part of the word "pastoribus", but hyphenation was not yet used. To complete this word, we now must jump back to the right of the animal. The sentence then continues: "Et accit..."]. Note the abbreviation codes such as i hm with a dash on top; this stands for "Iesum."

Typeset books show great similarities to handwritten ones. In fact, Gutenberg and his contemporaries wanted to produce books that looked like handwritten ones; they thus simply adapted the handwriting codes, just as most computer users today adopt various typesetting and printing codes when processing digital text.

Printed text and handwritten correction

In the above example from the so-called Frankfort copy of the Gutenberg bible (around 1455), the lines at the top are printed while the two lines at the bottom are a handwritten correction of a type setter's omission.


CHARACTERS

It took well over a century to create the basis of today's printing code. First, basic code elements (letters) had to be identified and then produced by the core invention of Gutenberg, the hand type foundry. Since Gutenberg essentially imitated handwriting, he made use of many special elements such as ligatures, abbreviation codes, etc. The complete set of Gutenberg's initial printing code comprised almost 300 different types. The following illustration shows a small part of Gutenberg's large character set .


Part of Gutenberg's character set (Top left some ordinary characters; top right some ligatures; second line left some special characters; second line right some abbreviation codes; bottom some punctuation marks)

The rapid adoption of printing technology all over Europe soon built up pressure for a simplification and unification of basic elements, first within individual companies, then nationally and internationally. Soon, printers managed well with one-fourth of Gutenberg's letter set.

Some characters from the world's oldest moveable-type book (1234/41)

In the Far East, printing had started much earlier; miniature dharani scrolls were printed from wood blocks in Korea and Japan around the middle of the 8th century. Around 1239, more than 200 years before Gutenberg, a primitive moveable-type printing process was first used to print a book -- a Zen text, by the way. But standardization of basic elements (Chinese characters) has a much longer history; since ancient times, each concerned government periodically redefined its set of standardized characters. In modern times, in addition to governments, individual companies such as IBM or Fujitsu defined their own character sets for computer use. A character set defined by five big companies even became the de facto Taiwanese standard: the so-called Big-Five code. Just as Gutenberg's successors gradually reduced their letter, set and standardization went beyond company and national boundaries, we today witness an effort to "unify" the Japanese, Chinese, and Korean character set. This effort forms part of the ISO 10464 standard (or as "Unicode," its subset), an ongoing attempt at codification of most modern characters used around the globe. The unification of Chinese characters is particularly urgent because the mess of national character codes put the whole region at a severe disadvantage vis-à-vis societies that use the alphabet (see Electronic Bodhidharma No. 3 on various codes used for Chinese characters).


Some pairs of Chinese characters that are unified in Unicode

But while the step towards less adherence to handwriting conventions (such as the abandonment of Gutenberg's ligatures) was a relatively simple task, that of Chinese characters is an extremely complex task. For example, the most difficult initial problem besetting the ongoing computer input of the Haeinsa blocks of the Korean Buddhist canon are its hand-carved Chinese character types. How much do they need to be unified and simplified for the transfer to the electronic medium? How are variant forms to be ha ndled? Which particularities are essential, which ones may be of interest to some researchers, and what can be neglected without regret?


WORDS

Codes for larger units of text such as words, sentences, paragraphs etc. also first had to be invented. To mention just the spaces: ancient Greek texts usually show no spaces between words or sentences, while Roman inscriptions sometimes separated words by dots; but in ordinary texts, word limits were not marked.


A Greek inscription of 145 BC; no spaces, but a kind of paragraph tab

In the 7th and 8th century, some scribes whose Latin was shaky began to separate words, much like foreigners put spaces into romanized Japanese text to make it easier to read. In Europe, the introduction of spaces was crucial for the growth of silent reading from around the tenth century, but the blank space between words is fully in place only since the end of the 17th century. (In electronic text, the space plays an extremely important role as a separator tag recognized by spell checkers, indexing tools, search engines, etc.) Indeed, much of the code we now use in writing and reading was created quite recently: the exclamation mark, the quotation mark, and the dash for example are all creations of the 17th century.


DOCUMENT PARTS

Moving from the encoding of basic text units to the encoding of parts of documents, we note that a similar development took place. When the text in the following illustration was written, even the most basic sequencing devices such as the line had not yet been invented.

Before the line: Fragment from Uruk (4th millennium BC, Mesopotamia)
Top left: two / top right: temple, house / center: sheep / asterisk: god / bottom: Inanna

Early printed books usually put commentary all around the text rather than at the bottom of the page, and they did not yet feature many elements that are now standared (such as a table of contents). The creation of standard parts of printed documents took time, and new conventions emerged slowly: dust cover (earliest 1833, but first used for advertisement in 1906), headers and footers, notes at the bottom ot the page or at the end of the chapter, illustrations, tables, bibliography, appendices, etc. As is t o be expected, such elements now also form part of the Standardized General Markup Language (SGML; see my Dog-Ears and SGML). The following example stems from the Guidelines for Electronic Text Encoding and Interchange (TEI P3)

<text> contains a single text of any kind, whether unitary or composite, for example a poem or drama, a collection of essays, a novel, a dictionary, or a corpus sample.
<front> contains any prefatory matter (headers, title page, prefaces, dedications, etc.) found before the start of the text proper
<body> contains the whole body of a single unitary text, excluding any front or back matter.
<back> contains any appendixes, etc. following the main part of a text
<div> contains a subdivision of the front, body, or back of a text
<div0> contains the largest possible subdivision of the body of a text
<div1> contains a first-level subdivision of the front, body, or back of a text (the largest, if is not used, the second largest if it is)

In order to find one's way through a document and to access different parts of it, the overall sequence of elements had to be fixed. Numbered pagination was developed in the course of the 16th century but still not universally applied; then followed conventions such as tables of content, tables of illustrations, references and cross references, annotation, and citations. In some cases, particularly in texts with many different editions, other kinds of mapping were developed that allow finding passages even in multiple editions or in editions written in other languages: chapter and verse numbers for the Bible, sequential numbers as in Greek classics, Taisho canon numbers in my institute's concordances, etc. Once the virtues of such sequencing had sunk in, people could hardly do without them and, like the Gerald Plunket mentioned above, added sequencing codes even to older materials (recto and verso folio numbers on manuscripts, document numbers. Indices and concordances of te xts push such internal mapping to its limit.

Such "maps" also contain a variety of paths or links. When researching some topic, we usually follow links embedded in printed documents: primitive links like figure references and cross-references that point to locations within the same document, or pointers to other documents of primary or secondary nature. Sometimes, such links are even reflected in the layout of the document (parallel arrangement of alternative text versions, notes at the bottom of the same page.). Other links may relate to di fferent temporal stages of a document (versions of documents, historical maps) or other information related to the document. Encyclopaedias are good examples of primitive linkage strategies; so are note numbers that link an item to further information.


CLOSING IN ON BOOKS

Traditionally, books were mostly known by their first words or, as we have seen with the Book of Kells, by some kind of nickname. Authors, translators, calligraphers, or illustrators were rarely mentioned by name; each book was unique. Only with printing did the need arise to give a product its place within a flood of publications. The colophon had served this purpose in a limited way in Europe; the Europeans were far less conscientious in noting down essential information about the creation of documents than their Chinese counterparts. The first colophon that was printed in moveable type appeared in 1457; it mentions the book title and the names of the printers. The mapping or encoding of the book gradually developed; but only after 1470 did the title page become popular. During well over one hundred years of experiments, the title page as we know it gradually evolved and became current around the year 1600. Before that, there were sometimes just a title and year; or a title, name, and portrait of the author; or the name of the printer with place and date of publication, etc. The following example shows the title page of a 1572 edition of Bocaccio's Decamerone which just features title, author, and year -- and a bold advertisement that would today grace the dust cover or a special paper band. The title reads: "IL DECAMERONE DI M.GIOVANNI BOCCACCIO NVOVAMENTE CORRETTO ET CON DILIGENTIA STAMPATO" (note that u and v were not yet distinguished).

Title page of a 1527 Decamerone edition

Gradually, efficient mapping strategies for storing basic information about books were developed, particularly at libraries: author or editor, title, publisher, content keywords, owner of intellectual rights, ISBN number, and whatever else we now find in the impressum of a book. But outside of the body of the book, too, various materials were created to assist in learning about books: dust covers, posters, book catalogues, book announcement brochures, bibliographies, book abstracts, keyword databases, etc.

Parallel to the development of such mapping strategies, storage and distribution facilities such as libraries, bookstores, book fairs, second hand book stores, etc. developed. Access to documents is of two basic types: either you go to the document, or the document comes to you. Media tend to move from the former to the latter; this trend is particularly striking with the current electronic document revolution.

Zen texts discovered in 1900 inside the Dunhuang caves in Central Asia are a good illustration of this: at the beginning of this century, the original manuscripts could only be examined in their original desert cave by a handful of explorers. Then a few privileged researchers were allowed to go to Paris, London, or Beijing, etc. to see them. After World War II, microfilms were prepared and gradually made available to a few choice institutions; thus the documents came closer and access was considerably broadened. Today, an increasing number of such documents are not only being published in form of photographic reproductions but also as typeset texts that can easily be copied on a copy machine, thus putting them in the hands of students who study at privileged institutions. In the next phase, such texts will be made available as digital documents, and access will be gained through a telephone line. Thus such documents may almost instantly appear on some African or Polynesian desktop, and the user will not even know where the original document database is located.


OUTLOOK

The trend of putting information into the hands of an increasing number people will continue at an accelerated pace with digital documents. The digital medium has some disadvantages; for example, there is no digital equivalent to quickly leafing through a pile of paper books. But the book's physical quality, its major advantage, is also its limit: it needs expensive shelf space, can be consulted only by one reader at a time at the location where the book is, etc. The digital document knows no such constrain ts and opens up many new possibilities (most of which are yet to be discovered). A new code is about to be created, and some of its first forms (SGML, HTML) are discussed in the Electronic Bodhidharma No. 4.

Revolutionary advances in codification like the Greek alphabet or the digital code tend to frighten people. Plato, for example, had Socrates object to writing because, among other things, it "weakens the memory" since it "relies on extraneous crutches in form of strange signs," roams about among "those who cannot understand it," and prevents the author from explaining and defending his views (Phaidros 274c-278b).


Earliest European depiction of a print shop, complete with demons and devils
(Woodcut print from "Danse macabre" (1499-1500)

Countless objections are recorded from the 15th century when scribes mistakenly thought that handwriting was threatened by the new printing technology. Today, there are again many people who see terrible dangers in digital text -- for example, the demise of the book. Of course the book will survive, as did handwriting. But it is now rapidly gaining a powerful relative, a relative whose physical form allows radically new kinds of codification -- such as this HTML-encoded page on the World Wide Web -- that go far beyond the possibilities of imprinted paper. More intricate codification opens up unforeseen possibilities, as the history of printed documents shows: the first newspaper was printed in 1597, 150 years after Gutenberg, and the idea of publishing a scholarly periodical journal was born more than 200 years after Gutenberg. There is no reason to doubt that in the coming centuries the digital medium will open up far more numerous and significant possibilities.


Author:Urs APP
Last updated: 95/05/03