Author: Christian Wittern, Taipei -- Created: 99-07-20, Last change: 01-03-31
URL of this document: http://www.chibs.edu.tw/~chris/smart/

Christian Wittern: The SMART project

This is the home site for the System for Markup and Retrieval of Texts, a project aiming at developping a suite of research tools for dealing with Chinese text using XML markup as recommended by the TEI-Consortium.

This project has been supported by a grant from the German Research Council (DFG) from 1999 to 2001.

Currently, the following is here:

Information about the SMART project

A very brief introduction to the SMART project.
Toward a Web-based Scholar's Workbench
In this paper, I discuss some technical issues that are related to some areas of frustration to scholars: Now that the data are available, there are a lot of other things that could be done with it, besides simple (or even more complex) searches. Texts could be analyzed, compared, annotated, corrected, interlinked, translated, connected with secondary sources, bibliographies, maps, dictionaries, still or moving images, audio content and any other imaginable digital resource. But alas, so far there is no way to do this with most of the data on the web so far. Although much more interactive than broadcasting media, the web as a medium still mainly provides content to the user, without involving the user in developing the content. The latter however, is the aim of the scholarly pursuit: Researching, publishing and digesting the research of others; in many areas also collaborating on smaller or larger projects. The Internet and the World Wide Web have the potential of enhancing these activities, but so far have largely failed to do so. The reason is for the most part that there are no applications to support this activity and a lack of standards that would allow such applications to talk to each other.

Technical notes

Layered Markup (00-07-05, updated 00-07-20)
The SMART project tries to build an environment where collaborators can remotely add markup, comments, references etc. to a common base text. The markup shall be added independently and with no necessary implied reference to each other.
Layered markup as discussed here is based on the presumption that markup of a text can be meaningful divided into structural markup (markup of the textual structure and common references) and content markup (markup of the content of the text as relevant to different domains). This follows very closely the model of a map, where different special-purpose maps (i.e. topographical, political, climatic, distribution of population, language, religion etc.) are layered on the same basic outline of a territory.
Format of the SMART Index Files (99-07-20, updated 00-01-10)
As the name suggests, retrieval plays a very important part in SMART. Retrieval is based on a new type of index, that places a layer of abstraction between the actual locations in electronic documents and the positions in the texts referenced. Beside that function, the index also maps out the multidimension structure of a text and its variant readings to a flat, sequential chunks of text. This technical note reports the details of the index format.
Remapping Unicode text for Japanese or Taiwanese computing environment (99-11-17, updated 00-01-09)
The Unicode character set provides for the first time a convenient way to deal with text files from different parts of East Asia, since widely used national codes are mapped to one common codespace. Unfortunately however, due to the source separation rule, which makes sure that characters distinguished in any of the source character sets will also be distinguished in Unicode, some commonly used characters will be assigned different Unicode codepoints for the same characters, depending whether the source is a Big5 encoded document from Taiwan or a JIS encoded document from Japan.
A conversion utility is introduced here and the underlying codetable is made available for free download.
SGML/XML editing with plain text editors (00-01-07, updated 00-01-10)
Lacking sophisticated and achievable SGML/XML capable editors suitable for editing of East Asian texts, it is feasible and possible to set up an editing environment with popular and powerful shareware editors. These editors can be downloaded from the Internet and used for a limited period free of charge. A licence fee has only to be paid for continued use. The two editors discussed here, UltraEdit Professional Text/HEX Editor Version 7.00a (available from www.ultraedit.com) and TextPad 4.1 (see www.textpad.com) are rich in features, with no clear front-runner. UltraEdits most recent version is capable of reading Unicode text files and seems to use Unicode as its internal encoding. There are still problems with codepoints outside the current codepage, but this gives UltraEdit a slight advantage over TextPad, whose most recent version introduces support for double-byte encodings and ships a version with Japanese user interface. In this technical note, I will discuss how to set up these editors to edit and validate SGML/XML text files.

Back to the homepage