As the name suggests, retrieval plays a very important part in SMART. Retrieval is based on a new type of index, that places a layer of abstraction between the actual locations in electronic documents and the positions in the texts referenced.
The SMART project tries to develop a suite of general tools for work on East Asian texts. Some assumptions are made about the texts:
While the first point might be self-explanatory, I will give some explanation of the second point. Here is a short extract of a text in XML format. Variants are surrounded by the <app> element, with the text version of the base text (Taisho) marked with the <lem> element, other versions are marked with the <rdg> element, the attribute wit indicates the name of the witness.
As can be seen from this example, some characters (marked with the <rdg> element) fall out of the sequential structure of the text, they can be seen as stacked on top of the primary version of the text (marked with the <rdg> element). For the building of the index, additional lines will be inserted into the index, to account for these additional characters. An example is shown here:
In this example, an index with two additional characters has been buildt, the meaning of the numbers to the left of the characters is explained below.
The index consists of three or four different data files:
The dictionary file simply contains the entries to be indexed one per line. The encoding is in UTF-16 (which is identical to UCS-2, the most popular Unicode encoding for most practical purposes). Mojikyo characters (that is, characters defined by the Mojikyo Font Institute) are using the Shift-Mojikyo encoding developed by Mr. Masahiko Maedera. The dictionary entries are separated by a tab character from the character addresses (locations), which are then following up to the end of line.
The file begins with the byte order mark FFFE followed by DIC and a digit indicating the format type (eg. 1) and a line-end (0D000A00) sequence.
The index length of the index term is variable, it is terminated by a tab character (0900). Following is a sequence of 6 byte location addresses, up to the end of the line.
The last adress is followed by a newline character. The adress is always given for the first Character; if preceding characters are given, for e.g. KWIC they should be appended at the end,separated from the other characters by a comma (2C00) character.
Example:
[U4e00][U4e00][U4e01] 00b4e8c6e63300b4c6e933e7 2 instances
...
The logical position of each character in the text, along with some other information, is expressed with the character address (I will call this location from now on). This location consists of a 15 digit decimal number, which encodes this information in its positions. Lets look at an example first:
The first character in the line T10n0279_p0001a01 is expressed as 00010027900000110101, which breaks down to:
T10n0279_p0001a01 = 010 0001 10101 000 010 Volume = 010 (maximum: 280 volumes) 0001 Page = p0001 1 Cataster = a 01 Line = 01 01 Character = n/a
0 RDG = 0=T, 1=S, 2=Y, 3=M 4=3, 5=G,9=others 0 BaseTag = 0=P, 1=L, 2=Z, 3=J, 4=Byline, 5=Comm, 6=Note 0 ExtTag = user defined... Name, Yin, etc.
As can be seen, the digit breaks down to different units, which have different semantics assigned to it. As can be seen, the first 12 digits are used to encode the character address in the text, expressed in the units of volume (3 digits), page (4 digits), cataster (1 digit), line (2 digits) and character number on the line (2 digits). These units are of course completely arbitrary and they can be redefined as needed.
The last three digits, are defined below the horizontal bar, they provided some information about the context of the character in question, derived from the XML markup. Provided is information about whether the character is
31.12.99: Alternatively, the cataster number can be used as a flag. I am using it now to indicate whether the character is the last in the line: For this purpose, the value of the cataster indicator is increasing in increments of 3. Adding 1 to the original value will indicate the last character of a line. This is important to be able to calculate the distance in the text of two characters for word-clustering and approximity searches.
The maximum is now 085146411005000 which can be expressed in 6 bytes. This will need an extra table for lookup of the text number.
010000110101000 085146411005000 281474976710656 65536 by 3
Alternatively, there can also be
T10n0279_p0001a01 = 00010027900000110101 0 Var.flag = 0:regular, 1: variant 0 reserved 0 TaishoEd = T (others not yet defined) 10 Volume = 10 0279 Number = n0279 00 add. Letter = _ 0001 Page = p0001 1 Cataster = a 01 Line = 01 01 Character = n/a
or T85n2920_p1464a10 = 00085292000146411005
The maximum is 085292000146411005, which can be expressed with 8 bytes. Every character in the Taisho can be addressed with this.
This is a binary (non-text) file consisting of long (i.e. four-byte) integers. These are offsets to the starting bytes of the lines or words or tokens within the dictionary file which can be looked up. The offsets are sorted according to the lexical value (i.e. collating sequence) of the tokens to which they point.
This index format is modelled after a format used by Jim Breen for his EDICT dictionary.
Gives first (and last?) lines of the texts, followed by text title and any other information. A hierarchy of lower level textual elements can be constructed by indenting the file with one space/tab for every hierarchical level.
This gives the position of the beginning of each line and the filename, where this is found. The receiving application might have to strip some markup, the beginning is the beginning of the physical line in the file, not the logical line of the text.
The size of the index depends largely on the number of characters included besides the lookup character. To estimate the size, a set of dictionary files of 1 to 7 characters has been calculated for one volume. Additionaly, complete indices have been created for a dictionary of 3 and 7 characters length. A larger size means faster response time and an immediate KWIC-like view. If this is not done and a KWIC is needed, it would have to be constructed on the fly, which is more timeconsumin.
One volume has approx. 1 500 000 characters.
In strings of different length, they are approximately distributed as follows:Length of entry | Lines in index | Size of dictionary (1 Vol) | Size of index (28 Vols) |
---|---|---|---|
1 | 3814 | 16 487 | |
2 | 174208 | 1 049 942 | |
3 | 575164 | 4 610 411 | 313 499378 |
4 | 838242 | 8 396 379 | |
5 | 970633 | 11 666 525 | |
6 | 1034526 | 14 507 096 | |
7 | 1074590 | 17 221 803 | 916 231744 |
Total | 1504517 | -- |