by Christian Wittern
This tool, created in the context of the Zen KnowledgeBase project, automatically generates a
concordance from a Chinese text file. The source file can be encoded either in JIS or in Big5; IRIZ KanjiBase character references are also recognized.
There are both a DOS and a MAC version. Both produce a tagged text file that can with the help of our Word 6 macros be transformed into a beautifully formatted concordance, complete with three tables of content (stroke count, radical, and four-corner) and lookup character headers.
To run the program you need to
Needed DOS files from the ZenBase CD 1:
CONCORD reads a text file, character by character. For every character it produces a line of output that contains the character, the surrounding characters, the location information, and some sorting keys. This output is then sorted and written to a file in a format that can be read by word processing programs. The word processing program will then interpret this format and produce a file that can be directly printed. I produced such a routine for MS Word for Windows, that forms part of the Kanjitools for Winword collection.
CONCORD makes some assumptions about the file it is working on, so you should format the source file accordingly for successful operation.
The basic layout of the concordance is fixed. You will get some characters to the left and right of each entry line, and a location reference to the text. Information about which characters are to be found on the page is given in the header. CONCORD creates three tables of content (radical number, number of strokes, and fourcorner code).
Some other parameters of the layout can be changed where needed, for example the number of lines on a page, the number of characters in each example etc. These settings are recorded in the CONCORD.INI file which the program reads when starting up.
If you're in a hurry: simply run the program in its default mode without writing anything into the CONCORD.INI file. If you're not satisfied with the outcome, you can then change the CONCORD.INI file following the guidelines below to suit your needs.
Three different initialization files can be used to configure CONCORD.
The format and the possible settings are the same in all these files. If none of these files is found, the internal default settings are used.
The following is an example of a CONCORD.INI file with some indented comments:
directory=d:\cefneu
This indicates the directory where the KANJIBASE files are stored. It will usually be the same as in KANJIBAS.INI.lines=37
This defines the number of lines per page. Header and footer lines are not counted.chars=5
This defines how many characters precede and follow the target character on each line.plaintext=NO
This inserts tags for easier word processor formatting of the outcome (e.g. for import in WORD). Set this to YES if you want a concordance in plain text format with fewer tags.sep= -
This defines the hyphen as symbol for the location reference. A separator is only needed if there are no letters between the page and line number. With Taisho references such as 568a27, for example, you do not need to define a separator because the letter a is used. Using the dash as defined here would produce 568-27; this can be useful for manuscripts (10-1 for the first line on page 10). If you do not need such a special separator, you can insert a # at the beginning of the line before sep= div=<DIV n=$n>The division separator is a template for numeric divisions ($n stands for any number) in the text. These divisions usually reflect a structure in the text on the paragraph level or above. Koans in a koan collection, for example, could show up with their number before the location reference. The division number will precede the location reference and will be connected with a colon, e.g 10:435a22 for koan number 10 on p. 435a22. To bring this about, you have to mark this koan in your text file by inserting <DIV n=10> at the appropriate place. The first koan would accordingly be marked & lt;DIV n=1>.
The output file
The output file is better not directly printed as it contains some markup that first needs to be interpreted. It has the following general format:
- The whole file is divided into four divisions: Three tables of content and the body of the concordance. These divisions are marked in the following way:
The end of each division has the simple end tag </DIV>.
- <DIV n=1 type=body>
- <DIV n=2 type=radindex>
- <DIV n=3 type=strindex>
- <DIV n=4 type=fcindex>
- The beginning of each page is marked with <PAGE n=$n> ($n stands for the page number); this is followed immediately by a sequence of <HEAD>(some Kanji)</HEAD> (the kanji are those that have an entry that begins on this page.
- The last tag to mention is <COL> which indicates where the column break is placed.
A word processing program that formats the output file will have to pick up these tags and convert them into whatever that program can understand. For this you can usually use the search-and-replace function or tailor a macro. Macro for Word for Windows and Word6 for Macintosh are included in the TOOLS directory, and a conversion script to CJK LaTex is planned. Additional scripts to format the concordance for various otjher applications are welcome.
Author:Christian Wittern
Last updated: 95/05/01