Concordance Tool

CONCORD, a concordance creation tool

by Christian Wittern

What CONCORD does
The format of the concordance
The initialization file
The output file

Overview of CONCORD

This tool, created in the context of the Zen KnowledgeBase project, automatically generates a concordance from a Chinese text file. The source file can be encoded either in JIS or in Big5; IRIZ KanjiBase character references are also recognized.
There are both a DOS and a MAC version. Both produce a tagged text file that can with the help of our Word 6 macros be transformed into a beautifully formatted concordance, complete with three tables of content (stroke count, radical, and four-corner) and lookup character headers.
To run the program you need to

install the files for your system in the list below
install KanjiBase in your system
record its directory in the concord.ini file
install the KanjiTools for Windows or Mac Word

You can then tailor the outcome to your needs by recording some of your specifications in the concord.ini file.

Needed DOS files from the ZenBase CD 1:

DOSWIN\TOOLS\CONCORD\CONCORD.EXE
DOSWIN\TOOLS\CONCORD\CONCORD.PL
DOSWIN\TOOLS\CONCORD\CONCORD.INI

Needed MAC files from the ZenBase CD 1:

MAC:TOOLS:CONCORD:CONCORD
MAC:TOOLS:CONCORD:CONCORD.INI
MAC:TOOLS:CONCORD:CONCORD.DOC (these are the Word6 macros)

What CONCORD does

CONCORD reads a text file, character by character. For every character it produces a line of output that contains the character, the surrounding characters, the location information, and some sorting keys. This output is then sorted and written to a file in a format that can be read by word processing programs. The word processing program will then interpret this format and produce a file that can be directly printed. I produced such a routine for MS Word for Windows, that forms part of the Kanjitools for Winword collection.

Preparing a text for CONCORD

CONCORD makes some assumptions about the file it is working on, so you should format the source file accordingly for successful operation.

The page and line number information can be in either of the three formats recognized by our FMTCNV tool (RAW, APP, or TAB).
Some additional information about the divisions of the text can be given to CONCORD in the CONCORD.INI file.
The text can contain a header that gives some information about the text. The first character of each header line must be '#' (see for example our texts on the ZenBase CD 1). Any other character might produce unpredictable results.
The text can contain SGML style tags; they and what is between them will be ignored, except when defined as a division marker (explained below). Later versions of CONCORD might try to make use of some such tags.

The format of the concordance

The basic layout of the concordance is fixed. You will get some characters to the left and right of each entry line, and a location reference to the text. Information about which characters are to be found on the page is given in the header. CONCORD creates three tables of content (radical number, number of strokes, and fourcorner code).

Some other parameters of the layout can be changed where needed, for example the number of lines on a page, the number of characters in each example etc. These settings are recorded in the CONCORD.INI file which the program reads when starting up.

The initialization file (CONCORD.INI)

If you're in a hurry: simply run the program in its default mode without writing anything into the CONCORD.INI file. If you're not satisfied with the outcome, you can then change the CONCORD.INI file following the guidelines below to suit your needs.

Three different initialization files can be used to configure CONCORD.

A global initialization file in the same directory where the program resides. This file has the name CONCORD.INI and is read first.
A local initialization file in the current directory. This file is also called CONCORD.INI and will be read next.
A text-specific file in the current directory. The name for this file takes the portion up to the dot from the filename of the text after which it is called and adds INI as the extension after the dot. So please do not use INI at the end of your source file name. don't end the name of your textfile with INI! This file is read last and thus overrides all previous settings.

The format and the possible settings are the same in all these files. If none of these files is found, the internal default settings are used.

The following is an example of a CONCORD.INI file with some indented comments:

directory=d:\cefneu

This indicates the directory where the KANJIBASE files are stored. It will usually be the same as in KANJIBAS.INI.

lines=37

This defines the number of lines per page. Header and footer lines are not counted.

chars=5

This defines how many characters precede and follow the target character on each line.

plaintext=NO

This inserts tags for easier word processor formatting of the outcome (e.g. for import in WORD). Set this to YES if you want a concordance in plain text format with fewer tags.

sep= -

This defines the hyphen as symbol for the location reference. A separator is only needed if there are no letters between the page and line number. With Taisho references such as 568a27, for example, you do not need to define a separator because the letter a is used. Using the dash as defined here would produce 568-27; this can be useful for manuscripts (10-1 for the first line on page 10). If you do not need such a special separator, you can insert a # at the beginning of the line before sep= div=<DIV n=$n>
The division separator is a template for numeric divisions ($n stands for any number) in the text. These divisions usually reflect a structure in the text on the paragraph level or above. Koans in a koan collection, for example, could show up with their number before the location reference. The division number will precede the location reference and will be connected with a colon, e.g 10:435a22 for koan number 10 on p. 435a22. To bring this about, you have to mark this koan in your text file by inserting <DIV n=10> at the appropriate place. The first koan would accordingly be marked & lt;DIV n=1>.

The output file

The output file is better not directly printed as it contains some markup that first needs to be interpreted. It has the following general format:

The whole file is divided into four divisions: Three tables of content and the body of the concordance. These divisions are marked in the following way:

<DIV n=1 type=body>

<DIV n=2 type=radindex>

<DIV n=3 type=strindex>

<DIV n=4 type=fcindex>
The end of each division has the simple end tag </DIV>.

The beginning of each page is marked with <PAGE n=$n> ($n stands for the page number); this is followed immediately by a sequence of <HEAD>(some Kanji)</HEAD> (the kanji are those that have an entry that begins on this page.

The last tag to mention is <COL> which indicates where the column break is placed.

A word processing program that formats the output file will have to pick up these tags and convert them into whatever that program can understand. For this you can usually use the search-and-replace function or tailor a macro. Macro for Word for Windows and Word6 for Macintosh are included in the TOOLS directory, and a conversion script to CJK LaTex is planned. Additional scripts to format the concordance for various otjher applications are welcome.

Author:Christian Wittern
Last updated: 95/05/01