Schema XML files in the KanripoX project

2021-01-11

Table of contents

1 Overview

The following sections detail the different file formats that have been defined for the extension of the Kanseki Repository. Although they constitute very different information for different purposes, for the convenience of describing the files and processing the information, they have been combined into one single schema, adressed under one single namespace. The schema allows the following entry points:

As can be seen, the manifest definitions can be grouped together into a list of manifests, thus providing two entry points for the schema for manifests, while the other two schemas define a list of nexus respectively t (token) elements, which are grouped together into lists, thus the lists provide the entry point in this case.

2 Grouping and description of texts: the manifest

The Manifest.xml described here contains information about a set of editions that are grouped here together, usually for the purpose of further description and processing.

There are two main elements under the root element manifest1:

The editions element holds information about the editions that are collected here. It contains edition elements, which give the details for each edition. This includes also the type, which can be either "documentary" or "interpretative". Documentary editions are editions that strive to reproduce an existing print edition, while interpretative editions do reflect the views of the editor and do not follow one single edition.

Other details for editions that will be collected here are the id, which is a unique label (or identifier) used to refer to this specific edition within the manifest and the processing systems.

The edition element may have the following children:

Both of these elements are optional. description contains a description of the edition, which could be the title, but also other information deemed relevant. divisions allows reference to divisions within the edition. This element is repeatable and when occuring more than once the edition is considered made up of the sequence of these divisions.

The divisions element can also occur as a child element of manifest, optionally following the editions element. If used here, there will be only one division element, which holds all subdivisions as possibly nesting div elements. The purpose of this element is to provide an entry point to the editions, which is neither tied to one specific edition, nor to a hyperlink or similar in a technical sense. The label on the div elements is used to provide a human readable label that can be used to point to that specific division, much the same as "Chapter 2" will (usually) refer to the same section of a work, no matter which edition is used. To serve as a link between this nesting structure of chapters, sections and so forth, each div can have one or more edRef elements, which point to the text span in one of the editions that is covering this specific section.
<div label="第一章">
 <edRef end="61key="KR5c0057_tls"
  start="0"/>

 <edRef end="58key="CH1a0918a_chant"
  start="0"/>

 <edRef end="402key="CH1a0918b_chant"
  start="2"/>

</div>
In this example, the start and end attributes give the number of the first and last token that is part of this section of a text, thus identifying the text span independent of the text format of the text. Other possibilities for adressing a text span are available if the edition is in TEI/XML.

3 Links between text passages: the nexus file

The nexus files described here describe links between locations in texts. The links consist of references to a span of one or more consecutive characters in a text. Related links can be grouped together to form a nexus. This can be used for example to describe corresponding passages in different versions of a text.

The main elements under the root element nexusListare::

The nexus element holds the locationRef elements, which contain the reference information to locate the passage of the text. The reference is expressed by pointing to a sequence of one or more tokens in a token file for the edition.

4 A shadow of the text: the token file

The token files described here serve as a shadow of other digital files that more thoroughly describe the texts documented there. This relieves the token files from the burden to describe the physical appearence, structure and transmission of the text. This information is available at any time by following the links back to these other files. The purpose of the token files is to provide a minimal description, containing only the characters of the text in a form that allows easy comparison and alignment of multiple versions. The function is similar to a concordance in that it provides access to the whole text, but without much of what a reader would expect to make reading (or editing) convenient, or even feasible. On the other hand, enough information should be retained to reconstruct a very basic version of the text.

The main elements under the root element <tlist>are::

The tg element holds the t elements, which have the character content of the text, one token per t. The purpose of the tg element is to group related t elements. tg can nest, and provide thus for a rudimentary structure in the token files.

5 Schema for Manifest, Nexus and Token

Schema KRX: Elements

<creation>

<creation> Information about the creation
ModuleKRXManifest
Contained by
May contain
KRXManifest: date resp title
Content model
<content>
 <alternate maxOccurs="unbounded"
  minOccurs="0">

  <elementRef key="date"/>
  <elementRef key="resp"/>
  <elementRef key="title"/>
 </alternate>
</content>
Schema Declaration
element creation { ( krx_date | krx_resp | krx_title )* }

<date>

<date> Date of the work
ModuleKRXManifest
Attributes
notbeforeEarliest possible date
Status Optional
Datatype string
notafterLatest possible date
Status Optional
Datatype string
certDegree of certainty of this assertion
Status Optional
Legal values are:
high
High degree of certainty
middle
Middle degree of certainty
low
Low degree of certainty
Contained by
KRXManifest: creation
May containCharacter data only
Content model
<content>
 <textNode/>
</content>
Schema Declaration
element date
{
   attribute notbefore { string }?,
   attribute notafter { string }?,
   attribute cert { "high" | "middle" | "low" }?,
   text
}

<description>

<description> Description of the edition or item this element is attached to.
ModuleKRXManifest
Contained by
KRXManifest: div edition manifest
May contain
KRXManifest: creation note title
character data
Content model
<content>
 <alternate maxOccurs="unbounded"
  minOccurs="0">

  <textNode/>
  <elementRef key="noteminOccurs="0"/>
  <elementRef key="titleminOccurs="0"/>
  <elementRef key="creationminOccurs="0"/>
 </alternate>
</content>
Schema Declaration
element description { ( text | krx_note? | krx_title? | krx_creation? )* }

<div>

<div> One specific subdivision on any level.
ModuleKRXManifest
Attributes
labelA label to identify the subdivision, can be any string, but should be unique in the manifest. This can be used to access this textual division.
Status Optional
Datatype token
editionA reference to the edition, as defined elsewhere in this manifest.
Status Optional
Datatype IDREF
sequenceSequencial number of this division, given in such a way that ordering by this number will produce the text in the same sequence as the base edition.
Status Optional
Datatype nonNegativeInteger
startThe sequencial number of the first token of this division in the token list.
Status Optional
Datatype nonNegativeInteger
endThe sequencial number of the last token of this division in the token list.
Status Optional
Datatype nonNegativeInteger
dividIf the source file of this edition has an identifier (usually a xml:id for this subdivision), it can be recorded here.
Status Optional
Datatype token
Contained by
KRXManifest: div divisions
May contain
KRXManifest: description div edRef label
Content model
<content>
 <sequence maxOccurs="1minOccurs="1">
  <elementRef key="label"
   maxOccurs="unboundedminOccurs="0"/>

  <elementRef key="description"
   minOccurs="0"/>

  <elementRef key="edRef"
   maxOccurs="unboundedminOccurs="0"/>

  <elementRef key="div"
   maxOccurs="unboundedminOccurs="0"/>

 </sequence>
</content>
Schema Declaration
element div
{
   attribute label { token }?,
   attribute edition { xsd:IDREF }?,
   attribute sequence { xsd:nonNegativeInteger }?,
   attribute start { xsd:nonNegativeInteger }?,
   attribute end { xsd:nonNegativeInteger }?,
   attribute divid { token }?,
   ( krx_label*, krx_description?, krx_edRef*, krx_div* )
}

<divisions>

<divisions> The internal subdivisions of the work under consideration.
ModuleKRXManifest
Attributes
editionIf necessary, the edition for which these textual divisions are valid can be given here
Status Optional
Datatype token
Contained by
KRXManifest: edition manifest
May contain
KRXManifest: div
Content model
<content>
 <elementRef key="div"
  maxOccurs="unboundedminOccurs="1"/>

</content>
Schema Declaration
element divisions { attribute edition { token }?, krx_div+ }

<edition>

<edition> One edition of the work. If there are multiple divisions, this indicates that the sequence of these divisions make up the work.
ModuleKRXManifest
Attributes
xml:idThe identifier of the work. This will be used to refer to this manifest from the display of this text.
Status Optional
Datatype ID
idThe identifier of the edition. This is required and has to be unique within this manifest. It will be used by the processing tools to refer to this edition.
Status Required
Datatype ID
formatThe parsing tool is selected based on the format given here, there are two formats defined at the moment. Additional formats can be added, but require a plugin to parse them.
Status Required
Legal values are:
xml/TEI
TEI file encoded in XML
txt/mandoku
Mandoku format
locationThis gives either the relative path to the local folder containing the edition or a resolvable remote reference to the edition, for example on github.
Status Required
Datatype string
Note

TODO: format for remote reference.

TODO: Format for identifying portion of text in file.

baseThe edition marked as 'base' is the reference edition for sequential reordering.
Status Optional
Legal values are:
true
This edition is the reference edition
false
Not the reference edition (default)
typeThe edition has to be declared as either ‘documentary’ or ‘interpretative’.
Status Required
Legal values are:
documentary
An edition that documents an existing print source as faithful as possible, without editorial changes.
interpretative
An edition that might be based on a print source, but possibly makes editorial changes.
roleOne of the editions has to be declared as the base edition, the others are reference editions.
Status Recommended
Legal values are:
base
This edition is the base edition.
reference
All editions except the base edition are considered reference editions. [Default]
languageThe language of the document, identified with an identifier according to RFC 1766.
Status Optional
Datatype language
sigleA short identifier used to identify this edition.
Status Optional
Datatype string
Contained by
KRXManifest: editionGroup editions
May contain
Content model
<content>
 <sequence maxOccurs="1minOccurs="1">
  <elementRef key="titlemaxOccurs="1"
   minOccurs="0"/>

  <elementRef key="creationmaxOccurs="1"
   minOccurs="0"/>

  <elementRef key="description"/>
  <elementRef key="tokenmapmaxOccurs="1"
   minOccurs="0"/>

  <elementRef key="divisions"
   maxOccurs="unboundedminOccurs="0"/>

 </sequence>
</content>
Schema Declaration
element edition
{
   attribute xml:id { xsd:ID }?,
   attribute id { xsd:ID },
   attribute format { "xml/TEI" | "txt/mandoku" },
   attribute location { string },
   attribute base { "true" | "false" }?,
   attribute type { "documentary" | "interpretative" },
   attribute role { "base" | "reference" }?,
   attribute language { xsd:language }?,
   attribute sigle { string }?,
   (
      krx_title?,
      krx_creation?,
      krx_description,
      krx_tokenmap?,
      krx_divisions*
   )
}

<editionGroup>

<editionGroup> A group of the editions representing the work under consideration.
ModuleKRXManifest
Attributes
typeThe treatment of the editions within this group are based on the value of this attribute.
Status Required
Legal values are:
root
The root text of this work.
root+annotation
The root text, interspersed with commentary.
annotation
Commentary to the root text, without repeating the text.
translation
Translations of the text and / or commentary.
other
Texts, that are grouped with this texts for some reason other than being textually related.
sigleA short identifier used to identify this group of editions.
Status Optional
Datatype string
Contained by
KRXManifest: editions
May contain
KRXManifest: creation edition title
Content model
<content>
 <sequence maxOccurs="1minOccurs="1">
  <elementRef key="titlemaxOccurs="1"
   minOccurs="0"/>

  <elementRef key="creationmaxOccurs="1"
   minOccurs="0"/>

  <elementRef key="edition"
   maxOccurs="unboundedminOccurs="1"/>

 </sequence>
</content>
Schema Declaration
element editionGroup
{
   attribute type
   {
      "root" | "root+annotation" | "annotation" | "translation" | "other"
   },
   attribute sigle { string }?,
   ( krx_title?, krx_creation?, krx_edition+ )
}

<editions>

<editions> The editions representing the work under consideration. Work is taken in a very broad sense here.
ModuleKRXManifest
Contained by
KRXManifest: manifest
May contain
KRXManifest: edition editionGroup
Content model
<content>
 <alternate maxOccurs="1minOccurs="1">
  <elementRef key="editionGroup"
   maxOccurs="unboundedminOccurs="1"/>

  <elementRef key="edition"
   maxOccurs="unboundedminOccurs="1"/>

 </alternate>
</content>
Schema Declaration
element editions { krx_editionGroup+ | krx_edition+ }

<edRef>

<edRef> Reference to this subdivision in one specific edition, identified by the key.
ModuleKRXManifest
Attributes
startThe sequencial number of the first token of this division in the token list.
Status Optional
Datatype nonNegativeInteger
endThe sequencial number of the last token of this division in the token list.
Status Optional
Datatype nonNegativeInteger
keyA reference to the edition, as defined elsewhere in this manifest.
Status Optional
Datatype IDREF
timestampThe timestamp in ISO format, e.g. 2020-10-09T14:23:52+09:00.
Status Optional
Datatype dateTime
labelA label to identify the subdivision as used in this edition. It can be any string, but should be unique in the manifest. This can be used to access this textual division.
Status Optional
Datatype token
Contained by
KRXManifest: div
May containEmpty element
Content model
<content>
 <empty/>
</content>
Schema Declaration
element edRef
{
   attribute start { xsd:nonNegativeInteger }?,
   attribute end { xsd:nonNegativeInteger }?,
   attribute key { xsd:IDREF }?,
   attribute timestamp { xsd:dateTime }?,
   attribute label { token }?,
   empty
}

<label>

<label> Additional label
ModuleKRXManifest
Attributes
languageThe language of the label, identified with an identifier according to RFC 1766.
Status Optional
Datatype language
Contained by
KRXManifest: div
May containCharacter data only
Content model
<content>
 <textNode/>
</content>
Schema Declaration
element label { attribute language { xsd:language }?, text }

<lb>

<lb> This element marks the beginning of a new line or line-like section on the text-bearing surface.
Modulederived-module-KRX
Attributes
edIdentifier of the edition to which this line belongs
Status Optional
Datatype string
nNumber or other label used to refer to this line
Status Optional
Datatype string
xml:id
Status Recommended
Datatype ID
Contained by
KRXToken: tg
May containEmpty element
Content model
<content>
 <empty/>
</content>
Schema Declaration
element lb
{
   attribute ed { string }?,
   attribute n { string }?,
   attribute xml:id { ID }?,
   empty
}

<locationRef>

<locationRef> Reference to a location in the token file. Optionally might hold a copy of the referenced text as a string of characters.
ModuleKRXNexus
Attributes
edIdentifier of the edition (as used in the token file)
Status Required
Datatype string
tpThe sequencial number of the first token in the token file.
Status Required
Datatype nonNegativeInteger
tcountThe number of tokens that make up this text span.
Status Optional
Datatype nonNegativeInteger
Default 1
targetIdentifier of the first token in the text span
Status Required
Datatype string
nLabel or identifier for this reference.
Status Optional
Datatype string
Contained by
KRXNexus: nexus
May containCharacter data only
Content model
<content>
 <textNode/>
</content>
Schema Declaration
element locationRef
{
   attribute ed { string },
   attribute tp { xsd:nonNegativeInteger },
   attribute tcount { xsd:nonNegativeInteger }?,
   attribute target { string },
   attribute n { string }?,
   text
}

<manifest>

<manifest> The root of the manifest. One manifest describes one work.
ModuleKRXManifest
Attributes
xml:idThe identifier of the work. This will be used to refer to this manifest from the display of this text.
Status Optional
Datatype ID
Contained by
KRXManifest: manifests
May contain
Note

Currently, only one work can be described per one manifest file. Need to think about what to do with use cases that need multiple works. Use several manifest in a file?

Content model
<content>
 <sequence maxOccurs="1minOccurs="1">
  <elementRef key="titleminOccurs="0"/>
  <elementRef key="description"/>
  <elementRef key="editions"/>
  <elementRef key="divisionsminOccurs="0"/>
 </sequence>
</content>
Schema Declaration
element manifest
{
   attribute xml:id { xsd:ID }?,
   ( krx_title?, krx_description, krx_editions, krx_divisions? )
}

<manifests>

<manifests> Root for manifests that contain multiple manifest elements.
ModuleKRXManifest
Contained by
May contain
KRXManifest: manifest
Content model
<content>
 <elementRef key="manifest"
  maxOccurs="unbounded"/>

</content>
Schema Declaration
element manifests { krx_manifest+ }

<map>

<map> Map of one textual feature to a specific token type
Modulederived-module-KRX
Attributes
srcElement or simple matching expression (for XML texts) or regular expressions (for plain text) that identifies the textual feature
Status Optional
Datatype string
tokToken type
Status Optional
Legal values are:
h
Token is part of a heading
p
Token is part of a paragraph
n
Token is part of a note or annotation of any kind
q
Token is part of a quotation
v
Token is part of a verse line
Contained by
KRXManifest: tokenmap
May containEmpty element
Content model
<content>
 <empty/>
</content>
Schema Declaration
element map
{
   attribute src { string }?,
   attribute tok { "h" | "p" | "n" | "q" | "v" }?,
   empty
}

<nexus>

<nexus> A group of locationRef elements.
ModuleKRXNexus
Attributes
xml:idThe identifier of this token group.
Status Optional
Datatype ID
tpThe sequencial number of the first token of this text span.
Status Required
Datatype nonNegativeInteger
tcountThe number of tokens that make up this text span.
Status Optional
Datatype nonNegativeInteger
Default 1
Contained by
KRXNexus: nexusList
May contain
KRXManifest: note
KRXNexus: locationRef
Content model
<content>
 <sequence maxOccurs="1minOccurs="1">
  <elementRef key="note"
   maxOccurs="unboundedminOccurs="0"/>

  <elementRef key="locationRef"
   maxOccurs="unboundedminOccurs="0"/>

 </sequence>
</content>
Schema Declaration
element nexus
{
   attribute xml:id { xsd:ID }?,
   attribute tp { xsd:nonNegativeInteger },
   attribute tcount { xsd:nonNegativeInteger }?,
   ( krx_note*, krx_locationRef* )
}

<nexusList>

<nexusList> Root for Nexus that may contain one or more nexus elements.
ModuleKRXNexus
Attributes
xml:id
Status Recommended
Datatype ID
edReference to the edition defined in the manifest.
Status Required
Datatype string
nA label
Status Optional
Datatype string
Contained by
May contain
KRXManifest: note
KRXNexus: nexus
Content model
<content>
 <sequence maxOccurs="1minOccurs="1">
  <elementRef key="notemaxOccurs="1"
   minOccurs="0"/>

  <elementRef key="nexus"
   maxOccurs="unbounded"/>

 </sequence>
</content>
Schema Declaration
element nexusList
{
   attribute xml:id { ID }?,
   attribute ed { string },
   attribute n { string }?,
   ( krx_note?, krx_nexus+ )
}

<note>

<note> An additional note
ModuleKRXManifest
Contained by
KRXManifest: description
KRXNexus: nexus nexusList
May containCharacter data only
Content model
<content>
 <textNode/>
</content>
Schema Declaration
element note { text }

<pb>

<pb> This element marks the beginning of a new page or page-like section on the text-bearing surface.
Modulederived-module-KRX
Attributes
edIdentifier of the edition to which this page belongs
Status Optional
Datatype string
nNumber or other label used to refer to this page
Status Optional
Datatype string
xml:id
Status Recommended
Datatype ID
Contained by
KRXToken: tg
May containEmpty element
Content model
<content>
 <empty/>
</content>
Schema Declaration
element pb
{
   attribute ed { string }?,
   attribute n { string }?,
   attribute xml:id { ID }?,
   empty
}

<resp>

<resp> Person responsible for some aspect of the work
ModuleKRXManifest
Attributes
role
Status Optional
Datatype string
Sample values include:
author
Author
compiler
Compiler
translator
Translator
keyA key identifying this person in some reference system.
Status Optional
Datatype string
Contained by
KRXManifest: creation
May containCharacter data only
Content model
<content>
 <textNode/>
</content>
Schema Declaration
element resp { attribute role { string }?, attribute key { string }?, text }

<t>

<t> A token.
ModuleKRXToken
Attributes
roleToken type
Status Required
Legal values are:
h
Token is part of a heading
p
Token is part of a paragraph
s
Token is part of a seg element
n
Token is part of a note or annotation of any kind
q
Token is part of a quotation
v
Token is part of a verse line
o
Token is part of a textual feature not in this list.
posThe sequencial number of this token within this element (or token type).
Status Optional
Datatype nonNegativeInteger
tpThe sequencial number of this token within the whole text.
Status Required
Datatype nonNegativeInteger
fPunctuation or other non-token text items, immediately following the token.
Status Optional
Datatype string
pPunctuation or other non-token text items, immediately preceding the token.
Status Optional
Datatype string
nLabel or identifier of the element in the text of which this token is part. If none is available, the code generating the token file should make one up on the fly.
Status Required
Datatype string
cpCodepoint of the token character.
Status Optional
Datatype nonNegativeInteger
positionposition and content of marks out of line, but related to this token. The description is similar to CSS description on HTML @style: 'left:か;' would indicate a か syllable to the left of this token.
Status Optional
Datatype string
kundokutenKundoku marks related to this token.
Status Optional
Datatype string
rubyPronounciation marks related to this token.
Status Optional
Datatype string
Contained by
KRXToken: tg
May containCharacter data only
Content model
<content>
 <textNode/>
</content>
Schema Declaration
element t
{
   attribute role { "h" | "p" | "s" | "n" | "q" | "v" | "o" },
   attribute pos { xsd:nonNegativeInteger }?,
   attribute tp { xsd:nonNegativeInteger },
   attribute f { string }?,
   attribute p { string }?,
   attribute n { string },
   attribute cp { xsd:nonNegativeInteger }?,
   attribute position { string }?,
   attribute kundokuten { string }?,
   attribute ruby { string }?,
   text
}

<tg>

<tg> A group of tokens.
ModuleKRXToken
Attributes
xml:idThe identifier of this token group.
Status Optional
Datatype ID
nA label
Status Optional
Datatype string
roleToken group type
Status Optional
Legal values are:
h
Token group is a heading
p
Token group is (part of) a paragraph
s
Token group is a seg element
n
Token group is (part of) a note or annotation of any kind
q
Token group is (part of) a quotation
v
Token group is (part of) a verse line
o
Token group is (part of) a textual feature not in this list.
positionposition and content of marks out of line, but related to this token. The description is similar to CSS description on HTML @style: 'left:か;' would indicate a か syllable to the left of this token.
Status Optional
Datatype string
kundokutenKundoku marks related to this token.
Status Optional
Datatype string
rubyPronounciation marks related to this token.
Status Optional
Datatype string
Contained by
KRXToken: tList tg
May contain
KRXToken: t tg
derived-module-KRX: lb pb
Content model
<content>
 <alternate maxOccurs="unbounded"
  minOccurs="0">

  <elementRef key="tg"
   maxOccurs="unboundedminOccurs="0"/>

  <elementRef key="tmaxOccurs="unbounded"
   minOccurs="0"/>

  <elementRef key="pb"
   maxOccurs="unboundedminOccurs="0"/>

  <elementRef key="lb"
   maxOccurs="unboundedminOccurs="0"/>

 </alternate>
</content>
Schema Declaration
element tg
{
   attribute xml:id { xsd:ID }?,
   attribute n { string }?,
   attribute role { "h" | "p" | "s" | "n" | "q" | "v" | "o" }?,
   attribute position { string }?,
   attribute kundokuten { string }?,
   attribute ruby { string }?,
   ( krx_tg* | krx_t* | krx_pb* | krx_lb* )*
}

<title>

<title> Title of the work.
ModuleKRXManifest
Contained by
May containCharacter data only
Content model
<content>
 <textNode/>
</content>
Schema Declaration
element title { text }

<tList>

<tList> Root for token that may contain one or more tg elements.
ModuleKRXToken
Attributes
xml:id
Status Recommended
Datatype ID
edReference to the edition defined in the manifest.
Status Required
Datatype string
nA label
Status Optional
Datatype string
fileseqIf the tokens are in several files, this gives the sequential number of the file.
Status Optional
Datatype nonNegativeInteger
Contained by
May contain
KRXToken: tg
Content model
<content>
 <elementRef key="tgmaxOccurs="unbounded"/>
</content>
Schema Declaration
element tList
{
   attribute xml:id { ID }?,
   attribute ed { string },
   attribute n { string }?,
   attribute fileseq { xsd:nonNegativeInteger }?,
   krx_tg+
}

<tokenmap>

<tokenmap> Mappings from textual features to token types
ModuleKRXManifest
Contained by
KRXManifest: edition
May contain
derived-module-KRX: map
Content model
<content>
 <elementRef key="map"
  maxOccurs="unboundedminOccurs="1"/>

</content>
Schema Declaration
element tokenmap { krx_map+ }
Notes
1
There are in fact two possible root elements, the other being manifests for a grouping of manifest elements.
Date: 2021-01-11