Christian Wittern: Digital Archiving, Saitama University Sep. 2011

Tentative outline of the course.

Introduction
Digital Text
Basic Technologies
- More than text: Markup
- Getting it together: XML
Living in a connected world

Introduction

What is digital?

it reduces everything to 1 and 0

it requires interpretation

it is easily, identically copied

it can be distributed intantly world wide

it has no intrinsic architecture, but can give rise to quite a few

Modelling

See Willard McCarty, Depth, markup and modelling and Humanities Computing (palgrave, 2005) Chapter 2.

Analogy
analogy アナロジー 類推。類比。 -K

Representation
representation 名-1 描写｛びょうしゃ｝、表示｛ひょうじ｝、表現｛ひょうげん｝

Diagram
図表

Map
map マップ 地図。多く、特定の用途を持ったものをいう。「ロード‐―」「観光―」 -K

Simulation
simulation シミュレーション 物理的・生態的・社会的等のシステムの挙動を、これとほぼ同じ法則に支配される他のシステムまたはコンピューターの挙動によって、模擬すること。―‐ゲーム simulation game -K

Experiment
experiment エクスペリメント 実験。 -K

Towards a philosophy of modelling

Text, only text: Characters

Character encoding history
Standards…
- IEF International Standards 100 years 国際標準化100年記念事業について "JIS、ISO、IECといった「標準」は、私たちの経済・社会・生活に欠かせないインフラとして大きな役割を果たしています本年、2006年（平成18年）は、1906年に電気・電子分野の国際標準化機関であるIEC （国際電気標準会議（※1））が設立されてから100周年に当たります。また、1906年にロンドンで開催されたIEC設立会議に参加した日本にとっても、本年は国際標準化活動に参画して100年になります。" (http://www.standard100.jp/international/index.html)
- 漢字コードの簡史: http://www.kanji.zinbun.kyoto-u.ac.jp/~wittern/koushuukai/2005/index3.html

Japanese character encodings

JIS, ShiftJIS, ISO 2022-JP, EUC-JP, MS Kanji

Unicode
- What is Unicode? cf Unicode Home Page
- Unicode 5.0 to be released in 2006

Digital Text

How to represent a text in digital form?

As facsimile / graphic?
As transcribed text?
As read text?
As performed text?

Text as script act

Peter Shillingsburg has put forward the idea of 'text as script act'
'I mean every sort of act conducted in relation to written and printed texts, including every act of reproduction and every act of reading' (p. 40)

Multi-layered model of textual dimensions

A presentation on this topic: Digital Text, Meaning and the World

Physical dimension
Visual dimension
Semiotic dimension
Phonetic dimension
Structural dimension
Semantic dimension
Temporal and spatial dimension
Intertextual dimension

Basic Technologies

More than text: Markup

The Concept of Text

Depending on the context of its usage, text can be:
- In everyday usage, a broad term for something written to express something.
- In linguistics, a communicative act, fulfilling the principles of textuality.
- In literary theory, text is the object studied, be it a novel, poem, film, advertisement or anything else with a linguistic component. This broad use is inspired by semiotics and cultural studies of the 1980s.
- In information processing, text refers to character data.
Text can be very simple or complexely structured. Structure usually makes it easier to understand. Text depends on some notation, usually a script made up of characters. Text Encoding

Presentational Markup

Presentational markup arranges the text in the electronic text in the same way as it is intended to appear on the page, using space characters, line-feed characters etc. This can be send to a output device (a printer or screen) without further processing.

Procedural Markup

Procedural markup inserts special codes into the electronic text to produce the desired effect in the output. A special software programm, called a ‘formatter’ is used to interpret these special codes and produce an intermediate version of the electronic text that is then send to the output device.

Descriptive Markup

Instead of inserting the desired formatting codes directly, descriptive markup inserts information into the text that describes or identifies features of a text. This introduces an additional layer of abstraction, and information about how to render these features can be held separately. A formatter then uses both the descriptive markup and the formatting information together to produce the desired result.

Descriptive markup offers many advantage, both for authoring composition or transcription of texts, and for publication. Some of the advantages are:
- Compostion is simplified
- Structure-oriented editing is supported
- Natural editing tools are supported
- Alternative views of a text are possible
- Formatting can be generically specified and modified
- Indexes, appendices, etc. can be automated
- Many output devices can be supported
- Portability and durability is maximized
- Information retrieval is supported
- Analytical procedures are supported
Descriptive markup is thus the most versatile and flexible method of markup and this will be the form of markup that we will apply here.

Text as a Ordered Hierarchy of Content Objects(OHCO)

The many advantages of descriptive markup and the methodologies used to implement it were so successful, that for some people it seemed to suggest that it was not only a handy way of working with text, but deeply, profoundly the ‘correct’: ‘descriptive markup is not just the best approach … it is the best imaginable approach’ Coombs et.al., 1987 . This view assumes that only the model used by the methodologies employed for descriptive markup reflect a correct view of ‘what text really is’ DeRose et.al., 1990 . In this model, a text is view is determined by its logical structure as a nested hierarchy of chapters, sections, paragraphs, sentences, and so on, but not as features of the physical representation of a text, like pages, columns, lines, font shifts, spacing and so on. According to this view then, a text is simply a ‘Ordered Hierarchy of Content Objects’(OHCO) and descriptive markup works well, because it identifies that hierarchy and makes it explicit.

The OHCO view provides a powerful model to text encoding and allows elegant and convenient handling of many features and constraints found in real-life texts (for example a section header is always at the start of a section, not somewhere in the middle, lines of a poem are within a stanza etc.), but there are also limits and cases where it can not be applied, e.g. sentences do not nest within lines of poetry or quotations interrupted by an authorial voice. Nevertheless, it proved so successful and clearly superior to other views of a text (for example as a simple sequence of characters) that it became the dominating view of markup languages like SGML and XML, and is widely applied in text encoding.

Getting it together: XML

See: The TEI Guidelines: 2 A Gentle Introduction to XML and XML 概論(PDF)

Living in a connected world

Self describing data

Markup: allows describtion of texts on four different levels:
- description of the physical status of the text: the medium, the textual transmission etc.
- description of the structure of the text
- content-related description of the text
- linguistic or metric annotation of the text
The most widely used schema for academic purposes is that by the Text Encoding Initiative (TEI), but there is also DocBook (DocBook.org) and of course HTML: HTML 4.01 Specification.

Metadata for Digital Objects (Ressource Description)
- Dublin Core Metadata Initiative (DCMI)
  
  The Dublin Core Metadata Initiative is an open forum engaged in the development of interoperable online metadata standards that support a broad range of purposes and business models. DCMI's activities include consensus-driven working groups, global conferences and workshops, standards liaison, and educational efforts to promote widespread acceptance of metadata standards and practices.
  
  Manages DCMI Metadata Terms, a widely used list of identifiers, last updated 2006-08-28. Japanese site here: Dublin Core Metadata Initiative Japanese ver.
  
  DCMI terms are frequently embedded in other documents, such as HTML files.
  
  All the following are available from the Library of Congress, U.S.A.
- MARC 21 formats - Representation and communication of descriptive metadata about information items
  
  The MARC formats are standards for the representation and communication of bibliographic and related information in machine-readable form.
- MARCXML - MARC 21 data in an XML structure
  
  The Library of Congress' Network Development and MARC Standards Office is developing a framework for working with MARC data in a XML environment. This framework is intended to be flexible and extensible to allow users to work with MARC data in ways specific to their needs. The framework itself includes many components such as schemas, stylesheets, and software tools.
  
  Lossless expression of MARC in XML.
- MODS (Metadata Object Description Standard) - XML markup for selected metadata from existing MARC 21 records as well as original resource description
  
  The Library of Congress' Network Development and MARC Standards Office, with interested experts, has developed a schema for a bibliographic element set that may be used for a variety of purposes, and particularly for library applications. As an XML schema, the "Metadata Object Description Schema" (MODS) is intended to be able to carry selected data from existing MARC 21 records as well as to enable the creation of original resource description records. It includes a subset of MARC fields and uses language-based tags rather than numeric ones, in some cases regrouping elements from the MARC 21 bibliographic format. MODS is expressed using the XML schema language of the World Wide Web Consortium. The standard is maintained by the Network Development and MARC Standards Office of the Library of Congress with input from users.
  
  Version 3.2, 2006-06-01.
  
  Japanese version: メタデータオブジェクトディスクリプションスキーマ Sample: 愛淑大図サンプル：MODS
- MADS (Metadata Authority Description Standard) - XML markup for selected authority data from MARC21 records as well as original authority data
  
  Authority files for use with MODS.
- EAD (Encoded Archival Description) - XML markup designed for encoding finding aids
  
  See: EADの概要と日本における動向
  
  This is also used by the NIJL joint-projects: 収蔵アーカイブズ検索手段EAD/XML化 EAD/XML-FA project at NIJL

Digital Libraries - under the hood

Digital Library Standards
- METS
  Metadata Encoding and Transmission Standard (METS) Official Web Site
  The METS schema is a standard for encoding descriptive, administrative, and structural metadata regarding objects within a digital library, expressed using the XML schema language of the World Wide Web Consortium. The standard is maintained in the Network Development and MARC Standards Office of the Library of Congress, and is being developed as an initiative of the Digital Library Federation.
  
  Usage Example: METS Navigator from the Indiana University Digital Library Program

Software for DL: DSpace
- What is it?
  DSpace Federation
  The DSpace digital repository system (developped by MIT and Hewlett-Packard Labs) captures, stores, indexes, preserves, and distributes digital research material.
  
  Research institutions worldwide use DSpace as an institutional repository, a learning object repository, for records management, and more. The DSpace open source platform is freely available so you can customize and extend it to suit your needs.
- Users
  DSpace at Waseda University: Home
  DSpace@Waseda Universityは、本学の研究者等が作成した学術論文、学位論文、紀要論文、ワーキングペーパー、会議録等の電子的な学術情報を保存・公開する学術機関リポジトリです。
  
  There is also Fedora (Cornell University and University of Viriginia Library) and Greenstone Digital Library Software (New Zealand Digital Library Project at the University of Waikato), all open source software.

Open Archives Initiative (OAI)

The Open Archives Initiative develops and promotes interoperability standards that aim to facilitate the efficient dissemination of content. The Open Archives Initiative has its roots in an effort to enhance access to e-print archives as a means of increasing the availability of scholarly communication. Continued support of this work remains a cornerstone of the Open Archives program.

Open Archives Initiative - Protocol for Metadata Harvesting - v.2.0

OAI-PMH2.0日本語訳

Open Archives Initiative - Repository Explorer

Closing: what to take away from here

Small test project: learning by doing

To learn more about the things presented here, best is to start with a little project and proceed from there. Digital Archiving, like anything that has to do with digital technology, can only be learned by doing it, not by listening to an instructor!

Date: 2011-09-06 21:12:08 JST

Author: Christian Wittern

Org version 7.5 with Emacs version 23

Validate XHTML 1.0