Christian Wittern: Digital Archiving, Saitama University Sep. 2011

Tentative outline of the course.

Table of Contents

Introduction

What is digital?

  • it reduces everything to 1 and 0
  • it requires interpretation
  • it is easily, identically copied
  • it can be distributed intantly world wide
  • it has no intrinsic architecture, but can give rise to quite a few

Modelling

See Willard McCarty, Depth, markup and modelling and Humanities Computing (palgrave, 2005) Chapter 2.

  • Analogy
    analogy アナロジー 類推。類比。 -K
  • Representation
    representation 名-1 描写{びょうしゃ}、表示{ひょうじ}、表現{ひょうげん}
  • Diagram
    図表
  • Map
    map マップ 地図。多く、特定の用途を持ったものをいう。「ロード‐―」「観光―」 -K
  • Simulation
    simulation シミュレーション 物理的・生態的・社会的等のシステムの挙動を、これとほぼ同じ法則に支配される他のシステムまたはコンピューターの挙動によって、模擬すること。―‐ゲーム simulation game -K
  • Experiment
    experiment エクスペリメント 実験。 -K
  • Towards a philosophy of modelling

Text, only text: Characters

  • Character encoding history
    Standards…
    • IEF International Standards 100 years 国際標準化100年記念事業について "JIS、ISO、IECといった「標準」は、私たちの経済・社会・生活に欠 かせないインフラとして大きな役割を果たしています本年、2006年 (平成18年)は、1906年に電気・電子分野の国際標準化機関であるIEC (国際電気標準会議(※1))が設立されてから100周年に当たります。 また、1906年にロンドンで開催されたIEC設立会議に参加した日本にとっても、 本年は国際標準化活動に参画して100年になります。" (http://www.standard100.jp/international/index.html)
    • 漢字コードの簡史: http://www.kanji.zinbun.kyoto-u.ac.jp/~wittern/koushuukai/2005/index3.html
  • Japanese character encodings

    JIS, ShiftJIS, ISO 2022-JP, EUC-JP, MS Kanji

Digital Text

How to represent a text in digital form?

  • As facsimile / graphic?
  • As transcribed text?
  • As read text?
  • As performed text?

Text as script act

  • Peter Shillingsburg has put forward the idea of 'text as script act'
  • 'I mean every sort of act conducted in relation to written and printed texts, including every act of reproduction and every act of reading' (p. 40)

Multi-layered model of textual dimensions

A presentation on this topic: Digital Text, Meaning and the World

  • Physical dimension
  • Visual dimension
  • Semiotic dimension
  • Phonetic dimension
  • Structural dimension
  • Semantic dimension
  • Temporal and spatial dimension
  • Intertextual dimension

Basic Technologies

More than text: Markup

  • The Concept of Text

    Depending on the context of its usage, text can be:

    • In everyday usage, a broad term for something written to express something.
    • In linguistics, a communicative act, fulfilling the principles of textuality.
    • In literary theory, text is the object studied, be it a novel, poem, film, advertisement or anything else with a linguistic component. This broad use is inspired by semiotics and cultural studies of the 1980s.
    • In information processing, text refers to character data.

    Text can be very simple or complexely structured. Structure usually makes it easier to understand. Text depends on some notation, usually a script made up of characters. Text Encoding

  • Presentational Markup

    Presentational markup arranges the text in the electronic text in the same way as it is intended to appear on the page, using space characters, line-feed characters etc. This can be send to a output device (a printer or screen) without further processing.

  • Procedural Markup

    Procedural markup inserts special codes into the electronic text to produce the desired effect in the output. A special software programm, called a ‘formatter’ is used to interpret these special codes and produce an intermediate version of the electronic text that is then send to the output device.

  • Descriptive Markup

    Instead of inserting the desired formatting codes directly, descriptive markup inserts information into the text that describes or identifies features of a text. This introduces an additional layer of abstraction, and information about how to render these features can be held separately. A formatter then uses both the descriptive markup and the formatting information together to produce the desired result.

    Descriptive markup offers many advantage, both for authoring composition or transcription of texts, and for publication. Some of the advantages are:

    • Compostion is simplified
    • Structure-oriented editing is supported
    • Natural editing tools are supported
    • Alternative views of a text are possible
    • Formatting can be generically specified and modified
    • Indexes, appendices, etc. can be automated
    • Many output devices can be supported
    • Portability and durability is maximized
    • Information retrieval is supported
    • Analytical procedures are supported

    Descriptive markup is thus the most versatile and flexible method of markup and this will be the form of markup that we will apply here.

  • Text as a Ordered Hierarchy of Content Objects(OHCO)

    The many advantages of descriptive markup and the methodologies used to implement it were so successful, that for some people it seemed to suggest that it was not only a handy way of working with text, but deeply, profoundly the ‘correct’: ‘descriptive markup is not just the best approach … it is the best imaginable approach’ Coombs et.al., 1987 . This view assumes that only the model used by the methodologies employed for descriptive markup reflect a correct view of ‘what text really is’ DeRose et.al., 1990 . In this model, a text is view is determined by its logical structure as a nested hierarchy of chapters, sections, paragraphs, sentences, and so on, but not as features of the physical representation of a text, like pages, columns, lines, font shifts, spacing and so on. According to this view then, a text is simply a ‘Ordered Hierarchy of Content Objects’(OHCO) and descriptive markup works well, because it identifies that hierarchy and makes it explicit.

    The OHCO view provides a powerful model to text encoding and allows elegant and convenient handling of many features and constraints found in real-life texts (for example a section header is always at the start of a section, not somewhere in the middle, lines of a poem are within a stanza etc.), but there are also limits and cases where it can not be applied, e.g. sentences do not nest within lines of poetry or quotations interrupted by an authorial voice. Nevertheless, it proved so successful and clearly superior to other views of a text (for example as a simple sequence of characters) that it became the dominating view of markup languages like SGML and XML, and is widely applied in text encoding.

Living in a connected world

Self describing data

  • Markup: allows describtion of texts on four different levels:
    • description of the physical status of the text: the medium, the textual transmission etc.
    • description of the structure of the text
    • content-related description of the text
    • linguistic or metric annotation of the text

    The most widely used schema for academic purposes is that by the Text Encoding Initiative (TEI), but there is also DocBook (DocBook.org) and of course HTML: HTML 4.01 Specification.

  • Metadata for Digital Objects (Ressource Description)
    • Dublin Core Metadata Initiative (DCMI)

      The Dublin Core Metadata Initiative is an open forum engaged in the development of interoperable online metadata standards that support a broad range of purposes and business models. DCMI's activities include consensus-driven working groups, global conferences and workshops, standards liaison, and educational efforts to promote widespread acceptance of metadata standards and practices.

      Manages DCMI Metadata Terms, a widely used list of identifiers, last updated 2006-08-28. Japanese site here: Dublin Core Metadata Initiative Japanese ver.

      DCMI terms are frequently embedded in other documents, such as HTML files.

      All the following are available from the Library of Congress, U.S.A.

    • MARC 21 formats - Representation and communication of descriptive metadata about information items

      The MARC formats are standards for the representation and communication of bibliographic and related information in machine-readable form.

    • MARCXML - MARC 21 data in an XML structure

      The Library of Congress' Network Development and MARC Standards Office is developing a framework for working with MARC data in a XML environment. This framework is intended to be flexible and extensible to allow users to work with MARC data in ways specific to their needs. The framework itself includes many components such as schemas, stylesheets, and software tools.

      Lossless expression of MARC in XML.

    • MODS (Metadata Object Description Standard) - XML markup for selected metadata from existing MARC 21 records as well as original resource description

      The Library of Congress' Network Development and MARC Standards Office, with interested experts, has developed a schema for a bibliographic element set that may be used for a variety of purposes, and particularly for library applications. As an XML schema, the "Metadata Object Description Schema" (MODS) is intended to be able to carry selected data from existing MARC 21 records as well as to enable the creation of original resource description records. It includes a subset of MARC fields and uses language-based tags rather than numeric ones, in some cases regrouping elements from the MARC 21 bibliographic format. MODS is expressed using the XML schema language of the World Wide Web Consortium. The standard is maintained by the Network Development and MARC Standards Office of the Library of Congress with input from users.

      Version 3.2, 2006-06-01.

      Japanese version: メタデータ オブジェクト ディスクリプション スキーマ Sample: 愛淑大図 サンプル:MODS

Digital Libraries - under the hood

  • Digital Library Standards
    • METS
      Metadata Encoding and Transmission Standard (METS) Official Web Site

      The METS schema is a standard for encoding descriptive, administrative, and structural metadata regarding objects within a digital library, expressed using the XML schema language of the World Wide Web Consortium. The standard is maintained in the Network Development and MARC Standards Office of the Library of Congress, and is being developed as an initiative of the Digital Library Federation.

      Usage Example: METS Navigator from the Indiana University Digital Library Program

  • Software for DL: DSpace
    • What is it?
      DSpace Federation

      The DSpace digital repository system (developped by MIT and Hewlett-Packard Labs) captures, stores, indexes, preserves, and distributes digital research material.

      Research institutions worldwide use DSpace as an institutional repository, a learning object repository, for records management, and more. The DSpace open source platform is freely available so you can customize and extend it to suit your needs.

    • Users
      DSpace at Waseda University: Home

      DSpace@Waseda Universityは、本学の研究者等が作成した学術論文、学 位論文、紀要論文、ワーキングペーパー、会議録等の電子的な学術情報 を保存・公開する学術機関リポジトリです。

      There is also Fedora (Cornell University and University of Viriginia Library) and Greenstone Digital Library Software (New Zealand Digital Library Project at the University of Waikato), all open source software.

Closing: what to take away from here

  • Small test project: learning by doing

    To learn more about the things presented here, best is to start with a little project and proceed from there. Digital Archiving, like anything that has to do with digital technology, can only be learned by doing it, not by listening to an instructor!

Date: 2011-09-06 21:12:08 JST

Author: Christian Wittern

Org version 7.5 with Emacs version 23

Validate XHTML 1.0