Summary of SGML/XML


Markup and SGML/XML Basics

  • What is markup and why use it?
  • SGML and its advantages

Basic Problems of Markup

  • character representation
  • representation of structure and typography
  • reduction to linear form
  • representation of analysis, interpretation, controversial information
  • capture of ancillary information

Text Encoding

We distinguish:

  • character encoding: representation of the characters of an alphabet or writing system
  • page encoding: representation of a pattern of ink on a page
  • text encoding: representation of a text with its structure and other information of interest

Text, not Pages

Some examples:

  • character encoding (ASCII, ISO 646, ISO 10646, Unicode)
  • page encoding (Postscript, SVG, TeX dvi files, PDF, RTF, etc., etc., etc.)
  • text encoding (SGML, XML)

Pages are physical objects.

Texts are abstract objects.

Markup

Markup is meta-information used to record the structure, status, properties, or other characteristics of a text.

A markup language is a set of conventions governing the use of markup, especially

  • what kinds of markup are allowed, where they are allowed, and what they mean
  • what kinds of markup are required and where
  • how to tell markup from content (the text itself)

Markup Languages

A markup language must specify methods for

  • representing all the characters of the text
  • marking the structure of the text
  • reducing the text to a linear order
  • representing `extra-textual' orcontextual information, e.g. analysis or documentation
  • determining whether a given document is valid or not

SGML

SGML is the Standard Generalized Markup Language (ISO 8879: 1986). It

  • defines a standard abstract syntax for markup languages, with several types of markup:
    • tags, which mark occurrences of element types
    • entities
    • declarations
    • processing instructions
  • defines methods for formal definition of markup languages, which can
    • define entities and element types
    • change the standard markup delimiters

Despite its name, SGML is not a markup language, but a metalanguage.

Goals of SGML

  • separation of data from processing specifications
  • re-use of text in multiple forms by multiple processes
  • system independence and portability of data
  • addition of intelligence to data
  • well-defined structure
  • clean interfaces for specialized notations (e.g. for graphics)

XML

XML is the eXtendable Markup Language, defined by the World Wide Web Consortium (W3C) as recommendation on Feb. 10, 1998.

XML has also been added to SGML as Appendix K in 1997 under the name of WebSGML.

  • XML is a simplified subset of SGML
  • It leaves out many complex, but rarely used features of SGML
  • Conforming XML documents can be well-formed or valid, SGML documents must be valid
  • XML is based on Unicode, SGML on ASCII

Goals of XML

  • Remove obstacles for implementations of SGML
  • Establish a solid base for the Semantic Web
  • portability and exchangeability of data
  • Replace HTML with a more expressive markup language

Recent developments of XML

XML was first released on Feb. 10, 1998 -- it just celebrated its fifth birthday.

Since its release XML has seen a lot of exciting developments. Some of them go well beyond what used to be possible with SGML:

  • XML Namespaces allow simultaneous use of more than one XML Vocabulary
  • XML Schemas allow the description of XML documents using XML syntax.
  • XSL Transformations and XSL Formating bjects allow powerful transformations and rendering of XML documents
  • Scalable Vector Graphic (SVG) language is a XML based vocabulary for vector graphics.
  • XML Topic Maps has been adopted as an appendix to the SGML Topic Maps (ISO 13250).

Adding Intelligence to Data

  • management, tracking, version control information
  • hypertext links, cross-references, indexing
  • annotation
  • references to external figures and graphics

Adding Intelligence on top of Data

  • navigation of information spaces
  • looking for relations, abstractions, concepts
  • inference, calculus, information discovery

Major Features of XML

  • text divided into elements, which can nest
  • element boundaries marked by tags
  • elements carry generic type and other attributes
  • entity references allow string substitution for character set problems, standard boilerplate text, and document management
  • consistent use of delimiters, few special characters

Markup within the XML Document

Everything is delimited:

  • Elements by start-tags and end-tags
  • Tags by < ... > and </ ... >
  • Entities by & ... ;

For example:

<quote lang="fra">L'état, c'est moi!</quote>

Beware:XML is case sensitive!!

XML is not:

  • a standard set of required tags or data elements
  • a standard set of recommended tags or data elements
  • a user interface
  • a presentation format
  • a programming language
  • a piece of software

XML Processing Model

During their lifetime, electronic texts are:

  • created by scanning or keyboarding
  • stored locally
  • exchanged over network or by disk
  • processed interactively (enriched)
  • formatted for output in a variety of different media and data formats.

3 Next | First| Previous Introduction to XML, Markup and the TEI Guidelines