|
Markup and SGML/XML Basics
- What is markup and why use it?
- SGML and its advantages
Basic Problems of Markup
- character representation
- representation of structure and typography
- reduction to linear form
- representation of analysis, interpretation, controversial
information
- capture of ancillary information
Text Encoding
We distinguish:
- character encoding: representation of the
characters of an alphabet or writing system
- page encoding: representation of a pattern
of ink on a page
- text encoding: representation of a text
with its structure and other information of
interest
Text, not Pages
Some examples:
- character encoding (ASCII, ISO 646, ISO 10646,
Unicode)
- page encoding (Postscript, SVG, TeX dvi files, PDF, RTF,
etc., etc., etc.)
- text encoding (SGML, XML)
Pages are physical objects.
Texts are abstract objects.
Markup
Markup is meta-information used to record the
structure, status, properties, or other characteristics of a
text.
A markup language is a set of conventions
governing the use of markup, especially
- what kinds of markup are allowed, where they are allowed,
and what they mean
- what kinds of markup are required and
where
- how to tell markup from content (the text
itself)
Markup Languages
A markup language must specify methods for
- representing all the characters of the
text
- marking the structure of the text
- reducing the text to a linear order
- representing `extra-textual' orcontextual
information, e.g. analysis or documentation
- determining whether a given document is valid or
not
SGML
SGML is the Standard Generalized Markup
Language (ISO 8879: 1986). It
- defines a standard abstract syntax for markup languages,
with several types of markup:
- tags, which mark occurrences of element
types
- entities
- declarations
- processing instructions
- defines methods for formal definition of markup languages,
which can
- define entities and element types
- change the standard markup
delimiters
Despite its name,
SGML is not a markup language, but a metalanguage.
Goals of SGML
- separation of data from processing specifications
- re-use of text in multiple forms by multiple
processes
- system independence and portability of data
- addition of intelligence to data
- well-defined structure
- clean interfaces for specialized notations (e.g. for
graphics)
XML
XML is the eXtendable Markup Language, defined by
the World Wide Web Consortium (W3C) as recommendation on Feb.
10, 1998.
XML has also been added to SGML as Appendix K in 1997 under
the name of WebSGML.
- XML is a simplified subset of SGML
- It leaves out many complex, but rarely used features of
SGML
- Conforming XML documents can be well-formed or valid, SGML
documents must be valid
- XML is based on Unicode, SGML on ASCII
Goals of XML
- Remove obstacles for implementations of SGML
- Establish a solid base for the Semantic Web
- portability and exchangeability of data
- Replace HTML with a more expressive markup language
Recent developments of XML
XML was first released on Feb. 10, 1998 -- it just celebrated
its fifth birthday.
Since its release XML has seen a lot of exciting developments.
Some of them go well beyond what used to be possible with
SGML:
- XML Namespaces allow simultaneous use of more
than one XML Vocabulary
- XML Schemas allow the description of XML
documents using XML syntax.
- XSL Transformations and XSL Formating
bjects allow powerful transformations and rendering of
XML documents
- Scalable Vector Graphic (SVG) language is a
XML based vocabulary for vector graphics.
- XML Topic Maps has been adopted as an
appendix to the SGML Topic Maps (ISO 13250).
Adding Intelligence to Data
- management, tracking, version control information
- hypertext links, cross-references, indexing
- annotation
- references to external figures and graphics
Adding Intelligence on top of Data
- navigation of information spaces
- looking for relations, abstractions, concepts
- inference, calculus, information discovery
Major Features of XML
- text divided into elements, which can
nest
- element boundaries marked by tags
- elements carry generic type and other
attributes
- entity references allow string substitution
for character set problems, standard boilerplate text, and
document management
- consistent use of delimiters, few special
characters
Markup within the XML Document
Everything is delimited:
- Elements by start-tags and
end-tags
- Tags by < ... > and </ ... >
- Entities by & ... ;
For example:
<quote lang="fra">L'état, c'est moi!</quote>
Beware:XML is case sensitive!!
XML is not:
- a standard set of required tags or data elements
- a standard set of recommended tags or data elements
- a user interface
- a presentation format
- a programming language
- a piece of software
XML Processing Model
During their lifetime, electronic texts are:
- created by scanning or keyboarding
- stored locally
- exchanged over network or by disk
- processed interactively (enriched)
- formatted for output in a variety of different media and
data formats.
|