|
Markup and XML Basics
- What is markup and why use it?
- XML and its advantages
Basic Problems of Markup
- character representation
- representation of structure and typography
- reduction to linear form
- representation of analysis, interpretation,
controversial information
- capture of ancillary information
Text Encoding
We distinguish:
-
character encoding: representation of the
characters of an alphabet or writing system
-
page encoding: representation of a pattern
of ink on a page
-
text encoding: representation of a text
with its structure and other information of
interest
Text, not Pages
Some examples:
- character encoding (ASCII, ISO 646, ISO 10646,
Unicode)
- page encoding (Postscript, TeX dvi files, PDF, RTF,
etc., etc., etc.)
- text encoding (SGML, XML)
Pages are physical objects.
Texts are abstract objects.
Markup
Markup is meta-information used to record the
structure, status, properties, or other characteristics of
a text.
A markup language is a set of conventions
governing the use of markup, especially
- what kinds of markup are allowed, where they are
allowed, and what they mean
- what kinds of markup are required and
where
- how to tell markup from content (the text
itself)
Markup Languages
A markup language must specify methods for
- representing all the characters of the
text
- marking the structure of the
text
- reducing the text to a linear order
- representing `extra-textual'
orcontextual information, e.g. analysis or
documentation
- determining whether a given document is valid or
not
SGML
SGML is the Standard Generalized Markup
Language (ISO 8879: 1986). It
- defines a standard abstract syntax for markup
languages, with several types of markup:
- tags, which mark occurrences of element
types
- entities
- declarations
- processing instructions
- defines methods for formal definition of markup
languages, which can
- define entities and element types
- change the standard markup delimiters
Despite its name, SGML is not a markup language, but a
metalanguage.
Goals of SGML
- separation of data from processing
specifications
- re-use of text in multiple forms by multiple
processes
- system independence and portability of data
- addition of intelligence to data
- well-defined structure
- clean interfaces for specialized notations (e.g. for
graphics)
XML
XML is the eXtendable Markup Language,
defined by the World Wide Web Consortium (W3C) as
recommendation on Feb. 10, 1998.
XML has also been added to SGML as Appendix K in 1997
under the name of WebSGML.
- XML is a simplified subset of SGML
- It leaves out many complex, but rarely used features
of SGML
- Conforming XML documents can be well-formed or
valid, SGML documents must be valid
- XML is based on Unicode, SGML on ASCII
Goals of XML
- Remove obstacles for implementations of SGML
- Establish a solid base for the Semantic Web
- portability and exchangeability of data
- Replace HTML with a more expressive markup
language
Recent developments of XML
Since its release 7 years ago, XML has seen a lot of
exciting developments. Some of them go well beyond what
used to be possible with SGML:
-
XML Namespaces allow simultaneous use of
more than one XML Vocabulary
-
XML Schemas allow the description of XML
documents using XML syntax.
-
XSL Transformations and XSL Formating
bjects allow powerful transformations and
rendering of XML documents
-
Scalable Vector Graphic (SVG) language is a
XML based vocabulary for vector graphics.
-
XML Topic Maps is an evolving standard
based on SGML Topic Maps (ISO 13250).
Adding Intelligence to Data
- management, tracking, version control
information
- hypertext links, cross-references, indexing
- annotation
- references to external figures and graphics
Major Features of XML
- text divided into elements, which can
nest
- element boundaries marked by tags
- elements carry generic type and other
attributes
-
entity references allow string substitution
for character set problems, standard boilerplate text,
and document management
- consistent use of delimiters, few special
characters
Markup within the XML Document
Everything is delimited:
- Elements by start-tags and
end-tags
- Tags by < ... > and </ ... >
- Entities by & ... ;
For example: <quote lang="fra">L'état, c'est moi!</quote>
Beware:XML is case sensitive, SGML is
not!!
XML is not:
- a standard set of required tags or data
elements
- a standard set of recommended tags or data
elements
- a user interface
- a presentation format
- a programming language
- a piece of software
XML Processing Model
During their lifetime, electronic texts are:
- created by scanning or keyboarding
- stored locally
- exchanged over network or by disk
- processed interactively (enriched)
- formatted for output in a variety of different media
and data formats.
|