|
What is Document
Analysis?
Document analysis includes:
- gathering information used in a formal description
of the electronic text
- studying the content and structure of the
documents:
- identifying and naming the components of some
class of documents
- specifying their interrelationships
- naming their properties
No serious project to produce electronic
texts should try to skip the document analysis phase.
Why is Document Analysis Essential?
Because you must know:
- if you use the TEI encoding scheme, which parts of
the TEI do you need?
- if you extend the TEI scheme, what must you
add?
- if you design your own XML markup, what needs to be
in the document type definition?
- if you don't use XML, what aspects of your
documents will you mark, and how?
Steps in Document Analysis
- Define the environment:
- your requirements
- external requirements
- the document universe
- the set of document types.
- Define the textual features you care about.
- Identify the relationships among the
features.
- Enrich the collection of text features.
Step 1a: Define Your Goals
- goals and objectives
- scope
- internal and external constraints
- intended and foreseeable uses of information
- functions of the XML application:
- publishing (paper, CD-ROM, network)
- database retrieval
- hypertext navigation
- electronic review and comment
- document interchange
- etc., etc.
Step 1b: Identify Relevant Standards
Step 1c: Document Universe
What documents are you talking about?
- what is, or what should
be?
- many similar items (Taisho Tripitaka)?
- or one unique item (the Oxford English
Dictionary)?
- what information in which documents?
- how many different kinds of documents?
Step 1d: Gathering Samples
Construct a set of samples to analyze, including:
- typical samples as well as special cases:
- typical
- unusual but within bounds
- off the wall
- short examples and long ones
- current documents and old ones
- not just printed samples
- all the parts, all associated documents
Step 2: Define Features
What is a feature?
- large structural units (table of contents, body,
front matter, chapter, ...)
- smaller structural units within the larger
(headings, figures, lists, ...)
- non-structural units conveying specialized
information: (italics, hypertext links, people's names,
dates, topical keywords, technical terms, ...)
How Big is a Feature?
As large or small as it needs to be. A feature might
be:
- the entire dictionary
- a section of the dictionary
- an entry in the dictionary
- the head-word in the dictionary
- a syllable within the head-word
Not all features are visible in the output: status
information, internal editorial notes (one editor to the
other: ‘How can you SAY
that?’)
Principles of Feature Definition
Something is a good candidate for definition as a feature
if:
- it looks different from the rest of the text
- it requires different processing
- you may want to find it easily later
- you may want to point at it from elsewhere
- it is a nameable part of the structure of the text
(chapter, note, quotation, ...)
- it fills a clear function in the hierarchy of the
text
- it is information of a specialized and interesting
type
How Many Features?
Academica Sinica, 128, Sec.2,
Yanjiuyuan lu, Nangang, Taipei
- One feature:
name-and-address.
- Two features:
organization-name and
organization-address.
- Three features:
name-and-address, which contains
organization-name and
organization-address.
How Many features?
Academica Sinica, 128, Sec. 2,
Yanjiuyuan lu, Nangang, Taipei
- Eight features:
name-and-address,
organization-name,
street-address,
house-number,
street-section,
street-name,
district,
city.
- Fifteen features: nine word
and six punctuation.
- Five features: address,
name (type=organization), three
address-line.
Choosing a Feature Analysis
The analysis of a sample text should ideally be:
- true
- useful
- simple enough to use
These features do not always co-exist.
In case of doubt, choose truth over an apparently useful
lie.
Why Identify features at All?
So you can:
- put all the technical terms into the draft
glossary
- print all personal names in blue, all names of
places in green
- find all occurrences of the word
hypertext, but only in section
headings, not in footnotes
- ensure that all announcements of public events
specify the date, time, location, and sponsor of the
event, as well as giving a description
- automatically maintain cross references and
indices
- create links in hypertext applications
Step 3: Identify Relationships
- hierarchy of containers (a part contains chapters,
which contain sections)
- sequence (front matter precedes body, which precedes
back matter)
- alternation: either A or B, but not both
- occurrence (occurs once? many times?
optional?)
- semantic groups: collections of similar
things
- syntactic groups: items which can appear in similar
places
Step 4: Enrich the Collection
- non-printing information: bibliographic and
cataloging data, subject keywords, identity of encoder,
circumstances of production
- control information: status tracking, routing and
process control information, confidentiality
- gaps in semantic groups: we have three reasons
something might be italic; are there more?
- gaps in syntactic groups: what else can
occur at the beginning of a new chapter?
Knowing When to Stop
What is `enough' document analysis?
How can you tell you're done?
- a place for everything and everything in its
place
- tag a sample: can you tag everything you see?
- have you identified everything you'll want to point
at, search for, search within, sort by, or process
specially?
- is the set of features good enough to go on
with?
- does this feature set provide a good foundation for
later growth and change?
What is Good Enough?
Don't over-stress the applications you foresee: in the
future, other applications and problems will take their
place. But it's not enough for markup to be useful
eventually: it can and should be made useful
now:
- can you search it acceptably?
- can you process it acceptably?
- can you display it on screen?
- can you print it?
- can you produce this level of markup with current
procedures?
Document Analysis: Conclusions
Document analysis forces you to:
- clarify your needs and interests
- identify clearly the textual features of critical
importance for your work
It thus prepares you to:
- identify TEI tags you will need
- identify desirable or necessary extensions to the
TEI encoding scheme
- define new XML elements if necessary
|