tea October, 31 1998 Satoshi Sekine (New York University) Annotate SGML tags to Japanese (EUC coded) documents and extract tags from Japanese (EUC coded) documents. *********** ** Usage ** *********** tea {-a|-e} index_file doc_file ************* ** Options ** ************* a : Annotation mode e : Extraction mode **************** ** Annotation ** **************** In annotation mode, it annotates documents according to the information described in index_file and output annotated data to stdout. **************** ** Extraction ** **************** In Extraction mode, it extracts tag information from documents and add it into index_file. (Initially index_file has the list of tags to be extracted.) ************* ** Purpose ** ************* It aims to exchange manually or automatically tagged data between people without breaking license agreement. But it would be also useful for converting tag information between text annotation format and Tipster architecture like format. *************** ** Indexfile ** *************** Contents of index file is defined by the following BNF. --- For extraction := For annotation := ; ---- := +; := "TAGSET" := string; := *; := *; := "DOCNO" ; := string; := "@" ; := integer; := integer; := string; := string; --- ******************* ** Document file ** ******************* In annotation mode, the document file can contain documents which are not specified in the index file. Then these documents are just ignored. In extraction mode, all document in document file{s} are the documents to be extracted. Hence, the index file has the document number for all documents regardless they have tagged items or not. ***************************** ** Tag overlap and nesting ** ***************************** This does not allow overlap and nesting of tags. ****************** ** Restrictions ** ****************** - Tag string can't have Japanese characters - Even if there are unbalanced tag pairs, extraction may create some data. It's user's responsibility to prepare balanced tag documents. - Many error checking have not been implemented (yet! hopefully). ********** ** Test ** ********** For the test purpose, we provide sample files. sample.txt : Plain document sample.key : document with keys sample.idx : Index file sample0.idx : Index file (Original for extraction purpose) First, you'd better to see what's in there, then try the following: > cp sample0.idx sample.idx -- Don't try it directly to the original. The program will change it. > tea -e sample.idx sample.key --- Do extraction of tag information from sample.key. > tea -a sample.idx sample.txt > sample.out --- Do annotation based on tag information in text.idx to sample.txt, and output it to standard output. > diff sample.out sample.key --- Check if the output is identical to the original file. ************************************* ** Comment, Suggestion, Bug report ** ************************************* Send email to . Thank you!!