Unigramトークナイザの最大トークン長と最大語彙数は係り受け解析に影響するのか
Unigramトークナイザにおける最大トークン長Mと最大語彙数Vが、UPOS/LAS/MLASにどう影響するか調査した。DeBERTaモデルの製作には、lzh_kyoto-ud-train.conlluの各文だけを用いている。
lzh_kyoto-ud-dev.conlluで評価
| V=8000 | V=16000 | V=32000 | V=64000 |
M=1 |
86.96/72.86/68.01 |
86.85/72.65/67.67 |
86.71/72.98/67.88 |
86.88/72.63/67.86 |
---|
M=2 |
86.90/72.82/67.79 |
86.94/72.84/68.02 |
86.65/72.89/67.76 |
86.68/72.64/67.64 |
---|
M=4 |
86.92/72.72/67.85 |
86.88/72.57/67.54 |
86.54/72.75/67.85 |
86.77/72.61/67.68 |
---|
M=8 |
86.73/72.42/67.56 |
86.69/72.75/67.73 |
86.67/72.92/67.91 |
87.04/73.04/68.12 |
---|
M=16 |
86.73/72.87/67.83 |
86.95/72.81/67.93 |
86.73/72.63/67.84 |
86.75/72.63/67.85 |
---|
lzh_kyoto-ud-test.conlluでテスト
| V=8000 | V=16000 | V=32000 | V=64000 |
M=1 |
88.20/74.12/69.12 |
88.33/74.67/69.65 |
88.07/74.41/69.13 |
88.40/74.35/69.36 |
---|
M=2 |
88.40/74.71/69.58 |
88.16/74.28/69.10 |
88.12/74.57/69.38 |
88.43/74.81/69.69 |
---|
M=4 |
88.48/74.46/69.53 |
88.34/74.77/69.45 |
88.55/74.62/69.50 |
88.48/74.62/69.38 |
---|
M=8 |
88.24/74.59/69.30 |
88.38/74.81/69.67 |
88.38/74.71/69.41 |
88.45/74.93/69.80 |
---|
M=16 |
88.37/74.56/69.21 |
88.27/74.52/69.33 |
88.24/74.76/69.59 |
88.39/74.79/69.69 |
---|
作業環境
mdx 1GPU (NVIDIA A100-SXM4-40GB)
- tokenizers 0.12.1
- transformers 4.19.1
- esupar 1.2.7
- torch 1.11.0+cu113
- Universal Dependencies 2.10
/bin/shスクリプト
#! /bin/sh
URL=https://github.com/UniversalDependencies/UD_Classical_Chinese-Kyoto
D=`basename $URL`
test -d $D || git clone --depth=1 $URL
for F in train dev test
do nawk -F'\t' '{OFS=FS;if(NF==10)$6="_";print}' $D/*-$F*.conllu > $F.conllu
sed -n 's/^# text = //p' $F.conllu > $F.txt
done
S='{if(NF==10&&$1~/^[1-9][0-9]*$/)printf($1>1?" %s":"%s",$2);if(NF==0)print}'
nawk -F'\t' "$S" $D/*-train.conllu > token.txt
U=http://universaldependencies.org/conll18/conll18_ud_eval.py
C=`basename $U`
test -f $C || curl -LO $U
for M in 1 2 4 8 16
do for V in 8000 16000 32000 64000
do test -d deberta$M-$V || python3 -c m,v=$M,$V'
from transformers import (DataCollatorForLanguageModeling,TrainingArguments,
DebertaV2TokenizerFast,DebertaV2Config,DebertaV2ForMaskedLM,Trainer)
from tokenizers import (Tokenizer,models,pre_tokenizers,normalizers,processors,
decoders,trainers)
s=["[CLS]","[PAD]","[SEP]","[UNK]","[MASK]"]
spt=Tokenizer(models.Unigram())
spt.pre_tokenizer=pre_tokenizers.Sequence([pre_tokenizers.Whitespace(),
pre_tokenizers.Punctuation()])
spt.normalizer=normalizers.Sequence([normalizers.Nmt(),normalizers.NFKC()])
spt.post_processor=processors.TemplateProcessing(single="[CLS] $A [SEP]",
pair="[CLS] $A [SEP] $B:1 [SEP]:1",special_tokens=[("[CLS]",0),("[SEP]",2)])
spt.decoder=decoders.WordPiece(prefix="",cleanup=True)
spt.train(trainer=trainers.UnigramTrainer(vocab_size=v,max_piece_length=m,
special_tokens=s,unk_token="[UNK]",n_sub_iterations=2),files=["token.txt"])
spt.save("tokenizer.json")
tkz=DebertaV2TokenizerFast(tokenizer_file="tokenizer.json",
do_lower_case=False,keep_accents=True,bos_token="[CLS]",cls_token="[CLS]",
pad_token="[PAD]",sep_token="[SEP]",unk_token="[UNK]",mask_token="[MASK]",
vocab_file="/dev/null",model_max_length=512,split_by_punct=True)
t=tkz.convert_tokens_to_ids(s)
cfg=DebertaV2Config(hidden_size=768,num_hidden_layers=12,num_attention_heads=12,
intermediate_size=3072,max_position_embeddings=tkz.model_max_length,
vocab_size=len(tkz),tokenizer_class=type(tkz).__name__,
bos_token_id=t[0],pad_token_id=t[1],eos_token_id=t[2])
arg=TrainingArguments(num_train_epochs=8,per_device_train_batch_size=64,
output_dir="/tmp",overwrite_output_dir=True,save_total_limit=2)
class ReadLineDS(object):
def __init__(self,file,tokenizer):
self.tokenizer=tokenizer
with open(file,"r",encoding="utf-8") as r:
self.lines=[s.strip() for s in r if s.strip()!=""]
__len__=lambda self:len(self.lines)
__getitem__=lambda self,i:self.tokenizer(self.lines[i],truncation=True,
add_special_tokens=True,max_length=self.tokenizer.model_max_length-2)
trn=Trainer(args=arg,data_collator=DataCollatorForLanguageModeling(tkz),
model=DebertaV2ForMaskedLM(cfg),train_dataset=ReadLineDS("train.txt",tkz))
trn.train()
trn.save_model("deberta{}-{}".format(m,v))
tkz.save_pretrained("deberta{}-{}".format(m,v))'
test -d upos$M-$V || python3 -m esupar.train deberta$M-$V upos$M-$V .
test -f result$M-$V/result && continue
mkdir -p result$M-$V
for F in dev test
do cat $F.txt | python3 -c 'mdl,f="upos'$M-$V'","result'$M-$V/$F'.conllu"
import esupar
nlp=esupar.load(mdl)
with open(f,"w",encoding="utf-8") as w:
while True:
try:
doc=nlp(input().strip())
except:
quit()
print(doc,file=w)'
done
( echo '***' upos$M-$V dev
python3 $C -v dev.conllu result$M-$V/dev.conllu
echo '***' upos$M-$V test
python3 $C -v test.conllu result$M-$V/test.conllu
) | tee result$M-$V/result
done
done