Connected texts through sentencepiece co-occurrences

Table of Contents

Here we will see if a network constructed from co-occurrences of sentencepiece items (sp) in texts can give useful insights into the relatedness of these texts.

All text in the Kanseki Repository are analyzed using the same SP model, (m0-krp-30000.model), using 30000 sp in this case. For every text, a list of sp and the frequency of occurrence in this texts is noted. (See here for the vocabulary lists).

These lists are then aggregated into a list of all individual sp, together with the texts they occurr in and the frequency. From this aggregated list, we then construct the networks by considering every co-occurrence as an edge between two texts, the combined frequency of both occurrences will then be added to the weight of this edge; the total weight for each text pair is thus the sum of the wight of all of these sp items.

Since the corpus builder of the sp processor ignores very rare characters, the aggregate list actually has more items than the vocabulary, 73220 . Fig 1 shows a histogram of the distribution of number of co-occurrences in the texts. The total number of texts is 8622. Compared to the bulei histograms, here are relatively few sp that occurr in more than half of the texts.

text-hist1.png

Figure 1: Number of co-occurrences of vocabulary in the 8622 texts

Using the same method as for the Bulei analysis, the edges and nodes are added to a Graph object and a network can be drawn:

text-nw-1.png

Figure 2: Network of the texts

This is not very useful. So how can the network be understood? The first step is to identify "communities", that is see how the network can be divided. However, this network has more than 1.5 million edges for 8622 nodes, which is very dense.

Next, we try to get fewer edges by taking only the most important edges for each node.

Here are some examples of communities, their size and membership:

This is a network with the most important 50 edges for each node.

0	4142	KR4:1131	KR6:851	KR5:783	KR3:706	KR1:671
1	1904	KR6:1724	KR5:108	KR4:037	KR3:027	KR1:008
2	0168	KR5:150	KR4:007	KR3:007	KR1:004
3	0187	KR6:187
4	0089	KR4:030	KR3:021	KR6:019	KR5:010	KR1:009
5	0173	KR6:116	KR4:034	KR3:010	KR1:008	KR5:005
6	0180	KR6:156	KR4:012	KR5:008	KR3:004

text-nw-3.png

Figure 3: Communities in the network

And here is another example:

0	3828	KR4:1070	KR6:0819	KR5:0729	KR1:0630	KR3:0580
1	1657	KR6:1454	KR5:0107	KR4:0057	KR3:0029	KR1:0010
2	0351	KR3:0136	KR6:0072	KR5:0052	KR4:0046	KR1:0045
3	0155	KR5:0136	KR4:0008	KR3:0007	KR1:0004
4	0189	KR6:0189
5	0189	KR6:0163	KR5:0010	KR4:0010	KR3:0006
6	0101	KR6:0057	KR4:0024	KR5:0007	KR1:0007	KR3:0006
7	0155	KR6:0131	KR4:0012	KR5:0008	KR3:0004
8	0126	KR6:0113	KR5:0012	KR3:0001
9	0091	KR6:0055	KR4:0024	KR3:0006	KR1:0004	KR5:0002

1 Betweenness centrality

Top 20 nodes by betweenness centrality:
KR6j0288 佛說觀自在菩薩如意心陀羅尼呪經-唐-義淨 0.00972684226351827
KR6j0233 千轉陀羅尼觀世音菩薩呪-唐-智通 0.005858058897677584
KR6j0496 佛說常瞿利毒女陀羅尼呪經-唐-瞿多 0.004700151580492199
KR6j0304 不空羂索陀羅尼經-唐-李無諂 0.004698587329389592
KR5i0052 易說-- 0.0035392531050988697
KR6d0076 法華經擊節-明-德清 0.003410178080207934
KR3i0042 菌譜-宋-陳仁玉 0.0032602025498151706
KR5a0116 黃帝陰符經註-北宋-黃居真 0.0030553726842153977
KR5h0052 玄天上帝百字聖號(玄帝感應靈籤)-- 0.0030450404681152685
KR1i0001 皇祐新樂圖記-宋-阮逸 0.0027158245441945878
KR6o0120 起信論疏記會閱-清-續法 0.0026706872898372308
KR1e0030 春秋五禮例宗-宋-張大亨 0.0026504087541593386
KR1b0008 禹貢指南-宋-毛晃 0.002605044218120542
KR6s0012 新譯大方廣佛華嚴經音義-唐-慧菀 0.002408074637176279
KR6i0453 佛說身毛喜豎經-宋-惟淨 0.002323207980597948
KR5a0123 黃帝陰符經註--劉處玄 0.002273007090428055
KR1e0020 春秋通義-宋-闕名 0.002210621023310573
KR6j0229 聖觀自在菩薩心真言瑜伽觀行儀軌-唐-不空 0.0021923937192554832
KR5c0269 太上老君玄妙枕中內德神咒經-唐- 0.0021841143637722138
KR6j0139 寶悉地成佛陀羅尼經-唐-不空 0.002181649296271316

2 Degree centrality

Top 20 nodes by degree centrality:
KR6j0288 佛說觀自在菩薩如意心陀羅尼呪經-唐-義淨 458
KR6j0304 不空羂索陀羅尼經-唐-李無諂 428
KR6j0233 千轉陀羅尼觀世音菩薩呪-唐-智通 410
KR6j0496 佛說常瞿利毒女陀羅尼呪經-唐-瞿多 370
KR5h0052 玄天上帝百字聖號(玄帝感應靈籤)-- 360
KR3i0042 菌譜-宋-陳仁玉 325
KR1i0001 皇祐新樂圖記-宋-阮逸 312
KR1b0008 禹貢指南-宋-毛晃 311
KR6j0583 東方最勝燈王陀羅尼經-隋-闍那崛多 309
KR5a0116 黃帝陰符經註-北宋-黃居真 309
KR6i0453 佛說身毛喜豎經-宋-惟淨 300
KR4c0016 唐皇甫冉詩集-唐- 297
KR1e0030 春秋五禮例宗-宋-張大亨 294
KR6o0120 起信論疏記會閱-清-續法 287
KR6o0085 釋摩訶衍論記-唐-聖法 278
KR1e0020 春秋通義-宋-闕名 278
KR6j0229 聖觀自在菩薩心真言瑜伽觀行儀軌-唐-不空 276
KR4c0051 劉賓客文集-唐-劉禹錫 276
KR5i0052 易說-- 266
KR6d0076 法華經擊節-明-德清 266

3 Eigenvector centrality

Top 20 nodes by eigenvector centrality:
KR6j0304 不空羂索陀羅尼經-唐-李無諂 0.08812291773715351
KR6j0583 東方最勝燈王陀羅尼經-隋-闍那崛多 0.07454350607657415
KR6j0288 佛說觀自在菩薩如意心陀羅尼呪經-唐-義淨 0.07130758533398501
KR6j0496 佛說常瞿利毒女陀羅尼呪經-唐-瞿多 0.06890515051197889
KR6j0233 千轉陀羅尼觀世音菩薩呪-唐-智通 0.06867247403894727
KR6j0579 佛說稱讚如來功德神呪經-唐-義淨 0.061374660142873794
KR5c0033 太上老君說解釋咒詛經-唐- 0.05864276793135707
KR6j0566 陀羅尼雜集--失譯 0.05394860958195797
KR5b0253 上清天樞院回車畢道正法-宋後期- 0.05339559046584755
KR6j0257 千眼千臂觀世音菩薩陀羅尼神呪經-唐-智通 0.05294957101334831
KR6j0121 大佛頂如來放光悉怛多般怛羅大神力都攝一切呪王陀羅尼經大威德最勝金輪三昧呪品-- 0.05252337168997488
KR5h0052 玄天上帝百字聖號(玄帝感應靈籤)-- 0.05199514135112837
KR6j0483 末利支提婆華鬘經-唐-不空 0.05168031150622622
KR6j0546 施諸餓鬼飲食及水法-唐-不空 0.05062923611737598
KR6j0139 寶悉地成佛陀羅尼經-唐-不空 0.0504659165449759
KR6j0229 聖觀自在菩薩心真言瑜伽觀行儀軌-唐-不空 0.049627458925181715
KR6j0072 陀羅尼集經-唐-阿地瞿多 0.04730942724871687
KR6j0379 地藏菩薩儀軌-唐-輸婆迦羅 0.04681546996771384
KR6j0498 使呪法經-唐-菩提留支 0.045394221877556845
KR6j0301 不空羂索呪經-隋-闍那崛多 0.04470212017532556

Author: Christian Witterm

Created: 2019-07-09 Tue 09:30

Validate