Here we will see if a network constructed from co-occurrences of sentencepiece items (sp) in texts can give useful insights into the relatedness of these texts.
All text in the Kanseki Repository are analyzed using the same SP
model, (m0-krp-30000.model
), using 30000 sp in this case. For every
text, a list of sp and the frequency of occurrence in this texts is
noted. (See here for the vocabulary lists).
These lists are then aggregated into a list of all individual sp, together with the texts they occurr in and the frequency. From this aggregated list, we then construct the networks by considering every co-occurrence as an edge between two texts, the combined frequency of both occurrences will then be added to the weight of this edge; the total weight for each text pair is thus the sum of the wight of all of these sp items.
Since the corpus builder of the sp processor ignores very rare characters, the aggregate list actually has more items than the vocabulary, 73220 . Fig 1 shows a histogram of the distribution of number of co-occurrences in the texts. The total number of texts is 8622. Compared to the bulei histograms, here are relatively few sp that occurr in more than half of the texts.
Figure 1: Number of co-occurrences of vocabulary in the 8622 texts
Using the same method as for the Bulei analysis, the edges and nodes are added to a Graph object and a network can be drawn:
Figure 2: Network of the texts
This is not very useful. So how can the network be understood? The first step is to identify "communities", that is see how the network can be divided. However, this network has more than 1.5 million edges for 8622 nodes, which is very dense.
Next, we try to get fewer edges by taking only the most important edges for each node.
Here are some examples of communities, their size and membership:
This is a network with the most important 50 edges for each node.
0 4142 KR4:1131 KR6:851 KR5:783 KR3:706 KR1:671 1 1904 KR6:1724 KR5:108 KR4:037 KR3:027 KR1:008 2 0168 KR5:150 KR4:007 KR3:007 KR1:004 3 0187 KR6:187 4 0089 KR4:030 KR3:021 KR6:019 KR5:010 KR1:009 5 0173 KR6:116 KR4:034 KR3:010 KR1:008 KR5:005 6 0180 KR6:156 KR4:012 KR5:008 KR3:004
Figure 3: Communities in the network
And here is another example:
0 3828 KR4:1070 KR6:0819 KR5:0729 KR1:0630 KR3:0580 1 1657 KR6:1454 KR5:0107 KR4:0057 KR3:0029 KR1:0010 2 0351 KR3:0136 KR6:0072 KR5:0052 KR4:0046 KR1:0045 3 0155 KR5:0136 KR4:0008 KR3:0007 KR1:0004 4 0189 KR6:0189 5 0189 KR6:0163 KR5:0010 KR4:0010 KR3:0006 6 0101 KR6:0057 KR4:0024 KR5:0007 KR1:0007 KR3:0006 7 0155 KR6:0131 KR4:0012 KR5:0008 KR3:0004 8 0126 KR6:0113 KR5:0012 KR3:0001 9 0091 KR6:0055 KR4:0024 KR3:0006 KR1:0004 KR5:0002
Top 20 nodes by betweenness centrality: KR6j0288 佛說觀自在菩薩如意心陀羅尼呪經-唐-義淨 0.00972684226351827 KR6j0233 千轉陀羅尼觀世音菩薩呪-唐-智通 0.005858058897677584 KR6j0496 佛說常瞿利毒女陀羅尼呪經-唐-瞿多 0.004700151580492199 KR6j0304 不空羂索陀羅尼經-唐-李無諂 0.004698587329389592 KR5i0052 易說-- 0.0035392531050988697 KR6d0076 法華經擊節-明-德清 0.003410178080207934 KR3i0042 菌譜-宋-陳仁玉 0.0032602025498151706 KR5a0116 黃帝陰符經註-北宋-黃居真 0.0030553726842153977 KR5h0052 玄天上帝百字聖號(玄帝感應靈籤)-- 0.0030450404681152685 KR1i0001 皇祐新樂圖記-宋-阮逸 0.0027158245441945878 KR6o0120 起信論疏記會閱-清-續法 0.0026706872898372308 KR1e0030 春秋五禮例宗-宋-張大亨 0.0026504087541593386 KR1b0008 禹貢指南-宋-毛晃 0.002605044218120542 KR6s0012 新譯大方廣佛華嚴經音義-唐-慧菀 0.002408074637176279 KR6i0453 佛說身毛喜豎經-宋-惟淨 0.002323207980597948 KR5a0123 黃帝陰符經註--劉處玄 0.002273007090428055 KR1e0020 春秋通義-宋-闕名 0.002210621023310573 KR6j0229 聖觀自在菩薩心真言瑜伽觀行儀軌-唐-不空 0.0021923937192554832 KR5c0269 太上老君玄妙枕中內德神咒經-唐- 0.0021841143637722138 KR6j0139 寶悉地成佛陀羅尼經-唐-不空 0.002181649296271316
Top 20 nodes by degree centrality: KR6j0288 佛說觀自在菩薩如意心陀羅尼呪經-唐-義淨 458 KR6j0304 不空羂索陀羅尼經-唐-李無諂 428 KR6j0233 千轉陀羅尼觀世音菩薩呪-唐-智通 410 KR6j0496 佛說常瞿利毒女陀羅尼呪經-唐-瞿多 370 KR5h0052 玄天上帝百字聖號(玄帝感應靈籤)-- 360 KR3i0042 菌譜-宋-陳仁玉 325 KR1i0001 皇祐新樂圖記-宋-阮逸 312 KR1b0008 禹貢指南-宋-毛晃 311 KR6j0583 東方最勝燈王陀羅尼經-隋-闍那崛多 309 KR5a0116 黃帝陰符經註-北宋-黃居真 309 KR6i0453 佛說身毛喜豎經-宋-惟淨 300 KR4c0016 唐皇甫冉詩集-唐- 297 KR1e0030 春秋五禮例宗-宋-張大亨 294 KR6o0120 起信論疏記會閱-清-續法 287 KR6o0085 釋摩訶衍論記-唐-聖法 278 KR1e0020 春秋通義-宋-闕名 278 KR6j0229 聖觀自在菩薩心真言瑜伽觀行儀軌-唐-不空 276 KR4c0051 劉賓客文集-唐-劉禹錫 276 KR5i0052 易說-- 266 KR6d0076 法華經擊節-明-德清 266
Top 20 nodes by eigenvector centrality: KR6j0304 不空羂索陀羅尼經-唐-李無諂 0.08812291773715351 KR6j0583 東方最勝燈王陀羅尼經-隋-闍那崛多 0.07454350607657415 KR6j0288 佛說觀自在菩薩如意心陀羅尼呪經-唐-義淨 0.07130758533398501 KR6j0496 佛說常瞿利毒女陀羅尼呪經-唐-瞿多 0.06890515051197889 KR6j0233 千轉陀羅尼觀世音菩薩呪-唐-智通 0.06867247403894727 KR6j0579 佛說稱讚如來功德神呪經-唐-義淨 0.061374660142873794 KR5c0033 太上老君說解釋咒詛經-唐- 0.05864276793135707 KR6j0566 陀羅尼雜集--失譯 0.05394860958195797 KR5b0253 上清天樞院回車畢道正法-宋後期- 0.05339559046584755 KR6j0257 千眼千臂觀世音菩薩陀羅尼神呪經-唐-智通 0.05294957101334831 KR6j0121 大佛頂如來放光悉怛多般怛羅大神力都攝一切呪王陀羅尼經大威德最勝金輪三昧呪品-- 0.05252337168997488 KR5h0052 玄天上帝百字聖號(玄帝感應靈籤)-- 0.05199514135112837 KR6j0483 末利支提婆華鬘經-唐-不空 0.05168031150622622 KR6j0546 施諸餓鬼飲食及水法-唐-不空 0.05062923611737598 KR6j0139 寶悉地成佛陀羅尼經-唐-不空 0.0504659165449759 KR6j0229 聖觀自在菩薩心真言瑜伽觀行儀軌-唐-不空 0.049627458925181715 KR6j0072 陀羅尼集經-唐-阿地瞿多 0.04730942724871687 KR6j0379 地藏菩薩儀軌-唐-輸婆迦羅 0.04681546996771384 KR6j0498 使呪法經-唐-菩提留支 0.045394221877556845 KR6j0301 不空羂索呪經-隋-闍那崛多 0.04470212017532556
Created: 2019-07-09 Tue 09:30