您当前的位置:首页 > 论文详情

大语言模型下的古籍语内分词研究

Research on Intralingual Word Segmentation of Ancient Chinese Texts: From the Perspective of Large Language Model

摘要: [目的/意义] 运用人工智能技术对古籍进行信息化处理,能够促进中华优秀传统文化在新时代的继承与发展。分词作为自然语言处理的基础性任务,探究领域模型在中文语内分词中的表现对推进古籍研究与人工智能领域的交叉融合具有重要意义。[方法/过程] 文章首先基于百万条先秦典籍、二十四史、新时代人民日报分词语料构建中文语内分词数据集,然后对传统深度学习模型BiLSTM-CRF,古籍领域预训练模型SikuBert、SikuRoberta、GujiBert、GujiRoBerTa,古籍领域大语言模型Xunzi-Baichuan2-7B和Xunzi-Qwen2-7B进行指令微调训练,最后从评价指标和内容质量两个角度对参测模型的语内分词表现进行全面分析。[结果/结论] 在中文语内分词任务中,BiLSTM-CRF效果不佳,古籍领域Bert系列预训练模型表现优异,古籍领域大语言模型与Bert系列预训练模型效果不相上下,具备良好的泛化能力和鲁棒性,展现出了在复杂序列标注任务中的优秀潜力。

Abstract: [Purpose/significance]Digitalizing ancient texts with artificial intelligence technologies can effectively facilitate the preservation and development of Chinese traditional culture in the new era. Word segmentation plays an essential role in natural language processing. Investigating the performance of domain-specific models in intra-textual word segmentation is significant for advancing the integration of ancient text studies and artificial intelligence. [Method/process]This study constructed a mixed dataset with over 1 million entries of intra-textual word segmentation from the Pre-Qin classics, the Twenty-Four Histories, and the New Era People’s Daily Segmented Corpus (NEPD). The models, including BiLSTM-CRF, SikuBert, SikuRoberta, GujiBert, GujiRoBerTa, Xunzi-Baichuan2-7B and Xunzi-Qwen2-7B were then trained with fine-tuning instruction and evaluated. Finally, we conducted a comprehensive analysis of the intra-textual word segmentation performance of these models, considering both evaluation metrics and content quality perspectives. [Result/conclusion] This paper demonstrates that the BiLSTM-CRF model performs poorly. Domain-specific Bert-series pre-trained models exhibit the best performance, while large language models specialized for ancient texts show comparable performance to the Bert-series pre-trained models. The domain-specific large language models demonstrate good generalization and robustness, highlighting the potential in complex sequence labeling tasks.

版本历史

[V1] 2025-02-12 10:35:25 PSSXiv:202502.00521V1 下载全文
点击下载全文
在线阅读
许可声明
metrics指标
  •  点击量40
  •  下载量13
  • 评论量 0
评论
分享
邀请专家评阅
收藏