采用深度学习的小语种舆情监控方法

宋千里; 赖华

doi:10.3788/IRLA20210298

采用深度学习的小语种舆情监控方法

doi: 10.3788/IRLA20210298

宋千里^{1, 2,},
赖华^{1, 2,}

1.
昆明理工大学信息工程与自动化学院，云南昆明 650500
2.
昆明理工大学云南省人工智能重点实验室，云南昆明 650500

基金项目: 国家自然科学基金（61972186，61762056，61472168）；云南省重大科技专项计划项目（202002AD080001）

详细信息

作者简介:
宋千里，男，硕士生，主要从事自然语言处理、小语种跨语言情感方面的研究

中图分类号: TP391

Monitoring method of public opinion in minor languages using deep learning

Song Qianli^{1, 2
,},
Lai Hua^{1, 2
,}

1.
Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Yunnan 650500, China
2.
Yunnan Key Laboratory of Artificial Intelligence, Kunming University of Science and Technology, Yunnan 650500, China

摘要: 在小语种舆情监控领域，由于小语种的标注语料难以获取，导致深度学习的训练效果较差。对于民间及媒体发表的新闻内容很难准确抽取其核心观点句，从而影响了进一步的舆情分析效果。为了将研究问题具体化，以越南语为例，提出一种融入共享主题特征的汉越跨语言新闻观点句的抽取方法，可以借助充足的汉语标注语料解决小语种资源稀缺问题，并利用双语可比语料间可共享的主题信息来优化抽取效果，进而提升舆情监控效果。具体方法为，提取汉越可比新闻的隐含狄利克雷分布（Latent Dirichlet Allocation, LDA）主题来构建共享主题特征，借助共享主题词典和情感词典训练双语词嵌入模型来共享汉越语义空间表征，将特征融入词向量，通过将语义信息与主题、情感、位置信息相结合来提升抽取效果。在汉越可比新闻数据集里进行的实验结果表明，融入共享主题特征能够提升小语种新闻观点句的抽取效果，F₁值达到0.721，对小语种舆情监控起到支撑作用。
- 小语种舆情监控 /
- 跨语言观点句抽取 /
- 汉越可比新闻 /
- 双语词嵌入 /
- 共享主题表征
Abstract: In the field of public opinion monitoring in minor languages, it is difficult to obtain annotated corpus in minor languages, resulting in poor practice of deep learning. It is difficult to extract the core opinions of the information published by the private and the media for further analysis of public opinion. Taking Vietnamese as an example, a method for extracting Chinese-Vietnamese news opinion sentences that incorporated shared topic features was proposed. The problem of scarcity of small language resources was solved with the help of sufficient Chinese annotation corpus, and the topic information was used to shared between bilingual comparable corpora. It could optimize the extraction effect, and then enhance the public opinion monitoring effect. First, the topics of Chinese-Vietnamese comparable news were extracted separately to construct shared topic features through LDA topic modeling; then, the bilingual word embedding model was trained to achieve shared semantic spatial representation of Chinese and Vietnamese; Finally, the features were integrated with word vectors for combining semantic information with topics, emotion and location information to enhance the effect of extracting consequent. The experimental results in the Chinese Vietnamese comparable news dataset show that the integration of shared topic features can improve the extraction of Chinese Vietnamese news opinion sentences, the value of F₁ is 0.721, which supports the monitoring of public opinion in minor languages.
- public opinion monitoring in minor languages /
- cross-lingual opinion sentence extraction /
- Chinese-Vietnamese comparable news /
- bilingual word embedding /
- shared topic representation

图 1 融入共享主题特征的汉越新闻观点句抽取流程图

Figure 1. Flow chart of Chinese-Vietnamese news opinion sentence extraction incorporating the characteristics of shared topics

下载: 全尺寸图片幻灯片

图 2 融入观点句判别特征的观点句抽取模型

Figure 2. Opinion sentence extraction model incorporating the discriminative features of opinion sentences

下载: 全尺寸图片幻灯片

表 1 汉语和越南语新闻语料的训练集、测试集、验证集分布

Table 1. Distribution of training set, test set, and verification set of Chinese and Vietnamese news corpus

	Number of Chinese news articles	Number of Vietnamese news articles
Training set	450	450
Test set	25	25
Validation set	25	25

下载: 导出CSV

表 2 不同模型下的观点句抽取效果对比

Table 2. Comparison of the effect of opinion sentence extraction under different models

Opinion sentence extraction model	P	R	F₁
LSTM + opinion sentence discriminative features	0.623	0.639	0.631
Bi-LSTM + discriminant features of opinion sentences	0.658	0.667	0.662
Transformer+discriminant features of opinion sentences	0.711	0.732	0.721

下载: 导出CSV

表 3 不同特征下的观点句抽取效果对比

Table 3. Comparison of the effect of opinion sentence extraction under different characteristic

Discriminant features of opinion sentences	P	R	F₁
None	0.638	0.650	0.644
Shared topic	0.686	0.695	0.690
Position	0.664	0.683	0.673
Emotion	0.699	0.707	0.703
Vietnamese theme features + location + emotion	0.676	0.709	0.650
Shared theme topic + location + emotion	0.711	0.732	0.721

下载: 导出CSV

表 4 汉越新闻实例的观点句抽取效果对比

Table 4. Comparison of the effect of opinion sentence extraction in Chinese-Vietnamese news cases

	Chinese news	Vietnamese news
Text	越通社河内—12月21日和22日，越南有关部门、越南驻美国大使馆、越南国家航空公司同当地政府有关部门配合，将在美国滞留的近360名越南公民安全接回国。 ...今后，将在国外滞留的公民接回国工作将继续根据公民的愿望和国内疫情和隔离能力等情况展开。	Trong hai ngày 21-22/12, các cơ quan chức năng Việt Nam, các cơ quan đại diện Việt Nam tại Hoa Kỳ, hãng Hàng không Quốc gia Việt Nam đã phối hợp với các cơ quan chức năng sở tại đưa gần 360 công dân Việt Nam về nước an toàn....Thời gian tới, việc đưa công dân có hoàn cảnh đặc biệt khó khăn về nước sẽ được sắp xếp theo nguyện vọng của công dân, phù hợp với tình hình dịch bệnh và năng lực cách ly trong nước.
Subject headings	越南，疫情...	Việt Nam(越南), cách ly (隔离)...
No shared theme feature	Opinion sentence number is 1, 8 (8 sentences in total)	Opinion sentence number is 1 (9 sentences in total)
Incorporate shared theme features	Opinion sentence number is 1, 8 (8 sentences in total)	Opinion sentence number is 1, 9 (9 sentences in total)
Artificially annotated opinion sentence	Opinion sentence number is 1, 8 (8 sentences in total)	Opinion sentence number is 1, 9 (9 sentences in total)

下载: 导出CSV

[1]	Pang Bo, Lee Lillian, Vaithyanathanet Shivakumar, et al. Thumbs up? Sentiment classification using machine learning techniques[C]//Proceedings of Conference on Empirical Methods in Natural Language Processing (EMNLP-2002), 2002: 79-86.
[2]	Liu P Y, Xun J, Fei S D, et al. Subjective sentence recognition based on Hidden Markov Model [J]. Journal of Chinese Information Processing, 2016, 30(4): 206-212. (in Chinese)
[3]	Zhao H J, Liu H L, Ren J W, et al. News-oriented emotional key sentence extraction and polarity determination [J]. Journal of Shanxi University (Natural Science Edition), 2014, 37(4): 588-594. (in Chinese) doi: 10.13451/j.cnki.shanxi.univ(nat.sci.).2014.04.018
[4]	Wang J, Tang S, Hang Y X, et al. Chinese-Vietnamese bilingual multi-document news opinion sentence recognition based on sentence association graph [J]. Journal of Computer Applications, 2020, 40(10): 2845-2849. (in Chinese)
[5]	Zhang M M. Cross-language sentiment classification based on shared space [J]. Information Technology and Informatization, 2020(5): 202-207. (in Chinese) doi: 10.3969/j.issn.1672-9528.2020.05.064
[6]	刘书龙. 汉越双语新闻观点句抽取及分析方法研究[D]. 昆明: 昆明理工大学, 2017. Liu S L, Research on extraction and analysis methods of Chinese and Vietnamese bilingual news opinion sentences [D]. Kunming: Kunming University of Science and Technology, 2017. (in Chinese)
[7]	Lin S Q, Yu Z T, Guo J J, et al. Chinese-Vietnamese news perspective sentence extraction methods incorporating multiple features [J]. Journal of Chinese Information Processing, 2019, 33(11): 101-106. (in Chinese) doi: 10.3969/j.issn.1003-0077.2019.11.012
[8]	Wang Q, Tian M J, Cui R Y, et al. Bilingual topic word embedding for Chinese-Korean cross-lingual text classification [J]. Journal of Chinese Information Processing, 2020, 34(12): 39-47. (in Chinese) doi: 10.3969/j.issn.1003-0077.2020.12.007
[9]	Kang C, Zheng S H, Li W L. Short text classification combining LDA topic model and 2D convolution [J]. Computer Applications and Software, 2020, 37(11): 127-131, 153. (in Chinese) doi: 10.3969/j.issn.1000-386x.2020.11.022
[10]	Vu T, Nguyen D Q, Nguyen D Q, et al. VnCoreNLP: A vietnamese natural language processing toolkit[C]//Proceedings of NAACL-HLT, 2018: 56-60.
[11]	张静. 基于深度学习的中文评论观点抽取研究[D]. 西南交通大学, 2018. Zhang Jing. Research on viewpoint extraction of chinese comments based on deep learning [D]. Chengdu: Southwest Jiaotong University, 2018. (in Chinese)
[12]	Lin S Q, Yu Z T, Guo J J, et al. Chinese-Vietnamese bilingual news sentiment classifications incorporating perspective sentence features [J]. Journal of Kunming University of Science and Technology (Natural Science Edition), 2020, 45(6): 67-73. (in Chinese) doi: 10.16112/j.cnki.53-1223/n.2020.06.009

[1]	杜中强, 唐林波, 韩煜祺. 面向嵌入式平台的车道线检测方法 . 红外与激光工程, 2022, 51(7): 20210753-1-20210753-8. doi: 10.3788/IRLA20210753
[2]	吴子若, 蔡燕妮, 王星睿, 张龙飞, 邓晓, 程鑫彬, 李同保. 基于多层膜光栅的AFM探针结构表征研究 . 红外与激光工程, 2020, 49(2): 0213002-0213002. doi: 10.3788/IRLA202049.0213001
[3]	郝寅雷, 丁君珂, 陈浩, 蒋建光, 孟浩然, 刘欣悦. 集成光学移相干涉仪的研制与性能表征 . 红外与激光工程, 2019, 48(4): 420001-0420001(5). doi: 10.3788/IRLA201948.0420001
[4]	雷李华, 蔡潇雨, 魏佳斯, 孟凡娇, 傅云霞, 张馨尹, 李源. 多维栅格标准样板的制备与表征 . 红外与激光工程, 2019, 48(5): 503006-0503006(7). doi: 10.3788/IRLA201948.0503006
[5]	王向军, 郭志翼, 王欢欢. 基于嵌入式平台的低时间复杂度目标跟踪算法 . 红外与激光工程, 2019, 48(12): 1226001-1226001(10). doi: 10.3788/IRLA201948.1226001
[6]	郭志强, 刘力源, 吴南健. 用于高速CIS的12-bit紧凑型多列共享并行pipeline-SAR ADC . 红外与激光工程, 2018, 47(5): 520001-0520001(10). doi: 10.3788/IRLA201847.0520001
[7]	樊凡, 潘志康, 娄小平, 董明利, 祝连庆. 基于雅可比矩阵的仿人视觉系统运动角度分解 . 红外与激光工程, 2018, 47(8): 817006-0817006(6). doi: 10.3788/IRLA201847.0817006
[8]	梁清华, 蒋大钊, 陈洪雷, 丁瑞军. 基于分时共享方案的640×512红外读出电路设计 . 红外与激光工程, 2017, 46(10): 1004001-1004001(8). doi: 10.3788/IRLA201780.1004001
[9]	肖龙, 徐超, 刘广荣. 应用于可穿戴微光成像系统的嵌入式平台设计 . 红外与激光工程, 2016, 45(1): 118006-0118006(6). doi: 10.3788/IRLA201645.0118006
[10]	成声月, 刘朝辉, 叶圣天, 王飞, 贾艺凡, 班国东. 水性红外迷彩涂料的制备及其表征 . 红外与激光工程, 2015, 44(8): 2298-2304.
[11]	戴艺丹, 屈恩世, 任立勇. Scheme语言的LED自由曲面透镜快速建模方法 . 红外与激光工程, 2015, 44(9): 2690-2695.
[12]	张磊, 程鑫彬, 张锦龙, 王占山. 光学表面功率谱密度的表征 . 红外与激光工程, 2015, 44(12): 3707-3712.
[13]	郝立超, 陈洪雷, 李辉, 陈义强, 赖灿雄, 黄爱波, 丁瑞军. 具有记忆功能背景抑制结构的共享型读出电路 . 红外与激光工程, 2015, 44(11): 3293-3298.
[14]	田立, 周付根, 孟偲. 基于嵌入式多核DSP 系统的并行粒子滤波目标跟踪 . 红外与激光工程, 2014, 43(7): 2354-2361.
[15]	杨亮, 李艳秋, 马旭, 盛乃. 嵌入式光栅多层结构锥形衍射的严格耦合波理论研究 . 红外与激光工程, 2014, 43(6): 1899-1904.
[16]	朱启明, 王立强, 袁波. 嵌入式系统的制冷CCD相机 . 红外与激光工程, 2014, 43(11): 3608-3614.
[17]	韩朝江, 马拥军, 裴重华, 曾敏. 红外宽频吸收硅基复合气凝胶的制备及表征 . 红外与激光工程, 2013, 42(8): 1956-1961.
[18]	刘华松, 傅翾, 王利栓, 姜玉刚, 冷健, 庄克文, 季一勤. 弱吸收单面薄膜光学特性的表征方法 . 红外与激光工程, 2013, 42(8): 2108-2114.
[19]	韩朝江, 马拥军, 裴重华, 曾敏. 红外宽频吸收硅基复合气凝胶的制备及表征 . 红外与激光工程, 2013, 42(4): 869-873.
[20]	刘香翠, 程翔, 张良, 任丽娜, 郭建广. 烟幕对红外热像仪遮蔽效果的定量表征 . 红外与激光工程, 2012, 41(1): 37-42.

点击查看大图

图(2) / 表(4)

计量

文章访问数: 305
HTML全文浏览量: 133
PDF下载量: 23
被引次数: 0

全文HTML

0. 引　言

在当今国际形势下，为了更好地把握与周边国家的国际关系和地缘政治，以越南为例，需要实时监控越南民间和官方媒体的舆情动向。目前主流的方法是利用爬虫获取大量训练语料，再通过深度学习获得舆情监控结果，而小语种有标注资源稀缺，需要通过其他语言语料进行辅助训练。

在舆情监控领域，观点句抽取是舆情监控的重要支撑，影响着后续对观点的情感分析效果。跨语言观点句抽取任务的核心是通过丰富的源语言标注资源弥补稀缺目标语言标注资源，准确且高效地抽取出篇章中代表观点的句子。单语观点句的抽取问题较普遍，而跨语言领域的观点句抽取问题研究较少，具有一定的研究价值。

观点句抽取任务是指给出一个包含多个句子的文档，识别并抽取文档中表达篇章观点的句子二分类问题，新闻的观点句也通常需要依赖观点句特征进行抽取。如Pang^[1]利用unigram特征训练SVM分类模型和朴素贝叶斯分类模型将电影评论分为正面和负面两类。还有一部分方法融入了情感词典和情感特征，以及通过加权强化对重点信息的关注，如Liu等^[2]通过抽取主客观特征进行句子的序列标注来获取观点句。Zhao等^[3]采用了集成学习方法，实现了基于句子的主题、位置、情感、特征词词性的观点句识别。

跨语言观点句抽取是在单语观点句抽取研究的基础上增加了对源语言的利用，主要分为基于双语词典、机器翻译、平行语料和双语词嵌入的方法进行跨语言观点句抽取。核心思想都是将源语言语料迁移到目标语言语义空间，弥补目标语言资源稀缺的问题，提升目标语言观点句抽取效果。

基于双语词嵌入方法属于目前的主流做法，实现了目标语言和源语言的语义空间对齐，核心问题转变为如何解决不同语言的语义表达差异问题^[4]。Zhang等^[5]利用tf-idf词典和LDA （Latent Dirichlet Allocation）主题词典构建共享语义空间以实现语义空间共享。Liu等^[6]采用了结合要素关联和情感关联的汉越双语新闻观点句抽取方法。Lin等^[7]在双语词嵌入模型基础上，在分类模型中融入了主题、位置和情感特征来实现跨语言观点句识别。现有的特征与跨语言结合的方法大多将融入的特征和观点句抽取作为两个独立的部分，没有充分利用汉越语料间的关联关系，对汉语资源的利用不足，影响了最终的目标语言观点句抽取效果。

考虑到跨语言新闻篇章可能描述同一事件主题内容的特点，用汉越可比新闻作为训练资源进行研究。通过对主题接近的双语可比语料进行分析，发现可比语料描述的主题高度一致，主题词信息有差异，情感信息接近，无法互相转译。主题信息与观点句抽取任务有着紧密的联系，所以获取的高质量的句子主题信息可以提升观点句抽取效果。

综上所述，文中结合跨语言新闻篇章可能描述同一事件主题内容的特点，用汉越可比新闻作为训练资源，提出一种融入共享主题信息，结合深度学习框架和共享语义空间，融合多特征的跨语言新闻篇章观点句抽取方法。

3. 结　论

通过该模型的相关实验，可以得出如下结论：对越南语观点句抽取任务，在训练时采用双语词嵌入引入汉语标注语料，在词向量中融入包含共享主题特征、位置特征、情感特征的观点句判别特征，采用Transformer做抽取模型，都会提升越南语观点句抽取效果（实验结果中F₁值达到0.721），进而可以提升小语种舆情监控效果。在下一步研究中将继续研究如何利用已获取的观点句抽取结果来提升越南语新闻情感分类的效果，以及在其他小语种中应用该方法的具体参数调整。

参考文献 (12)

姓名
邮箱
手机号码
标题
留言内容
验证码

留言板

采用深度学习的小语种舆情监控方法

doi: 10.3788/IRLA20210298

作者简介:
宋千里，男，硕士生，主要从事自然语言处理、小语种跨语言情感方面的研究

Monitoring method of public opinion in minor languages using deep learning

计量

采用深度学习的小语种舆情监控方法

doi: 10.3788/IRLA20210298

1. 昆明理工大学信息工程与自动化学院，云南昆明 650500

2. 昆明理工大学云南省人工智能重点实验室，云南昆明 650500

作者简介:
宋千里，男，硕士生，主要从事自然语言处理、小语种跨语言情感方面的研究

English Abstract

Monitoring method of public opinion in minor languages using deep learning

1. Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Yunnan 650500, China

2. Yunnan Key Laboratory of Artificial Intelligence, Kunming University of Science and Technology, Yunnan 650500, China

全文HTML

1.1. 共享主题信息的获取

1.2. 共享语义空间的构建

1.3. 融入共享主题的汉越新闻观点句抽取模型

2.1. 数据准备

2.2. 评价指标

2.3. 参数设置

2.4. 对比实验

2.5. 消融实验

2.6. 实例分析

目录

留言板

采用深度学习的小语种舆情监控方法

doi: 10.3788/IRLA20210298

作者简介: 宋千里，男，硕士生，主要从事自然语言处理、小语种跨语言情感方面的研究

Monitoring method of public opinion in minor languages using deep learning

计量

出版历程

采用深度学习的小语种舆情监控方法

doi: 10.3788/IRLA20210298

1. 昆明理工大学 信息工程与自动化学院，云南 昆明 650500 2. 昆明理工大学 云南省人工智能重点实验室，云南 昆明 650500

作者简介: 宋千里，男，硕士生，主要从事自然语言处理、小语种跨语言情感方面的研究

English Abstract

Monitoring method of public opinion in minor languages using deep learning

1. Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Yunnan 650500, China 2. Yunnan Key Laboratory of Artificial Intelligence, Kunming University of Science and Technology, Yunnan 650500, China

全文HTML

1.1. 共享主题信息的获取

1.2. 共享语义空间的构建

1.3. 融入共享主题的汉越新闻观点句抽取模型

2.1. 数据准备

2.2. 评价指标

2.3. 参数设置

2.4. 对比实验

2.5. 消融实验

2.6. 实例分析

目录

作者简介:
宋千里，男，硕士生，主要从事自然语言处理、小语种跨语言情感方面的研究

1. 昆明理工大学信息工程与自动化学院，云南昆明 650500

2. 昆明理工大学云南省人工智能重点实验室，云南昆明 650500

作者简介:
宋千里，男，硕士生，主要从事自然语言处理、小语种跨语言情感方面的研究

1. Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Yunnan 650500, China

2. Yunnan Key Laboratory of Artificial Intelligence, Kunming University of Science and Technology, Yunnan 650500, China