SXK敏感词过滤功能设计

概述

SXK系统中的运营管理、营销管理、活动发布、课程管理、咨询管理等模块涉及了大量文本、媒体内容。这部分内容在日常运营工作当中,产生了大量的审核工作,传统的人工审核会消耗运营部门大量人力与时间,因此需要以人工审核为基础,构建系统自动审核敏感内容审核功能,以减少人工消耗,提高工作效率。
目前需要审核的内容类型包含文本,图片,视频。结合目前平台开发和运营情况,在第一期先完成文本内容的自动过滤功能。

敏感词过滤流程

流程解释:
现有的审核大致分为2种:

  1. 内容生效前先行审核

如评论,意见等

  1. 先创建待审核的内容副本,审核通过之后,用副本替换正式内容

此类审核是为了保证线上运行时数据正确性
如课程,教师信息等

考虑到以下2点:

  • 不干扰当先流程,不造成较大改动
  • 无论自动审核是否通过,原有的人工审核必须存在,以减少误判,自动审核只为人工审核提供便利

因此,在原本的人工审核部分之前,调用关键词过滤服务,若出现敏感词,写入记录,在原先人工审核处进行提示,此后处理流程保持不变,最终审核交由人工操作

模型设计

  • 敏感词条 - SensitiveText

用于存储敏感词条

  • 敏感词命中历史 - SensitiveContentHistory

自动敏感词过滤命中时的记录,记录下时间,内容提交人,原始内容,敏感内容,审核对象信息
该数据此后可作为评价、监控等的数据依据

数据库设计请参考模型设计

核心接口/代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
package io.github.jarryzhou.sensitivecontentfilter.service;

import io.github.jarryzhou.sensitivecontentfilter.dto.SensitiveContentCheckResult;
import io.github.jarryzhou.sensitivecontentfilter.entity.SensitiveContentHitHistory;
import io.github.jarryzhou.sensitivecontentfilter.entity.SensitiveText;
import io.github.jarryzhou.sensitivecontentfilter.repository.SensitiveContentHitHistoryRepository;
import io.github.jarryzhou.sensitivecontentfilter.repository.SensitiveTextRepository;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.stereotype.Service;

import java.time.LocalDateTime;
import java.util.HashSet;
import java.util.List;
import java.util.Map;
import java.util.Set;
import java.util.stream.Collectors;

/**
* SensitiveContentCheckService
* <p>
* Author: Jarry Zhou
* Date: 2021/9/29
* Description: 敏感内容检测服务
**/
@Service
public class SensitiveContentCheckService {
public static final String SENSITIVE_TEXT_DELIMITER = ",";

@Autowired
private SensitiveTextRepository sensitiveTextRepository;
@Autowired
private SensitiveContentHitHistoryRepository sensitiveContentHitHistoryRepository;

private WordNode sensitiveWordNodeTree;

public SensitiveContentCheckResult check(String text) {
return check(text, true);
}

public SensitiveContentCheckResult check(String text, boolean generateRecord) {
if (sensitiveWordNodeTree == null) {
initSensitiveWordNodeTree();
}
Set<String> hits = findSensitiveWords(text);
if (hits != null && !hits.isEmpty() && generateRecord) {
generateSensitiveContentCheckHistory(text, hits);
}

return buildCheckResult(text, hits);
}

private void initSensitiveWordNodeTree() {
List<SensitiveText> all = sensitiveTextRepository.findAll();
buildDFATree(all.stream().map(SensitiveText::getContent).collect(Collectors.toList()));
}

private void buildDFATree(List<String> strings) {
WordNode root = new WordNode();
strings.forEach(word -> {
WordNode current = root;
for (int i = 0; i < word.length(); i++) {
char ch = word.charAt(i);
current.putChildIfAbsent(ch, new WordNode());
current = current.getChildren().get(ch);
if (i == word.length() - 1) {
current.setEnd(true);
}
}
});
sensitiveWordNodeTree = root;
}

private SensitiveContentCheckResult buildCheckResult(String text, Set<String> hits) {
SensitiveContentCheckResult result = new SensitiveContentCheckResult();
result.setCheckTime(LocalDateTime.now());
result.setSensitive(hits != null && !hits.isEmpty());
result.setSensitiveContent(hits);
result.setOriginalContent(text);
return result;

}

private Set<String> findSensitiveWords(String text) {
Set<String> hits = new HashSet<>();
for (int i = 0; i < text.length(); i++) {
int matchedLength = doCheck(sensitiveWordNodeTree.getChildren(), text, i);
if (matchedLength > 0) {
hits.add(text.substring(i, i + matchedLength));
i += matchedLength - 1;
}
}
return hits;
}

private int doCheck(Map<Character, WordNode> sensitiveWords, String txt, int beginIndex) {
if (sensitiveWords == null || sensitiveWords.isEmpty()) {
return 0;
}
boolean allMatches = false;
int matchedIndex = 0;
for (int i = beginIndex; i < txt.length(); i++) {
WordNode wordNode = sensitiveWords.get(txt.charAt(i));
if (wordNode == null) {
break;
}
matchedIndex++;
sensitiveWords = wordNode.getChildren();
if (wordNode.isEnd()) {
allMatches = true;
break;
}
}
return allMatches ? matchedIndex : 0;
}

private void generateSensitiveContentCheckHistory(String text, Set<String> sensitiveWordList) {
SensitiveContentHitHistory sensitiveContentHitHistory = new SensitiveContentHitHistory();
sensitiveContentHitHistory.setCheckedText(text);
sensitiveContentHitHistory.setSensitiveContent(String.join(SENSITIVE_TEXT_DELIMITER, sensitiveWordList));
sensitiveContentHitHistoryRepository.save(sensitiveContentHitHistory);
}

}

待拓展/完善的功能

  • 目前是敏感词完全匹配,实际运用中很多情况是模糊匹配, 可考虑支持通配符配置。例如,想要过滤掉:你***妈
  • 可考虑加入自动替换敏感词的功能,不然每一条都得去审核,工作量太大,并且一本检测出来的敏感词,审核了也不能让过~
  • 另外一些功能,比如要封号之类的,应该都是基于History。

示例代码库

https://github.com/jarryscript/sensitive-content-filter

Donate
  • Copyright: Copyright is owned by the author. For commercial reprints, please contact the author for authorization. For non-commercial reprints, please indicate the source.

扫一扫,分享到微信

微信分享二维码

请我喝杯咖啡吧

微信