SXK敏感词过滤功能设计

2021-09-29

Word count: 1.1k | Reading time≈ 4 min

概述

SXK系统中的运营管理、营销管理、活动发布、课程管理、咨询管理等模块涉及了大量文本、媒体内容。这部分内容在日常运营工作当中，产生了大量的审核工作，传统的人工审核会消耗运营部门大量人力与时间，因此需要以人工审核为基础，构建系统自动审核敏感内容审核功能，以减少人工消耗，提高工作效率。
目前需要审核的内容类型包含文本，图片，视频。结合目前平台开发和运营情况，在第一期先完成文本内容的自动过滤功能。

敏感词过滤流程

流程解释:
现有的审核大致分为2种：

内容生效前先行审核

如评论，意见等

先创建待审核的内容副本，审核通过之后，用副本替换正式内容

此类审核是为了保证线上运行时数据正确性
如课程，教师信息等

考虑到以下2点:

不干扰当先流程，不造成较大改动
无论自动审核是否通过，原有的人工审核必须存在，以减少误判，自动审核只为人工审核提供便利

因此，在原本的人工审核部分之前，调用关键词过滤服务，若出现敏感词，写入记录，在原先人工审核处进行提示，此后处理流程保持不变，最终审核交由人工操作

模型设计

敏感词条 - SensitiveText

用于存储敏感词条

敏感词命中历史 - SensitiveContentHistory

自动敏感词过滤命中时的记录，记录下时间，内容提交人，原始内容，敏感内容，审核对象信息
该数据此后可作为评价、监控等的数据依据

数据库设计请参考模型设计

核心接口/代码

package io.github.jarryzhou.sensitivecontentfilter.service;

import io.github.jarryzhou.sensitivecontentfilter.dto.SensitiveContentCheckResult;
import io.github.jarryzhou.sensitivecontentfilter.entity.SensitiveContentHitHistory;
import io.github.jarryzhou.sensitivecontentfilter.entity.SensitiveText;
import io.github.jarryzhou.sensitivecontentfilter.repository.SensitiveContentHitHistoryRepository;
import io.github.jarryzhou.sensitivecontentfilter.repository.SensitiveTextRepository;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.stereotype.Service;

import java.time.LocalDateTime;
import java.util.HashSet;
import java.util.List;
import java.util.Map;
import java.util.Set;
import java.util.stream.Collectors;

/**
 * SensitiveContentCheckService
 * <p>
 * Author: Jarry Zhou
 * Date: 2021/9/29
 * Description: 敏感内容检测服务
 **/
@Service
public class SensitiveContentCheckService {
    public static final String SENSITIVE_TEXT_DELIMITER = ",";

    @Autowired
    private SensitiveTextRepository sensitiveTextRepository;
    @Autowired
    private SensitiveContentHitHistoryRepository sensitiveContentHitHistoryRepository;

    private WordNode sensitiveWordNodeTree;

    public SensitiveContentCheckResult check(String text) {
        return check(text, true);
    }

    public SensitiveContentCheckResult check(String text, boolean generateRecord) {
        if (sensitiveWordNodeTree == null) {
            initSensitiveWordNodeTree();
        }
        Set<String> hits = findSensitiveWords(text);
        if (hits != null && !hits.isEmpty() && generateRecord) {
            generateSensitiveContentCheckHistory(text, hits);
        }

        return buildCheckResult(text, hits);
    }

    private void initSensitiveWordNodeTree() {
        List<SensitiveText> all = sensitiveTextRepository.findAll();
        buildDFATree(all.stream().map(SensitiveText::getContent).collect(Collectors.toList()));
    }

    private void buildDFATree(List<String> strings) {
        WordNode root = new WordNode();
        strings.forEach(word -> {
            WordNode current = root;
            for (int i = 0; i < word.length(); i++) {
                char ch = word.charAt(i);
                current.putChildIfAbsent(ch, new WordNode());
                current = current.getChildren().get(ch);
                if (i == word.length() - 1) {
                    current.setEnd(true);
                }
            }
        });
        sensitiveWordNodeTree = root;
    }

    private SensitiveContentCheckResult buildCheckResult(String text, Set<String> hits) {
        SensitiveContentCheckResult result = new SensitiveContentCheckResult();
        result.setCheckTime(LocalDateTime.now());
        result.setSensitive(hits != null && !hits.isEmpty());
        result.setSensitiveContent(hits);
        result.setOriginalContent(text);
        return result;

    }

    private Set<String> findSensitiveWords(String text) {
        Set<String> hits = new HashSet<>();
        for (int i = 0; i < text.length(); i++) {
            int matchedLength = doCheck(sensitiveWordNodeTree.getChildren(), text, i);
            if (matchedLength > 0) {
                hits.add(text.substring(i, i + matchedLength));
                i += matchedLength - 1;
            }
        }
        return hits;
    }

    private int doCheck(Map<Character, WordNode> sensitiveWords, String txt, int beginIndex) {
        if (sensitiveWords == null || sensitiveWords.isEmpty()) {
            return 0;
        }
        boolean allMatches = false;
        int matchedIndex = 0;
        for (int i = beginIndex; i < txt.length(); i++) {
            WordNode wordNode = sensitiveWords.get(txt.charAt(i));
            if (wordNode == null) {
                break;
            }
            matchedIndex++;
            sensitiveWords = wordNode.getChildren();
            if (wordNode.isEnd()) {
                allMatches = true;
                break;
            }
        }
        return allMatches ? matchedIndex : 0;
    }

    private void generateSensitiveContentCheckHistory(String text, Set<String> sensitiveWordList) {
        SensitiveContentHitHistory sensitiveContentHitHistory = new SensitiveContentHitHistory();
        sensitiveContentHitHistory.setCheckedText(text);
        sensitiveContentHitHistory.setSensitiveContent(String.join(SENSITIVE_TEXT_DELIMITER, sensitiveWordList));
        sensitiveContentHitHistoryRepository.save(sensitiveContentHitHistory);
    }

}

待拓展/完善的功能

目前是敏感词完全匹配，实际运用中很多情况是模糊匹配, 可考虑支持通配符配置。例如，想要过滤掉：你***妈
可考虑加入自动替换敏感词的功能，不然每一条都得去审核，工作量太大，并且一本检测出来的敏感词，审核了也不能让过~
另外一些功能，比如要封号之类的，应该都是基于History。

示例代码库

https://github.com/jarryscript/sensitive-content-filter

Donate

Copyright： Copyright is owned by the author. For commercial reprints, please contact the author for authorization. For non-commercial reprints, please indicate the source.