① 问题 ② 算法 ③ 场景 ④ 代码 ⑤ 关联 ⑥ 价值

Weak Supervision Data Labeling — 弱监督数据标注：用规则函数替代人工标注

Skill-Weak-Supervision-Data-Labeling · 22-数据采集工程

causalexperimentragdata_collection广告与投放客服与VOC推荐与搜索知识图谱与RAG数据采集与治理WF-B 广告优化WF-C 客服分诊WF-E Review监控

收录于数据治理基础手册跨境风险防御作战室

年化 ROI¥20-50 万

实现难度⭐⭐⭐☆☆

业务视角

适用角色数据工程师 / 技术负责人 · 运营负责人 · 选品负责人

适用平台Amazon SP API + Keepa · TikTok Shop API · 跨境多平台数据湖

什么情况下用想监控竞品价格/评论/排名但没有稳定采集能力，手动太慢；多平台数据分散整合成本极高；数据管道不稳定经常断

成功是什么样的竞品价格/评论数据每日自动更新，多平台数据统一入仓，数据管道稳定性 >99%，取数时间从小时降到分钟

业务痛点

竞品数据要手动收集太慢平台 API 限制抓不到数据多系统数据整合不起来报表用的数据是过期的

1. 解决的问题

训练评论分类AI需要10000条人工标注数据花3万元等3周——弱监督用15个规则函数自动标注全部数据1天完成成本近零，标注效率提升10-50倍支持多个AI项目年化节省数据标注成本20-50万元

2. 核心算法逻辑

人工标注 vs 弱监督标注：

3. 业务应用场景

业务问题：要训练一个"母婴产品评论质量分类器"（高质量/低质量），需要 10,000 条标注数据。人工标注 ¥30,000 + 3 周时间。用弱监督：写 15 个标注函数，1 天完成标注，成本近零。

数据要求： - 未标注的评论数据（10,000+ 条） - 领域知识（用于设计标注函数）

预期产出： - 每条数据的软标签（P(高质量)=0.78） - 标注函数质量分析（哪个函数准确率最高） - 可用于训练的弱标签数据集

4. 输入数据要求

请查看原始代码模板获取输入规格。

5. 输出结果

请查看原始代码模板获取输出规格。

6. 业务价值 / ROI

ROI 预估：
标注成本降低 90%（¥30,000 → ¥500）
标注时间缩短 95%（3周 → 1天）
实现多个 NLP/分类任务的快速数据准备
年化综合 ROI：¥20-50 万（多个AI项目的数据标注节省）
实施难度：⭐⭐⭐☆☆（Snorkel/Cleanlab 等库成熟；标注函数设计需要领域知识；约 2-3 周）

7. 代码模板

代码块数量：3 · 路径：未检测到

"""
Weak Supervision Data Labeling
弱监督数据标注：Snorkel 风格的规则函数融合
"""
import re
import numpy as np
from dataclasses import dataclass
from collections import defaultdict


# 标签常量
POSITIVE = 1
NEGATIVE = 0
ABSTAIN = -1  # 该函数不确定，放弃投票


def lf_specific_details(text: str) -> int:
    """有具体细节（数字/场景/品类词）→ 高质量"""
    if len(re.findall(r'\d+', text)) >= 2 or len(text.split()) >= 50:
        return POSITIVE
    return ABSTAIN


def lf_empty_exclamation(text: str) -> int:
    """过多感叹号/空洞赞美 → 低质量"""
    exclaim_ratio = text.count('!') / max(len(text.split()), 1)
    generic = sum(1 for w in ['amazing', 'perfect', 'love it', 'great'] if w in text.lower())
    if exclaim_ratio > 0.15 or (generic >= 3 and len(text.split()) < 30):
        return NEGATIVE
    return ABSTAIN


def lf_balanced_review(text: str) -> int:
    """同时提优点和缺点 → 高质量"""
    positive_words = ['good', 'great', 'love', 'excellent', 'like', 'nice']
    negative_words = ['but', 'however', 'although', 'downside', 'issue', 'problem', 'cons']
    text_lower = text.lower()
    has_positive = any(w in text_lower for w in positive_words)
    has_negative = any(w in text_lower for w in negative_words)
    if has_positive and has_negative:
        return POSITIVE
    return ABSTAIN


def lf_verified_purchase_proxy(text: str) -> int:
    """提到使用时长/场景 → 可能是真实用户 → 高质量"""
    usage_patterns = [r'\d+\s*(month|week|day|hour)', r'(office|travel|night|morning|work)', r'(used|using) (for|it|since)']
    if any(re.search(p, text.lower()) for p in usage_patterns):
        return POSITIVE
    return ABSTAIN


def lf_too_short(text: str) -> int:
    """过短 → 低质量"""
    if len(text.split()) < 15:
        return NEGATIVE
    return ABSTAIN


def lf_competitor_mention(text: str) -> int:

8. 论文来源

1711.10160

Skill Relations

前置组合延伸

前置技能

Data-Collection-Agent-Pipeline]]（组合
LLM-Annotation-Weak-Supervision]]（组合
Skill-Data-Collection-Agent-Pipeline
Skill-Ecommerce-Data-Quality-Assessment
Skill-LLM-Annotation-Weak-Supervision
Skill-NLP-Sentiment-ML-Pipeline
Skill-Review-Helpfulness-Prediction
Skill-VOC-Aspect-Sentiment-Extraction
前置（prerequisite）
可组合（combinable）
延伸（extends）

延伸技能

Data-Collection-Agent-Pipeline]]（组合
LLM-Annotation-Weak-Supervision]]（组合
Skill-Data-Collection-Agent-Pipeline
Skill-LLM-Annotation-Weak-Supervision
Skill-Review-Helpfulness-Prediction
Skill-VOC-Aspect-Sentiment-Extraction
可组合（combinable）
延伸（extends）

可组合技能

Data-Collection-Agent-Pipeline]]（组合
LLM-Annotation-Weak-Supervision]]（组合
Skill-Data-Collection-Agent-Pipeline
Skill-LLM-Annotation-Weak-Supervision
可组合（combinable）