大语言模型基础-EW帮帮网

第2章：自然语言处理基础

🎯 学习目标

通过本章学习，您将了解：

自然语言处理的基本概念和任务
文本预处理的方法和技术
语言模型的基本原理
NLP中的核心概念和术语

📖 什么是自然语言处理？

定义

自然语言处理（Natural Language Processing, NLP） 是计算机科学、人工智能和语言学的交叉领域，致力于让计算机能够理解、解释和生成人类语言。

核心目标

语言理解：让计算机理解人类语言的含义
语言生成：让计算机生成自然流畅的人类语言
语言交互：实现人机之间的自然语言对话

🔍 NLP的主要任务

1. 基础任务

分词（Tokenization）

将连续的文本分割成有意义的单元（词、字符、子词）。

def tokenize_text(text, method='word'):
    """
    文本分词函数
    
    Args:
        text: 输入文本
        method: 分词方法 ('word', 'char', 'subword')
    
    Returns:
        tokens: 分词结果列表
    """
    if method == 'word':
        # 按空格和标点分词
        import re
        tokens = re.findall(r'\b\w+\b', text.lower())
    elif method == 'char':
        # 字符级分词
        tokens = list(text)
    elif method == 'subword':
        # 子词分词（BPE等）
        tokens = subword_tokenize(text)
    
    return tokens

# 示例
text = "自然语言处理是一门有趣的学科。"
print("字符分词:", tokenize_text(text, 'char'))
print("词语分词:", tokenize_text(text, 'word'))

词性标注（Part-of-Speech Tagging）

为每个词标注其语法类别（名词、动词、形容词等）。

def pos_tagging(tokens):
    """
    词性标注函数
    
    Args:
        tokens: 分词后的词语列表
    
    Returns:
        tagged_tokens: 带词性标签的词语列表
    """
    # 简化的词性标注规则
    pos_rules = {
        '的': 'DE',      # 结构助词
        '是': 'VC',      # 系动词
        '有': 'VE',      # 存在动词
        '在': 'P',       # 介词
        '了': 'AS',      # 时态助词
    }
    
    tagged = []
    for token in tokens:
        pos = pos_rules.get(token, 'N')  # 默认为名词
        tagged.append((token, pos))
    
    return tagged

命名实体识别（Named Entity Recognition, NER）

识别文本中的人名、地名、机构名等特定实体。

def named_entity_recognition(text):
    """
    命名实体识别函数
    
    Args:
        text: 输入文本
    
    Returns:
        entities: 识别出的实体列表
    """
    # 简化的实体识别规则
    import re
    
    entities = []
    
    # 识别人名（简单规则）
    person_pattern = r'[张王李赵刘陈杨黄周吴徐孙胡朱高林何郭马罗梁宋郑谢韩唐冯于董萧程曹袁邓许傅沈曾彭吕苏卢蒋蔡贾丁魏薛叶阎余潘杜戴夏钟汪田任姜范方石姚谭廖邹熊金陆郝孔白崔康毛邱秦江史顾侯邵孟龙万段漕钱汤尹黎易常武乔贺赖龚文][\u4e00-\u9fa5]{1,2}'
    persons = re.findall(person_pattern, text)
    for person in persons:
        entities.append((person, 'PERSON'))
    
    # 识别地名（简单规则）
    location_pattern = r'[\u4e00-\u9fa5]{2,}[省市县区镇村街道路]'
    locations = re.findall(location_pattern, text)
    for location in locations:
        entities.append((location, 'LOCATION'))
    
    return entities

2. 高级任务

语法分析（Syntactic Parsing）

分析句子的语法结构，构建语法树。

class SyntaxTree:
    """
    语法树节点类
    """
    def __init__(self, label, children=None):
        self.label = label
        self.children = children or []
    
    def add_child(self, child):
        """添加子节点"""
        self.children.append(child)
    
    def __str__(self):
        if not self.children:
            return self.label
        children_str = ' '.join(str(child) for child in self.children)
        return f"({self.label} {children_str})"

def simple_parse(sentence):
    """
    简单的语法分析函数
    
    Args:
        sentence: 输入句子
    
    Returns:
        tree: 语法树
    """
    # 简化的语法分析（仅作示例）
    tokens = sentence.split()
    
    # 构建简单的语法树
    root = SyntaxTree('S')  # 句子
    np = SyntaxTree('NP')   # 名词短语
    vp = SyntaxTree('VP')   # 动词短语
    
    # 假设前半部分是主语，后半部分是谓语
    mid = len(tokens) // 2
    for token in tokens[:mid]:
        np.add_child(SyntaxTree(token))
    for token in tokens[mid:]:
        vp.add_child(SyntaxTree(token))
    
    root.add_child(np)
    root.add_child(vp)
    
    return root

语义分析（Semantic Analysis）

理解文本的含义和语义关系。

def semantic_similarity(text1, text2):
    """
    计算两个文本的语义相似度
    
    Args:
        text1, text2: 输入文本
    
    Returns:
        similarity: 相似度分数 (0-1)
    """
    # 简化的语义相似度计算
    words1 = set(text1.lower().split())
    words2 = set(text2.lower().split())
    
    # 计算Jaccard相似度
    intersection = len(words1.intersection(words2))
    union = len(words1.union(words2))
    
    return intersection / union if union > 0 else 0

def extract_semantic_roles(sentence):
    """
    语义角色标注
    
    Args:
        sentence: 输入句子
    
    Returns:
        roles: 语义角色字典
    """
    # 简化的语义角色标注
    tokens = sentence.split()
    roles = {
        'agent': [],      # 施事
        'patient': [],    # 受事
        'action': [],     # 动作
        'location': [],   # 地点
        'time': []        # 时间
    }
    
    # 简单规则识别
    for i, token in enumerate(tokens):
        if '在' in token:
            roles['location'].append(token)
        elif any(char in token for char in '昨今明天年月日'):
            roles['time'].append(token)
        elif i == 0:  # 假设第一个词是施事
            roles['agent'].append(token)
    
    return roles

🔤 文本预处理

1. 文本清洗

import re

def clean_text(text):
    """
    文本清洗函数
    
    Args:
        text: 原始文本
    
    Returns:
        cleaned_text: 清洗后的文本
    """
    # 移除HTML标签
    text = re.sub(r'<[^>]+>', '', text)
    
    # 移除URL
    text = re.sub(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', '', text)
    
    # 移除邮箱
    text = re.sub(r'\S+@\S+', '', text)
    
    # 移除多余空白字符
    text = re.sub(r'\s+', ' ', text).strip()
    
    # 移除特殊字符（保留中英文、数字、基本标点）
    text = re.sub(r'[^\u4e00-\u9fa5a-zA-Z0-9\s.,!?;:]', '', text)
    
    return text

2. 文本标准化

def normalize_text(text):
    """
    文本标准化函数
    
    Args:
        text: 输入文本
    
    Returns:
        normalized_text: 标准化后的文本
    """
    # 转换为小写（英文）
    text = text.lower()
    
    # 繁体转简体（需要额外库支持）
    # text = traditional_to_simplified(text)
    
    # 数字标准化
    text = re.sub(r'\d+', '<NUM>', text)
    
    # 标点符号标准化
    punctuation_map = {
        '，': ',', '。': '.', '！': '!', '？': '?',
        '；': ';', '：': ':', '（': '(', '）': ')',
        '【': '[', '】': ']', '「': '"', '」': '"'
    }
    
    for old, new in punctuation_map.items():
        text = text.replace(old, new)
    
    return text

3. 停用词处理

def remove_stopwords(tokens, stopwords=None):
    """
    移除停用词
    
    Args:
        tokens: 分词后的词语列表
        stopwords: 停用词集合
    
    Returns:
        filtered_tokens: 过滤后的词语列表
    """
    if stopwords is None:
        # 常见中文停用词
        stopwords = {
            '的', '了', '在', '是', '我', '有', '和', '就', 
            '不', '人', '都', '一', '一个', '上', '也', '很',
            '到', '说', '要', '去', '你', '会', '着', '没有',
            '看', '好', '自己', '这', '那', '里', '就是'
        }
    
    return [token for token in tokens if token not in stopwords]

📊 语言模型基础

1. N-gram模型

from collections import defaultdict, Counter

class NGramModel:
    """
    N-gram语言模型类
    """
    def __init__(self, n=2):
        self.n = n
        self.ngrams = defaultdict(Counter)
        self.vocab = set()
    
    def train(self, texts):
        """
        训练N-gram模型
        
        Args:
            texts: 训练文本列表
        """
        for text in texts:
            tokens = ['<START>'] * (self.n - 1) + text.split() + ['<END>']
            self.vocab.update(tokens)
            
            # 构建N-gram
            for i in range(len(tokens) - self.n + 1):
                context = tuple(tokens[i:i + self.n - 1])
                next_word = tokens[i + self.n - 1]
                self.ngrams[context][next_word] += 1
    
    def probability(self, context, word):
        """
        计算给定上下文下词的概率
        
        Args:
            context: 上下文元组
            word: 目标词
        
        Returns:
            prob: 概率值
        """
        context_count = sum(self.ngrams[context].values())
        if context_count == 0:
            return 1.0 / len(self.vocab)  # 平滑处理
        
        word_count = self.ngrams[context][word]
        return word_count / context_count
    
    def generate(self, context, max_length=20):
        """
        生成文本
        
        Args:
            context: 初始上下文
            max_length: 最大生成长度
        
        Returns:
            generated_text: 生成的文本
        """
        import random
        
        result = list(context)
        
        for _ in range(max_length):
            current_context = tuple(result[-(self.n-1):])
            candidates = self.ngrams[current_context]
            
            if not candidates:
                break
            
            # 按概率采样下一个词
            words = list(candidates.keys())
            weights = list(candidates.values())
            next_word = random.choices(words, weights=weights)[0]
            
            if next_word == '<END>':
                break
            
            result.append(next_word)
        
        return ' '.join(result[self.n-1:])

# 使用示例
texts = [
    "我喜欢学习自然语言处理",
    "自然语言处理很有趣",
    "机器学习是人工智能的基础"
]

model = NGramModel(n=2)
model.train(texts)
print("生成文本:", model.generate(('<START>',)))

2. 词向量表示

import numpy as np
from collections import defaultdict

class SimpleWord2Vec:
    """
    简化的Word2Vec实现
    """
    def __init__(self, vector_size=100, window=5, min_count=1):
        self.vector_size = vector_size
        self.window = window
        self.min_count = min_count
        self.vocab = {}
        self.word_vectors = {}
    
    def build_vocab(self, sentences):
        """
        构建词汇表
        
        Args:
            sentences: 句子列表
        """
        word_count = defaultdict(int)
        
        for sentence in sentences:
            for word in sentence.split():
                word_count[word] += 1
        
        # 过滤低频词
        self.vocab = {word: idx for idx, (word, count) in 
                     enumerate(word_count.items()) if count >= self.min_count}
        
        # 初始化词向量
        vocab_size = len(self.vocab)
        for word in self.vocab:
            self.word_vectors[word] = np.random.normal(0, 0.1, self.vector_size)
    
    def get_vector(self, word):
        """
        获取词向量
        
        Args:
            word: 输入词
        
        Returns:
            vector: 词向量
        """
        return self.word_vectors.get(word, np.zeros(self.vector_size))
    
    def similarity(self, word1, word2):
        """
        计算词语相似度
        
        Args:
            word1, word2: 两个词语
        
        Returns:
            similarity: 余弦相似度
        """
        vec1 = self.get_vector(word1)
        vec2 = self.get_vector(word2)
        
        # 余弦相似度
        dot_product = np.dot(vec1, vec2)
        norm1 = np.linalg.norm(vec1)
        norm2 = np.linalg.norm(vec2)
        
        if norm1 == 0 or norm2 == 0:
            return 0
        
        return dot_product / (norm1 * norm2)

🎯 NLP中的核心概念

1. 语言的层次结构

音韵层：语音和音系
词汇层：词汇和形态
句法层：语法和句子结构
语义层：意义和概念
语用层：上下文和使用

2. 语言的歧义性

def demonstrate_ambiguity():
    """
    演示语言歧义性的例子
    """
    examples = {
        "词汇歧义": {
            "句子": "我在银行工作",
            "歧义": ["金融机构", "河岸"]
        },
        "句法歧义": {
            "句子": "我看见了一个人用望远镜",
            "歧义": ["我用望远镜看见一个人", "我看见一个用望远镜的人"]
        },
        "语义歧义": {
            "句子": "这个苹果很大",
            "歧义": ["水果很大", "公司很大"]
        }
    }
    
    for ambiguity_type, example in examples.items():
        print(f"{ambiguity_type}:")
        print(f"  句子: {example['句子']}")
        print(f"  可能含义: {', '.join(example['歧义'])}")
        print()

demonstrate_ambiguity()

3. 语言的上下文依赖性

def context_dependency_example():
    """
    演示上下文依赖性
    """
    contexts = [
        {
            "上下文": "在医院里",
            "句子": "护士给病人打针",
            "理解": "医疗注射"
        },
        {
            "上下文": "在篮球场上",
            "句子": "他给队友传球",
            "理解": "体育运动"
        },
        {
            "上下文": "在编程中",
            "句子": "函数返回值",
            "理解": "程序执行结果"
        }
    ]
    
    for context in contexts:
        print(f"上下文: {context['上下文']}")
        print(f"句子: {context['句子']}")
        print(f"理解: {context['理解']}")
        print()

context_dependency_example()

📝 本章小结

自然语言处理是理解大语言模型的基础。通过本章学习，我们了解了：

NLP的基本任务：从分词、词性标注到语义分析
文本预处理：清洗、标准化、停用词处理
语言模型基础：N-gram模型和词向量
语言的复杂性：歧义性和上下文依赖性

这些基础知识为理解现代大语言模型的工作原理奠定了重要基础。在下一章中，我们将学习机器学习的基础概念。

🎯 练习题

实践编程：实现一个简单的中文分词器，能够处理基本的词语切分。
概念理解：解释为什么语言的歧义性对NLP系统来说是一个挑战，并提出可能的解决方案。
模型比较：比较N-gram模型和现代神经网络语言模型的优缺点。
应用设计：设计一个简单的文本分类系统，说明需要哪些NLP技术。

上一章：什么是大语言模型

下一章：机器学习基础

返回目录：学习指南

大语言模型基础