文本分词 nltk-EW帮帮网

NLTK 的主要功能

1. 文本分词（Tokenization）

2. 词性标注（POS Tagging）

3. 停用词过滤（Stopwords Removal）

4. 词干提取（Stemming）

5. 词形还原（Lemmatization）

6. 命名实体识别（NER, Named Entity Recognition）

7. 情感分析（Sentiment Analysis）

NLTK 的典型应用

NLTK（Natural Language Toolkit，自然语言处理工具包）是一个用于 文本处理 和 自然语言处理（NLP） 的 Python 库。它提供了丰富的工具和数据集，适用于 文本分词、词性标注、句法分析、情感分析、机器翻译 等 NLP 任务。

NLTK 的主要功能

1. 文本分词（Tokenization）

句子分词（sent_tokenize）：将段落拆分成句子。
单词分词（word_tokenize）：将句子拆分成单词。

from nltk.tokenize import sent_tokenize, word_tokenize

text = "Hello, world! How are you?"
print(sent_tokenize(text))  # 输出：['Hello, world!', 'How are you?']
print(word_tokenize(text))  # 输出：['Hello', ',', 'world', '!', 'How', 'are', 'you', '?']

2. 词性标注（POS Tagging）

标记单词的词性（名词、动词、形容词等）。

from nltk import pos_tag
from nltk.tokenize import word_tokenize

text = "I love coding in Python."
words = word_tokenize(text)
print(pos_tag(words))  # 输出：[('I', 'PRP'), ('love', 'VBP'), ('coding', 'VBG'), ('in', 'IN'), ('Python', 'NNP'), ('.', '.')]

3. 停用词过滤（Stopwords Removal）

移除无意义的单词（如 "the", "is", "and"）。

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

text = "This is a sample sentence."
words = word_tokenize(text)
stop_words = set(stopwords.words("english"))
filtered_words = [word for word in words if word.lower() not in stop_words]
print(filtered_words)  # 输出：['sample', 'sentence', '.']

4. 词干提取（Stemming）

将单词还原为词干形式（如 "running" → "run"）。

from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
print(stemmer.stem("running"))  # 输出：'run'
print(stemmer.stem("better"))   # 输出：'better'（不完全准确）

5. 词形还原（Lemmatization）

比 Stemming 更智能，返回单词的基本形式（如 "better" → "good"）。

from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize("better", pos="a"))  # 输出：'good'（'a' 表示形容词）
print(lemmatizer.lemmatize("running", pos="v")) # 输出：'run'（'v' 表示动词）

6. 命名实体识别（NER, Named Entity Recognition）

识别文本中的人名、地名、组织名等。

from nltk import ne_chunk, pos_tag, word_tokenize

text = "Apple is based in Cupertino."
words = word_tokenize(text)
tags = pos_tag(words)
print(ne_chunk(tags))  # 输出：(S (GPE Apple/NNP) is/VBZ based/VBN in/IN (GPE Cupertino/NNP) ./.)

7. 情感分析（Sentiment Analysis）

判断文本的情感倾向（正面/负面）。

from nltk.sentiment import SentimentIntensityAnalyzer

sia = SentimentIntensityAnalyzer()
print(sia.polarity_scores("I love Python!"))  # 输出：{'neg': 0.0, 'neu': 0.192, 'pos': 0.808, 'compound': 0.6369}

NLTK 的典型应用

文本预处理（清洗、分词、去停用词）
情感分析（评论、社交媒体分析）
机器翻译（结合其他 NLP 库）
聊天机器人（结合 RNN/LSTM）
搜索引擎优化（SEO）（关键词提取）

文本分词 nltk

NLTK 的主要功能

1. 文本分词（Tokenization）

2. 词性标注（POS Tagging）

3. 停用词过滤（Stopwords Removal）

4. 词干提取（Stemming）

5. 词形还原（Lemmatization）

6. 命名实体识别（NER, Named Entity Recognition）

7. 情感分析（Sentiment Analysis）

NLTK 的典型应用

网站公告

今日签到

热门文章

最新发布