初识Hugging Face-EW帮帮网

1. NLP要解决的任务：

1.处理文本数据，首先对文本数据进行分词操作(分词的方法可能会不同，中文常见的就是分词或者分字)
2.分完的词还是字符的表示形式，计算机还不认识，最终我们希望把这些字符映射成实际的特征（向量）
3.输入搞定好了之后，接下来就要构建模型（一般都用预训练模型，例如BERT,GPT系列等）
4.怎么去完成我们自己的任务呢，基本上就是在预训练模型的基础上进行微调（训练自己数据的过程）

import warnings
warnings.filterwarnings("ignore")
from transformers import pipeline#用人家设计好的流程完成一些简单的任务
classifier = pipeline("sentiment-analysis")
classifier(
    [
        "I've been waiting for a HuggingFace course my whole life.",
        "I hate this so much!",
    ]
)

输出：
[{'label': 'POSITIVE', 'score': 0.9598049521446228},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

2. 基本流程概述分析

3. Tokenizer要做的事：

1. 分词，分字以及特殊字符（起始，终止，间隔，分类等特殊字符可以自己设计的）
2. 对每一个token映射得到一个ID（每个词都会对应一个唯一的ID）
3. 还有一些辅助信息也可以得到，比如当前词属于哪个句子（还有一些MASK，表示是否是原来的词还是特殊字符等）

from transformers import AutoTokenizer#自动判断

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"#根据这个模型所对应的来加载
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

raw_inputs = [
    "I've been waiting for a this course my whole life.",
    "I hate this so much!",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(inputs)

输出：

{'input_ids': tensor([[ 101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 2023, 2607, 2026, 2878, 2166, 1012, 102], [ 101, 1045, 5223, 2023, 2061, 2172, 999, 102, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]])}

tokenizer.decode([ 101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 2023, 2607, 2026, 2878,2166, 1012,  102])

输出：

"[CLS] i've been waiting for a this course my whole life. [SEP]"

4. 关于模型的加载方法

from transformers import AutoModel

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)

outputs = model(**inputs)
print(outputs.last_hidden_state.shape)

输出：

torch.Size([2, 15, 768])

5. 模型基本逻辑

6. 增加输出头

from transformers import AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)     # 输出
print(outputs.logits.shape)

输出：

torch.Size([2, 2])

import torch

predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)

输出：

tensor([[1.5446e-02, 9.8455e-01],
        [9.9946e-01, 5.4418e-04]], grad_fn=<SoftmaxBackward0>)

model.config.id2label

输出：

{0: 'NEGATIVE', 1: 'POSITIVE'}

# id2label后续可以自己设计，标签名字对应都可以自己指定

7.关于padding的作用

sequence1_ids = [[200, 200, 200]]
sequence2_ids = [[200, 200]]

batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

print(model(torch.tensor(sequence1_ids)).logits)
print(model(torch.tensor(sequence2_ids)).logits)

print(model(torch.tensor(batched_ids)).logits)

输出：

tensor([[ 1.5694, -1.3895]], grad_fn=<AddmmBackward0>)
tensor([[ 0.5803, -0.4125]], grad_fn=<AddmmBackward0>)
tensor([[ 1.5694, -1.3895],
        [ 1.3374, -1.2163]], grad_fn=<AddmmBackward0>)

attention_mask的作用

batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

attention_mask = [
    [1, 1, 1],
    [1, 1, 0],
]

outputs = model(torch.tensor(batched_ids), attention_mask=torch.tensor(attention_mask))
print(outputs.logits)

输出：

tensor([[ 1.5694, -1.3895],
        [ 0.5803, -0.4125]], grad_fn=<AddmmBackward0>)

8. 不同的padding方法¶

sequences = ["I've been waiting for a this course my whole life.", "So have I!", "I played basketball yesterday."]

# 按照最长的填充
model_inputs = tokenizer(sequences, padding="longest")
model_inputs

输出：

{'input_ids': [[101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 2023, 2607, 2026, 2878, 2166, 1012, 102], [101, 2061, 2031, 1045, 999, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0], [101, 1045, 2209, 3455, 7483, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]]}

# BERT默认最大是512
model_inputs = tokenizer(sequences, padding="max_length")
model_inputs

指定padding数量：

# 填充到多少
model_inputs = tokenizer(sequences, padding="max_length", max_length=8)
model_inputs

指定到多少就进行截断：

#到多少就截断
model_inputs = tokenizer(sequences, max_length=10, truncation=True)
model_inputs

关于返回格式：

#最好返回tensor
model_inputs = tokenizer(sequences, padding=True, return_tensors="pt")
model_inputs

初识Hugging Face