初识Hugging Face

发布于:2022-12-31 ⋅ 阅读:(1416) ⋅ 点赞:(1)

 1. NLP要解决的任务:

1.处理文本数据,首先对文本数据进行分词操作(分词的方法可能会不同,中文常见的就是分词或者分字)
2.分完的词还是字符的表示形式,计算机还不认识,最终我们希望把这些字符映射成实际的特征(向量)
3.输入搞定好了之后,接下来就要构建模型(一般都用预训练模型,例如BERT,GPT系列等)
4.怎么去完成我们自己的任务呢,基本上就是在预训练模型的基础上进行微调(训练自己数据的过程)

import warnings
warnings.filterwarnings("ignore")
from transformers import pipeline#用人家设计好的流程完成一些简单的任务
classifier = pipeline("sentiment-analysis")
classifier(
    [
        "I've been waiting for a HuggingFace course my whole life.",
        "I hate this so much!",
    ]
)
输出:
[{'label': 'POSITIVE', 'score': 0.9598049521446228},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

2. 基本流程概述分析

 

3. Tokenizer要做的事:

1. 分词,分字以及特殊字符(起始,终止,间隔,分类等特殊字符可以自己设计的)
2. 对每一个token映射得到一个ID(每个词都会对应一个唯一的ID)
3. 还有一些辅助信息也可以得到,比如当前词属于哪个句子(还有一些MASK,表示是否是原来的词还是特殊字符等)

from transformers import AutoTokenizer#自动判断

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"#根据这个模型所对应的来加载
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

raw_inputs = [
    "I've been waiting for a this course my whole life.",
    "I hate this so much!",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(inputs)

输出:

{'input_ids': tensor([[ 101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 2023, 2607, 2026, 2878, 2166, 1012, 102], [ 101, 1045, 5223, 2023, 2061, 2172, 999, 102, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]])}

tokenizer.decode([ 101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 2023, 2607, 2026, 2878,2166, 1012,  102])

输出:

"[CLS] i've been waiting for a this course my whole life. [SEP]"

4. 关于模型的加载方法

from transformers import AutoModel

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)

outputs = model(**inputs)
print(outputs.last_hidden_state.shape)

输出:

torch.Size([2, 15, 768])

5. 模型基本逻辑

 

6. 增加输出头

from transformers import AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)     # 输出
print(outputs.logits.shape)

输出:

torch.Size([2, 2])
import torch

predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)

输出:

tensor([[1.5446e-02, 9.8455e-01],
        [9.9946e-01, 5.4418e-04]], grad_fn=<SoftmaxBackward0>)
model.config.id2label

输出:

{0: 'NEGATIVE', 1: 'POSITIVE'}

# id2label后续可以自己设计,标签名字对应都可以自己指定

7.关于padding的作用

sequence1_ids = [[200, 200, 200]]
sequence2_ids = [[200, 200]]

batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

print(model(torch.tensor(sequence1_ids)).logits)
print(model(torch.tensor(sequence2_ids)).logits)

print(model(torch.tensor(batched_ids)).logits)

输出:

tensor([[ 1.5694, -1.3895]], grad_fn=<AddmmBackward0>)
tensor([[ 0.5803, -0.4125]], grad_fn=<AddmmBackward0>)
tensor([[ 1.5694, -1.3895],
        [ 1.3374, -1.2163]], grad_fn=<AddmmBackward0>)

attention_mask的作用

batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

attention_mask = [
    [1, 1, 1],
    [1, 1, 0],
]

outputs = model(torch.tensor(batched_ids), attention_mask=torch.tensor(attention_mask))
print(outputs.logits)

输出:

tensor([[ 1.5694, -1.3895],
        [ 0.5803, -0.4125]], grad_fn=<AddmmBackward0>)

8. 不同的padding方法

sequences = ["I've been waiting for a this course my whole life.", "So have I!", "I played basketball yesterday."]
# 按照最长的填充
model_inputs = tokenizer(sequences, padding="longest")
model_inputs

输出:

{'input_ids': [[101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 2023, 2607, 2026, 2878, 2166, 1012, 102], [101, 2061, 2031, 1045, 999, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0], [101, 1045, 2209, 3455, 7483, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]]}
# BERT默认最大是512
model_inputs = tokenizer(sequences, padding="max_length")
model_inputs

指定padding数量:

# 填充到多少
model_inputs = tokenizer(sequences, padding="max_length", max_length=8)
model_inputs

指定到多少就进行截断:

#到多少就截断
model_inputs = tokenizer(sequences, max_length=10, truncation=True)
model_inputs

关于返回格式:

#最好返回tensor
model_inputs = tokenizer(sequences, padding=True, return_tensors="pt")
model_inputs

网站公告

今日签到

点亮在社区的每一天
去签到

热门文章