需要安装pypdf包
文档:https://s1.q4cdn.com/806093406/files/doc_downloads/2023/414759-1-_5_Nike-NPS-Combo_Form-10-K_WR.pdf
from langchain_community.document_loaders import PyPDFLoader
file_path = "./414759-1-_5_Nike-NPS-Combo_Form-10-K_WR.pdf"
loader = PyPDFLoader(file_path)
docs = loader.load()
print(len(docs))
print(docs[0].page_content[:100])
print(docs[0].metadata)
- 加载器将指定路径的PDF读取到内存中。
- 然后,它使用 pypdf 包提取文本数据。
- 最后,它为PDF的每一页创建一个LangChain 文档,包含该页的内容和一些关于文本来源于文档的元数据。
使用RAG进行问答
使用文本分割器,您将把加载的文档分割成更小的文档,以便更容易适应LLM的上下文窗口,然后将它们加载到向量存储中。然后,您可以从向量存储中创建一个检索器以在我们的RAG链中使用
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(docs)
vector_store = Chroma.from_documents(splits,
embedding=OpenAIEmbeddings(
openai_api_base="https://api.siliconflow.cn/v1/",
openai_api_key=os.environ["siliconFlow"],
model="Qwen/Qwen3-Embedding-8B"
))
retriever = vector_store.as_retriever()
最后,您将使用一些内置助手构建最终的 rag_chain:
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate
system_prompt = (
"You are an assistant for question-answering tasks. "
"Use the following pieces of retrieved context to answer "
"the question. If you don't know the answer, say that you "
"don't know. Use three sentences maximum and keep the "
"answer concise."
"\n\n"
"{context}"
)
prompt = ChatPromptTemplate.from_messages(
[
("system", system_prompt),
("human", "{input}"),
]
)
question_answer_chain = create_stuff_documents_chain(llm, prompt=prompt)
rag_chain = create_retrieval_chain(retriever, question_answer_chain)
results = rag_chain.invoke({"input": "What was Nike's revenue in 2023?"})
results
进一步检查 context 下的值,您可以看到它们是每个包含摄取页面内容块的文档。值得注意的是,这些文档还保留了您最初加载时的原始元数据:
print(results["context"][0].page_content)
print(results["context"][0].metadata)