序
本文主要研究一下Spring AI的ETL Pipeline
DocumentReader
org/springframework/ai/document/DocumentReader.java
public interface DocumentReader extends Supplier<List<Document>> {
default List<Document> read() {
return get();
}
}
有TextReader、JsonReader、JsoupDocumentReader、MarkdownDocumentReader、PagePdfDocumentReader、ParagraphPdfDocumentReader、TikaDocumentReader这些实现
DocumentTransformer
org/springframework/ai/document/DocumentTransformer.java
public interface DocumentTransformer extends Function<List<Document>, List<Document>> {
default List<Document> transform(List<Document> transform) {
return apply(transform);
}
}
有ContentFormatTransformer、KeywordMetadataEnricher、SummaryMetadataEnricher、TokenTextSplitter实现
DocumentWriter
org/springframework/ai/document/DocumentWriter.java
public interface DocumentWriter extends Consumer<List<Document>> {
default void write(List<Document> documents) {
accept(documents);
}
}
有FileDocumentWriter、SimpleVectorStore、AzureVectorStore、CassandraVectorStore、ChromaVectorStore、CoherenceVectorStore、CosmosDBVectorStore、CouchbaseSearchVectorStore、ElasticsearchVectorStore、GemFireVectorStore、HanaCloudVectorStore、HanaCloudVectorStore、MariaDBVectorStore、MilvusVectorStore、MongoDBAtlasVectorStore、Neo4jVectorStore、OpenSearchVectorStore、OracleVectorStore、PgVectorStore、PineconeVectorStore、QdrantVectorStore、RedisVectorStore、TypesenseVectorStore、WeaviateVectorStore这些实现
示例
@BeforeEach
void setUp() {
DocumentReader markdownReader = new MarkdownDocumentReader(this.knowledgeBaseResource,
MarkdownDocumentReaderConfig.defaultConfig());
this.knowledgeBaseDocuments = markdownReader.read();
this.pgVectorStore.add(this.knowledgeBaseDocuments);
}
@AfterEach
void tearDown() {
this.pgVectorStore.delete(this.knowledgeBaseDocuments.stream().map(Document::getId).toList());
}
@Test
void ragBasic() {
String question = "Where does the adventure of Anacletus and Birba take place?";
RetrievalAugmentationAdvisor ragAdvisor = RetrievalAugmentationAdvisor.builder()
.documentRetriever(VectorStoreDocumentRetriever.builder().vectorStore(this.pgVectorStore).build())
.build();
ChatResponse chatResponse = ChatClient.builder(this.openAiChatModel)
.build()
.prompt(question)
.advisors(ragAdvisor)
.call()
.chatResponse();
assertThat(chatResponse).isNotNull();
String response = chatResponse.getResult().getOutput().getText();
System.out.println(response);
assertThat(response).containsIgnoringCase("Highlands");
evaluateRelevancy(question, chatResponse);
}
这里通过MarkdownDocumentReader读取文档,然后写入到PgVectorStore,之后通过RetrievalAugmentationAdvisor去检索查询
小结
Spring AI提供了ETL(Extract, Transform, and Load
) Pipeline用来处于文档,它由DocumentReader、DocumentTransformer、DocumentWriter三个组件组成。