PDF 上传并保存到 MinIO 数据库

发布于:2025-07-12 ⋅ 阅读:(33) ⋅ 点赞:(0)

本项目是一个全栈应用,允许用户上传 PDF 文件。后端使用 Flask 构建,它会将原始 PDF 文件存储在 MinIO 存储桶中,并将其提取的文本内容索引到 OpenSearch 中。前端则是一个用于上传文件的简单 React 应用。
代码链接:https://github.com/zhouruiliangxian/Awesome-demo/tree/main/Fullstack/pdf_search_app

项目结构

pdf_search_app/
├── backend/            # Flask 后端
│   ├── .env            # 后端的环境变量
│   ├── app.py          # 主要的 Flask 应用逻辑
│   └── requirements.txt# Python 依赖项
├── frontend/           # React 前端
│   ├── public/
│   ├── src/
│   │   ├── App.css     # 前端样式文件
│   │   └── App.js      # 主要的 React 组件
│   └── package.json
└── docker-compose.yml  # 用于运行所有服务的 Docker Compose 文件

如何运行本应用

请遵循以下步骤来启动并运行整个应用。

第 1 步:启动基础设施服务

version: '3.8'

services:
  opensearch-node:
    image: opensearchproject/opensearch:2.19.1
    container_name: opensearch-node-pdf
    environment:
      - cluster.name=opensearch-cluster
      - node.name=opensearch-node
      - discovery.type=single-node
      - bootstrap.memory_lock=true
      - "OPENSEARCH_JAVA_OPTS=-Xms512m -Xmx512m"
      - "DISABLE_SECURITY_PLUGIN=true"
    ulimits:
      memlock:
        soft: -1
        hard: -1
      nofile:
        soft: 65536
        hard: 65536
    volumes:
      - opensearch-data:/usr/share/opensearch/data
    ports:
      - "9200:9200"
      - "9600:9600"
    networks:
      - app-network

  opensearch-dashboards:
    image: opensearchproject/opensearch-dashboards:2.19.1
    container_name: opensearch-dashboards-pdf
    ports:
      - "5601:5601"
    environment:
      OPENSEARCH_HOSTS: '["http://opensearch-node:9200"]'
      DISABLE_SECURITY_DASHBOARDS_PLUGIN: "true"
    networks:
      - app-network
    depends_on:
      - opensearch-node

  minio:
    image: minio/minio:latest
    container_name: minio
    ports:
      - "9000:9000" # API Port
      - "9001:9001" # Console Port
    volumes:
      - minio-data:/data
    environment:
      - MINIO_ROOT_USER=minioadmin # Change for production
      - MINIO_ROOT_PASSWORD=minioadmin # Change for production
    command: server /data --address ":9000" --console-address ":9001"
    networks:
      - app-network

volumes:
  opensearch-data:
  minio-data:

networks:
  app-network:
    driver: bridge

docker-compose.yml 文件将会启动 OpenSearch、OpenSearch Dashboards 和 MinIO。

pdf_search_app 的根目录下,运行:

docker-compose up -d

运行后,您可以通过以下地址访问这些服务:

  • OpenSearch 仪表盘: http://localhost:5601
  • MinIO 控制台: http://localhost:9001 (使用 docker-compose.yml 中配置的 minioadmin / minioadmin 登录)

第 2 步:运行 Flask 后端

  1. 导航到后端目录

    cd backend
    
  2. 创建虚拟环境并安装依赖

    # 创建一个虚拟环境
    uv venv
    # 激活它 (Windows)
    .\venv\Scripts\activate
    # (macOS/Linux)
    # source venv/bin/activate
    
    # 安装依赖
    uv pip install -r requirements.txt
    
  3. 启动 Flask 服务器
    重要提示:请使用 uv run app.py 命令来启动,以确保初始化代码(如创建 MinIO 存储桶)能够被执行。
    app.py文件

    # -*- coding: utf-8 -*-
    import os
    from flask import Flask, request, jsonify
    from flask_cors import CORS
    from dotenv import load_dotenv
    from minio import Minio
    from opensearchpy import OpenSearch
    import PyPDF2
    import io
    
    # --- Initialization ---
    load_dotenv()
    
    app = Flask(__name__)
    # Enable CORS for React frontend (adjust in production)
    CORS(app, resources={r"/api/*": {"origins": "http://localhost:3000"}})
    
    # --- Client Connections ---
    
    # OpenSearch Client
    opensearch_client = OpenSearch(
        hosts=[{'host': os.getenv('OPENSEARCH_HOST'), 'port': int(os.getenv('OPENSEARCH_PORT'))}],
        http_auth=None,
        use_ssl=False,
        verify_certs=False,
        ssl_assert_hostname=False,
        ssl_show_warn=False,
    )
    
    # MinIO Client
    minio_client = Minio(
        os.getenv('MINIO_ENDPOINT'),
        access_key=os.getenv('MINIO_ACCESS_KEY'),
        secret_key=os.getenv('MINIO_SECRET_KEY'),
        secure=False # Set to True if using HTTPS
    )
    
    import time
    
    # --- Helper Functions ---
    
    def setup_minio_and_opensearch():
        """Ensure MinIO bucket and OpenSearch index exist, with retries."""
        max_retries = 5
        retry_delay = 3 # seconds
    
        # Setup MinIO
        for i in range(max_retries):
            try:
                bucket_name = os.getenv('MINIO_BUCKET')
                found = minio_client.bucket_exists(bucket_name)
                if not found:
                    minio_client.make_bucket(bucket_name)
                    print(f"MinIO bucket '{bucket_name}' created.")
                else:
                    print(f"MinIO bucket '{bucket_name}' already exists.")
                break # Success, exit loop
            except Exception as e:
                print(f"MinIO setup failed (attempt {i+1}/{max_retries}): {e}")
                if i + 1 == max_retries:
                    raise
                print(f"Retrying in {retry_delay} seconds...")
                time.sleep(retry_delay)
    
        # Setup OpenSearch (can also have a retry loop if needed)
        index_name = os.getenv('OPENSEARCH_INDEX')
        if not opensearch_client.indices.exists(index=index_name):
            opensearch_client.indices.create(index=index_name)
            print(f"OpenSearch index '{index_name}' created.")
        else:
            print(f"OpenSearch index '{index_name}' already exists.")
    
    def extract_text_from_pdf(pdf_file):
        """Extracts text content from a PDF file stream."""
        text = ""
        try:
            pdf_reader = PyPDF2.PdfReader(pdf_file)
            for page in pdf_reader.pages:
                text += page.extract_text() or ""
        except Exception as e:
            print(f"Error extracting PDF text: {e}")
            return None
        return text
    
    # --- API Routes ---
    
    @app.route('/api/upload', methods=['POST'])
    def upload_pdf():
        if 'file' not in request.files:
            return jsonify({"error": "No file part"}), 400
        
        file = request.files['file']
        if file.filename == '' or not file.filename.lower().endswith('.pdf'):
            return jsonify({"error": "Invalid file, please upload a PDF"}), 400
    
        try:
            # Read file into memory
            pdf_bytes = file.read()
            pdf_stream = io.BytesIO(pdf_bytes)
            file_length = len(pdf_bytes)
            file_name = file.filename
    
            # 1. Upload original PDF to MinIO
            minio_bucket = os.getenv('MINIO_BUCKET')
            minio_client.put_object(
                minio_bucket,
                file_name,
                pdf_stream,
                length=file_length,
                content_type='application/pdf'
            )
            print(f"Successfully uploaded '{file_name}' to MinIO bucket '{minio_bucket}'.")
    
            # 2. Extract text from PDF
            pdf_stream.seek(0) # Reset stream position after upload
            extracted_text = extract_text_from_pdf(pdf_stream)
            if extracted_text is None:
                return jsonify({"error": "Could not extract text from PDF"}), 500
    
            # 3. Index metadata and text into OpenSearch
            document = {
                'file_name': file_name,
                'minio_path': f"/{minio_bucket}/{file_name}",
                'content': extracted_text,
                'size_bytes': file_length
            }
            opensearch_index = os.getenv('OPENSEARCH_INDEX')
            opensearch_client.index(
                index=opensearch_index,
                body=document,
                refresh=True # Make it immediately searchable
            )
            print(f"Successfully indexed metadata for '{file_name}' in OpenSearch.")
    
            return jsonify({
                "message": "File uploaded and indexed successfully!",
                "file_name": file_name,
                "minio_path": document['minio_path']
            }), 201
    
        except Exception as e:
            print(f"An error occurred: {e}")
            return jsonify({"error": "An internal error occurred"}), 500
    
    # --- Main Execution ---
    
    if __name__ == '__main__':
        with app.app_context():
            setup_minio_and_opensearch()
        app.run(host='0.0.0.0', port=5001, debug=True)  
    
    uv run app.py
    

    后端服务器将在 http://localhost:5001 上启动。首次运行时,它会自动创建所需的 MinIO 存储桶 (pdfs) 和 OpenSearch 索引 (pdf_documents)。

第 3 步:运行 React 前端

  1. 打开一个新的终端

  2. 导航到前端目录

    cd frontend
    
  3. 安装依赖并启动开发服务器

    npm install
    npm start
    
  4. 您的浏览器应该会自动打开 http://localhost:3000,在这里您会看到 PDF 上传界面。


效果测试

在这里插入图片描述
在这里插入图片描述
在这里插入图片描述

工作原理

  1. 上传: 您在 React 前端选择一个 PDF 文件并点击“上传”。
  2. API 调用: 前端将文件发送到 Flask 后端的 /api/upload 端点。
  3. 处理: Flask 服务器执行以下操作:
    a. 将原始 PDF 文件直接上传到 MinIOpdfs 存储桶中。
    b. 使用 PyPDF2 库从 PDF 中提取所有文本。
    c. 创建一个包含文件名、其在 MinIO 中的路径以及提取出的文本的 JSON 文档。
    d. 将此 JSON 文档索引到 OpenSearch 中。
  4. 结果: 您现在可以访问 OpenSearch 仪表盘 (http://localhost:5601),查看 pdf_documents 索引,并搜索您上传的 PDF 的内容。

网站公告

今日签到

点亮在社区的每一天
去签到