java爬虫实战

发布于:2025-08-15 ⋅ 阅读:(14) ⋅ 点赞:(0)

本人目前在做鱼皮的《智能协同云图库》,涉及到了以图搜图+图片爬取,虽然以前有爬过图片,但是用的都是别人现成的代码,不怎么去理解为什么要这样做,这次有在尝试理解每一个步骤。本人基础极差,属于一点基础也没学直接上手做项目的那种类型,所以跟课程有点吃力。但好在gpt非常好用,也算是勉强能够理解了。在这里总结一下思路。

百度的以图搜图可以通过上传url进行,我选择这个url的图片。

https://i2.hdslb.com/bfs/archive/ad698e40cc6dd3d03ae5d0ab7bfa50faf368bd9b.jpg

然后就可以出现这个:

然后可以打开Safari网页检查器(如果不是Safari,应该是开发者工具)

只看XHR类型就可以,也就是只显示接口请求。

记得设置保留日志,因为会有一闪而过的upload。别的网站也可能是别的名字,比如pcsearch这种。

把搜索的网址输进去,再重新搜一遍,会出现:

然后需要关注标头中的内容。

展开请求数据后,可以得到:

sdkParams 通常是由百度官方 SDK 生成的签名参数,里面可能是时间戳、签名、密钥哈希等。这里不需要管它。

package com.bxt.picturebackend.imageSearch.sub;

import cn.hutool.core.util.URLUtil;
import cn.hutool.http.HttpRequest;
import cn.hutool.http.HttpResponse;
import cn.hutool.json.JSONUtil;
import com.bxt.picturebackend.exception.BusinessException;
import com.bxt.picturebackend.exception.ErrorCode;
import lombok.extern.slf4j.Slf4j;


import java.nio.charset.StandardCharsets;
import java.util.HashMap;
import java.util.HexFormat;
import java.util.Map;


@Slf4j
public class GetImagePageUrlApi {
    public static String getImagePageUrl(String imageUrl) {
        Map<String, Object> formData = new HashMap<>();
        formData.put("image", imageUrl);
        formData.put("tn","pc");
        formData.put("from", "pc");
        formData.put("image_source", "PC_UPLOAD_URL");
        long upTime = System.currentTimeMillis();
        String postUrl = "https://graph.baidu.com/upload?uptime="+ upTime;
        String acsToken = "jmM4zyI8OUixvSuWh0sCy4xWbsttVMZb9qcRTmn6SuNWg0vCO7N0s6Lffec+IY5yuqHujHmCctF9BVCGYGH0H5SH/H3VPFUl4O4CP1jp8GoAzuslb8kkQQ4a21Tebge8yhviopaiK66K6hNKGPlWt78xyyJxTteFdXYLvoO6raqhz2yNv50vk4/41peIwba4lc0hzoxdHxo3OBerHP2rfHwLWdpjcI9xeu2nJlGPgKB42rYYVW50+AJ3tQEBEROlg/UNLNxY+6200B/s6Ryz+n7xUptHFHi4d8Vp8q7mJ26yms+44i8tyiFluaZAr66/+wW/KMzOhqhXCNgckoGPX1SSYwueWZtllIchRdsvCZQ8tFJymKDjCf3yI/Lw1oig9OKZCAEtiLTeKE9/CY+Crp8DHa8Tpvlk2/i825E3LuTF8EQfzjcGpVnR00Lb4/8A";

        try {
            HttpResponse httpResponse=HttpRequest.post(postUrl)
                    .form(formData)
                    .timeout(10000)
                    .header("Acs-Token", acsToken)
                    .execute();
            if (httpResponse.getStatus() != 200) {
                log.error("获取以图搜图页面地址失败,状态码:{}", httpResponse.getStatus());
                throw new RuntimeException("获取以图搜图页面地址失败,请稍后重试");
            }
            String body = httpResponse.body();
            System.out.println("body = " + body);
            Map<String, Object> responseMap = JSONUtil.toBean(body, Map.class);
            System.out.println("responseMap = " + responseMap);
            if (responseMap == null ) {
                log.error("获取以图搜图页面地址失败,响应内容:{}", body);
                throw new RuntimeException("获取以图搜图页面地址失败,请稍后重试");
            }
            Map<String, Object> data = (Map<String, Object>) responseMap.get("data");
            System.out.println("data = " + data);
            String rawUrl = (String) data.get("url");
            // 对 URL 进行解码
            String searchResultUrl = URLUtil.decode(rawUrl, StandardCharsets.UTF_8);
            // 如果 URL 为空
            if (searchResultUrl == null) {
                throw new BusinessException(ErrorCode.OPERATION_ERROR, "未返回有效结果");
            }
            return searchResultUrl;
        }catch (Exception e) {
            log.error("获取以图搜图页面地址失败,错误信息:{}", e.getMessage());
            throw new RuntimeException("获取以图搜图页面地址失败,请稍后重试");
        }


    }
}

用单元测试类测试:

package com.bxt.picturebackend.imageSearch.sub;

import cn.hutool.http.HttpResponse;
import com.mysql.cj.x.protobuf.MysqlxResultset;
import org.junit.jupiter.api.Test;

import static org.junit.jupiter.api.Assertions.*;

class GetImagePageUrlApiTest {
    @Test
    void testGetImagePageUrl() {
        String testImageUrl = "https://i2.hdslb.com/bfs/archive/ad698e40cc6dd3d03ae5d0ab7bfa50faf368bd9b.jpg";
        String response = GetImagePageUrlApi.getImagePageUrl(testImageUrl);
        System.out.println(response);
    }
}

可以得到:

body = {"status":0,"msg":"Success","data":{"url":"https://graph.baidu.com/s?card_key=\u0026entrance=GENERAL\u0026extUiData%5BisLogoShow%5D=1\u0026f=all\u0026isLogoShow=1\u0026session_id=13377293787626920489\u0026sign=1260533cc766d268eaf8401755063018\u0026tpl_from=pc","sign":"1260533cc766d268eaf8401755063018"}}
responseMap = {status=0, msg=Success, data={"url":"https://graph.baidu.com/s?card_key=&entrance=GENERAL&extUiData%5BisLogoShow%5D=1&f=all&isLogoShow=1&session_id=13377293787626920489&sign=1260533cc766d268eaf8401755063018&tpl_from=pc","sign":"1260533cc766d268eaf8401755063018"}}
data = {"url":"https://graph.baidu.com/s?card_key=&entrance=GENERAL&extUiData%5BisLogoShow%5D=1&f=all&isLogoShow=1&session_id=13377293787626920489&sign=1260533cc766d268eaf8401755063018&tpl_from=pc","sign":"1260533cc766d268eaf8401755063018"}
https://graph.baidu.com/s?card_key=&entrance=GENERAL&extUiData[isLogoShow]=1&f=all&isLogoShow=1&session_id=13377293787626920489&sign=1260533cc766d268eaf8401755063018&tpl_from=pc

Process finished with exit code 0

这里得到的url就是返回的页面。

然后可以继续分析这个页面

只过滤文稿,可以得到这个页面的html

因为需要的图片位于“相似图片”下方,所以可以去“相似图片”周边找一下

firsturl看起来是有用的。

把后边跟着的那一串字符摘过来:

https:\/\/graph.baidu.com\/ajax\/pcsimi?carousel=503&entrance=GENERAL&extUiData%5BisLogoShow%5D=1&inspire=general_pc&limit=30&next=2&render_type=card&session_id=13377293787626920489&sign=1260533cc766d268eaf8401755063018&tk=2e59f&tpl_from=pc

需要稍微改一下,因为其中反斜杠 \ 是 JSON 字符串里对斜杠 / 的转义,属于 JSON 格式要求,不是 URL 本身的内容。

把所有的反斜杠“\”都去掉,可以得到下边的网址:

https://graph.baidu.com/ajax/pcsimi?carousel=503&entrance=GENERAL&extUiData%5BisLogoShow%5D=1&inspire=general_pc&limit=30&next=2&render_type=card&session_id=13377293787626920489&sign=1260533cc766d268eaf8401755063018&tk=2e59f&tpl_from=pc

输入这个网址,可以得到如下页面:

thumbUrl后边跟着的字符串,是我们需要的内容

可是,直接把它粘过来进行搜索,是会出错的

原因主要是转义字符写法错误,具体问题包括:
URL中出现了错误的转义写法 /u0026,这是 Unicode 转义符,表示字符 &。但在 URL 中不能直接写成 /u0026,正确的是用 & 连接参数。同样的,末尾的 \u0026h=500 也写成了 \u0026,这不是有效的 URL 字符。

改成正确的格式,比如这样:

http://mms1.baidu.com/it/u=771534300,3396233686&fm=253&app=138&f=JPEG?w=800&h=500

就可以正常显示了

补充之前的代码,完整版如下,调用getUrlList可以返回相似图片的url

package com.bxt.picturebackend.imageSearch.sub;

import cn.hutool.core.util.URLUtil;
import cn.hutool.http.HttpRequest;
import cn.hutool.http.HttpResponse;
import cn.hutool.json.JSONUtil;
import com.bxt.picturebackend.exception.BusinessException;
import com.bxt.picturebackend.exception.ErrorCode;
import lombok.extern.slf4j.Slf4j;
import org.springframework.security.web.firewall.FirewalledRequest;


import java.nio.charset.StandardCharsets;
import java.util.HashMap;
import java.util.HexFormat;
import java.util.List;
import java.util.Map;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import static cn.hutool.poi.excel.sax.AttributeName.r;


@Slf4j
public class GetImagePageUrlApi {
    public static List<String> getUrlList(String imageUrl){
        String imagePageUrl = getImagePageUrl(imageUrl);
        if (imagePageUrl == null || imagePageUrl.isEmpty()) {
            throw new BusinessException(ErrorCode.OPERATION_ERROR, "未返回有效结果");
        }
        String acsToken = "jmM4zyI8OUixvSuWh0sCy4xWbsttVMZb9qcRTmn6SuNWg0vCO7N0s6Lffec+IY5yuqHujHmCctF9BVCGYGH0H5SH/H3VPFUl4O4CP1jp8GoAzuslb8kkQQ4a21Tebge8yhviopaiK66K6hNKGPlWt78xyyJxTteFdXYLvoO6raqhz2yNv50vk4/41peIwba4lc0hzoxdHxo3OBerHP2rfHwLWdpjcI9xeu2nJlGPgKB42rYYVW50+AJ3tQEBEROlg/UNLNxY+6200B/s6Ryz+n7xUptHFHi4d8Vp8q7mJ26yms+44i8tyiFluaZAr66/+wW/KMzOhqhXCNgckoGPX1SSYwueWZtllIchRdsvCZQ8tFJymKDjCf3yI/Lw1oig9OKZCAEtiLTeKE9/CY+Crp8DHa8Tpvlk2/i825E3LuTF8EQfzjcGpVnR00Lb4/8A";
        HttpResponse httpResponse = HttpRequest.get(imagePageUrl)
                .timeout(10000)
                .header("Acs-Token", acsToken)
                .execute();
//        System.out.println("httpResponse = " + httpResponse);
        if (httpResponse.getStatus() != 200) {
            log.error("获取以图搜图页面地址失败,状态码:{}", httpResponse.getStatus());
            throw new RuntimeException("获取以图搜图页面地址失败,请稍后重试");
        }
        Pattern pattern = Pattern.compile("\"firstUrl\"\\s*:\\s*\"(.*?)\"");
        Matcher matcher = pattern.matcher(httpResponse.body());
        String firstUrl;
        if (matcher.find()) {
            // 提取并替换 \/ 为 /
            firstUrl = matcher.group(1).replace("\\/", "/");
            System.out.println("firstUrl = " + firstUrl);
        } else {
            throw new BusinessException(ErrorCode.OPERATION_ERROR, "未返回有效结果");
        }

        HttpResponse urlListPage = HttpRequest.get(firstUrl)
                .timeout(10000)
                .header("Acs-Token", acsToken)
                .execute();
//        System.out.println(urlListPage);

        pattern = Pattern.compile("\"thumbUrl\"\\s*:\\s*\"(.*?)\"");
        matcher = pattern.matcher(urlListPage.body());

        List<String> urlList = new java.util.ArrayList<>();
        while (matcher.find()) {
            String thumbUrl = matcher.group(1);
            // 转义 \u0026 -> &
            thumbUrl = thumbUrl.replaceAll("\\\\u0026", "&");
            urlList.add(thumbUrl);
        }
//        System.out.println("urlList = " + urlList);
        return urlList;






    }
    public static String getImagePageUrl(String imageUrl) {
        Map<String, Object> formData = new HashMap<>();
        formData.put("image", imageUrl);
        formData.put("tn","pc");
        formData.put("from", "pc");
        formData.put("image_source", "PC_UPLOAD_URL");
        long upTime = System.currentTimeMillis();
        String postUrl = "https://graph.baidu.com/upload?uptime="+ upTime;
        String acsToken = "jmM4zyI8OUixvSuWh0sCy4xWbsttVMZb9qcRTmn6SuNWg0vCO7N0s6Lffec+IY5yuqHujHmCctF9BVCGYGH0H5SH/H3VPFUl4O4CP1jp8GoAzuslb8kkQQ4a21Tebge8yhviopaiK66K6hNKGPlWt78xyyJxTteFdXYLvoO6raqhz2yNv50vk4/41peIwba4lc0hzoxdHxo3OBerHP2rfHwLWdpjcI9xeu2nJlGPgKB42rYYVW50+AJ3tQEBEROlg/UNLNxY+6200B/s6Ryz+n7xUptHFHi4d8Vp8q7mJ26yms+44i8tyiFluaZAr66/+wW/KMzOhqhXCNgckoGPX1SSYwueWZtllIchRdsvCZQ8tFJymKDjCf3yI/Lw1oig9OKZCAEtiLTeKE9/CY+Crp8DHa8Tpvlk2/i825E3LuTF8EQfzjcGpVnR00Lb4/8A";

        try {
            HttpResponse httpResponse=HttpRequest.post(postUrl)
                    .form(formData)
                    .timeout(10000)
                    .header("Acs-Token", acsToken)
                    .execute();
            if (httpResponse.getStatus() != 200) {
                log.error("获取以图搜图页面地址失败,状态码:{}", httpResponse.getStatus());
                throw new RuntimeException("获取以图搜图页面地址失败,请稍后重试");
            }
            String body = httpResponse.body();
            System.out.println("body = " + body);
            Map<String, Object> responseMap = JSONUtil.toBean(body, Map.class);
            System.out.println("responseMap = " + responseMap);
            if (responseMap == null ) {
                log.error("获取以图搜图页面地址失败,响应内容:{}", body);
                throw new RuntimeException("获取以图搜图页面地址失败,请稍后重试");
            }
            Map<String, Object> data = (Map<String, Object>) responseMap.get("data");
            System.out.println("data = " + data);
            String rawUrl = (String) data.get("url");
            // 对 URL 进行解码
            String searchResultUrl = URLUtil.decode(rawUrl, StandardCharsets.UTF_8);
            // 如果 URL 为空
            if (searchResultUrl == null) {
                throw new BusinessException(ErrorCode.OPERATION_ERROR, "未返回有效结果");
            }
            return searchResultUrl;
        }catch (Exception e) {
            log.error("获取以图搜图页面地址失败,错误信息:{}", e.getMessage());
            throw new RuntimeException("获取以图搜图页面地址失败,请稍后重试");
        }
    }
}

输出最后的list,是这样的:

[http://mms1.baidu.com/it/u=771534300,3396233686&fm=253&app=138&f=JPEG?w=800&h=500, http://mms0.baidu.com/it/u=4161103281,1829674203&fm=253&app=138&f=JPEG?w=749&h=580, http://mms2.baidu.com/it/u=2706284301,789398194&fm=253&app=120&f=JPEG?w=800&h=500, http://mms1.baidu.com/it/u=1667096992,1485299432&fm=253&app=138&f=JPEG?w=800&h=500, http://mms0.baidu.com/it/u=2502213264,439196765&fm=253&app=120&f=JPEG?w=800&h=500, http://mms0.baidu.com/it/u=4000521229,3982402882&fm=253&app=120&f=JPEG?w=655&h=446, http://mms2.baidu.com/it/u=640527677,1986438968&fm=253&app=138&f=JPEG?w=455&h=256, http://mms2.baidu.com/it/u=156995109,2192672339&fm=253&app=120&f=JPEG?w=801&h=500, http://mms0.baidu.com/it/u=48011703,2549638517&fm=253&app=138&f=JPEG?w=800&h=500, http://mms2.baidu.com/it/u=1316957924,1711619045&fm=253&app=120&f=JPEG?w=800&h=500, http://mms0.baidu.com/it/u=2192255561,2552189568&fm=253&app=138&f=JPEG?w=634&h=356, http://mms0.baidu.com/it/u=2868092005,3149855400&fm=253&app=138&f=JPEG?w=500&h=500, http://mms0.baidu.com/it/u=2173262737,1364469520&fm=253&app=138&f=JPEG?w=500&h=500, http://mms0.baidu.com/it/u=896380067,3285805132&fm=253&app=138&f=JPEG?w=1053&h=800, http://mms0.baidu.com/it/u=184083361,1291046512&fm=253&app=138&f=JPEG?w=500&h=500, http://mms0.baidu.com/it/u=2147020713,3191068967&fm=253&app=138&f=JPEG?w=867&h=500, http://mms0.baidu.com/it/u=864737700,3400231159&fm=253&app=120&f=JPEG?w=800&h=500, http://mms1.baidu.com/it/u=153299186,2018689789&fm=253&app=120&f=JPEG?w=480&h=270, http://mms0.baidu.com/it/u=2253215478,3249860676&fm=253&app=120&f=JPEG?w=800&h=500, http://mms2.baidu.com/it/u=3522373714,3342355003&fm=253&app=120&f=JPEG?w=800&h=500]
 

全部都是坤坤


网站公告

今日签到

点亮在社区的每一天
去签到