在Java中设置爬虫的异常处理是一个重要的步骤,它可以帮助你识别和处理在爬取数据过程中可能遇到的问题,如网络错误、数据解析错误等。以下是一些关键点和代码示例,展示如何在Java爬虫中实现异常处理。
1. 捕获HTTP请求异常
当使用HTTP客户端(如Apache HttpClient)发送请求时,可能会遇到各种网络异常,如连接超时、断开连接等。你需要捕获这些异常并进行处理。
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.client.ClientProtocolException;
import org.apache.http.HttpResponse;
import org.apache.http.util.EntityUtils;
import java.io.IOException;
public class HttpCrawler {
public static void fetchData(String url) {
CloseableHttpClient httpClient = HttpClients.createDefault();
HttpGet httpGet = new HttpGet(url);
try {
HttpResponse response = httpClient.execute(httpGet);
String responseString = EntityUtils.toString(response.getEntity());
// 处理响应内容
System.out.println(responseString);
} catch (ClientProtocolException e) {
System.err.println("HTTP protocol error: " + e.getMessage());
} catch (IOException e) {
System.err.println("I/O error: " + e.getMessage());
} finally {
try {
httpClient.close();
} catch (IOException e) {
System.err.println("Error closing HTTP client: " + e.getMessage());
}
}
}
}
2. 处理JSON解析异常
当解析JSON响应数据时,可能会遇到解析错误,如格式错误等。使用Jackson或Gson等库时,需要捕获相应的解析异常。
import com.fasterxml.jackson.databind.ObjectMapper;
import com.fasterxml.jackson.core.JsonProcessingException;
public class JsonParser {
private static final ObjectMapper objectMapper = new ObjectMapper();
public static Object parseJson(String json) {
try {
return objectMapper.readValue(json, Object.class);
} catch (JsonProcessingException e) {
System.err.println("JSON parsing error: " + e.getMessage());
return null;
}
}
}
3. 处理数据存储异常
在将爬取的数据存储到数据库或文件时,可能会遇到I/O异常或数据库连接异常。
import java.io.FileWriter;
import java.io.IOException;
public class DataStorage {
public static void saveData(String data, String filePath) {
try (FileWriter writer = new FileWriter(filePath)) {
writer.write(data);
} catch (IOException e) {
System.err.println("Error writing to file: " + e.getMessage());
}
}
}
4. 使用日志记录异常
对于生产环境中的爬虫,使用日志框架(如Log4j、SLF4J)记录异常信息比直接打印到控制台更为专业和有用。
<!-- 在pom.xml中添加Log4j依赖 -->
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-core</artifactId>
<version>2.14.1</version>
</dependency>
import org.apache.logging.log4j.LogManager;
import org.apache.logging.log4j.Logger;
public class DataFetcher {
private static final Logger logger = LogManager.getLogger(DataFetcher.class);
public static void fetchData(String url) {
try {
// 发送请求和处理响应的代码
} catch (Exception e) {
logger.error("Failed to fetch data from {}", url, e);
}
}
}