解析的网站
https://xa.ke.com/xiaoqu/
<li >
<a > </a>
<div class="info">
<div class="title">
<a class="maidian-detail" href="https://wuhu.ke.com/xiaoqu/8822129407716432/" target="_blank" data-maidian="611520631514697728" title="芜湖碧桂园镜湖春色">芜湖碧桂园镜湖春色</a>
</div>
<div class="positionInfo">
<span class="positionIcon"></span>
<a href="https://wuhu.ke.com/xiaoqu/sanshanqu/" class="district" title="三山区小区">三山区</a> <a href="https://wuhu.ke.com/xiaoqu/longhujiedao/" class="bizcircle" title="龙湖街道小区">龙湖街道</a>
/ 2007年建成
</div>
Jupyter解析:
import requests
a=requests.get("https://xa.ke.com/xiaoqu/")
from lxml import etree
res=etree.HTML(a.text)
import re
re.search(";(.*?)年建成",a.text).group().split(";")[1].split("年建成")[0]
re.sub("/\xa0|年建成","",res.xpath("//div[@class=\"info\"]/div[3]/text()")[3].strip())
正则,xpath都能正常请求到。完全没问题,但Scrapy中内容完全不同了
delivery1=re.search("/ (.*?)年建成", l).group().split("/\xa0")[1].split("年建成")[0]
请求不到,必须一定要看自己现在请求的html文本,这里xpath 都是空,且别加re.S
又取不到了
http://www.ggzy.gov.cn/information/html/a/620000/0104/202208/26/006287d7dcc2a4944c659888ddacc87b80b9.shtml