求教RULES,想问下有不有dnf炼金术师快速升级级的rules

一般爬虫的逻辑是:给定起始页面,发起访问,分析页面包含的所有其他链接,然后将这些链接放入队列,再逐次访问这些队列,直至边界条件结束。为了针对列表页+详情页这种模式,需要对链接抽取(link extractor)的逻辑进行限定。好在scrapy已经提供,关键是你知道这个接口,并灵活运用
rules = (Rule(SgmlLinkExtractor(allow=('category/20/index_\d+\.html'), restrict_xpaths=("//div[@class='left']"))),
Rule(SgmlLinkExtractor(allow=('a/\d+/\d+\.html'), restrict_xpaths=("//div[@class='left']")), callback='parse_item'),
Rule是在定义抽取链接的规则,上面的两条规则分别对应列表页的各个分页页面和详情页,关键点在于通过restrict_xpath来限定只从页面特定的部分来抽取接下来将要爬取的链接。
2.follow用途
第一:这是我爬取豆瓣新书的规则 rules = (Rule(LinkExtractor(allow=(r’^*/’),),callback=’parse_item’,follow=False), ),在这条规则下,我只会爬取定义的start_urls中的和规则符合的链接。假设我把follow修改为True,那么爬虫会start_urls爬取的页面中在寻找符合规则的url,如此循环,直到把全站爬取完毕。
第二:rule无论有无callback,都由同一个_parse_response函数处理,只不过他会判断是否有follow和callback
3.CrawlSpider详解
在Scrapy基础——Spider中,我简要地说了一下Spider类。Spider基本上能做很多事情了,但是如果你想爬取知乎或者是简书全站的话,你可能需要一个更强大的武器。
CrawlSpider基于Spider,但是可以说是为全站爬取而生。
CrawlSpider是爬取那些具有一定规则网站的常用的爬虫,它基于Spider并有一些独特属性
rules: 是Rule对象的集合,用于匹配目标网站并排除干扰
parse_start_url: 用于爬取起始响应,必须要返回Item,Request中的一个。
因为rules是Rule对象的集合,所以这里也要介绍一下Rule。它有几个参数:link_extractor、callback=None、cb_kwargs=None、follow=None、process_links=None、process_request=None
其中的link_extractor既可以自己定义,也可以使用已有LinkExtractor类,主要参数为:
allow:满足括号中“正则表达式”的值会被提取,如果为空,则全部匹配。
deny:与这个正则表达式(或正则表达式列表)不匹配的URL一定不提取。
allow_domains:会被提取的链接的domains。
deny_domains:一定不会被提取链接的domains。
restrict_xpaths:使用xpath表达式,和allow共同作用过滤链接。还有一个类似的restrict_css
下面是官方提供的例子,我将从源代码的角度开始解读一些常见问题:
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class MySpider(CrawlSpider):
allowed_domains = ['']
start_urls = ['']
Rule(LinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))),
Rule(LinkExtractor(allow=('item\.php', )), callback='parse_item'),
def parse_item(self, response):
('Hi, this is an item page! %s', response.url)
item = scrapy.Item()
item['id'] = response.xpath('//td[@id="item_id"]/text()').re(r'ID: (\d+)')
item['name'] = response.xpath('//td[@id="item_name"]/text()').extract()
item['description'] = response.xpath('//td[@id="item_description"]/text()').extract()
return item
问题:CrawlSpider如何工作的?
因为CrawlSpider继承了Spider,所以具有Spider的所有函数。
首先由start_requests对start_urls中的每一个url发起请求(make_requests_from_url),这个请求会被parse接收。在Spider里面的parse需要我们定义,但CrawlSpider定义parse去解析响应(self._parse_response(response, self.parse_start_url, cb_kwargs={}, follow=True))
_parse_response根据有无callback,follow和self.follow_links执行不同的操作
def _parse_response(self, response, callback, cb_kwargs, follow=True):
if callback:
cb_res = callback(response, **cb_kwargs) or ()
cb_res = self.process_results(response, cb_res)
for requests_or_item in iterate_spider_output(cb_res):
yield requests_or_item
if follow and self._follow_links:
for request_or_item in self._requests_to_follow(response):
yield request_or_item
其中_requests_to_follow又会获取link_extractor(这个是我们传入的LinkExtractor)解析页面得到的link(link_extractor.extract_links(response)),对url进行加工(process_links,需要自定义),对符合的link发起Request。使用.process_request(需要自定义)处理响应。
问题:CrawlSpider如何获取rules?
CrawlSpider类会在init方法中调用_compile_rules方法,然后在其中浅拷贝rules中的各个Rule获取要用于回调(callback),要进行处理的链接(process_links)和要进行的处理请求(process_request)
def _compile_rules(self):
def get_method(method):
if callable(method):
return method
elif isinstance(method, six.string_types):
return getattr(self, method, None)
self._rules = [copy.copy(r) for r in self.rules]
for rule in self._rules:
rule.callback = get_method(rule.callback)
rule.process_links = get_method(rule.process_links)
rule.process_request = get_method(rule.process_request)
那么Rule是怎么样定义的呢?
class Rule(object):
def __init__(self, link_extractor, callback=None, cb_kwargs=None, follow=None, process_links=None, process_request=identity):
self.link_extractor = link_extractor
self.callback = callback
self.cb_kwargs = cb_kwargs or {}
self.process_links = process_links
self.process_request = process_request
if follow is None:
self.follow = False if callback else True
self.follow = follow
因此LinkExtractor会传给link_extractor。
有callback的是由指定的函数处理,没有callback的是由哪个函数处理的?
由上面的讲解可以发现_parse_response会处理有callback的(响应)respons。
cb_res = callback(response, **cb_kwargs) or ()
而_requests_to_follow会将self._response_downloaded传给callback用于对页面中匹配的url发起请求(request)。
r = Request(url=link.url, callback=self._response_downloaded)
如何在CrawlSpider进行模拟登陆
因为CrawlSpider和Spider一样,都要使用start_requests发起请求,用从Andrew_liu大神借鉴的代码说明如何模拟登陆:
def start_requests(self):
return [Request("/#signin", meta = {'cookiejar' : 1}, callback = self.post_login)]
def post_login(self, response):
print 'Preparing login'
xsrf = Selector(response).xpath('//input[@name="_xsrf"]/@value').extract()[0]
print xsrf
return [FormRequest.from_response(response,
meta = {'cookiejar' : response.meta['cookiejar']},
headers = self.headers,
formdata = {
'_xsrf': xsrf,
'email': '',
'password': '321324jia'
callback = self.after_login,
dont_filter = True
def after_login(self, response) :
for url in self.start_urls :
yield self.make_requests_from_url(url)
最后贴上Scrapy.spiders.CrawlSpider的源代码,以便检查
This modules implements the CrawlSpider which is the recommended spider to use
for scraping typical web sites that requires crawling pages.
See documentation in docs/topics/spiders.rst
import copy
import six
from scrapy.http import Request, HtmlResponse
from scrapy.utils.spider import iterate_spider_output
from scrapy.spiders import Spider
def identity(x):
class Rule(object):
def __init__(self, link_extractor, callback=None, cb_kwargs=None, follow=None, process_links=None, process_request=identity):
self.link_extractor = link_extractor
self.callback = callback
self.cb_kwargs = cb_kwargs or {}
self.process_links = process_links
self.process_request = process_request
if follow is None:
self.follow = False if callback else True
self.follow = follow
class CrawlSpider(Spider):
rules = ()
def __init__(self, *a, **kw):
super(CrawlSpider, self).__init__(*a, **kw)
self._compile_rules()
def parse(self, response):
return self._parse_response(response, self.parse_start_url, cb_kwargs={}, follow=True)
def parse_start_url(self, response):
def process_results(self, response, results):
return results
def _requests_to_follow(self, response):
if not isinstance(response, HtmlResponse):
seen = set()
for n, rule in enumerate(self._rules):
links = [lnk for lnk in rule.link_extractor.extract_links(response)
if lnk not in seen]
if links and rule.process_links:
links = rule.process_links(links)
for link in links:
seen.add(link)
r = Request(url=link.url, callback=self._response_downloaded)
r.meta.update(rule=n, link_text=link.text)
yield rule.process_request(r)
def _response_downloaded(self, response):
rule = self._rules[response.meta['rule']]
return self._parse_response(response, rule.callback, rule.cb_kwargs, rule.follow)
def _parse_response(self, response, callback, cb_kwargs, follow=True):
if callback:
cb_res = callback(response, **cb_kwargs) or ()
cb_res = self.process_results(response, cb_res)
for requests_or_item in iterate_spider_output(cb_res):
yield requests_or_item
if follow and self._follow_links:
for request_or_item in self._requests_to_follow(response):
yield request_or_item
def _compile_rules(self):
def get_method(method):
if callable(method):
return method
elif isinstance(method, six.string_types):
return getattr(self, method, None)
self._rules = [copy.copy(r) for r in self.rules]
for rule in self._rules:
rule.callback = get_method(rule.callback)
rule.process_links = get_method(rule.process_links)
rule.process_request = get_method(rule.process_request)
@classmethod
def from_crawler(cls, crawler, *args, **kwargs):
spider = super(CrawlSpider, cls).from_crawler(crawler, *args, **kwargs)
spider._follow_links = crawler.settings.getbool(
'CRAWLSPIDER_FOLLOW_LINKS', True)
return spider
def set_crawler(self, crawler):
super(CrawlSpider, self).set_crawler(crawler)
self._follow_links = crawler.settings.getbool('CRAWLSPIDER_FOLLOW_LINKS', True)
内容来自简书,有一些不明白的,待以后理解
本文已收录于以下专栏:
相关文章推荐
本文主要介绍与scrapy应用紧密相关的关键技术,不求很深入,但求能够提取要点。内容包括:
1、xpath选择器:选择页面中想要的内容
2、rules规则:定义爬虫要爬取的域
3、scrapy she...
爬了好多天,今天开始做模拟登陆:
其实,模拟登陆爬取思路很简单——>首先申请一个账户,然后将浏览器登陆的过程切换成自己手动请求登陆数据,登陆成功后,保持状态,爬取需要的链接数据。
多页面爬取有两种形式。
1)从某一个或者多个主页中获取多个子页面的url列表,parse()函数依次爬取列表中的各个子页面。
2)从递归爬取,这个相对简单。在scrapy中只要定义好初始页面以及爬虫规...
最近用scrapy来进行网页抓取,对于pythoner来说它用起来非常方便,详细文档在这里:http://doc.scrapy.org/en/0.14/index.html
要想利用scrapy来抓...
Spider类定义了如何爬取某个(或某些)网站。包括了爬取的动作(例如:是否跟进链接)以及如何从网页的内容中提取结构化数据(爬取item)。 换句话说,Spider就是您定义爬取...
1.递归调用网址。
from scrapy.http import Request
yield Request(url)
主要是要在parse函数中返回一个Request对象。 其中...
在教程(二)(http://blog.csdn.net/u/article/details/)中使用基于Spider实现了自己的w3cschool_spi...
Scrapy爬虫(六):多个爬虫组合实例Scrapy爬虫六多个爬虫组合实例
本章将实现多个爬虫共同工作的实例。
需求分析我们现在有这么个需求,既要爬取音乐详情又要爬取...
第一篇 scrapy爬虫起步(2)–从script调用scrapy实现了一个简单的爬虫程序,只抓取start_urls里的网页信息。用以实验的页面是社科相关的小组,这只是社科小组下面的第一页地址,如果...
有时候爬取网站的时候需要登录,在Scrapy中可以通过模拟登录保存cookie后再去爬取相应的页面。这里我通过登录github然后爬取自己的issue列表来演示下整个原理。
要想实现登录就需要表单提...
他的最新文章
讲师:董岩
您举报文章:
举报原因:
原文地址:
原因补充:
(最多只允许输入30个字)信用证 UCP LATEST VERSION,条款求指教 。 - 外贸单证 -
福步外贸论坛(FOB Business Forum) |中国第一外贸论坛
UID 906539
阅读权限 40
信用证 UCP LATEST VERSION,条款求指教 。
我是新手。
Applicable Rules :40E: UCP LATEST VERSION,请问像这个UCP最新条款,有哪些需要注意的地方?我才能避免出错?
另外,好怕自己没有翻译好,求下面几点的翻译:
Available with ... By ... :41D:
BANCA***SPA
BY DEF PAYMENT(这个和Initiating Institution :51D: 是一样的,会有事吗?)
2.Documents Required :46A:
+ FULL SET CLEAN ON BOARD BILL OF LADING , MADE OUT TO ORDER OF APPLICANT (FULL NAME AND
ADDRESS), NOTIFY: SAME APPLICANT (TO ORDER这个要怎么改?这个是体现在哪?)
+ HEALTH / SANITARY CERTIFICATE ISSUED BY COMPETENT GOVERNMENT
AUTHORITY IN ORIGINAL PLUS ONE COPY.
THIS DOCUMENT MUST BE ISSUED AS PER ECC RULES, AS PER MODEL
INDICATED BY THE GAZZETTA UFFICIALE OF ECC DD NOV 06, 2012,
DD NOV 05, 2012.
(求翻译,不能全然懂这意思)
+CERTIFICATE OF ORIGIN ISSUED BY CHAMBER OF COMMERCE ,
CERTIFYING CHINESE ORIGIN OF GOODS IN ORIGINAL PLUS ONE COPY.(这是指我们正常在做的贸促会的普通CO吧?)
+ DULY SIGNED BENEFICIARY FAX SENT TO APPLICANT WITHIN 1 DAYS
FROM SHIPMENT DATE EVIDENCING ALL SHIPPING DETAILS FOR PURPOSE
TO INSURE GOODS
COPY OF FAX REPORT IS REQUIRED.
3.Additional Conditions :47A:
+ PLS READ FIELD 50A AS:
.客户公司名 址
+ THE PAYMENT OF THIS DOCUMENTARY CREDIT WILL NOT BE EFFECTED IN
CASE OF REJECTION OF THE GOODS BY THE ITALIAN SANITARY
AUTHORITY. UNDER THIS CIRCUMSTANCE THE PAPER REJECTION
DOCUMENT, WHATEVER NAMED, ISSUED BY THE ITALIAN SANITARY
AUTHORITY WILL BE PRESENTED BY FIORITAL SRL TO THE ISSUING
BANK WITHIN FORESEEN PAYMENT'S DATE
(求翻译,不能全然懂这意思)
+ IN CASE OF AMENDMENTS REQUIRED BY BENEFICIARY, ALL CHARGES AND
COMMISSIONS, INCLUDING OURS, WILL BE FOR BENEFICIARY'S ACCOUNT
+ A FEE OF USD 140,00 WILL BE CHARGED TO BENEFICIARY FOR EACH
SET OF DISCREPANT DOCUMENTS AS ADDITIONAL PROCESSING FEE
WHENEVER WE MUST OBTAIN APPROVAL FROM OUR CUSTOMER,
WHETHER A NOTICE OF REFUSAL HAS BEEN GIVEN OR NOT
(求翻译,不能全然懂这意思)
+ ONE ADDITIONAL COPY OF ALL DOCS REQUIRED IN FIELD 46 SHOULD BE
PRESENTED FOR ISSUING BANK'S FILE, OTHERWISE USD 15,00 WILL BE
DEDUCTED FROM PAYMENT
Details of Charges :71B: ALL CHARGES AND COMMISSIONS
OUTSIDE ITALY, INCLUDING OUR
TRANSFER PAYMENT CHARGES ARE FOR
BENEFICIARY'S ACCOUNT.
ALL CHARGES ABOUT AMENDMENT ARE
FOR BEN'S ACCT.
Period for Presentation :48: within 15 days after the shipment
date but within validity of
Documentary Credit
[ 本帖最后由 飘逝晓雨 于
20:52 编辑 ]
UID 124804
积分 50880
福步币 48 块
阅读权限 120
Applicable Rules :40E: UCP LATEST VERSION,请问像这个UCP最新条款,有哪些需要注意的地方?我才能避免出错?
RE: UCP latest version 跟单信用证统一惯例最新版本,即UCP600.
UID 124804
积分 50880
福步币 48 块
阅读权限 120
Available with ... By ... :41D:
BANCA***SPA
BY DEF PAYMENT(这个和Initiating Institution :51D: 是一样的,会有事吗?)
RE: 指定Banca..SPA 为延期付款信用证项下的指定银行。
跟51D一样的话,没问题的。
UID 124804
积分 50880
福步币 48 块
阅读权限 120
需要在交单期限之内提交单据至41D下的被指定银行(Banca..SPA )或开证行。
UID 124804
积分 50880
福步币 48 块
阅读权限 120
+ FULL SET CLEAN ON BOARD BILL OF LADING , MADE OUT TO ORDER OF APPLICANT (FULL NAME AND
ADDRESS), NOTIFY: SAME APPLICANT (TO ORDER这个要怎么改?这个是体现在哪?)
RE:&&提单收货人为:to order of XXX company (具名申请人公司),显示地址
通知方为开证申请人。
To order of applicant的信用证,作为出口方应该提高警觉:
相符交单的话,开证行必须履行付款责任;
若是交单不符的话,退单或者转卖退运可能会面临一些问题。
土耳其的海关法规对于退运货物的规定倾向于保护该国的进口商,没有进口商的拒收证明甚至书面同意,受益人即使持有全套正本提单,也无法退回货物。一旦海关规定的监管期满,即强行拍卖,而且给予原进口商优先购买权。
所以在提单抬头方面,需要留意一下,尽量是签发成凭指示抬头(to order or to the order of shipper),或者是凭开证行指示。
避免使用记名抬头(consigned to appliant )或者是使用凭开证申请人指示抬头(consigned to the order of applicant).
以免单据若有不符点遭遇拒付之后,出口方在退运方面所面临的被动。
UID 124804
积分 50880
福步币 48 块
阅读权限 120
+CERTIFICATE OF ORIGIN ISSUED BY CHAMBER OF COMMERCE ,
CERTIFYING CHINESE ORIGIN OF GOODS IN ORIGINAL PLUS ONE COPY.(这是指我们正常在做的贸促会的普通CO吧?)
RE: 贸促会出具的产地证,加盖商会章即可。
UID 124804
积分 50880
福步币 48 块
阅读权限 120
+ HEALTH / SANITARY CERTIFICATE ISSUED BY COMPETENT GOVERNMENT
AUTHORITY IN ORIGINAL PLUS ONE COPY.
THIS DOCUMENT MUST BE ISSUED AS PER ECC RULES, AS PER MODEL
INDICATED BY THE GAZZETTA UFFICIALE OF ECC DD NOV 06, 2012,
DD NOV 05, 2012.
(求翻译,不能全然懂这意思)
RE: 由权威部门签发的健康证书 1正1副,
该证书的签发必须根据欧盟ECC规则和THE GAZZETTA UFFICIALE OF ECC 执行要求的型号。
THE GAZZETTA UFFICIALE OF ECC DD NOV 06, 2012,
DD NOV 05, 2012.
Gazzetta ufficiale of ecc 应该是产品符合需要符合的标准,
UID 124804
积分 50880
福步币 48 块
阅读权限 120
+ DULY SIGNED BENEFICIARY FAX SENT TO APPLICANT WITHIN 1 DAYS
FROM SHIPMENT DATE EVIDENCING ALL SHIPPING DETAILS FOR PURPOSE
TO INSURE GOODS
COPY OF FAX REPORT IS REQUIRED.
RE: 装船日后1个工作日之内通过传真发送一份装船通知给开证申请人,以便其投保。
需要交单文件: 经受益人签字的装船通知传真件+传真报告
UID 124804
积分 50880
福步币 48 块
阅读权限 120
+ THE PAYMENT OF THIS DOCUMENTARY CREDIT WILL NOT BE EFFECTED IN
CASE OF REJECTION OF THE GOODS BY THE ITALIAN SANITARY
AUTHORITY. UNDER THIS CIRCUMSTANCE THE PAPER REJECTION
DOCUMENT, WHATEVER NAMED, ISSUED BY THE ITALIAN SANITARY
AUTHORITY WILL BE PRESENTED BY FIORITAL SRL TO THE ISSUING
BANK WITHIN FORESEEN PAYMENT'S DATE
RE: 一旦产品被意大利卫生当局拒绝入境的话,信用证项下的付款将无法进行。
该条款虽然是进口商为了保证自己的利益和符合进口国的卫生法规,但影响了
信用证下相符交单开证行必须付款的第一责任,对受益人十分不利。&&
不可接受,建议删除。
UID 124804
积分 50880
福步币 48 块
阅读权限 120
+ IN CASE OF AMENDMENTS REQUIRED BY BENEFICIARY, ALL CHARGES AND
COMMISSIONS, INCLUDING OURS, WILL BE FOR BENEFICIARY'S ACCOUNT
若信用证的修改是经由受益人提出的话,则相应的银行费用和开证行这边的费用,都由受益人承担。
UID 124804
积分 50880
福步币 48 块
阅读权限 120
+ A FEE OF USD 140,00 WILL BE CHARGED TO BENEFICIARY FOR EACH
SET OF DISCREPANT DOCUMENTS AS ADDITIONAL PROCESSING FEE
WHENEVER WE MUST OBTAIN APPROVAL FROM OUR CUSTOMER,
WHETHER A NOTICE OF REFUSAL HAS BEEN GIVEN OR NOT
RE: 不符点扣费条款, 措辞有点霸道。
UID 124804
积分 50880
福步币 48 块
阅读权限 120
+ ONE ADDITIONAL COPY OF ALL DOCS REQUIRED IN FIELD 46 SHOULD BE
PRESENTED FOR ISSUING BANK'S FILE, OTHERWISE USD 15,00 WILL BE
DEDUCTED FROM PAYMENT
RE: 信用证项下的交单文件(46A )必须多复印一份给开证行留档,否则扣你15美金。
UID 906539
阅读权限 40
回复 #12 beyond-sky 的帖子
多谢多谢,我也是看了大概,感觉条款很不利于咱家,但是因为自己都不熟的原因, 也不知道如何操作好。
目前看只好按客户的要求,自己做单的时候严格些,要不然能怎么办呢?
UID 906539
阅读权限 40
回复 #9 beyond-sky 的帖子
回头来仔细看您的建议。我翻找了以前这客户做过的资料,这一点要求一直在。
RE: 一旦产品被意大利卫生当局拒绝入境的话,信用证项下的付款将无法进行。
不可接受,建议删除。
--就您建议的这个,我该如何跟客户说呢?直接说请他删掉吗?现在客人只发一个DRAFT过来。但
感觉他不会同意的样子。
UID 124804
积分 50880
福步币 48 块
阅读权限 120
既然上述交单文件已经要求提交Sanitary certificate, 就没必要设置这个付款生效条款。
该条款影响了受益人在相符交单下有权要求开证行偿付的权利。
当前时区 GMT+8, 现在时间是
Powered by D1scuz! && 2001-06-2809-1209-1609-09
04-0302-0904-1201-20
◇本站云标签}

我要回帖

更多关于 阴阳师快速升级 的文章

更多推荐

版权声明:文章内容来源于网络,版权归原作者所有,如有侵权请点击这里与我们联系,我们将及时删除。

点击添加站长微信