Web sitemi taramak ve sitemin tüm sayfa URL'lerini ayıklamak için temel bir Scrapy kurmaya çalışıyorum. Bunun oldukça kolay olacağını düşünürdüm.Sitemi taramak ve URL'leri dışa aktarmak için Scrub ve xpath - ne yapıyorum?
from scrapy.item import Item, Field
class Website(Item):
name = Field()
description = Field()
url = Field()
İşte öğreticisindeki example.py adlı benim Örümcek verilmiştir:
İşte öğretici kopyalanmış benim items.py, bu. Ben bot dan karşılığında elde ne
from scrapy.spiders import Spider
from scrapy.selector import Selector
from cspenn.items import Website
class DmozSpider(Spider):
name = "cspenn"
allowed_domains = ["christopherspenn.com"]
start_urls = ["http://www.christopherspenn.com/"]
def parse(self, response):
sel = Selector(response)
sites = sel.xpath('//a')
items = []
for site in sites:
item = Website()
item['name'] = site.xpath('a/text()').extract()
item['url'] = site.xpath('a/@href').extract()
item['description'] = site.xpath('text()').re('-\s[^\n]*\\r')
items.append(item)
return items
: Ben yanlış yapıyorum
scrapy crawl cspenn
2016-04-13 13:15:25 [scrapy] INFO: Scrapy 1.0.5 started (bot: cspenn)
2016-04-13 13:15:25 [scrapy] INFO: Optional features available: ssl, http11, boto
2016-04-13 13:15:25 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'cspenn.spiders', 'SPIDER_MODULES': ['cspenn.spiders'], 'BOT_NAME': 'cspenn'}
2016-04-13 13:15:25 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2016-04-13 13:15:26 [boto] DEBUG: Retrieving credentials from metadata server.
2016-04-13 13:15:27 [boto] ERROR: Caught exception reading instance data
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/boto/utils.py", line 210, in retry_url
r = opener.open(req, timeout=timeout)
File "/usr/lib/python2.7/urllib2.py", line 431, in open
response = self._open(req, data)
File "/usr/lib/python2.7/urllib2.py", line 449, in _open
'_open', req)
File "/usr/lib/python2.7/urllib2.py", line 409, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 1227, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/usr/lib/python2.7/urllib2.py", line 1197, in do_open
raise URLError(err)
URLError: <urlopen error timed out>
2016-04-13 13:15:27 [boto] ERROR: Unable to read instance data, giving up
2016-04-13 13:15:27 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-04-13 13:15:27 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-04-13 13:15:27 [scrapy] INFO: Enabled item pipelines:
2016-04-13 13:15:27 [scrapy] INFO: Spider opened
2016-04-13 13:15:27 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-04-13 13:15:27 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-04-13 13:15:27 [scrapy] DEBUG: Crawled (200) <GET http://www.christopherspenn.com/> (referer: None)
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] DEBUG: Scraped from <200 http://www.christopherspenn.com/>
{'description': [], 'name': [], 'url': []}
2016-04-13 13:15:27 [scrapy] INFO: Closing spider (finished)
2016-04-13 13:15:27 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 222,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 14302,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 4, 13, 17, 15, 27, 262789),
'item_scraped_count': 93,
'log_count/DEBUG': 96,
'log_count/ERROR': 2,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2016, 4, 13, 17, 15, 27, 77084)}
2016-04-13 13:15:27 [scrapy] INFO: Spider closed (finished)
? Öğreticiyi neredeyse tam olarak takip ettim. İstenen çıktı, bir CSV başlık dosyası, sayfa URL'si ve açıklamasıdır.
Temel olarak sitenin tüm URL'lerinin temiz bir şekilde dışa aktarılmasını arıyorum. Tüm siteyi tara ve tüm sayfa URL'lerini dışa aktar. Teşekkür ederim! –
@ChristopherPenn tamam, adım adım yaklaşalım, şimdi hangi çıktıyı elde ediyorsunuz? Şimdi boş olmayan 'isim' ve 'url' görüyor musunuz? – alecxe
Yaparım! { 'description': [], 'name': [u'Marketing Beyaz Bant '], 'url': [u'http: //www.christopherspenn.com/buy-the-marketing-white- kemer-kitap/']} –