2014-04-18 20 views



kolay seçenek bulundu extract//body//text() ve join her şeyi olurdu.

Başka bir seçenek kullanmaktır nltk 'ın clean_html():

>>> import nltk 
>>> html = """ 
... <div class="post-text" itemprop="description"> 
...   <p>I would like to have all the text visible from a website, after the HTML is rendered. I'm working in Python with Scrapy framework. 
... With <code>xpath('//body//text()')</code> I'm able to get it, but with the HTML tags, and I only want the text. Any solution for this? Thanks !</p> 
...  </div>""" 
>>> nltk.clean_html(html) 
"I would like to have all the text visible from a website, after the HTML is rendered. I'm working in Python with Scrapy framework.\nWith xpath('//body//text()') I'm able to get it, but with the HTML tags, and I only want the text. Any solution for this? Thanks !" 

Başka bir seçenek kullanmaktır BeautifulSoup' ın ​​get_text():


If you only want the text part of a document or tag, you can use the get_text() method. It returns all the text in a document or beneath a tag, as a single Unicode string.

>>> from bs4 import BeautifulSoup 
>>> soup = BeautifulSoup(html) 
>>> print soup.get_text().strip() 
I would like to have all the text visible from a website, after the HTML is rendered. I'm working in Python with Scrapy framework. 
With xpath('//body//text()') I'm able to get it, but with the HTML tags, and I only want the text. Any solution for this? Thanks ! 

Başka bir seçenek kullanmaktır lxml.html 'ın text_content() :


Returns the text content of the element, including the text content of its children, with no markup.

>>> import lxml.html 
>>> tree = lxml.html.fromstring(html) 
>>> print tree.text_content().strip() 
I would like to have all the text visible from a website, after the HTML is rendered. I'm working in Python with Scrapy framework. 
With xpath('//body//text()') I'm able to get it, but with the HTML tags, and I only want the text. Any solution for this? Thanks ! 

Sorumu sildim .. Aşağıdaki kodu kullandım html = sel.select ("// body // text()") tree = lxml.html.fromstring (html) item ['description'] = tree.text_content(). strip() Ama ben \t is_full_html = _looks_like_full_html_unicode (html) \t istisnalar alıyorum.TypeError: beklenen dize veya arabellek ..erro. Neler yanlış gitti – Backtrack


'nltk' benim için en iyi çalıştı – user4421975


Sadece bir güncelleme olarak' nltk' 'clean_html' yöntemini kullanım dışı bıraktı ve tavsiye etti: ' NotImplementedError: HTML işaretlemesini kaldırmak için, BeautifulSoup'ın get_text() işlevini kullanın ' – TheNastyOne


Denediniz mi?


sel bir Selector örneğidir:




Bu aslında oldukça iyi çalışıyor ancak yine de bazı html etiketleri ve diğerlerini döndürüyor. – tomasyany