Beautifulsoup Get_text Does Not Strip All Tags And Javascript
Solution 1:
nltk's clean_html()
is quite good at this!
Assuming that your already have your html stored in a variable html
like
html = urllib.urlopen(address).read()
then just use
importnltkclean_text= nltk.clean_html(html)
UPDATE
Support for clean_html
and clean_url
will be dropped for future versions of nltk. Please use BeautifulSoup for now...it's very unfortunate.
An example on how to achieve this is on this page:
Solution 2:
Here's an approach which is based on the answer here: BeautifulSoup Grab Visible Webpage Text by jbochi. This approach allows for comments embedded in elements containing page text, and does a bit to clean up the output by stripping newlines, consolidating space, etc.
html = urllib.urlopen(address).read()
soup = BeautifulSoup.BeautifulSoup(html)
texts = soup.findAll(text=True)
def visible_text(element):
if element.parent.name in ['style', 'script', '[document]', 'head', 'title']:
return''result= re.sub('<!--.*-->|\r|\n', '', str(element), flags=re.DOTALL)
result= re.sub('\s{2,}| ', ' ', result)
returnresult
visible_elements = [visible_text(elem) for elem in texts]
visible_text =''.join(visible_elements)
print(visible_text)
Solution 3:
This was the problem I was having. no solution seemed to be able to return the text (the text that would actually be rendered in the web broswer). Other solutions mentioned that BS is not ideal for rendering and that html2text was a good approach. I tried both html2text and nltk.clean_html and was surprised by the timing results so thought they warranted an answer for posterity. Of course, the speed delta might highly depend on the contents of the data...
One answer here from @Helge was about using nltk of all things.
import nltk
%timeit nltk.clean_html(html)
was returning 153 us per loop
It worked really well to return a string with rendered html. This nltk module was faster than even html2text, though perhaps html2text is more robust.
betterHTML = html.decode(errors='ignore')
%timeit html2text.html2text(betterHTML)
%3.09 ms per loop
Post a Comment for "Beautifulsoup Get_text Does Not Strip All Tags And Javascript"