Skip to content Skip to sidebar Skip to footer

Beautifulsoup Get_text Does Not Strip All Tags And Javascript

I am trying to use BeautifulSoup to get text from web pages. Below is a script I've written to do so. It takes two arguments, first is the input HTML or XML file, the second output

Solution 1:

nltk's clean_html() is quite good at this!

Assuming that your already have your html stored in a variable html like

html = urllib.urlopen(address).read()

then just use

importnltkclean_text= nltk.clean_html(html)

UPDATE

Support for clean_html and clean_url will be dropped for future versions of nltk. Please use BeautifulSoup for now...it's very unfortunate.

An example on how to achieve this is on this page:

BeatifulSoup4 get_text still has javascript

Solution 2:

Here's an approach which is based on the answer here: BeautifulSoup Grab Visible Webpage Text by jbochi. This approach allows for comments embedded in elements containing page text, and does a bit to clean up the output by stripping newlines, consolidating space, etc.

html = urllib.urlopen(address).read()
soup = BeautifulSoup.BeautifulSoup(html)
texts = soup.findAll(text=True)

def visible_text(element):
    if element.parent.name in ['style', 'script', '[document]', 'head', 'title']:
        return''result= re.sub('<!--.*-->|\r|\n', '', str(element), flags=re.DOTALL)
    result= re.sub('\s{2,}|&nbsp;', ' ', result)
    returnresult

visible_elements = [visible_text(elem) for elem in texts]
visible_text =''.join(visible_elements)
print(visible_text)

Solution 3:

This was the problem I was having. no solution seemed to be able to return the text (the text that would actually be rendered in the web broswer). Other solutions mentioned that BS is not ideal for rendering and that html2text was a good approach. I tried both html2text and nltk.clean_html and was surprised by the timing results so thought they warranted an answer for posterity. Of course, the speed delta might highly depend on the contents of the data...

One answer here from @Helge was about using nltk of all things.

import nltk

%timeit nltk.clean_html(html)
was returning 153 us per loop

It worked really well to return a string with rendered html. This nltk module was faster than even html2text, though perhaps html2text is more robust.

betterHTML = html.decode(errors='ignore')
%timeit html2text.html2text(betterHTML)
%3.09 ms per loop

Post a Comment for "Beautifulsoup Get_text Does Not Strip All Tags And Javascript"