How To Extract Certain Parts Of A Web Page In Python
Target web page: http://www.immi.gov.au/skilled/general-skilled-migration/estimated-allocation-times.htm The section I want to extract: Skilled –
Solution 1:
Soo--oop of the e--e--evening,
--Lewis Carroll, Alice's Adventures in Wonderland
I think this is exactly what he had in mind!
The Mock Turtle would probably do something like this:
>>>from BeautifulSoup import BeautifulSoup>>>import urllib2>>>url = 'http://www.immi.gov.au/skilled/general-skilled-migration/estimated-allocation-times.htm'>>>page = urllib2.urlopen(url)>>>soup = BeautifulSoup(page)>>>for row in soup.html.body.findAll('tr'):... data = row.findAll('td')...if data and'subclass 885online'in data[0].text:...print data[4].text...
15 May 2011
But I'm not sure it would help, since that date has already passed!
Good luck with the application!
Solution 2:
You might want to use this as a starting point:
Python 2.6.7 (r267:88850, Jun 132011, 22:03:32)
[GCC 4.6.120110608 (prerelease)] on linux2
Type"help", "copyright", "credits"or"license"for more information.
>>> import urllib2, re
>>> from BeautifulSoup import BeautifulSoup
>>> urllib2.urlopen('http://www.immi.gov.au/skilled/general-skilled-migration/estimated-allocation-times.htm')
<addinfourl at 139158380 whose fp = <socket._fileobject object at 0x84aa2ac>>
>>> html = _.read()
>>> soup = BeautifulSoup(html)
>>> soup.find(text = re.compile('\\bsubclass 885\\b')).parent.parent.find('td', text = re.compile(' [0-9]{4}$'))
u'15 May 2011'
Solution 3:
There is a library called Beautiful Soup which does the job you asked for. http://www.crummy.com/software/BeautifulSoup/
Post a Comment for "How To Extract Certain Parts Of A Web Page In Python"