Skip to content Skip to sidebar Skip to footer

Stripping Html Tags Without Using Htmlagilitypack

I need an efficient and (reasonably) reliable way to strip HTML tags from documents. It needs to be able to handle some fairly adverse circumstances: It's not known ahead of time

Solution 1:

This regex finds all tags avoiding angle brackets inside quotes in tags.

<[a-zA-Z0-9/_-]+?((".*?")|([^<"']+?)|('.*?'))*?>

It isn't able to detect escaped quotes inside quotes (but I think it is unnecessary in html)

Having the list of all allowed tags and replacing it in the first part of the regex, like <(tag1|tag2|...) could bring to a more precise solution, I'm afraid an exact solution can't be found starting with your assumption about angle brackets, think for example to something like <a href="test.html"> b<a </a>...

EDIT:

Updated regex (performing a lot better than the latter), moreover if you need to strip out code I suggest to perform a little cleaning before the first launch, something like replacing <script.+?</script> with nothing.

Solution 2:

I'm just thinking outside the box here, but you may consider leveraging something like Microsoft Word, or maybe OpenOffice.

I've used Word automation to translate HTML to DOC, RTF, or TXT. The HTML to TXT conversion native to Word would give you exactly what you want, stripping all of the HTML tags and converting it to text format. Of course this wouldn't be efficient at all if you're processing tons of tiny HTML files since there's some overhead in all of this. But if you're dealing with massive files this may not be a bad choice as I'm sure Word has plenty of optimizations around these conversions. You could test this theory by manually opening one of your largest HTML files in Word and resaving it as a TXT file and see how long Word takes to save.

And although I haven't tried it, I bet it's possible to programmatically interact with OpenOffice to accomplish something similar.

Post a Comment for "Stripping Html Tags Without Using Htmlagilitypack"