Skip to content Skip to sidebar Skip to footer

Remove All Strings In { } Delimiter Using Regex Or Html Agility Pack In ASP.NET Web Forms

i'm trying to extract the text only content from a web page and displayed and i use the HtmlAgilityPack to do the text extraction but the text return with the javascript and css te

Solution 1:

i have been using HtmlAgilityPack to load an web page and extract the text content only so when i'm loading the page and extract the text the css and javascript text also is extracted so i try this method of regex to remove the javascript and css from the output text by detect the { } delimiter but was hard so i try anther way and it work and much simpler by using the Descendants() from HtmlAgilityPack and my code is

 HtmlWeb web = new HtmlWeb();
 HtmlDocument doc = web.Load(url);
 doc.DocumentNode.Descendants()
                            .Where(n => n.Name == "script" || n.Name == "style" || n.Name == "#comment")
                            .ToList()
                            .ForEach(n => n.Remove());

            string s = doc.DocumentNode.InnerText;
            TextArea1.Value = Regex.Replace(s, @"\t|\n|<.*?>","");

and find this from : THIS LINK

and every thing works now.


Solution 2:

why dont you simply try :

/\{.*?\}/g

and replace with nothing.


Solution 3:

You have nested braces.

In Perl, PHP, Ruby, you could match the nested braces using (?R) (recursion syntax). But .NET does not have recursion. Does this mean we are lost? Luckily, no.

Balancing Groups to the Rescue

C# regex cannot use recursion, but it has an awesome feature called balancing groups.

This regex will match complete nested braces.

(?<counter>{)(?>(?<counter>{)|(?<-counter>})|[^{}]+)+?(?(counter)(?!))

For instance, it will match

  1. {sdfs{sdfs}sd{d{ab}}fs}
  2. {ab}
  3. But not {aa

Solution 4:

You want to match all case of '{' to '}' including every character which isn't '}' between the pair, then use the following:

/\{[^\}]+\}/g

Solution 5:

int x=0, y=0;
int l=string.lastIndexOf("}");
do
{
x= string.indexof("{", x) + 1;
y= string.indexof{"}", x};
string.remove(x, y-x);
}
while(y!=l);

Post a Comment for "Remove All Strings In { } Delimiter Using Regex Or Html Agility Pack In ASP.NET Web Forms"