HtmlAgilityPack to parse content

Perfecto

Client

25.03.2023

#1

Hi,

I try to extract content with HtmlAgilityPack :

C#:

using HtmlAgilityPack;

string htmlContent = project.Variables["DOM"].Value;

var doc = new HtmlDocument();
doc.LoadHtml(htmlContent);

var article = doc.DocumentNode.SelectSingleNode("//body");

// Remove all unwanted elements within the article
foreach (var node in article.SelectNodes("//*[not(starts-with(name(),'h')) and not(name()='h') and not(name()='ul') and not(name()='li') and not(name()='strong') and not(name()='b')]"))
{
    node.Remove();
}

// Print the article content with only the desired tags
string extractedContent = project.Variables["content"].Value;

I have this error :

I have installed the latest version of HtmlAgilityPack net45

Реакции: Pierre Paul Jacques

Phoenix78

Client

Read only

25.03.2023

#2

Remove using HtmlAgilityPack;
Put it in the general code, Using tab

Реакции: Pierre Paul Jacques

Perfecto

Client

25.03.2023

#3

Phoenix78 сказал(а):
Remove using HtmlAgilityPack;
Put it in the general code, Using tab

Посмотреть вложение 105167

Thanks it work

there is a problem with my code:

C#:

string htmlContent = project.Variables["DOM"].Value;

var doc = new HtmlDocument();
doc.LoadHtml(htmlContent);

if (doc.DocumentNode != null)
{
    // Find the element containing the article
    // In this example, we assume the article is contained within the <body> tag
    var article = doc.DocumentNode.SelectSingleNode("//body");

    if (article != null)
    {
        // Remove all unwanted elements within the article
        var nodesToRemove = article.SelectNodes("//*[not(starts-with(name(),'h')) and not(name()='h') and not(name()='ul') and not(name()='li') and not(name()='strong') and not(name()='b') and not(name()='p')]");

        if (nodesToRemove != null)
        {
            foreach (var node in nodesToRemove)
            {
                node.Remove();
            }
        }

        // Remove HTML comments
        var comments = article.SelectNodes("//comment()");
        if (comments != null)
        {
            foreach (var comment in comments)
            {
                comment.Remove();
            }
        }

        // Replace <p> tags with their inner text
        var pTags = article.SelectNodes("//p");
        if (pTags != null)
        {
            foreach (var pTag in pTags)
            {
                if (pTag.ParentNode != null)
                {
                    pTag.ParentNode.InsertBefore(HtmlTextNode.CreateNode(pTag.InnerText), pTag);
                    pTag.Remove();
                }
            }
        }

        // Store the article content with only the desired tags into the ZennoPoster variable
        project.Variables["clean_content"].Value = article.InnerHtml;
    }
    else
    {
        // if <body> not found
        project.Variables["clean_content"].Value = "No body";
    }
}
else
{
    // if DocumentNode is null
    project.Variables["clean_content"].Value = "DocumentNode is null";
}

My goal is to extract the HTML pages from different sites and clean it up from the HTML while keeping :
Hn, stong, b ,ul, li tags and their content
The content of the <p> tags but without the tags.
And by removing the HTML comments

The result is not the expected one but I can't understand why...

lokiys

Moderator

25.03.2023

#4

Use

C#:

project.SendInfoToLog("Your comment or data", false);

in your code and test what values are returned and fix your code.

Perfecto

Client

25.03.2023

#5

Thank you for your quick response.
It returns the same thing as in the "clean_content" variable
In my exemple I took this page : https://www.lavieclaire.com/conseils/quels-sont-les-bienfaits-de-la-spiruline/
And the result is :

C#:

<!-- Google Tag Manager (noscript) -->

<!-- End Google Tag Manager (noscript) -->

    
    

        
        
                

        <!-- Cookie Axeptio -->
        
        <!-- End Cookie Axeptio -->

Pierre Paul Jacques

Активный пользователь

15.02.2024

#6

Hi i got the same trouble with this pack,)

Maybe i missed something?

Thank by advance

Поиск

HtmlAgilityPack to parse content

Perfecto

Client

Phoenix78

Client

Perfecto

Client

lokiys

Moderator

Perfecto

Client

Pierre Paul Jacques

Активный пользователь

Кто просматривает тему: (Всего: 1, Пользователи: 0, Гости: 1)