Transforming HTML to Markdown: Best Methods?

Pierre Paul Jacques

Активный пользователь
Регистрация
08.10.2023
Сообщения
117
Благодарностей
33
Баллы
28
Hi Zenno Friends,

What are the most efficient methods for converting entire HTML documents to Markdown?

Context:


I'm working on automating the conversion of HTML content to Markdown. The goal is to preserve formatting such as headings, lists, bold, and italic text when converting to Markdown.
I guess by C#?
( i don't want to do 10 bloc of regex with search and remplace)

The goal is to post this content on some other website to do Parasite SEO where markdown is the only syntax accepted ( and not html)

Thank you in advance for your assistance!
 

morpheus93

Client
Регистрация
25.01.2012
Сообщения
1 038
Благодарностей
237
Баллы
63
I think Pandoc should do the work. There are also some online converters out there. Or just use the js library Turndown, you can find examples online for it.
 
  • Спасибо
Реакции: Pierre Paul Jacques

Pierre Paul Jacques

Активный пользователь
Регистрация
08.10.2023
Сообщения
117
Благодарностей
33
Баллы
28
I think Pandoc should do the work. There are also some online converters out there. Or just use the js library Turndown, you can find examples online for it.


Thank Morpheus Pandoc was the solution!

I wanted to share a useful C# script for automating HTML to Markdown conversion using Pandoc within ZennoPoster. This script might come in handy for those who need to process HTML content and convert it to Markdown format automatically.

The script assumes that you have Pandoc installed on your system and accessible via the command line. Here’s the generalized code snippet:

C#:
// Replace 'YourInputVariable' with your actual input variable name containing HTML content
string htmlContent = project.Variables["YourInputVariable"].Value;

// Replace 'YourOutputVariable' with your actual output variable name where Markdown will be stored
string markdownOutputVariable = "YourOutputVariable";

// Path to Pandoc executable; adjust it according to your system
string pandocPath = @"C:\Program Files\Pandoc\pandoc.exe";

// Temporary file paths for storing intermediate and output files
string tempHtmlPath = project.Directory + @"\tempHtml.html";
string tempMarkdownPath = project.Directory + @"\tempMarkdown.md";

try
{
    System.IO.File.WriteAllText(tempHtmlPath, htmlContent);

    var processInfo = new System.Diagnostics.ProcessStartInfo(pandocPath, $"-f html -t markdown {tempHtmlPath} -o {tempMarkdownPath}")
    {
        CreateNoWindow = true,
        UseShellExecute = false,
        RedirectStandardOutput = true,
        RedirectStandardError = true
    };

    using (var process = System.Diagnostics.Process.Start(processInfo))
    {
        process.WaitForExit();
    }

    string markdownContent = System.IO.File.ReadAllText(tempMarkdownPath);
    project.Variables[markdownOutputVariable].Value = markdownContent;
}
catch (Exception ex)
{
    project.SendInfoToLog("Error during Pandoc execution: " + ex.Message);
}
finally
{
    System.IO.File.Delete(tempHtmlPath);
    System.IO.File.Delete(tempMarkdownPath);
}
Make sure to replace the placeholder variable names with the actual ones you use in your ZennoPoster project. This script creates temporary files for the HTML input and Markdown output, which are deleted after the conversion process is completed.

This code was refined with the help of ChatGPT,to ensure it's clear and functional because i am not a dev,)
So for sure there is a way to simplify it but for the moment its work for me

I hope this helps anyone looking to streamline their content processing workflow in ZennoPoster. If you have any questions or improvements, feel free to chime in!
 

morpheus93

Client
Регистрация
25.01.2012
Сообщения
1 038
Благодарностей
237
Баллы
63
  • Спасибо
Реакции: Pierre Paul Jacques

kagorec

Client
Регистрация
24.08.2013
Сообщения
923
Благодарностей
476
Баллы
63
C#:
...  
var processInfo = new System.Diagnostics.ProcessStartInfo(pandocPath, $"-f html -t markdown {tempHtmlPath} -o {tempMarkdownPath}")
    ...
Better is `commonmark_x`
 
Последнее редактирование:
  • Спасибо
Реакции: Pierre Paul Jacques

winlingt

Client
Регистрация
21.10.2021
Сообщения
9
Благодарностей
2
Баллы
3
Thanks you
If it's not too complicated, could you please share the opposite version "Markdown to HTML" ?

I tried my self but my skills are too limited...
 
  • Спасибо
Реакции: Pierre Paul Jacques

Кто просматривает тему: (Всего: 1, Пользователи: 0, Гости: 1)