Aller au contenu principal

Multi-language Markdown/MDX Translation Pipeline

A key requirement is to prevent LLM hallucination or corruption of code and structured content by ensuring that only valid, human-language text is sent for translation. Translate human-readable content in Markdown and MDX files without breaking code, JSX, math, or metadata, while preserving formatting and structure as much as possible.


Core Problem

To safely translate documents with LLMs, we must mask or exclude non-translatable content (such as code, JSX, math, and metadata). Treating the document as plain text is unsafe, because LLMs may:

  • Hallucinate code

  • Translate identifiers

  • Break JSX or Markdown structure

  • Corrupt frontmatter or metadata

Therefore, the document must be parsed into an AST (Abstract Syntax Tree) so that translation can be applied selectively based on node type.

MDX + Docusaurus Compatibility

One of the main goals of this project is compatibility with Docusaurus, which uses MDX (Markdown + JSX components).

This introduces a major constraint as there is no mature, fully correct MDX parser in Python or C++ due to the need to support JSX, ESM imports/exports, and embedded JavaScript expressions.

While there are Python libraries such as tree-sitter for Markdown, they generally do not support MDX, and cannot safely parse JSX inside Markdown.

Solution: unified.js via Subprocess

Docusaurus itself uses unified.js, a JavaScript ecosystem that provides parsing, AST transformation (masking), and stringification for many languages, including Markdown, MDX (with JSX + ESM) , GFM (tables, task lists, etc.), Frontmatter, Latex Math and other extensions.

Given unified/remark is the industry-standard solution for MDX, the final design decision was to use the JS unified.js stack as a subprocess from Python to parse and transform MD/MDX files.

Why AST-Based Masking

This prevents LLM hallucination and corruption of code/structure, non-translatable content is masked implicitly by excluding AST nodes(by node type) rather than relying on heuristics or regex. This ensure original structure of the document is preserved while only natural language text are translated without misinterprete code and JSX as translatable text.

remarque

While this entire pipeline could be implemented in JavaScript (and would likely be simpler), the project integrates with an existing Python-based system, so JS is used specifically for parsing and AST-based masking.

Architecture

Python (Orchestrator)

  • Reads Markdown/MDX files

  • Invokes Node.js subprocess (or long-lived worker)

  • Sends document text via stdin

  • Receives transformed Markdown/MDX via stdout

  • Writes output back to disk

  • Handles batching, routing, and integration with existing Python systems

Node.js (Parser + AST Transformer)

  • Uses the unified / remark ecosystem to:

    • Parse Markdown + MDX into a real AST

    • Walk the AST

    • Mask / skip non-translatable nodes

    • Translate only safe text nodes

    • Stringify AST back to Markdown/MDX

  • This ensures:

    • First-class MDX + JSX support

    • Correct handling of GFM and frontmatter

    • Safe, structure-aware translation

JS Stack

  • Core Stack (Node.js)

  • unified

  • remark-parse

  • remark-mdx

  • remark-gfm

  • remark-frontmatter

  • remark-stringify

  • unist-util-visit-parents

  • LLM backend (e.g., Ollama)

Alternatives (Not Chosen)

OptionWhy Not
Tree-sitterGreat for multi-language parsing, but poor for MDX printing and JSX-aware Markdown round-trip
Pandoc/PanfluteGood for Markdown, but weak for MDX + JSX
Pure Python parsersNo real MDX support
Regex / text heuristicsUnsafe; leads to hallucinations and document corruption