Multi-language Markdown/MDX Translation Pipeline

A key requirement is to prevent LLM hallucination or corruption of code and structured content by ensuring that only valid, human-language text is sent for translation. Translate human-readable content in Markdown and MDX files without breaking code, JSX, math, or metadata, while preserving formatting and structure as much as possible.

Core Problem

To safely translate documents with LLMs, we must mask or exclude non-translatable content (such as code, JSX, math, and metadata). Treating the document as plain text is unsafe, because LLMs may:

Hallucinate code
Translate identifiers
Break JSX or Markdown structure
Corrupt frontmatter or metadata

Therefore, the document must be parsed into an AST (Abstract Syntax Tree) so that translation can be applied selectively based on node type.

MDX + Docusaurus Compatibility

One of the main goals of this project is compatibility with Docusaurus, which uses MDX (Markdown + JSX components).

This introduces a major constraint as there is no mature, fully correct MDX parser in Python or C++ due to the need to support JSX, ESM imports/exports, and embedded JavaScript expressions.

While there are Python libraries such as tree-sitter for Markdown, they generally do not support MDX, and cannot safely parse JSX inside Markdown.

Solution: unified.js via Subprocess

Docusaurus itself uses unified.js, a JavaScript ecosystem that provides parsing, AST transformation (masking), and stringification for many languages, including Markdown, MDX (with JSX + ESM) , GFM (tables, task lists, etc.), Frontmatter, Latex Math and other extensions.

Given unified/remark is the industry-standard solution for MDX, the final design decision was to use the JS unified.js stack as a subprocess from Python to parse and transform MD/MDX files.

Why AST-Based Masking

This prevents LLM hallucination and corruption of code/structure, non-translatable content is masked implicitly by excluding AST nodes(by node type) rather than relying on heuristics or regex. This ensure original structure of the document is preserved while only natural language text are translated without misinterprete code and JSX as translatable text.

remarque

While this entire pipeline could be implemented in JavaScript (and would likely be simpler), the project integrates with an existing Python-based system, so JS is used specifically for parsing and AST-based masking.

Architecture

Python (Orchestrator)

Reads Markdown/MDX files
Invokes Node.js subprocess (or long-lived worker)
Sends document text via stdin
Receives transformed Markdown/MDX via stdout
Writes output back to disk
Handles batching, routing, and integration with existing Python systems

Node.js (Parser + AST Transformer)

Uses the unified / remark ecosystem to:
- Parse Markdown + MDX into a real AST
- Walk the AST
- Mask / skip non-translatable nodes
- Translate only safe text nodes
- Stringify AST back to Markdown/MDX
This ensures:
- First-class MDX + JSX support
- Correct handling of GFM and frontmatter
- Safe, structure-aware translation

JS Stack

Core Stack (Node.js)
unified
remark-parse
remark-mdx
remark-gfm
remark-frontmatter
remark-stringify
unist-util-visit-parents
LLM backend (e.g., Ollama)

Alternatives (Not Chosen)

Option	Why Not
Tree-sitter	Great for multi-language parsing, but poor for MDX printing and JSX-aware Markdown round-trip
Pandoc/Panflute	Good for Markdown, but weak for MDX + JSX
Pure Python parsers	No real MDX support
Regex / text heuristics	Unsafe; leads to hallucinations and document corruption

Core Problem​

MDX + Docusaurus Compatibility​

Solution: unified.js via Subprocess​

Why AST-Based Masking​

Architecture​

Python (Orchestrator)​

Node.js (Parser + AST Transformer)​

JS Stack​

Alternatives (Not Chosen)​