Typing and Pasting Non-Latin Text (IME, RTL, Unicode)

This page collects the recurring questions we receive from customers who type or paste non-Latin scripts into the WinForms HTML Editor: "Chinese characters disappear when I type fast", "Copy/paste from a Japanese web page becomes mojibake", "Hebrew italic does not render the same as English", and "Emoji are saved as question marks". None of these are bugs in the editor itself - they are well-understood interactions between Windows IME, the WebView document, the clipboard, and the <meta charset> declaration in your saved HTML. This page tells you exactly which knob to turn for each scenario.

IME (Chinese, Japanese, Korean) input

An Input Method Editor (IME) is the Windows component that converts your keystrokes into the target script - Pinyin into Hanzi, Romaji into Hiragana/Kanji, Hangul jamo into syllabic blocks. The WinForms HTML Editor hosts a Microsoft WebView surface, and that surface participates in the standard Windows IME composition window cycle. You do not need to call any custom API to enable CJK typing - if your operating system has the IME installed and the language bar is switched to it, typing into the editor will compose characters the same way Notepad or Word would.

The one place customers run into trouble is the WinForms ImeMode property inherited from Control. Setting ImeMode to Disable on the host form (or on a parent panel) blocks the IME composition window for every child control underneath it, including the editor. We always leave the editor's ImeMode at the default NoControl so that the OS-level IME setting wins - if you have overridden it, restore the default:

// Let the WebView decide; do not force ImeMode on the editor.
htmlEditor1.ImeMode = System.Windows.Forms.ImeMode.NoControl;

The "characters disappear when I type fast" complaint is almost always one of two things. (1) A KeyDown or KeyPress handler on a parent form returns Handled = true for keys that are still being composed by the IME. If you intercept keys at the form level, check e.KeyCode == Keys.ProcessKey and bail out - that is the sentinel value Windows raises while the IME owns the keystroke. (2) A TextChanged-style handler that reads BodyHtml on every keystroke and re-assigns it. Re-assigning the document HTML resets the IME composition state, so partially composed characters get wiped. Throttle that work with a timer (250-500 ms) or move it onto the HtmlChanged event which already debounces.

Pasting Asian text without mojibake

Pasted Chinese or Japanese text turns into "random characters" when the source application puts UTF-8 bytes on the clipboard but tags them with the wrong code page (or vice versa). The editor handles this in ClipboardUtils.IsUtf8DecodingRequired and re-decodes the HTML through UTF-8 when the host CLR returns mis-decoded bytes. That covers the common Chrome / Edge / Word paste paths automatically.

If you still see mojibake for one specific source application, hook the Pasting event and normalize the payload yourself. The PastingHtml property is read/write - replace it before the editor inserts:

using System.Text;
using System.Text.RegularExpressions;

htmlEditor1.Pasting += (sender, e) =>
{
    if (string.IsNullOrEmpty(e.PastingHtml)) return;

    // Strip the clipboard's <meta charset> tag. Some apps emit one
    // that lies about the encoding; the editor will treat the rest of
    // the fragment as UTF-8 (which is what Windows actually put there).
    e.PastingHtml = Regex.Replace(
        e.PastingHtml,
        @"<meta[^>]*charset[^>]*>",
        string.Empty,
        RegexOptions.IgnoreCase);
};

If you receive a string from outside the editor (a database column, an API response) that is already mis-decoded, you can put it through the same UTF-8 round-trip the clipboard layer uses before inserting it:

using SpiceLogic.HtmlEditor.Infrastructure;

string fixedText = ClipboardUtils.ConvertDefaultToUtf8Encoding(rawText);
htmlEditor1.Content.InsertText(fixedText);

For pre-built HTML fragments (for example, generated server-side in Chinese), prefer Content.InsertHtml over BodyHtml = ... - the InsertHtml path goes through the editor's normalization pipeline, while a raw BodyHtml assignment can re-trigger the WebView's own charset guessing.

Right-to-left languages: Arabic, Hebrew, Persian, Urdu

RTL is controlled by the document's body element, not by the WinForms host. The editor exposes a clean API on top of the underlying DIRECTION: rtl body style:

using System.Windows.Forms;

// Flip the editor surface to RTL.
htmlEditor1.Content.SetRightToLeft(RightToLeft.Yes);

// Read it back (returns RightToLeft.Yes / No).
RightToLeft current = htmlEditor1.Content.GetRightToLeft();

// To toggle back to LTR:
htmlEditor1.Content.SetRightToLeft(RightToLeft.No);

Setting RTL flips text direction, list-bullet alignment, table cell flow, and indent/outdent semantics inside the document. Toolbar icons on the editor host stay in their LTR positions on purpose - mirroring the toolbar is a host-form concern (set the parent Form.RightToLeft / Form.RightToLeftLayout if you want the whole window mirrored).

Italic and bold in RTL: Arabic and Hebrew fonts frequently ship without true italic or bold faces, so the WebView falls back to font synthesis (slanting the glyphs algorithmically). The result looks crude. If your customers complain, point the editor at a font family that ships real italic / bold faces - for Arabic, "Noto Naskh Arabic" or "Amiri"; for Hebrew, "Noto Sans Hebrew" or "David Libre". Set it via the editor's default font or by wrapping inserted text in a <span style="font-family: ...">.

Unicode and the document <meta charset> tag

When you read htmlEditor1.DocumentHtml the editor returns a full HTML document including a <meta charset> declaration. That declaration is what every downstream consumer - browsers, email clients, PDF converters - uses to decode the bytes back into characters. If the declaration disagrees with the actual encoding of the file on disk, CJK characters and emoji break.

The current charset is exposed through the Charset property. The getter returns the value declared in the source document (empty string if no <meta charset> is present); the setter writes it into the MSHTML document object.

// Inspect what the document declares today.
string declared = htmlEditor1.Charset;

// Force UTF-8 so emoji and CJK survive any downstream consumer.
htmlEditor1.Charset = "utf-8";

When you write the document to disk, write the bytes as UTF-8 to match the declared charset - File.WriteAllText(path, htmlEditor1.DocumentHtml, new UTF8Encoding(false)). Mixing a UTF-8 <meta charset> with a file that was actually saved as the system code page (windows-1252 on most Western Windows installs) is the #1 cause of broken emoji and missing CJK glyphs in the field.

If you need to fix an existing document programmatically, the EncodingUtils helper in SpiceLogic.HtmlEditor.Infrastructure rewrites the <meta charset> tag in place:

using SpiceLogic.HtmlEditor.Infrastructure;

string current = EncodingUtils.ReadCharSetValueFromDocumentHtml(html);
string fixedHtml = EncodingUtils.SetCharSetValueInDocumentHtml(html, "utf-8");

Verifying your build end-to-end

Paste the following test paragraph into a running editor, save the document to disk as UTF-8, then re-open it. Every glyph should round-trip cleanly. If any of them fall back to ? or tofu boxes, the writer side is not honoring the declared charset.

const string verificationParagraph =
    "English mixed with 中文 (Chinese), 日本語 (Japanese), "
  + "한국어 (Korean), العربية (Arabic), עברית (Hebrew), "
  + "and emoji ✨ \U0001F30D \U0001F389.";

htmlEditor1.Charset = "utf-8";
htmlEditor1.Content.InsertText(verificationParagraph);
htmlEditor1.Content.SetRightToLeft(System.Windows.Forms.RightToLeft.Yes);

System.IO.File.WriteAllText(
    @"C:\temp\rtl-cjk-emoji.html",
    htmlEditor1.DocumentHtml,
    new System.Text.UTF8Encoding(encoderShouldEmitUTF8Identifier: false));

If the round-trip fails, the culprit is almost always one of three things: an encoding mismatch in the file writer (use UTF-8 without BOM), a missing CJK or Arabic font on the rendering machine, or a downstream renderer that ignores <meta charset> (some email clients re-decode with the recipient's system code page - in that case ship Content-Type: text/html; charset=utf-8 in the MIME header as well).

Last updated on May 14, 2026