Skip to content
June 19, 2010 / windperson

C#的普通文字字串轉Unicode entity編碼字串

會寫這個function的原因是有些外國軟體(ex: ArcGIS Server),在處理資料的xml檔時碰到中文字內容會起肖,即使指定該xml是用utf-8編碼也是一樣。可是又不得不有中文字的內容,於是乎只好使用XML character entity來表示原本的multi bytes中文字串。

string GetUnicodeEntityString(string source)
{
    StringBuilder sb = new StringBuilder();
    foreach (char c in source)
    {
        byte[] unicodeBytes = Encoding.Unicode.GetBytes(c.ToString());
        sb.Append("&#x");
        for (int i = unicodeBytes.Length - 1; i >= 0; i--)
        {
             sb.Append(Convert.ToString(unicodeBytes[i], 16).PadLeft(2, '0'));
        }
        sb.Append(";");
    }
    return sb.ToString();
}

Trick在第6行和第10行,本來以為是用Encoding.UTF8.GetBytes(),後來發現其實是要用Encoding.unicode.GetBytes()產生出來的二進位編碼陣列才是正確的XML charactier entity數值;還有那個byte array要印成hex表示法時,如果沒加那個第10行的PadLeft(2, ‘0’),像是「對」這個中文字的編碼是〝〞,只用那個ToString()產出來的只有〝׍〞,就不對了。

其實應該判斷是不是ASCII字元,如果不是再產這種character entity字串就好了,免得都整個xml內容都是這種超長字串,不過我懶,乾脆全部一致Tongue out

Reference:
http://www.opentag.com/xfaq_charrep.htm#char_ncr
http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references
http://www.fileformat.info/info/unicode/char/search.htm

Advertisements

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: