Convert To UTF-8

 Keywords: ChrSetCodepage ChrStringToHex ChrStringToUnicode ChrUnicodeToHex ChrUnicodeToString VarType ANSI UTF-8

Question:

There is a large UTF-8 text file with fitting BOM, which should be splitted into smaller parts. There is no need to work with the content, just reading and writing. How should FileOpen, FileRead and FileWrite to be configured?
There is a large UTF-16 LE or BE text file with fitting BOM,which should be splitted into smaller parts. The output files should be written into UTF-8 encoding with BOM. How should FileOpen, FileRead and FileWrite to be configured?

Is ChrSetCodePage needed and how?

Answer:

In WinBatch, the definition of a "Unicode" string is that it contains UTF16-LE encoded Unicode characters, and is terminated by a corresponding UTF16-LE NULL character, which is represented by 2 NULL bytes.

The definition of an ANSI string, or a "narrow" string, is any string that consists of either a single byte or variable multi-byte encoding with a corresponding code page that dictates what characters are associated with which values, and a single NULL byte terminator at the end of the string.

In the case of UTF8 strings in WinBatch, even though they are using the Unicode character set, they are a variable length multi-byte type of encoding, and thus they fall under the category ANSI or "narrow" strings.

What this all means is that if you call the ChrSetCodePage() function and specify a code page value of 65001, then you will be specifying UTF8 as the combination of encoding & character set for ANSI / "narrow" strings within WinBatch. Any conversion between a "Unicode" string and "narrow" string will retain the Unicode character set's characters in the strings, but they will be transcoded between UTF16-LE and UTF8.

; Example for text conversion from ANSI to Unicode UTF-16 to Unicode UTF-8 and back.
; DD.20120714.

; Convert ANSI to UTF-8.

strANSI     = "® € þæñ Höllë"                  ; "® € þæñ Höllë"
strHex      = ChrStringToHex (strANSI)         ; "AE208020FEE6F12048F66C6CEB"
intVarType  = VarType (strANSI)                ; 2 string

strUTF16LE  = ChrStringToUnicode (strANSI)     ; "® € þæñ Höllë"
strHex      = ChrUnicodeToHex (strUTF16LE)     ; "AE002000AC202000FE00E600F10020004800F6006C006C00EB00"
intVarType  = VarType (strUTF16LE)             ; 128 LPWSTR or "Unicode"

ChrSetCodepage (65001)                         ; 65001 Translate using UTF-8

strUTF8     = ChrUnicodeToString (strUTF16LE)  ; "Â® â‚¬ Ã¾Ã¦Ã± HÃ¶llÃ«"
strHex      = ChrStringToHex (strUTF8)         ; "C2AE20E282AC20C3BEC3A6C3B12048C3B66C6CC3AB"
intVarType  = VarType (strUTF8)                ; 2 string


; Convert UTF-8 to ANSI.

strUTF8_    = "Â® â‚¬ Ã¾Ã¦Ã± HÃ¶llÃ«"          ; "Â® â‚¬ Ã¾Ã¦Ã± HÃ¶llÃ«"
strHex      = ChrStringToHex (strUTF8_)        ; "C2AE20E282AC20C3BEC3A6C3B12048C3B66C6CC3AB"
intVarType  = VarType (strUTF8_)               ; 2 string

strUTF16LE_ = ChrStringToUnicode (strUTF8_)    ; "® € þæñ Höllë"
strHex      = ChrUnicodeToHex (strUTF16LE_)    ; "AE002000AC202000FE00E600F10020004800F6006C006C00EB00"
intVarType  = VarType (strUTF16LE_)            ; 128 LPWSTR or "Unicode"

ChrSetCodepage (0)                             ; 0 ANSI code page

strANSI_    = ChrUnicodeToString (strUTF16LE_) ; "® € þæñ Höllë"
strHex      = ChrStringToHex (strANSI_)        ; "AE208020FEE6F12048F66C6CEB"
intVarType  = VarType (strANSI_)               ; 2 string

Exit

Article ID:   W18286

Filename:   Convert To UTF-8.txt

File Created: 2012:07:16:08:21:02

Last Updated: 2012:07:16:08:21:02

Database Search

TechHome

Convert To UTF-8

Question:

Answer: