Here is the source code for a function I wrote many moons ago as an exercise for learning about Byte Order Markers (BOM's):
public function integer of_getfileencoding (string as_filename, ref encoding as_encoding);
// Determines how a file is encoded by examining the Byte Order Mark (BOM) at
// the beginning of a file. The start of the file has to be read in stream mode,
// otherwise the system skips over the BOM.
//
// There are five BOM's:
// 1. UTF 32 Big Endian (BE) x0000FEFF (byte values 0,0,254,255) Not supported by PB
// 2. UTF 32 Low Endian (LE) xFFFE0000 (byte values 255,254,0,0) Not supported by PB
// 3. UTF 16 Big Endian (BE) xFEFF (byte values 254,255) Recognized by PB
// 4. UTF 16 Low Endian (LE) xFFFE (byte values 255,254) Recognized by PB (default for PB10 & higher)
// 5. UTF 8 xEFBBBF (byte values 239,187,191) Recognized by PB
// 6. ANSI Any byte sequence not listed above Recognized by PB
//
// Arguments:
// String as_filename The path, name & extension of the file to be examined.
// Encoding as_encoding [passed by reference] The encoding in used by the file.
//
// Returns: Integer
// RC = 1 -> Successful determine of the file's encoding.
// RC = -1 & Encoding argument (passed by reference) null if error or not supported.
Integer li_filenum
Long ll_bytesread
Byte lbyte[]
Blob lblob
li_filenum = -1
SetNull(as_encoding)
if not FileExists(as_filename) then
Return -1
end if
// Open the file to be examined in Stream Mode.
li_filenum = FileOpen(as_filename, StreamMode!, Read!, Shared!)
if li_filenum = -1 then
Return -1
end if
// Read the first four bytes of the file (where the BOM resides) into a blob.
ll_bytesread = FileReadEx(li_filenum, lblob, 4)
FileClose(li_filenum)
if ll_bytesread < 4 then
Return -1
end if
// Copy the four bytes in the blob into a byte array for easy examination.
lbyte = GetByteArray(lblob)
// Does the file begin with a recognized BOM?
if lbyte[1] = 0 and lbyte[2] = 0 and lbyte[3] = 254 and lbyte[4] = 255 then
Return -1 // UTF 32 BE not supported by PB
elseif lbyte[1] = 255 and lbyte[2] = 254 and lbyte[3] = 0 and lbyte[4] = 0 then
Return -1 // UTF 32 LE not supported by PB
elseif lbyte[1] = 254 and lbyte[2] = 255 then
as_encoding = EncodingUTF16BE!
elseif lbyte[1] = 255 and lbyte[2] = 254 then
as_encoding = EncodingUTF16LE!
elseif lbyte[1] = 239 and lbyte[2] = 187 and lbyte[3] = 191 then
as_encoding = EncodingUTF8!
else
// No recognizable BOM, so this file is ANSI encoded.
as_encoding = EncodingANSI!
end if
Return 1
end function
HTH, John
@Michael : the conversion string(lblobData, EncodingUTF8!), that's what i actually do, it works if the file is in UTF8 but returns asian character if the file is Ansi.
I read file in streamMode FileOpen(ls_file, StreamMode!) (fileopen in EncodingUTF8 returns -1).
Then I Tried IsTextUnicode, but it always returns false.
ll_ret = filereadex(li_file,bdata)
istextunicode(bdata,ll_ret,ll_null)
Am I missing something ?
I assume you need something similar to IS_TEXT_UNICODE_UNICODE_MASK (combo of multiple options).
I would probably completely agree if you were to have added "as implemented by Powerbuilder" at the end of these sentences "...wanting to interpret a text file as a different encoding that its BOM indicates." and "you will have difficulty in PowerBuilder if you insist to interpret a text file in conflict with its BOM marker."
I too went though a similar exercise and then also looked at the files with a hex editor to try and understand it further. Text editors that I use today (Notepad, Notepad++, Brackets) default to saving a file as UTF-8, without BOM, and so maybe here is where there is an opportunity for some changes to Powerbuilder (EG incorporating UTF_8 without BOM as an encoding type, stop using ANSI as the default encoding if not specified and use UTF-8, etc). It seems the issue is that while UTF-8 without BOM is an accepted encoding, since text editors can pick up on it successfully (most of the time, as Roland pointed out it's still not guaranteed)...is Powerbuilder only relying on the presence of BOM even though the absence of BOM is widely accepted?
Some additional info that I found helpful understanding Unicode/ specs is available here https://home.unicode.org/ and specifically here https://unicode.org/faq/utf_bom.html.
I've also commented here: https://community.appeon.com/index.php/qna/q-a/identify-right-encoding-for-string
Regards,