Last entry, I hope. It got the right results on files of all standard file types (ANSI, plus UTF8 or UTF16 LE or BE, both with and without BOM), and properly rejects UTF32 LE or BE because PB can't open them anyways.
It uses the Windows API function IsTextUnicode that Roland suggested, with two different constant arguments (to distinguish UTF16 LE from the less likely UTF16 BE):
Function long IsTextUnicode ( &
ref blob lpv, &
long iSize, &
ref long lpiResult &
) Library "advapi32.dll"
Constant Long IS_TEXT_UNICODE_STATISTICS = 2 // 0x0002
Constant Long IS_TEXT_UNICODE_REVERSE_STATISTICS = 32 // 0x0020
function boolean gf_read_text_file(REF string as_result, string as_filename, string as_description)
// Read the entire contents of the text file as_filename into as_result.
// On error, give messages based on the description of the file (as_description), and return FALSE.
// On success, return TRUE.
int li_file
long ll_bytes, ll_result
blob lblbBytes
byte lbBytes[]
Encoding lEncoding
String lsTemp
// We have encountered some files recently that are UTF16 LE with no BOM (Byte Order Mark),
// so examine the file as binary and figure some things out first, then convert it to the correct encoding!
li_file = FileOpen(as_filename, StreamMode!, Read!, Shared!)
if li_file = -1 then
gMsg.Show("Cannot open " + as_description + " " + as_filename + ":~n~n" + &
gnv_environment.uf_last_error_message(), Exclamation!)
return FALSE
end if
ll_bytes = FileReadEx(li_file, lblbBytes)
FileClose(li_file)
if ll_bytes = -100 then
gMsg.Show("The " + as_description + " " + as_filename + " is empty.", Exclamation!)
return FALSE
elseif ll_bytes = -1 then
gMsg.Show("Cannot read " + as_description + " " + as_filename + ":~n~n" + &
gnv_environment.uf_last_error_message(), Exclamation!)
return FALSE
end if
lbBytes = GetByteArray(BlobMid(lblbBytes, 1, 4))
// Check for UTF32 BOMs, which the PB function FileEncoding won't detect, and PB cannot read
if (lbBytes[1] = 0 and lbBytes[2] = 0 and lbBytes[3] = 254 and lbBytes[4] = 255) or &
(lbBytes[1] = 255 and lbBytes[2] = 254 and lbBytes[3] = 0 and lbBytes[4] = 0) &
then
gMsg.Show("The " + as_description + " " + as_filename + " is in an encoding that " + &
"the program cannot read: UTF32.", Exclamation!)
return FALSE
end if
// Check for the file having a BOM showing its fileencoding
lEncoding = FileEncoding(as_filename) // This returns EncodingAnsi! for anything without a BOM
if lEncoding = EncodingAnsi! then
// The determination that it was Ansi is not reliable: all files without BOM return that.
if Mod(ll_bytes, 2) = 0 then
// Try the IsTextUnicode function (only works for even numbers of bytes!), and only seems to detect UTF16 types.
// Note: The Windows API docs at https://docs.microsoft.com/en-us/windows/win32/api/winbase/nf-winbase-istextunicode
// says it returns a bool (so, 1 or 0) if the file passes the indicated tests, with ll_result also being set to 1 or 0 also.
// My experience is different: some files with no BOM have this function return 0, but ll_result is set to non-0
// which I take to indicate it's that file type.
ll_result = gnv_environment.IS_TEXT_UNICODE_STATISTICS
gnv_environment.IsTextUnicode(lblbBytes, ll_bytes, ll_result)
if ll_result > 0 then
lEncoding = EncodingUTF16LE!
else
ll_result = gnv_environment.IS_TEXT_UNICODE_REVERSE_STATISTICS
gnv_environment.IsTextUnicode(lblbBytes, ll_bytes, ll_result)
if ll_result > 0 then
lEncoding = EncodingUTF16BE!
end if
end if
end if
end if
if lEncoding = EncodingAnsi! then
// It could still be UTF8 with no BOM, try converting as UTF8 and see whether it is shorter!
// If it had any non-ACSII characters it would be, because there would be at least one 2 or more byte sequence
// representing a single Unicode characters. If it's the same length, it must be all ASCII.
lsTemp = String(lblbBytes, EncodingUTF8!)
if Len(lsTemp) < ll_bytes then
lEncoding = EncodingUTF8! // bit wasteful that it will be converted again below here
end if
end if
as_result = String(lblbBytes, lEncoding)
return TRUE