Opening Unicode files without BOMs

Resolved Opening Unicode files without BOMs

How-to

Votes

Undo

Daniel Vivier
PowerBuilder
Thursday, 12 May 2022 20:56 PM UTC

I have encountered a situation where my application is trying to open a file that turns out to be UTF16-LE with no BOM. As documented, PB's FileOpen function won't detect that and thus reads the file incorrectly.

But in the situation where I'm reading that file, I could be reading files from various sources, some of which are proper Unicode files with BOMs and some of which are missing the expected BOMs. So I certainly can't just assume it's always UTF16-LE. Could be Ansi, UTF8 (with or without BOM), or either UTF16-LE or UTF16-BE (with or without BOM).

I don't suppose anyone has any code they can share that detects the type and somehow reads any of those files correctly?

I'm also contacting the vendor supplying that file to see whether they can or will do anything about it, but I don't hold out high hopes, because they have thousands of users (at least) and I'm sure they would think any change might break things for other users.

Responses (7)

John Fauss Accepted Answer Pending Moderation

Friday, 13 May 2022 04:04 AM UTC
PowerBuilder
# 1

Can't say I'm surprised the FileEncoding function does not handle all cases.

Sometime around 2017 I experimented with trying to identify the encoding used in a file by examining the bytes at the start of a file in a manner much like what you've described, Dan. At the time, I was not concerned with trying to identify the encoding used in files that lack a BOM. I'm including below the code I used to do this just so you can see what I did... I hope it helps you a little.

//public function integer of_getfileencoding (string as_filename, ref encoding as_encoding)
//
// Determines how a file is encoded by examining the Byte Order Mark (BOM) at
// the beginning of a file. The start of the file has to be read in stream mode,
// otherwise the system skips over the BOM.
//
// There are five BOM's:
// 1. UTF 32 Big Endian (BE) x0000FEFF (byte values 0,0,254,255) Not supported by PB
// 2. UTF 32 Low Endian (LE) xFFFE0000 (byte values 255,254,0,0) Not supported by PB
// 3. UTF 16 Big Endian (BE) xFEFF     (byte values 254,255)     Recognized by PB
// 4. UTF 16 Low Endian (LE) xFFFE     (byte values 255,254)     Recognized by PB (default for PB10 & higher)
// 5. UTF  8                 xEFBBBF   (byte values 239,187,191) Recognized by PB
// 6. ANSI                   Any byte sequence not listed above  Recognized by PB
//
// RC=1 -> Successful determination of the file's encoding.
// RC=-1 & Encoding argument (passed by reference) null if error or not supported.

Integer  li_filenum
Long     ll_bytesread
Byte     lbyte[]
Blob     lblob

li_filenum = -1
SetNull(as_encoding)

if not FileExists(as_filename) then
   Return -1
end if

// Open the file to be examined in Stream Mode.
li_filenum = FileOpen(as_filename, StreamMode!, Read!, Shared!)
if li_filenum = -1 then
   Return -1
end if

// Read the first four bytes of the file (where the BOM resides) into a blob.
ll_bytesread = FileReadEx(li_filenum, lblob, 4)
FileClose(li_filenum)

if ll_bytesread < 4 then
   Return -1
end if

// Copy the four bytes in the blob into a byte array for easy examination.
lbyte = GetByteArray(lblob)

// Does the file begin with a recognized BOM?
if lbyte[1] = 0 and lbyte[2] = 0 and lbyte[3] = 254 and lbyte[4] = 255 then
   Return -1    // UTF 32 BE not supported by PB
elseif lbyte[1] = 255 and lbyte[2] = 254 and lbyte[3] = 0 and lbyte[4] = 0 then
   Return -1    // UTF 32 LE not supported by PB
elseif lbyte[1] = 254 and lbyte[2] = 255 then
   as_encoding = EncodingUTF16BE!
elseif lbyte[1] = 255 and lbyte[2] = 254 then
   as_encoding = EncodingUTF16LE!
elseif lbyte[1] = 239 and lbyte[2] = 187 and lbyte[3] = 191 then
   as_encoding = EncodingUTF8!
else
   // No recognizable BOM, so this file is ANSI encoded.
   as_encoding = EncodingANSI!
end if

Return 1

From doing a web search, I found this post in the StackOverflow forum:

https://stackoverflow.com/questions/61817928/how-to-identify-file-format-of-a-file-with-out-bom-utf8-no-bom-or-ansi

Good luck!

Comment

There are no comments made yet.

Daniel Vivier Accepted Answer Pending Moderation

Friday, 13 May 2022 21:06 PM UTC
PowerBuilder
# 2

OK, I think the following function is the best I can do under the circumstances, and it definitely solves the problem of UTF16 files with no BOM.

boolean gf_read_text_file(ref string as_result, string as_filename, string as_description)
// Read the entire contents of the text file as_filename into as_result.
// On error, give messages based on the description of the file (as_description), and return FALSE
// On success, return TRUE

int li_file
long ll_bytes
blob lblbBytes
byte lbBytes[]
Encoding lEncoding = EncodingAnsi! // just any default other than a UTF16 one

// We have encountered some files recently that are UTF16 LE with no BOM (Byte Order Mark), 
// so examine the file first as binary and figure some things out first!
li_file = FileOpen(as_filename, StreamMode!, Read!, Shared!)
if li_file = -1 then
	gMsg.Show("Cannot open " + as_description + " " + as_filename + ":~n~n" + &
					gnv_environment.uf_last_error_message(), Exclamation!)
	return FALSE
end if

ll_bytes = FileReadEx(li_file, lblbBytes, 4)

if ll_bytes <> 4 then
	FileClose(li_file)
	gMsg.Show("Cannot read " + as_description + " " + as_filename + ", or it has fewer than 4 characters in it:~n~n" + &
					gnv_environment.uf_last_error_message(), Exclamation!)
	return FALSE
end if

lbBytes = GetByteArray(lblbBytes)
// Check for UTF32 BOMs
if (lbBytes[1] = 0 and lbBytes[2] = 0 and lbBytes[3] = 254 and lbBytes[4] = 255) or &
   (lbBytes[1] = 255 and lbBytes[2] = 254 and lbBytes[3] = 0 and lbBytes[4] = 0) &
then
	FileClose(li_file)
	gMsg.Show("The " + as_description + " " + as_filename + " is in an encoding that " + &
					"the program cannot read: UTF32.", Exclamation!)
	return FALSE
end if

// Check for UTF16 with no BOM
if lbBytes[1] = 0 and lbBytes[2] <> 0 and lbBytes[3] = 0 and lbBytes[4] <> 0 then
	lEncoding = EncodingUTF16BE!
elseif lbBytes[1] <> 0 and lbBytes[2] = 0 and lbBytes[3] <> 0 and lbBytes[4] = 0 then
	lEncoding = EncodingUTF16LE!
end if
if lEncoding <> EncodingAnsi! then
	// Read and convert it to string, hopefully will work despite no BOM
	FileSeek64(li_file, 0, FromBeginning!) // position reading back to start
	ll_bytes = FileReadEx(li_file, lblbBytes)
	FileClose(li_file)
	as_result = String(lblbBytes, lEncoding)
	return TRUE
end if

FileClose(li_file)

// If we fell through to here hopefully normal TextMode reading will work!

li_file = FileOpen(as_filename, TextMode!, Read!, Shared!)
if li_file = -1 then
	gMsg.Show("Cannot read from " + as_description + " " + as_filename + ":~n~n" + &
					gnv_environment.uf_last_error_message(), Exclamation!)
	return FALSE
end if

ll_bytes = FileReadEx(li_file, as_result)
FileClose(li_file)
if ll_bytes = -100 then // empty file
	gMsg.Show("Cannot read from " + as_description + "  " + as_filename + ":~n~n" + &
					"The file is empty.", Exclamation!)
	return FALSE
elseif ll_bytes < 0 then
	gMsg.Show("Cannot read from " + as_description + "  " + as_filename + ":~n~n" + &
					gnv_environment.uf_last_error_message(), Exclamation!)
	return FALSE
end if

return TRUE

Obviously this includes some internal stuff, like my MessageBox replacement (that handles simple HTML like <b> for bold etc., and defaults to a larger font!) and gnv_environment.uf_last_error_message (that gets the last Windows error message text) but it would be easy to replace that stuff if you need code like this.

Comment

Daniel Vivier
Friday, 13 May 2022 21:08 PM UTC

P.S. Thanks to John for the code with the BOM detection! Saved me looking those codes up.

Helpful 0

John Fauss
Saturday, 14 May 2022 02:07 AM UTC

I'm glad to hear you have found a way to address the issue you were facing, Dan! Congratulations, and thank you for sharing your code with us.

Helpful 0

There are no comments made yet.

Roland Smith Accepted Answer Pending Moderation

Saturday, 14 May 2022 00:32 AM UTC
PowerBuilder
# 3

Try the IsTextUnicode function. Read the first 256 bytes into a blob.

Function boolean IsTextUnicode ( &
ref blob lpv, &
long iSize, &
ref long lpiResult &
) Library "advapi32.dll"

Constant Long IS_TEXT_UNICODE_STATISTICS = 2

ll_result = IS_TEXT_UNICODE_STATISTICS
If IsTextUnicode(lblob_contents, ll_bytes, ll_result) Then
// File is Unicode
End If

My PBEditor example uses it, see of_GetEncoding in n_appobject.

https://www.topwizprogramming.com/freecode_pbeditor.html

Comment

Daniel Vivier
Saturday, 14 May 2022 13:17 PM UTC

So Roland I've examined your sample code (and the Windows API docs for the function) and it appears that the point of this is to detect Unicode text with no BOM, because you previously test for the BOMs for the two UTF16 variants and UTF8.

But I'm curious as to why when the function passes, you assume it is EncodingUTF16LE!. It seems to me that my test (checking for non-0 bytes in positions 1 and 3, and 0 bytes in positions 2 and 4, or vice versa) is pretty well as good for that, given my assumption that most of the text is likely to be plain English characters (which is probably true in my case). And the problem of distinguishing UTF8 with no BOM is still unsolved (although from the API docs I tend to think this function call is better at determining that than UTF16LE!).

Helpful 0

There are no comments made yet.

Daniel Vivier Accepted Answer Pending Moderation

Thursday, 12 May 2022 21:17 PM UTC
PowerBuilder
# 4

I'm thinking of an algorithm something like this:

Open the file in Stream mode.
Read the first 10 bytes into a blob.
If the first 2 or 3 bytes are a normal BOM, just close the file, then re-open the file normally in Line or Text mode.
If the first and 3rd bytes are non-0 and the 2nd and 4th bytes are 0, it's UTF16-LE - read the whole file into a blob, convert it to String with EncodingUTF16LE!, then use the string.
If the 1st and 3rd bytes are 0 and the 2nd and 4th bytes are non-0, it's UTF16-BE - read the whole file into a blob, convert it to String with EncodingUTF16BE!, then use the string.

I would hope that would catch most of the possible problems, though it doesn't catch UTF8 without BOM.

Comment

There are no comments made yet.

John Fauss Accepted Answer Pending Moderation

Thursday, 12 May 2022 21:38 PM UTC
PowerBuilder
# 5

Hi, Dan -

If you are using PB 2019 or higher, you may wish to try out the FileEncoding(filename) PowerScript function. It may help simplify the issue for you.

Best regards, John

Comment

Daniel Vivier
Thursday, 12 May 2022 21:56 PM UTC

Thanks for the idea, John, I wasn't familiar with that function. Unfortunately, on my file it gave the answer EncodingAnsi! which was wrong. I suspect it's still looking for the BOMs to help it decide and when there is none, it says it's Ansi.

Helpful 0

Miguel Leeuwe
Friday, 13 May 2022 07:21 AM UTC

Please be aware of this bug when using FileEncoding(): https://www.appeon.com/standardsupport/search/view?id=3573

regards

Helpful 0

There are no comments made yet.

Daniel Vivier Accepted Answer Pending Moderation

Saturday, 14 May 2022 21:07 PM UTC
PowerBuilder
# 6

Last entry, I hope. It got the right results on files of all standard file types (ANSI, plus UTF8 or UTF16 LE or BE, both with and without BOM), and properly rejects UTF32 LE or BE because PB can't open them anyways.

It uses the Windows API function IsTextUnicode that Roland suggested, with two different constant arguments (to distinguish UTF16 LE from the less likely UTF16 BE):

Function long IsTextUnicode ( &
	ref blob lpv, &
	long iSize, &
	ref long lpiResult &
	) Library "advapi32.dll"

Constant Long IS_TEXT_UNICODE_STATISTICS = 2	// 0x0002
Constant Long IS_TEXT_UNICODE_REVERSE_STATISTICS = 32 // 0x0020

function boolean gf_read_text_file(REF string as_result, string as_filename, string as_description)

// Read the entire contents of the text file as_filename into as_result.
// On error, give messages based on the description of the file (as_description), and return FALSE.
// On success, return TRUE.

int li_file
long ll_bytes, ll_result
blob lblbBytes
byte lbBytes[]
Encoding lEncoding
String lsTemp

// We have encountered some files recently that are UTF16 LE with no BOM (Byte Order Mark), 
// so examine the file as binary and figure some things out first, then convert it to the correct encoding!
li_file = FileOpen(as_filename, StreamMode!, Read!, Shared!)
if li_file = -1 then
	gMsg.Show("Cannot open " + as_description + " " + as_filename + ":~n~n" + &
					gnv_environment.uf_last_error_message(), Exclamation!)
	return FALSE
end if

ll_bytes = FileReadEx(li_file, lblbBytes)
FileClose(li_file)

if ll_bytes = -100 then
	gMsg.Show("The " + as_description + " " + as_filename + " is empty.", Exclamation!)
	return FALSE
elseif ll_bytes = -1 then
	gMsg.Show("Cannot read " + as_description + " " + as_filename + ":~n~n" + &
					gnv_environment.uf_last_error_message(), Exclamation!)
	return FALSE
end if

lbBytes = GetByteArray(BlobMid(lblbBytes, 1, 4))
// Check for UTF32 BOMs, which the PB function FileEncoding won't detect, and PB cannot read
if (lbBytes[1] = 0 and lbBytes[2] = 0 and lbBytes[3] = 254 and lbBytes[4] = 255) or &
   (lbBytes[1] = 255 and lbBytes[2] = 254 and lbBytes[3] = 0 and lbBytes[4] = 0) &
then
	gMsg.Show("The " + as_description + " " + as_filename + " is in an encoding that " + &
					"the program cannot read: UTF32.", Exclamation!)
	return FALSE
end if

// Check for the file having a BOM showing its fileencoding
lEncoding = FileEncoding(as_filename) // This returns EncodingAnsi! for anything without a BOM
if lEncoding = EncodingAnsi! then
	// The determination that it was Ansi is not reliable: all files without BOM return that.
	if Mod(ll_bytes, 2) = 0 then
		// Try the IsTextUnicode function (only works for even numbers of bytes!), and only seems to detect UTF16 types.
		// Note: The Windows API docs at https://docs.microsoft.com/en-us/windows/win32/api/winbase/nf-winbase-istextunicode
		//         says it returns a bool (so, 1 or 0) if the file passes the indicated tests, with ll_result also being set to 1 or 0 also.
		//		   My experience is different: some files with no BOM have this function return 0, but ll_result is set to non-0
		//		   which I take to indicate it's that file type.
		ll_result = gnv_environment.IS_TEXT_UNICODE_STATISTICS
		gnv_environment.IsTextUnicode(lblbBytes, ll_bytes, ll_result) 
		if ll_result > 0 then
			lEncoding = EncodingUTF16LE!
		else
			ll_result = gnv_environment.IS_TEXT_UNICODE_REVERSE_STATISTICS
			gnv_environment.IsTextUnicode(lblbBytes, ll_bytes, ll_result) 
			if ll_result > 0 then
				lEncoding = EncodingUTF16BE!
			end if
		end if
	end if	
end if

if lEncoding = EncodingAnsi! then
	// It could still be UTF8 with no BOM, try converting as UTF8 and see whether it is shorter!
	// If it had any non-ACSII characters it would be, because there would be at least one 2 or more byte sequence
	//    representing a single Unicode characters. If it's the same length, it must be all ASCII.
	lsTemp = String(lblbBytes, EncodingUTF8!)
	if Len(lsTemp) < ll_bytes then
		lEncoding = EncodingUTF8! // bit wasteful that it will be converted again below here
	end if
end if

as_result = String(lblbBytes, lEncoding)

return TRUE

Comment

Daniel Vivier
Sunday, 15 May 2022 13:53 PM UTC

I'm sorry, this still fails on some files - see https://en.wikipedia.org/wiki/Bush_hid_the_facts for a bug in IsTextUnicode. A file containing only "Bush hid the facts" will be identified by IsTextUnicode (and thus my code) as UTF16-LE, and then displayed as Chinese characters if you show it somewhere! Apparently many other strings have the same problems.

Helpful 0

Roland Smith
Monday, 16 May 2022 18:25 PM UTC

The docs for IsTextUnicode says that IS_TEXT_UNICODE_STATISTICS determines whether it is probably Unicode.

Helpful 0

There are no comments made yet.

Chris Pollach @Appeon Accepted Answer Pending Moderation

Monday, 16 May 2022 18:41 PM UTC
PowerBuilder
# 7

Hi Dan;

FWIW: My framework uses ...

FUNCTION Boolean IsTextUnicode ( ref blob lpv, long iSize, ref long lpiResult ) Library "advapi32.dll"

That matches MS's expectation as well ...

https://docs.microsoft.com/en-us/windows/win32/api/winbase/nf-winbase-istextunicode

Regards ... Chris

Comment

There are no comments made yet.

Page :
1

There are no replies made for this question yet.
However, you are not allowed to reply to this question.

Please login to post a reply

You will need to be logged in to be able to post a reply. Login using the form on the right or register an account if you are new here. Register Here »

Forgot Password?

Resolved Opening Unicode files without BOMs

Find Questions by Tag