FileEncoding() UTF8 file returns ANSI.

View Replies (4)

Resolved FileEncoding() UTF8 file returns ANSI.

Issue

Votes

Undo

Thierry Del Fiore
PowerBuilder
Monday, 4 May 2020 08:32 AM UTC

Hi,

We need to read text files in linemode with fileopen() and filereadex()

We use fileencoding() before fileopen() in order to find out the encoding argument (EncodingANSI!, EncodingUTF8!, ...)

We have an issue with UTF8 files.

FileEncoding() returns ANSI for UTF8 file.

Fileopen() with EncodingUTF8! argument returns -1 for UTF8 files.

So we open the file in ANSI, but in this case we have accent character issue : èéàùê becomes Ã¨Ã©Ã Ã¹Ãª

The only way to have the correct string is to convert : string(blob(ls_line, encodingansi!), encodingutf8!)

But we cannot always do the conversion because if the text file is really in ANSI, we get asian character.

So, the question is : how can we handle both ANSI and UTF8 files since FileEncoding() always returns ANSI ?

is this a Powerbuilder Bug ?

Regards

Responses (4)

Michael Kramer Accepted Answer Pending Moderation

Tuesday, 5 May 2020 13:28 PM UTC
PowerBuilder
# 1

Hi Thierry,

To me this is not a bug in PowerBuilder. Instead, it is correct behavior.
It is the result of wanting to interpret a text file as a different encoding that its BOM indicates.

My test example: Small text file including Danish special characters Æ Ø Å æ ø å.
I used NotePad on Windows 10 to save it in 3 different encodings:
1) ANSI
2) UTF8 - without BOM
3) UTF8 with BOM

I then compared file contents via command prompt using "FC /B" > Results >
(1) and (2) are IDENTICAL - EXCEPT for the Danish special characters.
(2) and (3) are IDENTICAL - EXCEPT (3) has a BOM and (2) doesn't.

FileEncoding returns the encoding represented by the BOM marker (ANSI when no known BOM marker).
You will have difficulty in PowerBuilder if you insist to interpret a text file in conflict with its BOM marker.

NotePad reads the full text file to best-guess its encoding (ANSI vs. UTF8-without-BOM) based on the byte sequences that potentially are non-English characters. But that's error-prone second guessing.
EX: Open ANY text file ANSI-encoded containing only US characters. NotePad will tell you that file is UTF8 without BOM.

HTH /Michael

Comment

Load more comments

Thierry Del Fiore
Wednesday, 6 May 2020 16:20 PM UTC

I didn't manage to make it work.

@Michael : the conversion string(lblobData, EncodingUTF8!), that's what i actually do, it works if the file is in UTF8 but returns asian character if the file is Ansi.

I read file in streamMode FileOpen(ls_file, StreamMode!) (fileopen in EncodingUTF8 returns -1).

Then I Tried IsTextUnicode, but it always returns false.

ll_ret = filereadex(li_file,bdata)

istextunicode(bdata,ll_ret,ll_null)

Am I missing something ?

Helpful 0

Michael Kramer
Wednesday, 6 May 2020 18:04 PM UTC

I'm sorry to hear IsTextUnicode isn't giving you what you need. I have no personal experience with that function. I read the docs that Roland linked to. It lists a bunch of option values you can use to tune the function. I found this page containing values for each option > https://www.pinvoke.net/default.aspx/advapi32.istextunicode

I assume you need something similar to IS_TEXT_UNICODE_UNICODE_MASK (combo of multiple options).

Helpful 0

Mark Goldsmith
Wednesday, 6 May 2020 23:19 PM UTC

Just my thoughts...so Michael I agree with your comments for the most part and thanks for describing how you approached trying to understand the issue. I think the only way to know if this is truly a bug (and Appeon seems to have accepted it as such, at least for the FileEncoding function, since they're going to be scheduling a fix) is to understand how Powerbuilder is determining whether a file is UTF-8 encoded...is it strictly adhering to the presence of the 3 BOM characters at the start of the file or is there something else going on under the covers to make this determination?

I would probably completely agree if you were to have added "as implemented by Powerbuilder" at the end of these sentences "...wanting to interpret a text file as a different encoding that its BOM indicates." and "you will have difficulty in PowerBuilder if you insist to interpret a text file in conflict with its BOM marker."

I too went though a similar exercise and then also looked at the files with a hex editor to try and understand it further. Text editors that I use today (Notepad, Notepad++, Brackets) default to saving a file as UTF-8, without BOM, and so maybe here is where there is an opportunity for some changes to Powerbuilder (EG incorporating UTF_8 without BOM as an encoding type, stop using ANSI as the default encoding if not specified and use UTF-8, etc). It seems the issue is that while UTF-8 without BOM is an accepted encoding, since text editors can pick up on it successfully (most of the time, as Roland pointed out it's still not guaranteed)...is Powerbuilder only relying on the presence of BOM even though the absence of BOM is widely accepted?

Some additional info that I found helpful understanding Unicode/ specs is available here https://home.unicode.org/ and specifically here https://unicode.org/faq/utf_bom.html.

I've also commented here: https://community.appeon.com/index.php/qna/q-a/identify-right-encoding-for-string

Regards,

Helpful 0

There are no comments made yet.

Roland Smith Accepted Answer Pending Moderation

Monday, 4 May 2020 16:58 PM UTC
PowerBuilder
# 2

Another person asked pretty much the same thing earlier today. I suggested using the IsTextUnicode function to analyze a blob which does not start with a BOM.

https://docs.microsoft.com/en-us/windows/win32/api/winbase/nf-winbase-istextunicode

Function boolean IsTextUnicode ( ref blob lpv, long iSize, ref long lpiResult ) Library "advapi32.dll"

Comment

Michael Kramer
Tuesday, 5 May 2020 18:55 PM UTC

I like the description of IsTextUnicode: " is likely to contain a form of Unicode text." It has options to configure what + how it searches, nice.

It will probably return TRUE for every ANSI string identical to its own UTF8 representation.

Helpful 0

There are no comments made yet.

John Fauss Accepted Answer Pending Moderation

Monday, 4 May 2020 15:45 PM UTC
PowerBuilder
# 3

Here is the source code for a function I wrote many moons ago as an exercise for learning about Byte Order Markers (BOM's):

public function integer of_getfileencoding (string as_filename, ref encoding as_encoding);
// Determines how a file is encoded by examining the Byte Order Mark (BOM) at
// the beginning of a file. The start of the file has to be read in stream mode,
// otherwise the system skips over the BOM.
//
// There are five BOM's:
// 1. UTF 32 Big Endian (BE) x0000FEFF (byte values 0,0,254,255) Not supported by PB
// 2. UTF 32 Low Endian (LE) xFFFE0000 (byte values 255,254,0,0) Not supported by PB
// 3. UTF 16 Big Endian (BE) xFEFF (byte values 254,255) Recognized by PB
// 4. UTF 16 Low Endian (LE) xFFFE (byte values 255,254) Recognized by PB (default for PB10 & higher)
// 5. UTF 8 xEFBBBF (byte values 239,187,191) Recognized by PB
// 6. ANSI Any byte sequence not listed above Recognized by PB
//
// Arguments:
// String as_filename The path, name & extension of the file to be examined.
// Encoding as_encoding [passed by reference] The encoding in used by the file.
//
// Returns: Integer
// RC = 1 -> Successful determine of the file's encoding.
// RC = -1 & Encoding argument (passed by reference) null if error or not supported.

Integer li_filenum
Long ll_bytesread
Byte lbyte[]
Blob lblob

li_filenum = -1
SetNull(as_encoding)

if not FileExists(as_filename) then
Return -1
end if

// Open the file to be examined in Stream Mode.
li_filenum = FileOpen(as_filename, StreamMode!, Read!, Shared!)
if li_filenum = -1 then
Return -1
end if

// Read the first four bytes of the file (where the BOM resides) into a blob.
ll_bytesread = FileReadEx(li_filenum, lblob, 4)
FileClose(li_filenum)

if ll_bytesread < 4 then
Return -1
end if

// Copy the four bytes in the blob into a byte array for easy examination.
lbyte = GetByteArray(lblob)

// Does the file begin with a recognized BOM?
if lbyte[1] = 0 and lbyte[2] = 0 and lbyte[3] = 254 and lbyte[4] = 255 then
Return -1 // UTF 32 BE not supported by PB
elseif lbyte[1] = 255 and lbyte[2] = 254 and lbyte[3] = 0 and lbyte[4] = 0 then
Return -1 // UTF 32 LE not supported by PB
elseif lbyte[1] = 254 and lbyte[2] = 255 then
as_encoding = EncodingUTF16BE!
elseif lbyte[1] = 255 and lbyte[2] = 254 then
as_encoding = EncodingUTF16LE!
elseif lbyte[1] = 239 and lbyte[2] = 187 and lbyte[3] = 191 then
as_encoding = EncodingUTF8!
else
// No recognizable BOM, so this file is ANSI encoded.
as_encoding = EncodingANSI!
end if

Return 1
end function

HTH, John

Comment

Thierry Del Fiore
Tuesday, 5 May 2020 07:38 AM UTC

Thanks for your code example.

I tried it, but I had the same behavior as the fileencoding() function.

It only works for "UTF-8 with BOM" files.

"UTF8" files are identified as "ANSI"

Helpful 0

Roland Smith
Tuesday, 5 May 2020 12:08 PM UTC

You can use IsTextUnicode that I suggested to determine if a blob contains Unicode characters when there is no BOM.

Helpful 0

There are no comments made yet.

Miguel Leeuwe Accepted Answer Pending Moderation

Monday, 4 May 2020 08:41 AM UTC
PowerBuilder
# 4

Hi,

I remembered having seen this bug:

https://www.appeon.com/standardsupport/search/view?id=3573

Had something to do with having BOM or not.

Maybe this code by Yuri could be a valid workaround? I haven't tried it myself:

https://community.appeon.com/index.php/qna/q-a/identify-right-encoding-for-string#reply-15329

Comment

Thierry Del Fiore
Tuesday, 5 May 2020 07:29 AM UTC

Thanks for your answer, indeed it seems to be a PB Bug

Helpful 0

There are no comments made yet.

Page :
1

There are no replies made for this question yet.
However, you are not allowed to reply to this question.

Please login to post a reply

You will need to be logged in to be able to post a reply. Login using the form on the right or register an account if you are new here. Register Here »

Forgot Password?