1. ahmed tawfik
  2. PowerBuilder
  3. Monday, 19 July 2021 21:28 PM UTC

I have many word documents that have been saved as blobs via Column  (longbinary(max) "blob") to the DB (SQL Server 2012). Now I need to have the ability to search text in this Blob column (such I do with "like" in varchar column )
Need to be able to do this programmatically.

another approach

what about if i convert word document(blob) stored in longbainary(max) column to text (varchar(max)) column and search using like ?

ahmed tawfik Accepted Answer Pending Moderation
  1. Wednesday, 28 July 2021 23:05 PM UTC
  2. PowerBuilder
  3. # 1

Hi Mark

 

li_file_handle = FileOpen(ls_path_filename, StreamMode!, Read!, LockReadWrite!, Append!, EncodingANSI!)
ll_chars_read = FileReadEx(li_file_handle,lblb_file_contents)
ls_contents_as_utf8 = String(lblb_file_contents, EncodingUTF8!)

 

It works fine and the text in txt file stored in DB column accurately

thank you

Regards ...

Ahmed

Comment
  1. Mark Goldsmith
  2. Thursday, 29 July 2021 01:10 AM UTC
Glad to hear that worked for you Ahmed and thanks for the update.
  1. Helpful
There are no comments made yet.
Mark Goldsmith Accepted Answer Pending Moderation
  1. Wednesday, 28 July 2021 15:43 PM UTC
  2. PowerBuilder
  3. # 2

Hi Ahmed,

Daryl's speculation on the issue being related to encodings is correct. As you can see in the picture below, when I opened the text file you provided, it shows as UTF-8 encoding, but PowerBuilder requires UTF-8 with BOM in order to successfully open it with FileOpen() and EncodingUTF8! Actually, if you manually save the file but change it to UTF-8 with BOM (with Notepad, Notepad++ etc.) and then try to process it I suspect it will work. You may wish to see if you can control the encoding that Apache Tika uses when creating the text file.

 

If you can't control the encoding there are a couple of options I can think of:

  1. Once Apache Tika creates the file, open it and manually add the BOM encoding bytes (EF BB BF) to the start of the file before you try and process it, or
  2. The following code will allow you to get around the absence of the BOM bytes:

li_file_handle = FileOpen(ls_path_filename, StreamMode!, Read!, LockReadWrite!, Append!, EncodingANSI!)
ll_chars_read = FileReadEx(li_file_handle,lblb_file_contents)
ls_contents_as_utf8 = String(lblb_file_contents, EncodingUTF8!)


The above 3 lines of code are basically opening the file in stream mode with ANSI encoding; then reading the contents of the file into a Blob variable; then converting the blob contents to a string using UTF-8 encoding...now you can process what's contained in ls_contents_as_utf8.

HTH...regards,

Mark

Comment
There are no comments made yet.
ahmed tawfik Accepted Answer Pending Moderation
  1. Wednesday, 28 July 2021 09:24 AM UTC
  2. PowerBuilder
  3. # 3

Hi Daryl

zip file attached for docx,txt,message and code

also rar file attached

 

 

 

Attachments (2)
Comment
There are no comments made yet.
ahmed tawfik Accepted Answer Pending Moderation
  1. Monday, 26 July 2021 21:26 PM UTC
  2. PowerBuilder
  3. # 4
Hi Daryl Foster
We use Apache Solr for our full text searching solution and we use Apache Tika to extract the text from .docx files to load into Solr.
thank you
I used Apatch Teka and it works fine
Regards
Ahmed
 

 

Comment
  1. ahmed tawfik
  2. Wednesday, 28 July 2021 08:03 AM UTC
if not shown here may be deleted can i send u email ?
  1. Helpful
  1. ahmed tawfik
  2. Wednesday, 28 July 2021 08:22 AM UTC
i tried torename files and send as rar , pdf , pbl ,jpg but it removoved ?
  1. Helpful
  1. Daryl Foster
  2. Wednesday, 28 July 2021 09:05 AM UTC
Hi Ahmed, you probably can’t just rename the files, you are probably better off zipping them and attaching the zip file. You’ll need to add a new reply, rather than a new comment. Are you getting an error when you attach them?
  1. Helpful
There are no comments made yet.
Mark Goldsmith Accepted Answer Pending Moderation
  1. Monday, 26 July 2021 16:25 PM UTC
  2. PowerBuilder
  3. # 5

Hi Ahmed,

Try adding 'CMD /c in front of your first double-quote and add a single quote after your last double-quote. So it looks like the following:


Run('CMD /c "java -jar tika-app-2.0.0.jar e:\a1.docx > e:\abc.txt 2> e:\err.err"')

This should open a command window, execute what you have provided in double-quotes and then close.

HTH...regards,

Mark

Comment
  1. Mark Goldsmith
  2. Monday, 26 July 2021 19:44 PM UTC
Hi Ahmed,

For the issue using a variable...your code is still placing ls_path inside quotes and so it is being evaluated literally instead of as a variable; you'll need to have a single quote prior to the variable and one just after the variable (and you do not need the extra set of double-quotes around the variable). It may be easier to build the entire Run string inside a variable, something like:



ls_command = 'cmd /c "java -jar tika-app-2.0.0.jar ' + ls_path + ' > e:\abc.txt 2> e:\err.err" '

Run(ls_command)



For the cmd window to delay closing...you may wish to create a batch file instead (and Run that) which would include your line of text to issue the command then followed by a TIMEOUT command (I think SLEEP if Win 7 and prior) to delay however long you wish.



Regards,

Mark
  1. Helpful
  1. ahmed tawfik
  2. Monday, 26 July 2021 21:23 PM UTC
Hi Mark

thank you for yours answers -

it runs

Regards

ahmed
  1. Helpful
  1. Mark Goldsmith
  2. Monday, 26 July 2021 22:26 PM UTC
You're welcome Ahmed and glad to hear that works for you.
  1. Helpful
There are no comments made yet.
ahmed tawfik Accepted Answer Pending Moderation
  1. Sunday, 25 July 2021 09:35 AM UTC
  2. PowerBuilder
  3. # 6

Hi all

I've successfully used apache tika to extract the arabic  text from the CMD, how would I do it from the powerbuilder, can i use run ??

And how could I use the output text to put it in the required column automatically (without user interaction){KomentoLock}.

Regards Ahmed

Comment
  1. ahmed tawfik
  2. Monday, 26 July 2021 11:16 AM UTC
Hi all

I used Apache teka

I can extract docx file to txt files but onlt in CMD env.

the command is as follow (in CMD)

E:\>java -jar tika-app-2.0.0.jar e:\a1.docx > e:\abc.txt 2> e:\err.err

it works fine and if there is any error it is written to e:\err.txt

but in PB env. i tried run("java -jar tika-app-2.0.0.jar e:\a1.docx > e:\abc.txt 2> e:\err.err")

no output file even no error file produced

what is error?

Regards Ahmed

  1. Helpful
There are no comments made yet.
John Fauss Accepted Answer Pending Moderation
  1. Tuesday, 20 July 2021 13:48 PM UTC
  2. PowerBuilder
  3. # 7

Hi, Ahmed -

As a test, try manually editing a Word document using Notepad. You will not see the textual content of the document.

This illustrates what Daryl describes. The text in a Word document is not in a "clear text" format... you must either use Word to search the contents of a Word document or you must first extract the text content and then search that.

The follow-up link that Chris has provided describes searching "character-based" text, NOT text in Word documents. Once you have the textual content of a Word document (Daryl has described one technique for accomplishing this), then you may be able to provide search functionality.

Regards,
John

Comment
  1. ahmed tawfik
  2. Tuesday, 20 July 2021 16:09 PM UTC


Hi, John Fauss



As a test, try manually editing a Word document using Notepad. You will not see the textual content of the document.



that what the customer (user) did He open .docx file using microsoft word then CtrlA , CtrlC Finally CtrlV (all text without photo) in nvarvhar(max) column in datawindow )if he tried to open by word pad he get unread char as "



the customer was unsatisfied saying it had a lot of time to do that and it is easier to search using file search in windows



Regards .... Ahmed

  1. Helpful
  1. John Fauss
  2. Tuesday, 20 July 2021 22:40 PM UTC
What I meant was that YOU could do this as a test; To illustrate that you cannot simply search the contents of a Word document stored in the database as a binary large object without extracting the character text minus all of the formatting. I was not suggesting that your user do this.
  1. Helpful
There are no comments made yet.
Daryl Foster Accepted Answer Pending Moderation
  1. Tuesday, 20 July 2021 02:18 AM UTC
  2. PowerBuilder
  3. # 8

Hi Ahmed, I think the issue you will have searching for text in the Word documents is that .docx files are actually binary zip files which contain a bunch of xml files, so you won't be able to search for the text inside them unless you (at least) decompress them first and then search in the relevant xml files.  An alternative is to extract the text only from the .docx and store that in a separate varchar(max) column which you could use for searching.

We use Apache Solr for our full text searching solution and we use Apache Tika to extract the text from .docx files to load into Solr.

Comment
  1. ahmed tawfik
  2. Tuesday, 20 July 2021 12:29 PM UTC
Hi Daryl Foster

An alternative is to extract the text only from the .docx and store that in a separate varchar(max) column which you could use for searching.

how to extract the text only from the .docx (the file is docx file contains text(writeen in Arabic and English) and photo ?

I tried

blob lb_object

string ls_ole_to_text // in DB the column is nvarchar(max)

selectblob ole_object

into :lb_object

from folder_photo

WHERE ( folder_photo.id = :id) ;

ls_ole_to_text = string(lb_object, EncodingAnsi!) Or even ls_ole_to_text = string(lb_object)

UPDATE folder_photo

SET ole_to_text = :ls_ole_to_text

WHERE ( folder_photo.id= :id) ;

commit ;



but when i open folder photo

the column ole_to_text contains " ذدà،±ل " why ????

Regards .... Ahmed
  1. Helpful
  1. Daryl Foster
  2. Wednesday, 21 July 2021 02:05 AM UTC
Hi Ahmed, John's reply above explains what I was trying to say, i.e. you can't just convert the docx file contents to a string and then search because it is a binary format. If you want the text from the Word document you need to do that as an extra step and programmatically extract the text before searching. You could have two columns in your database, one for the binary docx data and one for the text. You could then have an extra step when you store the docx data into the database of extracting the text and updating the text column as well as the binary column.

There are a few ways of programmatically extracting the text from a docx file, you could use Word OLE Automation, or look at Apache Tika, or something like Gembox Document.
  1. Helpful
  1. ahmed tawfik
  2. Wednesday, 21 July 2021 09:48 AM UTC
Hi Darly Foster



"You could have two columns in your database, one for the binary docx data and one for the text. You could then have an extra step when you store the docx data into the database of extracting the text and updating the text column as well as the binary column."



that is what I did but i can not extract text from doc file to text.



are there any examples using Apache Tika or Word OLE Automation, in powerbuilder?

Regards ....Ahmed

  1. Helpful
There are no comments made yet.
Chris Pollach @Appeon Accepted Answer Pending Moderation
  1. Monday, 19 July 2021 21:41 PM UTC
  2. PowerBuilder
  3. # 9

Hi Ahmed;

   This feature is very DBMS vendor dependent.

For example: http://docs.microsoft.com/en-us/azure/storage/blobs/storage-manage-find-blobs?tabs=azure-portal

Regards ... Chris

Comment
  1. Chris Pollach @Appeon
  2. Tuesday, 20 July 2021 02:51 AM UTC
PS: Blob text searching is also possible but again, a feature of your specific DBMS. For example:

https://docs.microsoft.com/en-us/sql/relational-databases/search/full-text-search?view=sql-server-ver15
  1. Helpful
  1. ahmed tawfik
  2. Tuesday, 20 July 2021 12:32 PM UTC
Hi Chris

is there any examples using powerbuilder code as done in ole powerbuilder example - tha DB is SQLServer 2016

Regards ... Ahmed
  1. Helpful
  1. Chris Pollach @Appeon
  2. Tuesday, 20 July 2021 15:34 PM UTC
For SS - There is nothing that needs to be done on the PB App side as it's all implemented on the SS side. I have personally used the "Full Text Search" feature on a few PB Apps for the Supreme Court of Canada and Federal Court of Canada to search through uploaded court documents. This SS feature worked great and was totally transparent to the related PB Apps.
  1. Helpful
There are no comments made yet.
  • Page :
  • 1


There are no replies made for this question yet.
However, you are not allowed to reply to this question.