none
Need help parsing the text - log file and find EOL character in that file

    Question

  • Hi Everone,
    I have a strange problem. I am using simple HTA program  to read the text file (.log file) and split it into lines as following

    FileText = objFile.ReadAll 
    ArrayLines = Split(FileText, VbCrLf)
     
    I am then doing some string search and manipulation on those arraylines and it works most of the time. However I recently found that some files were not getting split correctly. For example if I need to search all the lines with instances of "QuickBrownFox", it was giving me incorrect number (lesser number) of lines. I ran the windows Grep function against the file which correctly showed the number of instances where "QuickBrownFox" exists in lines for that file...
    Here is the strange part...If I  open the file  which is not read correctly and copy the contents  it to a another file with same extension, then script works correctly.
    I compared the two files (original) and the newly created with Textpad and both shows the ANSI as encoding scheme with same numbers of characters, words and lines. 
    Original file has these characters in some lines which are not regular ANSI  but so does the new file....
    I have tried VrCr as well as VbCrLf and i guess issue is not with the script...
    Reading line by line is not an option as an other option I also need to read all files in directory which is not feasible due to huge time lag
    Should I try changing the encoding of the file (original) or I am missing something else?
    Please kindly help...
    Thank so much in advance. could not have come this far without your help!!!
    Aray

    Wednesday, January 06, 2010 6:16 PM

Answers

  • Notepad, Textpad and Wordpad all handle special characters differently.  Notepad is pretty dumb, while TextPad and Wordpad have more logic built in.  Also, some advance editors like Textpad may have advanced logic for how to handle special characters that are not displayed (like converting them behind the scenes).  Therefore, these tools are not the best resource from troubleshooting your problem, but they help do help identify it.

    There is most likely a special character in the string, that needs to be removed or ignored.  Use the link to some VBS code to remove the special characters, or modify it to display the special characters.  Not all special characters are accounted for, you may need to cycle through all the ASCII/ANSI table to find which character is causing your issue.

    http://networkadminkb.com/vbslibrary/Knowledge%20Base/Components/Strings/RemoveSC.aspx

    Wednesday, January 06, 2010 6:53 PM
  • It's probably VbLf. Replace all VbLf with VbCrLf and then do the split

    FileText = objFile.ReadAll
    FileText = replace(FileText, vblf, vbcrlf)
    ArrayLines = Split(FileText, VbCrLf)

    ...or something like that. I don't know if you have vblf and vbcrlf as eol in the same file.

    Not sure but it may be better to replace VbCrLf with VbLf and then do the split with VbLf.

    Francis

    Thursday, January 07, 2010 7:09 AM

All replies

  • Hi,


    Log file generated by my server has a line which I need to scan. E.g
    "Quick Brown Fox jumps over the lazy dog. Alpha Beta = Gamma "

    Now if I open the above line in notepad (default) , it shows up as one single line but if I open this up in the WordPad (windows), it should this like two lines with a space in front of Aplha.

    Quick Brown Fox jumps over the lazy dog.
     Alpha Beta = Gamma

    With above issue  I did some testing today with possible alternatives and here is what i found.

    ==> ReadAll reads the file just like notepad and script works fine until i found that some files generated by server are not read correctly (see my first post), meaning that somwhere VBCRLF is confusing the ReadAll function
    ==>I tested then as following

    Do Until objFile.AtEndOfStream
            ReDim Preserve TextArray(i)
            TextArra(i) = objFile.ReadLine
            i = i + 1
     Loop
    FileText  = Join(TextArray, VbCRLf)

    With above code , i found that above string is divided into two lines, just like wordpad. I have to make major changes in the script to scan next line every time Quick Brown Fox appers in a line to catch Aplha and Beta values. Does this also means that notepad is ignoring some CR of LF character that wordpad can see.
    ==>I also tested with Adodb.stream with charset ascii and .mode is admode read
    This time it worked (seems like file is read just like notepad did) but unfortunately it took almost 5 times more time to load the file as compared to above method which is not going to work for me.
    Function ReadLog(sFileSpec)
      const adTypeText = 2
      const adModeReadWrite = 3
      With CreateObject("ADODB.Stream")
        .type = adTypeText
        .mode = adModeRead
        .Charset = "ascii"
        .open
        .LoadFromFile sFileSpec
        ReadLog = .readText
    End with
    End Funciton

    So my question is that is there a way to load the file for reading only using ADODB.stream quickly so that I have workaround to my issue and avoid using readline which will cause me some grief in the long run and additional cycles.
    Alternatively is there a way to change the characterset of a file when doing using objFile ? May be that is my problem.
    remember that original code (File Text = objFile.ReadAll) fails only on certain files and if I copy the contects into new file, same script works fine.
    Looking to hear from someone, anyone on this...
    Thanks so much...
    Aray
    Wednesday, January 06, 2010 6:16 PM
  • Notepad, Textpad and Wordpad all handle special characters differently.  Notepad is pretty dumb, while TextPad and Wordpad have more logic built in.  Also, some advance editors like Textpad may have advanced logic for how to handle special characters that are not displayed (like converting them behind the scenes).  Therefore, these tools are not the best resource from troubleshooting your problem, but they help do help identify it.

    There is most likely a special character in the string, that needs to be removed or ignored.  Use the link to some VBS code to remove the special characters, or modify it to display the special characters.  Not all special characters are accounted for, you may need to cycle through all the ASCII/ANSI table to find which character is causing your issue.

    http://networkadminkb.com/vbslibrary/Knowledge%20Base/Components/Strings/RemoveSC.aspx

    Wednesday, January 06, 2010 6:53 PM
  • It's probably VbLf. Replace all VbLf with VbCrLf and then do the split

    FileText = objFile.ReadAll
    FileText = replace(FileText, vblf, vbcrlf)
    ArrayLines = Split(FileText, VbCrLf)

    ...or something like that. I don't know if you have vblf and vbcrlf as eol in the same file.

    Not sure but it may be better to replace VbCrLf with VbLf and then do the split with VbLf.

    Francis

    Thursday, January 07, 2010 7:09 AM
  • Thanks for your replies and help me move forward in this problem.
    I am curious as what change is made to the text file when you open and do CTRL+A (select all) and paste it to another text file. Reason is that once i do that, script reads the file correctly.
    Is there a way to read the file other than objFile ? I used Adodb.stream but don't know if I am doing that right?
    I really appreciate all of your help.
    Thanks .
    Aray
    Friday, January 08, 2010 3:06 PM
  • Hi,
    I am kind of little lost here as I now konw the problem but don't know how to fix it :)
    I have found following with my troublesome log files

    There are some non-ascii characters which show up like strange characters on separate lines when the file is opened with notepad or wordpad.
    Previously one way to make right for this issue is to copy the contents of whole file and paste to new text/log file.
    However I found now that if I find/replace these character (even first few instance) from the file, the file is read correctly by script which mean these are causing the issue. These character start with something like a black box with white dot in there. Don't know how to check the ascii hex values to tell you specifically.

    Another thing i found out is that if I open the log file with MS Word, it asks me to choose the text encoding to make the document readable. I choose windows (default) and there are two other options , MS-DOS and Other encoding which has long list to choose. When I choose Windows(default), it does not show those characters. Don't know if they are still there and not showing up or remove altogether. however that black box with white dot is still there in MS word file :)

    I guess when I am coping the contents of the file and pasting it into another text file, it is somehow encoding it in Windows encoding format?
    I am working with server guys to find out why we have those character in the log files in the first place but my question is how to encode the file to windows default encoding using vbscript while doing ReadALL function.

    I rather not use ReadLine because then i have to check each line for VbLf and append those two lines separated by VbLf which is not desired.
    I tried ADODB.stream but it is painfully slow. All the log files are like 10Mb and ReadAll is the quickest (10 Mb file in just 5 sec).

    Please kindly help and looking forward to any / all replies :)
    Thanks


    PS: There is a Vblf in the  log file and I am doing split for VbCrLf as i want to read those two lines separated by VbLf as one. That was the symptom of this issue and not the cause. Issue is not VBCrLf or VbLf but these characters causing some text encoding issue.
     
    • Edited by arayanz Tuesday, January 12, 2010 3:27 PM Added more info
    Tuesday, January 12, 2010 3:13 PM
  • I am trying to clear up the older open posts on this forum. If this is still an unresolved issue for you please let me know. If you do not post back within one week I will assume it is resolved and will close this thread.

    Thank you

    Ed Wilson
    Microsoft Scripting Guy
    Sunday, May 09, 2010 11:00 PM