locked
Performance of the script which removes special characters from xml files is very bad. Can we improve? RRS feed

  • Question

  • Hi All,

    I'm facing a challenge in reducing the time taken for removal of special characters (i.e. Acscii less than 32). Removal works as expected. However, this process is taking about 6 hours. Could you please suggest any improvements? Thanks!

    Below is the script snippet: 

    function RemoveSplChar($string)
    {
     
    $arr = $null
    $ByteArray = [System.Text.Encoding]::ASCII.GetBytes($string)
    
    foreach ($char in $ByteArray)
    {
        if($char -gt '31')
        {
            [array]$arr += $char
        }
    }
     
    $String_new = $null
    $String_new = [System.Text.Encoding]::ASCII.GetString($arr)
    return $String_new
    }
    
    ###################################################
    #Parsing XML files to remove special characters
    ###################################################
    $temp_dir = "c:\temp"
    logwrite ("Removing special characters from the export files...")
    $xmlfiles = "database_structure.xml",
    "database_products.xml",
    "database_persons.xml",
    "database_materials.xml",
    "database_literature.xml",
    "database_disclaimer.xml",
    "database_media_asset_refs.xml",
    "database_media_assets.xml"
    
    foreach ($fl in $xmlfiles) 
    { 
        $filefullpath = $temp_dir + '\' + $fl
        $filewosplchars = $filefullpath + '.log'
        $stream = [System.IO.StreamWriter] $filewosplchars
        foreach ($line in [System.IO.File]::ReadLines($filefullpath)) 
        { 
            if ($line -match '<!\[CDATA\[([\S\s]+)?\]\]>')
    	    { 
                $rawdata = $Matches[1]
                if (($rawdata -ne "" ) -and ($rawdata -ne $null))
                {
                    $nosplchars = RemoveSplChar($rawdata)
                    $replaceline = '<![CDATA[' + $nosplchars + ']]>'
                    $newline = $line -replace '<!\[CDATA\[([\S\s]+)?\]\]>', $replaceline
                    $stream.WriteLine($newline)
                }
                else
                {
                    $stream.WriteLine($line)
                } 
    	    }
        else { $stream.WriteLine($line) }
        }
        $stream.close()
        Remove-Item -Path $filefullpath -Force
        Rename-Item -Path $filewosplchars -NewName $filefullpath -Force
    }


    Tuesday, November 7, 2017 9:44 AM

Answers

  • I found a method which took less than a minute!!! Below is the snippet of the code change:

    function RemoveSplChar($string)
    {
        $string -replace '[^\u0000-\u007F]+', ""
    }
    • Marked as answer by KarthikSN Wednesday, November 8, 2017 7:29 AM
    Wednesday, November 8, 2017 7:29 AM

All replies

  • You do not need to remove special characters from "CDATA" sections.

    You can use the encode function to convert special characters from text elements.

    Here is the fastest way to convert XML: https://msdn.microsoft.com/en-us/library/4zhk8s1x(v=vs.71).aspx

    If the XML loads as XML you can just re-encode each text field which would be fast.

    [xml]$xml = Get-Content myfile.xml


    \_(ツ)_/

    Tuesday, November 7, 2017 2:12 PM
  • Also note that formatting characters do no have to be escaped in XML.

    To escape a string use the following method:

    [System.Security.SecurityElement]::Escape($string)


    \_(ツ)_/

    Tuesday, November 7, 2017 2:18 PM
  • Thank you for your response :)

    In my case, XML is written by the application which contains raw data in few parts of the XML. I'm able to import that file using  [xml]$xml = Get-Content command. With that, I can also navigate to the part where special characters are located and remove them. But all those will happen in the memory. 

    Is there a way to update that into an XML file? similar to 

    set-content

    My intention is to have an XML file with no special characters.

    Thanks a lot!



    • Edited by KarthikSN Tuesday, November 7, 2017 6:45 PM
    Tuesday, November 7, 2017 4:31 PM
  • Please fix your post.  It is unreadable.  Do not attempt to post formatted text.  It breaks the editor.


    \_(ツ)_/

    Tuesday, November 7, 2017 5:22 PM
  • sorry about that.. Modified as suggested :)
    Tuesday, November 7, 2017 6:54 PM
  • Hi,

    I'm checking how the issue is going, was your issue resolved?

    And if the replies as above are helpful, we would appreciate you to mark them as answers, and if you resolve it using your own solution, please share your experience and solution here. It will be greatly helpful to others who have the same question.

    Appreciate for your feedback.

    Best Regards,
    Albert Ling


    Please remember to mark the replies as an answers if they help.
    If you have feedback for TechNet Subscriber Support, contact tnmff@microsoft.com

    Wednesday, November 8, 2017 5:57 AM
  • I found a method which took less than a minute!!! Below is the snippet of the code change:

    function RemoveSplChar($string)
    {
        $string -replace '[^\u0000-\u007F]+', ""
    }
    • Marked as answer by KarthikSN Wednesday, November 8, 2017 7:29 AM
    Wednesday, November 8, 2017 7:29 AM