locked
Replace u+fffd RRS feed

  • Question

  • I have an html pages that have some embedded u+fffd characters in them.  I am trying to use powershell to get rid of all of these and just replace them with blanks or perhaps a single quote, so that I don't have to do it manually.  Here is my script:

    $oldfile = "C:\PS\Filename.aspx" 
    $newfile = "C:\PS\Filename2.aspx"

    $text = (Get-Content -Path $oldfile -ReadCount 0) -join "`n"
    $text -replace 'u+FFFD', ' ' | Set-Content -Path $newfile

    What happens when I run this?  Well, it executes successfully with no errors.  However, it does not replace the u+FFFD's.  Can someone explain what I am doing wrong?

     
    Wednesday, June 21, 2017 1:37 PM

Answers

  • Interesting.  I read the replacement character phrase in a couple of web pages but didn't understand.  So I did the following and it did not change anything.  Is this the correct syntax?  (I know you said this likely would not work, but I though I would try it.)

    # Search and Replace characters in a file:

    $oldfile = "C:\PS\filename.aspx" 
    $newfile = "C:\PS\filename2.aspx"

    $text = (Get-Content -Path $oldfile -ReadCount 0) -join "`n"
    $text -replace '\xff\xfd', ' ' | Set-Content -Path $newfile

    • Marked as answer by clm2 Wednesday, June 21, 2017 2:44 PM
    • Unmarked as answer by clm2 Wednesday, June 21, 2017 2:45 PM
    • Marked as answer by clm2 Wednesday, June 21, 2017 2:45 PM
    Wednesday, June 21, 2017 2:06 PM
  • I cannot thank you enough for your help!  You were right:  the hex's that I needed were shifted to the left.  This actually worked and I checked about half the file and all of them were replaced so far:

    $oldfile = "C:\PS\filename.aspx" 
    $newfile = "C:\PS\filename2.aspx"

    $text = (Get-Content -Path $oldfile -ReadCount 0) -join "`n"
    $text -replace '\xef\xbf\xbd', ' ' | Set-Content -Path $newfile

    And you are right:  it was printing on the html page something weird-an upside down question mark-something weird.  The above replaced these three with a blank.  You just save me hours and hours of doing this manually.  (I have over a thousand html - actually aspx - pages to go through.)


    • Edited by clm2 Wednesday, June 21, 2017 2:40 PM
    • Marked as answer by clm2 Wednesday, June 21, 2017 2:45 PM
    Wednesday, June 21, 2017 2:39 PM

All replies

  • What are u+fffd characters?  Post a sample of the file.


    \_(ツ)_/

    Wednesday, June 21, 2017 1:45 PM
  • There are about 10 of them in this file.  In Textpad, these characters show up as a "control character" with the characteric box.  When I push them up to my site, they display online as a weird upside down question mark that has a heiroglyphic look.  When I hover over the control charcter box, it shows the u+fffd, and I have done an octal dump to verify as well.  Here is an excerpt:

    "Testosterone replacement therapy improves mood in hypogonadal men�a clinical

    What causes it are those single quotes that aren't really single quotes.  But I need to get rid of these and replace it (ideally) with a single quote.

    Wednesday, June 21, 2017 1:52 PM
  • To replace these you will have to use the hex values in your match string.

    $text -replace '\xff\xfd'

    These are Unicode characters that are unprintable. 

    If you have "smart quotes" then you will need to replace them by referencing the smart quote characters.

    I doubt that the characters are what you say.  u+fffd is called the Unicode replacement character.  It says the character is unprintable.

    See: http://www.fileformat.info/info/unicode/char/fffd/index.htm


    \_(ツ)_/

    Wednesday, June 21, 2017 2:00 PM
  • Interesting.  I read the replacement character phrase in a couple of web pages but didn't understand.  So I did the following and it did not change anything.  Is this the correct syntax?  (I know you said this likely would not work, but I though I would try it.)

    # Search and Replace characters in a file:

    $oldfile = "C:\PS\filename.aspx" 
    $newfile = "C:\PS\filename2.aspx"

    $text = (Get-Content -Path $oldfile -ReadCount 0) -join "`n"
    $text -replace '\xff\xfd', ' ' | Set-Content -Path $newfile

    • Marked as answer by clm2 Wednesday, June 21, 2017 2:44 PM
    • Unmarked as answer by clm2 Wednesday, June 21, 2017 2:45 PM
    • Marked as answer by clm2 Wednesday, June 21, 2017 2:45 PM
    Wednesday, June 21, 2017 2:06 PM
  • If that is what it is then you need to find the actual characters and replace them by Unicode values.

    $text -replace '\u0008'

    First figure out what they are using a hex editor.

    If they are smart quotes then they are \u201C and \u201D


    \_(ツ)_/


    • Edited by jrv Wednesday, June 21, 2017 2:11 PM
    Wednesday, June 21, 2017 2:10 PM
  • The single quote characters are \u2018 and \u2019

    If this is true HTML then they are in HTML escaped format.

      “ (left curly quote)

    ‘ (left single curly quote)

      ” (right curly quote)

        ’   (right single curly quote)


    \_(ツ)_/



    • Edited by jrv Wednesday, June 21, 2017 2:18 PM
    Wednesday, June 21, 2017 2:15 PM
  • Okay, you were right (of course).  I did a hexdump - this is the first time I've done that so bear with me - and there are actually three "characters" that are not printable or that are causing problems.  Here is what they are in the hex dump:

    bf bd 33

    Does this make sense?

    Wednesday, June 21, 2017 2:19 PM
  • If I understood what you are saying, I tried this as well and it did not work:

    $oldfile = "C:\PS\filename.aspx" 
    $newfile = "C:\PS\filename2.aspx"

    $text = (Get-Content -Path $oldfile -ReadCount 0) -join "`n"
    $text -replace '\u201D', ' ' | Set-Content -Path $newfile

    (I also tried u201C and it did not work either.)  If I didn't understand, let me know.  


    Wednesday, June 21, 2017 2:25 PM
  • 0xBF is an upside down question mark in ASCII.

    0xBD is the symbol for 1/2 (one half)

    0x33 is the number 3 and is printable.


    \_(ツ)_/

    Wednesday, June 21, 2017 2:31 PM
  • I cannot thank you enough for your help!  You were right:  the hex's that I needed were shifted to the left.  This actually worked and I checked about half the file and all of them were replaced so far:

    $oldfile = "C:\PS\filename.aspx" 
    $newfile = "C:\PS\filename2.aspx"

    $text = (Get-Content -Path $oldfile -ReadCount 0) -join "`n"
    $text -replace '\xef\xbf\xbd', ' ' | Set-Content -Path $newfile

    And you are right:  it was printing on the html page something weird-an upside down question mark-something weird.  The above replaced these three with a blank.  You just save me hours and hours of doing this manually.  (I have over a thousand html - actually aspx - pages to go through.)


    • Edited by clm2 Wednesday, June 21, 2017 2:40 PM
    • Marked as answer by clm2 Wednesday, June 21, 2017 2:45 PM
    Wednesday, June 21, 2017 2:39 PM