locked
I want to search for Latin words. RRS feed

  • Question

  • findstr  /s /i habitação *.*

    Do I have to set the font to UTF-8? How?

    I use PowerShell version 2.0

    I appreciate help!

    Friday, March 6, 2020 2:26 PM

All replies

  • Where do you want to search for those words?
    Friday, March 6, 2020 2:50 PM
  • In all text files, in the selected folder.
    Friday, March 6, 2020 3:01 PM
  • PowerShell 2 is no longer supported and should not be used on any system due to security issues.

    The file encoding is selected automatically when the utility reads the file.

    Your question is not a PowerShell question.

    UTF-8 is not a font it is a file encoding.


    \_(ツ)_/

    Friday, March 6, 2020 3:02 PM
  • The PowerShell command is the following:

    Select-String -Pattern 'habitação' -Path *.*


    \_(ツ)_/

    • Proposed as answer by Nevets24 Friday, March 6, 2020 3:26 PM
    Friday, March 6, 2020 3:10 PM
  • I think I'd add "word boundary" metacharacters to that pattern. Searching for words with unbounded patterns can lead to mismatches, especially if the expected match is a short one, or one that's a common root for other words.

    $word = "habitação"
    Select-String -Pattern "\b$word
    \b" -Path *.*

    Also not mentioned in the posting's topic or text (and maybe it's not relevant at all) is whether or not any of the words may be split across lines (e.g., both hyphenation, and having to match line endings in the string).


    --- Rich Matheisen MCSE&I, Exchange Ex-MVP (16 years)

    Friday, March 6, 2020 8:26 PM
  • Non-ascii search (not 0-127):


    'habitação' -match '[^\x00-\x7F]'

    True


    I was going to say '\P{IsBasicLatin}', but it matches capital and small letter 'i'.  Weird.


    'i' -match '\P{IsBasicLatin}'

    True

    'I' -match '\P{IsBasicLatin}'

    True

    • Edited by JS2010 Saturday, March 7, 2020 11:07 PM
    Saturday, March 7, 2020 10:49 PM
  • I should also note that to correctly search for and match characters in European character sets like ISO-8859-n you need to specify UTF8 if the file is ANSI/ASCII/UTF8.  "Select-String" apparently detects this correctly.  "-match" will not depending on the characters in use.

    Also see the following for other issues with European character sets particularly ISO-8859-1.

    https://www.i18nqa.com/debug/bug-double-conversion.html 

    https://www.i18nqa.com/debug/bug-utf-8-latin1.html

    https://www.i18nqa.com/debug/table-iso8859-1-vs-windows-1252.html


    \_(ツ)_/


    • Edited by jrv Sunday, March 8, 2020 12:59 AM
    Sunday, March 8, 2020 12:56 AM
  • You're using a case-insensitive match operator. Try -cmatch instead.

    --- Rich Matheisen MCSE&I, Exchange Ex-MVP (16 years)

    Sunday, March 8, 2020 3:44 AM
  • You're using a case-insensitive match operator. Try -cmatch instead.

    --- Rich Matheisen MCSE&I, Exchange Ex-MVP (16 years)

    Why.  The match doesn't work at all.  If it matched wrong you could be right but case has nothing to do with this.  The issue is the character set in use and how the "RegEx" works by default.  If the file type is wrong or has been converted from UTF8 to ANSI then the match will never work. See the links I posted for why this can happen.


    \_(ツ)_/

    Sunday, March 8, 2020 4:06 AM
  • You're using a case-insensitive match operator. Try -cmatch instead.

    --- Rich Matheisen MCSE&I, Exchange Ex-MVP (16 years)

    You're correct. -cmatch is a workround.  Otherwise it matches some Turkish character that looks similar, but has a dot on top.  It seems like a bug to me.


    'i' -cmatch '\P{IsBasicLatin}'
    False

    'I' -cmatch '\P{IsBasicLatin}'
    False

    [char]0x130
    İ

    'i' -match 'İ'
    True

    • Edited by JS2010 Sunday, March 8, 2020 5:29 PM
    Sunday, March 8, 2020 5:05 PM
  • Yes.  "cmatch" is an important issue when needing direct matches and International character se4ts further aggrqave things in interesting ways.


    \_(ツ)_/

    Sunday, March 8, 2020 6:18 PM
  • On the other hand, the Kelvin K passes as ascii when case is ignored:


    [char]0x212a | select-string '\p{IsBasicLatin}'


    Sunday, March 8, 2020 7:55 PM