none
Need a method to check and remove string duplication within a line of a text file

    Question

  • Hello,

    Here's my situation.  I need to read a text file that contains "ship to" and "sold to" information and only retrieve the "ship to" information.  My problem is that both the "ship to" and "sold to" addressing is on the same line (e.g. Mr. John Doe Mr. John Doe <next line> 123 Main St. 123 Main St. etc...).  My idea was to pull out the first x characters of the line and check for a duplication of those characters within the line and then create a variable to hold it as output to a CSV file.  I just can't seem to come up with a easy way to do it.  Any help would be appreciated.

    Thanks

    I Matthies

    • Moved by Bill_Stewart Thursday, January 02, 2014 8:58 PM Abandoned
    Monday, November 18, 2013 9:49 PM

All replies

  • Hi,

    Please post at least a few example lines from your input file. You can remove the real information, but we'll need to know what you're actually working with before we can offer any suggestions.

    EDIT: At least I will. =]


    Don't retire TechNet! - (Maybe there's still a chance for hope, over 12,300+ strong and growing)


    Monday, November 18, 2013 9:53 PM
  • Hi,

    this checks for duplicate wordings on each row and sends the output to the pipeline

    $path="c:\test.txt"
    Get-Content $path | foreach {
    	$words=$_.Trim() -split " "
    	$len=$words.Length
    	#no need to go any further than half way
    	$end=[Math]::Ceiling($len/2)
    	for($index=0;$index -lt $end;$index++){
    		$left=$words[0..$index] -join " "
    		$right=$words[($index+1)..($len-1)] -join " "
    		#output dupe and continue on next line
    		if ($left -eq $right){
    			$left
    			break
    		}
    	}
    }


    • Edited by Dirk_74 Tuesday, November 19, 2013 2:38 AM
    Tuesday, November 19, 2013 2:16 AM
  • if your text only contains those kind of duplicates you could further simply to:

    $path="c:\test.txt"
    Get-Content $path | foreach {
    	$words=$_.Trim() -split " "
    	$len=$words.Length
    	$left=$words[0..($len/2-1)] -join " "
    	$right=$words[($len/2)..($len-1)] -join " "
    	if ($left -eq $right){
    		#output dupe and continue on next line
    		$left
    		return
    	}
    }

    Tuesday, November 19, 2013 2:36 AM
  • Here is some of the data:

    John Smith John Smith
    #4, 10516-79 Ave. #4, 10516-79 Ave.
    Edmonton, AB T6E 1R8 Edmonton, AB T6E 1R8
    780-200-0217 780-200-0217
    Lisa Simpson Lisa Simpson
    5131 5131
    St-Félix-De-Valois, QC J0K 2M0 St-Félix-De-Valois, QC J0K 2M0
    450-889-4009 450-889-4009  
    Normand Leronde Normand Leronde
    15 RUE DUPRAS 15 RUE DUPRAS
    SAINT-BASILE-LE-GRAND, QC J3N 1H1 SAINT-BASILE-LE-GRAND, QC J3N 1H1
    450-461-1086 450-461-1086

    Thanks

    Tuesday, November 19, 2013 9:14 PM
  • Get-Content .\test.txt | ForEach-Object { $_ -replace '^\s*(.+)\s*\1\s*$', '$1' }

    Tuesday, November 19, 2013 9:25 PM
  • David's already posted a short way. Here's my much larger block of code:

    $finalData = @()
    
    Get-Content .\combinedInfo.txt | ForEach {
    
        $arr = $_.Split(' ')
    
        $matchFound = $false
        $i = 1
        
        Do {
    
            If ($arr[$i] -eq $arr[0]) { $matchFound = $true }
            Else { $i++ }
    
        } Until ($matchFound)
    
        $outstring = ''
        
        For ($j = 0 ; $j -lt $i ; $j++ ) { $outString += $arr[$j] + ' ' }
    
        $finalData += $outString 
    
    }
    
    $finalData | Out-File splitInfo.txt

    I'm sure there's better ways to go about this, but this worked for me at least.

    EDIT: There's a flaw in there that might give you trouble. If there's a duplicated word in the 'first section' of data, the code will stop there.


    Don't retire TechNet! - (Maybe there's still a chance for hope, over 12,420+ strong and growing)


    Tuesday, November 19, 2013 9:36 PM
  • Here is some of the data:

    John Smith John Smith
    #4, 10516-79 Ave. #4, 10516-79 Ave.
    Edmonton, AB T6E 1R8 Edmonton, AB T6E 1R8
    780-200-0217 780-200-0217begin_of_the_skype_highlighting 780-200-0217 FREE  end_of_the_skype_highlighting
    Lisa Simpson Lisa Simpson
    5131 5131
    St-Félix-De-Valois, QC J0K 2M0 St-Félix-De-Valois, QC J0K 2M0
    450-889-4009 450-889-4009begin_of_the_skype_highlighting 450-889-4009 FREE  end_of_the_skype_highlighting  
    Normand Leronde Normand Leronde
    15 RUE DUPRAS 15 RUE DUPRAS
    SAINT-BASILE-LE-GRAND, QC J3N 1H1 SAINT-BASILE-LE-GRAND, QC J3N 1H1
    450-461-1086 450-461-1086begin_of_the_skype_highlighting 450-461-1086 FREE  end_of_the_skype_highlighting

    Thanks

    I can almost guarantee you that it is tab delimited.

    Split tabs like this:
    Get-Content file.ext |%{$_.Split("`t")

    That is a back tick and a "t'".  This will split the line.  The file is a label file which is designed to print two columns of labels,  A tab sets the second label.  This is a common output.  Just split every line that has a tab into to separate arrays.


    ¯\_(ツ)_/¯

    Tuesday, November 19, 2013 10:02 PM
  • Take the shortcut.  Split it on the tab characters.

    ¯\_(ツ)_/¯

    Tuesday, November 19, 2013 10:03 PM