none
Folder Comparison from multiple servers RRS feed

  • Question

  • I need to compare folders from multiple servers which do not have direct connection so i was trying to generate a text file with command dir /b /s >c:\server1.txt and feed them to script to compare. 

    my desired output is csv file in the following format

    File name,Server1,server2,server3,is it missng file?
    File1,yes,yes,yes,No
    file2,Yes,Yes,No,Yes
    File3,Yes,No,Yes,Yes
    File4,No,Yes,Yes,Yes

    these folders contains few hundred thousand files and the script is taking 4 to 5 hours to run the comparison. i thought threads will help to run fast using parallel processng but did not help. i have no expreience with threads and after searching for examples and implementing it was even slower than normal. 
    i probably am doing something wrong. Any help is much appreciated. 

    The following is the script so far and it works fine but its taking long time.

    [array]$contentArray = @()
    [array]$allfileslist = @()
    [array]$uniquefileslist = @()
    [array]$allfilesSRVlist = @()
    [array]$srvlist = @()
    
    $FolderName = Split-Path -parent $MyInvocation.MyCommand.Definition
    $reportName = $FolderName + "\ComparisonReport3.csv"
    
    $ListOfFiles = get-childitem $FolderName
    $List = $ListOfFiles | where {$_.extension -ieq ".txt"}
    $list.count
    $index = 0
    foreach($listitem in $List){
    	$listfilename = $listitem.FullName
    	$listname = $listitem.Name
    	$listname = $listname.replace(".txt","")
    	$srvlist = $srvlist + $listname
    	write-host $listfilename 
    	#$StrContent = get-content $listfilename 
    	$StrContent = [io.file]::ReadAllLines($listfilename)
    	$contentArray += ,@($StrContent)
    	$StrContent.count
    	$contentArray[$index].count
    	$index = $index + 1
    }
    for($i = 0;$i -lt $index;$i++){
    	$allfileslist = $allfileslist + $contentArray[$i] 
    }
    $allfileslist.count
    $uniquefileslist = $allfileslist | sort-object | get-unique
    $Stroutline = "File Name,"
    foreach($srvlistitem in $srvlist){
    	$Stroutline = $Stroutline + $srvlistitem.ToUpper() + ","
    }
    $Stroutline = $Stroutline + "Is it Missing file?"
    Write-Output $Stroutline | Out-File "$reportName"  -Force
    foreach($uniquefileslistitem in $uniquefileslist){
    	$Stroutline = ""
    	$missingfile = "No"
    	$Stroutline = $uniquefileslistitem + ","
    	for($i=0;$i -lt $index;$i++){
    		if($contentArray[$i] -contains $uniquefileslistitem){
    			$Stroutline = $Stroutline + "Yes,"
    		}
    		else{
    			$Stroutline = $Stroutline + "No,"
    			$missingfile = "Yes"
    		}
    	}
    	$Stroutline = $Stroutline + $missingfile
    	Write-Output $Stroutline | Out-File "$reportName"  -Force -Append 
    	$j++
    }
    

    Following is the script i modified using threads example. this is running even slower.

    [array]$contentArray = @()
    [array]$allfileslist = @()
    [array]$uniquefileslist = @()
    [array]$allfilesSRVlist = @()
    [array]$srvlist = @()
    
    $FolderName = Split-Path -parent $MyInvocation.MyCommand.Definition
    $reportName = $FolderName + "\ComparisonReport3.csv"
    
    $ListOfFiles = get-childitem $FolderName
    $List = $ListOfFiles | where {$_.extension -ieq ".txt"}
    $list.count
    $index = 0
    #$contentArray = New-Object 'object[,]' $xDim, $yDim
    foreach($listitem in $List){
    	$listfilename = $listitem.FullName
    	$listname = $listitem.Name
    	$listname = $listname.replace(".txt","")
    	$srvlist = $srvlist + $listname
    	write-host $listfilename 
    	#$StrContent = get-content $listfilename 
    	$StrContent = [io.file]::ReadAllLines($listfilename)
    	$contentArray += ,@($StrContent)
    	$StrContent.count
    	$contentArray[$index].count
    	$index = $index + 1
    }
    
    for($i = 0;$i -lt $index;$i++){
    	$allfileslist = $allfileslist+ $contentArray[$i] 
    }
    $allfileslist.count
    #$uniquefileslist = $allfileslist | select –unique
    get-date
    $uniquefileslist = $allfileslist | sort-object | get-unique
    $uniquefileslist.count
    get-date
    $Stroutline = "File Name,"
    foreach($srvlistitem in $srvlist){
    	$Stroutline = $Stroutline + $srvlistitem.ToUpper() + ","
    }
    $Stroutline = $Stroutline + "Is it Missing file?"
    Write-Output $Stroutline | Out-File "$reportName"  -Force
    $j = 1
    $count = $uniquefileslist.count
    
    $stroutput = ""
    $maxConcurrent = 50
    $results= ""
    $PauseTime = 1
    
    $uniquefileslist | %{
    	while ((Get-Job -State Running).Count -ge $maxConcurrent) {Start-Sleep -seconds $PauseTime}
    	$job = start-job -argumentList $_,$contentArray,$index -scriptblock {
    	$StrArgFileName = $args[0]
    	$ArgContentArray = $args[1]
    	$ArgIndex = $args[2]
    	$Stroutline = ""
    	$missingfile = "No"
    	$Stroutline = $StrArgFileName + ","
    	for($i=0;$i -lt $ArgIndex;$i++){
    		if($ArgContentArray[$i] -contains $StrArgFileName){
    			$Stroutline = $Stroutline + "Yes,"
    		}
    		else{
    			$Stroutline = $Stroutline + "No,"
    			$missingfile = "Yes"
    		}
    	}
    	$Stroutline = $Stroutline + $missingfile
    	$Stroutline
    	}
      While (Get-Job -State "Running")
    	{
      	Start-Sleep 1
    	}
      $results = Get-Job | Receive-Job
      
    	$results
      $stroutput = $stroutput + "`n" + $results
      Remove-Job *
    }
    Write-Output $stroutput | Out-File "$reportName"  -Force -Append 
    

    Thank you very much!

    Vamsi.

    Wednesday, January 29, 2014 3:09 PM

Answers

  • Here is the yes/no builder

    $combined=dir *.txt | %{cat $_} |sort -unique
     $h=@{}
    $combined|%{$h.Add($_,@('no','no','no'))}
    cat c1.txt | %{($h[$_])[0]='yes'}
    cat c2.txt | %{($h[$_])[1]='yes'}
    cat c3.txt | %{($h[$_])[2]='yes'}
    

    Here is the output formatter

    #header
     '{0,-60} {1,-10} {2,-10} {3,-10}' -f 'Fullname','Server01','Server02','Server03'
    $combined|
    %{
         '{0,-60} {1,-10} {2,-10} {3,-10}' -f $_,$h[$_][0],$h[$_][1],$h[$_][2]
    }

    It can actually be simplified even more but this will help you see how it works.


    ¯\_(ツ)_/¯


    • Edited by jrv Wednesday, January 29, 2014 10:14 PM
    • Marked as answer by vamsinm Friday, January 31, 2014 2:24 PM
    Wednesday, January 29, 2014 10:13 PM

All replies

  • There is no easy way to do this kind of comparison.  You need to accept it that speed will be sacrificed.

    If you are just comparing by filename you issue is even compounded. If you want to compare by folder then grouping will help performance.

    I would do a merge by computer, parent path, and file.  I would sort this list then open it in excel and do a pivot table which would show the results you seek.  The Excel pivot table might be faster because it will generate indexes.

    The other method would be to use a database and just add all items into a table.  This would get the data in in one pass.

    I can probably do an update of a billion files into SQLServer on a Quad Win 7 system in about 5 minutes.  Maybe less.

    I do not see how

    you can do this fast with string comparisons.  That is the slowest mechanism.


    ¯\_(ツ)_/¯

    Wednesday, January 29, 2014 3:35 PM
  • I looked a bit closer.  You are not just looking for a match of file names but yo are reading the content of every file. Why?  You cannot compare content across systems by counting lines.  You could generate a hash for the file and compare hashes. Perhaps you need to explain exactly what it is that you are doing and why it needs to be done.  There are likely tools that will do this more easily.


    ¯\_(ツ)_/¯

    Wednesday, January 29, 2014 5:06 PM
  • This will generate a CSV much faster than your method.

    Choose the hashing method you want and just add it.  Run this before you add the has and after to see the impact.   Hashing will be much faster and more accurate than counting lines.

    $FolderName = Split-Path -parent $MyInvocation.MyCommand.Definition
    $reportName = Join-Path $FolderName "$env:computername.csv"
    $md5 = new-object -TypeName System.Security.Cryptography.MD5CryptoServiceProvider
    
    Get-ChildItem "$FolderName\*" -include *.txt -recurse |
        ForEach-Object{
    	    New-Object PsObject -property @{
                 Computer=$env:computername FullName=$_.FullName Name=$_.Name Path=$_.DirectoryName # here we would generate a hash #Hash=[System.BitConverter]::ToString($md5.ComputeHash([System.IO.File]::ReadAllBytes($_.FullName))) } } | Export-Csv $reportName
    I fixed and commented out the hash so you can compare.


    ¯\_(ツ)_/¯





    • Edited by jrv Wednesday, January 29, 2014 5:27 PM
    Wednesday, January 29, 2014 5:16 PM
  • Once you get a couple of these you can just merge the files and group by fullname then compare the hashes and report match/nomatch

    The grouping will take about 10 minutes on a quad workstation at 64 bits.  The comparison will be very fast.

    There is no way to multithread this.  It has to be done as a complete operation and you have to choose which file is the master or the one to compare "to".

    If you choose to compare each machine separately then you will have the overhead of combining the results which will also be troublesome.


    ¯\_(ツ)_/¯

    Wednesday, January 29, 2014 5:31 PM
  • Here is what i am trying to do... let me know if it make sesnse. 

    i have server1.txt, server2.txt and server3.txt by running dir /b /s <folder> on server1, server2 and server3. i could not connect to these servers directly using powershell becuase of the secuirty. 

    what i tried to do was to Get the contents of the list of files from each file and combine them and get unique value in to a single array. compare each line in the array with the three files and see if they are available or not. 

    Server1.txt has list of file on server1 under a folder. 
    File1
    File2
    File3

    Server2.txt has list of file on server1 under a folder. 
    File1
    File2
    File4

    Server3.txt has list of file on server1 under a folder. 
    File1
    File3
    File4

    CSV report should look like this... 
    File name,Server1,server2,server3,is it missng file?
    File1,yes,yes,yes,No
    file2,Yes,Yes,No,Yes
    File3,Yes,No,Yes,Yes
    File4,No,Yes,Yes,Yes


    Wednesday, January 29, 2014 6:43 PM
  • So the script you posted was a fake and has nothing to do with what you are doing.  The script you posted gets the file names and the number of lines for txt file only.

    You are saying you have lists of file names only and you want to compare the file names across servers.  This is completely different from the scripts posted so...Why did you post the scripts?

    Start by merging all files into onle and quirtey it with select * -unique.  This will give you a list of all files across all servers.  Next post each infividual file to the list.  If the list is a hash table you can just add the computer names to the table

    $hash=@{filename=@()}

    As a collection you can get the file and add the computer.  Once you have this done then just run your report to print out the contents of the hashtables.

    $hash[$filename]+=$computername

    This woiuld be very fast.  It will be about as fast as you can read each file.  Printing will be the slowest because printers are slow.

    Start by doing step one and I will step you through the rest.


    ¯\_(ツ)_/¯

    Wednesday, January 29, 2014 8:43 PM
  • I really apprciate your help with this... i will try your suggestion. thank you!

    The script that posted does exactly what I described and its not fake.

    Below code reads the input files with dir output from various servers. eg. server1.txt server2.txt etc. these files have out put of dir on the server. i am reading each file content and storing in to an array and also merging them in to into single array to get unique list of files across all the servers. 

    $ListOfFiles = get-childitem $FolderName $List = $ListOfFiles | where {$_.extension -ieq ".txt"} $list.count $index = 0 foreach($listitem in $List){ $listfilename = $listitem.FullName $listname = $listitem.Name $listname = $listname.replace(".txt","") $srvlist = $srvlist + $listname write-host $listfilename #$StrContent = get-content $listfilename $StrContent = [io.file]::ReadAllLines($listfilename) $contentArray += ,@($StrContent) $StrContent.count $contentArray[$index].count $index = $index + 1 } for($i = 0;$i -lt $index;$i++){ $allfileslist = $allfileslist + $contentArray[$i] }

    $uniquefileslist = $allfileslist | sort-object | get-unique

    $allfileslist.count and $list.count is only for info while executing for me. 

    following lines are comparing each unique line and check if it exists in different servers...

    foreach($uniquefileslistitem in $uniquefileslist){
    	$Stroutline = ""
    	$missingfile = "No"
    	$Stroutline = $uniquefileslistitem + ","
    	for($i=0;$i -lt $index;$i++){
    		if($contentArray[$i] -contains $uniquefileslistitem){
    			$Stroutline = $Stroutline + "Yes,"
    		}
    		else{
    			$Stroutline = $Stroutline + "No,"
    			$missingfile = "Yes"
    		}
    	}
    	$Stroutline = $Stroutline + $missingfile
    	Write-Output $Stroutline | Out-File "$reportName"  -Force -Append 
    	$j++
    }

    Wednesday, January 29, 2014 9:08 PM
  • So you are trying to compare the number of lines in each file?  I don't see what that is good for.

    I can say this.  The code is so unlike what you have said you are doing that I have to assume that they are two unrelated things.

    If I have two lists of files and I want to tabulate them I need to have some way to collect all row ids. If the rows are file names then I can concatenate all of the files and do a unique list which gives me a row for every file that may be in any machine.  I then pass over each file and "check" the server column.

    This can process a few hundred million file names in seconds or a few minutes.

    Do this.  zip all three files an upload them and I will show you how it is done.


    ¯\_(ツ)_/¯


    • Edited by jrv Wednesday, January 29, 2014 9:27 PM
    Wednesday, January 29, 2014 9:24 PM
  • no i am not comparing the number of lines.. count i used was only for my information... it has nothing to do with the compare. here is the script with out count... 

    [array]$contentArray = @()
    [array]$allfileslist = @()
    [array]$uniquefileslist = @()
    [array]$allfilesSRVlist = @()
    [array]$srvlist = @()
    
    $FolderName = Split-Path -parent $MyInvocation.MyCommand.Definition
    $reportName = $FolderName + "\ComparisonReport3.csv"
    $ListOfFiles = get-childitem $FolderName
    $List = $ListOfFiles | where {$_.extension -ieq ".txt"}
    $index = 0
    foreach($listitem in $List){
    	$listfilename = $listitem.FullName
    	$listname = $listitem.Name
    	$listname = $listname.replace(".txt","")
    	$srvlist = $srvlist + $listname
    	$StrContent = [io.file]::ReadAllLines($listfilename)
    	$contentArray += ,@($StrContent)
    	$index = $index + 1
    }
    
    for($i = 0;$i -lt $index;$i++){
    	$allfileslist = $allfileslist+ $contentArray[$i] 
    }
    $uniquefileslist = $allfileslist | sort-object | get-unique
    $strfinal = "File Name,"
    foreach($srvlistitem in $srvlist){
    	$strfinal = $strfinal + $srvlistitem.ToUpper() + ","
    }
    $strfinal = $strfinal + "Is it Missing file?"
    foreach($uniquefileslistitem in $uniquefileslist){
    	$missingfile = "No"
    	$Stroutline = $uniquefileslistitem + ","
    	for($i=0;$i -lt $index;$i++){
    		if($contentArray[$i] -contains $uniquefileslistitem){
    			$Stroutline = $Stroutline + "Yes,"
    		}
    		else{
    			$Stroutline = $Stroutline + "No,"
    			$missingfile = "Yes"
    		}
    	}
    	$Stroutline = $Stroutline + $missingfile
    	$strfinal =  $strfinal + "`n" +  $Stroutline
    }
    Write-Output $strfinal | Out-File "$reportName"  -Force -Append 
    

    if you want to test it... just create two text files and put them in the same folder as the script. 

    server1.txt will have the following content

    filename1

    filename2

    filename3

    server2.txt will have the following content

    filename1

    filename2

    filename4

    it should generate the csv file ComparisonReport3.csv

    filename,server1,server2,Is it Missing file?

    filename1,yes,yes,no

    filename2,yes,yes,no

    filename3,yes,no,yes

    filename4,no,yes,yes

    Wednesday, January 29, 2014 9:39 PM
  • Here is the template code that combines and checks each files existence.

    $combined=dir *.txt | %{cat $_} |sort -unique
     $h=@{}
    $combined|%{$h.Add($_,@())}
    cat c1.txt | %{$h[$_]+='server01'}
    cat c2.txt | %{$h[$_]+='server02'}
    cat c3.txt | %{$h[$_]+='server03'}
    
    

    We can then flatten this into your report. Each row will have an array with on or more computer names where the file was found. 

    We could also alter this to directly place yes and no in specific columns by alter two lines.


    ¯\_(ツ)_/¯

    Wednesday, January 29, 2014 9:43 PM
  • Thank you sooooo much! this did the trick... obviously printing taking time but have done much faster than my original script. how can we change to place yes and now in specific columns?
    Wednesday, January 29, 2014 10:05 PM
  • Here is the yes/no builder

    $combined=dir *.txt | %{cat $_} |sort -unique
     $h=@{}
    $combined|%{$h.Add($_,@('no','no','no'))}
    cat c1.txt | %{($h[$_])[0]='yes'}
    cat c2.txt | %{($h[$_])[1]='yes'}
    cat c3.txt | %{($h[$_])[2]='yes'}
    

    Here is the output formatter

    #header
     '{0,-60} {1,-10} {2,-10} {3,-10}' -f 'Fullname','Server01','Server02','Server03'
    $combined|
    %{
         '{0,-60} {1,-10} {2,-10} {3,-10}' -f $_,$h[$_][0],$h[$_][1],$h[$_][2]
    }

    It can actually be simplified even more but this will help you see how it works.


    ¯\_(ツ)_/¯


    • Edited by jrv Wednesday, January 29, 2014 10:14 PM
    • Marked as answer by vamsinm Friday, January 31, 2014 2:24 PM
    Wednesday, January 29, 2014 10:13 PM
  • Here is a fully annotated version: http://sdrv.ms/1bz38pV


    ¯\_(ツ)_/¯

    Wednesday, January 29, 2014 10:24 PM
  • Thank you for your help with this.. i guess printing is the main hindrance to the performance. it took forever (more than 8 hours and still did not complete) to print it in to CSV. Really appreciate your help. thank you again!
    Friday, January 31, 2014 2:25 PM
  • Thank you for your help with this.. i guess printing is the main hindrance to the performance. it took forever (more than 8 hours and still did not complete) to print it in to CSV. Really appreciate your help. thank you again!

    There is no reason for that unless you have 100s of megabytes of text which is unlikely.  Perhaps you have a bad disk.  Are you trying to write to a file share? Write the file locally and copy it.


    ¯\_(ツ)_/¯

    Friday, January 31, 2014 2:48 PM
  • local copy. file size is approximately 25mb. i will look in to it again to see what i am doing wrong and post it. thank you! have a great day.
    Friday, January 31, 2014 2:58 PM
  • You have mucked up something because write 25 Mb should not take that long.

    #header
     $report='{0,-60} {1,-10} {2,-10} {3,-10}' -f 'Fullname','Server01','Server02','Server03'
    $combined|
    %{
         $report+='{0,-60} {1,-10} {2,-10} {3,-10}' -f $_,$h[$_][0],$h[$_][1],$h[$_][2]
    }
    
    $report | Out-File myfile.txt
    


    ¯\_(ツ)_/¯

    Friday, January 31, 2014 3:17 PM
  • this is what i have... i used for loop becuase the number of file that  need to compare varies. 

    $FolderName = Split-Path -parent $MyInvocation.MyCommand.Definition
    $reportName = $FolderName + "\ComparisonReport.csv"
    $UniqueFileNames = get-childitem $FolderName | where {$_.extension -ieq ".txt"} | %{[io.file]::ReadAllLines($_.FullName)} |sort -unique
    $h=@{}
    $array= @()
    $stroutline = "File Name,"
    get-childitem $FolderName | where {$_.extension -ieq ".txt"} |%{
    	$array = $array + "No"
    	$listname = $_.Name
    	$listname = $listname.replace(".txt","").toupper()
    	$stroutline = $stroutline + $listname + ","
    }
    $array = $array + "No"
    $UniqueFileNames|%{$h.Add($_,@($array))}
    $stroutline = $stroutline + "Is it Missing file?" + "`n"
    $i = 0
    get-childitem $FolderName | where {$_.extension -ieq ".txt"} |%{
    	[io.file]::ReadAllLines($_.FullName) | %{($h[$_])[$i]="Yes"}
    	$i++
    }
    
    $UniqueFileNames|
    %{
    	$stroutline = $stroutline + $_ + ","
    	for($j=0;$j -lt $i; $j++){
    		$stroutline = $stroutline +($h[$_])[$j] + ","
    	}
    	$stroutline = $stroutline + ($h[$_])[$i] + "`n"
    }
    Write-Output $stroutline | Out-File "$reportName"  -Force
    

    Friday, January 31, 2014 3:24 PM
  • I gave you a solution that works. YOu went back to using your old solution.

    Sorry but I can't help you.  You don't seem to underatnd what I am showing you.

    You cannot concatenate strings forever.  It won't work.  You have also completely change the example and now it is not wahat ai posted and will not work as intended.


    ¯\_(ツ)_/¯

    Friday, January 31, 2014 3:36 PM
  • All of this is just extra noise:

    $FolderName = Split-Path -parent $MyInvocation.MyCommand.Definition
    $reportName
    = $FolderName + "\ComparisonReport.csv"
    $UniqueFileNames
    = get-childitem $FolderName | where {$_.extension -ieq ".txt"} | %{[io.file]::ReadAllLines($_.FullName)} |sort -unique

    It is identical to this:

     get-childitem "$FolderName\* " -Include *.txt |
          ForEach-Object{
               
    All of the removal of file extensions is complete unnecessary and may give you bad results.

    What I cannot understand is why you took me through all of that showing you how to fix your code to only have you revert back to your same broken code.  You have pretty much thrown out everything I did to optimize the process.


    ¯\_(ツ)_/¯

    Friday, January 31, 2014 4:10 PM
  • i apologize.. i am testing with your code now... i have started it around 20 mins back and its still running. i did not thought i thrown your code. i used the logic that gave me. did not know that reading file using different methods make a difference. i will let you know once it completes running.

    thank you

    vamsi.

    Friday, January 31, 2014 4:22 PM
  • i apologize.. i am testing with your code now... i have started it around 20 mins back and its still running. i did not thought i thrown your code. i used the logic that gave me. did not know that reading file using different methods make a difference. i will let you know once it completes running.

    thank you

    vamsi.

    It is  not the different methods but the fact that it shows you are guessing about how to do things in PowerShell and are copying things from the Internet without really looking at the task.  You must focus on the mechanism first,  Once you have a working mechanism then you can decorate it in any way that suits you and know that, if your decorations break the code, then it is your changes that are wrong and not the technique or technology.

    My other criticism is that you are trying to get me to decipher your code that doesn't work after I have already shown you how you have to do this.  I don't feel like going back and unwinding what you are trying to do.  You will understand this better when you learn more about PowerShell and computer/data science.  PowerShell works extremely well with data but you must throw away old batch and VB ideas to get it to work well.


    ¯\_(ツ)_/¯

    Friday, January 31, 2014 4:31 PM