none
Query a document library to find dupliates and remove them

    Question

  • We are using SharePoint 2010 Enterprise

    We are trying to remove duplicate documents from our document library.  The duplicates will be pulled based on a couple column values that match among documents found within the library, Such as:

    - Customer ID

    - Document Date

    - Document Description

    This removal process needs to be done carefully and unfortunately the actual delete of the doc needs to be done manually, once a user determines it's a true duplicate.

    So, I need to generate list of all the potential duplicate documents based on the criteria noted above so a user can then manually go through the list of documents (which would be sorted also by the criteria above) to determine if the documents are, in fact, duplicates and then delete any copies leaving just one.

    So Far,

    I have a sql query that will return the duplciates from our doc library, but I haven't found a way to apply this within our SharePoint site.  I tried an external list, but since the results are returned as an actual sharepoint list, the user can't view or delete any documents from the doc library when going through the list.  I looked at the Content Query Webpart, but from what I could tell I couldn't set up the query to compare column values among documents, I could just add a filter value for a column.

    Any thoughts on how I can accomplish this?  Hopefully using an OOB web part!

    Thanks in advance!


    zc26



    • Edited by zc26 Wednesday, September 18, 2013 4:34 AM
    Wednesday, September 18, 2013 4:32 AM

Answers

  • Hi zc26, 

    There is a similar question in the forums today that I answered, so I'll post the code, though I can't take credit for all of it. That post is here: http://social.msdn.microsoft.com/Forums/sharepoint/en-US/b1ff945e-e935-4376-9cc2-3f19453d8827/sharepoint-powershell-groupby-number-column

    You can use PowerShell to get all the items from the list and group them by one or more fields. I've created an example, below, that prints the names of all the documents in the library, then groups all "duplicate documents" based on the CustomerId and DocumentType being the same. Finally, the script prints each duplicate document, along with it's listitem id. As Alex suggests, you could use the listitem ID to update a boolean field on the listitem to indicate it's a duplicate. 

    Code:

    $w = Get-SPWeb "http://devmy101"
    $l = $w.Lists["mylistwithdatainit"]
    $l.Items | FT Title,@{Label="Item Id";Expression={$_["ID"]}},@{Label="Document Description";Expression={$_["Document_x0020_Description"]}},@{Label="Customer Id";Expression={$_["CustomId"]}} -AutoSize
    $customIdInternalFieldName = "CustomId"
    $documentTypeInternalFieldName = "Document_x0020_Description"
    $dt = $l.Items.GetDataTable()
    Write-Host $l.ItemCount"Items in the library" -ForegroundColor Green
    $ad = $l.Items.GetDataTable() | Group-Object $customIdInternalFieldName,$documentTypeInternalFieldName  | ?{$_.Count -gt 1} | %{$_.CustomId}
    Write-Host $ad.Count"Duplicates founds" -ForegroundColor Green
    $dups = $l.Items.GetDataTable() | Group-Object $customIdInternalFieldName,$documentTypeInternalFieldName  | ?{$_.Count -gt 1}
    foreach($d in $dups){$d.Group | FT Title,@{Label="Item Id";Expression={$_["ID"]}},@{Label="Document Description";Expression={$_["Document_x0020_Description"]}},@{Label="Customer Id";Expression={$_["CustomId"]}} -AutoSize}

    Screen Shot



    Regards, Matthew
    MCPD | MCITP
    My Blog
    Please remember to click "Mark As Answer" if a post solves your problem or "Vote As Helpful" if it was useful.

    I just added a webpart to the TechNet Gallery that allows administrative users to upload, crop and format user profile photos. Check it out here: Upload and Crop User Profile Photos


    Wednesday, September 18, 2013 8:44 PM
  • The PowerShell is not suitable for real time updates, although it is in theory possible to hack it together to do that.

    For real time updates you'd need/want to use the same process (transformed into C# code) in a custom Web Part.


    Tuesday, September 24, 2013 7:01 PM

All replies

  • Hi,

    You can create a view of the document library by grouping the documents based on the specified value and use that view to delete the duplicate documents in document library.

    Regards,


    Sairam Avacorp Technologies

    Wednesday, September 18, 2013 4:37 AM
  • Thanks for the reply.. I looked at this option, but with using the group I can't have it return just the documents that are potential duplicates.  It returns everything in our doc library (which is a lot) and sorts it based on the grouping. 

    I'd like to just have it return documents that match on Customer ID and Document Description and return nothing if one only document is found.  A list of duplicates will be considerable shorter and easier for a user to go through. 

    I didn't see a way where I could further filter on group count to only display if it's greater than 1.

    Thank you again.

    zc26


    zc26

    Wednesday, September 18, 2013 4:42 PM
  • Firstly, going in through SQL isn't something you should be doing often. It's bad practice and can lead to stability/performance issues. Querying is bad enough but never write to it.

    If you have to do it through the UI the way i'd do it is to create a yes/no column called 'duplicate' and a view that only shows items with 'Yes' selected.

    I'd then use powershell to identify all the items that match your list of IDs/descriptions and set the duplicate column for those items. They will then show up in your view.

    Once you've manually deleted your documents you can delete the column. For a longer term/nicer solution you could use PowerShell to do the identification and marking all as one step, without the need to go into SQL.

    Wednesday, September 18, 2013 5:34 PM
  • Hi zc26, 

    There is a similar question in the forums today that I answered, so I'll post the code, though I can't take credit for all of it. That post is here: http://social.msdn.microsoft.com/Forums/sharepoint/en-US/b1ff945e-e935-4376-9cc2-3f19453d8827/sharepoint-powershell-groupby-number-column

    You can use PowerShell to get all the items from the list and group them by one or more fields. I've created an example, below, that prints the names of all the documents in the library, then groups all "duplicate documents" based on the CustomerId and DocumentType being the same. Finally, the script prints each duplicate document, along with it's listitem id. As Alex suggests, you could use the listitem ID to update a boolean field on the listitem to indicate it's a duplicate. 

    Code:

    $w = Get-SPWeb "http://devmy101"
    $l = $w.Lists["mylistwithdatainit"]
    $l.Items | FT Title,@{Label="Item Id";Expression={$_["ID"]}},@{Label="Document Description";Expression={$_["Document_x0020_Description"]}},@{Label="Customer Id";Expression={$_["CustomId"]}} -AutoSize
    $customIdInternalFieldName = "CustomId"
    $documentTypeInternalFieldName = "Document_x0020_Description"
    $dt = $l.Items.GetDataTable()
    Write-Host $l.ItemCount"Items in the library" -ForegroundColor Green
    $ad = $l.Items.GetDataTable() | Group-Object $customIdInternalFieldName,$documentTypeInternalFieldName  | ?{$_.Count -gt 1} | %{$_.CustomId}
    Write-Host $ad.Count"Duplicates founds" -ForegroundColor Green
    $dups = $l.Items.GetDataTable() | Group-Object $customIdInternalFieldName,$documentTypeInternalFieldName  | ?{$_.Count -gt 1}
    foreach($d in $dups){$d.Group | FT Title,@{Label="Item Id";Expression={$_["ID"]}},@{Label="Document Description";Expression={$_["Document_x0020_Description"]}},@{Label="Customer Id";Expression={$_["CustomId"]}} -AutoSize}

    Screen Shot



    Regards, Matthew
    MCPD | MCITP
    My Blog
    Please remember to click "Mark As Answer" if a post solves your problem or "Vote As Helpful" if it was useful.

    I just added a webpart to the TechNet Gallery that allows administrative users to upload, crop and format user profile photos. Check it out here: Upload and Crop User Profile Photos


    Wednesday, September 18, 2013 8:44 PM
  • I see where this apporach is helpful in finding the duplicates and using a column to mark a document as a dup so it can pulled and grouped into a view.  However, if I use Powershell and update in this manner, is there a way to have PS run this script so when the view is accessed by my user in the SharePoint UI, it will be refreshed so the user is looking at the most recent list?  Or would this script need to be run separately and the user would work from a list until it's updated?   This realtime updating of the view was why I was intially thinking I had to link the view to a sql query (read only) to allow the view to refresh at the point of access (similar to how an external list works)..  I agree it's not a great option, but I couldn't see anohter option...  So if the PS will get me there, I'm all for it.

    thank you for your help..!


    zc26

    Tuesday, September 24, 2013 4:52 PM
  • The PowerShell is not suitable for real time updates, although it is in theory possible to hack it together to do that.

    For real time updates you'd need/want to use the same process (transformed into C# code) in a custom Web Part.


    Tuesday, September 24, 2013 7:01 PM