locked
Find duplicate files

    Question

  • Hello,

     

    Is it possible to use sharepoint for the finding duplicates files? I need to identify all duplicate files at our network shared drives.

    Thanks

    Thursday, April 03, 2008 9:27 AM

Answers

  • I got the same need and wrote that query on MOSS Search Database. Hope that helps :

     

    -- Step1 : get all files with short names, md5 signatures, and size

    select

    md5,

    right(accessurl, charindex('\', reverse(accessurl)) - 1) as ShortFileName,

    accessurl AS Url,

    llVal / 1024 as FileSizeKb

    into

    #listingFilesMd5Size

    from

    MSSCrawlURL y inner join MSSDocProps on ( y.DocID = MSSDocProps.DocID )

    where

    MSSDocProps.pid = 58 -- File size

    and llVal > 1024 * 10 -- 10 Kb minimum in size

    and md5 <> 0

    and charindex('\', reverse(accessurl)) > 1

    -- Step 2: Filter duplicated items

    select count(*) AS NbDuplicates, md5, ShortFileName, FileSizeKb

    into #duplicates

    from #listingFilesMd5Size

    group by md5, ShortFileName, FileSizeKb

    having count(*) > 1

    order by count(*) desc

    drop table #listingFilesMd5Size

    -- Step3 : show the report with search URLs

    select *, NbDuplicates * FileSizeKb AS TotalSpaceKb, 'http://srv-moss/SearchCenter/Pages/results.aspx?k=' + ShortFileName AS SearchUrl

    from #duplicates

    order by NbDuplicates * FileSizeKb desc

    drop table #duplicates

     

    http://www.magesi.com/blog/?p=95 

    Tuesday, May 06, 2008 8:01 AM

All replies

  • If sharepoint is indexing those shared drives, then you could probably use the search API to produce a list like that.  Maybe do a succession of searches that start first with file name = "A" and followed by "B", etc. 

     

    Someone may have a more clevel approach.

     

    How do you know they are duplicate?  Name?  File Size?  Byte-level comparison?

     

    It's a tough job I think.

     

    Good luck.

    Thursday, April 03, 2008 4:46 PM
  • The searchpoint itself returned me as the result of the searching that he found duplications.

    We have a lot of duplicates (the same content) files and I would like to use MOSS to identify this and after delete.

    So my question was... Is there in the MOSS API feature to identify (duplicate) equal files?

     

    Thanks

     

    Karnek

    Monday, April 07, 2008 11:57 AM
  • I don't know.

     

    I was looking up something unrelated to this post earlier and found this keyword documented in a search training class I took a year ago:

     

    duplicates:"http://site"

     

    That might help.

     

     

     

    Monday, April 07, 2008 12:35 PM
  • I got the same need and wrote that query on MOSS Search Database. Hope that helps :

     

    -- Step1 : get all files with short names, md5 signatures, and size

    select

    md5,

    right(accessurl, charindex('\', reverse(accessurl)) - 1) as ShortFileName,

    accessurl AS Url,

    llVal / 1024 as FileSizeKb

    into

    #listingFilesMd5Size

    from

    MSSCrawlURL y inner join MSSDocProps on ( y.DocID = MSSDocProps.DocID )

    where

    MSSDocProps.pid = 58 -- File size

    and llVal > 1024 * 10 -- 10 Kb minimum in size

    and md5 <> 0

    and charindex('\', reverse(accessurl)) > 1

    -- Step 2: Filter duplicated items

    select count(*) AS NbDuplicates, md5, ShortFileName, FileSizeKb

    into #duplicates

    from #listingFilesMd5Size

    group by md5, ShortFileName, FileSizeKb

    having count(*) > 1

    order by count(*) desc

    drop table #listingFilesMd5Size

    -- Step3 : show the report with search URLs

    select *, NbDuplicates * FileSizeKb AS TotalSpaceKb, 'http://srv-moss/SearchCenter/Pages/results.aspx?k=' + ShortFileName AS SearchUrl

    from #duplicates

    order by NbDuplicates * FileSizeKb desc

    drop table #duplicates

     

    http://www.magesi.com/blog/?p=95 

    Tuesday, May 06, 2008 8:01 AM