none
What is a duplicate?

    Question

  • How does FAST determine what is a duplicate file during crawling? Is it by its MD5 hash value or something else?

    Thanks, Shane



    Thursday, July 14, 2011 3:45 PM

Answers

  • Hello Shane,

    First I thought it calculated a check sum based on the binary bytes in what was indexed (the data field).

    After investigating this a bit I found out that it calculates a check sum (64 bit) based on the full title of the document and the first 1024 bytes of the text. Certainly a sub-par implementation.

    Luckily you can create a check sum of your own, and use more text for the calculation (http://msdn.microsoft.com/en-us/library/ff521593.aspx).

    When creating similar functions for FAST ESP in the past I have used:

    byte size + content

    Regards,
    Mikael Svenson 


    Search Enthusiast - SharePoint MVP/WCF4/ASP.Net4
    http://techmikael.blogspot.com/
    Friday, July 15, 2011 7:11 PM

All replies

  • Hello Shane,

    First I thought it calculated a check sum based on the binary bytes in what was indexed (the data field).

    After investigating this a bit I found out that it calculates a check sum (64 bit) based on the full title of the document and the first 1024 bytes of the text. Certainly a sub-par implementation.

    Luckily you can create a check sum of your own, and use more text for the calculation (http://msdn.microsoft.com/en-us/library/ff521593.aspx).

    When creating similar functions for FAST ESP in the past I have used:

    byte size + content

    Regards,
    Mikael Svenson 


    Search Enthusiast - SharePoint MVP/WCF4/ASP.Net4
    http://techmikael.blogspot.com/
    Friday, July 15, 2011 7:11 PM
  • Aarrgghh!!! Sometimes I think I'd ask for my money back if FS4SP had've come out of my pocket!!! "Sub-par implementation" indeed...

    Can someone from Microsoft please confirm that the checksum used by FS4SP for determining duplicates is only based on the full title of the document and the first 1024 bytes of text, not on the entire contents of the file?

    Mikael, I've read the technet article and it's definitely out of my league to implement but thanks anyway...

    Shane

    Saturday, July 16, 2011 2:39 PM
  • I did a test on the title+1024 characters before writing my answer, so I will say it's confirmed.

    You can in an unsupported way change the configuration on this to include more than 1024 characters.

    That said, depending on your data it could work ok as titles do differ, as well as the first 1024 characters. Often there is a data change, a reference change etc. But it's very easy to create a scenario like I did where it will fail.

    I created two text files called "duplicate.txt", filled them with lorem ipsum for 1024 characters and appended "mikael" to one of the files. And they showed as duplicates when searching for "lorem".

    As for getting getting your money back.... the great thing with FAST is that you can quite often change the behavior to work like you want even if it's not out of the box, so at least you're not trapped with a set behavior.

    -m


    Search Enthusiast - SharePoint MVP/WCF4/ASP.Net4
    http://techmikael.blogspot.com/
    Saturday, July 16, 2011 5:48 PM
  • Hi Mikael,

    I wasn't doubting you findings, just wanted a response from inside Microsoft for my records.

    Thanks, Shane

    Saturday, July 16, 2011 7:44 PM
  • Hi Mikael,

    Here we customized the calculation of DocumentSignature and make it based on file name. 

    E.g. we have 3 folders contains 3 version of a fire fighting policy, the DocumentSignature is calculated by "firefighting.pdf", therefore system will consider them are duplicated (and identical) documents.

    • d:\2012.3\firefighting.pdf
    • d:\2012.1\firefighting.pdf
    • d:\2011.8\firefighting.pdf

    we have separate property "releaseDate" for extracting date from "2012.3" or "2011.8" and attached into index.

    in first search page, we are grouping them as one document, and when user click "duplicates(3)" he/she will go to second page and get all 3 documents. 

    However, the problem is in first page, it is showing the document which is not the latest one, for example it is showing d:\2012.1\firefighting.pdf. 

    Is it possible to make sure the first page always showing the latest one (d:\2012.3\firefighting.pdf) while system think all of them are identical(but in fact they are not)

    sounds like a mission impossible... :S


    • Edited by Feng_Lu Tuesday, December 18, 2012 8:25 AM
    Tuesday, December 18, 2012 8:21 AM
  • Hi,

    That's a tricky one. If you set the sorting of the results to sort on date (descending), does that change anything? I know that when sorting on relevance it seems random which one is picked as the original, but it might be the one with the highest relevance score.. though I haven't checked or verified this.

    Thanks,
    Mikael Svenson


    Search Enthusiast - SharePoint MVP/MCT/MCPD - If you find an answer useful, please up-vote it.
    http://techmikael.blogspot.com/
    Author of Working with FAST Search Server 2010 for SharePoint

    Tuesday, December 18, 2012 9:31 AM
  • Hi Mikael,

    your guessing is correct. System will pick up the highest score document as original doc while sorting on relevance. Change the sorting based on date does change the order of results. 

    Now the problem became that how to turning the rank profile to make sure the latest document (e.g. 2012.3\firefighting.pdf) has the highest score. 

    I changed the rank profile, make the freshnessWeight = 500, which i think is pretty high. However, from time to time I can still find some older documents which their content overweights the freshness scores, and won the newer version documents. :(

    Shall i continue increase the freshnessweight score? or i should directly sort by date? (if i sort by date, the search result is totally different, though).

    Maybe I can change the ranking somehow by using the "releaseDate" (sample values: 2012.3, 2012.1..)?

    Any hints?

    Anyway, wish all of you here Merry Christmas and Happy new year.

    -Feng

    Monday, December 24, 2012 8:24 AM
  • Hi,

    Changing freshness weight will only work for "newer" documents until the item is too old to get much of a boost.

    If the items you want sorted with the newest first all have the same rank otherwise you could try to make the date part of the rank somehow. For example by turning the date into an integer of the needed resolution (day,hour,minute,second), and then use this as part of the rank profile by adding a static rank property.

    You might want to look at the actual number you use and the weight in order to have some control of the number of rank points given :)

    Take a look at TechNet for adding static rank properties.

    Thanks,
    Mikael Svenson


    Search Enthusiast - SharePoint MVP/MCT/MCPD - If you find an answer useful, please up-vote it.
    http://techmikael.blogspot.com/
    Author of Working with FAST Search Server 2010 for SharePoint

    Tuesday, December 25, 2012 3:05 PM
  • Just wanted to raise the awareness that there is a third MP called "documentsignaturecontribution" that you can leverage for customizing the default document signature.

    See http://blogs.msdn.com/b/nicolasu/archive/2013/03/13/fs4sp-document-signature-customization.aspx

     

    Wednesday, March 13, 2013 9:16 PM
  • Hi,

    Although this helps, it's no silver bullet. I wrote about this on page 340 of my book, and the real issue is that the duplicate check uses 32 bit. This means that after 77.000 items you still have a 50% chance of getting a duplicate CRC when using 32bit. After 200.000 items you have a 99% chance of getting a duplicate CRC. (This is called the birthday paradox and is explained at http://en.wikipedia.org/wiki/Birthday_paradox)

    This means that duplicate collapsing will collapse items which are totally unrelated.

     I do suggest adding more content to documentsignaturecontribution, for example mapping body to that field, but it won't take you all the way.

    Thanks,


    Search Enthusiast - SharePoint MVP/MCT/MCPD - If you find an answer useful, please up-vote it.
    http://techmikael.blogspot.com/
    Author of Working with FAST Search Server 2010 for SharePoint

    Thursday, March 14, 2013 11:27 AM