none
Is it possible to do a fuzzy duplicate search RRS feed

  • Question

  • Hi, new to Power BI and loving it so far, but have a query I can't see an answer for.  I've looked on this forum and others and can't see anything so apologies if I have missed where it is covered.

    Essentially, I was hoping to use fuzzy matching to identify duplicates in a list of filenames.  The list is generated every 3 months, and includes filenames for lots of different files that are created daily e.g. product001_sales_Jun_01,  product001_sales_Jun_02,  product001_sales_Jun_03  etc.  The main list is generated every three months, so in this example, the Quarter 2 list would contain all April, May, and June files.  I just need to identify that there is a type of file which is product001_sales_xxxx.  I don't even really need to know how many.  There are many variations in the way that all file names are made up so while in this case it would be easy to take the first xx characters and de-duplicate, in other examples the date might be at the front, or middle, or in the folder name and not file name, or letters or numbers!

    I can see how to do a fuzzy match from one list to another (i.e. match the Q3 list to the Q2 list)  to exclude those file types already identified, but I'm struggling with doing this within the same list.  I.E. in above example, I can see how to match the July, August and September files in the Q3 list to the files in the first Q2 list, but how do I match the June and May and April files within the one list?  I have tried copying the list to another table and matching, but as it's a copy it obviously finds and matches everything!  I then applied a fuzzy match to generate more matches, and then grouped and totalled the no of matches, but I can't see how this will help me get down to one file.  Any help really appreciated!

    Thursday, September 5, 2019 3:34 PM

All replies