Find duplicates and diff separate drafts?

derekvan · July 6, 2020, 9:14pm

I am looking to move my notes out of Bear and into Drafts. At some point last year, I moved a bunch of Drafts into Bear. Now, with some of the newer Drafts features, I feel like Bear is unnecessary and I’m ready to keep it all in Drafts.

My plan to do this is to simply export all the Bear notes and import them into Drafts. This will result in many duplicates, but not too many (maybe 100-200). Some of the duplicates may have subtle differences (mainly additions on the Bear side).

Is it possible in javascript to:

query for a bunch of drafts tagged X
take each draft and search for another draft with same title
diff those two drafts
save the one with the most text in it and delete the other one (or maybe the one with the most recent modification date, which I can preserve in importing from Bear)

I know I can do step 1, but after that I’m not entirely sure if it’s possible. I think I can repurpose @jsamlarose diff script for step 3 (his script diffs different versions of the same draft). I’m not entirely sure about step 2 either, maybe the titles could be different too? Could there be a way to search for drafts that are mostly the same?

Basically, before I started hacking on this, I was curious if others had any opinions about the overall approach or any specific guidance on my plan.

jsamlarose · July 6, 2020, 10:04pm

Happy to see that action getting some use! Also: many thanks for yours— particularly Tag Assign or Filter and your Editorial round trip actions.

It shouldn’t be too difficult to have that diff function compare different drafts rather than versions of the same draft, but I imagine that your biggest issue would be figuring out what constitutes sameness, if you can’t guarantee that the titles are consistent between the versions you want to compare…

mattgemmell · July 6, 2020, 10:16pm

Maybe a ratio of the Levenshtein distance to character length? To get a sort of sameness percentage which you could then threshold, or ask the user. You could do that on titles to find candidates, but it would have polynomial time-complexity.

derekvan · July 6, 2020, 10:35pm

This sounds helpful, but is way over my head. Basically you’re suggesting something like what’s described in this stack overflow?

derekvan · July 6, 2020, 10:39pm

ok, @mattgemmell 's answer led me to this “fuzzy” javascript library, which purports to de-dedupe strings. Maybe I can use that somehow. Maybe tag all dupe possibilities and I can just eyeball them quickly. Or delete all the ones that are exactly the same and tag those that meet a fuzzy threshold.

derekvan · July 16, 2020, 5:06pm

Ok, I’ve got something working here. It uses the Fuse.js library. To run the script pasted below, you’ll need to download this library file as well.

I’m not sure how to generalize this into an action others could use. But I think some folks might be able to take things from my approach. Basically, I had a bunch of notes I imported from Bear that had the “bear-import” tag. Some of those notes had duplicates in Drafts as well. So I constructed a query that found all Drafts NOT tagged “bear-import” and created an object with the text and UUID of each Draft. Then, I created a query for all the “bear-import” Drafts. For each of those drafts, I run a Fuse test that scores the similarities. If the threshold is high enough, the two matching drafts get a unique tag assigned (dupe1, dupe2, dupe3, etc.). Then I can go through the tags and verify which Draft is the best one to keep.

On my Mac, this script takes about an hour to run. I’m not sure I’d try this on iOS.

// Find dupes

require("fuse.js");


const drafts = Draft.query("","all",[],["bear_import"]);
var allDrafts = [];
for (let d of drafts)
{
    allDrafts.push({"text":d.content,"id":d.uuid});
}

const bear = Draft.query("","all",["bear_import"]);

var count = 0;

for (let b of bear)
{
     let options = {
         includeScore:true,
         keys:['text']
     };

     let fuse = new Fuse(allDrafts,options);

     let result = fuse.search(b.content);
     
     for (r in result)
     {
         if (result[r].score < 0.6)
         {
            count++;
            let tagger = "dupe" + count;
            b.addTag(tagger);
            b.update();
            let f = Draft.find(result[r].item.id);
            f.addTag(tagger);
            f.update();
         }
     }
}