fdupes Tutorial

February 18, 2007 at 6:58 pm 18 comments

fdupes seems to have done the job of getting rid of the horrible duplicate mess I’ve managed to get myself into, but glancing through the duplicates list I may have rendered some configuration settings (like the ones in ~/.mozilla) unusable. That’s not a real big problem though since I planned to start those over mostly from scratch anyway.

The only real problems I have with fdupes is that it doesn’t detect whole directory trees (which might have made the job of removing the files a little quicker) so there are a couple of empty directories floating around. I also wished it had an option to be more verbose and print the MD5SUM along with the results. (just to releive a little bit of paranoia)

To use start the removal process run fdupes with the -r option. (fdupes -r $somedir) This will write the list of duplicate files to the console which is probably not what you want. I like using a pipe to tee so I can see the progress and have a list of files made. (fdupes -r $somedir | tee $filelist)

The results will have all duplicates separated in groups of the same file, with groups separated by a blank line.

Now to make a list of files to remove run fdupes -rf $somedir | tee $filelist-omit-first. The added -f flag will tell fdupes to omit the first match from the output. (this should leave one copy of a file in a group of duplicates so at least 1 copy will be preserved)

Use your favorite diff program (I recommend vimdiff) to compare the first file list you made to the second as a precaution against deleting something you don’t want. Once that’s all done it should be safe to remove all the files in the second file list (the one made from fdupes -rf)

Do some last bit cleanup on the file list by sorting, eliminating matching lines, and blank lines. (sort $filelist-from-rf | uniq | grep -v '^$' > $removelist)

finally you are ready to remove the duplicates. (while read file; do rm -v "$file"; done )

Good-bye duplicates!

Advertisements

Entry filed under: Linux. Tags: , , , , , .

The Hunt for Duplicates Continues The Ubuntu Logo

18 Comments Add your own

  • 1. Ryan Sinn  |  March 2, 2007 at 1:22 pm

    I think your final line to remove the duplocates isn’t any good because you don’t specify the file to read from…

    This works —

    for file in `cat $removelist`; do echo $file; done

    Where $removelist is the file list to be deleted and $file is actually a variable that references the “file” at the beginning of the line… for file

    Great post / help though 🙂

    Reply
  • 2. Ryan Sinn  |  March 2, 2007 at 1:26 pm

    whoops 🙂 Instead of do echo $file … you need to change that to rm -v when you have the confidence required to actually delete your files 🙂

    for file in `cat $removelist`; do rm -v “$file”; done

    I had been using the new alpha release of Mozilla Thunderbird and when I merged a few dozen mail folders containing over 8000 emails Thunderbird decided to make 20+ duplicates of over 1000 of those… YUCK! 39000 emails in a single folder… luckily I’m on linux and I can edit my Maildir by hand 🙂

    Reply
  • 3. dosnlinux  |  March 2, 2007 at 7:41 pm

    Thanks for pointing that out.

    I forgot the file redirect after done The last line should be while read file; do rm -v "$file"; done < $removelist

    Glad you found the post helpful 🙂

    Reply
  • 4. Paul  |  October 2, 2007 at 10:27 am

    The last step doesn’t seem to work for files with spaces in the names. I’m getting output like:


    rm: cannot remove `iTunes/Kansas/Dust': No such file or directory
    rm: cannot remove `In': No such file or directory
    rm: cannot remove `The': No such file or directory
    rm: cannot remove `Wind.mp3': No such file or directory

    Reply
  • 5. Paul  |  October 2, 2007 at 10:29 am

    Aha, I switched to bash and now all is OK.

    Reply
  • 6. Flightlessbird  |  October 13, 2007 at 7:10 am

    Fantastic help – even a total newbie like me could follow it. Very useful.

    Reply
  • 8. Ahsan Ali  |  May 30, 2008 at 7:34 am

    Thanks, this was very useful even all the way over here in Pakistan. 🙂

    Reply
  • […] mind that the file paths/names had difficult characters. The following fdupes tutuial was useful: fdupes Tutorial Life at the CLI I’ve double checked the results with variations on the following command: find . -type f -exec […]

    Reply
  • 10. James  |  December 23, 2008 at 7:47 pm

    A little unclear. The sort command doesn’t specify which filelists should be passed as arguments (previous steps mention two filelists: the original -rf one, the one with first entries omitted). The while loop doesn’t mention which of the three filelists to pass, and where. Ryan’s Comment 1 defines variables as a tutorial should. Last, the while loop command mentioned in the tutorial is still incorrect lacking “< $removelist”. There are also various typos, which are annoying (“use start”).

    Reply
  • 11. Sergio  |  February 4, 2009 at 7:02 am

    This is a nice approach:

    b=""
    fdupes tmp/ --recurse | \
    while read f
    do
    if [ "$f" = "" ]
    then
    b=”"
    else
    if [ "$b" = "" ]
    then
    b=”$f”
    else
    rm “$f” && echo “Removed \”$f\”"
    fi
    fi
    done

    (Source: http://www.miriamruiz.es/weblog/?p=79 )

    Reply
  • 12. James McGill  |  April 18, 2009 at 12:36 pm

    I’m on the ambitious task of finding and removing duplicates against an 8 terabyte network attached storage.

    Even breaking it down into reasonably sized chunks, fdupes takes days (maybe weeks, we don’t know yet) to run, but it doesn’t overgrow in memory when presented with huge numbers of files.
    (That’s what I was afraid it would do, and fslint choked on our filesystem.)

    Reply
  • 13. none  |  September 25, 2009 at 8:28 am

    you can pipe the list of files into the iterator:

    cat dupes.txt | while read line;do rm -v “$line”;done

    Reply
  • 14. none  |  October 16, 2009 at 12:58 pm

    you’ll need to adjust the quotes in my example above since the blog soft replaced them

    Reply
  • 15. Eko_taas  |  November 27, 2010 at 10:55 am

    Very usefull and simple even for me (a lot to take).

    8.9GB of music cleaned/dublicates removed (after learning curve) in few minutes 🙂

    Now will be very easy (with confort) to merge all computers to one NAS (w/o thinking too much) and then delete dublicas

    I skipped comparison part as checksum looks for me very safe way to get only identicals…

    Reply
  • 16. bayrouni  |  November 6, 2011 at 7:11 am

    Voici un script que j’ai fait en se basant sur les infos que j’ai pu récolter. Your feedback please

    #!/bin/bash

    # répertoire à traiter
    REP=$1

    [ $# -ne 1 ] && echo “Usage: `basename $0` full path the directory” && exit

    if [ ! -e $REP ]; then
    echo “Directory does not exist…!”
    exit
    fi

    beep -f 1000 -l 250 -D 250 -r 3

    # creer un fichier contenant les doublons à effacer mais tout en gardant un exemplaire
    fdupes –recurse –omitfirst –noempty –size $REP > omitfirst

    # Trier les fichiers par ordre alphabetique
    sort omitfirst |uniq |grep -v ‘^$’ > omitfirst_sorted

    # Selectionner uniquement les fichiers d’éxtensions choisies
    cat omitfirst_sorted |grep -i -e –regexp=’\.\(jpg\|jpeg\)$’ omitfirst_sorted > omitfirst_sorted_final

    # Effacer les doublons
    i=0
    while read f; do
    # enlever le ‘#’ après avoir testé
    #rm -vf $f

    # d’abord tester
    echo $f

    i=$(($i+1))
    done < omitfirst_sorted_final

    beep -f 2000 -l 250 -D 250 -r 3

    Reply
  • 17. manfred  |  November 13, 2011 at 7:39 am

    fdupes fails on folders with spaces like /Black Sabbath. I can’t find an option that those files are read too.

    Reply
  • 18. kamesh  |  January 26, 2013 at 7:56 am

    lol….its 2013,still this post was so helpful for me thank you

    Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Trackback this post  |  Subscribe to the comments via RSS Feed


Feeds


%d bloggers like this: