Delete Duplicate Files With Bash

I needed to delete all but 1 duplicate files from a bunch of experiments. I didn't care about file names. Here's the bash - it works.

#!/bin/bash

declare -a list=( `find ./DIRECTORY_CONTAINING_FILES f` );
declare -a sums;
cnt=${#list[@]}

echo "creating md5sum list"
for ((x = 0; x < $cnt -1; x++))
do
    sums[$x]=`md5sum ${list[$x]} | cut -d ' ' -f 1`
    #echo ${sums[$x]}
    progress=$(echo "scale=2;($x/$cnt)*100" | bc)
    echo -ne "progress $progress %\r"
done

echo "doing compare"
for ((x = 0; x < $cnt -1; x++))
do
    for ((y = x+1; y < $cnt; y++))
        do
            if [ "${sums[$x]}" == "${sums[$y]}" ];then
                #echo ${sums[$x]} " - "  ${sums[$y]}
                if [ ${list[$x]} != ${list[$y]} ]; then
                    #remove '#' and combine next 2 lines to enable
                    echo "Delete file\n${list[$y]}\n${list[$x]}"
                    # && rm -f ${list[$y]}
                fi
            fi
    done
done
This entry was posted in Programming, Simulation, Social Data Analysis, useful. Bookmark the permalink.

2 Responses to Delete Duplicate Files With Bash

  1. Stephen Paulger says:

    If you didn't care about the filenames could you have just renamed all the files to their md5sum?

    for f in *
    do
    mv $f $(md5sum $f|cut -f1 -d" ")
    done

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>