Bash - Finding Itunes Duplicates

Jim's shrinking music collection : it was filled with strange duplicates, so I set about removing them

Did a backup of my friend Jim's music. He's been really pleased with himself that he finally filled up his 30Gb iPod and wanted to be sure all that work wouldn't be lost. After inspecting the backup I couldn't help noticing there's anything up to 20 copies of the same track on occasion. So I set out to fix this with a script to remove duplicates.

Here is a typical example of the kinds of duplicates which turn up, revealed by looking for a specific track name from the terminal, using a recursive file search (unix tool called 'find').



                    find . -name '*Ehmedo*'

                    ./Compilations/S© Nomads 3/03 Ehmedo.mp3

                    ./Music/Compilations/S© Nomads 3/03 Ehmedo 1.mp3

                    ./Music/Compilations/S© Nomads 3/03 Ehmedo 10.mp3

                    ./Music/Compilations/S© Nomads 3/03 Ehmedo 11.mp3

                    ./Music/Compilations/S© Nomads 3/03 Ehmedo 13.mp3

                    ./Music/Compilations/S© Nomads 3/03 Ehmedo 14.mp3

                    ./Music/Compilations/S© Nomads 3/03 Ehmedo 15.mp3

                    ./Music/Compilations/S© Nomads 3/03 Ehmedo 16.mp3

                    ./Music/Compilations/S© Nomads 3/03 Ehmedo 2.mp3

                    ./Music/Compilations/S© Nomads 3/03 Ehmedo 3.mp3

                    ./Music/Compilations/S© Nomads 3/03 Ehmedo 5.mp3

                    ./Music/Compilations/S© Nomads 3/03 Ehmedo 6.mp3

                    ./Music/Compilations/S© Nomads 3/03 Ehmedo 7.mp3

                    ./Music/Compilations/S© Nomads 3/03 Ehmedo 8.mp3

                    ./Music/Compilations/S© Nomads 3/03 Ehmedo 9.mp3

                    ./Music/Compilations/S© Nomads 3/03 Ehmedo.mp3

                

Now Jim's not a very technical person, and I'm not sure exactly how he's got himself in this tizz, but his collection has been maintained on a bunch of different people's laptops (indexing his portable hard drive for him) until he got his Macbook this year and things started to settle down.

This could be seen as good or bad for Jim. He hasn't got the amount of music he thinks he has, but he can get much more on his iPod by getting rid of these duplicates. However that's easier said than done, hence this blog entry.

Howto remove iTunes Duplicates: Practice with Mac Terminal, BASH and AWK

So anyway I thought I'd turn this into a little exercise in using the BSD power tools available on the Mac terminal in order to process all this information, eliminate duplicates and finally automatically delete the duplicates from the library. It's a challenging problem for (at least) the following reasons...

  • Mp3s with uniquely number suffixes are sometimes duplicates and sometimes not...
    • 03 Ehmedo 2.mp3 and 03 Ehmedo 3.mp3 are the same track
    • Track 02.mp3 and Track 03.mp3 are different tracks ripped from the same album
  • Different file types need to be anticipated
    • The collection sometimes has m4a (AAC audio) as well as mp3 for a given track.
    • Other file types are sitting around in there too, and need to be excluded. For example, here is a Terminal command which reveals all the file types in Jim's iTunes directory
      find .  -type f | grep -o '\.[^\.]*$' | sort | uniq
      
      .DOC
      .DS_Store
      .Doc
      .JPG
      .PDF
      .apr
      .dll
      .doc
      .exe
      .gif
      .inf
      .ini
      .itl
      .jpg
      .m4a
      .m4p
      .m4v
      .mp3
      .mp4
      .ost
      .pdf
      .plist
      .ppt
      .shs
      .txt
      .wmf
      .xls
      .xml
                                      

The Approach

In plain english, this is the approach which needs to be taken to identify the files to keep (which aren't duplicates)...

  1. A list of all the mp3 and m4a files must be generated
  2. Certain files are suffixed by numbers legitimately, and these are kept (they contain 'Track' or 'Untitled')
  3. A minimal identifier for each of the remaining files is derived by dropping .m4a, .mp3 and any trailing numbers.
  4. All duplicates (which have the same minimal identifier) are duplicates of each other
  5. From each duplicate group, only one track will be kept, according to this scheme
    • The shortest track name is preferred (time.m4a is preferred to time2.m4a)
    • The suffix mp3 is preferred (time.mp3 is preferred to time.m4a) as mp3s are more portable
  6. At the end, the files NOT in the 'keep' list to be retained are deleted.

The Implementation

The trick as with many of these problems is simply to choose the right tools for the job. The actual code to do this is pretty small if you choose right.

Get file names using 'find'

First of all we'll use our old friend 'find' to grab all files.



                    find . -type f

                

Which outputs the full list of all files in the current directory (.) one per line, excluding directories and other strange file handles (-type f), like this sample.



                    ...

                    ./Music/Zubop Gambia/Zubop Gambia/09 Lumbi Gada Mayo _ Jarama Zubop.mp3

                    ./Music/Zubop Gambia/Zubop Gambia/10 Gatche Hare.mp3

                    ./Music/Zubop Gambia/Zubop Gambia/11 Dawda Sane.mp3

                

Processing file list through awk

Then we'll pipe this output into a text processing tool called awk, which handles the content with a bespoke filtering program. The first program we'll feed the list through will preserve files which contain mp3 or m4a. We'll pipe this output into a file called 'all' by using the '>' symbol.



                    find . -type f | awk '/mp3$|m4a$/' > all

                

This is our global list of files. Anything in this 'all' list which doesn't make it through to the 'keep' list will get deleted.

Sanity Checking Uniqueness

At this stage it's worth double checking the planned scheme works at all. This check simply involves spitting out the unique id calculated for each file. Here is the awk code for generating the unique id only.

printids.awk



                    # A file identifier is the file name (without path) with number and file suffixes removed                    

                    {

                        # STRIP FILE NAME AND PATH TO A MINIMAL IDENTIFIER

                        path = $0;

                        match(path, /^.*\//);

                        prefix = substr(path, RSTART, RLENGTH);

                        prefixend = RSTART + RLENGTH;

                        suffix = match(path, /[0-9]*\.(mp3$|m4a$)/);

                        suffixstart = RSTART;

                        id = substr(path, prefixend, suffixstart - prefixend) ;

                        print id;

                    }

                

This program is a bit long to just put on the command line, so we'll get awk to load it from a file called printids.awk using the -f option. Note we are using 'cat' to retrieve the contents of the all file we created earlier.



                    cat all | awk -f printids.awk | sort | uniq -c | sort

                

This command line also feeds the output through two more unix tools, sort and uniq. The first sort simply orders all the lines alphabetically (e.g. putting all the copies of the same file identifier together. Then uniq is used with the -c option to count any duplicate lines which are next to each other (and there are LOTS of duplicates). Lastly, the output of uniq (which starts each line with the number of duplicates found) is re-sorted numerically. This creates a list which reveals the worst cases of duplication.

Sample of output...



                    ...

                    10 14 Song For Olabi

                    14 01 Just One Day

                    14 02 1000 Islands

                    14 03 Ehmedo

                    14 07 Je N'aime Que Toi

                    14 08 Houndoti

                    15 04 Chang Tzel [Zeb Bring Me Peace Remix]

                    22 05 Hora Lui Sile

                

This is just the worst examples (the last lines of the sorted results). It shows that Hora Lui Sile has 22 copies of the same track in the library!

Outputting which files to retain

Given that this works OK, and reveals a genuine problem of duplication, we can go the whole hog, outputting the file names to retain and piping that into a file called 'keep'.



                    cat all | awk -f nonduplicates.awk > keep

                

The full nonduplicates.awk file processes each line in turn, using pattern-matching to strip each filename to a 'file identifier', without directories, trailing numbers or filetype suffixes. It inserts the filename into an associative array, using the 'file identifier'as the key for that entry, UNLESS there is a file recorded against that identifier already, and it is preferred (e.g. one with a shorter name without numbers added, or which is mp3 rather than m4a).

At the end, once all files have been processed, it dumps the file paths of the preferred files - the values which were retained in the associative array. Basically this procedure aims to preserve only one file against each identifier. All files called Track or Untitled (which commonly legitimately have numbers in them are just allowed through directly. Duplicates are simply forgotten.

nonduplicates.awk



                    

                    /Track/

                    

                    /Untitled/

                    

                    {

                    

                        # STRIP FILE NAME AND PATH TO A MINIMAL IDENTIFIER

                        

                        # EACH LINE CONTAINS ONE FILE PATH

                        path = $0;

                        

                        # IDENTIFY THE DIRECTORY PART

                        match(path, /^.*\//);

                        prefix = substr(path, RSTART, RLENGTH);

                        prefixend = RSTART + RLENGTH;

                        

                        # IDENTIFY THE TRAILING JUNK

                        suffix = match(path, /[[:space:]0-9]*\.(mp3$|m4a$)/);

                        suffixstart = RSTART;

                        

                        # THE ID IS THE GOOD BIT IN THE MIDDLE

                        id = substr(path, prefixend, suffixstart - prefixend);

                                        

                        # IS THERE ALREADY A PREFERRED FILE FOR THIS IDENTIFIER?

                        if(id in nonduplicates){

                            oldpath = nonduplicates[id];

                            if(path ~ /mp3/ && oldpath !~ /mp3/){

                                # PREFER MP3

                                newpath = path;

                            }

                            else if(length(path) < length(oldpath)){

                                # PREFER SHORTEST

                                newpath = path;

                            }

                            else{

                                # ELSE KEEP THE OLD PATH

                                newpath = oldpath;

                            }

                        }

                        else{

                            newpath = path;

                        }

                    

                        # STORE PREFERRED PATH AGAINST KEY

                        nonduplicates[id]=newpath;

                    }

                    

                    END{

                        # OUTPUT PREFERRED FILE FOR EACH IDENTIFIER

                        for(id in nonduplicates){

                            print(nonduplicates[id]);

                        }

                    }

                

Final results

To identify the things to lose rather than keep, we can use a tool called 'comm' which can tell us which lines were in the 'all' file and are now not in the 'keep' file. This is used to generate a 'lose' file which contains all the files to delete.



                    comm -23 all keep > lose

                

The comm tool normally prints out three columns

  • lines unique to the first file
  • lines unique to the second file
  • lines common to both first file

The option -23 asks it to suppress the second two columns, preserving only the first.

Stats

A nice command line tool called 'wc' is available to do word counts, (and line counts and stuff), and we can use this to survey the results



                    wc -l all keep lose



                    4716 all

                    2772 keep

                    1975 lose

                

This tells us that there were a total of 4716 tracks originally, with 2772 which were legitimate non-duplicated originals, but 1975 duplicates which can safely be deleted, since there are already preferred copies of these files.

Deleting the files

This took more work than expected, owing to the spaces and apostrophes and special characters in some of the filenames. Ideally you'd just create a load of command lines like...



                    rm filename

                

...but if the filename contains spaces and stuff, this doesn't work. It ends up with...



                    rm file's name

                

Which the command line reads as two separate filenames, and confused totally by the fact that it's waiting for a closing quote to match the hanging apostrophe there. These characters need to be escaped with a preceding backslash '\' and I ended up having to write my own routine using 'sed' a stream editor to do this - I had problems with 'printf %q' and other routines for quoting the lines read from the 'lose' file..

The full script

The final script is made out of two scripts and one awk file. The script 'run.sh' uses the nonduplicates.awk file to list all the files which should be dropped, and provides a simple report of the outcome. All the lines which start 'echo' in this script are just printing information on what happened - they don't do anything concrete.



                    #!/bin/bash

                    find . -type f | awk '/mp3$|m4a$/' | sort > all

                    cat all | awk -f nonduplicates.awk | sort > keep

                    comm -23 all keep > lose

                    # These lines are optional

                    echo

                    echo `wc -l all keep lose | awk '/all/ {print $1, " total tracks"} /keep/ {print $1, " real tracks"} /lose/ {print $1, " duplicates found"}'`

                    echo

                    echo 'Verify planned deletions in the file called lose'

                    echo

                    echo 'Run DMC (delete my crap) to remove all duplicates'

                    echo

                

The script 'dmc.sh' reads in the 'lose' file and processes it through sed to provide proper escaping characters. It then feeds each properly escaped line to 'rm' (the unix remove program) to delete each file in turn.



                    #!/bin/bash

                    cat lose | sed "s/\'/\\\'/g" | sed 's/ /\\ /g' | xargs -n 1 rm

                

Epilogue

So the result is that Jim actually only has 17Gig of music, not 29 as originally thought.



                    du -d0 -hc 1 orig1

                    17G    1

                    29G    orig1

                

Tagged:

scripting (6)

jim smith

itunes