Blogged Images
Bash - Finding Itunes Duplicates
Jim's shrinking music collection : it was filled with strange duplicates, so I set about removing them
Did a backup of my friend Jim's music. He's been really pleased with himself that he finally filled up his 30Gb iPod and wanted to be sure all that work wouldn't be lost. After inspecting the backup I couldn't help noticing there's anything up to 20 copies of the same track on occasion. So I set out to fix this with a script to remove duplicates.
Here is a typical example of the kinds of duplicates which turn up, revealed by looking for a specific track name from the terminal, using a recursive file search (unix tool called 'find').
find . -name '*Ehmedo*'
./Compilations/S© Nomads 3/03 Ehmedo.mp3
./Music/Compilations/S© Nomads 3/03 Ehmedo 1.mp3
./Music/Compilations/S© Nomads 3/03 Ehmedo 10.mp3
./Music/Compilations/S© Nomads 3/03 Ehmedo 11.mp3
./Music/Compilations/S© Nomads 3/03 Ehmedo 13.mp3
./Music/Compilations/S© Nomads 3/03 Ehmedo 14.mp3
./Music/Compilations/S© Nomads 3/03 Ehmedo 15.mp3
./Music/Compilations/S© Nomads 3/03 Ehmedo 16.mp3
./Music/Compilations/S© Nomads 3/03 Ehmedo 2.mp3
./Music/Compilations/S© Nomads 3/03 Ehmedo 3.mp3
./Music/Compilations/S© Nomads 3/03 Ehmedo 5.mp3
./Music/Compilations/S© Nomads 3/03 Ehmedo 6.mp3
./Music/Compilations/S© Nomads 3/03 Ehmedo 7.mp3
./Music/Compilations/S© Nomads 3/03 Ehmedo 8.mp3
./Music/Compilations/S© Nomads 3/03 Ehmedo 9.mp3
./Music/Compilations/S© Nomads 3/03 Ehmedo.mp3
Now Jim's not a very technical person, and I'm not sure exactly how he's got himself in this tizz, but his collection has been maintained on a bunch of different people's laptops (indexing his portable hard drive for him) until he got his Macbook this year and things started to settle down.
This could be seen as good or bad for Jim. He hasn't got the amount of music he thinks he has, but he can get much more on his iPod by getting rid of these duplicates. However that's easier said than done, hence this blog entry.
Howto remove iTunes Duplicates: Practice with Mac Terminal, BASH and AWK
So anyway I thought I'd turn this into a little exercise in using the BSD power tools available on the Mac terminal in order to process all this information, eliminate duplicates and finally automatically delete the duplicates from the library. It's a challenging problem for (at least) the following reasons...
- Mp3s with uniquely number suffixes are sometimes duplicates and sometimes
not...
- 03 Ehmedo 2.mp3 and 03 Ehmedo 3.mp3 are the same track
- Track 02.mp3 and Track 03.mp3 are different tracks ripped from the same album
- Different file types need to be anticipated
- The collection sometimes has m4a (AAC audio) as well as mp3 for a given track.
- Other file types are sitting around in there too, and need to be
excluded. For example, here is a Terminal command which reveals all
the file types in Jim's iTunes directory
find . -type f | grep -o '\.[^\.]*$' | sort | uniq .DOC .DS_Store .Doc .JPG .PDF .apr .dll .doc .exe .gif .inf .ini .itl .jpg .m4a .m4p .m4v .mp3 .mp4 .ost .pdf .plist .ppt .shs .txt .wmf .xls .xml
The Approach
In plain english, this is the approach which needs to be taken to identify the files to keep (which aren't duplicates)...
- A list of all the mp3 and m4a files must be generated
- Certain files are suffixed by numbers legitimately, and these are kept (they contain 'Track' or 'Untitled')
- A minimal identifier for each of the remaining files is derived by dropping .m4a, .mp3 and any trailing numbers.
- All duplicates (which have the same minimal identifier) are duplicates of each other
- From each duplicate group, only one track will be kept, according to this
scheme
- The shortest track name is preferred (time.m4a is preferred to time2.m4a)
- The suffix mp3 is preferred (time.mp3 is preferred to time.m4a) as mp3s are more portable
- At the end, the files NOT in the 'keep' list to be retained are deleted.
The Implementation
The trick as with many of these problems is simply to choose the right tools for the job. The actual code to do this is pretty small if you choose right.
Get file names using 'find'
First of all we'll use our old friend 'find' to grab all files.
find . -type f
Which outputs the full list of all files in the current directory (.) one per line, excluding directories and other strange file handles (-type f), like this sample.
...
./Music/Zubop Gambia/Zubop Gambia/09 Lumbi Gada Mayo _ Jarama Zubop.mp3
./Music/Zubop Gambia/Zubop Gambia/10 Gatche Hare.mp3
./Music/Zubop Gambia/Zubop Gambia/11 Dawda Sane.mp3
Processing file list through awk
Then we'll pipe this output into a text processing tool called awk, which handles the content with a bespoke filtering program. The first program we'll feed the list through will preserve files which contain mp3 or m4a. We'll pipe this output into a file called 'all' by using the '>' symbol.
find . -type f | awk '/mp3$|m4a$/' > all
This is our global list of files. Anything in this 'all' list which doesn't make it through to the 'keep' list will get deleted.
Sanity Checking Uniqueness
At this stage it's worth double checking the planned scheme works at all. This check simply involves spitting out the unique id calculated for each file. Here is the awk code for generating the unique id only.
printids.awk
# A file identifier is the file name (without path) with number and file suffixes removed
{
# STRIP FILE NAME AND PATH TO A MINIMAL IDENTIFIER
path = $0;
match(path, /^.*\//);
prefix = substr(path, RSTART, RLENGTH);
prefixend = RSTART + RLENGTH;
suffix = match(path, /[0-9]*\.(mp3$|m4a$)/);
suffixstart = RSTART;
id = substr(path, prefixend, suffixstart - prefixend) ;
print id;
}
This program is a bit long to just put on the command line, so we'll get awk to load it from a file called printids.awk using the -f option. Note we are using 'cat' to retrieve the contents of the all file we created earlier.
cat all | awk -f printids.awk | sort | uniq -c | sort
This command line also feeds the output through two more unix tools, sort and uniq. The first sort simply orders all the lines alphabetically (e.g. putting all the copies of the same file identifier together. Then uniq is used with the -c option to count any duplicate lines which are next to each other (and there are LOTS of duplicates). Lastly, the output of uniq (which starts each line with the number of duplicates found) is re-sorted numerically. This creates a list which reveals the worst cases of duplication.
Sample of output...
...
10 14 Song For Olabi
14 01 Just One Day
14 02 1000 Islands
14 03 Ehmedo
14 07 Je N'aime Que Toi
14 08 Houndoti
15 04 Chang Tzel [Zeb Bring Me Peace Remix]
22 05 Hora Lui Sile
This is just the worst examples (the last lines of the sorted results). It shows that Hora Lui Sile has 22 copies of the same track in the library!
Outputting which files to retain
Given that this works OK, and reveals a genuine problem of duplication, we can go the whole hog, outputting the file names to retain and piping that into a file called 'keep'.
cat all | awk -f nonduplicates.awk > keep
The full nonduplicates.awk file processes each line in turn, using pattern-matching to strip each filename to a 'file identifier', without directories, trailing numbers or filetype suffixes. It inserts the filename into an associative array, using the 'file identifier'as the key for that entry, UNLESS there is a file recorded against that identifier already, and it is preferred (e.g. one with a shorter name without numbers added, or which is mp3 rather than m4a).
At the end, once all files have been processed, it dumps the file paths of the preferred files - the values which were retained in the associative array. Basically this procedure aims to preserve only one file against each identifier. All files called Track or Untitled (which commonly legitimately have numbers in them are just allowed through directly. Duplicates are simply forgotten.
nonduplicates.awk
/Track/
/Untitled/
{
# STRIP FILE NAME AND PATH TO A MINIMAL IDENTIFIER
# EACH LINE CONTAINS ONE FILE PATH
path = $0;
# IDENTIFY THE DIRECTORY PART
match(path, /^.*\//);
prefix = substr(path, RSTART, RLENGTH);
prefixend = RSTART + RLENGTH;
# IDENTIFY THE TRAILING JUNK
suffix = match(path, /[[:space:]0-9]*\.(mp3$|m4a$)/);
suffixstart = RSTART;
# THE ID IS THE GOOD BIT IN THE MIDDLE
id = substr(path, prefixend, suffixstart - prefixend);
# IS THERE ALREADY A PREFERRED FILE FOR THIS IDENTIFIER?
if(id in nonduplicates){
oldpath = nonduplicates[id];
if(path ~ /mp3/ && oldpath !~ /mp3/){
# PREFER MP3
newpath = path;
}
else if(length(path) < length(oldpath)){
# PREFER SHORTEST
newpath = path;
}
else{
# ELSE KEEP THE OLD PATH
newpath = oldpath;
}
}
else{
newpath = path;
}
# STORE PREFERRED PATH AGAINST KEY
nonduplicates[id]=newpath;
}
END{
# OUTPUT PREFERRED FILE FOR EACH IDENTIFIER
for(id in nonduplicates){
print(nonduplicates[id]);
}
}
Final results
To identify the things to lose rather than keep, we can use a tool called 'comm' which can tell us which lines were in the 'all' file and are now not in the 'keep' file. This is used to generate a 'lose' file which contains all the files to delete.
comm -23 all keep > lose
The comm tool normally prints out three columns
- lines unique to the first file
- lines unique to the second file
- lines common to both first file
The option -23 asks it to suppress the second two columns, preserving only the first.
Stats
A nice command line tool called 'wc' is available to do word counts, (and line counts and stuff), and we can use this to survey the results
wc -l all keep lose
4716 all
2772 keep
1975 lose
This tells us that there were a total of 4716 tracks originally, with 2772 which were legitimate non-duplicated originals, but 1975 duplicates which can safely be deleted, since there are already preferred copies of these files.
Deleting the files
This took more work than expected, owing to the spaces and apostrophes and special characters in some of the filenames. Ideally you'd just create a load of command lines like...
rm filename
...but if the filename contains spaces and stuff, this doesn't work. It ends up with...
rm file's name
Which the command line reads as two separate filenames, and confused totally by the fact that it's waiting for a closing quote to match the hanging apostrophe there. These characters need to be escaped with a preceding backslash '\' and I ended up having to write my own routine using 'sed' a stream editor to do this - I had problems with 'printf %q' and other routines for quoting the lines read from the 'lose' file..
The full script
The final script is made out of two scripts and one awk file. The script 'run.sh' uses the nonduplicates.awk file to list all the files which should be dropped, and provides a simple report of the outcome. All the lines which start 'echo' in this script are just printing information on what happened - they don't do anything concrete.
#!/bin/bash
find . -type f | awk '/mp3$|m4a$/' | sort > all
cat all | awk -f nonduplicates.awk | sort > keep
comm -23 all keep > lose
# These lines are optional
echo
echo `wc -l all keep lose | awk '/all/ {print $1, " total tracks"} /keep/ {print $1, " real tracks"} /lose/ {print $1, " duplicates found"}'`
echo
echo 'Verify planned deletions in the file called lose'
echo
echo 'Run DMC (delete my crap) to remove all duplicates'
echo
The script 'dmc.sh' reads in the 'lose' file and processes it through sed to provide proper escaping characters. It then feeds each properly escaped line to 'rm' (the unix remove program) to delete each file in turn.
#!/bin/bash
cat lose | sed "s/\'/\\\'/g" | sed 's/ /\\ /g' | xargs -n 1 rm
Epilogue
So the result is that Jim actually only has 17Gig of music, not 29 as originally thought.
du -d0 -hc 1 orig1
17G 1
29G orig1
Latest | Search | Contact