By Anne-Marie Concepcion
Do you ever get multiple copies of the same catalog in the mail? Or three or four copies of the same HTML email from the same company? The problem here is a dirty list: Someone forgot to check for duplicate entries.
When you’re doing any sort of mail merging or database publishing, it’s important to start with a clean, “de-duped” list before you import the data, because they’re often impossible to find and delete in the finished product.
In the case of HTML emails, you never even see the finished product. But you might just aggravate a recipient enough to get tagged as a spammer.
If the data is still in its native database format (Access, Filemaker, etc.) getting rid of dupes is seldom an issue. You just look for the program’s “Delete Duplicate Records” command.
But sometimes, you don’t have a database, just a list of thousands of something or others that you’ve combined from various sources, like websites and Word files.
And, when you’re doing a job for a client, you’re normally given a tab- or comma-delimited text file that was *exported* from their database, which may have been in an atrocious state to begin with. Or worse, they call their Saved-As-bazillion-times Word document that Marge started in 1994 their “database” and give you that.
So … confronted with a text file containing thousands of records, one record per paragraph, how can you quickly get rid of duplicate entries?
De-dupe in Excel
If you have Microsoft Excel, you can open the tab- or comma-delimited text file in it and either use a formula operation or download and purchase an Excel add-on that adds the ability to find and delete duplicates.
If you feel like strapping on your Excel Geek gear and trying the formula method, follow the steps outlined here.
Or, you can purchase an inexpensive Excel (Windows only) add-on, like DeDupe4excel.
DeDupe4excel is more powerful than the formula method, because you can configure it to delete “close” duplicates, like “Mike” and “Michael.” The formula method only works on exact matches.
Actually, if you do an online search for the term “delete duplicate +excel” you’ll end up with dozens of Excel add-ons that you might want to investigate.
After you’ve removed the duplicates, just save the cleaned-up data back out in tab-delimited text or CSV format to use as your source file.
De-dupe in a Text Editor
Mac users can take care of the de-duping in a few seconds with the help of one of my favorite programs in the world, Bare Bones Software’s free TextWrangler.
Open the file in TextWrangler (their commercial BBEdit program has this too) and choose Text > Process Duplicate Lines. In the resulting dialog box, turn on the checkbox for Delete Duplicate Lines, and click the Process button. Done!
This powerful little dialog box also lets you specify pattern matching (aka GREP) and offers the option to move the deleted lines to a new file so you can see what it removed.
Windows users can use the slick UltraEdit ($59.95) program from IDM.
Open the text file in UltraEdit and choose File > Sort > Advanced Sort Options. Choose a sort order, turn on the Remove Duplicates checkbox, and click the Sort button. Voila!
De-dupe on the Web for Free
What could be simpler than copying and pasting onto a website? Check out http://SortMyList.com. In one simple — and a little primitive, design-wise — window, it can do all sorts of manipulations to the text you paste in.
Not only can it de-dupe (look under the Clean menu for the Remove Duplicates command), but it can transpose, clean white space, reverse items (change “firstname lastname” to “lastname firstname”), and delete special characters. And more. For free!
In fact, you don’t even have to copy/paste. The first menu on the web page, File, lets you upload .txt files to the window. When you’re done, choose File > Save to Hard Drive.
Anne-Marie “Her Geekness” Concepcion (@amarie on Twitter) is owner of Seneca Design & Training, a cross-media publication design studio that provides consulting, design, production, and training services to publishers and designers around the world. She is the co-host of the InDesignSecrets.com blog and podcast, and an author of many video tutorials at Lynda.com. Anne-Marie publishes DesignGeek, an ezine of tips and tricks that’s enjoyed by over 8,000 creative professionals. DesignGeek’s home page can be found at http://designgeek.com/.
This article is reprinted with permission. Copyright 2011 Seneca Design & Training, Inc. All Rights Reserved. No portion of this article may be reprinted by any means for any purpose without express written permission of the copyright holder.