Duplicate Contacts Manager for Thunderbird

Summary

This Thunderbird add-on searches address book(s) for pairs of matching contact entries. It can automatically delete entries that have equivalent or less information than the other one. The matching and information comparison uses advanced and partly configurable algorithms for normalizing values etc. Any remaining matches are presented for manual treatment, where the results of comparing their properties are shown. The user can choose to edit one more both of the matching entries, delete one of them, or skip the match.

Usage

This Thunderbird extension facilitates handling of redundant entries in address books.
After installation it can be invoked via the Tools->Duplicate Contacts Manager... menu entry. One can also customize the Toolbar of the Address Book window with a Find Duplicates button.

The Duplicate Contacts Manager searches address books for pairs of matching contact entries, also known as cards. It can automatically delete all cards that have equivalent or less information than some matching one. Any remaining pairs of matching cards are presented as candidate duplicates for manual treatment. Each two cards are shown side-by-side with a comparison of all fields containing data, including any photo. Some important fields are always shown such that they can be filled in when they have been empty so far.

[Screenshot of comparison window showing two matching cards]

When pairs of candidate duplicates are presented, various comparison information is given in the column between them.

During manual treatment of a pair of matching cards the user can skip them, can modify one or both of them, and can decide to delete one of them. When a card is deleted and it has a primary email address that is contained in one or more mailing lists and the other card does not have the same primary email address, the address is also deleted from the respective mailing lists. In order to exclude pairs of similar cards from being repeatedly presented for manual treatment they may be given different AIMScreenNames, such that they are filtered out from the search results.

At the beginning one or two address books can selected and a couple of options can be set.

[Screenshot of start window with the options available]

At the end a summary is presented.

[Screenshot of final window showing a statistics on the last run]

Matching contact entries

There are two search modes for finding matching cards:

The matching relation is designed to be rather weak, such that it tends to yield all pairs of potential duplicates. Two cards are considered matching if any of the following conditions hold, where the details are explained below.

Yet cards with non-equivalent AIMScreenName are never considered matching,

Matching of names, email addresses, and phone numbers is based upon equivalence and sub-equivalence of fields modulo abstraction, described below. As a result, for example, names differing only in letter case are considered to match. For the matching process, names are completed and their order is normalized — for example, if two name parts are detected in the DisplayName (e.g., "John Doe") r in an email address (e.g., "John.Doe@company.com"), they are taken as first and last name. Both multiple email addresses within a card and multiple phone numbers within a card are treated as sets, i.e., their order is ignored as well as their types.

Abstraction of field values

Before card fields are compared their values are abstracted using the following steps.
  1. Pruning, which removes stray contents irrelevant for comparison:
    1. ignore values of certain field types — the set of ignored fields is configurable with the default being UID, UUID, CardUID, groupDavKey, groupDavVersion, groupDavVersionPrev, RecordKey, DbRowID, PhotoType, PhotoName, LowercasePrimaryEmail, LowercaseSecondEmail, unprocessed:rev, unprocessed:x-ablabel,
    2. remove leading/trailing/multiple whitespace and strip non-digit characters from phone numbers,
    3. strip any stray email address duplicates from names, which get inserted by some email clients as default names, and
    4. replace @googlemail.com by @gmail.com in email addresses.
  2. Transformation, which re-arranges information for better comparison:
    1. correct the order of first and last name (for instance, re-order "Doe, John"),
    2. move middle initials such as "M" from last name to first name, and
    3. move last name prefixes such as "von" from first name to last name.
  3. Normalization, which equalizes representation variants:
    1. convert to lowercase (except for name part of AOL email addresses),
    2. convert texts by transcribing umlauts and ligatures, and
    3. if configured, replace in phone numbers the international call prefix (such as '00') by '+' and the national trunk prefix (such as '0') by the default country calling code (such as '+49').
  4. Simplification, which strips less relevant information from texts by removing accents and punctuation.
Corresponding fields in two cards are considered equivalent if their abstracted values are equal.
Parts of names are considered sub-equivalent if their abstracted values are equal or the abstracted value of one of them is a non-empty whole-word substring of the abstracted value of the other.
Note that the value adaptations mentioned above are computed only for the comparison, i.e., they do not change the actual card fields.

If automatic removal is chosen, only cards are removed that match some other card and have equivalent or less information than the other card and are preferred for deletion; for details see below.
When a pair of matching cards is presented for manual inspection, the card flagged by default with red color for removal is the one preferred for deletion.

Equivalence of information

A card is considered to have equivalent or less information than another card if for each field: For the above field-wise comparison, the email addresses of a card are treated as a set, the phone numbers of a card are also treated as a set, and the set of names of mailing lists a card belongs to is treated as an additional field.

Of two matching cards one is preferred for deletion such that

Here is an example. The card on the right will be preferred for deletion because it contains less information.

NickName: "Péte" "  pete ! " accent, punctuation, letter case, and whitespace ignored
FirstName: "Peter" "Peter Y van" name prefix "van" moved to last name
LastName: "Y van Müller" "Mueller" middle initial "Y" moved to first name, umlauts transcribed
DisplayName: "Hans Peter van Müller" "van Müller, Peter" first name moved to the front, name is substring
PreferDisplayName: 'yes' 'yes' same value
AimScreenName: "" "" same AIM name
PreferMailFormat: 'HTML' 'unknown' default ('unknown') considered less information
PrimaryEmail: "Peter.vanMueller@company.com" "P.van.Mueller@gmx.de" emails treated as sets, letter case ignored
SecondaryEmail: "p.van.mueller@gmx.de" "" emails treated as sets, letter case ignored
WorkPhone: "089/1234-5678" "+49 89 12345678" trunk prefix and international call (IDD) prefix normalized and non-digits ignored
PopularityIndex: 5 3 field ignored for information comparison
LastModifiedDate: 2018-02-25 07:51:28 2018-02-25 08:30:37 field ignored for information comparison
UUID: "" "903a61be-64d5-4844-802a" field ignored

Configuration variables

The options/configuration/preferences used by this Thunderbird extension are saved in configuration keys starting with extensions.DuplicateContactsManager. — for instance, the list of ignored fields is stored in the variable ignoreFields.

Update of 2017-02-27, introducing version 1.0

This is a major update, which I call Version 1.0, of the Duplicate Contact Manager.

Work on this extension apparently has been stopped by end of 2012. Meanwhile, mixed user experience piled up on the official Thunderbird add-on feedback page.

Recently I faced a major challenge: my address book with some 1.200 entries got inflated by a buggy CardDAV online sync tool to more than 17.000 cards. The new copies contained new types of automatically generated identification meta fields. When I tried to clean the mess automatically using Duplicate Contact Manager, this did not work because it considered the copies different due to the new identifiers. So I added to the extension a configurable list of field types ignored during comparison. Doing so, I started fixing several issues and adding further features:

Part of the original post of 2012-04-07, introducing version 0.9:

The so far available Version 0.8.2 was a good starting point, but since I urgently needed a more sophisticated tool, I started improving it myself. My changes have been motivated — and validated — using my personal address book with some pretty diligently manually managed 1.000 entries and using the automatically generated collected address book with some 2.500 entries including many duplicates and weird variants of names etc.
The change log is:

Questions and comments are welcome.

back
-----------------------------------------------------------------------------------------------
[Valid HTML5] URL: http://ddvo.net/DuplicateContactsManager/index.html Last modified: Sun Jan 6 20:52:41 CET 2019