Creating Scores
Generating Weighted Pairs
Next, the program computes weighted pairs. This is the most time-consuming part of the process since it potentially involves comparing each spelling with every other spelling. So for the 415,000 record dataset for the 1881 census this would potentially involve (415,000 x 415,000) / 2 comparisons, i.e. c.85 billion. In practice one can reduce this number hugely by not comparing surnames that are clearly very different such as SMITH and JONES. But every surname with a given initial letter is compared with every other surname with that letter, and often with other sets of names too. Thus, all 'G' surnames are compared with each other, but also with all the 'J' names (e.g. to generate a score for GERRARD vs. JERRARD). Similarly, all F- names are compared with all PH- names, R- with WR-, etc. And all initial-vowel names (A,E,I,O,U) are compared with each other and also with H- and Y- names.
The metrics computed by the system are as follows:
Metric | Description | |
---|---|---|
1. Punctuation Issues: | ||
1a. | Pre-process penalty | A small penalty imposed if a punctuation change was necessary. |
2. Phonetic Comparison: | ||
2a. | IPA Comparison | The most important of the comparison algorithms. For each pair of surnames (head word & candidate word) the program compares their respective IPA versions and creates a 'difference score'. This uses a letter-matrix created from the two spellings, identifying a "minimum cost route" through the matrix. Each cell of the comparison matrix is populated from a pre-existing ‘penalty cost table’, i.e. a table of similarities between every possible pair of phonemes. e.g. ‘d’ compared to ‘t’ would be a lower cost than ‘d’ compared to ‘m’. The raw version of the minimum cost score is then modified in various ways:
|
2b. | Initial Letter Weight | Allows additional weight to be given to the initial letter, much as Soundex does. |
2c. | First vowel weight | Additional weighting on the first syllable on the grounds that this is where the stress lies in most surnames. |
2d. | Syllable count difference | The metric is the difference between the syllable count of the head word and the current word. |
3. Orthographic Comparison [spelling]: | ||
3a. | Edit Difference | The second most important comparison algorithm. Calculated with the classic Levenshtein edit-difference algorithm, i.e. the number of insert / delete / substitute steps required to turn one string into the other. |
Other comparison metrics were tried, such as N-gram, Guth etc. But these contributed little to the success of the overall algorithm so were eventually dropped. Old 'name bucket' approaches such as Soundex and Metaphone were not used at all.
Combining Scores
Finally, a combined score is generated from the above comparison metrics. Various different methods and weightings were tried. An overall phonetic score is computed by adding Nos. 2a, 2b, 2c, 2d above. This overall Phonetic Score is then compared to the Edit Distance Score (3a above), and the final score is the minimum of these two (suitable weighting being applied first).
The final step is to turn this combined difference score into a percentage, where 100% is an exact match. A cut-off is generally applied, e.g. discarding values lower than (say) 80%.