Nominex - How it Works (4)

Creating Scores

Generating Weighted Pairs

Next, the program computes weighted pairs. This is the most time-consuming part of the process since it potentially involves comparing each spelling with every other spelling. So for the 415,000 record dataset for the 1881 census this would potentially involve (415,000 x 415,000) / 2 comparisons, i.e. c.85 billion. In practice one can reduce this number hugely by not comparing surnames that are clearly very different such as SMITH and JONES. But every surname with a given initial letter is compared with every other surname with that letter, and often with other sets of names too. Thus, all 'G' surnames are compared with each other, but also with all the 'J' names (e.g. to generate a score for GERRARD vs. JERRARD). Similarly, all F- names are compared with all PH- names, R- with WR-, etc. And all initial-vowel names (A,E,I,O,U) are compared with each other and also with H- and Y- names.

The metrics computed by the system are as follows:

Metric		Description
1. Punctuation Issues:
1a.	Pre-process penalty	A small penalty imposed if a punctuation change was necessary.
2. Phonetic Comparison:
2a.	IPA Comparison	The most important of the comparison algorithms. For each pair of surnames (head word & candidate word) the program compares their respective IPA versions and creates a 'difference score'. This uses a letter-matrix created from the two spellings, identifying a "minimum cost route" through the matrix. Each cell of the comparison matrix is populated from a pre-existing ‘penalty cost table’, i.e. a table of similarities between every possible pair of phonemes. e.g. ‘d’ compared to ‘t’ would be a lower cost than ‘d’ compared to ‘m’. The raw version of the minimum cost score is then modified in various ways: Cluster comparisons can reduce the raw score (e.g. by comparing 'ns' & 'nds' in Simmons vs. Simmonds, which reduces the effect of an inserted 'd' - i.e. 'elision'). There are around 300 of these cluster comparisons, any one of which may affect a given pairwise comparison. ‘Tapering’ is applied, whereby lesser weight is given to the end of each word - on the grounds that suffixes in longer surnames tend to get corrupted. The score is then ‘normalized’, to take account of differing word length.
2b.	Initial Letter Weight	Allows additional weight to be given to the initial letter, much as Soundex does.
2c.	First vowel weight	Additional weighting on the first syllable on the grounds that this is where the stress lies in most surnames.
2d.	Syllable count difference	The metric is the difference between the syllable count of the head word and the current word.
3. Orthographic Comparison [spelling]:
3a.	Edit Difference	The second most important comparison algorithm. Calculated with the classic Levenshtein edit-difference algorithm, i.e. the number of insert / delete / substitute steps required to turn one string into the other.

Other comparison metrics were tried, such as N-gram, Guth etc. But these contributed little to the success of the overall algorithm so were eventually dropped. Old 'name bucket' approaches such as Soundex and Metaphone were not used at all.

Combining Scores

Finally, a combined score is generated from the above comparison metrics. Various different methods and weightings were tried. An overall phonetic score is computed by adding Nos. 2a, 2b, 2c, 2d above. This overall Phonetic Score is then compared to the Edit Distance Score (3a above), and the final score is the minimum of these two (suitable weighting being applied first).

The final step is to turn this combined difference score into a percentage, where 100% is an exact match. A cut-off is generally applied, e.g. discarding values lower than (say) 80%.