Nominex - Creating Derived Forms

Creating Derived Forms

First, for each surname spelling a number of derived versions are produced. There are also other columns in the original database, giving for each surname spelling the following list of original and derived forms:

Database Column		Notes
1. Existing Columns:
1a.	Raw Spelling	Exactly as originally recorded in the database.
1b.	'Standardized' version	Won’t be available for most datasets, but does exist for the working datasets used in this project (NBI and 1881 census). The standard forms were assigned previously by manual inspection for each of the original projects. They are not used at all in the algorithmic generation of the ranked matches, but have provided a useful benchmark against which the program’s performance can be measured.
1c.	No. of Occurrences	The frequency of this spelling in the dataset. May or may not be immediately available in the dataset, but can usually be calculated.
2. Derived Forms:
2a.	Revised spelling	Derived from the cleaned spelling. Incorporates minor punctuation changes such as removal of spaces and quote-marks, reduction of any 3-character repeats to 2 characters; and expansion of ST to SAINT. Spaces, dashes and apostrophes are allowed - as in GRACEY-JONES, GRACEY JONES, O'BRIEN. Other anomalies may be corrected, such as making a double space into a single space character. For double-barrelled names three versions are generated, one for each component and a composite version with the space (or dash) character removed. This is because some names that have been recorded as two words may originally have been a single surname, e.g. Green-Field as a version of ‘Greenfield’.
2b.	Phonetic (IPA) version	Derived from the Revised spelling and recorded in the working database using the Sampa version of the International Phonetic Alphabet (IPA). The current version of the system can create two different phonetic versions where necessary, for those cases where alternative pronunciations of a surname are either known or suspected. Click for detailed information.
2c.	Syllable count	For each surname its syllable count is estimated - by counting the number of vowel sounds in its IPA version and allowing values >1.0 for for long vowels and diphthongs. The range of syllable counts for most surnames is from 1.0, up to 3.0 or so.

The Derived Forms in the above table are generated from the Cleaned Spelling column using a batch process, this typically might take a while to process, e.g. perhaps half an hour or so for a dataset of c.400,000 spellings.

Next...