Chapter 6 | RNI Elasticsearch Plugin

6. Gil and Yoshi's Test Chapter

This day is call'd the feast of Crispian. He that outlives this day, and comes safe home, Will stand a tip-toe when this day is nam'd, And rouse him at the name of Crispian. He that shall live this day, and see old age, Will yearly on the vigil feast his neighbours, And say "To-morrow is Saint Crispian." Then will he strip his sleeve and show his scars, And say "These wounds I had on Crispin's day." Old men forget; yet all shall be forgot, But he'll remember, with advantages, What feats he did that day. Then shall our names, Familiar in his mouth as household words— Harry the King, Bedford and Exeter, Warwick and Talbot, Salisbury and Gloucester— Be in their flowing cups freshly rememb'red.

The plugin includes UTF-8 text files in the bt_root/rlpnc/data/rnm/ref/override subdirectory that designate name elements to strip during indexing and queries, sample full-name pairs with match scores, and token pairs to receive enhanced scores during queries. This directory also contains sample files for performing these operations on designated entity types.

You can modify these files and add additional files in the same subdirectory to extend coverage to additional supported languages. You can also create files that only apply to a specified entity type, such as PERSON.

RNI Tuning Properties. RNI includes a number of properties that we can use to tune the matching algorithm that RNI uses. If you are interested in exploring this topic, please contact [email protected].

5.1 Stop Patterns and Stopword Prefixes

Stop patterns and stopword prefixes strip matching names elements during indexing and queries. The stripping of prefixes (string literals) can be performed more quickly than the application of stop patterns (regular expressions), so you can rely on stopword prefixes for the efficient removal of prefixes, such as titles, that you do not want to include in name matching.

For each name, RNI first performs character-level normalization, stripping punctuation, with the exception of periods, commas, and hyphens; whitespace is reduced to single spaces; and characters are lowercased. Then RNI cycles its way through the stop patterns then the stopwords, removing during each cycle the patterns and stopwords that strip nothing, until the list of stop patterns and stopwords is empty.

Stop Pattern. A stop pattern is a regular expression that excludes matching name elements during indexing and queries. You can use any regular expression supported by the Java 1.7 java.util.regex.Pattern.

Stop patterns for a given language are specified in a UTF-8 file with the ISO639 three-letter language code in the filename:

stopregexes_LANG[_TYPE].txt ²

where LANG is the three-letter ISO 639-3 language code. Each row in the file, with the exception of rows that begin with #, ³ is a regular expression. Leading and trailing whitespace is removed from regex lines, so use \s at beginning and end where needed.

Elements in the names to be processed that match any of these regular expressions are removed. Longer stop patterns are applied before shorter stop patterns, so the presence of a shorter stop pattern does not prevent the stripping of a longer pattern that includes the shorter pattern. For example, the brigadier[- ]general stop pattern is applied where applicable when general is also a stop pattern.

The plugin includes files with stop patterns for names in English (generic and ORGANIZATION) and Spanish (generic). These files are in bt_root//rlpnc/data/rnm/ref/override. The generic (non-entity-specific) English file is stopregexes_eng.txt. For example, the entry

^mayor\s

indicates that mayor (1) at the beginning of a lowercased name, and (2) followed by whitespace is to be removed.

You can modify the contents of this file. To add stop patterns for a different language, create an additional UTF-8 file in the same subdirectory with the three-letter language code in the filename.

For example, stopregexes_por.txt would include regular expressions with Portuguese names; stopregexes_eng_PERSON.txt would include regular expression to remove elements from PERSON names in English text.

Use of complex patterns may increase processing time. When possible, use stopword prefixes.

Stopword Prefixes. A stopword prefix is a string literal that strips the matching prefix from name elements during indexing and queries.

Stopword prefixes for a given language are specified in a UTF-8 file with the ISO639 three-letter language code in the filename:

stopprefixes_LANG[_TYPE].txt ².

where LANG is a three-letter language code. Each row in the file, with the exception of rows that begin with #, ³ is a string literal.

Prefixes in the names to be processed that match any of these string literals are removed.

Like stop patterns, longer stopword prefixes take precedence over shorter prefixes that the longer stopword contains. For example, the lieutenant colonel stopword prefix is applied where applicable when colonel is also a stopword prefix.

The plugin includes files with generic stopword prefixes for names in English and Spanish. These files are in bt_root//rlpnc/data/rnm/ref/override: stopprefixes_eng.txt and stopprefixes_spa.txt. You can modify the contents of these files. To add stopword prefixes for another language, create a UTF-8 file in the same directory with the three-letter language code in the filename. For example, stopprefixes_fra.txt would include stopword prefixes for use with French names.

Overriding Name Pair Matches

You can create UTF-8 text files that specify the scores to be assigned for specified full-name pairs. The filename uses ISO639 three-letter language codes to designate the language of each full name in each of the full-name pairs:

fullnames_LANG1_LANG2[_TYPE].txt ²

where LANG1 is the three-letter language code for the first name and LANG2 is the three letter language code for the second name. Each row in the file, with the exception of rows that begin with #, is a tab-delimited full-name pair and score:

query_name Tab index_name Tab score

The scores must between 0 and 1.0, where 0 indicates no match, and 1.0 indicates a perfect match. ⁴

The plugin includes a sample file with sample entries commented out: bt_root//rlpnc/data/rnm/ref/override/fullnames_eng_eng.txt. Any non-commented-out entries in this file assign scores to English queries applied to English names in rni_name fields. For example,

John Doe Joe Bloggs 1.0

indicates that the query name John Doe matches the index name Joe Bloggs (both used in different regions to indicate 'person unknown') with a score of 1.0.

These match patterns are commutative. The previous entry also specifies a match score of 1.0 if the query name is Joe Bloggs and the index includes a document with an rni_name field containing John Doe.

You can add entries for English to English name matches to fullnames_eng_eng.txt, and create additional override files, using the filename to specify the languages.

² Include _TYPE, where TYPE designates an entity type, such as PERSON if you want the override to apply only if the name (for stop patterns), matching names, or matching tokens have been assigned this entity type. If the filename does not include _TYPE, it will be applied to all names, irregardless of the entity type.

³ # may also be used after an entry on the same line to begin a comment.

⁴ Since the minimum score for names returned by RNI rescoring queries must be greater than 0, an RNI rescoring query will not return the name if the override score is 0.