simmetrics

Approximate string matching and comparison

In addition to the standard string operations, VadalogEngine also supports advanced operations for approximate string matching.

These functions are provided in the simmetrics library, which can be made available using the following annotation:

@library("sim:", "simmetrics").

The first type of operators serve an indexing role and can be used to map similar strings to the same key. The supported indexing operators are:

caverphone1

computes the Caverphone phonetic algorithm (version 1)

caverphone2

computes the Caverphone phonetic algorithm (version 2)

colognePhonetic

computes the Cologne phonetics algorithm

daitchMokotoffSoundex

computes the Daitch-Mokotoff Soundex phonetic algorithm

doubleMetaphone

computes the Double Metaphone phonetic algorithm

matchRatingApproach

computes the Match Rating Approach phonetic algorithm

metaphone

computes the Metaphone phonetic algorithm

nysiis

computes the New York State Identification and Intelligence System phonetic algorithm

removeDiacritics

removes diacritics from a string

removeNonWord

removes non-word characters from a string

soundex

computes the Soundex phonetic algorithm

toLowerCase

transforms a string into lower case

toUpperCase

transforms a string into upper case

The second type of supported operators measure the similarities between different strings. They take two strings and return a similarity value in the interval [0,1]. While some of the operators work by computing an [edit distance](https://en.wikipedia.org/wiki/Edit_distance) between the two strings, others work by first tokenizing the two strings into words or breaking them into [q-grams](https://en.wikipedia.org/wiki/N-gram), and then comparing the resulting token (multi-)sets. The supported similarity operators are:

blockDistance

computes the similarity based on the L1-distance between the token sets of the input strings

cosineSimilarity

computes the Cosine Similarity between the token sets of the input strings

damerauLevenshtein

computes the similarity based on the Damerau–Levenshtein Edit Distance

dice

computes the Dice Coefficient between the token sets of the input strings

euclideanDistance

computes the similarity based on the L2-distance between the token sets of the input strings

generalizedJaccard

computes the Generalised Jaccard Similarity between the token sets of the input strings

identity

returns 1 if the two strings are the same and 0 otherwise

jaccard

computes the Jaccard Similarity between the token sets of the input strings

jaro

computes the Jaro Similarity between the input strings

jaroWinkler

computes the Jaro-Winkler Similarity between the input strings

jaroWinklerSoundex

computes the Jaro-Winkler Similarity between the Soundex encodings of the input strings

leadingSubstringSimilarity

computes the common prefix similarity on the list of tokens of the two strings

levenshtein

computes the similarity based on the Levenshtein Edit Distance between the input strings

longestCommonSubsequence

computes the similarity based on the length of the Longest Common Subsequence of the input strings

longestCommonSubstring

computes the similarity based on the length of the Longest Common Substring of the input strings

mongeElkan

computes the Monge-Elkan similarity between the token sets of the two strings by lifting the Smith-Waterman-Gotoh similarity to sets

mongeElkanMax

computes the Monge-Elkan similarity between the token sets of the two strings by lifting the substring similarity to sets

needlemanWunch

computes the Needleman–Wunsch similarity between the input strings

overlapCoefficient

computes the Overlap Coefficient between the token sets of the input strings

qGramsDistance

computes the similarity based on the L1-distance between the sets of tri-grams in the input strings

simonWhite

computes the Simon-White coefficient (the multi-set version of the Dice Coefficient) between the multisets of bi-grams of the input sets

smithWaterman

computes the Smith-Waterman Similarity between the input strings

smithWatermanGotoh

computes the Gotoh version of the Smith-Waterman Similarity between the input strings

substring

returns 1 if one of the strings is a subset of the other, and 0 otherwise

caverphone1

Computes the Caverphone phonetic algorithm (version 1).

caverphone1(Text)

Where:

  • Text is the string to be encoded.

Example
@library("sim:", "simmetrics").
input("Marcus").
result(X) :- input(Y), X = sim:caverphone1(Y).
@output("result").
Expected results
result("MKS111")

caverphone2

Computes the Caverphone phonetic algorithm (version 2).

caverphone2(Text)

Where:

  • Text is the string to be encoded.

Example
@library("sim:", "simmetrics").
input("Markus").
result(X) :- input(Y), X = sim:caverphone2(Y).
@output("result").
Expected results
result("MKS111")

colognePhonetic

Computes the Cologne phonetic algorithm.

colognePhonetic(Text)

Where:

  • Text is the string to be encoded.

Example
@library("sim:", "simmetrics").
input("Mayer").
result(X) :- input(Y), X = sim:colognePhonetic(Y).
@output("result").
Expected results
result("67")

daitchMokotoffSoundex

Computes the Daitch-Mokotoff Soundex phonetic algorithm.

daitchMokotoffSoundex(Text)

Where:

  • Text is the string to be encoded.

Example
@library("sim:", "simmetrics").
input("Iozefovich").
result(X) :- input(Y), X = sim:daitchMokotoffSoundex(Y).
@output("result").
Expected results
result("147740")

doubleMetaphone

Computes the Double Metaphone phonetic algorithm.

doubleMetaphone(Text)

Where:

  • Text is the string to be encoded.

Example
@library("sim:", "simmetrics").
input("architect").
result(X) :- input(Y), X = sim:doubleMetaphone(Y).
@output("result").
Expected results
result("ARKT")

matchRatingApproach

Computes the Match Rating Approach phonetic algorithm.

matchRatingApproach(Text)

Where:

  • Text is the string to be encoded.

Example
@library("sim:", "simmetrics").
input("Smith").
result(X) :- input(Y), X = sim:matchRatingApproach(Y).
@output("result").
Expected results
result("SMT")

metaphone

Computes the Metaphone phonetic algorithm.

metaphone(Text)

Where:

  • Text is the string to be encoded.

Example
@library("sim:", "simmetrics").
input("Melbert").
result(X) :- input(Y), X = sim:metaphone(Y).
@output("result").
Expected results
result("MLBR")

nysiis

Computes the New York State Identification and Intelligence System phonetic algorithm.

nysiis(Text)

Where:

  • Text is the string to be encoded.

Example
@library("sim:", "simmetrics").
input("Webberley").
result(X) :- input(Y), X = sim:nysiis(Y).
@output("result").
Expected results
result("WABARL")

removeDiacritics

Removes diacritics from a string.

removeDiacritics(Text)

Where:

  • Text is the string from which to remove diacritics.

Example
@library("sim:", "simmetrics").
input("Cañon City").
result(X) :- input(Y), X = sim:removeDiacritics(Y).
@output("result").
Expected results
result("Canon City")

removeNonWord

Removes non-word characters from a string.

removeNonWord(Text)

Where:

  • Text is the string from which to remove non-word characters.

Example
@library("sim:", "simmetrics").
input("hello, world!").
result(X) :- input(Y), X = sim:removeNonWord(Y).
@output("result").
Expected results
result("helloworld")

soundex

Computes the Soundex phonetic algorithm.

soundex(Text)

Where:

  • Text is the string to be encoded.

Example
@library("sim:", "simmetrics").
input("Perotti").
result(X) :- input(Y), X = sim:soundex(Y).
@output("result").
Expected results
result("P630")

toLowerCase

Transforms a string into lower case.

toLowerCase(Text)

Where:

  • Text is the string to be transformed.

Example
@library("sim:", "simmetrics").
input("HELLO WORLD").
result(X) :- input(Y), X = sim:toLowerCase(Y).
@output("result").
Expected results
result("hello world")

toUpperCase

Transforms a string into upper case.

toUpperCase(Text)

Where:

  • Text is the string to be transformed.

Example
@library("sim:", "simmetrics").
input("hello world").
result(X) :- input(Y), X = sim:toUpperCase(Y).
@output("result").
Expected results
result("HELLO WORLD")

blockDistance

Computes the similarity based on the L1-distance between the token sets of the input strings.

blockDistance(Text1, Text2)

Where:

  • Text1 is the first string to be compared.

  • Text2 is the second string to be compared.

Example
@library("sim:", "simmetrics").
strings("hello world", "hello").
result(X) :- strings(Y1, Y2), X = sim:blockDistance(Y1, Y2).
@output("result").
Expected results
result(0.5)

cosineSimilarity

Computes the similarity based on the Damerau–Levenshtein Edit Distance between the input strings.

cosineSimilarity(Text1, Text2)

Where:

  • Text1 is the first string to be compared.

  • Text2 is the second string to be compared.

Example
@library("sim:", "simmetrics").
strings("hello world", "hello").
result(X) :- strings(Y1, Y2), X = sim:cosineSimilarity(Y1, Y2).
@output("result").
Expected results
result(0.707)

damerauLevenshtein

Computes the similarity based on the Damerau–Levenshtein Edit Distance between the input strings.

damerauLevenshtein(Text1, Text2)

Where:

  • Text1 is the first string to be compared.

  • Text2 is the second string to be compared.

Example
@library("sim:", "simmetrics").
strings("hello", "hallo").
result(X) :- strings(Y1, Y2), X = sim:damerauLevenshtein(Y1, Y2).
@output("result").
Expected results
result(0.8)

dice

Computes the Dice Coefficient between the token sets of the input strings.

dice(Text1, Text2)

Where:

  • Text1 is the first string to be compared.

  • Text2 is the second string to be compared.

Example
@library("sim:", "simmetrics").
strings("hello", "hallo").
result(X) :- strings(Y1, Y2), X = sim:dice(Y1, Y2).
@output("result").
Expected results
result(0.667)

euclideanDistance

Computes the similarity based on the L2-distance between the token sets of the input strings.

euclideanDistance(Text1, Text2)

Where:

  • Text1 is the first string to be compared.

  • Text2 is the second string to be compared.

Example
@library("sim:", "simmetrics").
strings("hello world", "hello").
result(X) :- strings(Y1, Y2), X = sim:euclideanDistance(Y1, Y2).
@output("result").
Expected results
result(0.5)

generalizedJaccard

Computes the Generalised Jaccard Similarity between the token sets of the input strings.

generalizedJaccard(Text1, Text2)

Where:

  • Text1 is the first string to be compared.

  • Text2 is the second string to be compared.

Example
@library("sim:", "simmetrics").
strings("hello", "hallo").
result(X) :- strings(Y1, Y2), X = sim:generalizedJaccard(Y1, Y2).
@output("result").
Expected results
result(0.75)

identity

Returns 1 if the two strings are the same and 0 otherwise.

identity(Text1, Text2)

Where:

  • Text1 is the first string to be compared.

  • Text2 is the second string to be compared.

Example
@library("sim:", "simmetrics").
strings("hello", "hello").
result(X) :- strings(Y1, Y2), X = sim:identity(Y1, Y2).
@output("result").
Expected results
result(1)

jaccard

Computes the Jaccard Similarity between the token sets of the input strings.

jaccard(Text1, Text2)

Where:

  • Text1 is the first string to be compared.

  • Text2 is the second string to be compared.

Example
@library("sim:", "simmetrics").
strings("hello", "hallo").
result(X) :- strings(Y1, Y2), X = sim:jaccard(Y1, Y2).
@output("result").
Expected results
result(0.6)

jaro

Computes the Jaro Similarity between the input strings.

jaro(Text1, Text2)

Where:

  • Text1 is the first string to be compared.

  • Text2 is the second string to be compared.

Example
@library("sim:", "simmetrics").
strings("hello", "hallo").
result(X) :- strings(Y1, Y2), X = sim:jaro(Y1, Y2).
@output("result").
Expected results
result(0.84)

jaroWinkler

Computes the Jaro-Winkler Similarity between the input strings.

jaroWinkler(Text1, Text2)

Where:

  • Text1 is the first string to be compared.

  • Text2 is the second string to be compared.

Example
@library("sim:", "simmetrics").
strings("hello", "hallo").
result(X) :- strings(Y1, Y2), X = sim:jaroWinkler(Y1, Y2).
@output("result").
Expected results
result(0.87)

jaroWinklerSoundex

Computes the Jaro-Winkler Similarity between the Soundex encodings of the input strings.

jaroWinklerSoundex(Text1, Text2)

Where:

  • Text1 is the first string to be compared.

  • Text2 is the second string to be compared.

Example
@library("sim:", "simmetrics").
strings("Perotti", "Pirot").
result(X) :- strings(Y1, Y2), X = sim:jaroWinklerSoundex(Y1, Y2).
@output("result").
Expected results
result(0.89)

leadingSubstringSimilarity

Computes the common prefix similarity on the list of tokens of the two strings. Returns |C|/max{|L1|, |L2|}, where Li is the list of tokens in the i`th input string, i = 1,2, and `C is the longest prefix prefix of the lists L1 and L2.

leadingSubstringSimilarity(Text1, Text2)

Where:

  • Text1 is the first string to be compared.

  • Text2 is the second string to be compared.

Example
@library("sim:", "simmetrics").
strings("hello world", "hello there").
result(X) :- strings(Y1, Y2), X = sim:leadingSubstringSimilarity(Y1, Y2).
@output("result").
Expected results
result(0.5)

levenshtein

Computes the similarity based on the Levenshtein Edit Distance between the input strings.

levenshtein(Text1, Text2)

Where:

  • Text1 is the first string to be compared.

  • Text2 is the second string to be compared.

Example
@library("sim:", "simmetrics").
strings("hello", "hallo").
result(X) :- strings(Y1, Y2), X = sim:levenshtein(Y1, Y2).
@output("result").
Expected results
result(0.8)

longestCommonSubsequence

Computes the similarity based on the length of the Longest Common Subsequence of the input strings.

longestCommonSubsequence(Text1, Text2)

Where:

  • Text1 is the first string to be compared.

  • Text2 is the second string to be compared.

Example
@library("sim:", "simmetrics").
strings("hello world", "hallo there").
result(X) :- strings(Y1, Y2), X = sim:longestCommonSubsequence(Y1, Y2).
@output("result").
Expected results
result(0.6)

longestCommonSubstring

Computes the similarity based on the length of the Longest Common Substring of the input strings.

longestCommonSubstring(Text1, Text2)

Where:

  • Text1 is the first string to be compared.

  • Text2 is the second string to be compared.

Example
@library("sim:", "simmetrics").
strings("hello world", "hallo there").
result(X) :- strings(Y1, Y2), X = sim:longestCommonSubstring(Y1, Y2).
@output("result").
Expected results
result(0.4)

mongeElkan

Computes the Monge-Elkan similarity between the token sets of the two strings by lifting the Smith-Waterman-Gotoh similarity to sets.

mongeElkan(Text1, Text2)

Where:

  • Text1 is the first string to be compared.

  • Text2 is the second string to be compared.

Example
@library("sim:", "simmetrics").
strings("hello", "helo").
result(X) :- strings(Y1, Y2), X = sim:mongeElkan(Y1, Y2).
@output("result").
Expected results
result(0.857)

mongeElkanMax

Computes the Monge-Elkan similarity between the token sets of the two strings by lifting the substring similarity to sets.

mongeElkanMax(Text1, Text2)

Where:

  • Text1 is the first string to be compared.

  • Text2 is the second string to be compared.

Example
@library("sim:", "simmetrics").
strings("hello", "helo").
result(X) :- strings(Y1, Y2), X = sim:mongeElkanMax(Y1, Y2).
@output("result").
Expected results
result(0.857)

needlemanWunch

Computes the Needleman–Wunsch similarity between the input strings.

needlemanWunch(Text1, Text2)

Where:

  • Text1 is the first string to be compared.

  • Text2 is the second string to be compared.

Example
@library("sim:", "simmetrics").
strings("hello", "helo").
result(X) :- strings(Y1, Y2), X = sim:needlemanWunch(Y1, Y2).
@output("result").
Expected results
result(0.857)

overlapCoefficient

Computes the Overlap Coefficient between the token sets of the input strings.

overlapCoefficient(Text1, Text2)

Where:

  • Text1 is the first string to be compared.

  • Text2 is the second string to be compared.

Example
@library("sim:", "simmetrics").
strings("hello world", "hello").
result(X) :- strings(Y1, Y2), X = sim:overlapCoefficient(Y1, Y2).
@output("result").
Expected results
result(1.0)

qGramsDistance

Computes the similarity based on the L1-distance between the sets of tri-grams in the input strings.

qGramsDistance(Text1, Text2)

Where:

  • Text1 is the first string to be compared.

  • Text2 is the second string to be compared.

Example
@library("sim:", "simmetrics").
strings("hello", "hallo").
result(X) :- strings(Y1, Y2), X = sim:qGramsDistance(Y1, Y2).
@output("result").
Expected results
result(0.8)

simonWhite

Computes the Simon-White coefficient (the multi-set version of the Dice Coefficient) between the multisets of bi-grams of the input sets.

simonWhite(Text1, Text2)

Where:

  • Text1 is the first string to be compared.

  • Text2 is the second string to be compared.

Example
@library("sim:", "simmetrics").
strings("hello", "helo").
result(X) :- strings(Y1, Y2), X = sim:simonWhite(Y1, Y2).
@output("result").
Expected results
result(0.8)

smithWaterman

Computes the Smith-Waterman Similarity between the input strings.

smithWaterman(Text1, Text2)

Where:

  • Text1 is the first string to be compared.

  • Text2 is the second string to be compared.

Example
@library("sim:", "simmetrics").
strings("hello", "helo").
result(X) :- strings(Y1, Y2), X = sim:smithWaterman(Y1, Y2).
@output("result").
Expected results
result(0.9)

smithWatermanGotoh

Computes the Gotoh version of the Smith-Waterman Similarity between the input strings.

smithWatermanGotoh(Text1, Text2)

Where:

  • Text1 is the first string to be compared.

  • Text2 is the second string to be compared.

Example
@library("sim:", "simmetrics").
strings("hello", "helo").
result(X) :- strings(Y1, Y2), X = sim:smithWatermanGotoh(Y1, Y2).
@output("result").
Expected results
result(0.9)

substring

Returns 1 if one of the strings is a subset of the other, and 0 otherwise.

substring(Text1, Text2)

Where:

  • Text1 is the first string to be compared.

  • Text2 is the second string to be compared.

Example
@library("sim:", "simmetrics").
strings("hello", "hello world").
result(X) :- strings(Y1, Y2), X = sim:substring(Y1, Y2).
@output("result").
Expected results
result(1)