simmetrics
Approximate string matching and comparison
In addition to the standard string operations, VadalogEngine also supports advanced operations for approximate string matching.
These functions are provided in the simmetrics
library, which can be made available using the following annotation:
@library("sim:", "simmetrics").
The first type of operators serve an indexing role and can be used to map similar strings to the same key. The supported indexing operators are:
computes the Caverphone phonetic algorithm (version 1) |
|
computes the Caverphone phonetic algorithm (version 2) |
|
computes the Cologne phonetics algorithm |
|
computes the Daitch-Mokotoff Soundex phonetic algorithm |
|
computes the Double Metaphone phonetic algorithm |
|
computes the Match Rating Approach phonetic algorithm |
|
computes the Metaphone phonetic algorithm |
|
computes the New York State Identification and Intelligence System phonetic algorithm |
|
removes diacritics from a string |
|
removes non-word characters from a string |
|
computes the Soundex phonetic algorithm |
|
transforms a string into lower case |
|
transforms a string into upper case |
The second type of supported operators measure the similarities between different strings.
They take two strings and return a similarity value in the interval [0,1]
. While some of the operators work by computing an [edit distance](https://en.wikipedia.org/wiki/Edit_distance) between the two strings, others work by first tokenizing the two strings into words or breaking them into [q-grams](https://en.wikipedia.org/wiki/N-gram), and then comparing the resulting token (multi-)sets. The supported similarity operators are:
computes the similarity based on the L1-distance between the token sets of the input strings |
|
computes the Cosine Similarity between the token sets of the input strings |
|
computes the similarity based on the Damerau–Levenshtein Edit Distance |
|
computes the Dice Coefficient between the token sets of the input strings |
|
computes the similarity based on the L2-distance between the token sets of the input strings |
|
computes the Generalised Jaccard Similarity between the token sets of the input strings |
|
returns |
|
computes the Jaccard Similarity between the token sets of the input strings |
|
computes the Jaro Similarity between the input strings |
|
computes the Jaro-Winkler Similarity between the input strings |
|
computes the Jaro-Winkler Similarity between the Soundex encodings of the input strings |
|
computes the common prefix similarity on the list of tokens of the two strings |
|
computes the similarity based on the Levenshtein Edit Distance between the input strings |
|
computes the similarity based on the length of the Longest Common Subsequence of the input strings |
|
computes the similarity based on the length of the Longest Common Substring of the input strings |
|
computes the Monge-Elkan similarity between the token sets of the two strings by lifting the Smith-Waterman-Gotoh similarity to sets |
|
computes the Monge-Elkan similarity between the token sets of the two strings by lifting the |
|
computes the Needleman–Wunsch similarity between the input strings |
|
computes the Overlap Coefficient between the token sets of the input strings |
|
computes the similarity based on the L1-distance between the sets of tri-grams in the input strings |
|
computes the Simon-White coefficient (the multi-set version of the Dice Coefficient) between the multisets of bi-grams of the input sets |
|
computes the Smith-Waterman Similarity between the input strings |
|
computes the Gotoh version of the Smith-Waterman Similarity between the input strings |
|
returns 1 if one of the strings is a subset of the other, and 0 otherwise |
caverphone1
Computes the Caverphone phonetic algorithm (version 1).
caverphone1(Text)
Where:
-
Text
is the string to be encoded.
@library("sim:", "simmetrics").
input("Marcus").
result(X) :- input(Y), X = sim:caverphone1(Y).
@output("result").
result("MKS111")
caverphone2
Computes the Caverphone phonetic algorithm (version 2).
caverphone2(Text)
Where:
-
Text
is the string to be encoded.
@library("sim:", "simmetrics").
input("Markus").
result(X) :- input(Y), X = sim:caverphone2(Y).
@output("result").
result("MKS111")
colognePhonetic
Computes the Cologne phonetic algorithm.
colognePhonetic(Text)
Where:
-
Text
is the string to be encoded.
@library("sim:", "simmetrics").
input("Mayer").
result(X) :- input(Y), X = sim:colognePhonetic(Y).
@output("result").
result("67")
daitchMokotoffSoundex
Computes the Daitch-Mokotoff Soundex phonetic algorithm.
daitchMokotoffSoundex(Text)
Where:
-
Text
is the string to be encoded.
@library("sim:", "simmetrics").
input("Iozefovich").
result(X) :- input(Y), X = sim:daitchMokotoffSoundex(Y).
@output("result").
result("147740")
doubleMetaphone
Computes the Double Metaphone phonetic algorithm.
doubleMetaphone(Text)
Where:
-
Text
is the string to be encoded.
@library("sim:", "simmetrics").
input("architect").
result(X) :- input(Y), X = sim:doubleMetaphone(Y).
@output("result").
result("ARKT")
matchRatingApproach
Computes the Match Rating Approach phonetic algorithm.
matchRatingApproach(Text)
Where:
-
Text
is the string to be encoded.
@library("sim:", "simmetrics").
input("Smith").
result(X) :- input(Y), X = sim:matchRatingApproach(Y).
@output("result").
result("SMT")
metaphone
Computes the Metaphone phonetic algorithm.
metaphone(Text)
Where:
-
Text
is the string to be encoded.
@library("sim:", "simmetrics").
input("Melbert").
result(X) :- input(Y), X = sim:metaphone(Y).
@output("result").
result("MLBR")
nysiis
Computes the New York State Identification and Intelligence System phonetic algorithm.
nysiis(Text)
Where:
-
Text
is the string to be encoded.
@library("sim:", "simmetrics").
input("Webberley").
result(X) :- input(Y), X = sim:nysiis(Y).
@output("result").
result("WABARL")
removeDiacritics
Removes diacritics from a string.
removeDiacritics(Text)
Where:
-
Text
is the string from which to remove diacritics.
@library("sim:", "simmetrics").
input("Cañon City").
result(X) :- input(Y), X = sim:removeDiacritics(Y).
@output("result").
result("Canon City")
removeNonWord
Removes non-word characters from a string.
removeNonWord(Text)
Where:
-
Text
is the string from which to remove non-word characters.
@library("sim:", "simmetrics").
input("hello, world!").
result(X) :- input(Y), X = sim:removeNonWord(Y).
@output("result").
result("helloworld")
soundex
Computes the Soundex phonetic algorithm.
soundex(Text)
Where:
-
Text
is the string to be encoded.
@library("sim:", "simmetrics").
input("Perotti").
result(X) :- input(Y), X = sim:soundex(Y).
@output("result").
result("P630")
toLowerCase
Transforms a string into lower case.
toLowerCase(Text)
Where:
-
Text
is the string to be transformed.
@library("sim:", "simmetrics").
input("HELLO WORLD").
result(X) :- input(Y), X = sim:toLowerCase(Y).
@output("result").
result("hello world")
toUpperCase
Transforms a string into upper case.
toUpperCase(Text)
Where:
-
Text
is the string to be transformed.
@library("sim:", "simmetrics").
input("hello world").
result(X) :- input(Y), X = sim:toUpperCase(Y).
@output("result").
result("HELLO WORLD")
blockDistance
Computes the similarity based on the L1-distance between the token sets of the input strings.
blockDistance(Text1, Text2)
Where:
-
Text1
is the first string to be compared. -
Text2
is the second string to be compared.
@library("sim:", "simmetrics").
strings("hello world", "hello").
result(X) :- strings(Y1, Y2), X = sim:blockDistance(Y1, Y2).
@output("result").
result(0.5)
cosineSimilarity
Computes the similarity based on the Damerau–Levenshtein Edit Distance between the input strings.
cosineSimilarity(Text1, Text2)
Where:
-
Text1
is the first string to be compared. -
Text2
is the second string to be compared.
@library("sim:", "simmetrics").
strings("hello world", "hello").
result(X) :- strings(Y1, Y2), X = sim:cosineSimilarity(Y1, Y2).
@output("result").
result(0.707)
damerauLevenshtein
Computes the similarity based on the Damerau–Levenshtein Edit Distance between the input strings.
damerauLevenshtein(Text1, Text2)
Where:
-
Text1
is the first string to be compared. -
Text2
is the second string to be compared.
@library("sim:", "simmetrics").
strings("hello", "hallo").
result(X) :- strings(Y1, Y2), X = sim:damerauLevenshtein(Y1, Y2).
@output("result").
result(0.8)
dice
Computes the Dice Coefficient between the token sets of the input strings.
dice(Text1, Text2)
Where:
-
Text1
is the first string to be compared. -
Text2
is the second string to be compared.
@library("sim:", "simmetrics").
strings("hello", "hallo").
result(X) :- strings(Y1, Y2), X = sim:dice(Y1, Y2).
@output("result").
result(0.667)
euclideanDistance
Computes the similarity based on the L2-distance between the token sets of the input strings.
euclideanDistance(Text1, Text2)
Where:
-
Text1
is the first string to be compared. -
Text2
is the second string to be compared.
@library("sim:", "simmetrics").
strings("hello world", "hello").
result(X) :- strings(Y1, Y2), X = sim:euclideanDistance(Y1, Y2).
@output("result").
result(0.5)
generalizedJaccard
Computes the Generalised Jaccard Similarity between the token sets of the input strings.
generalizedJaccard(Text1, Text2)
Where:
-
Text1
is the first string to be compared. -
Text2
is the second string to be compared.
@library("sim:", "simmetrics").
strings("hello", "hallo").
result(X) :- strings(Y1, Y2), X = sim:generalizedJaccard(Y1, Y2).
@output("result").
result(0.75)
identity
Returns 1
if the two strings are the same and 0
otherwise.
identity(Text1, Text2)
Where:
-
Text1
is the first string to be compared. -
Text2
is the second string to be compared.
@library("sim:", "simmetrics").
strings("hello", "hello").
result(X) :- strings(Y1, Y2), X = sim:identity(Y1, Y2).
@output("result").
result(1)
jaccard
Computes the Jaccard Similarity between the token sets of the input strings.
jaccard(Text1, Text2)
Where:
-
Text1
is the first string to be compared. -
Text2
is the second string to be compared.
@library("sim:", "simmetrics").
strings("hello", "hallo").
result(X) :- strings(Y1, Y2), X = sim:jaccard(Y1, Y2).
@output("result").
result(0.6)
jaro
Computes the Jaro Similarity between the input strings.
jaro(Text1, Text2)
Where:
-
Text1
is the first string to be compared. -
Text2
is the second string to be compared.
@library("sim:", "simmetrics").
strings("hello", "hallo").
result(X) :- strings(Y1, Y2), X = sim:jaro(Y1, Y2).
@output("result").
result(0.84)
jaroWinkler
Computes the Jaro-Winkler Similarity between the input strings.
jaroWinkler(Text1, Text2)
Where:
-
Text1
is the first string to be compared. -
Text2
is the second string to be compared.
@library("sim:", "simmetrics").
strings("hello", "hallo").
result(X) :- strings(Y1, Y2), X = sim:jaroWinkler(Y1, Y2).
@output("result").
result(0.87)
jaroWinklerSoundex
Computes the Jaro-Winkler Similarity between the Soundex encodings of the input strings.
jaroWinklerSoundex(Text1, Text2)
Where:
-
Text1
is the first string to be compared. -
Text2
is the second string to be compared.
@library("sim:", "simmetrics").
strings("Perotti", "Pirot").
result(X) :- strings(Y1, Y2), X = sim:jaroWinklerSoundex(Y1, Y2).
@output("result").
result(0.89)
leadingSubstringSimilarity
Computes the common prefix similarity on the list of tokens of the two strings. Returns |C|/max{|L1|, |L2|}
, where Li
is the list of tokens in the i`th input string, i = 1,2, and `C
is the longest prefix prefix of the lists L1
and L2
.
leadingSubstringSimilarity(Text1, Text2)
Where:
-
Text1
is the first string to be compared. -
Text2
is the second string to be compared.
@library("sim:", "simmetrics").
strings("hello world", "hello there").
result(X) :- strings(Y1, Y2), X = sim:leadingSubstringSimilarity(Y1, Y2).
@output("result").
result(0.5)
levenshtein
Computes the similarity based on the Levenshtein Edit Distance between the input strings.
levenshtein(Text1, Text2)
Where:
-
Text1
is the first string to be compared. -
Text2
is the second string to be compared.
@library("sim:", "simmetrics").
strings("hello", "hallo").
result(X) :- strings(Y1, Y2), X = sim:levenshtein(Y1, Y2).
@output("result").
result(0.8)
longestCommonSubsequence
Computes the similarity based on the length of the Longest Common Subsequence of the input strings.
longestCommonSubsequence(Text1, Text2)
Where:
-
Text1
is the first string to be compared. -
Text2
is the second string to be compared.
@library("sim:", "simmetrics").
strings("hello world", "hallo there").
result(X) :- strings(Y1, Y2), X = sim:longestCommonSubsequence(Y1, Y2).
@output("result").
result(0.6)
longestCommonSubstring
Computes the similarity based on the length of the Longest Common Substring of the input strings.
longestCommonSubstring(Text1, Text2)
Where:
-
Text1
is the first string to be compared. -
Text2
is the second string to be compared.
@library("sim:", "simmetrics").
strings("hello world", "hallo there").
result(X) :- strings(Y1, Y2), X = sim:longestCommonSubstring(Y1, Y2).
@output("result").
result(0.4)
mongeElkan
Computes the Monge-Elkan similarity between the token sets of the two strings by lifting the Smith-Waterman-Gotoh similarity to sets.
mongeElkan(Text1, Text2)
Where:
-
Text1
is the first string to be compared. -
Text2
is the second string to be compared.
@library("sim:", "simmetrics").
strings("hello", "helo").
result(X) :- strings(Y1, Y2), X = sim:mongeElkan(Y1, Y2).
@output("result").
result(0.857)
mongeElkanMax
Computes the Monge-Elkan similarity between the token sets of the two strings by lifting the substring
similarity to sets.
mongeElkanMax(Text1, Text2)
Where:
-
Text1
is the first string to be compared. -
Text2
is the second string to be compared.
@library("sim:", "simmetrics").
strings("hello", "helo").
result(X) :- strings(Y1, Y2), X = sim:mongeElkanMax(Y1, Y2).
@output("result").
result(0.857)
needlemanWunch
Computes the Needleman–Wunsch similarity between the input strings.
needlemanWunch(Text1, Text2)
Where:
-
Text1
is the first string to be compared. -
Text2
is the second string to be compared.
@library("sim:", "simmetrics").
strings("hello", "helo").
result(X) :- strings(Y1, Y2), X = sim:needlemanWunch(Y1, Y2).
@output("result").
result(0.857)
overlapCoefficient
Computes the Overlap Coefficient between the token sets of the input strings.
overlapCoefficient(Text1, Text2)
Where:
-
Text1
is the first string to be compared. -
Text2
is the second string to be compared.
@library("sim:", "simmetrics").
strings("hello world", "hello").
result(X) :- strings(Y1, Y2), X = sim:overlapCoefficient(Y1, Y2).
@output("result").
result(1.0)
qGramsDistance
Computes the similarity based on the L1-distance between the sets of tri-grams in the input strings.
qGramsDistance(Text1, Text2)
Where:
-
Text1
is the first string to be compared. -
Text2
is the second string to be compared.
@library("sim:", "simmetrics").
strings("hello", "hallo").
result(X) :- strings(Y1, Y2), X = sim:qGramsDistance(Y1, Y2).
@output("result").
result(0.8)
simonWhite
Computes the Simon-White coefficient (the multi-set version of the Dice Coefficient) between the multisets of bi-grams of the input sets.
simonWhite(Text1, Text2)
Where:
-
Text1
is the first string to be compared. -
Text2
is the second string to be compared.
@library("sim:", "simmetrics").
strings("hello", "helo").
result(X) :- strings(Y1, Y2), X = sim:simonWhite(Y1, Y2).
@output("result").
result(0.8)
smithWaterman
Computes the Smith-Waterman Similarity between the input strings.
smithWaterman(Text1, Text2)
Where:
-
Text1
is the first string to be compared. -
Text2
is the second string to be compared.
@library("sim:", "simmetrics").
strings("hello", "helo").
result(X) :- strings(Y1, Y2), X = sim:smithWaterman(Y1, Y2).
@output("result").
result(0.9)
smithWatermanGotoh
Computes the Gotoh version of the Smith-Waterman Similarity between the input strings.
smithWatermanGotoh(Text1, Text2)
Where:
-
Text1
is the first string to be compared. -
Text2
is the second string to be compared.
@library("sim:", "simmetrics").
strings("hello", "helo").
result(X) :- strings(Y1, Y2), X = sim:smithWatermanGotoh(Y1, Y2).
@output("result").
result(0.9)
substring
Returns 1 if one of the strings is a subset of the other, and 0 otherwise.
substring(Text1, Text2)
Where:
-
Text1
is the first string to be compared. -
Text2
is the second string to be compared.
@library("sim:", "simmetrics").
strings("hello", "hello world").
result(X) :- strings(Y1, Y2), X = sim:substring(Y1, Y2).
@output("result").
result(1)