Implementation of various string similarity and distance algorithms: Levenshtein, Jaro-winkler, n-Gram, Q-Gram, Jaccard index, Longest Common Subsequence edit distance, cosine similarity ...
java
algorithm
distance
jaro-winkler
levenshtein-distance
similarity-measures
cosine-similarity
string-distance
damerau-levenshtein
shingles
distance-measure
-
Updated
Jun 7, 2021 - Java


Need to implement a smarter method of tokenization which takes into account languages that traditionally does not use spaces between words (currently resulting in full-sentence tokens not suitable for the current method of cosine similarity comparisons).
Some of these languages include: