Most frequent k characters
This article has multiple issues. Please help improve it or discuss these issues on the talk page. (Learn how and when to remove these template messages)(Learn how and when to remove this template message)
In information theory, MostFreqKDistance is a string metric technique for quickly estimating how similar two ordered sets or strings are. The scheme was invented by Sadi Evren Seker (2014), and initially used in text mining applications like author recognition. The method is originally based on a hashing function MaxFreqKChars classical author recognition problem and idea first came out while studying on data stream mining. The algorithm is suitable for coding in any turing complete programming language.
Method has two steps.
- Hash input strings str1 and str2 separately using MostFreqKHashing and output hstr1 and hstr2 respectively
- Calculate string distance (or string similarity coefficient) of two hash outputs, hstr1 and hstr2 and output an integer value
Most frequent K hashing
The first step of algorithm is calculating the hashing based on the most frequent k characters. The hashing algorithm has below steps:
String function MostFreqKHashing (String inputString, int K) def string outputString for each distinct character count occurrence of each character for i := 0 to K char c = next most freq ith character (if two chars have same frequency then get the first occurrence in inputString) int count = number of occurrence of the character append to outputString, c and count end for return outputString
Above function, simply gets an input string and an integer K value and outputs the most frequent K characters from the input string. The only condition during the creation of output string is adding the first occurring character first, if the frequencies of two characters are equal. Similar to the most of hashing functions, Most Frequent K Hashing is also a one way function.
Most frequent K distance
The second step of algorithm works on two outputs from two different input strings and outputs the similarity coefficient (or distance metric).
int function MostFreqKSimilarity (String inputStr1, String inputStr2, int limit) def int similarity for each c = next character from inputStr1 lookup c in inputStr2 if c is null continue similarity += frequency of c in inputStr1 return limit - similarity
Above function, simply gets two input strings, previously outputted from the
MostFreqKHashing function. From the most frequent k hashing function, the characters and their frequencies are returned. So, the similarity function calculates the similarity based on characters and their frequencies by checking if the same character appears on both strings. The limit is usually taken to be 10 and in the end the function returns the result of the subtraction of the sum of similarities from limit.
In some implementations, the distance metric is required instead of similarity coefficient. In order to convert the output of above similarity coefficient to distance metric, the output can be subtracted from any constant value (like the maximum possible output value). For the case, it is also possible to implement a wrapper function over above two functions.
String distance wrapper function
In order to calculate the distance between two strings, the below function can be implemented
int function MostFreqKSDF (String inputStr1, String inputStr2, int K, int maxDistance) return MostFreqKSimilarity(MostFreqKHashing(inputStr1, K), MostFreqKHashing(inputStr2, K), maxDistance)
Any call to above string distance function will supply two input strings and a maximum distance value. The function will calculate the similarity and subtract that value from the maximum possible distance. It can be considered as a simple additive inverse of similarity.
Let's consider maximum 2 frequent hashing over two strings ‘research’ and ‘seeking’. MostFreqKHashing('research', 2) = r2e2 because we have 2 'r' and 2 'e' characters with the highest frequency and we return in the order they appear in the string. MostFreqKHashing('seeking', 2) = e2s1 Again we have character 'e' with highest frequency and rest of the characters have same frequency of 1, so we return the first character of equal frequencies, which is 's'. Finally we make the comparison: MostFreqKSimilarity('r2e2', 'e2s1') = 2 We simply compared the outputs and only character occurring in both input is character 'e' and the occurrence in both input is 2. Instead running the sample step by step as above, we can simply run by using the string distance wrapper function as below: MostFreqKSDF('research', 'seeking', 2) = 2
Below table holds some sample runs between example inputs for K=2:
|Inputs||Hash outputs||SDF output (max from 10)|
Method is also suitable for bioinformatics to compare the genetic strings like in FASTA format.
- Str1 = LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV
- Str2 = EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG
- MostFreqKHashing(str1, 2) = L9T8
- MostFreqKHashing(str2, 2) = F9L8
- MostFreqKSDF(str1, str2, 2, 100) = 83
Algorithm complexity and comparison
The motivation behind algorithm is calculating the similarity between two input strings. So, the hashing function should be able to reduce the size of input and at the same time keep the characteristics of the input. Other hashing algorithms like MD5 or SHA-1, the output is completely unrelated with the input and those hashing algorithms are not suitable for string similarity check.
On the other hand, string similarity functions like Levenshtein distance have the algorithm complexity problem.
The calculation of time complexity of 'most frequent k char string similarity' is quite simple. In order to get the maximum frequent K characters from a string, the first step is sorting the string in a lexiconical manner. After this sort, the input with highest occurrence can be achieved with a simple pass in linear time complexity. Since major classical sorting algorithms are working in O(nlogn) complexity like merge sort or quick sort, we can sort the first string in O(nlogn) and second string on O(mlogm) times. The total complexity would be O(nlog n ) + O (m log m) which is O(n log n) as the upper bound worst case analysis.
Below table compares the complexity of algorithms:
For the above table, n is the length of first string and m is the length of second string.
Success on text mining
The success of string similarity algorithms are compared on a study. The study is based on IMDB62 dataset which is holding 1000 comment entries in Internet Movie Database from each 62 people. The data set is challenged for three string similarity functions and the success rates are as below:
|Algorithm||Running time||Error (RMSE)||Error (RAE)|
|Levenshtein distance||3647286.54 sec||29||0.47|
|Jaccard index||228647.22 sec||45||0.68|
For the time performance and the success rates, the bitwise similarity functions like Sørensen–Dice index, Tversky index or Hamming distance are all in the same category with similar success rates and running times. There are obviously slight differences but the idea behind bitwise operation, loses the string operations like deletion or addition. For example, a single bit addition to the front of one of the input strings would yield a catastrophic result on the similarity for bitwise operators while Levenshtein distance is successfully catching.
Unfortunately, big data studies often require a faster algorithm that still has an acceptable success rate. In such situations, the 'max frequent k characters' is a conceptually simpler algorithm that is also straight forward to implement.
- Approximate string matching
- Bitap algorithm
- Damerau–Levenshtein distance
- Dynamic time warping
- Euclidean distance
- Fuzzy string searching
- Hamming weight
- Hirschberg's algorithm
- Homology of sequences in genetics
- Hunt–McIlroy algorithm
- Jaccard index
- Jaro–Winkler distance
- Levenshtein distance
- Longest common subsequence problem
- Lucene (an open source search engine that implements edit distance)
- Manhattan distance
- Metric space
- Needleman–Wunsch algorithm
- Optimal matching algorithm
- Sequence alignment
- Similarity space on Numerical taxonomy
- Smith–Waterman algorithm
- Sørensen similarity index
- String distance metric
- String similarity function
- Wagner–Fischer algorithm
- Locality-sensitive hashing
- Seker, Sadi Evren; Altun, Oguz; Ayan, Ugur; Mert, Cihan (2014). "A Novel String Distance Function based on Most Frequent K Characters". International Journal of Machine Learning and Computing (IJMLC). International Association of Computer Science and Information Technology Press (IACSIT Press). 4 (2): 177–183. arXiv:1401.6596. Bibcode:2014arXiv1401.6596E.
- Seker, Sadi Evren; Mert, Cihan (2013). "A Novel Feature Hashing For Text Mining". Journal of Technical Science and Technologies. International Black Sea University. 2 (1): 37–41. ISSN 2298-0032.
- Rosetta Code: Most frequent k chars distance, retrieved 2014-10-16