In this post we'll see how to write a Java program for Rabin-Karp string matching algorithm.
Rabin-Karp string matching algorithm uses hashing to search for a pattern in a String. The idea is to calculate hash for the pattern and for the substring of the same length as pattern in the original string.
To determine that the pattern and substring are same you compare the calculated hash values.
If hash values match, then as an added precaution (to negate the scenario of hash collision), you do character by character comparison also for the pattern and that substring.
If hash values don't match, then in the original string you move by one character and recalculate hash for the next substring of the same length as pattern string.
For example, if the string is "aabcde" and the searched pattern is "abc" then calculate hash for the pattern string "abc" and start with the substring of the same length in the original string which will be "aab" and calculate hash for this substring.
Then compare the calculated hash of the pattern string with the hash of the substring. If they don't match then slide by one character in the original string to calculate hash for the next substring.
This sliding or rolling of hash is the core idea in the Rabin-Karp algorithm to make the algorithm efficient.
Rolling hash in Rabin-Karp algorithm
The concept of rolling hash allows you to compute the hash of a substring in constant time by reusing the already calculated hash of the previous substring. Instead of recalculating hash for the next substring from scratch, sliding window concept is used to roll the hash by-
- Removing the hash of the leftmost character
- Adding a new character at the end
Rolling hash formula in the Rabin-Karp algorithm
If str is the text, patLength is the length of the pattern string, strHash is the hash of the current substring, base and modulo are the base and prime modulus variables respectively and h is the precomputed weight, then the rolling hash formula is
strHash = (BASE * (strHash - str.charAt(i) * h) + str.charAt(i + patLength)) % MODULO;
Base and modulus in the Rabin-Karp algorithm
How base and modulo are chosen for calculating hash helps in the proper distribution of hash and preventing many hash collisions.
The base in Rabin-Karp should be at least the size of the number of possible characters used to represent the string, ensuring each character can be uniquely represented. For example, if using extended ASCII characters, the base should be at least 256 to cover all characters from 0 to 255.
The modulus is typically a large prime number which helps in distributing hash values more uniformly, thus reducing the likelihood of hash collisions. We have seen the formula for calculating hash in the above section and that can give a very large value; by using the modulus operation the size of the hash value is reduced to a smaller value that will be within the range of data type used (int or long).
Rabin-Karp String Matching Algorithm - Java Program
public class RobinKarp { private static final int R = 256; // Number of characters in the input alphabet static final int MODULO = 1000000009; // A prime number for modulo operations public static void main(String[] args) { String str = "This is a test, this test should pass"; String pattern = "test"; //String str = "abcabcabc"; //String pattern = "ab"; searchPattern(str, pattern); } public static void searchPattern(String str, String pattern) { int patLength = pattern.length(); int strLength = str.length(); long patHash = 0; long strHash = 0; long h = 1; // precomputed weight of the polynomial hash for (int i = 0; i < patLength - 1; i++) h = (h * R) % MODULO; // calculate hash for the pattern string and the // substring of the same length in the original string for(int i = 0; i < patLength; i++) { patHash = (R * patHash + pattern.charAt(i)) % MODULO; strHash = (R * strHash + str.charAt(i)) % MODULO; } int j; for(int i = 0; i <= strLength - patLength; i++) { // if hashes are same if(patHash == strHash) { // do character by character comparison to confirm // strings equality and ruling out hash collision for(j = 0; j < patLength; j++) { if(str.charAt(i+j) != pattern.charAt(j)) { break; } } if(j == patLength) { System.out.println("Pattern found at index " + i); } } // Need this check to avoid index out of bound for (i+patlength) if(i < strLength - patLength) { // remove the char at the beginning and add one at the end ** Sliding Window strHash = (R * (strHash - str.charAt(i) * h) + str.charAt(i + patLength)) % MODULO; // in case calculated hash is negative bring it with in range 0..MODULO-1 if(strHash < 0) { strHash += MODULO; } } } } }
Output
Pattern found at index 10 Pattern found at index 21
Now, the confusion in this algorithm is why do need to pre-calculate this h variable and what exactly is happening in this formula for rolling hash
strHash = (R * (strHash - str.charAt(i) * h) + str.charAt(i + patLength)) % MODULO;
Let's try to understand with the help of one example string, String str = "abcabcabc"; and pattern to be searched as String pattern = "abc";
Here length of the pattern is 3 so value of h will be
h = R2 % MODULO
and strHash, which is calculated using the following formula
strHash = (R * strHash + str.charAt(i)) % MODULO;
Let's remove the Modulo part to make it easy to follow, then it will have the values as
For S0- strHash = R * 0 + S0 = S0 For S1- strHash = R * (S0) + S1 = RS0 + S1 For S2- strHash = R * (RS0 + S1) + S2 = R2S0 + RS1 + S2 For S3- strHash = R * (R2S0 + RS1 + S2) + S3
But, if we are taking substrings of length 3 then we should have removed hash for S0, while calculating hash for S3, so strHash for S3 will actually be-
strHash = R * (RS1 + S2) + S3 = R2S1+RS2+S3
If you have observed the pattern, leftmost character always have the exponent for RADIX as patternLength - 1, which is R2 in case pattern length is 3. That's what precomputed h also gives us.
Now, when you are removing S0 and adding S3, what you need to do is remove hash for S0 and add S3, in the formula if we replace the values (here again let's remove the Modulo to make it easy to follow)-
strHash = R * (R2S0 + RS1 + S2 - S0 * R2) + S3 = R * (R2S0 + RS1 + S2 - R2S0) + S3 = R * (RS1 + S2) + S3 = R2S1 + RS2 + S3
Which is same as strHash of S3 as calculated above.
Hope that makes it easy to understand why we keep precomputed value of h and how it is used to get rolling hash.
Time and Space complexity of Rabin-Karp algorithm
If original string is of length n and pattern string is of length m then-
Initial hash of the pattern and substring of the same length- O(m)
Rolling hash for the rest of the String- O(n-m) which can be considered O(n)
So, the best case scenario where there are no hash collisions (spurious hits) and no full comparisons then time complexity is O(n + m)
In worst case scenario time complexity is O(n X m), which may happen if there are spurious hits in each sliding window triggering many full length comparisons.
That's all for this topic Rabin-Karp Algorithm For Pattern Search - Java Program. If you have any doubt or any suggestions to make please drop a comment. Thanks!
>>>Return to Java Programs Page
Related Topics
You may also like-
No comments:
Post a Comment