Fuzzy Fulltext Search with Mysql

I just was thinking about Fuzzy Search and why it isn’t possible in mysql. Sure there is SOUNDEX in Mysql, but that’s not what I want. I want search results from the Levenshtein distance. But Mysql didn’t has such a function. I guess the best solution would be a search server like SOLR or Elasticsearch. But on a shared hosting you haven’t such a cool thing. A possible solution is to return all results of the table and work with the levenshtein function of php. I guess in the most cases this is the best solution. But I just want to test an other way, I guess maybe a very stupid way.

Levenshtein distance

Just a quick explanation. “The Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one word into the other.”wikipedia.org. For example the Magento and the word Magneto have a distance from 2. With other metrics the distance maybe would be 1 cause we just have to switch the n and the e. But the Levenshtein distance is 2, cause we can’t switch, we just can replace. You can test it on your own.

`levenshtein('Magento', 'Magneto');`

The idea behind Fuzzy Search with the Levenshtein distance

The idea is really simple. If we search for example with the query:

`SELECT * FROM `catalog_product_flat_1` WHERE `name` LIKE '%Magento%'`

We can just find products with Magento in the name. But if Mysql has no possible way to search with the Levenshtein distance and we didn’t want to return the hole table we just have to search with every regex. So we can find every entry with a Levenshtein distance of 1 for example. What does that mean. Let’s take a look at every regex that we can produce with a distance of 1.

Substitutions

All three operations together produce now 22 new regex.

```SELECT * FROM `catalog_product_flat_1` WHERE `name` LIKE '%_agento%' OR `name` LIKE '%M_gento%' OR `name` LIKE '%Ma_ento%' OR `name` LIKE '%Mag_nto%' OR `name` LIKE '%Mage_to%' OR `name` LIKE '%Magen_o%' OR `name` LIKE '%Magent_%' OR `name` LIKE '%_Magento%' OR `name` LIKE '%M_agento%' OR `name` LIKE '%Ma_gento%' OR `name` LIKE '%Mag_ento%' OR `name` LIKE '%Mage_nto%' OR `name` LIKE '%Magen_to%' OR `name` LIKE '%Magent_o%' OR `name` LIKE '%Magento_%' OR `name` LIKE '%agento%' OR `name` LIKE '%Mgento%' OR `name` LIKE '%Maento%' OR `name` LIKE '%Magnto%' OR `name` LIKE '%Mageto%' OR `name` LIKE '%Mageno%' OR `name` LIKE '%Magent%'```

This query needs more time, but we are searching fuzzy. If we search with a distance of 1 we have to produce 3*n new regex (n is the length of the word). If we want to search with a distance d, we have to produce (3*n)d. Not really cool. But I guess we can get some of these regex together. Here is a little function that can produce very naive regexs with levenshtein distance 1.

11 thoughts on “Fuzzy Fulltext Search with Mysql”

1. CaballoCarpiano says:

That’s interesting, what have been your results experimenting with this? I mean in performance and accuracy?

1. It works pretty good, performance is also good. But I didn’t tested with multiple requests.

2. Anton says:

Wouldn’t it be better to combine all expression into one regex and use RLIKE operator instead of LIKE? Ex. “Magento|.agento|M.gento” etc.

1. I can’t say if it makes a big performance benefit, but looks better. Thank you very much.

3. Joel Sager says:

Great post. I was looking for a way to do this. I ended up making a function that would take a string and build an rlike string for mysql. I use this function when we are looking for first and last names that sometimes have typos in them. Thanks again.

```function rlike(\$my_string) { \$strlen = strlen( \$my_string );   for (\$i = 0; \$i &lt;= \$strlen; \$i++) { for( \$x = 0; \$x &lt;= (\$strlen -1); \$x++ ) { if (\$x == \$i) { \$char = &#039;.&#039;; } else { \$char = substr( \$my_string, \$x, 1 ); } \$rstr[\$x] = \$char; } \$rlike[\$i] = implode(\$rstr); } return &quot;RLIKE &#039;&quot;. implode(&#039;|&#039;,array_filter(\$rlike)).&quot;&#039;&quot;; }```
4. Abhishek says:

Hi,

I like your idea to overcome this hurdle of MySQL not supporting Fuzzy Search with Full Text Search.

However, wanted to get your input on another stupid way I am planning to overcome this hurdle.

1. Creating a dictionary table of all single words from the search table (in which the query will be searching for records).
2. Implementing this Levenshtein function in MySQL https://github.com/jmcejuela/Levenshtein-MySQL-UDF
3. Now, before any full text search query ‘B’ is made, all of its ‘search text’ e.g. ‘mgento featres’ would pass through another query ‘A’
4. This ‘A’ query would break down the search text to each single word, and check their levenshtein distance in dictionary table and replace each word with their counterpart (giving least distance in ASC order)
5. Then query ‘B’ would contain search text as ‘magento features’.

Please let me know your input. I am creating the dictionary table only from selected columns (which I searching via Full Text Search) from search table.

Thanks again

1. Brett says:

Did you try it?

5. Little modified version to use with UTF-8 and REGEXP:

`mb_internal_encoding('UTF-8'); function mysql_fuzzy_regex(\$str) { \$len=mb_strlen(\$str); \$qstr=[]; for (\$i=0; \$i < \$len; \$i++) \$qstr[\$i]=preg_quote(mb_substr(\$str, \$i, 1)); \$reg='[[:<:]]('.implode(\$qstr).'[[:alnum:]]?'; for (\$i=0; \$i < \$len; \$i++) { \$reg.='|'; for (\$x=0; \$x < \$len; \$x++) \$reg.=\$x==\$i ?'([[:alnum:]]'.\$qstr[\$x].'|[[:alnum:]]?)' :\$qstr[\$x]; } return \$reg.')[[:>:]]'; }`
6. Ola!

Thanks for this solution and the comments are handy too.

I thought I’d throw in my two pennies too, I’ve been using this http://lucene.apache.org/solr/ as a fuzzymatch solution.

Its a pretty accessible Java based extension of apache and even ranks the results from the datasource.

Heres the code to link it to the mysql database:

http://stackoverflow.com/questions/15084713/how-to-connect-apache-solr-with-mysql

~ Lewis