Fuzzy Fulltext Search with Mysql

I just was thinking about Fuzzy Search and why it isn't possible in mysql. Sure there is SOUNDEX in Mysql, but that's not what I want. I want search results from the Levenshtein distance. But Mysql didn't has such a function. I guess the best solution would be a search server like SOLR or Elasticsearch. But on a shared hosting you haven't such a cool thing. A possible solution is to return all results of the table and work with the levenshtein function of php. I guess in the most cases this is the best solution. But I just want to test an other way, I guess maybe a very stupid way.

Levenshtein distance

Just a quick explanation.

"The Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one word into the other."
For example the Magento and the word Magneto have a distance from 2. With other metrics the distance maybe would be 1 cause we just have to switch the n and the e. But the Levenshtein distance is 2, cause we can't switch, we just can replace. You can test it on your own.

levenshtein('Magento', 'Magneto');

The idea behind Fuzzy Search with the Levenshtein distance

The idea is really simple. If we search for example with the query:

SELECT *  FROM `catalog_product_flat_1` WHERE `name` LIKE '%Magento%'
We can just find products with Magento in the name. But if Mysql has no possible way to search with the Levenshtein distance and we didn't want to return the hole table we just have to search with every regex. So we can find every entry with a Levenshtein distance of 1 for example. What does that mean. Let's take a look at every regex that we can produce with a distance of 1.

Insertions

insertions

Deletions

deletions

Substitutions

substitutions

All three operations together produce now 22 new regex.

SELECT *  FROM `catalog_product_flat_1` WHERE
`name` LIKE '%_agento%' OR
`name` LIKE '%M_gento%' OR
`name` LIKE '%Ma_ento%' OR
`name` LIKE '%Mag_nto%' OR
`name` LIKE '%Mage_to%' OR
`name` LIKE '%Magen_o%' OR
`name` LIKE '%Magent_%' OR
`name` LIKE '%_Magento%' OR
`name` LIKE '%M_agento%' OR
`name` LIKE '%Ma_gento%' OR
`name` LIKE '%Mag_ento%' OR
`name` LIKE '%Mage_nto%' OR
`name` LIKE '%Magen_to%' OR
`name` LIKE '%Magent_o%' OR
`name` LIKE '%Magento_%' OR
`name` LIKE '%agento%' OR
`name` LIKE '%Mgento%' OR
`name` LIKE '%Maento%' OR
`name` LIKE '%Magnto%' OR
`name` LIKE '%Mageto%' OR
`name` LIKE '%Mageno%' OR
`name` LIKE '%Magent%'
This query needs more time, but we are searching fuzzy. If we search with a distance of 1 we have to produce 3*n new regex (n is the length of the word). If we want to search with a distance d, we have to produce (3*n)d. Not really cool. But I guess we can get some of these regex together. Here is a little function that can produce very naive regexs with levenshtein distance 1.
<?php

function getLevenshtein1($word)
{
    $words = array();
    for ($i = 0; $i < strlen($word); $i++) {
        // insertions
        $words[] = substr($word, 0, $i) . '_' . substr($word, $i);
        // deletions
        $words[] = substr($word, 0, $i) . substr($word, $i + 1);
        // substitutions
        $words[] = substr($word, 0, $i) . '_' . substr($word, $i + 1);
    }
    // last insertion
    $words[] = $word . '_';
    return $words;
}

Next Previous