Sat, 02/05/2009 - 23:27 — horuskol
One of my major projects at the moment is a large subscription-based knowledge-base management system. Along with the actual content management for the site, another major requirement for the site is to make the information as easy to find as possible – utilising organisation, related keywords, and, of course, a search engine. One of the design decisions was not to use an indexing script, as all of the content was already stored in the database, and a custom search engine would be easier to configure to add bias and weighting to particular sections of the content.
There were some complexities in that the content of each article is divided into three fields within the database – giving great flexibility on which content is displayed where (for example – the ‘definition’ of an article is displayed under the search results).
MySQL FULLTEXT Search
Thankfully, MySQL allows the use of FULLTEXT indexes, which can then also be used to provide a relevancy score for keyword searches:
CREATE FULLTEXT INDEX content ON article(title, definition, navigation, content);
SELECT
id,
title,
definition,
MATCH (title, definition, navigation, content) AGAINST ('keyword') AS score
FROM
article
WHERE
MATCH (title, definition, navigation, content) AGAINST ('keyword');There are a few restrictions – here were the ones most relevant to me:
* This will only work on MyISAM tables (as no other table type supports FULLTEXT indexes)
* It will not match a keyword in the stopword list – you can reconfigure the list, but it seems to be a global config which is read in when the MySQL service starts
* If a keyword is matched in more than 50% of the records, the query will return an empty result set (when using the default natural language mode)
Exact Phrase Searching
One other, not very well documented, ‘feature’ is that this particular use of the FULLTEXT index does not allow for “exact phrase†matching. So, if someone input:
The MATCH will break down those three words into three separate keywords, and not into two elements (a keyword and a keyphrase). This meant we were getting a lot of irrelevant articles in the search results because they might have a high relevance for ‘phrase’ but actually nothing to do with ‘exact phrase’.
For the kind of information resource we were building, exact phrase searches are a must, and I spent a while researching to see if there was some way to configure MySQL to allow it. Unfortunately, all I could find is that MySQL decided it was too hard – although, they don’t actually say they don’t do it – they just explain how difficult it is.
Reading up on PostgreSQL as a possible alternative database server, I came to the conclusion that fulltext with relevance scoring and exact phrase matching was not something that was supported out of the can by any SQL server.
Solution
We rewrote the FULLTEXT query above to force MySQL to perform exact matches on the text, but then still score the relevance of the discovered articles:
$keywords = $_GET['keywords']; // example, a string of 'keyword "exact phrase" $arrKeywords = preg_split( "/[\s,]*\\\"([^\\\"]+)\\\"[\s,]*|[\s,]+/", $keywords, 0, PREG_SPLIT_DELIM_CAPTURE ); // splits the keyword string - keeping the exact phrase together $sql['keywords'] = array(); foreach ($arrKeywords as $keyword) { $sql['keywords'] = $dbc->real_escape_string($keyword); // assuming $dbc to be a mysqli object } $query = "SELECT id, title, definition, MATCH (title, definition, navigation, content) AGAINST ('" . implode(" ", $sql['keywords']) . "') FROM article WHERE ( title LIKE '%" . implode("%' OR title LIKE '%", $sql['keywords']) . "%' OR definition LIKE '%" . implode("%' OR definition LIKE '%", $sql['keywords']) . "%' OR navigation LIKE '%" . implode("%' OR navigation LIKE '%", $sql['keywords']) . "%' OR content LIKE '%" . implode("%' OR content LIKE '%", $sql['keywords']) . "%' ) AND isDeleted = 0;"
The resulting query would look like:
SELECT
id,
title,
definition,
MATCH (title, definition, navigation, content) AGAINST ('keyword exact phrase') AS SCORE
FROM
article
WHERE
(
title LIKE '%keyword%' OR title LIKE '%exact phrase%'
OR
definition LIKE '%keyword%' OR definition LIKE '%exact phrase%'
OR
navigation LIKE '%keyword%' OR navigation LIKE '%exact phrase%'
OR
content LIKE '%keyword%' OR content LIKE '%exact phrase%'
)
AND
isDeleted = 0;While it may not be perfect – it is ideal for the article database, and doesn’t have any issues with slowness.
Comments
Nice idea combining LIKE but
Nice idea combining LIKE but still using the FULL-TEXT stuff for scoring – will def have to use this.
think there is a copy/paste error in the final query though, you seem to be checking `definition` multiple times. (the php bit looks ok)
Thanks Barry – glad that the
Thanks Barry – glad that the idea is proving useful to others.
You’re right about the copy/paste error – fixing that.
Post new comment