which is the best tag/search system

@deadfishJan 13.2010

I hope someone can help…

I am working on a website that has users and user-generated articles, galleries and video’s. I am trying to make a tagging system and a search for them all.

At first I was thinking in tbl_articles, tbl_galleries and tbl_videos I would have a title, description and a tags field. Then run a query like the following for each:

select * from tbl_articles where match(title, description, tags)
against (‘$search’ in boolean mode) ORDER BY match(title, description, tags)
against (‘$search’ in boolean mode) DESC, views desc LIMIT 0, 3

The same query for tbl_galleries and tbl_videos. For the users just compare the username. Then display three of each on the results page with a ‘more’ button (facebook style).

When viewing an article, gallery or video there will also have links to related content so I was thinking of using the same query only with the LIMIT set to ‘1,3’ – to avoid showing itself.

Q1 – How is this system?

I was happy with the system, until I found this

In which they have

* a ‘tags’ table which contains two columns a primary id and a uniquely indexed tag_name. * a ‘type’ table for which they have another primary id and a uniquely indexed ‘type’ (category) (I thought I could use it for user/video/article/gallery) * a ‘search’ table that contains the url of the article with a foreign id from ‘tags’ and ‘type’. (I thought instead of a full url I could just store the related foreign id so that I can generate the url e.g article.php?id=….)

Q2 – This system seems far more efficient… although how would I search the title or description?

Q3 – The other bad thing is for every page view I would have to join the tags.. so it might not be that much more efficient.

Q4 – My system only searches boolean too, would I be better with a ‘like’ query?

Q5 – I limit my users to 4 tags, but I encourage single-words (stackoverflow style)… I realise though that in my system a search for ‘train station’ will not match a tag like ‘train-station’ how do I get around this?

So many questions… Sorry it is so long. Thank you.

to post a comment

PHP

4 Comments(s) _↴

@SrWebDeveloperJan 13.2010 — #To summarize a long conversation, as to open source or GPL search engines I recommend two which both represent a "recommended" setup of filters, scoring, metrics and administrative tools:

1) Google Custom Search - all the search power of Google, you can use its tools to custom crawl your site or upload a compatible index file you can crawl on your own via cron using your own crawler, set up metrics and fancy controls. Now with Ajax search results.

2) FastFind (PHP/MySQL) - Crawl via cron, nice features, templates and easy to implement. This is one of the most popular DIY search engines beyond Google's alternative. Allows complete customization.

Actually take the time to explore all the features of each, read the FAQs and you'll get all the answers to all your questions.

-jim

@criterion9Jan 13.2010 — #%LIKE% should help with your "train-station" issue though it could also increase your results exponentially depending on the search string.

* a 'tags' table which contains two columns a primary id and a uniquely indexed tag_name.

* a 'type' table for which they have another primary id and a uniquely indexed 'type' (category) (I thought I could use it for user/video/article/gallery)

* a 'search' table that contains the url of the article with a foreign id from 'tags' and 'type'. (I thought instead of a full url I could just store the related foreign id so that I can generate the url e.g article.php?id=....)
[/quote]
Is much more expandable and customizable than adding those multiple columns to each of the other tables. JOINs are pretty common and much better in this situation (i.e. do you have to do extra processing to handle a comma separated list of tags each time you search the 'tags' column?).

Many times it is much better to separate the data as much as possible and have the database server put it together as needed (i.e. SELECT * can be slower than SELECT col1,col2,col3 when you get many cols you may not actually need in your result set) and cache-control on result sets can offset the extra processing done by the database server. Additionally temp tables caching the results (or better yet flat files caching the results (removed during add/edit events of course) to prevent even accessing the database server when it isn't needed (flat files are usually faster than a database query especially when the web server and database server are not on the same machine (though this is highly dependent on how the results are stored to prevent further processing)).

@svidgenJan 13.2010 — #For your specific needs, I second the suggestion for using a Google Custom Search Engine. Of course, if you're really hell-bent on hosting the search on your own hardware, look into something like the sphinx full text search engine. And, taking it a step further, if this is a learning project, look into trigrams and fuzzy search.

Using trigrams (n-grams of length 3), you'd basically have to index all of your text in groups of 3 characters and/or 3 words at a time, along with their number of occurrences. You'd then break the user's query into trigrams and perform some magic to locate a few approximate matches, and then evaluate the top approximates more closely with another fuzzy match (one that can produce a percent-similarity and is possibly less fuzzy than the first match).

Of course, the exact algorithm could be pretty "simple" ... or you could get pretty complex--say, breaking the search string into token/word trigrams, determining what the user probably meant to type, and then performing some more precise matching (but still fuzzy) based on a modified search string (or set of top most likely strings).

The idea is to index your data in such a way that you can look at its content statistically. Take the following short string, for instance ...

"sick pick"

You [might] get the following trigram counts (with respect to characters (and this varies based on how you treat word boundaries)):

1 sic

2 ick

1 ck_

1 k_p

1 _pi

1 pic

So, when you're user searches for "slick" and you get the following trigrams:

1 sli

1 lic

1 ick

... You know that you [I]probably[/I] don't have a good match. The only trigram from your database that matches is [I]ick[/I], and [I]ick[/I] is a "common" trigram--so, not very informative/useful.

However, when your user searches for "sick cow" and you get the following search trigrams:

1 sic

1 ick

1 ck_

1 k_c

1 _co

1 cow

You match 3 trigrams, [I]ick[/I] (matches, but is unimportant), [I]sic[/I], and [I]ck_[/I]. You've got 2 matches on trigrams that appear infrequently in your database. Those are fairly good indicators that the user searched for something related to the records that those trigrams were taken from.

Of course, in a real search, when you're dealing with much more data, you'll probably ignore trigrams that are both above and below certain thresholds. You want to avoid matching both millions of records and/or a small number of records that are probably meaningless. And you'll probably have to tweak your algorithms and thresholds pretty regularly to continue delivering quality results without demolishing your server ... A bad threshold can bring your server to its knees indefinitely.

... use a Google Custom Search Engine.

@deadfishauthorJan 15.2010 — #Thank you all for your input ?

Also in #PHP _↴

SiteMaps - How Many Levels Deep They Usually Go ?very new to mySql next and last recordeds AUTO_INCREMENT MYsql?!

Success!

Help @deadfish spread the word by sharing this article on Twitter...

Tweet This

which is the best tag/search system

4 Comments(s) _↴

Also in #PHP _↴

Success!

Social

Version

which is the best tag/search system

4 Comments(s) ↴

Also in #PHP ↴

Success!

The web is an endless sea of information. Don't miss the boat... Subscribe!

Social

Version

4 Comments(s) _↴

Also in #PHP _↴