Optimal data architecture for tagging, clouds, and searching (like StackOverflow)?

Wow I just wrote a big post and SO choked and hung on it, and when I hit my back button to resubmit, the markup editor was empty. aaargh.

So here I go again…

Regarding Stack Overflow, it turns out that they use SQL server 2005 full text search.

Regarding the OS projects recommended by @Grant:

  • *DotNetKicks uses the DB for tagging and Lucene for full-text search. There appears to be no way to combine a full text search with a tag search
  • Kigg uses Linq-to-SQL for both search and tag queries. Both queries join Stories->StoryTags->Tags.
  • Both projects have a 3-table approach to tagging as everyone generally seems to recommend

I also found some other questions on SO that I’d missed before:

  • How Do You Recommend Implementing Tags or Tagging?
  • How to structure data for searchability?
  • Database Design for Tagging

What I’m currently doing for each of the items I mentioned:

  1. In the DB, 3 tables: Entity, Tag, Entity_Tag. I use the DB to:
    • Build site-wide tag clouds
    • browse by tag (i.e. urls like SO’s /questions/tagged/ASP.NET)
  2. For search I use Lucene + NHibernate.Search
    • Tags are concat’d into a TagString that is indexed by Lucene
      • So I have the full power of the Lucene query engine (AND / OR / NOT queries)
      • I can search for text and filter by tags at the same time
      • The Lucene analyzer merges words for better tag searches (i.e. a tag search for “test” will also find stuff tagged “testing”)
    • Lucene returns a potentially enormous result set, which I paginate to 20 results
    • Then NHibernate loads the result Entities by Id, either from the DB or the Entity cache
    • So it’s entirely possible that a search results in 0 hits to the DB
  3. Not doing this yet, but I think I will probably try to find a way to build the tag cloud from the TagString in Lucene, rather than take another DB hit
  4. Haven’t done this yet either, but I will probably store the TagString in the DB so that I can show an Entity’s Tag list without having to make 2 more joins.

This means that whenever an Entity’s tags are modified, I have to:

  • Insert any new Tags that do not already exist
  • Insert/Delete from the EntityTag table
  • Update Entity.TagString
  • Update the Lucene index for the Entity

Given that the ratio of reads to writes is very big in my application, I think I’m ok with this. The only really time-consuming part is Lucene indexing, because Lucene can only insert and delete from its index, so I have to re-index the entire entity in order to update the TagString. I’m not excited about that, but I think that if I do it in a background thread, it will be fine.

Time will tell…

Leave a Comment

techhipbettruvabetnorabahisbahis forumu