Entity Extraction/Recognition with free tools while feeding Lucene Index

The problem you are facing in the ‘wicket’ example is called entity disambiguation, not entity extraction/recognition (NER). NER can be useful but only when the categories are specific enough. Most NER systems doesn’t have enough granularity to distinguish between a sport and a software project (both types would fall outside the typically recognized types: person, org, location).

For disambiguation, you need a knowledge base against which entities are being disambiguated. DBpedia is a typical choice due to its broad coverage. See my answer for How to use DBPedia to extract Tags/Keywords from content? where I provide more explanation, and mentions several tools for disambiguation including:

  • Zemanta
  • Maui-indexer
  • Dbpedia Spotlight
  • Extractiv (my company)

These tools often use a language-independent API like REST, and I do not know that they directly provide Lucene support, but I hope my answer has been beneficial for the problem you are trying to solve.

Leave a Comment

Hata!: SQLSTATE[HY000] [1045] Access denied for user 'divattrend_liink'@'localhost' (using password: YES)