lucene – Page 5 – Tarik Billa

How to get a Token from a Lucene TokenStream?

February 13, 2023 by Tarik

Yeah, it’s a little convoluted (compared to the good ol’ way), but this should do it: TokenStream tokenStream = analyzer.tokenStream(fieldName, reader); OffsetAttribute offsetAttribute = tokenStream.getAttribute(OffsetAttribute.class); TermAttribute termAttribute = tokenStream.getAttribute(TermAttribute.class); while (tokenStream.incrementToken()) { int startOffset = offsetAttribute.startOffset(); int endOffset = offsetAttribute.endOffset(); String term = termAttribute.term(); } Edit: The new way According to Donotello, TermAttribute has been … Read more

What does percolator mean/do in elasticsearch?

February 2, 2023 by Tarik

What you usually do is index documents and get them back by querying. What the percolator allows you to do in a nutshell is index your queries and percolate documents against the indexed queries to know which queries they match. It’s also called reversed search, as what you do is the opposite to what you … Read more

using OR and NOT in solr query

January 28, 2023 by Tarik

I don’t know why that doesn’t work, but this one is logically equivalent and it does work: -(myField:superneat AND -myOtherField:somethingElse) Maybe it has something to do with defining the same field twice in the query… Try asking in the solr-user group, then post back here the final answer!

How does Lucene work

January 15, 2023 by Tarik

Lucene is an inverted full-text index. This means that it takes all the documents, splits them into words, and then builds an index for each word. Since the index is an exact string-match, unordered, it can be extremely fast. Hypothetically, an SQL unordered index on a varchar field could be just as fast, and in … Read more

Comparison of Lucene Analyzers

December 27, 2022 by Tarik

In general, any analyzer in Lucene is tokenizer + stemmer + stop-words filter. Tokenizer splits your text into chunks, and since different analyzers may use different tokenizers, you can get different output token streams, i.e. sequences of chunks of text. For example, KeywordAnalyzer you mentioned doesn’t split the text at all and takes all the … Read more

How does lucene index documents?

December 26, 2022 by Tarik

In a nutshell, Lucene builds an inverted index using Skip-Lists on disk, and then loads a mapping for the indexed terms into memory using a Finite State Transducer (FST). Note, however, that Lucene does not (necessarily) load all indexed terms to RAM, as described by Michael McCandless, the author of Lucene’s indexing system himself. Note … Read more

How to query SOLR for empty fields?

December 12, 2022 by Tarik

Try this: ?q=-id:[“” TO *]

Elasticsearch vs Cassandra vs Elasticsearch with Cassandra

December 11, 2022 by Tarik

One of our applications uses data that is stored into both Cassandra and ElasticSearch. We use Cassandra to access those records whenever we can, and have data duplicated into query tables designed to adhere to specific application-side requests. For a more liberal search than our query tables can allow, ElasticSearch performs that functionality nicely. We … Read more

Difference between solr and lucene

October 23, 2022 by Tarik

@darkheir: Lucene and Solr are 2 differents Apache projects that are made to work together, I don’t understand what is the aim of each project. Solr uses Lucene under the hood. Lucene has no clue about the Solr API. Lucene is a powerful search engine framework that lets us add search capability to our application. … Read more

Choosing a stand-alone full-text search server: Sphinx or SOLR? [closed]

October 19, 2022 by Tarik

I’ve been using Solr successfully for almost 2 years now, and have never used Sphinx, so I’m obviously biased. However, I’ll try to keep it objective by quoting the docs or other people. I’ll also take patches to my answer 🙂 Similarities: Both Solr and Sphinx satisfy all of your requirements. They’re fast and designed … Read more