Friday, May 6, 2011

Using stop words with WhitespaceAnalyzer

Lucene's StandardAnalyzer removes dots from string/acronyms when indexing it. I want Lucene to retain dots and hence I'm using WhitespaceAnalyzer class.

I can give my list of stop words to StandardAnalyzer...but how do i give it to WhitespaceAnalyzer?

Thanks for reading.

From stackoverflow
  • Create your own analyzer by extending WhiteSpaceAnalyzer and override tokenStream method as follows.

    public TokenStream tokenStream(String fieldName, Reader reader) {
        TokenStream result = super.tokenStream(fieldName, reader);
        result = new StopFilter(result, stopSet);
        return result;

    Here the stopSet is the Set of stop words, which you could get by adding a constructor to your analyzer which accepts a list of stop words.

    You may also wish to override reusableTokenStream() method in similar fashion if you plan to reuse the TokenStream.

    Steve Chapman : could you please have a loot at my answer and comment:


Post a Comment