`

lucene3.6.0索引操作的学习笔记

 
阅读更多

索引操作:
删除索引
indexreader:delete document,删除索引是在indexreader类进行
numDoc,maxDoc,删除索引是在内存先进行索引删除,合并索引后才能更新到磁盘,当删除一个document时,numDoc能及时更新,而maxDoc得等到合并索引后才会更新。
恢复被删除的索引:
undelete方法
更新索引:
删除之后再插入

批量操作


对document某些field加权,降权,
对整个Document加权降权:
Document.java类的下面这个方法:
  /** Sets a boost factor for hits on any field of this document.  This value
   * will be multiplied into the score of all hits on this document.
   *
   * <p>The default value is 1.0.
   * 
   * <p>Values are multiplied into the value of {@link Fieldable#getBoost()} of
   * each field in this document.  Thus, this method in effect sets a default
   * boost for the fields of this document.
   *
   * @see Fieldable#setBoost(float)
   */
  public void setBoost(float boost) {
    this.boost = boost;
  }

  对某些Field加权降权:
AbstractField.java类的下面这个方法:
  /** Sets the boost factor hits on this field.  This value will be
   * multiplied into the score of all hits on this this field of this
   * document.
   *
   * <p>The boost is multiplied by {@link org.apache.lucene.document.Document#getBoost()} of the document
   * containing this field.  If a document has multiple fields with the same
   * name, all such values are multiplied together.  This product is then
   * used to compute the norm factor for the field.  By
   * default, in the {@link
   * org.apache.lucene.search.Similarity#computeNorm(String,
   * FieldInvertState)} method, the boost value is multiplied
   * by the {@link
   * org.apache.lucene.search.Similarity#lengthNorm(String,
   * int)} and then
   * rounded by {@link org.apache.lucene.search.Similarity#encodeNormValue(float)} before it is stored in the
   * index.  One should attempt to ensure that this product does not overflow
   * the range of that encoding.
   *
   * @see org.apache.lucene.document.Document#setBoost(float)
   * @see org.apache.lucene.search.Similarity#computeNorm(String, FieldInvertState)
   * @see org.apache.lucene.search.Similarity#encodeNormValue(float)
   */
  public void setBoost(float boost) {
    this.boost = boost;
  }


lucene内部存储的都是字符串,如果要存储时间,数字需要确保自然顺序和字典顺序,默认是按照字典顺寻的,所以需要对日期和数字进行处理保证字典顺序。


排序:
IndexSearcher.java里面可以通过传入sort对象指定按照某些field进行排序
  /** Search implementation with arbitrary sorting.  Finds
   * the top <code>n</code> hits for <code>query</code>, applying
   * <code>filter</code> if non-null, and sorting the hits by the criteria in
   * <code>sort</code>.
   * 
   * <p>NOTE: this does not compute scores by default; use
   * {@link IndexSearcher#setDefaultFieldSortScoring} to
   * enable scoring.
   *
   * @throws BooleanQuery.TooManyClauses
   */
  @Override
  public TopFieldDocs search(Query query, Filter filter, int n,
                             Sort sort) throws IOException {
    return search(createNormalizedWeight(query), filter, n, sort);
  }

Sort.java可以传入多个Field对象:
  /** Sets the sort to the given criteria. */
  public void setSort(SortField field) {
    this.fields = new SortField[] { field };
  }

  /** Sets the sort to the given criteria in succession. */
  public void setSort(SortField... fields) {
    this.fields = fields;
  }
  
SortField.java具体某个排序字段的规则:可以指定哪个字段,并给定字段的类型进行排序,具体的type值可以参考SortField.java类里面定义的静态变量,用于排序的字段必须能够转化为整型,浮点型,字符串型的对象。
  /** Creates a sort, possibly in reverse, by terms in the given field with the
   * type of term values explicitly given.
   * @param field  Name of field to sort by.  Can be <code>null</code> if
   *               <code>type</code> is SCORE or DOC.
   * @param type   Type of values in the terms.
   * @param reverse True if natural order should be reversed.  降序还是升序
   */
  public SortField(String field, int type, boolean reverse) {
    initFieldType(field, type);
    this.reverse = reverse;
  }


 /** Sort using term values as Strings.  Sort values are String and lower
   * values are at the front. */
  public static final int STRING = 3;

  /** Sort using term values as encoded Integers.  Sort values are Integer and
   * lower values are at the front. */
  public static final int INT = 4;

  /** Sort using term values as encoded Floats.  Sort values are Float and
   * lower values are at the front. */
  public static final int FLOAT = 5;

  /** Sort using term values as encoded Longs.  Sort values are Long and
   * lower values are at the front. */
  public static final int LONG = 6;

  /** Sort using term values as encoded Doubles.  Sort values are Double and
   * lower values are at the front. */
  public static final int DOUBLE = 7;
......


控制索引过程
调整索引性能
1.添加document时,先缓存在内存中,达到一定条件后刷新到磁盘
  mergeFactor 默认10,控制段的合并频率和大小  内存大,可以设置得大一点,减少磁盘io,但是索引文件一多,检索速度会变慢,内存小需要设置得小一点
  maxMergeDocs MAX  控制每个段的大小
  minMergeDocs 10  控制索引时RAM使用的总量

当jvm分配的内存比较大时,适当提高mergefactor和minMergeDocs可以提高索引过程的速度,

设置LogMergePolicy.java来调整索引性能
public abstract class LogMergePolicy extends MergePolicy {

  /** Defines the allowed range of log(size) for each
   *  level.  A level is computed by taking the max segment
   *  log size, minus LEVEL_LOG_SPAN, and finding all
   *  segments falling within that range. */
  public static final double LEVEL_LOG_SPAN = 0.75;

  /** Default merge factor, which is how many segments are
   *  merged at a time */
  public static final int DEFAULT_MERGE_FACTOR = 10;

  /** Default maximum segment size.  A segment of this size
   *  or larger will never be merged.  @see setMaxMergeDocs */
  public static final int DEFAULT_MAX_MERGE_DOCS = Integer.MAX_VALUE;

  /** Default noCFSRatio.  If a merge's size is >= 10% of
   *  the index, then we disable compound file for it.
   *  @see #setNoCFSRatio */
  public static final double DEFAULT_NO_CFS_RATIO = 0.1;

  protected int mergeFactor = DEFAULT_MERGE_FACTOR;

  protected long minMergeSize;
  protected long maxMergeSize;

文件打开个数:
lucene默认打开的文件最大数个数:
(1+mergeFactor)*FilesPerSegement
linux下可以通过ulimit -u 设置打开文件描述符个数

内存所以RAMDirectory.java
1.可以先把doc存到内存索引中,然后隔一段时间或者插入一定量的document之后通过把RAMDirectory内存中的索引通过IndexWriter刷新到磁盘,从而达到把RAMDirectory当做内存缓存器实现对索引的批处理
并行索引多个索引文件
1.可以通过RAMDirectory对多个CPU,磁盘或者机器进行并行索引,如果是同一台机器,可以通过多线程同步方式最后利用IndexWriter的addIndexes(Directorys[])进行索引合并再写入磁盘,如果是不同机器,则可以通过把单台机器的索引进行分发到同一台机器后后再进行合并。


限制域的大小:maxFieldLength  可以使用LimitTokenCountAnalyzer.java这个analyzer来处理域的大小

IndexWriter 的optimize()方法来优化索引。只能提高检索速度,不能提高插入doc的速度。

在同一个时间,只能有一个进程修改一个索引文件。lucene利用文件的锁机制来防止这种由并发修改索引文件导致的问题。

优化索引文件最好是在索引过程结束之后,并且在此后一段时间不会被修改。


lucene对索引文件的并发访问规则:
1.只读操作可以并行操作。
2.在索引被修改的时候,仍然可以进行多个数量的只读操作。
3.在某一时间,只能允许一个修改索引的操作。也就是说在同一个时间,一个索引文件只能由一个indexWriter或者一个indexReader打开。

indexReader和indexWriter是多线程安全的。
1.indexReader对象从索引删除一个document时,indexWriter不能向其中添加文档。
2.indexWriter在进行合并或者优化时,indexReader也不能删除文档。

lucene索引锁机制,文件write.lock,commit.lock
IndexWriter.java关于文件锁的处理API:
  /**
   * Returns <code>true</code> iff the index in the named directory is
   * currently locked.
   * @param directory the directory to check for a lock
   * @throws IOException if there is a low-level IO error
   */
  public static boolean isLocked(Directory directory) throws IOException {
    return directory.makeLock(WRITE_LOCK_NAME).isLocked();
  }

  /**
   * Forcibly unlocks the index in the named directory.
   * <P>
   * Caution: this should only be used by failure recovery code,
   * when it is known that no other process nor thread is in fact
   * currently accessing this index.
   */
  public static void unlock(Directory directory) throws IOException {
    directory.makeLock(IndexWriter.WRITE_LOCK_NAME).release();
  }

可以通过设置IndexWriter的infoStream来查看索引操作的信息。
 
分享到:
评论

相关推荐

Global site tag (gtag.js) - Google Analytics