首页 > 范文大全 > 正文





摘 要:


关键词: 倒排索引;搜索引擎;全文检索;分块结构;检索效率



New inverted index storage scheme for Chinese search engine


MA Jian1, ZHANG Taihong1,2*, CHEN Yanhong1


1. College of Computer and Information Engineering, Xinjiang Agricultural University, Urumqi Xinjiang 830052, China;

2. College of Information and Electrical Engineering, China Agricultural University, Beijing 100083, China



After analyzing inverted index structure and access mode of an open source search engine-ASPSeek, this paper gave an abstract definition of "inverted index". In order to solve the difficulties of inverted index updating and the efficiency issues caused by directly accessing inverted index through file caching of operating system in ASPSeek, considering the characteristics of 1.25 million Chinese agricultural Web pages, this article proposed a new blocking inverted index storage scheme with a buffer mechanism which was based on CLOCK replacement algorithm. The experimental results show that the new scheme is more efficient than ASPSeek whether the buffer system is disabled or enabled. When the buffer system got enabled and 160 thousand Chinese terms or 50 thousand highfrequency Chinese terms were used as a test set, the retrieval time of new scheme tended to be a constant after one million accesses. Even when using entire 827309 terms as a test set, the retrieval time of new scheme began to converge after two million accesses.

After analyzing inverted index structure and access mode of an open source search engine—ASPSeek, this paper gave an abstract definition of “inverted index”. In order to solve the difficulties of inverted index updating and the efficiency issues caused by directly accessing inverted index through file caching of operating system in ASPSeek, considering the characteristics of 1.25 million Chinese agricultural Web pages, this article proposed a new blocking inverted index storage scheme with a buffer mechanism which based on CLOCK replacement algorithms. The experimental results show that the new scheme is more efficient than ASPSeek whether the buffer system was disabled or enabled. In the condition of the buffer system been enabled and using 160 thousand Chinese terms or 50 thousand highfrequency Chinese terms as a test set, the retrieval time of new scheme tended to a constant after one million accesses. Even using entire 827309 terms as a test set, the retrieval time of new scheme began to converge after two million accesses.

英文关键词Key words:

inverted index; search engine; fulltext retrieval; blocking structure; retrieval efficiency