Research & Publication

Abstract:In the recent years there is a large growth in web contents over the internet.The internet does not provide any standard mechanism for verification of web contents before hosting them in web servers, which cause to increase the near and exact duplicated contents over the internet fro heterogeneous sources. These duplicate contents can exist either intentional or accidental. The problem of finding near duplicate web pages has been a subject of research in the database and web-search communities for some years. Since most prevailing text mining methods adopted term-based approaches, they all suffer from the problems of word synonymy and large number of comparison. In this paper, we are going to deal with the detection of near and duplicate web pages detection by using term document weighting scheme, sentence level features and addressing the synonym detection. The existence of these near and duplicate web pages causes the problems that ranges from network band width utilization, storage cost, reduce the performance of search engines by duplicated content indexing, increase load on a remote host.