Menshchikov A.A., Komarova A.V., Gatchin Y., Polev A.V. —
Development of a system for automatic categorization of web-page
// Software systems and computational methods. – 2016. – ¹ 4.
– P. 383 - 391.
DOI: 10.7256/2454-0714.2016.4.21438
Read the article
Abstract: This article reviews the problems of automatic processing of web content. Since the speed of obsolescence of information in the global network is very high, the problem of prompt extraction of the necessary data from the Internet becomes more urgent. The research focuses on the web resources that contain text, unadapted to the automated processing. The subject of the research is a set of software and methods. A particular attention is paid to the categorization of ads placed on specialized websites. The authors also review practical aspects of the development of a universal architecture of information-gathering systems. The following methods were used during this study: analytical review of the main principles of development of systems of automated information gathering and analysis of natural languages. For obtaining practice-oriented methods of synthesis and analysis results were used. A special contribution of the authors of the study is in developing an automated system for collecting, processing and classification of the information contained on the web-site. The novelty of the research is to use a new approach to solve this problem by taking into account the semantics and structure characteristic for specific sites. The main conclusions of the study are the applicability and effectiveness of the classification method for solving this problem.
Menshchikov A.A., Gatchin Y. —
Detection methods for web resources automated data collection
// Cybernetics and programming. – 2015. – ¹ 5.
– P. 136 - 157.
DOI: 10.7256/2306-4196.2015.5.16589
URL: https://en.e-notabene.ru/kp/article_16589.html
Read the article
Abstract: The article deals with the problem of automated data collection from web-resources. The authors present a classification of detection methods taking into account modern approaches. The article shows an analysis of existing methods for detection and countering web robots. The authors study the possibilities and limitations of combining methods. To date, there is no open system of web robots detection that would be suitable for use in real conditions. Therefore the development of an integrated system, that would include a variety of methods, techniques and approaches, is an urgent task. To solve this problem the authors developed a software product – prototype of such detection system. The system was tested on real data. The theoretical significance of this study is in the development of the current trend in the domestic segment, making a system of web robots detection based on the latest methods and the improvement of global best practices. Applied significance is in creation of a database for the development of demanded and promising software.