Download PDFOpen PDF in browserClassification of Hindi News Articles Using Machine Learning Models with Challenges and SolutionsEasyChair Preprint 1573912 pages•Date: January 20, 2025AbstractIn today's modern digitized world, large amount of Hindi text documents are generated and shared through many sectors, including public organizations, news portals, government webpages, and commercial sectors. These news documents need to be classified into distinct classes such as business, health, science, politics, and sports. Text classification is essential due to the overwhelming amount of unorganized data that exists. Hindi news agencies still rely on manual sorting due to the lack of a dedicated Hindi text classifier. While English text classification is well-established and has ample resources, Indian languages, particularly Hindi, lack standardized benchmarks. Hindi, one of the most popular and used languages in the world, faces challenges in text processing. Despite the progress made in text summarization, keyword extraction, and information retrieval, the creation of classifiers for dividing Hindi news articles into predefined categories is still lacking in several areas. This paper addresses this gap by preprocessing a collection of standard Hindi news articles at various levels—word, sentence, paragraph, and document. The paper also explores feature extraction techniques and applies machine learning classifiers to categorize the articles. Classifying Hindi news articles presents unique difficulties due to the language's intricate letter combinations, conjuncts, sentence structures, and multi-sense words. Keyphrases: NLP, machine learning, text classifier
|