Towards understanding news plagiarism: theoretical and experimental analysis
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
In this paper we aim to improve the understanding of the news manipulation using a mathematical formalism. First, we develop a software that has the ability to crawl various news domains, to collect the news and to carry out similarity search between the collected news. Then, we create a mathematical model that uses temporal graphs based on the collected data. We also designed a random data generator for the above-mentioned model which we validate using four statistical tests: Kolmogorov-Smirnov test, Fisher's Exact test, Mann-Whitney U test and permutation test. We ran each of the 4 statistical tests on 100 randomly generated data counting how many times p-value is less than 0.05 or greater. On average, on at least 75% of instances we obtained a p-value greater or equal than 0.05. Then, our main result of the paper is that we formulate several combinatorial optimization problems and explain their relation to news plagiarism. We theoretically analyze each of these problems and prove NP-hardness and hardness-of approximation results. These results show that, unless P=NP, polynomial time exact algorithms for these problems do not exist. Finally, given the NP-hardness results, we formulate our problems as integer programs and use the state of the art solver, Gurobi, to solve them both on collected data and random generated data. Using multiple tests, we observe that a bugdet of around 40-60% of the total cost achieves an influence of at least 75% of the maximum value.