Generalization Estimation in Process Mining: The Impact of Event Data Quality

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

The problem of process discovery in process mining involves constructing process models from event data to describe real-world systems that generated the data, allowing those systems to be studied and improved. One quality criterion of discovered models is generalization, which assesses how well a model describes both seen and unseen processes of the system. When the system itself is unknown, event logs must be used to determine model-system relationships, such as generalization. In this work, we investigate event log representativeness, which measures how well an event log reflects its generative system, exploring the extent to which representativeness affects the accuracy of generalization estimation. Our research employs bootstrap generalization as a method for estimating model generalization, complemented by a novel approximation technique for log representativeness that correlates strongly with previous measures. This approximation is validated through a ground truth analysis, demonstrating its robustness and accuracy, and is also applicable to various event data types. Our extensive experiments show that log representativeness substantially affects generalization estimation accuracy: highly representative logs can directly represent the system for measuring model generalization, while less representative logs require additional estimations. Interestingly, applying estimations to highly representative logs can lead to further deviations from the ground truth. Our findings offer insights into the relationship between log representativeness and generalization estimation, identifying specific regions where system estimations are most beneficial. By demonstrating the consistency of our findings across multiple systems and scenarios, we provide a more robust framework for understanding this relationship. A key theoretical contribution emerges from our investigation of bootstrap generalization. Under reasonable assumptions about process discovery techniques, we show that the bootstrap method can yield more precise estimates of model-system precision than recall, guiding future research and practical applications of bootstrap generalization method.

Article activity feed