Generalization Estimation in Process Mining: The Impact of Event Data Quality

Anandi Karunaratne
Artem Polyvyanyy
Alistair Moffat

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The problem of process discovery in process mining involves constructing process models from event data to describe real-world systems that generated the data, allowing those systems to be studied and improved. One quality criterion of discovered models is generalization, which assesses how well a model describes both seen and unseen processes of the system. When the system itself is unknown, event logs must be used to determine model-system relationships, such as generalization. In this work, we investigate event log representativeness, which measures how well an event log reflects its generative system, exploring the extent to which representativeness affects the accuracy of generalization estimation. Our research employs bootstrap generalization as a method for estimating model generalization, complemented by a novel approximation technique for log representativeness that correlates strongly with previous measures. This approximation is validated through a ground truth analysis, demonstrating its robustness and accuracy, and is also applicable to various event data types. Our extensive experiments show that log representativeness substantially affects generalization estimation accuracy: highly representative logs can directly represent the system for measuring model generalization, while less representative logs require additional estimations. Interestingly, applying estimations to highly representative logs can lead to further deviations from the ground truth. Our findings offer insights into the relationship between log representativeness and generalization estimation, identifying specific regions where system estimations are most beneficial. By demonstrating the consistency of our findings across multiple systems and scenarios, we provide a more robust framework for understanding this relationship. A key theoretical contribution emerges from our investigation of bootstrap generalization. Under reasonable assumptions about process discovery techniques, we show that the bootstrap method can yield more precise estimates of model-system precision than recall, guiding future research and practical applications of bootstrap generalization method.

Version published to 10.21203/rs.3.rs-5634303/v1 on Research Square
Apr 15, 2025

Advancing Object-Centric Process Mining with Multi-Dimensional Data Operations

This article has 3 authors:
1. Shahrzad Khayatbashi
2. Najmeh Miri
3. Amin Jalali
This article has no evaluationsLatest version Jan 21, 2026
Time-Invariant Learning and History-Based Inference for Time-Varying Survival Models in Predictive Maintenance

This article has 4 authors:
1. Iulii Vasilev
2. Mark Goverdovskiy
3. Mikhail Petrovskiy
4. Igor Mashechkin
This article has no evaluationsLatest version Feb 3, 2026
A Discovery Technique for Expressive Yet Sound Process Models

This article has 3 authors:
1. Humam Kourani
2. Gyunam Park
3. Wil M.P. van der Aalst
This article has no evaluationsLatest version Jan 12, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Advancing Object-Centric Process Mining with Multi-Dimensional Data Operations

Time-Invariant Learning and History-Based Inference for Time-Varying Survival Models in Predictive Maintenance

A Discovery Technique for Expressive Yet Sound Process Models