A Kubernetes-Based AI Framework for Scalable PII Detection and Redaction in Application Logs

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

The increasing adoption of microservice architectures has amplified the challenges of application log management, particularly concerning the proliferation of personally identifiable information (PII). While logs are crucial for monitoring, debugging, and compliance, the distributed nature of microservices, coupled with stringent data privacy regulations like GDPR, CCPA, and HIPAA, necessitates robust PII detection and redaction mechanisms. Traditional methods, such as regular expressions, are inadequate for the volume and complexity of log data in modern IT environments. This research investigates the application of Natural Language Processing (NLP), specifically transformer-based models, for automated PII detection and redaction within Kubernetes-based microservices. We conduct a comparative analysis of several NLP techniques, including TF-IDF, spaCy's pre-trained model, a CNN-LSTM architecture, and a specialized pre-trained PII detection model (iiiorg/piiranha-v1), using the AI4Privacy dataset. Our evaluation considers accuracy, precision, recall, F1-score, resource utilization, and runtime. The results demonstrate the trade-offs between accuracy, computational cost, and contextual understanding, highlighting the superior performance and efficiency of specialized pre-trained models for balancing these factors in Kubernetes deployments. Specifically, we show that while deep learning models like CNN-LSTM achieve high accuracy, they are resource-intensive. Conversely, while TF-IDF is efficient, it lacks the contextual awareness needed for robust PII detection. Our findings indicate that specialized pre-trained models offer a compelling solution for practical PII redaction in resource-constrained Kubernetes environments.

Article activity feed