Identifying Open-Source Threat Detection Resources on GitHub: A Scalable Machine Learning Approach

Manuel Kern
Max Landauer
Florian Skopik
Edgar Weippl

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Many businesses rely on open-source software modules to build their technology stacks. However, those who lack domain expertise may struggle to find the right software due to unfamiliar terminology and specific names. As a consequence, search engines and other platforms often cannot be utilized effectively to discover appropriate solutions. There is thus a need for a more applicable approach to assist non-domain experts in navigating the vastness of available repositories, enabling them to efficiently discover and select the right solution for their business needs. To overcome these gaps, we introduce an approach that supports finding unpopular yet important open-source software repositories on GitHub using advanced machine learning techniques. For this purpose, we propose novel strategies for information gathering and data pre-processing that resolve scalability issues of existing solutions and enable clustering of repositories even when topics, descriptions, or repository names are unclear or absent. For our evaluation, we gathered a dataset of 221,971 repositories using GitHub search and keywords related to incident detection. We show that our approach is able to separate threat detection repositories from others with an F1-score of 0.93.

Version published to 10.1007/s10207-025-01069-1
Jun 17, 2025
Version published to 10.21203/rs.3.rs-5664665/v1 on Research Square
Dec 20, 2024

ActivityRDI: A Centralized Solution Framework for Activity Retrieval and Detection Intelligence based on Knowledge Graph, Large Language Model and Imbalanced Learning

This article has 2 authors:
1. Lili Zhang
2. Quanyan Zhu
This article has no evaluationsLatest version Jan 19, 2026
DiLLaB: Discussion Labeling with LLMs for Building Datasets

This article has 6 authors:
1. Ludimila Gonçalves
2. Márcia Lima
3. André Carvalho
4. Walter Nakamura
5. Igor Steinmacher
6. Tayana Conte
This article has no evaluationsLatest version Jan 28, 2026
BH25DE report: On the path to machine-actionable training materials

This article has 11 authors:
1. Phil Reed
2. Nick Juty
3. Petra Steiner
4. Leyla Jael Castro
5. Charles Tapley Hoyt
6. Oliver Knodel
7. Martin Voigt
8. Roman Baum
9. Dilfuza Djamalova
10. Jacobo Miranda
11. Alban Gaignard
This article has no evaluationsLatest version Jan 26, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

ActivityRDI: A Centralized Solution Framework for Activity Retrieval and Detection Intelligence based on Knowledge Graph, Large Language Model and Imbalanced Learning

DiLLaB: Discussion Labeling with LLMs for Building Datasets

BH25DE report: On the path to machine-actionable training materials