A Context-Aware Hybrid Search Framework Integrating LLM Tagging and GPU Acceleration for Enhanced E-Commerce Product Discovery

Tsung-Yin Ou
Shashika Dharmasena
Areoll Wu

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Traditional keyword-based retail search engines typically struggle to deliver relevant results owing to their reliance on exact matches, which can limit user experience and product discovery. As consumer demands grow, search performance optimization, particularly, throughput and latency, has become crucial. To address such challenges, this study presents a novel hybrid search framework centered around a context-aware large language model (LLM) tag generation mechanism tailored to traditional Chinese and specific market nuances (e.g., Taiwanese brands/trends). The core component is integrated with dense embedding and reranker models, and the entire system leverages GPU-accelerated technologies, such as RAPIDS cuDF for efficient large-scale data handling and the NVIDIA Triton Inference Server for optimized real-time inference, including dense embedding caching and dynamic batching.Results demonstrate the relevance and performance efficiency of the framework. The incorporation of context-aware LLM tags can dramatically improve the search relevance, that is, it can increase the intent-aligned conversion rate from 5.09–98.16% and enable the retrieval of all the relevant items in the specific tests in which a keyword search failed. Moreover, the performance optimization yields substantial gains: RAPIDS Dask-cuDF reduces the data-processing latency by 85.5% compared with CPU-based Pandas, Triton Inference Server improves the model serving throughput by nearly 800% and reduces the latency by 97% versus baseline CUDA execution, Redis caching drastically shortens the cached embedding retrieval time, and the LLM component achieves a 178.33 tokens/sec throughput (benchmarked on the Llama-3.1-8B via NIMS).The optimized search framework is successfully deployed on the 711go e-commerce platform. The framework deployment results in a 50% increase in the customer dwell time and a 40% increase in sales over the 90-day verification period, which confirm the ability of the system to enhance consumer browsing experience considerably and deliver tangible business value through improved search functionality.

Version published to 10.21203/rs.3.rs-6999016/v1 on Research Square
Nov 21, 2025

Best Practices for Using Large Language Models at Scale

This article has 5 authors:
1. Bhargavee Kannikanti
2. Arjun Coimbatore Nagarasan
3. Alberto Rosas
4. Sriram Kothandaraman
5. Sravan Kumar Kannuri
This article has no evaluationsLatest version Dec 12, 2025
Parallel Architectures for Large - Scale Document Processing:Integrating OCR and RAG Pipelines

This article has 4 authors:
1. Alejandro Jaime
2. Veronica Gil-Costa
3. Marcelo Errecalde
4. Leticia Cagnina
This article has no evaluationsLatest version Jan 19, 2026
Predictive-LoRA: A Proactive and Fragmentation-Aware Serverless Inference System for LLMs

This article has 5 authors:
1. Yinan Ni
2. Xiao Yang
3. Zhimin Qiu
4. Chen Wang
5. Tingzhou Yuan
This article has no evaluationsLatest version Dec 24, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Best Practices for Using Large Language Models at Scale

Parallel Architectures for Large - Scale Document Processing:Integrating OCR and RAG Pipelines

Predictive-LoRA: A Proactive and Fragmentation-Aware Serverless Inference System for LLMs