Decomposable Reward Modeling and Realistic Environment Design for Reinforcement Learning-Based Forex Trading

Nabeel Ahmad Saidd

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Applying reinforcement learning (RL) to foreign exchange (Forex) trading remains challenging because realistic environments, well-defined reward functions, and expressive action spaces are all required simultaneously. Many existing studies simplify these elements through basic simulators, single scalar rewards, and limited action representations, making learned policies difficult to diagnose and limiting practical relevance. This paper introduces a modular RL framework for foreign exchange trading that addresses these limitations. The framework comprises three components. First, a friction-aware execution engine enforces strict anti-lookahead semantics—observations are taken at closet, orders are executed at opent+1, and positions are marked-to-market at closet+1—while incorporating realistic transaction costs, including spread, commission, slippage, rollover financing, and margin-triggered liquidation. Second, a decomposable 11-component reward architecture uses fixed, pre-specified weights with perstep diagnostic logging, facilitating systematic ablation and component-wise attribution analysis. Third, a 10-action discrete interface with legal-action masking defines explicit trading primitives (scaling, reduction, closure, and reversal) while enforcing margin-aware feasibility constraints during both training and evaluation. Three controlled experiment families are evaluated on EURUSD to analyze learning dynamics rather than generalization. Within this controlled setting, reward component interactions exhibit strongly non-monotonic effects—adding penalty terms does not consistently improve outcomes—with the full-reward configuration achieving the highest terminal training Sharpe (0.765) and cumulative return (57.09%). In the action-space comparison, the extended 10-action interface improves cumulative return while increasing turnover and reducing Sharpe relative to a conservative 3-action adapter, indicating a return–activity trade-off under a fixed training budget. In scaling experiments, all scaling-enabled variants reduce drawdown relative to no scaling, and the combined configuration achieves the strongest endpoint return.

Version published to 10.20944/preprints202603.1701.v1
Mar 23, 2026

Large Language Models for Reinforcement Learning: A Survey of Intervention Operators and Optimization Effects

This article has 3 authors:
1. Kourosh Shahnazari
2. Seyed Moein Ayyoubzadeh
3. Mohammadali Keshtparvar
This article has no evaluationsLatest version Mar 3, 2026
XRL-LLM: Explainable Reinforcement Learning Framework for Voltage Control

This article has 3 authors:
1. Shrenik Jadhav
2. Birva Sevak
3. Van-Hai Bui
This article has no evaluationsLatest version Mar 16, 2026
A Brief Tutorial on Reinforcement Learning: From MDP to DDPG

This article has 2 authors:
1. Tian Zhang
2. Zhirong Su
This article has no evaluationsLatest version Feb 20, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Large Language Models for Reinforcement Learning: A Survey of Intervention Operators and Optimization Effects

XRL-LLM: Explainable Reinforcement Learning Framework for Voltage Control

A Brief Tutorial on Reinforcement Learning: From MDP to DDPG