Decomposable Reward Modeling and Realistic Environment Design for Reinforcement Learning-Based Forex Trading

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Applying reinforcement learning (RL) to foreign exchange (Forex) trading remains challenging because realistic environments, well-defined reward functions, and expressive action spaces are all required simultaneously. Many existing studies simplify these elements through basic simulators, single scalar rewards, and limited action representations, making learned policies difficult to diagnose and limiting practical relevance. This paper introduces a modular RL framework for foreign exchange trading that addresses these limitations. The framework comprises three components. First, a friction-aware execution engine enforces strict anti-lookahead semantics—observations are taken at closet, orders are executed at opent+1, and positions are marked-to-market at closet+1—while incorporating realistic transaction costs, including spread, commission, slippage, rollover financing, and margin-triggered liquidation. Second, a decomposable 11-component reward architecture uses fixed, pre-specified weights with perstep diagnostic logging, facilitating systematic ablation and component-wise attribution analysis. Third, a 10-action discrete interface with legal-action masking defines explicit trading primitives (scaling, reduction, closure, and reversal) while enforcing margin-aware feasibility constraints during both training and evaluation. Three controlled experiment families are evaluated on EURUSD to analyze learning dynamics rather than generalization. Within this controlled setting, reward component interactions exhibit strongly non-monotonic effects—adding penalty terms does not consistently improve outcomes—with the full-reward configuration achieving the highest terminal training Sharpe (0.765) and cumulative return (57.09%). In the action-space comparison, the extended 10-action interface improves cumulative return while increasing turnover and reducing Sharpe relative to a conservative 3-action adapter, indicating a return–activity trade-off under a fixed training budget. In scaling experiments, all scaling-enabled variants reduce drawdown relative to no scaling, and the combined configuration achieves the strongest endpoint return.

Article activity feed