Policy Deviation Integral Guided Meta-Reinforcement Learning: Applications to High-Speed Train Trajectory Optimization

Haotong Zhang & Wanyuan Wang

IEEE Transactions on Intelligent Transportation Systems2026https://doi.org/10.1109/tits.2026.3675245article

ABDC A

Weight

0.50

What the paper says

Deep reinforcement learning (DRL) has emerged as a promising approach for the train trajectory optimization (TTO) problem in real-world high-speed rail (HSR) systems. However, there remain two issues with current DRL-based HSR operations: 1) the driver-centric markov decision process (MDP) with sparse rewards, and 2) optimizing a single rail trajectory (i.e., single task), but is less adaptable to practical HSR scenarios that require rapid responses to changing conditions (i.e., multiple tasks). In terms of sparse rewards, this paper first proposes a looped segment-wise gradient optimization (LSGO)-centric MDP that discards human-driving-pattern imitation, where the complete trajectory and terminal reward can be obtained at every agent’s action directly from the trajectory state. In terms of multiple tasks learning, meta-RL is promising to learn a policy that is capable of adapting to any new task with as little data as possible. Existing meta-RL algorithms directly use the meta parameters to train new tasks, overlooking task similarity and leading to a “lazy” agent issue with high training costs. In addressing the TTO problem, this paper finds that there exists a linear relationship between task-specific optimal policies. By fully exploiting the similarity policy between tasks, this paper proposes a policy deviation integral guided meta-reinforcement learning (PDIMRL) algorithm. It linearly adjusts and initializes new-task policies by integrating deviations between known task optima. Finally, experiments show that 1) LSGO achieves a $16.8\times $ speedup in single-task training compared to driver-centric MDPs, and 2) based on LSGO-centric and driver-centric MDPs, PDIMRL only requires 41.29% and 14.74% meta-training times than benchmark meta-RL algorithms (e.g., Reptile), respectively.

Open paper page →

Evidence weight

0.50

Balanced mode · F 0.40 / M 0.15 / V 0.05 / R 0.40

F · citation impact	0.50 × 0.4 = 0.20
M · momentum	0.50 × 0.15 = 0.07
V · venue signal	0.50 × 0.05 = 0.03
R · text relevance †	0.50 × 0.4 = 0.20

† Text relevance is estimated at 0.50 on the detail page — for your query’s actual relevance score, open this paper from a search result.