Skip to content

Commit b2c05c9

Browse files
committed
first commit
0 parents  commit b2c05c9

16 files changed

+731
-0
lines changed

.DS_Store

10 KB
Binary file not shown.

LICENSE

+21
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
MIT License
2+
3+
Copyright (c) 2023 Aymen
4+
5+
Permission is hereby granted, free of charge, to any person obtaining a copy
6+
of this software and associated documentation files (the "Software"), to deal
7+
in the Software without restriction, including without limitation the rights
8+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9+
copies of the Software, and to permit persons to whom the Software is
10+
furnished to do so, subject to the following conditions:
11+
12+
The above copyright notice and this permission notice shall be included in all
13+
copies or substantial portions of the Software.
14+
15+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21+
SOFTWARE.

README.md

+38
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
# Robot arm control with Reinforcement Learning
2+
3+
![anim](https://github.com/kaymen99/Robot-arm-control-with-RL/assets/83681204/224cf960-43d8-4bdc-83be-ac8fe37e5be9)
4+
This project focuses on controlling a 7 DOF robot arm provided in the [pandas_gym](https://github.com/qgallouedec/panda-gym) Reacher environment using two continuous reinforcement learning algorithms: DDPG (Deep Deterministic Policy Gradients) and TD3 (Twin Delayed Deep Deterministic Policy Gradients). The technique of Hindsight Experience Replay is used to enhance the learning process of both algorithms.
5+
6+
## Continuous RL Algorithms
7+
8+
<p align="justify">
9+
Continuous reinforcement learning deals with environments where actions are continuous, such as the precise control of robotic arm joints or controlling the throttle of an autonomous vehicle. The primary objective is to find policies that effectively map observed states to continuous actions, ultimately optimizing the accumulation of expected rewards. Several algorithms have been specifically developed to address this challenge, including DDPG, TD3, SAC, PPO, and more.
10+
</p>
11+
12+
### 1- DDPG (Deep Deterministic Policy Gradients)
13+
14+
<p align="justify">
15+
DDPG is an actor-critic algorithm designed for continuous action spaces. It combines the strengths of policy gradients and Q-learning. In DDPG, an actor network learns the policy, while a critic network approximates the action-value (Q-function). The actor network directly outputs continuous actions, which are evaluted by the critic network to find the best action thus allowing for fine-grained control.
16+
</p>
17+
18+
### 2- TD3 (Twin Delayed Deep Deterministic Policy Gradients)
19+
20+
<p align="justify">
21+
TD3 is an enhancement of DDPG that addresses issues such as overestimation bias. It introduces the concept of "twin" critics to estimate the Q-value (it uses two critic networks instead of a single one like in DDPG), and it uses target networks with delayed updates to stabilize training. TD3 is known for its robustness and improved performance over DDPG.
22+
</p>
23+
24+
## Hindsight Experience Replay
25+
26+
<p align="justify">
27+
Hindsight Experience Replay (HER) is a technique developed to address the challenge of sparse and binary rewards in RL environments. For example, in many robotic tasks, achieving the desired goal is rare, and traditional RL algorithms struggle to learn from such feedback (agent always gets a zero reward unless the robot successfully completed the task which makes it difficult for the algorithm to learn as it doesn't know if the steps done were good or not).
28+
</p>
29+
30+
<p align="justify">
31+
HER tackles this issue by reusing past experiences for learning, even if they didn't lead to the desired goal. It works by relabeling and storing experiences in a replay buffer, allowing the agent to learn from both successful and failed attempts which significantly accelerates the learning process.
32+
</p>
33+
34+
- You can train a given model simply by running one of the files in the `training` folder.
35+
36+
- You can change the values of the hyperparameters of both algorithms (learning_rate (alpha/beta), discount factor (gamma),...) by going directly to each agent class in the `agents` folder. The architecture of the Actor/Critic networks can be modified from the `networks.py` file.
37+
38+
Link to HER paper: https://arxiv.org/pdf/1707.01495.pdf

agents/ddpg.py

+144
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,144 @@
1+
import tensorflow as tf
2+
import tensorflow.keras as keras
3+
from replay_memory.ReplayBuffer import ReplayBuffer
4+
from utils.networks import ActorNetwork, CriticNetwork
5+
6+
## Actor-critic networks parameters :
7+
8+
# actor learning rate
9+
alpha = 0.001
10+
11+
# critic learning rate
12+
beta = 0.002
13+
14+
## DDPG algorithms paramters
15+
16+
# discount factor
17+
gamma = 0.99
18+
19+
# target netwroks soft update factor
20+
tau = 0.005
21+
22+
# replay buffer max memory size
23+
max_size = 10**6
24+
25+
# exploration noise factor
26+
noise_factor = 0.1
27+
28+
# training batch size
29+
batch_size = 64
30+
31+
## DDPG agent class
32+
class DDPGAgent:
33+
def __init__(self, env, input_dims):
34+
self.gamma = gamma
35+
self.tau = tau
36+
self.batch_size = batch_size
37+
self.noise_factor = noise_factor
38+
39+
self.env = env
40+
self.n_actions = env.action_space.shape[0]
41+
self.max_action = env.action_space.high[0]
42+
self.min_action = env.action_space.low[0]
43+
44+
self.memory = ReplayBuffer(max_size, input_dims, self.n_actions)
45+
46+
self._initialize_networks(self.n_actions)
47+
self.update_parameters(tau=1)
48+
49+
# Choose action based on actor network
50+
# Add exploration noise if in traning mode
51+
def choose_action(self, state, evaluate=False):
52+
state = tf.convert_to_tensor([state], dtype=tf.float32)
53+
actions = self.actor(state)
54+
if not evaluate:
55+
actions += tf.random.normal(shape=[self.n_actions], mean=0, stddev=self.noise_factor)
56+
actions = tf.clip_by_value(actions, self.min_action, self.max_action)
57+
return actions[0]
58+
59+
def remember(self, state, action, reward, new_state, done):
60+
self.memory.store_transition(state, action, reward, new_state, done)
61+
62+
# Main DDPG algorithms learning process
63+
def learn(self):
64+
if self.memory.counter < self.batch_size:
65+
return
66+
67+
# Sample batch size of experiences from replay buffer
68+
states, actions, rewards, new_states, dones = self.memory.sample(self.batch_size)
69+
states = tf.convert_to_tensor(states, dtype=tf.float32)
70+
actions = tf.convert_to_tensor(actions, dtype=tf.float32)
71+
rewards = tf.convert_to_tensor(rewards, dtype=tf.float32)
72+
new_states = tf.convert_to_tensor(new_states, dtype=tf.float32)
73+
74+
# Calculate critic network loss
75+
with tf.GradientTape() as tape:
76+
target_actions = self.target_actor(new_states)
77+
new_critic_value = tf.squeeze(self.target_critic(new_states, target_actions), 1)
78+
critic_value = tf.squeeze(self.critic(states, actions), 1)
79+
target = rewards + self.gamma * new_critic_value * (1 - dones)
80+
critic_loss = tf.keras.losses.MSE(target, critic_value)
81+
82+
# Apply gradient decente with the calculated critic loss
83+
critic_network_gradient = tape.gradient(critic_loss, self.critic.trainable_variables)
84+
self.critic.optimizer.apply_gradients(zip(
85+
critic_network_gradient, self.critic.trainable_variables
86+
))
87+
88+
# Calculate actor network loss
89+
with tf.GradientTape() as tape:
90+
new_actions = self.actor(states)
91+
actor_loss = - self.critic(states, new_actions)
92+
actor_loss = tf.math.reduce_mean(actor_loss)
93+
94+
# Apply gradient decente with the calculated actor loss
95+
actor_network_gradient = tape.gradient(actor_loss, self.actor.trainable_variables)
96+
self.actor.optimizer.apply_gradients(zip(
97+
actor_network_gradient, self.actor.trainable_variables
98+
))
99+
100+
# Update actor/critic target networks
101+
self.update_parameters()
102+
103+
# Update actor/critic target networks parameters with soft update rule
104+
def update_parameters(self, tau=None):
105+
if tau is None:
106+
tau = self.tau
107+
108+
weights = []
109+
targets = self.target_actor.weights
110+
for i, weight in enumerate(self.actor.weights):
111+
weights.append(tau * weight + (1 - tau) * targets[i])
112+
self.target_actor.set_weights(weights)
113+
114+
weights = []
115+
targets = self.target_critic.weights
116+
for i, weight in enumerate(self.critic.weights):
117+
weights.append(tau * weight + (1 - tau) * targets[i])
118+
self.target_critic.set_weights(weights)
119+
120+
def save_models(self):
121+
print("---- saving models ----")
122+
self.actor.save_weights(self.actor.checkpoints_file)
123+
self.critic.save_weights(self.critic.checkpoints_file)
124+
self.target_actor.save_weights(self.target_actor.checkpoints_file)
125+
self.target_critic.save_weights(self.target_critic.checkpoints_file)
126+
127+
def load_models(self):
128+
print("---- loading models ----")
129+
self.actor.load_weights(self.actor.checkpoints_file)
130+
self.critic.load_weights(self.critic.checkpoints_file)
131+
self.target_actor.load_weights(self.target_actor.checkpoints_file)
132+
self.target_critic.load_weights(self.target_critic.checkpoints_file)
133+
134+
def _initialize_networks(self, n_actions):
135+
model = "ddpg"
136+
self.actor = ActorNetwork(n_actions, name="actor", model=model)
137+
self.critic = CriticNetwork(name="critic", model=model)
138+
self.target_actor = ActorNetwork(n_actions, name="target_actor", model=model)
139+
self.target_critic = CriticNetwork(name="target_critic", model=model)
140+
141+
self.actor.compile(keras.optimizers.Adam(learning_rate=alpha))
142+
self.critic.compile(keras.optimizers.Adam(learning_rate=beta))
143+
self.target_actor.compile(keras.optimizers.Adam(learning_rate=alpha))
144+
self.target_critic.compile(keras.optimizers.Adam(learning_rate=beta))

0 commit comments

Comments
 (0)