Skip to content

Improvement Methods

These methods are trained to improve existing solutions iteratively, akin to local search algorithms. They focus on refining existing solutions rather than generating them from scratch.

DACT

Classes:

  • DACTEncoder

    Dual-Aspect Collaborative Transformer Encoder as in Ma et al. (2021)

DACTEncoder

DACTEncoder(
    embed_dim: int = 64,
    init_embedding: Module = None,
    pos_embedding: Module = None,
    env_name: str = "tsp_kopt",
    pos_type: str = "CPE",
    num_heads: int = 4,
    num_layers: int = 3,
    normalization: str = "layer",
    feedforward_hidden: int = 64,
)

Bases: ImprovementEncoder

Dual-Aspect Collaborative Transformer Encoder as in Ma et al. (2021)

Parameters:

  • embed_dim (int, default: 64 ) –

    Dimension of the embedding space

  • init_embedding (Module, default: None ) –

    Module to use for the initialization of the node embeddings

  • pos_embedding (Module, default: None ) –

    Module to use for the initialization of the positional embeddings

  • env_name (str, default: 'tsp_kopt' ) –

    Name of the environment used to initialize embeddings

  • pos_type (str, default: 'CPE' ) –

    Name of the used positional encoding method (CPE or APE)

  • num_heads (int, default: 4 ) –

    Number of heads in the attention layers

  • num_layers (int, default: 3 ) –

    Number of layers in the attention network

  • normalization (str, default: 'layer' ) –

    Normalization type in the attention layers

  • feedforward_hidden (int, default: 64 ) –

    Hidden dimension in the feedforward layers

Source code in rl4co/models/zoo/dact/encoder.py
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
def __init__(
    self,
    embed_dim: int = 64,
    init_embedding: nn.Module = None,
    pos_embedding: nn.Module = None,
    env_name: str = "tsp_kopt",
    pos_type: str = "CPE",
    num_heads: int = 4,
    num_layers: int = 3,
    normalization: str = "layer",
    feedforward_hidden: int = 64,
):
    super(DACTEncoder, self).__init__(
        embed_dim=embed_dim,
        env_name=env_name,
        pos_type=pos_type,
        num_heads=num_heads,
        num_layers=num_layers,
        normalization=normalization,
        feedforward_hidden=feedforward_hidden,
    )

    assert self.env_name in ["tsp_kopt"], NotImplementedError()

    self.net = AdaptiveSequential(
        *(
            DACTEncoderLayer(
                num_heads,
                embed_dim,
                feedforward_hidden,
                normalization,
            )
            for _ in range(num_layers)
        )
    )

Classes:

  • DACTDecoder

    DACT decoder based on Ma et al. (2021)

DACTDecoder

DACTDecoder(embed_dim: int = 64, num_heads: int = 4)

Bases: ImprovementDecoder

DACT decoder based on Ma et al. (2021) Given the environment state and the dual sets of embeddings (PFE, NFE embeddings), compute the logits for selecting two nodes for the 2-opt local search from the current solution

Parameters:

  • embed_dim (int, default: 64 ) –

    Embedding dimension

  • num_heads (int, default: 4 ) –

    Number of attention heads

Methods:

  • forward

    Compute the logits of the removing a node pair from the current solution

Source code in rl4co/models/zoo/dact/decoder.py
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
def __init__(
    self,
    embed_dim: int = 64,
    num_heads: int = 4,
):
    super().__init__()
    self.embed_dim = embed_dim
    self.n_heads = num_heads
    self.hidden_dim = embed_dim

    # for MHC sublayer (NFE aspect)
    self.compater_node = MultiHeadCompat(
        num_heads, embed_dim, embed_dim, embed_dim, embed_dim
    )

    # for MHC sublayer (PFE aspect)
    self.compater_pos = MultiHeadCompat(
        num_heads, embed_dim, embed_dim, embed_dim, embed_dim
    )

    self.norm_factor = 1 / math.sqrt(1 * self.hidden_dim)

    # for Max-Pooling sublayer
    self.project_graph_pos = nn.Linear(self.embed_dim, self.embed_dim, bias=False)
    self.project_graph_node = nn.Linear(self.embed_dim, self.embed_dim, bias=False)
    self.project_node_pos = nn.Linear(self.embed_dim, self.embed_dim, bias=False)
    self.project_node_node = nn.Linear(self.embed_dim, self.embed_dim, bias=False)

    # for feed-forward aggregation (FFA)sublayer
    self.value_head = MLP(
        input_dim=2 * self.n_heads,
        output_dim=1,
        num_neurons=[32, 32],
        dropout_probs=[0.05, 0.00],
    )

forward

forward(
    td: TensorDict, final_h: Tensor, final_p: Tensor
) -> Tensor

Compute the logits of the removing a node pair from the current solution

Parameters:

  • td (TensorDict) –

    TensorDict with the current environment state

  • final_h (Tensor) –

    final NFE embeddings

  • final_p (Tensor) –

    final pfe embeddings

Source code in rl4co/models/zoo/dact/decoder.py
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
def forward(self, td: TensorDict, final_h: Tensor, final_p: Tensor) -> Tensor:
    """Compute the logits of the removing a node pair from the current solution

    Args:
        td: TensorDict with the current environment state
        final_h: final NFE embeddings
        final_p: final pfe embeddings
    """

    batch_size, graph_size, dim = final_h.size()

    # Max-Pooling sublayer
    h_node_refined = self.project_node_node(final_h) + self.project_graph_node(
        final_h.max(1)[0]
    )[:, None, :].expand(batch_size, graph_size, dim)
    h_pos_refined = self.project_node_pos(final_p) + self.project_graph_pos(
        final_p.max(1)[0]
    )[:, None, :].expand(batch_size, graph_size, dim)

    # MHC sublayer
    compatibility = torch.zeros(
        (batch_size, graph_size, graph_size, self.n_heads * 2),
        device=h_node_refined.device,
    )
    compatibility[:, :, :, : self.n_heads] = self.compater_pos(h_pos_refined).permute(
        1, 2, 3, 0
    )
    compatibility[:, :, :, self.n_heads :] = self.compater_node(
        h_node_refined
    ).permute(1, 2, 3, 0)

    # FFA sublater
    return self.value_head(self.norm_factor * compatibility).squeeze(-1)

Classes:

  • DACTPolicy

    DACT Policy based on Ma et al. (2021)

DACTPolicy

DACTPolicy(
    embed_dim: int = 64,
    num_encoder_layers: int = 3,
    num_heads: int = 4,
    normalization: str = "layer",
    feedforward_hidden: int = 64,
    env_name: str = "tsp_kopt",
    pos_type: str = "CPE",
    init_embedding: Module = None,
    pos_embedding: Module = None,
    temperature: float = 1.0,
    tanh_clipping: float = 6.0,
    train_decode_type: str = "sampling",
    val_decode_type: str = "sampling",
    test_decode_type: str = "sampling",
)

Bases: ImprovementPolicy

DACT Policy based on Ma et al. (2021) This model first encodes the input graph and current solution using a DACT encoder (:class:DACTEncoder) and then decodes the 2-opt action (:class:DACTDecoder)

Parameters:

  • embed_dim (int, default: 64 ) –

    Dimension of the node embeddings

  • num_encoder_layers (int, default: 3 ) –

    Number of layers in the encoder

  • num_heads (int, default: 4 ) –

    Number of heads in the attention layers

  • normalization (str, default: 'layer' ) –

    Normalization type in the attention layers

  • feedforward_hidden (int, default: 64 ) –

    Dimension of the hidden layer in the feedforward network

  • env_name (str, default: 'tsp_kopt' ) –

    Name of the environment used to initialize embeddings

  • pos_type (str, default: 'CPE' ) –

    Name of the used positional encoding method (CPE or APE)

  • init_embedding (Module, default: None ) –

    Module to use for the initialization of the embeddings

  • pos_embedding (Module, default: None ) –

    Module to use for the initialization of the positional embeddings

  • temperature (float, default: 1.0 ) –

    Temperature for the softmax

  • tanh_clipping (float, default: 6.0 ) –

    Tanh clipping value (see Bello et al., 2016)

  • train_decode_type (str, default: 'sampling' ) –

    Type of decoding to use during training

  • val_decode_type (str, default: 'sampling' ) –

    Type of decoding to use during validation

  • test_decode_type (str, default: 'sampling' ) –

    Type of decoding to use during testing

Methods:

  • forward

    Forward pass of the policy.

Source code in rl4co/models/zoo/dact/policy.py
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
def __init__(
    self,
    embed_dim: int = 64,
    num_encoder_layers: int = 3,
    num_heads: int = 4,
    normalization: str = "layer",
    feedforward_hidden: int = 64,
    env_name: str = "tsp_kopt",
    pos_type: str = "CPE",
    init_embedding: nn.Module = None,
    pos_embedding: nn.Module = None,
    temperature: float = 1.0,
    tanh_clipping: float = 6.0,
    train_decode_type: str = "sampling",
    val_decode_type: str = "sampling",
    test_decode_type: str = "sampling",
):
    super(DACTPolicy, self).__init__()

    self.env_name = env_name

    # Encoder and decoder
    self.encoder = DACTEncoder(
        embed_dim=embed_dim,
        init_embedding=init_embedding,
        pos_embedding=pos_embedding,
        env_name=env_name,
        pos_type=pos_type,
        num_heads=num_heads,
        num_layers=num_encoder_layers,
        normalization=normalization,
        feedforward_hidden=feedforward_hidden,
    )

    self.decoder = DACTDecoder(embed_dim=embed_dim, num_heads=num_heads)

    # Decoding strategies
    self.temperature = temperature
    self.tanh_clipping = tanh_clipping
    self.train_decode_type = train_decode_type
    self.val_decode_type = val_decode_type
    self.test_decode_type = test_decode_type

forward

forward(
    td: TensorDict,
    env: Union[str, RL4COEnvBase] = None,
    phase: str = "train",
    return_actions: bool = True,
    return_embeds: bool = False,
    only_return_embed: bool = False,
    actions=None,
    **decoding_kwargs
) -> dict

Forward pass of the policy.

Parameters:

  • td (TensorDict) –

    TensorDict containing the environment state

  • env (Union[str, RL4COEnvBase], default: None ) –

    Environment to use for decoding. If None, the environment is instantiated from env_name. Note that it is more efficient to pass an already instantiated environment each time for fine-grained control

  • phase (str, default: 'train' ) –

    Phase of the algorithm (train, val, test)

  • return_actions (bool, default: True ) –

    Whether to return the actions

  • actions

    Actions to use for evaluating the policy. If passed, use these actions instead of sampling from the policy to calculate log likelihood

  • decoding_kwargs

    Keyword arguments for the decoding strategy. See :class:rl4co.utils.decoding.DecodingStrategy for more information.

Returns:

  • out ( dict ) –

    Dictionary containing the reward, log likelihood, and optionally the actions and entropy

Source code in rl4co/models/zoo/dact/policy.py
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
def forward(
    self,
    td: TensorDict,
    env: Union[str, RL4COEnvBase] = None,
    phase: str = "train",
    return_actions: bool = True,
    return_embeds: bool = False,
    only_return_embed: bool = False,
    actions=None,
    **decoding_kwargs,
) -> dict:
    """Forward pass of the policy.

    Args:
        td: TensorDict containing the environment state
        env: Environment to use for decoding. If None, the environment is instantiated from `env_name`. Note that
            it is more efficient to pass an already instantiated environment each time for fine-grained control
        phase: Phase of the algorithm (train, val, test)
        return_actions: Whether to return the actions
        actions: Actions to use for evaluating the policy.
            If passed, use these actions instead of sampling from the policy to calculate log likelihood
        decoding_kwargs: Keyword arguments for the decoding strategy. See :class:`rl4co.utils.decoding.DecodingStrategy` for more information.

    Returns:
        out: Dictionary containing the reward, log likelihood, and optionally the actions and entropy
    """

    # Encoder: get encoder output and initial embeddings from initial state
    NFE, PFE = self.encoder(td)
    h_featrues = torch.cat((NFE, PFE), -1)

    if only_return_embed:
        return {"embeds": h_featrues.detach()}

    # Instantiate environment if needed
    if isinstance(env, str) or env is None:
        env_name = self.env_name if env is None else env
        log.info(f"Instantiated environment not provided; instantiating {env_name}")
        env = get_env(env_name)
    assert env.two_opt_mode, "DACT only support 2-opt"

    # Get decode type depending on phase and whether actions are passed for evaluation
    decode_type = decoding_kwargs.pop("decode_type", None)
    if actions is not None:
        decode_type = "evaluate"
    elif decode_type is None:
        decode_type = getattr(self, f"{phase}_decode_type")

    # Setup decoding strategy
    # we pop arguments that are not part of the decoding strategy
    decode_strategy: DecodingStrategy = get_decoding_strategy(
        decode_type,
        temperature=decoding_kwargs.pop("temperature", self.temperature),
        tanh_clipping=decoding_kwargs.pop("tanh_clipping", self.tanh_clipping),
        mask_logits=True,
        improvement_method_mode=True,
        **decoding_kwargs,
    )

    # Perform the decoding
    batch_size, seq_length = td["rec_current"].size()
    logits = self.decoder(td, NFE, PFE).view(batch_size, -1)

    # Get mask
    mask = env.get_mask(td)
    if "action" in td.keys():
        mask[torch.arange(batch_size), td["action"][:, 0], td["action"][:, 1]] = False
        mask[torch.arange(batch_size), td["action"][:, 1], td["action"][:, 0]] = False
    mask = mask.view(batch_size, -1)

    # Get action and log-likelihood
    logprob, action_sampled = decode_strategy.step(
        logits,
        mask,
        action=(
            actions[:, 0] * seq_length + actions[:, 1]
            if actions is not None
            else None
        ),
    )
    action_sampled = action_sampled.unsqueeze(-1)
    if phase == "train":
        log_likelihood = logprob.gather(1, action_sampled)
    else:
        log_likelihood = torch.zeros(batch_size, device=td.device)

    ## return
    DACT_action = torch.cat(
        (
            action_sampled // seq_length,
            action_sampled % seq_length,
        ),
        -1,
    )

    outdict = {"log_likelihood": log_likelihood, "cost_bsf": td["cost_bsf"]}
    td.set("action", DACT_action)

    if return_embeds:
        outdict["embeds"] = h_featrues.detach()

    if return_actions:
        outdict["actions"] = DACT_action

    return outdict

Classes:

  • DACT

    DACT Model based on n_step Proximal Policy Optimization (PPO) with an DACT model policy.

DACT

DACT(
    env: RL4COEnvBase,
    policy: Module = None,
    critic: CriticNetwork = None,
    policy_kwargs: dict = {},
    critic_kwargs: dict = {},
    **kwargs
)

Bases: n_step_PPO

DACT Model based on n_step Proximal Policy Optimization (PPO) with an DACT model policy. We default to the DACT model policy and the improvement Critic Network.

Parameters:

  • env (RL4COEnvBase) –

    Environment to use for the algorithm

  • policy (Module, default: None ) –

    Policy to use for the algorithm

  • critic (CriticNetwork, default: None ) –

    Critic to use for the algorithm

  • policy_kwargs (dict, default: {} ) –

    Keyword arguments for policy

  • critic_kwargs (dict, default: {} ) –

    Keyword arguments for critic

Source code in rl4co/models/zoo/dact/model.py
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
def __init__(
    self,
    env: RL4COEnvBase,
    policy: nn.Module = None,
    critic: CriticNetwork = None,
    policy_kwargs: dict = {},
    critic_kwargs: dict = {},
    **kwargs,
):
    if policy is None:
        policy = DACTPolicy(env_name=env.name, **policy_kwargs)

    if critic is None:
        embed_dim = (
            policy_kwargs["embed_dim"] * 2 if "embed_dim" in policy_kwargs else 128
        )  # the critic's embed_dim must be as policy's

        encoder = MultiHeadAttentionLayer(
            embed_dim,
            critic_kwargs["num_heads"] if "num_heads" in critic_kwargs else 4,
            critic_kwargs["feedforward_hidden"] * 2
            if "feedforward_hidden" in critic_kwargs
            else 128,
            critic_kwargs["normalization"]
            if "normalization" in critic_kwargs
            else "layer",
            bias=False,
        )
        value_head = CriticDecoder(embed_dim)

        critic = CriticNetwork(
            encoder=encoder,
            value_head=value_head,
            customized=True,
        )

    super().__init__(env, policy, critic, **kwargs)

N2S

Classes:

  • N2SEncoder

    Neural Neighborhood Search Encoder as in Ma et al. (2022)

N2SEncoder

N2SEncoder(
    embed_dim: int = 128,
    init_embedding: Module = None,
    pos_embedding: Module = None,
    env_name: str = "pdp_ruin_repair",
    pos_type: str = "CPE",
    num_heads: int = 4,
    num_layers: int = 3,
    normalization: str = "layer",
    feedforward_hidden: int = 128,
)

Bases: ImprovementEncoder

Neural Neighborhood Search Encoder as in Ma et al. (2022) First embed the input and then process it with a Graph AttepdN2ntion Network.

Parameters:

  • embed_dim (int, default: 128 ) –

    Dimension of the embedding space

  • init_embedding (Module, default: None ) –

    Module to use for the initialization of the node embeddings

  • pos_embedding (Module, default: None ) –

    Module to use for the initialization of the positional embeddings

  • env_name (str, default: 'pdp_ruin_repair' ) –

    Name of the environment used to initialize embeddings

  • pos_type (str, default: 'CPE' ) –

    Name of the used positional encoding method (CPE or APE)

  • num_heads (int, default: 4 ) –

    Number of heads in the attention layers

  • num_layers (int, default: 3 ) –

    Number of layers in the attention network

  • normalization (str, default: 'layer' ) –

    Normalization type in the attention layers

  • feedforward_hidden (int, default: 128 ) –

    Hidden dimension in the feedforward layers

Source code in rl4co/models/zoo/n2s/encoder.py
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
def __init__(
    self,
    embed_dim: int = 128,
    init_embedding: nn.Module = None,
    pos_embedding: nn.Module = None,
    env_name: str = "pdp_ruin_repair",
    pos_type: str = "CPE",
    num_heads: int = 4,
    num_layers: int = 3,
    normalization: str = "layer",
    feedforward_hidden: int = 128,
):
    super(N2SEncoder, self).__init__(
        embed_dim=embed_dim,
        init_embedding=init_embedding,
        pos_embedding=pos_embedding,
        env_name=env_name,
        pos_type=pos_type,
        num_heads=num_heads,
        num_layers=num_layers,
        normalization=normalization,
        feedforward_hidden=feedforward_hidden,
    )

    self.pos_net = MultiHeadCompat(num_heads, embed_dim, feedforward_hidden)

    self.net = AdaptiveSequential(
        *(
            N2SEncoderLayer(
                num_heads,
                embed_dim,
                feedforward_hidden,
                normalization,
            )
            for _ in range(num_layers)
        )
    )

Classes:

NodePairRemovalDecoder

NodePairRemovalDecoder(
    embed_dim: int = 128, num_heads: int = 4
)

Bases: ImprovementDecoder

N2S Node-Pair Removal decoder based on Ma et al. (2022) Given the environment state and the node embeddings (positional embeddings are discarded), compute the logits for selecting a pair of pickup and delivery nodes for node pair removal from the current solution

Parameters:

  • embed_dim (int, default: 128 ) –

    Embedding dimension

  • num_heads (int, default: 4 ) –

    Number of attention heads

Methods:

  • forward

    Compute the logits of the removing a node pair from the current solution

Source code in rl4co/models/zoo/n2s/decoder.py
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
def __init__(
    self,
    embed_dim: int = 128,
    num_heads: int = 4,
):
    super().__init__()
    self.input_dim = embed_dim
    self.n_heads = num_heads
    self.hidden_dim = embed_dim

    assert embed_dim % num_heads == 0

    self.W_Q = nn.Parameter(
        torch.Tensor(self.n_heads, self.input_dim, self.hidden_dim)
    )
    self.W_K = nn.Parameter(
        torch.Tensor(self.n_heads, self.input_dim, self.hidden_dim)
    )

    self.agg = MLP(input_dim=2 * self.n_heads + 4, output_dim=1, num_neurons=[32, 32])

    self.init_parameters()

forward

forward(
    td: TensorDict, final_h: Tensor, final_p: Tensor
) -> Tensor

Compute the logits of the removing a node pair from the current solution

Parameters:

  • td (TensorDict) –

    TensorDict with the current environment state

  • final_h (Tensor) –

    final node embeddings

  • final_p (Tensor) –

    final positional embeddings

Source code in rl4co/models/zoo/n2s/decoder.py
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
def forward(self, td: TensorDict, final_h: Tensor, final_p: Tensor) -> Tensor:
    """Compute the logits of the removing a node pair from the current solution

    Args:
        td: TensorDict with the current environment state
        final_h: final node embeddings
        final_p: final positional embeddings
    """

    selection_recent = torch.cat(
        (td["action_record"][:, -3:], td["action_record"].mean(1, True)), 1
    )
    solution = td["rec_current"]

    pre = solution.argsort()  # pre=[1,2,0]
    post = solution.gather(
        1, solution
    )  # post=[1,2,0] # the second neighbour works better
    batch_size, graph_size_plus1, input_dim = final_h.size()

    hflat = final_h.contiguous().view(-1, input_dim)  #################   reshape

    shp = (self.n_heads, batch_size, graph_size_plus1, self.hidden_dim)

    # Calculate queries, (n_heads, batch_size, graph_size+1, key_size)
    hidden_Q = torch.matmul(hflat, self.W_Q).view(shp)
    hidden_K = torch.matmul(hflat, self.W_K).view(shp)

    Q_pre = hidden_Q.gather(
        2, pre.view(1, batch_size, graph_size_plus1, 1).expand_as(hidden_Q)
    )
    K_post = hidden_K.gather(
        2, post.view(1, batch_size, graph_size_plus1, 1).expand_as(hidden_Q)
    )

    compatibility = (
        (Q_pre * hidden_K).sum(-1)
        + (hidden_Q * K_post).sum(-1)
        - (Q_pre * K_post).sum(-1)
    )[
        :, :, 1:
    ]  # (n_heads, batch_size, graph_size) (12)

    compatibility_pairing = torch.cat(
        (
            compatibility[:, :, : graph_size_plus1 // 2],
            compatibility[:, :, graph_size_plus1 // 2 :],
        ),
        0,
    )  # (n_heads*2, batch_size, graph_size/2)

    compatibility_pairing = self.agg(
        torch.cat(
            (
                compatibility_pairing.permute(1, 2, 0),
                selection_recent.permute(0, 2, 1),
            ),
            -1,
        )
    ).squeeze()  # (batch_size, graph_size/2)

    return compatibility_pairing

NodePairReinsertionDecoder

NodePairReinsertionDecoder(
    embed_dim: int = 128, num_heads: int = 4
)

Bases: ImprovementDecoder

N2S Node-Pair Reinsertion decoder based on Ma et al. (2022) Given the environment state, the node embeddings (positional embeddings are discarded), and the removed node from the NodePairRemovalDecoder, compute the logits for finding places to re-insert the removed pair of pickup and delivery nodes to form a new solution

Parameters:

  • embed_dim (int, default: 128 ) –

    Embedding dimension

  • num_heads (int, default: 4 ) –

    Number of attention heads

Source code in rl4co/models/zoo/n2s/decoder.py
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
def __init__(
    self,
    embed_dim: int = 128,
    num_heads: int = 4,
):
    super().__init__()
    self.input_dim = embed_dim
    self.n_heads = num_heads
    self.hidden_dim = embed_dim

    assert embed_dim % num_heads == 0

    self.compater_insert1 = MultiHeadCompat(
        num_heads, embed_dim, embed_dim, embed_dim, embed_dim
    )

    self.compater_insert2 = MultiHeadCompat(
        num_heads, embed_dim, embed_dim, embed_dim, embed_dim
    )

    self.agg = MLP(input_dim=4 * self.n_heads, output_dim=1, num_neurons=[32, 32])

Classes:

  • N2SPolicy

    N2S Policy based on Ma et al. (2022)

N2SPolicy

N2SPolicy(
    embed_dim: int = 128,
    num_encoder_layers: int = 3,
    num_heads: int = 4,
    normalization: str = "layer",
    feedforward_hidden: int = 128,
    env_name: str = "pdp_ruin_repair",
    pos_type: str = "CPE",
    init_embedding: Module = None,
    pos_embedding: Module = None,
    temperature: float = 1.0,
    tanh_clipping: float = 6.0,
    train_decode_type: str = "sampling",
    val_decode_type: str = "sampling",
    test_decode_type: str = "sampling",
)

Bases: ImprovementPolicy

N2S Policy based on Ma et al. (2022) This model first encodes the input graph and current solution using a N2S encoder (:class:N2SEncoder) and then decodes the node-pair removal and reinsertion action using the Node-Pair Removal (:class:NodePairRemovalDecoder) and Reinsertion (:class:NodePairReinsertionDecoder) decoders

Parameters:

  • embed_dim (int, default: 128 ) –

    Dimension of the node embeddings

  • num_encoder_layers (int, default: 3 ) –

    Number of layers in the encoder

  • num_heads (int, default: 4 ) –

    Number of heads in the attention layers

  • normalization (str, default: 'layer' ) –

    Normalization type in the attention layers

  • feedforward_hidden (int, default: 128 ) –

    Dimension of the hidden layer in the feedforward network

  • env_name (str, default: 'pdp_ruin_repair' ) –

    Name of the environment used to initialize embeddings

  • pos_type (str, default: 'CPE' ) –

    Name of the used positional encoding method (CPE or APE)

  • init_embedding (Module, default: None ) –

    Module to use for the initialization of the embeddings

  • pos_embedding (Module, default: None ) –

    Module to use for the initialization of the positional embeddings

  • temperature (float, default: 1.0 ) –

    Temperature for the softmax

  • tanh_clipping (float, default: 6.0 ) –

    Tanh clipping value (see Bello et al., 2016)

  • train_decode_type (str, default: 'sampling' ) –

    Type of decoding to use during training

  • val_decode_type (str, default: 'sampling' ) –

    Type of decoding to use during validation

  • test_decode_type (str, default: 'sampling' ) –

    Type of decoding to use during testing

Methods:

  • forward

    Forward pass of the policy.

Source code in rl4co/models/zoo/n2s/policy.py
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
def __init__(
    self,
    embed_dim: int = 128,
    num_encoder_layers: int = 3,
    num_heads: int = 4,
    normalization: str = "layer",
    feedforward_hidden: int = 128,
    env_name: str = "pdp_ruin_repair",
    pos_type: str = "CPE",
    init_embedding: nn.Module = None,
    pos_embedding: nn.Module = None,
    temperature: float = 1.0,
    tanh_clipping: float = 6.0,
    train_decode_type: str = "sampling",
    val_decode_type: str = "sampling",
    test_decode_type: str = "sampling",
):
    super(N2SPolicy, self).__init__()

    self.env_name = env_name

    # Encoder and decoder
    self.encoder = N2SEncoder(
        embed_dim=embed_dim,
        init_embedding=init_embedding,
        pos_embedding=pos_embedding,
        env_name=env_name,
        pos_type=pos_type,
        num_heads=num_heads,
        num_layers=num_encoder_layers,
        normalization=normalization,
        feedforward_hidden=feedforward_hidden,
    )

    self.removal_decoder = NodePairRemovalDecoder(
        embed_dim=embed_dim, num_heads=num_heads
    )

    self.reinsertion_decoder = NodePairReinsertionDecoder(
        embed_dim=embed_dim, num_heads=num_heads
    )

    self.project_graph = nn.Linear(embed_dim, embed_dim, bias=False)
    self.project_node = nn.Linear(embed_dim, embed_dim, bias=False)

    # Decoding strategies
    self.temperature = temperature
    self.tanh_clipping = tanh_clipping
    self.train_decode_type = train_decode_type
    self.val_decode_type = val_decode_type
    self.test_decode_type = test_decode_type

forward

forward(
    td: TensorDict,
    env: Union[str, RL4COEnvBase] = None,
    phase: str = "train",
    return_actions: bool = True,
    return_embeds: bool = False,
    only_return_embed: bool = False,
    actions=None,
    **decoding_kwargs
) -> dict

Forward pass of the policy.

Parameters:

  • td (TensorDict) –

    TensorDict containing the environment state

  • env (Union[str, RL4COEnvBase], default: None ) –

    Environment to use for decoding. If None, the environment is instantiated from env_name. Note that it is more efficient to pass an already instantiated environment each time for fine-grained control

  • phase (str, default: 'train' ) –

    Phase of the algorithm (train, val, test)

  • return_actions (bool, default: True ) –

    Whether to return the actions

  • actions

    Actions to use for evaluating the policy. If passed, use these actions instead of sampling from the policy to calculate log likelihood

  • decoding_kwargs

    Keyword arguments for the decoding strategy. See :class:rl4co.utils.decoding.DecodingStrategy for more information.

Returns:

  • out ( dict ) –

    Dictionary containing the reward, log likelihood, and optionally the actions and entropy

Source code in rl4co/models/zoo/n2s/policy.py
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
def forward(
    self,
    td: TensorDict,
    env: Union[str, RL4COEnvBase] = None,
    phase: str = "train",
    return_actions: bool = True,
    return_embeds: bool = False,
    only_return_embed: bool = False,
    actions=None,
    **decoding_kwargs,
) -> dict:
    """Forward pass of the policy.

    Args:
        td: TensorDict containing the environment state
        env: Environment to use for decoding. If None, the environment is instantiated from `env_name`. Note that
            it is more efficient to pass an already instantiated environment each time for fine-grained control
        phase: Phase of the algorithm (train, val, test)
        return_actions: Whether to return the actions
        actions: Actions to use for evaluating the policy.
            If passed, use these actions instead of sampling from the policy to calculate log likelihood
        decoding_kwargs: Keyword arguments for the decoding strategy. See :class:`rl4co.utils.decoding.DecodingStrategy` for more information.

    Returns:
        out: Dictionary containing the reward, log likelihood, and optionally the actions and entropy
    """

    # Encoder: get encoder output and initial embeddings from initial state
    h_wave, final_p = self.encoder(td)
    if only_return_embed:
        return {"embeds": h_wave.detach()}
    final_h = (
        self.project_node(h_wave) + self.project_graph(h_wave.max(1)[0])[:, None, :]
    )

    # Instantiate environment if needed
    if isinstance(env, str) or env is None:
        env_name = self.env_name if env is None else env
        log.info(f"Instantiated environment not provided; instantiating {env_name}")
        env = get_env(env_name)

    # Get decode type depending on phase and whether actions are passed for evaluation
    decode_type = decoding_kwargs.pop("decode_type", None)
    if actions is not None:
        decode_type = "evaluate"
    elif decode_type is None:
        decode_type = getattr(self, f"{phase}_decode_type")

    # Setup decoding strategy
    # we pop arguments that are not part of the decoding strategy
    decode_strategy: DecodingStrategy = get_decoding_strategy(
        decode_type,
        temperature=decoding_kwargs.pop("temperature", self.temperature),
        tanh_clipping=decoding_kwargs.pop("tanh_clipping", self.tanh_clipping),
        mask_logits=True,
        improvement_method_mode=True,
        **decoding_kwargs,
    )

    ## action 1

    # Perform the decoding
    logits = self.removal_decoder(td, final_h, final_p)

    # Get mask
    mask = torch.ones_like(td["action_record"][:, 0], device=td.device).bool()
    if "action" in td.keys():
        mask = mask.scatter(1, td["action"][:, :1], 0)

    # Get action and log-likelihood
    logprob_removal, action_removal = decode_strategy.step(
        logits,
        mask,
        action=actions[:, 0] if actions is not None else None,
    )
    action_removal = action_removal.unsqueeze(-1)
    if phase == "train":
        selected_log_ll_action1 = logprob_removal.gather(1, action_removal)

    ## action 2
    td.set("action", action_removal)

    # Perform the decoding
    batch_size, seq_length = td["rec_current"].size()
    logits = self.reinsertion_decoder(td, final_h, final_p).view(batch_size, -1)

    # Get mask
    mask = env.get_mask(action_removal + 1, td).view(batch_size, -1)
    # Get action and log-likelihood
    logprob_reinsertion, action_reinsertion = decode_strategy.step(
        logits,
        mask,
        action=(
            actions[:, 1] * seq_length + actions[:, 2]
            if actions is not None
            else None
        ),
    )
    action_reinsertion = action_reinsertion.unsqueeze(-1)
    if phase == "train":
        selected_log_ll_action2 = logprob_reinsertion.gather(1, action_reinsertion)

    ## return
    N2S_action = torch.cat(
        (
            action_removal.view(batch_size, -1),
            action_reinsertion // seq_length,
            action_reinsertion % seq_length,
        ),
        -1,
    )
    if phase == "train":
        log_likelihood = selected_log_ll_action1 + selected_log_ll_action2
    else:
        log_likelihood = torch.zeros(batch_size, device=td.device)

    outdict = {"log_likelihood": log_likelihood, "cost_bsf": td["cost_bsf"]}
    td.set("action", N2S_action)

    if return_embeds:
        outdict["embeds"] = h_wave.detach()

    if return_actions:
        outdict["actions"] = N2S_action

    return outdict

Classes:

  • N2S

    N2S Model based on n_step Proximal Policy Optimization (PPO) with an N2S model policy.

N2S

N2S(
    env: RL4COEnvBase,
    policy: Module = None,
    critic: CriticNetwork = None,
    policy_kwargs: dict = {},
    critic_kwargs: dict = {},
    **kwargs
)

Bases: n_step_PPO

N2S Model based on n_step Proximal Policy Optimization (PPO) with an N2S model policy. We default to the N2S model policy and the improvement Critic Network.

Parameters:

  • env (RL4COEnvBase) –

    Environment to use for the algorithm

  • policy (Module, default: None ) –

    Policy to use for the algorithm

  • critic (CriticNetwork, default: None ) –

    Critic to use for the algorithm

  • policy_kwargs (dict, default: {} ) –

    Keyword arguments for policy

  • critic_kwargs (dict, default: {} ) –

    Keyword arguments for critic

Source code in rl4co/models/zoo/n2s/model.py
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
def __init__(
    self,
    env: RL4COEnvBase,
    policy: nn.Module = None,
    critic: CriticNetwork = None,
    policy_kwargs: dict = {},
    critic_kwargs: dict = {},
    **kwargs,
):
    if policy is None:
        policy = N2SPolicy(env_name=env.name, **policy_kwargs)

    if critic is None:
        embed_dim = (
            policy_kwargs["embed_dim"] if "embed_dim" in policy_kwargs else 128
        )  # the critic's embed_dim must be as policy's

        encoder = MultiHeadAttentionLayer(
            embed_dim,
            critic_kwargs["num_heads"] if "num_heads" in critic_kwargs else 4,
            critic_kwargs["feedforward_hidden"]
            if "feedforward_hidden" in critic_kwargs
            else 128,
            critic_kwargs["normalization"]
            if "normalization" in critic_kwargs
            else "layer",
            bias=False,
        )
        value_head = CriticDecoder(embed_dim)

        critic = CriticNetwork(
            encoder=encoder,
            value_head=value_head,
            customized=True,
        )

    super().__init__(env, policy, critic, **kwargs)

NeuOpt

Classes:

  • RDSDecoder

    RDS Decoder for flexible k-opt based on Ma et al. (2023)

RDSDecoder

RDSDecoder(embed_dim: int = 128)

Bases: ImprovementDecoder

RDS Decoder for flexible k-opt based on Ma et al. (2023) Given the environment state and the node embeddings (positional embeddings are discarded), compute the logits for selecting a k-opt exchange on basis moves (S-move, I-move, E-move) from the current solution

Parameters:

  • embed_dim (int, default: 128 ) –

    Embedding dimension

  • num_heads

    Number of attention heads

Source code in rl4co/models/zoo/neuopt/decoder.py
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
def __init__(
    self,
    embed_dim: int = 128,
):
    super().__init__()
    self.embed_dim = embed_dim

    self.linear_K1 = nn.Linear(self.embed_dim, self.embed_dim, bias=False)
    self.linear_K2 = nn.Linear(self.embed_dim, self.embed_dim, bias=False)
    self.linear_K3 = nn.Linear(self.embed_dim, self.embed_dim, bias=False)
    self.linear_K4 = nn.Linear(self.embed_dim, self.embed_dim, bias=False)

    self.linear_Q1 = nn.Linear(self.embed_dim, self.embed_dim, bias=False)
    self.linear_Q2 = nn.Linear(self.embed_dim, self.embed_dim, bias=False)
    self.linear_Q3 = nn.Linear(self.embed_dim, self.embed_dim, bias=False)
    self.linear_Q4 = nn.Linear(self.embed_dim, self.embed_dim, bias=False)

    self.linear_V1 = nn.Parameter(torch.Tensor(self.embed_dim))
    self.linear_V2 = nn.Parameter(torch.Tensor(self.embed_dim))

    self.rnn1 = nn.GRUCell(self.embed_dim, self.embed_dim)
    self.rnn2 = nn.GRUCell(self.embed_dim, self.embed_dim)

Classes:

CustomizeTSPInitEmbedding

CustomizeTSPInitEmbedding(embed_dim, linear_bias=True)

Bases: Module

Initial embedding for the Traveling Salesman Problems (TSP). Embed the following node features to the embedding space:

- locs: x, y coordinates of the cities
Source code in rl4co/models/zoo/neuopt/policy.py
26
27
28
29
30
31
32
33
def __init__(self, embed_dim, linear_bias=True):
    super(CustomizeTSPInitEmbedding, self).__init__()
    node_dim = 2  # x, y
    self.init_embed = nn.Sequential(
        nn.Linear(node_dim, embed_dim // 2, linear_bias),
        nn.ReLU(inplace=True),
        nn.Linear(embed_dim // 2, embed_dim, linear_bias),
    )

NeuOptPolicy

NeuOptPolicy(
    embed_dim: int = 128,
    num_encoder_layers: int = 3,
    num_heads: int = 4,
    normalization: str = "layer",
    feedforward_hidden: int = 128,
    env_name: str = "tsp_kopt",
    pos_type: str = "CPE",
    init_embedding: Module = None,
    pos_embedding: Module = None,
    temperature: float = 1.0,
    tanh_clipping: float = 6.0,
    train_decode_type: str = "sampling",
    val_decode_type: str = "sampling",
    test_decode_type: str = "sampling",
)

Bases: ImprovementPolicy

NeuOpt Policy based on Ma et al. (2023) This model first encodes the input graph and current solution using a N2S encoder (:class:N2SEncoder) and then decodes the k-opt action (:class:RDSDecoder)

Parameters:

  • embed_dim (int, default: 128 ) –

    Dimension of the node embeddings

  • num_encoder_layers (int, default: 3 ) –

    Number of layers in the encoder

  • num_heads (int, default: 4 ) –

    Number of heads in the attention layers

  • normalization (str, default: 'layer' ) –

    Normalization type in the attention layers

  • feedforward_hidden (int, default: 128 ) –

    Dimension of the hidden layer in the feedforward network

  • env_name (str, default: 'tsp_kopt' ) –

    Name of the environment used to initialize embeddings

  • pos_type (str, default: 'CPE' ) –

    Name of the used positional encoding method (CPE or APE)

  • init_embedding (Module, default: None ) –

    Module to use for the initialization of the embeddings

  • pos_embedding (Module, default: None ) –

    Module to use for the initialization of the positional embeddings

  • temperature (float, default: 1.0 ) –

    Temperature for the softmax

  • tanh_clipping (float, default: 6.0 ) –

    Tanh clipping value (see Bello et al., 2016)

  • train_decode_type (str, default: 'sampling' ) –

    Type of decoding to use during training

  • val_decode_type (str, default: 'sampling' ) –

    Type of decoding to use during validation

  • test_decode_type (str, default: 'sampling' ) –

    Type of decoding to use during testing

Methods:

  • forward

    Forward pass of the policy.

Source code in rl4co/models/zoo/neuopt/policy.py
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
def __init__(
    self,
    embed_dim: int = 128,
    num_encoder_layers: int = 3,
    num_heads: int = 4,
    normalization: str = "layer",
    feedforward_hidden: int = 128,
    env_name: str = "tsp_kopt",
    pos_type: str = "CPE",
    init_embedding: nn.Module = None,
    pos_embedding: nn.Module = None,
    temperature: float = 1.0,
    tanh_clipping: float = 6.0,
    train_decode_type: str = "sampling",
    val_decode_type: str = "sampling",
    test_decode_type: str = "sampling",
):
    super(NeuOptPolicy, self).__init__()

    self.env_name = env_name
    self.embed_dim = embed_dim

    # Decoding strategies
    self.temperature = temperature
    self.tanh_clipping = tanh_clipping
    self.train_decode_type = train_decode_type
    self.val_decode_type = val_decode_type
    self.test_decode_type = test_decode_type

    # Encoder and decoder
    if init_embedding is None:
        init_embedding = CustomizeTSPInitEmbedding(self.embed_dim)

    self.encoder = N2SEncoder(
        embed_dim=embed_dim,
        init_embedding=init_embedding,
        pos_embedding=pos_embedding,
        env_name=env_name,
        pos_type=pos_type,
        num_heads=num_heads,
        num_layers=num_encoder_layers,
        normalization=normalization,
        feedforward_hidden=feedforward_hidden,
    )

    self.decoder = RDSDecoder(embed_dim=embed_dim)

    self.init_hidden_W = nn.Linear(self.embed_dim, self.embed_dim)
    self.init_query_learnable = nn.Parameter(torch.Tensor(self.embed_dim))

    self.init_parameters()

forward

forward(
    td: TensorDict,
    env: Union[str, RL4COEnvBase] = None,
    phase: str = "train",
    return_actions: bool = True,
    return_embeds: bool = False,
    only_return_embed: bool = False,
    actions=None,
    **decoding_kwargs
) -> dict

Forward pass of the policy.

Parameters:

  • td (TensorDict) –

    TensorDict containing the environment state

  • env (Union[str, RL4COEnvBase], default: None ) –

    Environment to use for decoding. If None, the environment is instantiated from env_name. Note that it is more efficient to pass an already instantiated environment each time for fine-grained control

  • phase (str, default: 'train' ) –

    Phase of the algorithm (train, val, test)

  • return_actions (bool, default: True ) –

    Whether to return the actions

  • actions

    Actions to use for evaluating the policy. If passed, use these actions instead of sampling from the policy to calculate log likelihood

  • decoding_kwargs

    Keyword arguments for the decoding strategy. See :class:rl4co.utils.decoding.DecodingStrategy for more information.

Returns:

  • out ( dict ) –

    Dictionary containing the reward, log likelihood, and optionally the actions and entropy

Source code in rl4co/models/zoo/neuopt/policy.py
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
def forward(
    self,
    td: TensorDict,
    env: Union[str, RL4COEnvBase] = None,
    phase: str = "train",
    return_actions: bool = True,
    return_embeds: bool = False,
    only_return_embed: bool = False,
    actions=None,
    **decoding_kwargs,
) -> dict:
    """Forward pass of the policy.

    Args:
        td: TensorDict containing the environment state
        env: Environment to use for decoding. If None, the environment is instantiated from `env_name`. Note that
            it is more efficient to pass an already instantiated environment each time for fine-grained control
        phase: Phase of the algorithm (train, val, test)
        return_actions: Whether to return the actions
        actions: Actions to use for evaluating the policy.
            If passed, use these actions instead of sampling from the policy to calculate log likelihood
        decoding_kwargs: Keyword arguments for the decoding strategy. See :class:`rl4co.utils.decoding.DecodingStrategy` for more information.

    Returns:
        out: Dictionary containing the reward, log likelihood, and optionally the actions and entropy
    """

    # Encoder: get encoder output and initial embeddings from initial state
    nfe, _ = self.encoder(td)
    if only_return_embed:
        return {"embeds": nfe.detach()}

    # Instantiate environment if needed
    if isinstance(env, str) or env is None:
        env_name = self.env_name if env is None else env
        log.info(f"Instantiated environment not provided; instantiating {env_name}")
        env = get_env(env_name)
    assert not env.two_opt_mode, "NeuOpt only support k-opt with k > 2"

    # Get decode type depending on phase and whether actions are passed for evaluation
    decode_type = decoding_kwargs.pop("decode_type", None)
    if actions is not None:
        decode_type = "evaluate"
    elif decode_type is None:
        decode_type = getattr(self, f"{phase}_decode_type")

    # Setup decoding strategy
    # we pop arguments that are not part of the decoding strategy
    decode_strategy: DecodingStrategy = get_decoding_strategy(
        decode_type,
        temperature=decoding_kwargs.pop("temperature", self.temperature),
        tanh_clipping=decoding_kwargs.pop("tanh_clipping", self.tanh_clipping),
        mask_logits=True,
        improvement_method_mode=True,
        **decoding_kwargs,
    )

    # Perform the decoding
    bs, gs, _, ll, action_sampled, rec, visited_time = (
        *nfe.size(),
        0.0,
        None,
        td["rec_current"],
        td["visited_time"],
    )
    action_index = torch.zeros(bs, env.k_max, dtype=torch.long).to(rec.device)
    k_action_left = torch.zeros(bs, env.k_max + 1, dtype=torch.long).to(rec.device)
    k_action_right = torch.zeros(bs, env.k_max, dtype=torch.long).to(rec.device)
    next_of_last_action = (
        torch.zeros_like(rec[:, :1], dtype=torch.long).to(rec.device) - 1
    )
    mask = torch.zeros_like(rec, dtype=torch.bool).to(rec.device)
    stopped = torch.ones(bs, dtype=torch.bool).to(rec.device)
    zeros = torch.zeros((bs, 1), device=td.device)

    # init queries
    h_mean = nfe.mean(1)
    init_query = self.init_query_learnable.repeat(bs, 1)
    input_q1 = input_q2 = init_query.clone()
    init_hidden = self.init_hidden_W(h_mean)
    q1 = q2 = init_hidden.clone()

    for i in range(env.k_max):
        # Pass RDS decoder
        logits, q1, q2 = self.decoder(nfe, q1, q2, input_q1, input_q2)

        # Calc probs
        if i == 0 and "action" in td.keys():
            mask = mask.scatter(1, td["action"][:, :1], 1)

        logprob, action_sampled = decode_strategy.step(
            logits,
            ~mask.clone(),
            action=actions[:, i : i + 1].squeeze() if actions is not None else None,
        )
        action_sampled = action_sampled.unsqueeze(-1)
        if i > 0:
            action_sampled = torch.where(
                stopped.unsqueeze(-1), action_index[:, :1], action_sampled
            )
        if phase == "train":
            loss_now = logprob.gather(1, action_sampled)
        else:
            loss_now = zeros.clone()

        # Record log_likelihood and Entropy
        if i > 0:
            ll = ll + torch.where(stopped.unsqueeze(-1), zeros * 0, loss_now)
        else:
            ll = ll + loss_now

        # Store and Process actions
        next_of_new_action = rec.gather(1, action_sampled)
        action_index[:, i] = action_sampled.squeeze().clone()
        k_action_left[stopped, i] = action_sampled[stopped].squeeze().clone()
        k_action_right[~stopped, i - 1] = action_sampled[~stopped].squeeze().clone()
        k_action_left[:, i + 1] = next_of_new_action.squeeze().clone()

        # Prepare next RNN input
        input_q1 = nfe.gather(
            1, action_sampled.view(bs, 1, 1).expand(bs, 1, self.embed_dim)
        ).squeeze(1)
        input_q2 = torch.where(
            stopped.view(bs, 1).expand(bs, self.embed_dim),
            input_q1.clone(),
            nfe.gather(
                1,
                (next_of_last_action % gs)
                .view(bs, 1, 1)
                .expand(bs, 1, self.embed_dim),
            ).squeeze(1),
        )

        # Process if k-opt close
        # assert (input_q1[stopped] == input_q2[stopped]).all()
        if i > 0:
            stopped = stopped | (action_sampled == next_of_last_action).squeeze()
        else:
            stopped = (action_sampled == next_of_last_action).squeeze()
        # assert (input_q1[stopped] == input_q2[stopped]).all()

        k_action_left[stopped, i] = k_action_left[stopped, i - 1]
        k_action_right[stopped, i] = k_action_right[stopped, i - 1]

        # Calc next basic masks
        if i == 0:
            visited_time_tag = (
                visited_time - visited_time.gather(1, action_sampled)
            ) % gs
        mask &= False
        mask[(visited_time_tag <= visited_time_tag.gather(1, action_sampled))] = True
        if i == 0:
            mask[visited_time_tag > (gs - 2)] = True
        mask[stopped, action_sampled[stopped].squeeze()] = (
            False  # allow next k-opt starts immediately
        )
        # if True:#i == env.k_max - 2: # allow special case: close k-opt at the first selected node
        index_allow_first_node = (~stopped) & (
            next_of_new_action.squeeze() == action_index[:, 0]
        )
        mask[index_allow_first_node, action_index[index_allow_first_node, 0]] = False

        # Move to next
        next_of_last_action = next_of_new_action
        next_of_last_action[stopped] = -1

    # Form final action
    k_action_right[~stopped, -1] = k_action_left[~stopped, -1].clone()
    k_action_left = k_action_left[:, : env.k_max]
    action_all = torch.cat((action_index, k_action_left, k_action_right), -1)

    outdict = {"log_likelihood": ll, "cost_bsf": td["cost_bsf"]}
    td.set("action", action_all)

    if return_embeds:
        outdict["embeds"] = nfe.detach()

    if return_actions:
        outdict["actions"] = action_all

    return outdict

Classes:

  • NeuOpt

    NeuOpt Model based on n_step Proximal Policy Optimization (PPO) with an NeuOpt model policy.

NeuOpt

NeuOpt(
    env: RL4COEnvBase,
    policy: Module = None,
    critic: CriticNetwork = None,
    policy_kwargs: dict = {},
    critic_kwargs: dict = {},
    **kwargs
)

Bases: n_step_PPO

NeuOpt Model based on n_step Proximal Policy Optimization (PPO) with an NeuOpt model policy. We default to the NeuOpt model policy and the improvement Critic Network.

Parameters:

  • env (RL4COEnvBase) –

    Environment to use for the algorithm

  • policy (Module, default: None ) –

    Policy to use for the algorithm

  • critic (CriticNetwork, default: None ) –

    Critic to use for the algorithm

  • policy_kwargs (dict, default: {} ) –

    Keyword arguments for policy

  • critic_kwargs (dict, default: {} ) –

    Keyword arguments for critic

Source code in rl4co/models/zoo/neuopt/model.py
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
def __init__(
    self,
    env: RL4COEnvBase,
    policy: nn.Module = None,
    critic: CriticNetwork = None,
    policy_kwargs: dict = {},
    critic_kwargs: dict = {},
    **kwargs,
):
    if policy is None:
        policy = NeuOptPolicy(env_name=env.name, **policy_kwargs)

    if critic is None:
        embed_dim = (
            policy_kwargs["embed_dim"] if "embed_dim" in policy_kwargs else 128
        )  # the critic's embed_dim must be as policy's

        encoder = MultiHeadAttentionLayer(
            embed_dim,
            critic_kwargs["num_heads"] if "num_heads" in critic_kwargs else 4,
            critic_kwargs["feedforward_hidden"]
            if "feedforward_hidden" in critic_kwargs
            else 128,
            critic_kwargs["normalization"]
            if "normalization" in critic_kwargs
            else "layer",
            bias=False,
        )
        value_head = CriticDecoder(embed_dim, dropout_rate=0.001)

        critic = CriticNetwork(
            encoder=encoder,
            value_head=value_head,
            customized=True,
        )

    super().__init__(env, policy, critic, **kwargs)