Improvement Methods¶

These methods are trained to improve existing solutions iteratively, akin to local search algorithms. They focus on refining existing solutions rather than generating them from scratch.

DACT¶

Classes:

DACTEncoder –

Dual-Aspect Collaborative Transformer Encoder as in Ma et al. (2021)

DACTEncoder ¶

DACTEncoder(
    embed_dim: int = 64,
    init_embedding: Module = None,
    pos_embedding: Module = None,
    env_name: str = "tsp_kopt",
    pos_type: str = "CPE",
    num_heads: int = 4,
    num_layers: int = 3,
    normalization: str = "layer",
    feedforward_hidden: int = 64,
)

Bases: ImprovementEncoder

Dual-Aspect Collaborative Transformer Encoder as in Ma et al. (2021)

Parameters:

embed_dim (int, default: 64 ) –

Dimension of the embedding space
init_embedding (Module, default: None ) –

Module to use for the initialization of the node embeddings
pos_embedding (Module, default: None ) –

Module to use for the initialization of the positional embeddings
env_name (str, default: 'tsp_kopt' ) –

Name of the environment used to initialize embeddings
pos_type (str, default: 'CPE' ) –

Name of the used positional encoding method (CPE or APE)
num_heads (int, default: 4 ) –

Number of heads in the attention layers
num_layers (int, default: 3 ) –

Number of layers in the attention network
normalization (str, default: 'layer' ) –

Normalization type in the attention layers
feedforward_hidden (int, default: 64 ) –

Hidden dimension in the feedforward layers

Source code in rl4co/models/zoo/dact/encoder.py

def __init__(
    self,
    embed_dim: int = 64,
    init_embedding: nn.Module = None,
    pos_embedding: nn.Module = None,
    env_name: str = "tsp_kopt",
    pos_type: str = "CPE",
    num_heads: int = 4,
    num_layers: int = 3,
    normalization: str = "layer",
    feedforward_hidden: int = 64,
):
    super(DACTEncoder, self).__init__(
        embed_dim=embed_dim,
        env_name=env_name,
        pos_type=pos_type,
        num_heads=num_heads,
        num_layers=num_layers,
        normalization=normalization,
        feedforward_hidden=feedforward_hidden,
    )

    assert self.env_name in ["tsp_kopt"], NotImplementedError()

    self.net = AdaptiveSequential(
        *(
            DACTEncoderLayer(
                num_heads,
                embed_dim,
                feedforward_hidden,
                normalization,
            )
            for _ in range(num_layers)
        )
    )

Classes:

DACTDecoder –

DACT decoder based on Ma et al. (2021)

DACTDecoder ¶

DACTDecoder(embed_dim: int = 64, num_heads: int = 4)

Bases: ImprovementDecoder

DACT decoder based on Ma et al. (2021) Given the environment state and the dual sets of embeddings (PFE, NFE embeddings), compute the logits for selecting two nodes for the 2-opt local search from the current solution

Parameters:

embed_dim (int, default: 64 ) –

Embedding dimension
num_heads (int, default: 4 ) –

Number of attention heads

Methods:

forward –

Compute the logits of the removing a node pair from the current solution

Source code in rl4co/models/zoo/dact/decoder.py

def __init__(
    self,
    embed_dim: int = 64,
    num_heads: int = 4,
):
    super().__init__()
    self.embed_dim = embed_dim
    self.n_heads = num_heads
    self.hidden_dim = embed_dim

    # for MHC sublayer (NFE aspect)
    self.compater_node = MultiHeadCompat(
        num_heads, embed_dim, embed_dim, embed_dim, embed_dim
    )

    # for MHC sublayer (PFE aspect)
    self.compater_pos = MultiHeadCompat(
        num_heads, embed_dim, embed_dim, embed_dim, embed_dim
    )

    self.norm_factor = 1 / math.sqrt(1 * self.hidden_dim)

    # for Max-Pooling sublayer
    self.project_graph_pos = nn.Linear(self.embed_dim, self.embed_dim, bias=False)
    self.project_graph_node = nn.Linear(self.embed_dim, self.embed_dim, bias=False)
    self.project_node_pos = nn.Linear(self.embed_dim, self.embed_dim, bias=False)
    self.project_node_node = nn.Linear(self.embed_dim, self.embed_dim, bias=False)

    # for feed-forward aggregation (FFA)sublayer
    self.value_head = MLP(
        input_dim=2 * self.n_heads,
        output_dim=1,
        num_neurons=[32, 32],
        dropout_probs=[0.05, 0.00],
    )

forward ¶

forward(
    td: TensorDict, final_h: Tensor, final_p: Tensor
) -> Tensor

Compute the logits of the removing a node pair from the current solution

Parameters:

td (TensorDict) –

TensorDict with the current environment state
final_h (Tensor) –

final NFE embeddings
final_p (Tensor) –

final pfe embeddings

Source code in rl4co/models/zoo/dact/decoder.py

def forward(self, td: TensorDict, final_h: Tensor, final_p: Tensor) -> Tensor:
    """Compute the logits of the removing a node pair from the current solution

    Args:
        td: TensorDict with the current environment state
        final_h: final NFE embeddings
        final_p: final pfe embeddings
    """

    batch_size, graph_size, dim = final_h.size()

    # Max-Pooling sublayer
    h_node_refined = self.project_node_node(final_h) + self.project_graph_node(
        final_h.max(1)[0]
    )[:, None, :].expand(batch_size, graph_size, dim)
    h_pos_refined = self.project_node_pos(final_p) + self.project_graph_pos(
        final_p.max(1)[0]
    )[:, None, :].expand(batch_size, graph_size, dim)

    # MHC sublayer
    compatibility = torch.zeros(
        (batch_size, graph_size, graph_size, self.n_heads * 2),
        device=h_node_refined.device,
    )
    compatibility[:, :, :, : self.n_heads] = self.compater_pos(h_pos_refined).permute(
        1, 2, 3, 0
    )
    compatibility[:, :, :, self.n_heads :] = self.compater_node(
        h_node_refined
    ).permute(1, 2, 3, 0)

    # FFA sublater
    return self.value_head(self.norm_factor * compatibility).squeeze(-1)

Classes:

DACTPolicy –

DACT Policy based on Ma et al. (2021)

DACTPolicy ¶

DACTPolicy(
    embed_dim: int = 64,
    num_encoder_layers: int = 3,
    num_heads: int = 4,
    normalization: str = "layer",
    feedforward_hidden: int = 64,
    env_name: str = "tsp_kopt",
    pos_type: str = "CPE",
    init_embedding: Module = None,
    pos_embedding: Module = None,
    temperature: float = 1.0,
    tanh_clipping: float = 6.0,
    train_decode_type: str = "sampling",
    val_decode_type: str = "sampling",
    test_decode_type: str = "sampling",
)

Bases: ImprovementPolicy

DACT Policy based on Ma et al. (2021) This model first encodes the input graph and current solution using a DACT encoder (:class:DACTEncoder) and then decodes the 2-opt action (:class:DACTDecoder)

Parameters:

embed_dim (int, default: 64 ) –

Dimension of the node embeddings
num_encoder_layers (int, default: 3 ) –

Number of layers in the encoder
num_heads (int, default: 4 ) –

Number of heads in the attention layers
normalization (str, default: 'layer' ) –

Normalization type in the attention layers
feedforward_hidden (int, default: 64 ) –

Dimension of the hidden layer in the feedforward network
env_name (str, default: 'tsp_kopt' ) –

Name of the environment used to initialize embeddings
pos_type (str, default: 'CPE' ) –

Name of the used positional encoding method (CPE or APE)
init_embedding (Module, default: None ) –

Module to use for the initialization of the embeddings
pos_embedding (Module, default: None ) –

Module to use for the initialization of the positional embeddings
temperature (float, default: 1.0 ) –

Temperature for the softmax
tanh_clipping (float, default: 6.0 ) –

Tanh clipping value (see Bello et al., 2016)
train_decode_type (str, default: 'sampling' ) –

Type of decoding to use during training
val_decode_type (str, default: 'sampling' ) –

Type of decoding to use during validation
test_decode_type (str, default: 'sampling' ) –

Type of decoding to use during testing

Methods:

forward –

Forward pass of the policy.

Source code in rl4co/models/zoo/dact/policy.py

def __init__(
    self,
    embed_dim: int = 64,
    num_encoder_layers: int = 3,
    num_heads: int = 4,
    normalization: str = "layer",
    feedforward_hidden: int = 64,
    env_name: str = "tsp_kopt",
    pos_type: str = "CPE",
    init_embedding: nn.Module = None,
    pos_embedding: nn.Module = None,
    temperature: float = 1.0,
    tanh_clipping: float = 6.0,
    train_decode_type: str = "sampling",
    val_decode_type: str = "sampling",
    test_decode_type: str = "sampling",
):
    super(DACTPolicy, self).__init__()

    self.env_name = env_name

    # Encoder and decoder
    self.encoder = DACTEncoder(
        embed_dim=embed_dim,
        init_embedding=init_embedding,
        pos_embedding=pos_embedding,
        env_name=env_name,
        pos_type=pos_type,
        num_heads=num_heads,
        num_layers=num_encoder_layers,
        normalization=normalization,
        feedforward_hidden=feedforward_hidden,
    )

    self.decoder = DACTDecoder(embed_dim=embed_dim, num_heads=num_heads)

    # Decoding strategies
    self.temperature = temperature
    self.tanh_clipping = tanh_clipping
    self.train_decode_type = train_decode_type
    self.val_decode_type = val_decode_type
    self.test_decode_type = test_decode_type

forward ¶

forward(
    td: TensorDict,
    env: str | RL4COEnvBase = None,
    phase: str = "train",
    return_actions: bool = True,
    return_embeds: bool = False,
    only_return_embed: bool = False,
    actions=None,
    **decoding_kwargs
) -> dict

Forward pass of the policy.

Parameters:

td (TensorDict) –

TensorDict containing the environment state
env (str | RL4COEnvBase, default: None ) –

Environment to use for decoding. If None, the environment is instantiated from env_name. Note that it is more efficient to pass an already instantiated environment each time for fine-grained control
phase (str, default: 'train' ) –

Phase of the algorithm (train, val, test)
return_actions (bool, default: True ) –

Whether to return the actions
actions –

Actions to use for evaluating the policy. If passed, use these actions instead of sampling from the policy to calculate log likelihood
decoding_kwargs –

Keyword arguments for the decoding strategy. See :class:rl4co.utils.decoding.DecodingStrategy for more information.

Returns:

out ( dict ) –

Dictionary containing the reward, log likelihood, and optionally the actions and entropy

Source code in rl4co/models/zoo/dact/policy.py

def forward(
    self,
    td: TensorDict,
    env: str | RL4COEnvBase = None,
    phase: str = "train",
    return_actions: bool = True,
    return_embeds: bool = False,
    only_return_embed: bool = False,
    actions=None,
    **decoding_kwargs,
) -> dict:
    """Forward pass of the policy.

    Args:
        td: TensorDict containing the environment state
        env: Environment to use for decoding. If None, the environment is instantiated from `env_name`. Note that
            it is more efficient to pass an already instantiated environment each time for fine-grained control
        phase: Phase of the algorithm (train, val, test)
        return_actions: Whether to return the actions
        actions: Actions to use for evaluating the policy.
            If passed, use these actions instead of sampling from the policy to calculate log likelihood
        decoding_kwargs: Keyword arguments for the decoding strategy. See :class:`rl4co.utils.decoding.DecodingStrategy` for more information.

    Returns:
        out: Dictionary containing the reward, log likelihood, and optionally the actions and entropy
    """

    # Encoder: get encoder output and initial embeddings from initial state
    NFE, PFE = self.encoder(td)
    h_featrues = torch.cat((NFE, PFE), -1)

    if only_return_embed:
        return {"embeds": h_featrues.detach()}

    # Instantiate environment if needed
    if isinstance(env, str) or env is None:
        env_name = self.env_name if env is None else env
        log.info(f"Instantiated environment not provided; instantiating {env_name}")
        env = get_env(env_name)
    assert env.two_opt_mode, "DACT only support 2-opt"

    # Get decode type depending on phase and whether actions are passed for evaluation
    decode_type = decoding_kwargs.pop("decode_type", None)
    if actions is not None:
        decode_type = "evaluate"
    elif decode_type is None:
        decode_type = getattr(self, f"{phase}_decode_type")

    # Setup decoding strategy
    # we pop arguments that are not part of the decoding strategy
    decode_strategy: DecodingStrategy = get_decoding_strategy(
        decode_type,
        temperature=decoding_kwargs.pop("temperature", self.temperature),
        tanh_clipping=decoding_kwargs.pop("tanh_clipping", self.tanh_clipping),
        mask_logits=True,
        improvement_method_mode=True,
        **decoding_kwargs,
    )

    # Perform the decoding
    batch_size, seq_length = td["rec_current"].size()
    logits = self.decoder(td, NFE, PFE).view(batch_size, -1)

    # Get mask
    mask = env.get_mask(td)
    if "action" in td.keys():
        mask[torch.arange(batch_size), td["action"][:, 0], td["action"][:, 1]] = False
        mask[torch.arange(batch_size), td["action"][:, 1], td["action"][:, 0]] = False
    mask = mask.view(batch_size, -1)

    # Get action and log-likelihood
    logprob, action_sampled = decode_strategy.step(
        logits,
        mask,
        action=(
            actions[:, 0] * seq_length + actions[:, 1]
            if actions is not None
            else None
        ),
    )
    action_sampled = action_sampled.unsqueeze(-1)
    if phase == "train":
        log_likelihood = logprob.gather(1, action_sampled)
    else:
        log_likelihood = torch.zeros(batch_size, device=td.device)

    ## return
    DACT_action = torch.cat(
        (
            action_sampled // seq_length,
            action_sampled % seq_length,
        ),
        -1,
    )

    outdict = {"log_likelihood": log_likelihood, "cost_bsf": td["cost_bsf"]}
    td.set("action", DACT_action)

    if return_embeds:
        outdict["embeds"] = h_featrues.detach()

    if return_actions:
        outdict["actions"] = DACT_action

    return outdict

Classes:

DACT –

DACT Model based on n_step Proximal Policy Optimization (PPO) with an DACT model policy.

DACT ¶

DACT(
    env: RL4COEnvBase,
    policy: Module = None,
    critic: CriticNetwork = None,
    policy_kwargs: dict = {},
    critic_kwargs: dict = {},
    **kwargs
)

Bases: n_step_PPO

DACT Model based on n_step Proximal Policy Optimization (PPO) with an DACT model policy. We default to the DACT model policy and the improvement Critic Network.

Parameters:

env (RL4COEnvBase) –

Environment to use for the algorithm
policy (Module, default: None ) –

Policy to use for the algorithm
critic (CriticNetwork, default: None ) –

Critic to use for the algorithm
policy_kwargs (dict, default: {} ) –

Keyword arguments for policy
critic_kwargs (dict, default: {} ) –

Keyword arguments for critic

Source code in rl4co/models/zoo/dact/model.py

def __init__(
    self,
    env: RL4COEnvBase,
    policy: nn.Module = None,
    critic: CriticNetwork = None,
    policy_kwargs: dict = {},
    critic_kwargs: dict = {},
    **kwargs,
):
    if policy is None:
        policy = DACTPolicy(env_name=env.name, **policy_kwargs)

    if critic is None:
        embed_dim = (
            policy_kwargs["embed_dim"] * 2 if "embed_dim" in policy_kwargs else 128
        )  # the critic's embed_dim must be as policy's

        encoder = MultiHeadAttentionLayer(
            embed_dim,
            critic_kwargs["num_heads"] if "num_heads" in critic_kwargs else 4,
            critic_kwargs["feedforward_hidden"] * 2
            if "feedforward_hidden" in critic_kwargs
            else 128,
            critic_kwargs["normalization"]
            if "normalization" in critic_kwargs
            else "layer",
            bias=False,
        )
        value_head = CriticDecoder(embed_dim)

        critic = CriticNetwork(
            encoder=encoder,
            value_head=value_head,
            customized=True,
        )

    super().__init__(env, policy, critic, **kwargs)

N2S¶

Classes:

N2SEncoder –

Neural Neighborhood Search Encoder as in Ma et al. (2022)

N2SEncoder ¶

N2SEncoder(
    embed_dim: int = 128,
    init_embedding: Module = None,
    pos_embedding: Module = None,
    env_name: str = "pdp_ruin_repair",
    pos_type: str = "CPE",
    num_heads: int = 4,
    num_layers: int = 3,
    normalization: str = "layer",
    feedforward_hidden: int = 128,
)

Bases: ImprovementEncoder

Neural Neighborhood Search Encoder as in Ma et al. (2022) First embed the input and then process it with a Graph AttepdN2ntion Network.

Parameters:

embed_dim (int, default: 128 ) –

Dimension of the embedding space
init_embedding (Module, default: None ) –

Module to use for the initialization of the node embeddings
pos_embedding (Module, default: None ) –

Module to use for the initialization of the positional embeddings
env_name (str, default: 'pdp_ruin_repair' ) –

Name of the environment used to initialize embeddings
pos_type (str, default: 'CPE' ) –

Name of the used positional encoding method (CPE or APE)
num_heads (int, default: 4 ) –

Number of heads in the attention layers
num_layers (int, default: 3 ) –

Number of layers in the attention network
normalization (str, default: 'layer' ) –

Normalization type in the attention layers
feedforward_hidden (int, default: 128 ) –

Hidden dimension in the feedforward layers

Source code in rl4co/models/zoo/n2s/encoder.py

def __init__(
    self,
    embed_dim: int = 128,
    init_embedding: nn.Module = None,
    pos_embedding: nn.Module = None,
    env_name: str = "pdp_ruin_repair",
    pos_type: str = "CPE",
    num_heads: int = 4,
    num_layers: int = 3,
    normalization: str = "layer",
    feedforward_hidden: int = 128,
):
    super(N2SEncoder, self).__init__(
        embed_dim=embed_dim,
        init_embedding=init_embedding,
        pos_embedding=pos_embedding,
        env_name=env_name,
        pos_type=pos_type,
        num_heads=num_heads,
        num_layers=num_layers,
        normalization=normalization,
        feedforward_hidden=feedforward_hidden,
    )

    self.pos_net = MultiHeadCompat(num_heads, embed_dim, feedforward_hidden)

    self.net = AdaptiveSequential(
        *(
            N2SEncoderLayer(
                num_heads,
                embed_dim,
                feedforward_hidden,
                normalization,
            )
            for _ in range(num_layers)
        )
    )

Classes:

NodePairRemovalDecoder –

N2S Node-Pair Removal decoder based on Ma et al. (2022)
NodePairReinsertionDecoder –

N2S Node-Pair Reinsertion decoder based on Ma et al. (2022)

NodePairRemovalDecoder ¶

NodePairRemovalDecoder(
    embed_dim: int = 128, num_heads: int = 4
)

Bases: ImprovementDecoder

N2S Node-Pair Removal decoder based on Ma et al. (2022) Given the environment state and the node embeddings (positional embeddings are discarded), compute the logits for selecting a pair of pickup and delivery nodes for node pair removal from the current solution

Parameters:

embed_dim (int, default: 128 ) –

Embedding dimension
num_heads (int, default: 4 ) –

Number of attention heads

Methods:

forward –

Compute the logits of the removing a node pair from the current solution

Source code in rl4co/models/zoo/n2s/decoder.py

def __init__(
    self,
    embed_dim: int = 128,
    num_heads: int = 4,
):
    super().__init__()
    self.input_dim = embed_dim
    self.n_heads = num_heads
    self.hidden_dim = embed_dim

    assert embed_dim % num_heads == 0

    self.W_Q = nn.Parameter(
        torch.Tensor(self.n_heads, self.input_dim, self.hidden_dim)
    )
    self.W_K = nn.Parameter(
        torch.Tensor(self.n_heads, self.input_dim, self.hidden_dim)
    )

    self.agg = MLP(input_dim=2 * self.n_heads + 4, output_dim=1, num_neurons=[32, 32])

    self.init_parameters()

forward ¶

forward(
    td: TensorDict, final_h: Tensor, final_p: Tensor
) -> Tensor

Compute the logits of the removing a node pair from the current solution

Parameters:

td (TensorDict) –

TensorDict with the current environment state
final_h (Tensor) –

final node embeddings
final_p (Tensor) –

final positional embeddings

Source code in rl4co/models/zoo/n2s/decoder.py

def forward(self, td: TensorDict, final_h: Tensor, final_p: Tensor) -> Tensor:
    """Compute the logits of the removing a node pair from the current solution

    Args:
        td: TensorDict with the current environment state
        final_h: final node embeddings
        final_p: final positional embeddings
    """

    selection_recent = torch.cat(
        (td["action_record"][:, -3:], td["action_record"].mean(1, True)), 1
    )
    solution = td["rec_current"]

    pre = solution.argsort()  # pre=[1,2,0]
    post = solution.gather(
        1, solution
    )  # post=[1,2,0] # the second neighbour works better
    batch_size, graph_size_plus1, input_dim = final_h.size()

    hflat = final_h.contiguous().view(-1, input_dim)  #################   reshape

    shp = (self.n_heads, batch_size, graph_size_plus1, self.hidden_dim)

    # Calculate queries, (n_heads, batch_size, graph_size+1, key_size)
    hidden_Q = torch.matmul(hflat, self.W_Q).view(shp)
    hidden_K = torch.matmul(hflat, self.W_K).view(shp)

    Q_pre = hidden_Q.gather(
        2, pre.view(1, batch_size, graph_size_plus1, 1).expand_as(hidden_Q)
    )
    K_post = hidden_K.gather(
        2, post.view(1, batch_size, graph_size_plus1, 1).expand_as(hidden_Q)
    )

    compatibility = (
        (Q_pre * hidden_K).sum(-1)
        + (hidden_Q * K_post).sum(-1)
        - (Q_pre * K_post).sum(-1)
    )[
        :, :, 1:
    ]  # (n_heads, batch_size, graph_size) (12)

    compatibility_pairing = torch.cat(
        (
            compatibility[:, :, : graph_size_plus1 // 2],
            compatibility[:, :, graph_size_plus1 // 2 :],
        ),
        0,
    )  # (n_heads*2, batch_size, graph_size/2)

    compatibility_pairing = self.agg(
        torch.cat(
            (
                compatibility_pairing.permute(1, 2, 0),
                selection_recent.permute(0, 2, 1),
            ),
            -1,
        )
    ).squeeze()  # (batch_size, graph_size/2)

    return compatibility_pairing

NodePairReinsertionDecoder ¶

NodePairReinsertionDecoder(
    embed_dim: int = 128, num_heads: int = 4
)

Bases: ImprovementDecoder

N2S Node-Pair Reinsertion decoder based on Ma et al. (2022) Given the environment state, the node embeddings (positional embeddings are discarded), and the removed node from the NodePairRemovalDecoder, compute the logits for finding places to re-insert the removed pair of pickup and delivery nodes to form a new solution

Parameters:

embed_dim (int, default: 128 ) –

Embedding dimension
num_heads (int, default: 4 ) –

Number of attention heads

Source code in rl4co/models/zoo/n2s/decoder.py

def __init__(
    self,
    embed_dim: int = 128,
    num_heads: int = 4,
):
    super().__init__()
    self.input_dim = embed_dim
    self.n_heads = num_heads
    self.hidden_dim = embed_dim

    assert embed_dim % num_heads == 0

    self.compater_insert1 = MultiHeadCompat(
        num_heads, embed_dim, embed_dim, embed_dim, embed_dim
    )

    self.compater_insert2 = MultiHeadCompat(
        num_heads, embed_dim, embed_dim, embed_dim, embed_dim
    )

    self.agg = MLP(input_dim=4 * self.n_heads, output_dim=1, num_neurons=[32, 32])

Classes:

N2SPolicy –

N2S Policy based on Ma et al. (2022)

N2SPolicy ¶

N2SPolicy(
    embed_dim: int = 128,
    num_encoder_layers: int = 3,
    num_heads: int = 4,
    normalization: str = "layer",
    feedforward_hidden: int = 128,
    env_name: str = "pdp_ruin_repair",
    pos_type: str = "CPE",
    init_embedding: Module = None,
    pos_embedding: Module = None,
    temperature: float = 1.0,
    tanh_clipping: float = 6.0,
    train_decode_type: str = "sampling",
    val_decode_type: str = "sampling",
    test_decode_type: str = "sampling",
)

Bases: ImprovementPolicy

N2S Policy based on Ma et al. (2022) This model first encodes the input graph and current solution using a N2S encoder (:class:N2SEncoder) and then decodes the node-pair removal and reinsertion action using the Node-Pair Removal (:class:NodePairRemovalDecoder) and Reinsertion (:class:NodePairReinsertionDecoder) decoders

Parameters:

embed_dim (int, default: 128 ) –

Dimension of the node embeddings
num_encoder_layers (int, default: 3 ) –

Number of layers in the encoder
num_heads (int, default: 4 ) –

Number of heads in the attention layers
normalization (str, default: 'layer' ) –

Normalization type in the attention layers
feedforward_hidden (int, default: 128 ) –

Dimension of the hidden layer in the feedforward network
env_name (str, default: 'pdp_ruin_repair' ) –

Name of the environment used to initialize embeddings
pos_type (str, default: 'CPE' ) –

Name of the used positional encoding method (CPE or APE)
init_embedding (Module, default: None ) –

Module to use for the initialization of the embeddings
pos_embedding (Module, default: None ) –

Module to use for the initialization of the positional embeddings
temperature (float, default: 1.0 ) –

Temperature for the softmax
tanh_clipping (float, default: 6.0 ) –

Tanh clipping value (see Bello et al., 2016)
train_decode_type (str, default: 'sampling' ) –

Type of decoding to use during training
val_decode_type (str, default: 'sampling' ) –

Type of decoding to use during validation
test_decode_type (str, default: 'sampling' ) –

Type of decoding to use during testing

Methods:

forward –

Forward pass of the policy.

Source code in rl4co/models/zoo/n2s/policy.py

def __init__(
    self,
    embed_dim: int = 128,
    num_encoder_layers: int = 3,
    num_heads: int = 4,
    normalization: str = "layer",
    feedforward_hidden: int = 128,
    env_name: str = "pdp_ruin_repair",
    pos_type: str = "CPE",
    init_embedding: nn.Module = None,
    pos_embedding: nn.Module = None,
    temperature: float = 1.0,
    tanh_clipping: float = 6.0,
    train_decode_type: str = "sampling",
    val_decode_type: str = "sampling",
    test_decode_type: str = "sampling",
):
    super(N2SPolicy, self).__init__()

    self.env_name = env_name

    # Encoder and decoder
    self.encoder = N2SEncoder(
        embed_dim=embed_dim,
        init_embedding=init_embedding,
        pos_embedding=pos_embedding,
        env_name=env_name,
        pos_type=pos_type,
        num_heads=num_heads,
        num_layers=num_encoder_layers,
        normalization=normalization,
        feedforward_hidden=feedforward_hidden,
    )

    self.removal_decoder = NodePairRemovalDecoder(
        embed_dim=embed_dim, num_heads=num_heads
    )

    self.reinsertion_decoder = NodePairReinsertionDecoder(
        embed_dim=embed_dim, num_heads=num_heads
    )

    self.project_graph = nn.Linear(embed_dim, embed_dim, bias=False)
    self.project_node = nn.Linear(embed_dim, embed_dim, bias=False)

    # Decoding strategies
    self.temperature = temperature
    self.tanh_clipping = tanh_clipping
    self.train_decode_type = train_decode_type
    self.val_decode_type = val_decode_type
    self.test_decode_type = test_decode_type

forward ¶

forward(
    td: TensorDict,
    env: str | RL4COEnvBase = None,
    phase: str = "train",
    return_actions: bool = True,
    return_embeds: bool = False,
    only_return_embed: bool = False,
    actions=None,
    **decoding_kwargs
) -> dict

Forward pass of the policy.

Parameters:

td (TensorDict) –

TensorDict containing the environment state
env (str | RL4COEnvBase, default: None ) –

Environment to use for decoding. If None, the environment is instantiated from env_name. Note that it is more efficient to pass an already instantiated environment each time for fine-grained control
phase (str, default: 'train' ) –

Phase of the algorithm (train, val, test)
return_actions (bool, default: True ) –

Whether to return the actions
actions –

Actions to use for evaluating the policy. If passed, use these actions instead of sampling from the policy to calculate log likelihood
decoding_kwargs –

Keyword arguments for the decoding strategy. See :class:rl4co.utils.decoding.DecodingStrategy for more information.

Returns:

out ( dict ) –

Dictionary containing the reward, log likelihood, and optionally the actions and entropy

Source code in rl4co/models/zoo/n2s/policy.py

def forward(
    self,
    td: TensorDict,
    env: str | RL4COEnvBase = None,
    phase: str = "train",
    return_actions: bool = True,
    return_embeds: bool = False,
    only_return_embed: bool = False,
    actions=None,
    **decoding_kwargs,
) -> dict:
    """Forward pass of the policy.

    Args:
        td: TensorDict containing the environment state
        env: Environment to use for decoding. If None, the environment is instantiated from `env_name`. Note that
            it is more efficient to pass an already instantiated environment each time for fine-grained control
        phase: Phase of the algorithm (train, val, test)
        return_actions: Whether to return the actions
        actions: Actions to use for evaluating the policy.
            If passed, use these actions instead of sampling from the policy to calculate log likelihood
        decoding_kwargs: Keyword arguments for the decoding strategy. See :class:`rl4co.utils.decoding.DecodingStrategy` for more information.

    Returns:
        out: Dictionary containing the reward, log likelihood, and optionally the actions and entropy
    """

    # Encoder: get encoder output and initial embeddings from initial state
    h_wave, final_p = self.encoder(td)
    if only_return_embed:
        return {"embeds": h_wave.detach()}
    final_h = (
        self.project_node(h_wave) + self.project_graph(h_wave.max(1)[0])[:, None, :]
    )

    # Instantiate environment if needed
    if isinstance(env, str) or env is None:
        env_name = self.env_name if env is None else env
        log.info(f"Instantiated environment not provided; instantiating {env_name}")
        env = get_env(env_name)

    # Get decode type depending on phase and whether actions are passed for evaluation
    decode_type = decoding_kwargs.pop("decode_type", None)
    if actions is not None:
        decode_type = "evaluate"
    elif decode_type is None:
        decode_type = getattr(self, f"{phase}_decode_type")

    # Setup decoding strategy
    # we pop arguments that are not part of the decoding strategy
    decode_strategy: DecodingStrategy = get_decoding_strategy(
        decode_type,
        temperature=decoding_kwargs.pop("temperature", self.temperature),
        tanh_clipping=decoding_kwargs.pop("tanh_clipping", self.tanh_clipping),
        mask_logits=True,
        improvement_method_mode=True,
        **decoding_kwargs,
    )

    ## action 1

    # Perform the decoding
    logits = self.removal_decoder(td, final_h, final_p)

    # Get mask
    mask = torch.ones_like(td["action_record"][:, 0], device=td.device).bool()
    if "action" in td.keys():
        mask = mask.scatter(1, td["action"][:, :1], 0)

    # Get action and log-likelihood
    logprob_removal, action_removal = decode_strategy.step(
        logits,
        mask,
        action=actions[:, 0] if actions is not None else None,
    )
    action_removal = action_removal.unsqueeze(-1)
    if phase == "train":
        selected_log_ll_action1 = logprob_removal.gather(1, action_removal)

    ## action 2
    td.set("action", action_removal)

    # Perform the decoding
    batch_size, seq_length = td["rec_current"].size()
    logits = self.reinsertion_decoder(td, final_h, final_p).view(batch_size, -1)

    # Get mask
    mask = env.get_mask(action_removal + 1, td).view(batch_size, -1)
    # Get action and log-likelihood
    logprob_reinsertion, action_reinsertion = decode_strategy.step(
        logits,
        mask,
        action=(
            actions[:, 1] * seq_length + actions[:, 2]
            if actions is not None
            else None
        ),
    )
    action_reinsertion = action_reinsertion.unsqueeze(-1)
    if phase == "train":
        selected_log_ll_action2 = logprob_reinsertion.gather(1, action_reinsertion)

    ## return
    N2S_action = torch.cat(
        (
            action_removal.view(batch_size, -1),
            action_reinsertion // seq_length,
            action_reinsertion % seq_length,
        ),
        -1,
    )
    if phase == "train":
        log_likelihood = selected_log_ll_action1 + selected_log_ll_action2
    else:
        log_likelihood = torch.zeros(batch_size, device=td.device)

    outdict = {"log_likelihood": log_likelihood, "cost_bsf": td["cost_bsf"]}
    td.set("action", N2S_action)

    if return_embeds:
        outdict["embeds"] = h_wave.detach()

    if return_actions:
        outdict["actions"] = N2S_action

    return outdict

Classes:

N2S –

N2S Model based on n_step Proximal Policy Optimization (PPO) with an N2S model policy.

N2S ¶

N2S(
    env: RL4COEnvBase,
    policy: Module = None,
    critic: CriticNetwork = None,
    policy_kwargs: dict = {},
    critic_kwargs: dict = {},
    **kwargs
)

Bases: n_step_PPO

N2S Model based on n_step Proximal Policy Optimization (PPO) with an N2S model policy. We default to the N2S model policy and the improvement Critic Network.

Parameters:

env (RL4COEnvBase) –

Environment to use for the algorithm
policy (Module, default: None ) –

Policy to use for the algorithm
critic (CriticNetwork, default: None ) –

Critic to use for the algorithm
policy_kwargs (dict, default: {} ) –

Keyword arguments for policy
critic_kwargs (dict, default: {} ) –

Keyword arguments for critic

Source code in rl4co/models/zoo/n2s/model.py

def __init__(
    self,
    env: RL4COEnvBase,
    policy: nn.Module = None,
    critic: CriticNetwork = None,
    policy_kwargs: dict = {},
    critic_kwargs: dict = {},
    **kwargs,
):
    if policy is None:
        policy = N2SPolicy(env_name=env.name, **policy_kwargs)

    if critic is None:
        embed_dim = (
            policy_kwargs["embed_dim"] if "embed_dim" in policy_kwargs else 128
        )  # the critic's embed_dim must be as policy's

        encoder = MultiHeadAttentionLayer(
            embed_dim,
            critic_kwargs["num_heads"] if "num_heads" in critic_kwargs else 4,
            critic_kwargs["feedforward_hidden"]
            if "feedforward_hidden" in critic_kwargs
            else 128,
            critic_kwargs["normalization"]
            if "normalization" in critic_kwargs
            else "layer",
            bias=False,
        )
        value_head = CriticDecoder(embed_dim)

        critic = CriticNetwork(
            encoder=encoder,
            value_head=value_head,
            customized=True,
        )

    super().__init__(env, policy, critic, **kwargs)

NeuOpt¶

Classes:

RDSDecoder –

RDS Decoder for flexible k-opt based on Ma et al. (2023)

RDSDecoder ¶

RDSDecoder(embed_dim: int = 128)

Bases: ImprovementDecoder

RDS Decoder for flexible k-opt based on Ma et al. (2023) Given the environment state and the node embeddings (positional embeddings are discarded), compute the logits for selecting a k-opt exchange on basis moves (S-move, I-move, E-move) from the current solution

Parameters:

embed_dim (int, default: 128 ) –

Embedding dimension
num_heads –

Number of attention heads

Source code in rl4co/models/zoo/neuopt/decoder.py

def __init__(
    self,
    embed_dim: int = 128,
):
    super().__init__()
    self.embed_dim = embed_dim

    self.linear_K1 = nn.Linear(self.embed_dim, self.embed_dim, bias=False)
    self.linear_K2 = nn.Linear(self.embed_dim, self.embed_dim, bias=False)
    self.linear_K3 = nn.Linear(self.embed_dim, self.embed_dim, bias=False)
    self.linear_K4 = nn.Linear(self.embed_dim, self.embed_dim, bias=False)

    self.linear_Q1 = nn.Linear(self.embed_dim, self.embed_dim, bias=False)
    self.linear_Q2 = nn.Linear(self.embed_dim, self.embed_dim, bias=False)
    self.linear_Q3 = nn.Linear(self.embed_dim, self.embed_dim, bias=False)
    self.linear_Q4 = nn.Linear(self.embed_dim, self.embed_dim, bias=False)

    self.linear_V1 = nn.Parameter(torch.Tensor(self.embed_dim))
    self.linear_V2 = nn.Parameter(torch.Tensor(self.embed_dim))

    self.rnn1 = nn.GRUCell(self.embed_dim, self.embed_dim)
    self.rnn2 = nn.GRUCell(self.embed_dim, self.embed_dim)

Classes:

CustomizeTSPInitEmbedding –

Initial embedding for the Traveling Salesman Problems (TSP).
NeuOptPolicy –

NeuOpt Policy based on Ma et al. (2023)

CustomizeTSPInitEmbedding ¶

CustomizeTSPInitEmbedding(embed_dim, linear_bias=True)

Bases: Module

Initial embedding for the Traveling Salesman Problems (TSP). Embed the following node features to the embedding space:

- locs: x, y coordinates of the cities

Source code in rl4co/models/zoo/neuopt/policy.py

def __init__(self, embed_dim, linear_bias=True):
    super(CustomizeTSPInitEmbedding, self).__init__()
    node_dim = 2  # x, y
    self.init_embed = nn.Sequential(
        nn.Linear(node_dim, embed_dim // 2, linear_bias),
        nn.ReLU(inplace=True),
        nn.Linear(embed_dim // 2, embed_dim, linear_bias),
    )

NeuOptPolicy ¶

NeuOptPolicy(
    embed_dim: int = 128,
    num_encoder_layers: int = 3,
    num_heads: int = 4,
    normalization: str = "layer",
    feedforward_hidden: int = 128,
    env_name: str = "tsp_kopt",
    pos_type: str = "CPE",
    init_embedding: Module = None,
    pos_embedding: Module = None,
    temperature: float = 1.0,
    tanh_clipping: float = 6.0,
    train_decode_type: str = "sampling",
    val_decode_type: str = "sampling",
    test_decode_type: str = "sampling",
)

Bases: ImprovementPolicy

NeuOpt Policy based on Ma et al. (2023) This model first encodes the input graph and current solution using a N2S encoder (:class:N2SEncoder) and then decodes the k-opt action (:class:RDSDecoder)

Parameters:

embed_dim (int, default: 128 ) –

Dimension of the node embeddings
num_encoder_layers (int, default: 3 ) –

Number of layers in the encoder
num_heads (int, default: 4 ) –

Number of heads in the attention layers
normalization (str, default: 'layer' ) –

Normalization type in the attention layers
feedforward_hidden (int, default: 128 ) –

Dimension of the hidden layer in the feedforward network
env_name (str, default: 'tsp_kopt' ) –

Name of the environment used to initialize embeddings
pos_type (str, default: 'CPE' ) –

Name of the used positional encoding method (CPE or APE)
init_embedding (Module, default: None ) –

Module to use for the initialization of the embeddings
pos_embedding (Module, default: None ) –

Module to use for the initialization of the positional embeddings
temperature (float, default: 1.0 ) –

Temperature for the softmax
tanh_clipping (float, default: 6.0 ) –

Tanh clipping value (see Bello et al., 2016)
train_decode_type (str, default: 'sampling' ) –

Type of decoding to use during training
val_decode_type (str, default: 'sampling' ) –

Type of decoding to use during validation
test_decode_type (str, default: 'sampling' ) –

Type of decoding to use during testing

Methods:

forward –

Forward pass of the policy.

Source code in rl4co/models/zoo/neuopt/policy.py

def __init__(
    self,
    embed_dim: int = 128,
    num_encoder_layers: int = 3,
    num_heads: int = 4,
    normalization: str = "layer",
    feedforward_hidden: int = 128,
    env_name: str = "tsp_kopt",
    pos_type: str = "CPE",
    init_embedding: nn.Module = None,
    pos_embedding: nn.Module = None,
    temperature: float = 1.0,
    tanh_clipping: float = 6.0,
    train_decode_type: str = "sampling",
    val_decode_type: str = "sampling",
    test_decode_type: str = "sampling",
):
    super(NeuOptPolicy, self).__init__()

    self.env_name = env_name
    self.embed_dim = embed_dim

    # Decoding strategies
    self.temperature = temperature
    self.tanh_clipping = tanh_clipping
    self.train_decode_type = train_decode_type
    self.val_decode_type = val_decode_type
    self.test_decode_type = test_decode_type

    # Encoder and decoder
    if init_embedding is None:
        init_embedding = CustomizeTSPInitEmbedding(self.embed_dim)

    self.encoder = N2SEncoder(
        embed_dim=embed_dim,
        init_embedding=init_embedding,
        pos_embedding=pos_embedding,
        env_name=env_name,
        pos_type=pos_type,
        num_heads=num_heads,
        num_layers=num_encoder_layers,
        normalization=normalization,
        feedforward_hidden=feedforward_hidden,
    )

    self.decoder = RDSDecoder(embed_dim=embed_dim)

    self.init_hidden_W = nn.Linear(self.embed_dim, self.embed_dim)
    self.init_query_learnable = nn.Parameter(torch.Tensor(self.embed_dim))

    self.init_parameters()

forward ¶

forward(
    td: TensorDict,
    env: str | RL4COEnvBase = None,
    phase: str = "train",
    return_actions: bool = True,
    return_embeds: bool = False,
    only_return_embed: bool = False,
    actions=None,
    **decoding_kwargs
) -> dict

Forward pass of the policy.

Parameters:

td (TensorDict) –

TensorDict containing the environment state
env (str | RL4COEnvBase, default: None ) –

Environment to use for decoding. If None, the environment is instantiated from env_name. Note that it is more efficient to pass an already instantiated environment each time for fine-grained control
phase (str, default: 'train' ) –

Phase of the algorithm (train, val, test)
return_actions (bool, default: True ) –

Whether to return the actions
actions –

Actions to use for evaluating the policy. If passed, use these actions instead of sampling from the policy to calculate log likelihood
decoding_kwargs –

Keyword arguments for the decoding strategy. See :class:rl4co.utils.decoding.DecodingStrategy for more information.

Returns:

out ( dict ) –

Dictionary containing the reward, log likelihood, and optionally the actions and entropy

Source code in rl4co/models/zoo/neuopt/policy.py

def forward(
    self,
    td: TensorDict,
    env: str | RL4COEnvBase = None,
    phase: str = "train",
    return_actions: bool = True,
    return_embeds: bool = False,
    only_return_embed: bool = False,
    actions=None,
    **decoding_kwargs,
) -> dict:
    """Forward pass of the policy.

    Args:
        td: TensorDict containing the environment state
        env: Environment to use for decoding. If None, the environment is instantiated from `env_name`. Note that
            it is more efficient to pass an already instantiated environment each time for fine-grained control
        phase: Phase of the algorithm (train, val, test)
        return_actions: Whether to return the actions
        actions: Actions to use for evaluating the policy.
            If passed, use these actions instead of sampling from the policy to calculate log likelihood
        decoding_kwargs: Keyword arguments for the decoding strategy. See :class:`rl4co.utils.decoding.DecodingStrategy` for more information.

    Returns:
        out: Dictionary containing the reward, log likelihood, and optionally the actions and entropy
    """

    # Encoder: get encoder output and initial embeddings from initial state
    nfe, _ = self.encoder(td)
    if only_return_embed:
        return {"embeds": nfe.detach()}

    # Instantiate environment if needed
    if isinstance(env, str) or env is None:
        env_name = self.env_name if env is None else env
        log.info(f"Instantiated environment not provided; instantiating {env_name}")
        env = get_env(env_name)
    assert not env.two_opt_mode, "NeuOpt only support k-opt with k > 2"

    # Get decode type depending on phase and whether actions are passed for evaluation
    decode_type = decoding_kwargs.pop("decode_type", None)
    if actions is not None:
        decode_type = "evaluate"
    elif decode_type is None:
        decode_type = getattr(self, f"{phase}_decode_type")

    # Setup decoding strategy
    # we pop arguments that are not part of the decoding strategy
    decode_strategy: DecodingStrategy = get_decoding_strategy(
        decode_type,
        temperature=decoding_kwargs.pop("temperature", self.temperature),
        tanh_clipping=decoding_kwargs.pop("tanh_clipping", self.tanh_clipping),
        mask_logits=True,
        improvement_method_mode=True,
        **decoding_kwargs,
    )

    # Perform the decoding
    bs, gs, _, ll, action_sampled, rec, visited_time = (
        *nfe.size(),
        0.0,
        None,
        td["rec_current"],
        td["visited_time"],
    )
    action_index = torch.zeros(bs, env.k_max, dtype=torch.long).to(rec.device)
    k_action_left = torch.zeros(bs, env.k_max + 1, dtype=torch.long).to(rec.device)
    k_action_right = torch.zeros(bs, env.k_max, dtype=torch.long).to(rec.device)
    next_of_last_action = (
        torch.zeros_like(rec[:, :1], dtype=torch.long).to(rec.device) - 1
    )
    mask = torch.zeros_like(rec, dtype=torch.bool).to(rec.device)
    stopped = torch.ones(bs, dtype=torch.bool).to(rec.device)
    zeros = torch.zeros((bs, 1), device=td.device)

    # init queries
    h_mean = nfe.mean(1)
    init_query = self.init_query_learnable.repeat(bs, 1)
    input_q1 = input_q2 = init_query.clone()
    init_hidden = self.init_hidden_W(h_mean)
    q1 = q2 = init_hidden.clone()

    for i in range(env.k_max):
        # Pass RDS decoder
        logits, q1, q2 = self.decoder(nfe, q1, q2, input_q1, input_q2)

        # Calc probs
        if i == 0 and "action" in td.keys():
            mask = mask.scatter(1, td["action"][:, :1], 1)

        logprob, action_sampled = decode_strategy.step(
            logits,
            ~mask.clone(),
            action=actions[:, i : i + 1].squeeze() if actions is not None else None,
        )
        action_sampled = action_sampled.unsqueeze(-1)
        if i > 0:
            action_sampled = torch.where(
                stopped.unsqueeze(-1), action_index[:, :1], action_sampled
            )
        if phase == "train":
            loss_now = logprob.gather(1, action_sampled)
        else:
            loss_now = zeros.clone()

        # Record log_likelihood and Entropy
        if i > 0:
            ll = ll + torch.where(stopped.unsqueeze(-1), zeros * 0, loss_now)
        else:
            ll = ll + loss_now

        # Store and Process actions
        next_of_new_action = rec.gather(1, action_sampled)
        action_index[:, i] = action_sampled.squeeze().clone()
        k_action_left[stopped, i] = action_sampled[stopped].squeeze().clone()
        k_action_right[~stopped, i - 1] = action_sampled[~stopped].squeeze().clone()
        k_action_left[:, i + 1] = next_of_new_action.squeeze().clone()

        # Prepare next RNN input
        input_q1 = nfe.gather(
            1, action_sampled.view(bs, 1, 1).expand(bs, 1, self.embed_dim)
        ).squeeze(1)
        input_q2 = torch.where(
            stopped.view(bs, 1).expand(bs, self.embed_dim),
            input_q1.clone(),
            nfe.gather(
                1,
                (next_of_last_action % gs)
                .view(bs, 1, 1)
                .expand(bs, 1, self.embed_dim),
            ).squeeze(1),
        )

        # Process if k-opt close
        # assert (input_q1[stopped] == input_q2[stopped]).all()
        if i > 0:
            stopped = stopped | (action_sampled == next_of_last_action).squeeze()
        else:
            stopped = (action_sampled == next_of_last_action).squeeze()
        # assert (input_q1[stopped] == input_q2[stopped]).all()

        k_action_left[stopped, i] = k_action_left[stopped, i - 1]
        k_action_right[stopped, i] = k_action_right[stopped, i - 1]

        # Calc next basic masks
        if i == 0:
            visited_time_tag = (
                visited_time - visited_time.gather(1, action_sampled)
            ) % gs
        mask &= False
        mask[(visited_time_tag <= visited_time_tag.gather(1, action_sampled))] = True
        if i == 0:
            mask[visited_time_tag > (gs - 2)] = True
        mask[stopped, action_sampled[stopped].squeeze()] = (
            False  # allow next k-opt starts immediately
        )
        # if True:#i == env.k_max - 2: # allow special case: close k-opt at the first selected node
        index_allow_first_node = (~stopped) & (
            next_of_new_action.squeeze() == action_index[:, 0]
        )
        mask[index_allow_first_node, action_index[index_allow_first_node, 0]] = False

        # Move to next
        next_of_last_action = next_of_new_action
        next_of_last_action[stopped] = -1

    # Form final action
    k_action_right[~stopped, -1] = k_action_left[~stopped, -1].clone()
    k_action_left = k_action_left[:, : env.k_max]
    action_all = torch.cat((action_index, k_action_left, k_action_right), -1)

    outdict = {"log_likelihood": ll, "cost_bsf": td["cost_bsf"]}
    td.set("action", action_all)

    if return_embeds:
        outdict["embeds"] = nfe.detach()

    if return_actions:
        outdict["actions"] = action_all

    return outdict

Classes:

NeuOpt –

NeuOpt Model based on n_step Proximal Policy Optimization (PPO) with an NeuOpt model policy.

NeuOpt ¶

NeuOpt(
    env: RL4COEnvBase,
    policy: Module = None,
    critic: CriticNetwork = None,
    policy_kwargs: dict = {},
    critic_kwargs: dict = {},
    **kwargs
)

Bases: n_step_PPO

NeuOpt Model based on n_step Proximal Policy Optimization (PPO) with an NeuOpt model policy. We default to the NeuOpt model policy and the improvement Critic Network.

Parameters:

env (RL4COEnvBase) –

Environment to use for the algorithm
policy (Module, default: None ) –

Policy to use for the algorithm
critic (CriticNetwork, default: None ) –

Critic to use for the algorithm
policy_kwargs (dict, default: {} ) –

Keyword arguments for policy
critic_kwargs (dict, default: {} ) –

Keyword arguments for critic

Source code in rl4co/models/zoo/neuopt/model.py

def __init__(
    self,
    env: RL4COEnvBase,
    policy: nn.Module = None,
    critic: CriticNetwork = None,
    policy_kwargs: dict = {},
    critic_kwargs: dict = {},
    **kwargs,
):
    if policy is None:
        policy = NeuOptPolicy(env_name=env.name, **policy_kwargs)

    if critic is None:
        embed_dim = (
            policy_kwargs["embed_dim"] if "embed_dim" in policy_kwargs else 128
        )  # the critic's embed_dim must be as policy's

        encoder = MultiHeadAttentionLayer(
            embed_dim,
            critic_kwargs["num_heads"] if "num_heads" in critic_kwargs else 4,
            critic_kwargs["feedforward_hidden"]
            if "feedforward_hidden" in critic_kwargs
            else 128,
            critic_kwargs["normalization"]
            if "normalization" in critic_kwargs
            else "layer",
            bias=False,
        )
        value_head = CriticDecoder(embed_dim, dropout_rate=0.001)

        critic = CriticNetwork(
            encoder=encoder,
            value_head=value_head,
            customized=True,
        )

    super().__init__(env, policy, critic, **kwargs)