Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Misalignment in LSTM Application and Subsequent Operations in _TSSequencerEncoderLayer #875

Open
prashantkhatri23 opened this issue Jan 17, 2024 · 0 comments

Comments

@prashantkhatri23
Copy link

Issue Description

I've noticed a potential issue in the implementation of the _TSSequencerEncoderLayer class, where the LSTM layer appears to be applied along the channel axis (feature size) instead of the temporal axis (sequence length). This is evident from the initialization of the LSTM layer:

  1. LSTM Layer Initialization:
    Currently, the LSTM layer is initialized as follows:

    self.bilstm = nn.LSTM(q_len, q_len, num_layers=1, bidirectional=True, bias=lstm_bias)

    This should be revised to:

    self.bilstm = nn.LSTM(d_model, d_model, num_layers=1, bidirectional=True, bias=lstm_bias)
  2. Fully Connected Layer Adjustment:
    The self.fc layer needs to be updated to accommodate the change in LSTM layer dimensions:

    self.fc = nn.Linear(2 * d_model, d_model)
  3. Modifications in Forward Pass:
    The forward method needs modifications to correctly process the data through the LSTM layer:

    • For the pre-normalization case:
      x = self.drop_path(self.dropout(self.fc(self.bilstm(self.lstm_norm(x))[0]))) + x
    • For the non pre-normalization case:
      x = self.lstm_norm(self.drop_path(self.dropout_t(self.fc(self.bilstm(x)[0]))) + x)

Additional Context:

These issues were identified during a detailed code review while integrating the model into my project. Specifically, I applied the model to two different tasks on RAVDESS AV emotion dataset:

  1. Emotion prediction using a facial embedding sequence extracted from a video.
  2. Emotion prediction using audio feature sequences.

In an attempt to address these concerns, I tested the model's performance with the proposed changes. Interestingly, the results were quite surprising. The performance remained similar whether the LSTM was applied across the channel axis (as per the current implementation) or across the time steps (as per the proposed modification). This observation raises questions about the expected impact of these changes and suggests a need for further investigation into the model's behavior in different application contexts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant