Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

There is a mismatch between output["topic-word-matrix"] and dataset.get_vocabulary() in terms of words? #86

Open
Zay-Ben opened this issue Jan 10, 2023 · 5 comments

Comments

@Zay-Ben
Copy link

Zay-Ben commented Jan 10, 2023

There is a mismatch between output["topic-word-matrix"] and dataset.get_vocabulary() in terms of words?

I created a Dataframe as follows:

df = pd.DataFrame(data = output["topic-word-matrix"], columns = dataset.get_vocabulary()).T

When I sort the Dataframe by a topic number to get the top words for a topic, why do the results differ from output["topics"][i]?

Thank you!

@Zay-Ben Zay-Ben changed the title May I ask if there is a mismatch between output["topic-word-matrix"] and dataset.get_vocabulary() in terms of words? There is a mismatch between output["topic-word-matrix"] and dataset.get_vocabulary() in terms of words? Jan 11, 2023
@silviatti
Copy link
Collaborator

There should be a one-to-one correspondence between the two. It's difficult to say what is wrong. Can you share more details about the problem?

@Zay-Ben
Copy link
Author

Zay-Ben commented Feb 2, 2023

Good day Dr. Silvia, nice to see you again, and thank you for reply. Here are the details of the issue. :)

First, I created a dataset folder containing two files, namely corpus.txt and vocabulary.tsv as the OCTIS module required.

The corpus file:

image

The vocabulary file (sorted alphabetically):

image

Second, I loaded the dataset and trained LDA models with the dataset.

image

image

image

Third, after training, I imported one of the LDA models. With the model’s topic-word-matrix as the data and the dataset’s vocabulary as the column. The resulting data frame is shown in the figure below:

image

Last, the top 5 words of the data frame’s first topic are different from the top 5 words of the model’s first topic.

image

I can't determine why there are discrepancies in the top words of the topics.

With appreciation,

Benz

@silviatti
Copy link
Collaborator

Hi Benz,
sorry for the late reply. I haven't had time to work on OCTIS these months. There's something weird, I agree.
I would suggest two experiments in case you're still interesting in this issue:

  • Can you also print out dataset.get_vocabulary()? Just to see if the vocabulary match with your file.
  • Could you try to repeat the experiment with another model and see if you have the same problem? I'd like to see if the problem is only of LDA or it's general.

Thanks for your patience.

Silvia

@Zay-Ben
Copy link
Author

Zay-Ben commented Apr 15, 2023

Dear Dr. Silvia,

Thank you for taking the time to address my questions.

Regarding the first question, the results show that the order of the vocabulary before and after importing it using OCTIS is different. The vocabulary was sorted alphabetically before importing and shuffled randomly (seemingly) after importing, as shown in the image with the first five terms of each vocabulary.
image
image

Regarding the second question, I trained two models (ETM and NMF) using the same dataset and found that the problem persists for NMF, but not for ETM, as shown in the figure below. I noticed that OCTIS's LDA and NMF are both from Gensim. Could this be the source of the error?

ETM:
image
image

NMF:
image
image

Just to give context, the dataset consists of tweets that contain customer complaints about telecommunication companies.

Thank you again for your help! Topic modeling has never been easy without OCTIS. 😭

@silviatti
Copy link
Collaborator

Hi,
just to double-check, when you load the custom dataset, do you have a file in the dataset folder called vocabulary.txt? That should be the vocabulary file were words are sorted alphabetically. I asked this question because I noticed that your file is called "words.txt", so it can be possible that OCTIS doesn't load it.

Let me know :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants