Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

problems partitioning custom dataset #101

Open
afriedman412 opened this issue Mar 17, 2023 · 0 comments
Open

problems partitioning custom dataset #101

afriedman412 opened this issue Mar 17, 2023 · 0 comments

Comments

@afriedman412
Copy link

  • OCTIS version: 1.11.1
  • Python version: 3.9.16
  • Operating System: ubuntu 20.04

Description

Trying to run Optimization, following this tutorial on custom dataset raises:

File /usr/local/lib/python3.9/dist-packages/octis/models/pytorchavitm/AVITM.py:77, in AVITM.train_model(self, dataset, hyperparameters, top_words)
     74 self.set_params(hyperparameters)
     76 if self.use_partitions:
---> 77     train, validation, test = dataset.get_partitioned_corpus(use_validation=True)
     79     data_corpus_train = [' '.join(i) for i in train]
     80     data_corpus_test = [' '.join(i) for i in test]

TypeError: cannot unpack non-iterable NoneType object

What I Did

Here's the code for creating the custom dataset from a list of strings...

# docs is a list of strings

# collect tokens
tokens = []
for d in tqdm(docs):
    tokens += word_tokenize(d.lower())

# write vocab file
with open("octis_dataset/vocabulary.txt", "w+") as f:
    for s in tqdm(set(tokens)):
        f.write(s + "\n")

# create corpus tsv
df = pd.DataFrame(docs)

# partition
tr_data = df.sample(48500, random_state=420)
te_data = df.query("index not in @tr_data.index").sample(12900, random_state=420)
val_data = df.query("index not in @tr_data.index and index not in @te_data.index")

df = pd.concat([tr_data, te_data, val_data])

# write tsv
df.to_csv("octis_dataset/corpus.tsv", sep="\t", header=None)

And here is the code to optimize the model...

optimizer=Optimizer()

start = time.time()
optimization_result = optimizer.optimize(
    model, dataset, coherence, search_space, number_of_call=optimization_runs, 
    model_runs=model_runs, save_models=True, 
    extra_metrics=None, # to keep track of other metrics
    save_path='results/test_neuralLDA/'
)
end = time.time()
duration = end - start
optimization_result.save_to_csv("results_neuralLDA.csv")
print('Optimizing model took: ' + str(round(duration)) + ' seconds.')

And this raises the error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant