Additional guidance needed #329

TheMrguiller · 2024-05-03T12:04:18Z

Sorry to interrupt you. I really appreciate your work. I have been investigating different tools for data extraction and I find your work to be the best so far. I haven't had the opportunity to try out the Gradio example that many people are excited about. That said, I have a few questions.

I'm new to this information extraction field, so there are many things that are a bit out of my scope. I've seen that several people have been asking about your model "xrf_layout/model_final_inf_only.pt". I understand that this model is private, but I've seen that you've made available the possibility of training a very similar model in https://deepdoctection.readthedocs.io/en/latest/tutorials/training_and_logging/.

I got a bit confused, though, with one of your comments on Hugging Face, where you mentioned that the cell/row models used are available, but I don't see where they are actually available.

Finally, I just want to confirm that the example training procedure for the layout model is the correct one, and that it's not just a toy example.

Thank you in advance for your help.

JaMe76 · 2024-05-03T14:54:13Z

Thanks for your comments about this repo.

Regarding your questions: The training of the private layout model follows exactly the training scripts you were referring to with the only difference that when merging datasets I had one additional datasets containing around 6k images labelled by hand. Looking at the training script, the number of samples taken from Doclaynet is around 75k and Publaynet 25k. That means, you can train with this script a model on a dataset that differs by 6%. It takes 5-6 days on a RTX 3090.

Do these 6k images really matters? To be honest, I don't know. I did not train a model on this reduced dataset. Adding some good data can change the game a lot. So, I would like to qualify my last statement about "getting a similar model". It refers to the fact that you can train a model on a slightly smaller publicly available dataset.

The model cell/row models haven't been released either. I trained these model on Pubtabnet + ~1k hand labelled data. With the release of Table transformer v2, I don't think they are any better.

TheMrguiller · 2024-05-03T15:04:01Z

Thank you for your invaluable insights. I've been experimenting with Table Transformer v2 based on your guidelines in the documentation, but I haven't noticed a significant improvement in my specific cases compared to other baseline methods. It's possible that my limited understanding of Table Transformer is hindering my progress, as its parameterization appears to rely heavily on trial and error. Admittedly, my tables are somewhat complex, adorned with embellishments. Any guidance you could provide, @JaMe76, would be greatly appreciated, though I hope not to cause any inconvenience.

JaMe76 · 2024-05-08T20:15:19Z

Tatr v2 does not require padding as far as I remember. So, you reducing the default padding values to 0 in the configs should improve the segmentation results.

TheMrguiller · 2024-05-09T06:19:08Z

Hi,
Thank you for the feedback @JaMe76 . After a bit of research and also making a lot of trials wit Tatr v2, a found this example https://github.com/NielsRogge/Transformers-Tutorials/blob/master/Table%20Transformer/Inference_with_Table_Transformer_(TATR)_for_parsing_tables.ipynb which works really well. But i did found out that the model tends to at least crop the table too short and latter on gives problems when doing table column and row detection. Also by Tatr v2 you mean the base Table transformer from huggingface https://huggingface.co/microsoft/table-transformer-detection?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Additional guidance needed #329

Additional guidance needed #329

TheMrguiller commented May 3, 2024

JaMe76 commented May 3, 2024

TheMrguiller commented May 3, 2024

JaMe76 commented May 8, 2024

TheMrguiller commented May 9, 2024

Additional guidance needed #329

Additional guidance needed #329

Comments

TheMrguiller commented May 3, 2024

JaMe76 commented May 3, 2024

TheMrguiller commented May 3, 2024

JaMe76 commented May 8, 2024

TheMrguiller commented May 9, 2024