How to extract the text of each category detected on a page? #83

tikitong · 2022-11-04T01:01:30Z

tikitong
Nov 4, 2022

Hi, first of all thank you very much for this work, very interesting and useful!

I have a little trouble understanding the API. For example, for a using as in the demo:

If I want to extract the text by section for each category: TEXT, TITLE, LIST, TABLE and FIGURE (exactly same as in the Contiguous text demo output). I don't understand which class to use to do this? Except for the TABLE category, thanks to the deepdoctection.datapoint.page.Page class and its tables attribute. The get_text() method from the same class extracts the text for all categories without distinction.

import deepdoctection as dd
analyzer = dd.get_dd_analyzer(language='en') 
df = analyzer.analyze(path="my.pdf")
for dp in df: #loop on all pages
..#Which class to use on dp to extract each detected category and get their text?

Many thanks.

Answered by JaMe76

Nov 4, 2022

Hi, thanks for your question.

I have to admit that the consumer API is very confusing and I am experimenting with a new one that will be hopefully easier to understand. The difficulty is to establish an API that can be used even if one has a model that determines different categories than the ones currently in use.

For now every, layout block other than 'TABLE' is stored in Page.items. Saying that you can get name, reading order position and text of the layout block as follows:

import deepdoctection as dd

   path = "/path/to/dir"
   analyzer = dd.get_dd_analyzer()

   df = analyzer.analyze(path=path)
   df.reset_state()

   for dp in df:
       for item in dp.items:
           print(f"re…

View full answer

JaMe76 · 2022-11-04T08:00:14Z

JaMe76
Nov 4, 2022
Maintainer

Hi, thanks for your question.

I have to admit that the consumer API is very confusing and I am experimenting with a new one that will be hopefully easier to understand. The difficulty is to establish an API that can be used even if one has a model that determines different categories than the ones currently in use.

For now every, layout block other than 'TABLE' is stored in Page.items. Saying that you can get name, reading order position and text of the layout block as follows:

import deepdoctection as dd

   path = "/path/to/dir"
   analyzer = dd.get_dd_analyzer()

   df = analyzer.analyze(path=path)
   df.reset_state()

   for dp in df:
       for item in dp.items:
           print(f"reading order: {item.reading_order}")
           print(f"layout: {item.layout_type.value}")
           print(f"text: {item.text} \n")

4 replies

tikitong Nov 4, 2022
Author

Thank you very much for the quick response, it helped me understand the API better! I wrote the following extraction and sorting that may somewhat match what you did in the demo, for the layouts part:

import deepdoctection as dd
from operator import itemgetter
...
annotations = {"file": path}

for dp in df:

    layouts = sorted([{"reading order": item.reading_order,
                       "layout": item.layout_type.value,
                       "text": item.text,
                       "score": item.score} for item in dp.items], key=itemgetter("reading order"))

    table = [{"layout": table.layout_type.value, "text": table.text,
              "score": table.score} for table in dp.tables if dp.tables]

    annotations[dp.file_name] = layouts+table

with open('data.json', 'w') as fp:
    json.dump(annotations, fp, ensure_ascii=False)

But I think I could also get this result faster, by calling a method on df, which generates the dictionary of results with as_dict() or save() right?

JaMe76 Nov 4, 2022
Maintainer

Your method is sensible if your are mainly interested in text and layout structures and if you do not need to recover the intrinsic data format.

When I work with a larger document corpus I usually save the intrinsic data format in a jsonlines file (or in several json files).

path = "path/to/large_pdf"

df = analyzer.analyze(path=path)
df = dd.MapData(df, lambda dp: dp.get_export(save_image=True))   # This will save the image as b64 string in the jsonl file as well
dd.SerializerJsonlines.save(df,"path/to/dir","save_file.jsonl")

This allows to load and parse my saved results page by page without worrying too much about memory consumption.

df = dd.SerializerJsonlines.load("path/to/dir/save_file.jsonl)
df = dd.MapData(lambda dp: dd.Page.from_dict(**dp))
df.reset_state()
for dp in df:
 ...     # now dp is again a Page instance

JaMe76 Nov 4, 2022
Maintainer

Note, that there is another more generic data format Image that is used while the page gets processed through the analyzer pipeline. At last step the results get parsed into the Page format which is used by the consumer.

As mentioned before, I am not very happy with having to different data formats and a not very consumer friendly API and I am currently working on a common structure that will make the Page structure redundant.

tikitong Nov 6, 2022
Author

Thank you for this detailed answer. It helped me a lot. I am starting to understand the structure well. I note for the data format Image I will also look into it. Setting up a single common structure seems to be a very good idea.

yashsandansing · 2023-11-23T13:51:30Z

yashsandansing
Nov 23, 2023

This answer does not work anymore. Any ideas on how to get boxes with their text in their respective reading order?

1 reply

JaMe76 Nov 23, 2023
Maintainer

The first code snippets will now look like this (bounding boxes included):

    import deepdoctection as dd

    path = "/path/to/file.pdf"
    analyzer = dd.get_dd_analyzer()

    df = analyzer.analyze(path=path)
    df.reset_state()

    for dp in df:
        for item in dp.layouts:
            print(f"reading order: {item.reading_order}")
            print(f"layout: {item.category_name.value}")
            print(f"text: {item.text} \n")
            print(f"bounding box: {item.bounding_box} \n")

Saving a PDF in a .jsonl file will look like this:

    import deepdoctection as dd

    path =  path = "/path/to/file.pdf"

    analyzer = dd.get_dd_analyzer()
    df = analyzer.analyze(path=path)
    df = dd.MapData(df, lambda dp: dp.as_dict())  # This will save the image as b64 string in the jsonl file as well
    dd.SerializerJsonlines.save(df, "path/to/dir", "save_file.jsonl")

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to extract the text of each category detected on a page? #83

{{title}}

Replies: 2 comments 5 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How to extract the text of each category detected on a page? #83

tikitong Nov 4, 2022

Replies: 2 comments · 5 replies

JaMe76 Nov 4, 2022 Maintainer

tikitong Nov 4, 2022 Author

JaMe76 Nov 4, 2022 Maintainer

JaMe76 Nov 4, 2022 Maintainer

tikitong Nov 6, 2022 Author

yashsandansing Nov 23, 2023

JaMe76 Nov 23, 2023 Maintainer

tikitong
Nov 4, 2022

Replies: 2 comments 5 replies

JaMe76
Nov 4, 2022
Maintainer

tikitong Nov 4, 2022
Author

JaMe76 Nov 4, 2022
Maintainer

JaMe76 Nov 4, 2022
Maintainer

tikitong Nov 6, 2022
Author

yashsandansing
Nov 23, 2023

JaMe76 Nov 23, 2023
Maintainer