Extraction does not terminate #43

adbar · 2022-03-18T18:14:32Z

On content-rich webpages the algorithm does not seem to terminate, leading to a deadlock which has to be interrupted. See adbar/trafilatura#189

Here is an archived version of the page where the problem has been found: https://web.archive.org/web/20220223144026/https://omnieq.com/

miso-belica · 2022-04-09T15:44:41Z

It's really bad if it get stuck but the page https://omnieq.com/ does not seem like very content rich to me. There almost no text. I seems only table with some stock data? jusText was meant for extracting text from pages, not a short tabular data.

adbar · 2022-04-11T16:55:23Z

I agree, it is not about extracting this bunch of data correctly. It would be best not to start an endless loop on it though, using at least a CPU core indefinitely.

miso-belica · 2022-04-27T19:07:51Z

It seems it's not an infinite loop, but just very suboptimal code. There are cca 50k small paragraphs found in the page and the function to revise them is basically doing 50_000*100_000 iterations because no close good or bad paragraph is found on the page so it is very slow. Unfortunately the function is not tested at all so I need to write a set of tests first and then refactor it. It takes more time than expected.

Below is the code that causes it. My idea how to fix it is instead of iterating to find prev/next paragraph I can remember them while iterating all the paragraphs. It should reduce iterations to cca. 50k only.

jusText/justext/core.py

Lines 330 to 344 in cbd5a5c

    
           for i, paragraph in enumerate(paragraphs): 
        
               if paragraph.class_type != 'short': 
        
                   continue 
        
               prev_neighbour = get_prev_neighbour(i, paragraphs, ignore_neargood=True) 
        
               next_neighbour = get_next_neighbour(i, paragraphs, ignore_neargood=True) 
        
               if prev_neighbour == 'good' and next_neighbour == 'good': 
        
                   new_classes[i] = 'good' 
        
               elif prev_neighbour == 'bad' and next_neighbour == 'bad': 
        
                   new_classes[i] = 'bad' 
        
               # it must be set(['good', 'bad']) 
        
               elif (prev_neighbour == 'bad' and get_prev_neighbour(i, paragraphs, ignore_neargood=False) == 'neargood') or \ 
        
                    (next_neighbour == 'bad' and get_next_neighbour(i, paragraphs, ignore_neargood=False) == 'neargood'): 
        
                   new_classes[i] = 'good' 
        
               else: 
        
                   new_classes[i] = 'bad'

adbar · 2022-05-03T14:38:48Z

That sounds good, please work on it if you have the time!

miso-belica added the bug label Apr 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extraction does not terminate #43

Extraction does not terminate #43

adbar commented Mar 18, 2022

miso-belica commented Apr 9, 2022

adbar commented Apr 11, 2022 via email

miso-belica commented Apr 27, 2022 •

edited

adbar commented May 3, 2022

Extraction does not terminate #43

Extraction does not terminate #43

Comments

adbar commented Mar 18, 2022

miso-belica commented Apr 9, 2022

adbar commented Apr 11, 2022 via email

miso-belica commented Apr 27, 2022 • edited

adbar commented May 3, 2022

miso-belica commented Apr 27, 2022 •

edited