Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extraction does not terminate #43

Open
adbar opened this issue Mar 18, 2022 · 4 comments
Open

Extraction does not terminate #43

adbar opened this issue Mar 18, 2022 · 4 comments
Labels

Comments

@adbar
Copy link
Contributor

adbar commented Mar 18, 2022

On content-rich webpages the algorithm does not seem to terminate, leading to a deadlock which has to be interrupted. See adbar/trafilatura#189

Here is an archived version of the page where the problem has been found: https://web.archive.org/web/20220223144026/https://omnieq.com/

@miso-belica
Copy link
Owner

It's really bad if it get stuck but the page https://omnieq.com/ does not seem like very content rich to me. There almost no text. I seems only table with some stock data? jusText was meant for extracting text from pages, not a short tabular data.

@miso-belica miso-belica added the bug label Apr 9, 2022
@adbar
Copy link
Contributor Author

adbar commented Apr 11, 2022 via email

@miso-belica
Copy link
Owner

miso-belica commented Apr 27, 2022

It seems it's not an infinite loop, but just very suboptimal code. There are cca 50k small paragraphs found in the page and the function to revise them is basically doing 50_000*100_000 iterations because no close good or bad paragraph is found on the page so it is very slow. Unfortunately the function is not tested at all so I need to write a set of tests first and then refactor it. It takes more time than expected.

Below is the code that causes it. My idea how to fix it is instead of iterating to find prev/next paragraph I can remember them while iterating all the paragraphs. It should reduce iterations to cca. 50k only.

jusText/justext/core.py

Lines 330 to 344 in cbd5a5c

for i, paragraph in enumerate(paragraphs):
if paragraph.class_type != 'short':
continue
prev_neighbour = get_prev_neighbour(i, paragraphs, ignore_neargood=True)
next_neighbour = get_next_neighbour(i, paragraphs, ignore_neargood=True)
if prev_neighbour == 'good' and next_neighbour == 'good':
new_classes[i] = 'good'
elif prev_neighbour == 'bad' and next_neighbour == 'bad':
new_classes[i] = 'bad'
# it must be set(['good', 'bad'])
elif (prev_neighbour == 'bad' and get_prev_neighbour(i, paragraphs, ignore_neargood=False) == 'neargood') or \
(next_neighbour == 'bad' and get_next_neighbour(i, paragraphs, ignore_neargood=False) == 'neargood'):
new_classes[i] = 'good'
else:
new_classes[i] = 'bad'

@adbar
Copy link
Contributor Author

adbar commented May 3, 2022

That sounds good, please work on it if you have the time!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants