-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extraction does not terminate #43
Comments
It's really bad if it get stuck but the page https://omnieq.com/ does not seem like very content rich to me. There almost no text. I seems only table with some stock data? jusText was meant for extracting text from pages, not a short tabular data. |
I agree, it is not about extracting this bunch of data correctly.
It would be best not to start an endless loop on it though, using at
least a CPU core indefinitely.
|
It seems it's not an infinite loop, but just very suboptimal code. There are cca 50k small paragraphs found in the page and the function to revise them is basically doing Below is the code that causes it. My idea how to fix it is instead of iterating to find prev/next paragraph I can remember them while iterating all the paragraphs. It should reduce iterations to cca. 50k only. Lines 330 to 344 in cbd5a5c
|
That sounds good, please work on it if you have the time! |
On content-rich webpages the algorithm does not seem to terminate, leading to a deadlock which has to be interrupted. See adbar/trafilatura#189
Here is an archived version of the page where the problem has been found: https://web.archive.org/web/20220223144026/https://omnieq.com/
The text was updated successfully, but these errors were encountered: