Feedback #19

jonashaag · 2022-06-23T07:52:24Z

Gave this a try :-)

Feedback:

If this library works as advertised it'd be huge!
mlscraper.html is missing from the PyPI package.
When no scraper can be found, the error message could be more helpful:
mlscraper.training.NoScraperFoundException: did not find scraper
Would be nice if the error message gave some guidance as to what fields
couldn't be found in the HTML.
Even with DEBUG log level it's not really helpful.
See more notes in my script below.
Training the script was really slow (gave up after 15 min).

import requests
from mlscraper.html import Page
from mlscraper.samples import Sample, TrainingSet
from mlscraper.training import train_scraper

jonas_url = "https://github.com/jonashaag"
resp = requests.get(jonas_url)
resp.raise_for_status()

page = Page(resp.content)
sample = Sample(
    page,
    {
        "name": "Jonas Haag",
        "followers": "329",  # Note that this doesn't work if 329 passed as an int.
        #'company': '@QuantCo',  # Does not work.
        "twitter": "@_jonashaag",  # Does not work without the "@".
        "username": "jonashaag",
        "nrepos": "282",
    },
)

training_set = TrainingSet()
training_set.add_sample(sample)

scraper = train_scraper(training_set)

resp = requests.get("https://github.com/lorey")
result = scraper.get(Page(resp.content))
print(result)

The text was updated successfully, but these errors were encountered:

lorey · 2022-06-23T11:26:43Z

Hi Jonas, love the feedback. Thanks for taking the time. I might need to check more thoroughly, but here are some thoughts of things to be fixed/improved on my side:

I hope that (after fixing some of the hiccups) it soon will 😄
when I download the files, it's in there: https://pypi.org/project/mlscraper/1.0.0rc2/#files Did you install the release candidate with pip install --pre mlscraper?
Agree, will fix
Will open issues for the notes, thanks
As a first guess, I think the twitter link has no specific markup and thus no (simple) rule can be found. I think the most straightforward way would be to match it via the parents itemprop="twitter" attribute. This then leads to no scraper being found overall, which is (of course) undesired outcome. Will think about a solution.

lorey · 2022-06-23T11:35:00Z

Resulting issues and enhancements:

better error messages Improve errors when no match is found #18
add progress visualization: Show progress during training #22
integer matching: Integer Matching #21
substring matching: Match substrings #20

added:

github profiles: Find and fix issue with github profile pages #23

lorey · 2022-06-23T12:01:40Z

Just saw that @QuantCo is @Quantco on your profile. Maybe that's also related to #18

jonashaag · 2022-06-23T13:09:03Z

Just saw that @QuantCo is @Quantco on your profile

Oops, my bad.

jonashaag · 2022-06-23T13:11:15Z

Re: pip, it installs 0.1.2 for me oO

pip install --pre mlscraper --no-deps
Collecting mlscraper
  Using cached mlscraper-0.1.2-py2.py3-none-any.whl (12 kB)
Installing collected packages: mlscraper
Successfully installed mlscraper-0.1.2

lorey · 2022-06-23T13:28:25Z

Okay, issue identified, cause still unclear. You would need the 1.0.0rc2 version.

Maybe because 1.0 is python 3.9+? If that's not it, I'm out of ideas. Just tried with docker and ubuntu-latest, worked like a charm.

jonashaag · 2022-06-23T13:42:36Z

Yep, that's the cause. User error, case closed :)

lorey · 2022-06-23T21:11:39Z

While fixing, found #23

lorey · 2022-06-24T11:09:01Z

Have added the Github profiles as a test case and re-worked training, should now work reasonably fast.

CSS selectors are flaky at times, need to find a reasonable heuristic to prefer good ones.

jonashaag · 2022-06-24T13:29:10Z

Here's another example that doesn't work in case you're looking for work :-D

import requests
from mlscraper.html import Page
from mlscraper.samples import Sample, TrainingSet
from mlscraper.training import train_scraper

article1_url = "https://www.spiegel.de/politik/kristina-haenel-nach-abstimmung-ueber-219a-im-bundestag-dieser-kampf-ist-vorbei-a-f3c04fb2-8126-4831-bc32-ac6c58e1e520"
resp = requests.get(article1_url)
resp.raise_for_status()

page = Page(resp.content)
sample = Sample(
    page,
    {
        "title": "»Dieser Kampf ist vorbei«",
        "subtitle": "Ärztin Kristina Hänel nach Abstimmung über 219a",
        "teaser": "Der umstrittene Paragraf zum »Werbeverbot« für Abtreibung ist seit heute Geschichte – und die Gießenerin Kristina Hänel, die seit Jahren dafür gekämpft hat, kann aufatmen. Wie geht es für die Medizinerin jetzt weiter?",
        "author": "Nike Laurenz",
        "published": "24.06.2022, 14.26 Uhr",
    },
)

training_set = TrainingSet()
training_set.add_sample(sample)

scraper = train_scraper(training_set)

resp = requests.get("https://www.spiegel.de/politik/deutschland/abtreibung-abschaffung-von-paragraf-219a-fuer-die-muendige-frau-kommentar-a-784cd403-f279-4124-a216-e320042d1719")
result = scraper.get(Page(resp.content))
print(result)

lorey · 2022-06-24T15:53:04Z

What does "doesn't work" mean in that context?

I think that it's impossible to get it right with one sample (and esp. for two slightly different pages). I would most likely fail to write a scraper myself just by looking at one page either.

jonashaag · 2022-06-24T16:36:25Z

It crashes (but with 1 sample only, haven’t tested more)

lorey · 2022-07-07T11:44:02Z

So regarding spiegel online, this was quite some work as articles have different layouts. Took me some major performance tweaks to get it running in a sensible amount of time without sacrificing correctness. I still have issues with missing authors because the scraper class raises an error instead of assuming None if no author is found, but that's fixable.

Issue #25

Here's the code: https://gist.github.com/lorey/fdb88d6c8e41b9b6bc8df264cffc68e1

lorey · 2022-07-07T14:54:55Z

Fixed the authors issue, now takes around 30s on my machine. Formatting by me:

INFO:root:found DictScraper (scraper_per_key={
    'published': <ValueScraper self.selector=<CssRuleSelector self.css_rule='time'>, self.extractor=<TextValueExtractor>>, 
    'subtitle': <ValueScraper self.selector=<CssRuleSelector self.css_rule='h2 .font-sansUI'>, self.extractor=<TextValueExtractor>>, 
    'title': <ValueScraper self.selector=<CssRuleSelector self.css_rule='h2 > span:nth-child(2)'>, self.extractor=<TextValueExtractor>>, 
    'teaser': <ValueScraper self.selector=<CssRuleSelector self.css_rule='meta[name="description"]'>, self.extractor=<AttributeValueExtractor self.attr='content'>>, 
    'authors': <ListScraper self.selector=<CssRuleSelector self.css_rule='header a.border-b'> self.scraper=<ValueScraper self.selector=<mlscraper.selectors.PassThroughSelector object at 0x7efda0f969a0>, self.extractor=<TextValueExtractor>>>
})

# results of newly scraped pages
{'published': '07.07.2022, 11.34 Uhr', 'subtitle': 'Absage an Forderung der Union', 'title': 'Lambrecht will keine Transportpanzer in die Ukraine liefern', 'teaser': 'CDU und CSU fordern eine kurzfristige Lieferung von 200 Fuchs-Panzern an die Ukraine. Die Bundesverteidigungsministerin erteilt dem Vorschlag eine klare Absage – mit Hinweis auf eigene Sicherheitsinteressen.', 'authors': []}
{'published': '07.07.2022, 11.32 Uhr', 'subtitle': 'Größter Vermieter Deutschlands', 'title': 'Vonovia will nachts die Heizungen herunterdrehen', 'teaser': 'Um Energie zu sparen, will Deutschlands größter Wohnungskonzern während der Nachtstunden die Vorlauftemperatur der Heizungsanlage absenken. Die Räume werden dann allenfalls noch rund 17 Grad warm.', 'authors': []}

jonashaag · 2022-07-07T15:22:47Z

Impressive work 🤩

jonashaag · 2022-07-07T15:40:07Z

Example from a commercial application, price doesn't work, anything else works great

"""
To use this:
pip install requests
pip install --pre mlscraper

To automatically build any scraper, check out https://github.com/lorey/mlscraper
"""

import logging

import requests

from mlscraper.html import Page
from mlscraper.samples import Sample, TrainingSet
from mlscraper.training import train_scraper

ARTICLES = ({
    'url': 'https://www.rahm24.de/schlafen-und-wohnen/komfortmatratzen/schaumstoffmatratze-burmeier-basic-fit',
    'title': "Schaumstoffmatratze Burmeier Basic-Fit",
    #'price': '230,00 € *',
    'manufacturer': 'Burmeier',
},
    {
        'url': 'https://www.rahm24.de/medizintechnik/inhalationstherapie/inhalationsgeraet-omron-ne-c28p',
        'title': 'Inhalationsgerät Omron NE-C28P',
        #'price': '87,00 € *',
        'manufacturer': 'Omron',
    },
    {
        'url': 'https://www.rahm24.de/schlafen-und-wohnen/aufstehsessel/ruhe-und-aufstehsessel-innov-cocoon',
        'title': 'Ruhe- und Aufstehsessel Innov Cocoon',
        #'price': '1.290,00 € *',
        'manufacturer': 'Innov`Sa',
    },
)


def train_and_scrape():
    """
    This trains the scraper and scrapes two other pages.
    """
    scraper = train_medical_aid_scraper()

    urls_to_scrape = [
        'https://www.rahm24.de/pflegeprodukte/stoma/stoma-vlieskompressen-saliomed',
    ]
    for url in urls_to_scrape:
        # fetch page
        article_resp = requests.get(url)
        article_resp.raise_for_status()
        page = Page(article_resp.content)

        # extract result
        result = scraper.get(page)
        print(result)


def train_medical_aid_scraper():
    training_set = make_training_set_for_articles(ARTICLES)
    scraper = train_scraper(training_set, complexity=2)
    return scraper


def make_training_set_for_articles(articles):
    """
    This creates a training set to automatically derive selectors based on the given samples.
    """
    training_set = TrainingSet()
    for article in articles:
        # fetch page
        article_url = article['url']
        html_raw = requests.get(article_url).content
        page = Page(html_raw)

        # create and add sample
        sample = Sample(page, article)
        training_set.add_sample(sample)

    return training_set


if __name__ == '__main__':
    logging.basicConfig(level=logging.INFO)
    train_and_scrape()

lorey · 2022-07-07T15:58:10Z

There's some weird whitespace causing issues. But it works if you change the price to a proper price in dot notation (which is hidden in the html):

ARTICLES = ({
    'url': 'https://www.rahm24.de/schlafen-und-wohnen/komfortmatratzen/schaumstoffmatratze-burmeier-basic-fit',
    'title': "Schaumstoffmatratze Burmeier Basic-Fit",
    'price': '230.00',
    'manufacturer': 'Burmeier',
},
    {
        'url': 'https://www.rahm24.de/medizintechnik/inhalationstherapie/inhalationsgeraet-omron-ne-c28p',
        'title': 'Inhalationsgerät Omron NE-C28P',
        'price': '87.00',
        'manufacturer': 'Omron',
    },
    {
        'url': 'https://www.rahm24.de/schlafen-und-wohnen/aufstehsessel/ruhe-und-aufstehsessel-innov-cocoon',
        'title': 'Ruhe- und Aufstehsessel Innov Cocoon',
        'price': '1290.00',
        'manufacturer': 'Innov`Sa',
    },
)

returns:

INFO:root:found DictScraper (scraper_per_key={'title': <ValueScraper self.selector=<CssRuleSelector self.css_rule='section header'>, self.extractor=<TextValueExtractor>>, 'manufacturer': <ValueScraper self.selector=<CssRuleSelector self.css_rule='li:nth-child(2) > span'>, self.extractor=<TextValueExtractor>>, 'price': <ValueScraper self.selector=<CssRuleSelector self.css_rule='.product--price meta'>, self.extractor=<AttributeValueExtractor self.attr='content'>>, 'url': <ValueScraper self.selector=<CssRuleSelector self.css_rule='meta[itemprop="url"]'>, self.extractor=<AttributeValueExtractor self.attr='content'>>})

lorey · 2022-07-07T16:00:29Z

I think generally this needs to be fixed by #15

lorey mentioned this issue Jul 7, 2022

Fuzzy text matching #15

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feedback #19

Feedback #19

jonashaag commented Jun 23, 2022 •

edited

lorey commented Jun 23, 2022

lorey commented Jun 23, 2022 •

edited

lorey commented Jun 23, 2022

jonashaag commented Jun 23, 2022

jonashaag commented Jun 23, 2022

lorey commented Jun 23, 2022

jonashaag commented Jun 23, 2022

lorey commented Jun 23, 2022

lorey commented Jun 24, 2022 •

edited

jonashaag commented Jun 24, 2022

lorey commented Jun 24, 2022

jonashaag commented Jun 24, 2022

lorey commented Jul 7, 2022 •

edited

lorey commented Jul 7, 2022

jonashaag commented Jul 7, 2022

jonashaag commented Jul 7, 2022 •

edited

lorey commented Jul 7, 2022

lorey commented Jul 7, 2022

Feedback #19

Feedback #19

Comments

jonashaag commented Jun 23, 2022 • edited

lorey commented Jun 23, 2022

lorey commented Jun 23, 2022 • edited

lorey commented Jun 23, 2022

jonashaag commented Jun 23, 2022

jonashaag commented Jun 23, 2022

lorey commented Jun 23, 2022

jonashaag commented Jun 23, 2022

lorey commented Jun 23, 2022

lorey commented Jun 24, 2022 • edited

jonashaag commented Jun 24, 2022

lorey commented Jun 24, 2022

jonashaag commented Jun 24, 2022

lorey commented Jul 7, 2022 • edited

lorey commented Jul 7, 2022

jonashaag commented Jul 7, 2022

jonashaag commented Jul 7, 2022 • edited

lorey commented Jul 7, 2022

lorey commented Jul 7, 2022

jonashaag commented Jun 23, 2022 •

edited

lorey commented Jun 23, 2022 •

edited

lorey commented Jun 24, 2022 •

edited

lorey commented Jul 7, 2022 •

edited

jonashaag commented Jul 7, 2022 •

edited