Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

redirection to a forbidden domain happened without slash suffix character in the web crawler #738

Open
msmygit opened this issue Nov 27, 2023 · 1 comment

Comments

@msmygit
Copy link
Collaborator

msmygit commented Nov 27, 2023

Setup

% langstream -V
LangStream CLI 0.5.0 (8162f382)

Web crawler configuration

pipeline:
  - name: "Crawl the WebSite"
    type: "webcrawler-source"
    configuration:
      seed-urls:
        - "https://aws.amazon.com/about-aws/whats-new/2023/11"
      allowed-domains:
        - "https://aws.amazon.com/about-aws/whats-new/2023/11"
      forbidden-paths: []
      ...

When we execute the below command,

langstream docker run test -app examples/docker-chatbot -s ./secrets.yaml

we get the following error,

15:23:56.896 [crawler-webcrawler-source-1-runner-465eeb4a-f140-4b8d-b683-ee51ee76f401] INFO  a.l.a.webcrawler.WebCrawlerSource -- The last cycle didn't produce any new documents
15:23:56.896 [crawler-webcrawler-source-1-runner-465eeb4a-f140-4b8d-b683-ee51ee76f401] INFO  a.l.a.webcrawler.crawler.WebCrawler -- Crawling url: https://aws.amazon.com/about-aws/whats-new/2023/11
15:23:57.086 [crawler-webcrawler-source-1-runner-465eeb4a-f140-4b8d-b683-ee51ee76f401] WARN  a.l.a.webcrawler.crawler.WebCrawler -- A redirection to a forbidden domain happened (from https://aws.amazon.com/about-aws/whats-new/2023/11 to /about-aws/whats-new/2023/11/)

Workaround

Adding the slash (/) character suffix at the seed-urls and allowed-domains fixed the error.

pipeline:
  - name: "Crawl the WebSite"
    type: "webcrawler-source"
    configuration:
      seed-urls:
        - "https://aws.amazon.com/about-aws/whats-new/2023/11/"
      allowed-domains:
        - "https://aws.amazon.com/about-aws/whats-new/2023/11/"
      forbidden-paths: []
      ...
@eolivelli
Copy link
Member

Happy that you have found a solution

I have one question:
do you want to index only 1 page ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants