Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Firecrawl Web Extractor #3988

Draft
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

nickscamara
Copy link

Description

Dify x Firecrawl collab for creating knowledge base from web pages. Firecrawl team went ahead and created the Firecrawl web extractor which takes care of scraping and crawling website - turning them into llm-ready markdown.

There are 3 modes to help yall implement the front-end: crawl, scrape and crawl_return_urls.

  • Crawl: crawls the entire website and outputs a list of documents
  • Scrape: scrapes a single page only and outputs a list of document with 1 document
  • Crawl Return Urls: Optional: This is a helpful util for your team to be able to display which urls the person wants to scrape when they insert a website. That way you can show them a list and have them pick which ones they want ingested as a data source. You can then just feed those urls to the /scrape endpoint.

Questions / Doubts:

  • We were a bit unsure about the auth side of the web extractor. We were going to follow the Notion example but because Notion authenticates via OAuth and we only do a Bearer api key, didn't make too much sense to copy that. So we ended up not touching that side of things for now. It would be good to get some clarification on what the best approach is there but yall can probably implement that way faster than we can so feel free to tackle it.

I ended up leaving some [REVIEW] tags in some places as we weren't too sure there.

Type of Change

  • New feature (non-breaking change which adds functionality)
  • This change requires a documentation update, included: Dify Document

How Has This Been Tested?

  • Unit tests for Firecrawl web extractor

TODO:

  • Firecrawl Web extractor (Firecrawl team)
  • Firecrawl Authentication / Bearer Auth data source (?) (Prop Dify team, but Firecrawl team can help if needed (just need further info))
  • Front-end for web loader (Dify team)

ccing: @guchenhe @rafaelsideguide

@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. 💪 enhancement New feature or request 📚 documentation Improvements or additions to documentation labels Apr 30, 2024
@takatost takatost marked this pull request as draft April 30, 2024 05:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
📚 documentation Improvements or additions to documentation 💪 enhancement New feature or request size:L This PR changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant