feat: Firecrawl Web Extractor #3988

nickscamara · 2024-04-30T00:38:31Z

Description

Dify x Firecrawl collab for creating knowledge base from web pages. Firecrawl team went ahead and created the Firecrawl web extractor which takes care of scraping and crawling website - turning them into llm-ready markdown.

There are 3 modes to help yall implement the front-end: crawl, scrape and crawl_return_urls.

Crawl: crawls the entire website and outputs a list of documents
Scrape: scrapes a single page only and outputs a list of document with 1 document
Crawl Return Urls: Optional: This is a helpful util for your team to be able to display which urls the person wants to scrape when they insert a website. That way you can show them a list and have them pick which ones they want ingested as a data source. You can then just feed those urls to the /scrape endpoint.

Questions / Doubts:

We were a bit unsure about the auth side of the web extractor. We were going to follow the Notion example but because Notion authenticates via OAuth and we only do a Bearer api key, didn't make too much sense to copy that. So we ended up not touching that side of things for now. It would be good to get some clarification on what the best approach is there but yall can probably implement that way faster than we can so feel free to tackle it.

I ended up leaving some [REVIEW] tags in some places as we weren't too sure there.

Type of Change

New feature (non-breaking change which adds functionality)
This change requires a documentation update, included: Dify Document

How Has This Been Tested?

Unit tests for Firecrawl web extractor

TODO:

Firecrawl Web extractor (Firecrawl team)
Firecrawl Authentication / Bearer Auth data source (?) (Prop Dify team, but Firecrawl team can help if needed (just need further info))
Front-end for web loader (Dify team)

ccing: @guchenhe @rafaelsideguide

nickscamara added 6 commits April 29, 2024 16:47

Nick: init

5be033d

Nick: more readable

bd2fbbe

Nick: added tests and envs

b318b5e

Nick:

21cfd5e

Merge branch 'main' into nsc/firecrawl-integration

271ef1b

Update extract_setting.py

0cb6bed

dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. 💪 enhancement New feature or request 📚 documentation Improvements or additions to documentation labels Apr 30, 2024

takatost requested review from guchenhe, JohnJyong and VincePotato April 30, 2024 04:57

takatost marked this pull request as draft April 30, 2024 05:39

Update firecrawl_app.py

01b0bac

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Firecrawl Web Extractor #3988

feat: Firecrawl Web Extractor #3988

nickscamara commented Apr 30, 2024

feat: Firecrawl Web Extractor #3988

Are you sure you want to change the base?

feat: Firecrawl Web Extractor #3988

Conversation

nickscamara commented Apr 30, 2024

Description

Type of Change

How Has This Been Tested?