Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MediaWiki and Wikipedia Connectors #1250

Merged
merged 18 commits into from
May 24, 2024

Conversation

qthequartermasterman
Copy link
Contributor

@qthequartermasterman qthequartermasterman commented Mar 23, 2024

Resolves #1141.

I am happy to iterate on this with inputs from the devs.

Summary

This PR adds a general MediaWiki Connector which will connect to most MediaWiki sites, including Wikipedia, fandom sites, and many others. There is also a subclass Wikipedia Connector which is a light wrapper around the MediaWikiConnector which uses the special handling for Wikipedia.

The connector is based on pywikibot.

It will optionally recurse over categories to obtain additional pages.

It supports both polling and loading.

Possible Future improvements

There is a solution for handling general MediaWiki sites which generates a Family class automatically by querying a given site using several heuristics (built into pywikibot). This will not handle any special cases however. Wikipedia, for example, has some extra language sites that wouldn't otherwise be found by the generic technique. This special Family class is built into pywikibot, and is used here. There are many more special Family classes to deal with various sites built into pywikibot. None of these other special cases are included, because it's not clear to me which ones would be useful.

Additionally, there is no special handling for other types of pages, such as talk pages; just regular pages and categories.

Copy link

vercel bot commented Mar 23, 2024

@qthequartermasterman is attempting to deploy a commit to the Danswer Team on Vercel.

A member of the Team first needs to authorize it.

@qthequartermasterman
Copy link
Contributor Author

@yuhongsun96 How can I make this easier to review?

# Conflicts:
#	backend/danswer/configs/constants.py
#	backend/danswer/connectors/factory.py
#	web/src/components/icons/icons.tsx
#	web/src/lib/types.ts
@yuhongsun96
Copy link
Contributor

Hi! Will try to get to it soon, apologies on the delay and thanks for your patience with us

Thanks also for the great work and contribution!

@qthequartermasterman
Copy link
Contributor Author

@yuhongsun96 Any update on this?

@yuhongsun96
Copy link
Contributor

Taking a look now 🫡 , thanks!

@yuhongsun96
Copy link
Contributor

Looks good, a couple requests:

  • Let's make these the Polling type (so it should show in the bottom section and pull in updated documents every day or so)
  • Let's place the icons before request tracker in the bottom list
  • Would be really nice if you also created a guide page for it in the docs: https://github.com/danswer-ai/danswer-docs
  • Please rebase it, looks like only minor conflicts
Screenshot 2024-05-23 at 2 15 49 PM

Thanks for the amazing work!

@qthequartermasterman
Copy link
Contributor Author

@yuhongsun96

  • Let's make these the Polling type (so it should show in the bottom section and pull in updated documents every day or so)

The connectors already inherit from PollConnector--MediaWikiConnector directly, and WikipediaConnector via MediaWikiConnector.

class MediaWikiConnector(LoadConnector, PollConnector):

I also swapped category: SourceCategory.ImportedKnowledge, for category: SourceCategory.AppConnection, in sources.py so that it is on the bottom section in the admin page.

Is that what you're referring to?

Also, would you like me to update the refreshFreq on page.tsx for both connectors to be a day? It's currently the default 10 minutes.

  • Let's place the icons before request tracker in the bottom list

Does this look like what you're asking?

Screenshot 2024-05-23 at 9 39 32 PM

I will open a PR doing so shortly. It may be a few days given the upcoming holiday weekend.

@yuhongsun96
Copy link
Contributor

Ya, that's perfect, the bottom section is for "poll" connectors, the top for "load", that's the way most users think about it! Granted the Web connector does update but a lot of people already have it mentally associated the other way so we never moved it :P

I can change the poll frequency myself, that's trivial, a day seem reasonable!

Thanks for the amazing work and looking forward to the docs!

@yuhongsun96 yuhongsun96 merged commit 94018e8 into danswer-ai:main May 24, 2024
1 check failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add connector for MediaWiki sites (including Wikipedia and Fandom)
2 participants