Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add connector for MediaWiki sites (including Wikipedia and Fandom) #1141

Closed
qthequartermasterman opened this issue Feb 27, 2024 · 0 comments · Fixed by #1250
Closed

Add connector for MediaWiki sites (including Wikipedia and Fandom) #1141

qthequartermasterman opened this issue Feb 27, 2024 · 0 comments · Fixed by #1250

Comments

@qthequartermasterman
Copy link
Contributor

qthequartermasterman commented Feb 27, 2024

I think that supporting MediaWiki sites (including Wikipedia and Fandom) as connectors would be helpful. I include a background investigation that I did into what would be needed to support this connector. I also included some sample proof-of-concept code that I threw together for my own personal needs, but isn't ready for a PR in this repo.

I would be happy to help contribute the general MediaWiki connector if the maintainers feel that such a contribution would be helpful and if they could provide some guidance on design.

Background

MediaWiki is an open source wiki software powering many of the largest wiki sites on the web, including Wikipedia, Fandom (formerly wikia), wikiHow, and many others. According to its wikipedia page, it is also used internally at many large corporations and government organizations:

MediaWiki is also used internally by a large number of companies, including Novell and Intel.[34][35]

Notable usages of MediaWiki within governments include Intellipedia, used by the United States Intelligence Community, Diplopedia, used by the United States Department of State, and milWiki, a part of milSuite used by the United States Department of Defense. United Nations agencies such as the United Nations Development Programme and INSTRAW chose to implement their wikis using MediaWiki, because "this software runs Wikipedia and is therefore guaranteed to be thoroughly tested, will continue to be developed well into the future, and future technicians on these wikis will be more likely to have exposure to MediaWiki than any other wiki software."[36]

I can attest that it is frequently used for internal knowledge stores, and would be incredibly helpful to be able to use MediaWiki sites as a source for Danswer.

MediaWiki API + pywikibot

MediaWiki instances maintain a public-facing API, and most (if not all) MediaWiki sites allow access to that API. This API makes it easy to request pages (including metadata such as revision timestamps) and other content without having to scrape the pages manually. The ability to filter by category, page name, revision date, or other metadata additionally makes it convenient in a context such as Danswer so that it doesn't have to scrape every page each time it updates its index.

Although this API is powerful (and well documented), it is a little clunky to use. The WikiMedia foundation (who own the MediaWiki project) also publish a corresponding python package called pywikibot to make it easier to do API requests. I have personally used this package to implement a custom Danswer connector (far too specific to my usecase to submit as a general connector to this repository), and I can attest that it is incredibly easy to use and integrate as part of a Danswer connector.

Danswer Connector

Configuration

API details

To use a MediaWiki's site's API in pywikibot you need to define what it calls a Family, which is a collection of related wikis that have some common structure (usually multi-lingual versions of the same site), including a host-name, subwiki-names, and an API endpoint. Family classes also define a version (referring to the targetted MediaWiki site version), but in my limited experimentation it is actually not relevant.

There would need to be some way for the Danswer use to specify this Family (or for Danswer to construct the Family from a host name). According to the official pywikibot docs, there are scripts that can automatically build the Family definition from a wiki's meta data (and there are many already built into pywikibot itself). Theoretically, this could be used to create that Family definition, and the Danswer user would only need to put in a host name.

Each MediaWiki site defines its own API endpoint (i.e. the API itself is the same, but what the URL to access it may change). This API endpoint definition is necessary

Many major sites (especially those hosted by WikiMedia) follow the standard /w/api.php (for example, https://en.wikipedia.org/w/api.php), but other sites are free to deviate from this. Many sites in the fandom network do this, for example, the Fallout wiki: https://fallout.fandom.com/api.php. This would need to be specified by the user.

Theoretically, this API endpoint could be dynamically scraped from the Special:Version page, which outputs metadata about the MediaWiki instance, including its api endpoint, if the user only has the host name. I haven't personally needed to do this, so as of now I can only say it's "theoretically possible".

What to Index

Many MediaWiki sites are very, very large. It would be nice to let a Danswer user specify which specific pages and/or categories and/or portals they would like to index (perhaps including pages recursively linked), to avoid having to index an entire MediaWiki site.

Sample Code

Below is an incomplete, hyperspecific, and unoptimized version of a Danswer connector that I put together for a very specific problem I have (scraping game lore from a specific fandom site to be indexed by Danswer). It should provide some inspiration for a future design, however.

Supporting Wikipedia as a Connector

Supporting MediaWiki sites in this general way means supporting Wikipedia, basically for free. The only difference would be specifying the hardcoded API link and adding another connector in the list.

Another consideration might be limiting the user to only indexing smaller portions of the site (some collection of Categories or specific pages for example), since Wikipedia is very, very large (the current English Version of Wikipedia (with no edit history) and text only) measures to over 50GB).

Supporting Fandom as a Connector

Supporting arbitrary Fandom sites would require little work beyond the general MediaWiki connector described above.

Proof-of-Concept

Below is an incomplete, hyperspecific, and unoptimized version of a Danswer connector that I put together for a very specific problem I have (scraping game lore from a specific fandom site to be indexed by Danswer). It should provide some inspiration for a future design, however. The parts that would need to be made more general is constructing the Family object for the pywikibot.Site object and parsing the Sections of each page.

import pywikibot
from pywikibot import pagegenerators, family

from danswer.configs.app_configs import INDEX_BATCH_SIZE
from danswer.configs.constants import DocumentSource
from danswer.connectors.interfaces import GenerateDocumentsOutput
from danswer.connectors.interfaces import LoadConnector
from danswer.connectors.interfaces import PollConnector
from danswer.connectors.interfaces import SecondsSinceUnixEpoch
from danswer.connectors.models import Document
from danswer.connectors.models import Section


class MediaWikiConnector(LoadConnector, PollConnector):
    def __init__(
        self,
        categories: list[str],
        pages: list[str],
        hostname: str,
        script_path: str,
        recurse_depth: int|None,
        batch_size: int = INDEX_BATCH_SIZE,
    ) -> None:

        class Family(family.Family):
            name = 'hostname'

            langs = {
                'en': None,
            }

            # A few selected big languages for things that we do not want to loop over
            # all languages. This is only needed by the titletranslate.py module, so
            # if you carefully avoid the options, you could get away without these
            # for another wiki family.
            languages_by_size = ['en']

            def hostname(self, code):
                return hostname

            def scriptpath(self, code):
                return script_path

            def version(self, code):
                return "1.39.6"  # Which version of MediaWiki is used?

        self.site = pywikibot.Site(fam=Family(), code='en')
        self.batch_size = batch_size

        self.categories = [pywikibot.Category(self.site, f"Category:{category.replace(' ', '_')}") for category in categories]
        self.pages = [pywikibot.Page(self.site, page) for page in pages]
        self.recurse_depth = recurse_depth


    def load_credentials(self, credentials: dict[str, Any]) -> dict[str, Any] | None:
        return None

    def _get_doc_from_page(self, page: pywikibot.Page) -> Document:
        return Document(
            source=DocumentSource.MEDIAWIKI,
            title=page.title(),
            text=page.text,
            url=page.full_url(),
            created_at=page.oldest_revision.timestamp,
            updated_at=page.latest_revision.timestamp,
            sections=[
                # TODO: extract individual sections of the page out using the WikiText markup language.
                Section(
                    link=page.full_url(),
                    text=page.text,
                )
            ],
            semantic_identifier=page.title(),
            metadata={
                "categories": [category.title() for category in page.categories()]
            },
            id=page.pageid,
        )

    def _get_doc_batch(
            self,
            start: SecondsSinceUnixEpoch | None = None,
            end: SecondsSinceUnixEpoch | None = None,
    ) -> Generator[list[Document], None, None]:
        doc_batch: list[Document] = []

        all_pages = itertools.chain(self.pages, *[pagegenerators.CategorizedPageGenerator(category, recurse=self.recurse_depth) for category in self.categories])
        for page in all_pages:
            if start and page.latest_revision.timestamp.timestamp() < start:
                continue
            if end and page.oldest_revision.timestamp.timestamp() > end:
                continue
            doc_batch.append(
                self._get_doc_from_page(page)
            )
            if len(doc_batch) >= self.batch_size:
                yield doc_batch
                doc_batch = []
        if doc_batch:
            yield doc_batch

    def load_from_state(self) -> GenerateDocumentsOutput:
        return self.poll_source(None, None)

    def poll_source(
            self, start: SecondsSinceUnixEpoch | None, end: SecondsSinceUnixEpoch | None
    ) -> GenerateDocumentsOutput:
        return self._get_doc_batch(start, end)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant