You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I think that supporting MediaWiki sites (including Wikipedia and Fandom) as connectors would be helpful. I include a background investigation that I did into what would be needed to support this connector. I also included some sample proof-of-concept code that I threw together for my own personal needs, but isn't ready for a PR in this repo.
I would be happy to help contribute the general MediaWiki connector if the maintainers feel that such a contribution would be helpful and if they could provide some guidance on design.
Background
MediaWiki is an open source wiki software powering many of the largest wiki sites on the web, including Wikipedia, Fandom (formerly wikia), wikiHow, and many others. According to its wikipedia page, it is also used internally at many large corporations and government organizations:
MediaWiki is also used internally by a large number of companies, including Novell and Intel.[34][35]
I can attest that it is frequently used for internal knowledge stores, and would be incredibly helpful to be able to use MediaWiki sites as a source for Danswer.
MediaWiki API + pywikibot
MediaWiki instances maintain a public-facing API, and most (if not all) MediaWiki sites allow access to that API. This API makes it easy to request pages (including metadata such as revision timestamps) and other content without having to scrape the pages manually. The ability to filter by category, page name, revision date, or other metadata additionally makes it convenient in a context such as Danswer so that it doesn't have to scrape every page each time it updates its index.
Although this API is powerful (and well documented), it is a little clunky to use. The WikiMedia foundation (who own the MediaWiki project) also publish a corresponding python package called pywikibot to make it easier to do API requests. I have personally used this package to implement a custom Danswer connector (far too specific to my usecase to submit as a general connector to this repository), and I can attest that it is incredibly easy to use and integrate as part of a Danswer connector.
Danswer Connector
Configuration
API details
To use a MediaWiki's site's API in pywikibot you need to define what it calls a Family, which is a collection of related wikis that have some common structure (usually multi-lingual versions of the same site), including a host-name, subwiki-names, and an API endpoint. Family classes also define a version (referring to the targetted MediaWiki site version), but in my limited experimentation it is actually not relevant.
There would need to be some way for the Danswer use to specify this Family (or for Danswer to construct the Family from a host name). According to the official pywikibot docs, there are scripts that can automatically build the Family definition from a wiki's meta data (and there are many already built into pywikibot itself). Theoretically, this could be used to create that Family definition, and the Danswer user would only need to put in a host name.
Each MediaWiki site defines its own API endpoint (i.e. the API itself is the same, but what the URL to access it may change). This API endpoint definition is necessary
Many major sites (especially those hosted by WikiMedia) follow the standard /w/api.php (for example, https://en.wikipedia.org/w/api.php), but other sites are free to deviate from this. Many sites in the fandom network do this, for example, the Fallout wiki: https://fallout.fandom.com/api.php. This would need to be specified by the user.
Theoretically, this API endpoint could be dynamically scraped from the Special:Version page, which outputs metadata about the MediaWiki instance, including its api endpoint, if the user only has the host name. I haven't personally needed to do this, so as of now I can only say it's "theoretically possible".
What to Index
Many MediaWiki sites are very, very large. It would be nice to let a Danswer user specify which specific pages and/or categories and/or portals they would like to index (perhaps including pages recursively linked), to avoid having to index an entire MediaWiki site.
Sample Code
Below is an incomplete, hyperspecific, and unoptimized version of a Danswer connector that I put together for a very specific problem I have (scraping game lore from a specific fandom site to be indexed by Danswer). It should provide some inspiration for a future design, however.
Supporting Wikipedia as a Connector
Supporting MediaWiki sites in this general way means supporting Wikipedia, basically for free. The only difference would be specifying the hardcoded API link and adding another connector in the list.
Supporting arbitrary Fandom sites would require little work beyond the general MediaWiki connector described above.
Proof-of-Concept
Below is an incomplete, hyperspecific, and unoptimized version of a Danswer connector that I put together for a very specific problem I have (scraping game lore from a specific fandom site to be indexed by Danswer). It should provide some inspiration for a future design, however. The parts that would need to be made more general is constructing the Family object for the pywikibot.Site object and parsing the Sections of each page.
importpywikibotfrompywikibotimportpagegenerators, familyfromdanswer.configs.app_configsimportINDEX_BATCH_SIZEfromdanswer.configs.constantsimportDocumentSourcefromdanswer.connectors.interfacesimportGenerateDocumentsOutputfromdanswer.connectors.interfacesimportLoadConnectorfromdanswer.connectors.interfacesimportPollConnectorfromdanswer.connectors.interfacesimportSecondsSinceUnixEpochfromdanswer.connectors.modelsimportDocumentfromdanswer.connectors.modelsimportSectionclassMediaWikiConnector(LoadConnector, PollConnector):
def__init__(
self,
categories: list[str],
pages: list[str],
hostname: str,
script_path: str,
recurse_depth: int|None,
batch_size: int=INDEX_BATCH_SIZE,
) ->None:
classFamily(family.Family):
name='hostname'langs= {
'en': None,
}
# A few selected big languages for things that we do not want to loop over# all languages. This is only needed by the titletranslate.py module, so# if you carefully avoid the options, you could get away without these# for another wiki family.languages_by_size= ['en']
defhostname(self, code):
returnhostnamedefscriptpath(self, code):
returnscript_pathdefversion(self, code):
return"1.39.6"# Which version of MediaWiki is used?self.site=pywikibot.Site(fam=Family(), code='en')
self.batch_size=batch_sizeself.categories= [pywikibot.Category(self.site, f"Category:{category.replace(' ', '_')}") forcategoryincategories]
self.pages= [pywikibot.Page(self.site, page) forpageinpages]
self.recurse_depth=recurse_depthdefload_credentials(self, credentials: dict[str, Any]) ->dict[str, Any] |None:
returnNonedef_get_doc_from_page(self, page: pywikibot.Page) ->Document:
returnDocument(
source=DocumentSource.MEDIAWIKI,
title=page.title(),
text=page.text,
url=page.full_url(),
created_at=page.oldest_revision.timestamp,
updated_at=page.latest_revision.timestamp,
sections=[
# TODO: extract individual sections of the page out using the WikiText markup language.Section(
link=page.full_url(),
text=page.text,
)
],
semantic_identifier=page.title(),
metadata={
"categories": [category.title() forcategoryinpage.categories()]
},
id=page.pageid,
)
def_get_doc_batch(
self,
start: SecondsSinceUnixEpoch|None=None,
end: SecondsSinceUnixEpoch|None=None,
) ->Generator[list[Document], None, None]:
doc_batch: list[Document] = []
all_pages=itertools.chain(self.pages, *[pagegenerators.CategorizedPageGenerator(category, recurse=self.recurse_depth) forcategoryinself.categories])
forpageinall_pages:
ifstartandpage.latest_revision.timestamp.timestamp() <start:
continueifendandpage.oldest_revision.timestamp.timestamp() >end:
continuedoc_batch.append(
self._get_doc_from_page(page)
)
iflen(doc_batch) >=self.batch_size:
yielddoc_batchdoc_batch= []
ifdoc_batch:
yielddoc_batchdefload_from_state(self) ->GenerateDocumentsOutput:
returnself.poll_source(None, None)
defpoll_source(
self, start: SecondsSinceUnixEpoch|None, end: SecondsSinceUnixEpoch|None
) ->GenerateDocumentsOutput:
returnself._get_doc_batch(start, end)
The text was updated successfully, but these errors were encountered:
I think that supporting
MediaWiki
sites (including Wikipedia and Fandom) as connectors would be helpful. I include a background investigation that I did into what would be needed to support this connector. I also included some sample proof-of-concept code that I threw together for my own personal needs, but isn't ready for a PR in this repo.I would be happy to help contribute the general MediaWiki connector if the maintainers feel that such a contribution would be helpful and if they could provide some guidance on design.
Background
MediaWiki is an open source wiki software powering many of the largest wiki sites on the web, including Wikipedia, Fandom (formerly wikia), wikiHow, and many others. According to its wikipedia page, it is also used internally at many large corporations and government organizations:
I can attest that it is frequently used for internal knowledge stores, and would be incredibly helpful to be able to use MediaWiki sites as a source for Danswer.
MediaWiki API +
pywikibot
MediaWiki instances maintain a public-facing API, and most (if not all) MediaWiki sites allow access to that API. This API makes it easy to request pages (including metadata such as revision timestamps) and other content without having to scrape the pages manually. The ability to filter by category, page name, revision date, or other metadata additionally makes it convenient in a context such as Danswer so that it doesn't have to scrape every page each time it updates its index.
Although this API is powerful (and well documented), it is a little clunky to use. The WikiMedia foundation (who own the MediaWiki project) also publish a corresponding python package called
pywikibot
to make it easier to do API requests. I have personally used this package to implement a custom Danswer connector (far too specific to my usecase to submit as a general connector to this repository), and I can attest that it is incredibly easy to use and integrate as part of a Danswer connector.Danswer Connector
Configuration
API details
To use a MediaWiki's site's API in
pywikibot
you need to define what it calls aFamily
, which is a collection of related wikis that have some common structure (usually multi-lingual versions of the same site), including a host-name, subwiki-names, and an API endpoint.Family
classes also define aversion
(referring to the targetted MediaWiki site version), but in my limited experimentation it is actually not relevant.There would need to be some way for the Danswer use to specify this
Family
(or for Danswer to construct theFamily
from a host name). According to the officialpywikibot
docs, there are scripts that can automatically build theFamily
definition from a wiki's meta data (and there are many already built intopywikibot
itself). Theoretically, this could be used to create thatFamily
definition, and the Danswer user would only need to put in a host name.Each MediaWiki site defines its own API endpoint (i.e. the API itself is the same, but what the URL to access it may change). This API endpoint definition is necessary
Many major sites (especially those hosted by WikiMedia) follow the standard
/w/api.php
(for example, https://en.wikipedia.org/w/api.php), but other sites are free to deviate from this. Many sites in the fandom network do this, for example, the Fallout wiki: https://fallout.fandom.com/api.php. This would need to be specified by the user.Theoretically, this API endpoint could be dynamically scraped from the
Special:Version
page, which outputs metadata about the MediaWiki instance, including its api endpoint, if the user only has the host name. I haven't personally needed to do this, so as of now I can only say it's "theoretically possible".What to Index
Many MediaWiki sites are very, very large. It would be nice to let a Danswer user specify which specific pages and/or categories and/or portals they would like to index (perhaps including pages recursively linked), to avoid having to index an entire MediaWiki site.
Sample Code
Below is an incomplete, hyperspecific, and unoptimized version of a Danswer connector that I put together for a very specific problem I have (scraping game lore from a specific fandom site to be indexed by Danswer). It should provide some inspiration for a future design, however.
Supporting Wikipedia as a Connector
Supporting MediaWiki sites in this general way means supporting Wikipedia, basically for free. The only difference would be specifying the hardcoded API link and adding another connector in the list.
Another consideration might be limiting the user to only indexing smaller portions of the site (some collection of Categories or specific pages for example), since Wikipedia is very, very large (the current English Version of Wikipedia (with no edit history) and text only) measures to over 50GB).
Supporting Fandom as a Connector
Supporting arbitrary Fandom sites would require little work beyond the general MediaWiki connector described above.
Proof-of-Concept
Below is an incomplete, hyperspecific, and unoptimized version of a Danswer connector that I put together for a very specific problem I have (scraping game lore from a specific fandom site to be indexed by Danswer). It should provide some inspiration for a future design, however. The parts that would need to be made more general is constructing the
Family
object for thepywikibot.Site
object and parsing the Sections of each page.The text was updated successfully, but these errors were encountered: