Indexing Atlassian Confluence #154

pudo · 2021-01-14T10:34:07Z

We have this recurring request from some editors to index project Confluence wikis into Aleph. The idea is to index all the reporters notes from a given wiki space into an investigation casefile. What we'd need to figure out:

How do we authenticate with Confluence in a way that ships around 2FA on SSO. Do they have some sort of app passwords?
Do we want to index wiki pages as HTML, or is it better to index for example a PDF export?
- What do we do with comments?
- How do we represent the hierarchy of wiki pages? Do we create pseudo-folders?
Need to make sure we also pull in page attachments
Need to generate good foreign IDs so changed pages don't duplicate based on hash

pudo · 2021-01-18T19:56:42Z

https://atlassian-python-api.readthedocs.io/confluence.html

Rosencrantz · 2021-02-10T11:06:31Z

Hi @pudo Ex Confluence developer here. Great to hear that Connie is getting used by some editors. I might be able to provide a couple of random thoughts that may be useful in getting that data into Aleph...

First thing to consider is whether the editor is using Confluence Cloud or Confluence Server. Although the products have the same name the codebases are (now) pretty divergent and the way you achieve things can be significantly different depending on which product you want to interact with. Fun times.

One aspect of Confluence that is common to both cloud and server is the export function. If the space is relatively static, meaning if the editor has finished working on their notes and simply wants to import into Aleph then it might be easier to have the editor export the space using the Confluence export feature (there are numerous export options, html and xml for example). This export could then be ingested and transformed into something that Aleph/FtM can handle.

If that's not viable then you'll either want to get the rendered content for each page using the Confluence API or find a way of scaping the page with Memorious, which would leave you with the SSO/2FA challenge.

To work around challenges with SSO and 2FA you might be able to create a plugin that is installed on the Confluence instance. This plugin would have access to page content, comments, and attachments and could call back to an API to record that same information in Aleph.

Cloud plugins are effectively microservices and can be written in a bunch of different languages, Server plugins are built in Java. So, that might be something else to consider.

Another entirely random thought here would be to switch things around and, rather than export data from Confluence into Aleph, build an integration from Aleph into Confluence.

Rosencrantz · 2021-02-10T16:13:58Z

Confluence-space-export-155300.html.zip

The attached is a basic Confluence space export in HTML format. It contains content and attachments but unfortunately no comments. Importing this directly into Aleph produces output similar to the following:

It also exports a page which holds the structure of the space, so sub pages etc. What is somewhat annoying is that the links don't work so you can't navigate the space easily once it has been uploaded into Aleph. With that said it might be possible to extend the html ingestor to handle this?

pudo added the task label Jan 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Indexing Atlassian Confluence #154

Indexing Atlassian Confluence #154

pudo commented Jan 14, 2021

pudo commented Jan 18, 2021

Rosencrantz commented Feb 10, 2021 •

edited

Rosencrantz commented Feb 10, 2021 •

edited

Indexing Atlassian Confluence #154

Indexing Atlassian Confluence #154

Comments

pudo commented Jan 14, 2021

pudo commented Jan 18, 2021

Rosencrantz commented Feb 10, 2021 • edited

Rosencrantz commented Feb 10, 2021 • edited

Rosencrantz commented Feb 10, 2021 •

edited

Rosencrantz commented Feb 10, 2021 •

edited