Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indexing Atlassian Confluence #154

Open
pudo opened this issue Jan 14, 2021 · 3 comments
Open

Indexing Atlassian Confluence #154

pudo opened this issue Jan 14, 2021 · 3 comments

Comments

@pudo
Copy link
Contributor

pudo commented Jan 14, 2021

We have this recurring request from some editors to index project Confluence wikis into Aleph. The idea is to index all the reporters notes from a given wiki space into an investigation casefile. What we'd need to figure out:

  • How do we authenticate with Confluence in a way that ships around 2FA on SSO. Do they have some sort of app passwords?
  • Do we want to index wiki pages as HTML, or is it better to index for example a PDF export?
    • What do we do with comments?
    • How do we represent the hierarchy of wiki pages? Do we create pseudo-folders?
  • Need to make sure we also pull in page attachments
  • Need to generate good foreign IDs so changed pages don't duplicate based on hash
@pudo pudo added the task label Jan 14, 2021
@pudo
Copy link
Contributor Author

pudo commented Jan 18, 2021

@Rosencrantz
Copy link
Contributor

Rosencrantz commented Feb 10, 2021

Hi @pudo Ex Confluence developer here. Great to hear that Connie is getting used by some editors. I might be able to provide a couple of random thoughts that may be useful in getting that data into Aleph...

First thing to consider is whether the editor is using Confluence Cloud or Confluence Server. Although the products have the same name the codebases are (now) pretty divergent and the way you achieve things can be significantly different depending on which product you want to interact with. Fun times.

One aspect of Confluence that is common to both cloud and server is the export function. If the space is relatively static, meaning if the editor has finished working on their notes and simply wants to import into Aleph then it might be easier to have the editor export the space using the Confluence export feature (there are numerous export options, html and xml for example). This export could then be ingested and transformed into something that Aleph/FtM can handle.

If that's not viable then you'll either want to get the rendered content for each page using the Confluence API or find a way of scaping the page with Memorious, which would leave you with the SSO/2FA challenge.

To work around challenges with SSO and 2FA you might be able to create a plugin that is installed on the Confluence instance. This plugin would have access to page content, comments, and attachments and could call back to an API to record that same information in Aleph.

Cloud plugins are effectively microservices and can be written in a bunch of different languages, Server plugins are built in Java. So, that might be something else to consider.

Another entirely random thought here would be to switch things around and, rather than export data from Confluence into Aleph, build an integration from Aleph into Confluence.

@Rosencrantz
Copy link
Contributor

Rosencrantz commented Feb 10, 2021

Confluence-space-export-155300.html.zip

The attached is a basic Confluence space export in HTML format. It contains content and attachments but unfortunately no comments. Importing this directly into Aleph produces output similar to the following:

aleph-confluence

It also exports a page which holds the structure of the space, so sub pages etc. What is somewhat annoying is that the links don't work so you can't navigate the space easily once it has been uploaded into Aleph. With that said it might be possible to extend the html ingestor to handle this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Status: 📋 Backlog
Development

No branches or pull requests

2 participants