Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sync job gets marked as Succeeded even if there was a error that ended the job prematurely #2393

Open
3 of 5 tasks
gcasar opened this issue Apr 16, 2024 · 5 comments
Open
3 of 5 tasks
Labels

Comments

@gcasar
Copy link

gcasar commented Apr 16, 2024

Bug Description

After an internal connector error that prematurely ends a Full sync the Sync job proceeds, reconciles the index by deleting documents previously in the system and is marked as Sync complete

To Reproduce

Steps to reproduce the behavior:

This behavior is hard to reproduce reliably because it is 1) Connector specific and 2) Relies on the third party error. There are at least two bugs that surface this problem.

Please note that the example I'm giving below is not the only place this happens.

06:59:51.107
enterprise_search.connectors
[enterprise_search.connectors][warning] Skipping data for type content from https://stackoverflow.atlassian.net/wiki/rest/api/content/search?next=true&cursor=_f_NTA%3D_sa_WyJcdDM2NjM5OTMwIFI2KF8wZUpoJWhYdDA8L2w8a2NcIiBjcCJd&expand=children.attachment,history.lastUpdated,body.storage,space,space.permissions,restrictions.read.restrictions.user,restrictions.read.restrictions.group&limit=50&start=285750&cql=space+in+('LTDD','BATD','MADCOW')+AND+type=page. Exception: Response payload is not completed.
06:59:53.811
enterprise_search.connectors
[enterprise_search.connectors][info] Sync ended with status completed -- created: 0 | updated: 285753 | deleted: 23748 (took 14338 seconds)

Expected behavior

When an exception occurs that forces a Full sync to prematurely end
the Sync job should be marked as Failed
and the existing documents in the index should not get Deleted.

As a user of connector
I expect this problem to be handled in a shared way
so that uncaught exceptions in connectors don't get silently discarded
giving a false sense of completed syncs.

Screenshots

image Please note that this is an older image, but we are still able to reproduce this in 8.13.2 where the errors reported above are from.

Environment

  • Elastic Cloud, version 8.13.2, native Confluence cloud and Github cloud connectors

Additional context

The second example is related and for Github:

06:39:29.252
enterprise_search.connectors
[enterprise_search.connectors][warning] Something went wrong while fetching the issues. Exception: 403, message='Forbidden', url=URL('https://api.github.com/graphql')
  File "/usr/share/enterprise-search/lib/python3.10/site-packages/connectors/sources/github.py", line 1542, in _fetch_issues
    async for response in self.github_client.paginated_api_call(
/...several lines ommited ../
06:39:29.253
enterprise_search.connectors
[enterprise_search.connectors][info] Fetching files from repo: 'elastic/elasticsearch' (branch: 'main')
06:39:40.101
enterprise_search.connectors
[enterprise_search.connectors][info] Sync ended with status completed -- created: 2543 | updated: 34290 | deleted: 34693 (took 13176 seconds)

While by no means an expert in your codebase, for at least Confluence the following code bits could explain the behavior:

1.) 3rd party (Confluence) closes the session (this should be somewhat expected and recoverable)
2.) Session is closed in a bad way, not nulling it out, should likely use the internal close_session method to unset the private variable

3.) The call is retried a few times, but the session is not recreated because it was not unset

except Exception as exception:
if isinstance(
exception,
ServerDisconnectedError,
):
await self.session.close() # pyright: ignore
retry_counter += 1
if retry_counter > self.retry_count:
raise exception
self._logger.warning(
f"Retry count: {retry_counter} out of {self.retry_count}. Exception: {exception}"
)
await self._sleeps.sleep(RETRY_INTERVAL**retry_counter)

4.) Paginated API call eats the propagated exception
except Exception as exception:
self._logger.warning(
f"Skipping data for type {url_name} from {url}. Exception: {exception}."
)
break

3.) Wrapper code continues and reconciles the Index state (by updating and deleting)

Step 4 is the problem in this connectors case.

The above links are pinned to a commit from 8.13.2 but I am also able to find a similar smell in the latest main branch for jira (main branch coresponding problem) but the error handling is more robust there (example) while confluence does not handle rate limits at all.

Subtasks

  1. community-driven confluence enhancement team:external
    praveen-elastic
  2. community-driven enhancement github team:external
  3. bug community-driven team:external
    praveen-elastic
  4. bug community-driven priority:high team:external
    parthpuri-elastic
  5. community-driven enhancement team:external
    moxarth-elastic
@gcasar gcasar added the bug Something isn't working label Apr 16, 2024
@seanstory
Copy link
Member

Hi @gcasar , thanks for filing.

4.) [Something] eats the propagated exception
Step 4 is the problem in this connectors case.

Agreed, this looks like the problem. We try to balance between the ideologies of:

  • "don't just fail the sync if there's some minor error, move on"
  • "fail fast if there's an error"

While we shoot for a reasonable balance, it looks like you've definitely identified a few instances here where we've missed the mark. Unfortunately for us, and as you've already identified, fixing this needs to be done on a case-by-case (and connector-by-connector) basis. Looks like you've identified at least two instances here:

  • confluence swallows any error that occurs during pagination
  • github swallows errors even if fetching issues fails

On top of those, looks like the underlying bugs that cause this to be a problem are:

  • confluence session close should unset session variable
  • github 403 not being caught as an UnauthorizedError (link)

If the above sounds right to you, we can make those 4 into discrete subtasks, and work to prioritize them.

@gcasar
Copy link
Author

gcasar commented Apr 16, 2024

Thank you for the fast response, the balancing makes sense in this context and seems like a smart approach 😄

I see how splitting them up would help, in that case I'd add one more:

  • Confluence does not handle rate limiting

I'll follow up on this tomorrow morning if you'd like me to split the issue up myself - I'm happy to do that!

@seanstory
Copy link
Member

see how splitting them up would help

excellent, I'll do that now.

I'd add one more
...
if you'd like me to split the issue up myself - I'm happy to do that!

I've got enough info to create the first 4 subtasks I mentioned. If you'd create a separate issue for the confluence rate limiting (definitely something we should handle) and provide relevant logs/traces/links, that would be excellent. Thanks!

@gcasar
Copy link
Author

gcasar commented Apr 17, 2024

I was not able to replicate the Confluence 429 that we encountered previously to get a proper bug report back to you. I don't want to detract from others in that list until I can do a more diligent report on my end and reproduce it - currently I can highlight the discrepancy between Jira and Confluence where Jira nicely handles it but Confluence retries a few times without waiting. Sadly, there are no docs from Atlassian around this but it might be fair to assume that both systems behave in a similar way.

This is the code that I'm referring to https://github.com/elastic/connectors/blame/8c8d4a3f7775c6edbb598a81a2b67348a7244f3a/connectors/sources/jira.py#L168-L182

@seanstory
Copy link
Member

Ok no worries. I've just made a pretty minimal issue in #2403. Please add any details/stack traces/logs/ etc to that if you hit it again. Otherwise, we'll work with what we have.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants