Sync job gets marked as Succeeded even if there was a error that ended the job prematurely #2393

gcasar · 2024-04-16T16:59:11Z

Bug Description

After an internal connector error that prematurely ends a Full sync the Sync job proceeds, reconciles the index by deleting documents previously in the system and is marked as Sync complete

To Reproduce

Steps to reproduce the behavior:

This behavior is hard to reproduce reliably because it is 1) Connector specific and 2) Relies on the third party error. There are at least two bugs that surface this problem.

Please note that the example I'm giving below is not the only place this happens.

06:59:51.107
enterprise_search.connectors
[enterprise_search.connectors][warning] Skipping data for type content from https://stackoverflow.atlassian.net/wiki/rest/api/content/search?next=true&cursor=_f_NTA%3D_sa_WyJcdDM2NjM5OTMwIFI2KF8wZUpoJWhYdDA8L2w8a2NcIiBjcCJd&expand=children.attachment,history.lastUpdated,body.storage,space,space.permissions,restrictions.read.restrictions.user,restrictions.read.restrictions.group&limit=50&start=285750&cql=space+in+('LTDD','BATD','MADCOW')+AND+type=page. Exception: Response payload is not completed.
06:59:53.811
enterprise_search.connectors
[enterprise_search.connectors][info] Sync ended with status completed -- created: 0 | updated: 285753 | deleted: 23748 (took 14338 seconds)

Expected behavior

When an exception occurs that forces a Full sync to prematurely end
the Sync job should be marked as Failed
and the existing documents in the index should not get Deleted.

As a user of connector
I expect this problem to be handled in a shared way
so that uncaught exceptions in connectors don't get silently discarded
giving a false sense of completed syncs.

Screenshots

Please note that this is an older image, but we are still able to reproduce this in 8.13.2 where the errors reported above are from.

Environment

Elastic Cloud, version 8.13.2, native Confluence cloud and Github cloud connectors

Additional context

The second example is related and for Github:

06:39:29.252
enterprise_search.connectors
[enterprise_search.connectors][warning] Something went wrong while fetching the issues. Exception: 403, message='Forbidden', url=URL('https://api.github.com/graphql')
  File "/usr/share/enterprise-search/lib/python3.10/site-packages/connectors/sources/github.py", line 1542, in _fetch_issues
    async for response in self.github_client.paginated_api_call(
/...several lines ommited ../
06:39:29.253
enterprise_search.connectors
[enterprise_search.connectors][info] Fetching files from repo: 'elastic/elasticsearch' (branch: 'main')
06:39:40.101
enterprise_search.connectors
[enterprise_search.connectors][info] Sync ended with status completed -- created: 2543 | updated: 34290 | deleted: 34693 (took 13176 seconds)

While by no means an expert in your codebase, for at least Confluence the following code bits could explain the behavior:

1.) 3rd party (Confluence) closes the session (this should be somewhat expected and recoverable)
2.) Session is closed in a bad way, not nulling it out, should likely use the internal close_session method to unset the private variable

3.) The call is retried a few times, but the session is not recreated because it was not unset

connectors/connectors/sources/confluence.py

Lines 158 to 170 in 8c8d4a3

    
           except Exception as exception: 
        
               if isinstance( 
        
                   exception, 
        
                   ServerDisconnectedError, 
        
               ): 
        
                   await self.session.close()  # pyright: ignore 
        
               retry_counter += 1 
        
               if retry_counter > self.retry_count: 
        
                   raise exception 
        
               self._logger.warning( 
        
                   f"Retry count: {retry_counter} out of {self.retry_count}. Exception: {exception}" 
        
               ) 
        
               await self._sleeps.sleep(RETRY_INTERVAL**retry_counter)

4.) Paginated API call eats the propagated exception

connectors/connectors/sources/confluence.py

Lines 194 to 198 in 8c8d4a3

    
           except Exception as exception: 
        
               self._logger.warning( 
        
                   f"Skipping data for type {url_name} from {url}. Exception: {exception}." 
        
               ) 
        
               break

3.) Wrapper code continues and reconciles the Index state (by updating and deleting)

Step 4 is the problem in this connectors case.

The above links are pinned to a commit from 8.13.2 but I am also able to find a similar smell in the latest main branch for jira (main branch coresponding problem) but the error handling is more robust there (example) while confluence does not handle rate limits at all.

Subtasks

Give feedback

confluence swallows any error that occurs during pagination #2394

community-driven confluence enhancement team:external
github swallows errors even if fetching issues fails #2395

community-driven enhancement github team:external
confluence session close should unset session variable #2396

bug community-driven team:external
github 403 not being caught as an UnauthorizedError #2397

bug community-driven priority:high team:external
confluence does not handle rate limiting #2403

community-driven enhancement team:external
Options

The text was updated successfully, but these errors were encountered:

seanstory · 2024-04-16T17:15:09Z

Hi @gcasar , thanks for filing.

4.) [Something] eats the propagated exception
Step 4 is the problem in this connectors case.

Agreed, this looks like the problem. We try to balance between the ideologies of:

"don't just fail the sync if there's some minor error, move on"
"fail fast if there's an error"

While we shoot for a reasonable balance, it looks like you've definitely identified a few instances here where we've missed the mark. Unfortunately for us, and as you've already identified, fixing this needs to be done on a case-by-case (and connector-by-connector) basis. Looks like you've identified at least two instances here:

confluence swallows any error that occurs during pagination
github swallows errors even if fetching issues fails

On top of those, looks like the underlying bugs that cause this to be a problem are:

confluence session close should unset session variable
github 403 not being caught as an UnauthorizedError (link)

If the above sounds right to you, we can make those 4 into discrete subtasks, and work to prioritize them.

gcasar · 2024-04-16T17:27:06Z

Thank you for the fast response, the balancing makes sense in this context and seems like a smart approach 😄

I see how splitting them up would help, in that case I'd add one more:

Confluence does not handle rate limiting

I'll follow up on this tomorrow morning if you'd like me to split the issue up myself - I'm happy to do that!

seanstory · 2024-04-16T17:34:11Z

see how splitting them up would help

excellent, I'll do that now.

I'd add one more
...
if you'd like me to split the issue up myself - I'm happy to do that!

I've got enough info to create the first 4 subtasks I mentioned. If you'd create a separate issue for the confluence rate limiting (definitely something we should handle) and provide relevant logs/traces/links, that would be excellent. Thanks!

gcasar · 2024-04-17T18:27:24Z

I was not able to replicate the Confluence 429 that we encountered previously to get a proper bug report back to you. I don't want to detract from others in that list until I can do a more diligent report on my end and reproduce it - currently I can highlight the discrepancy between Jira and Confluence where Jira nicely handles it but Confluence retries a few times without waiting. Sadly, there are no docs from Atlassian around this but it might be fair to assume that both systems behave in a similar way.

This is the code that I'm referring to https://github.com/elastic/connectors/blame/8c8d4a3f7775c6edbb598a81a2b67348a7244f3a/connectors/sources/jira.py#L168-L182

seanstory · 2024-04-17T20:44:48Z

Ok no worries. I've just made a pretty minimal issue in #2403. Please add any details/stack traces/logs/ etc to that if you hit it again. Otherwise, we'll work with what we have.

gcasar added the bug Something isn't working label Apr 16, 2024

seanstory added the community-driven label Apr 16, 2024

seanstory mentioned this issue Apr 17, 2024

confluence does not handle rate limiting #2403

Closed

3 tasks

danajuratoni added the priority:high label Apr 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sync job gets marked as Succeeded even if there was a error that ended the job prematurely #2393

Sync job gets marked as Succeeded even if there was a error that ended the job prematurely #2393

gcasar commented Apr 16, 2024 •

edited by seanstory

Subtasks

seanstory commented Apr 16, 2024

gcasar commented Apr 16, 2024

seanstory commented Apr 16, 2024

gcasar commented Apr 17, 2024

seanstory commented Apr 17, 2024

Sync job gets marked as Succeeded even if there was a error that ended the job prematurely #2393

Sync job gets marked as Succeeded even if there was a error that ended the job prematurely #2393

Comments

gcasar commented Apr 16, 2024 • edited by seanstory

Bug Description

To Reproduce

Expected behavior

Screenshots

Environment

Additional context

Subtasks

seanstory commented Apr 16, 2024

gcasar commented Apr 16, 2024

seanstory commented Apr 16, 2024

gcasar commented Apr 17, 2024

seanstory commented Apr 17, 2024

gcasar commented Apr 16, 2024 •

edited by seanstory