Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Box] add logs #2554

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open

[Box] add logs #2554

wants to merge 6 commits into from

Conversation

moxarth-elastic
Copy link
Collaborator

Part Of #2299

Adding more logs in Box connector.

Log file: https://drive.google.com/file/d/1TnFFvRBvMb2wGKQWz9N6dVkQtXUO7AIK/view?usp=drive_link

Checklists

Pre-Review Checklist

  • this PR does NOT contain credentials of any kind, such as API keys or username/passwords (double check config.yml.example)
  • this PR has a meaningful title
  • this PR links to all relevant github issues that it fixes or partially addresses
  • if there is no GH issue, please create it. Each PR should have a link to an issue
  • this PR has a thorough description
  • Tested the changes locally
  • Added a label for each target release version (example: v7.13.2, v7.14.0, v8.0.0)

@moxarth-elastic moxarth-elastic requested a review from a team as a code owner May 14, 2024 10:14
@moxarth-elastic moxarth-elastic changed the title add logging for box [Box] add logs May 14, 2024
@artem-shelkovnikov
Copy link
Member

artem-shelkovnikov commented May 14, 2024

From the log lines:

  1. "Successfully connected to Box" should be DEBUG, failure to connect should be WARN
  2. "Fetching files and folders recursively for folder ID: 0" - can we have folder path instead of id?
  3. "Retrieving access token from the cache" we should delete, it spams too much. Instead, we should just output in DEBUG when we're fetching token not from the cache, but from Box
  4. "Making a GET call for url: /2.0/folders/258475338133/items with params: {'fields': 'name,modified_at,size,type,sequence_id,etag,created_at,modified_at,content_created_at,content_modified_at,description,created_by,modified_by,owned_by,parent,item_status', 'offset': 0, 'limit': 1000}" - can we just make it "Calling GET {host}/2.0/folders/258475338133/items?fields=name,modified_at,size,type,...&offset=0&limit=1000"? This will help a lot if we need to emulate this specific query and you can just copy-paste url into postman or curl?

logger.debug("Fetching content starting from root folder")
await self.fetchers.put(partial(self._fetch, doc_id=root_folder))
self._logger.info("Fetching data from Box's Free Account")
await self.fetchers.put(partial(self._fetch, doc_id="0"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think having "0" be defined with a variable name root_folder (basically as it was before) is better for clarity. Someone without understanding of box will not know what 0 is at first glance. Could you revert that please?

already_processed_ids = set()
root_folder = "0"

stored_id = set()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think already_processed_ids was clearer than stored_id. If I look at this code later, I will think "why is it stored?"

If you want to shorten the variable name, perhaps seen_ids is better?

@moxarth-elastic
Copy link
Collaborator Author

buildkite test this

Copy link
Contributor

@navarone-feekery navarone-feekery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, just left one final change request

Comment on lines 180 to 185
query_string = (
"&".join(f"{key}={value}" for key, value in params.items())
if params
else ""
)
self._logger.debug(f"Calling GET {url}?{query_string}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think query_string should only built when debug logs are enabled. How it is here, this string is being built every run even though it isn't used unless debug logs are enabled. It's minor, but every bit counts when looking at performance.

This could be moved to a func elsewhere called something like debug_query_string and could be called in-line like this:

Suggested change
query_string = (
"&".join(f"{key}={value}" for key, value in params.items())
if params
else ""
)
self._logger.debug(f"Calling GET {url}?{query_string}")
self._logger.debug(f"Calling GET {url}?{debug_query_string(params)}")

@navarone-feekery navarone-feekery self-assigned this May 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants