Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add git to filesystem source 301 #954

Draft
wants to merge 8 commits into
base: devel
Choose a base branch
from

Conversation

deanja
Copy link
Contributor

@deanja deanja commented Feb 9, 2024

Description

  • Add gitpythonfs fsspec implementation
  • add dynamic registration of fsspec implementations
  • Urls do not require a netloc (in rfc1808)
  • Use 'mtime' as default input for modification_date

Related Issues

Additional Context

gitpythonfs uses code snippets and ideas from other fsspec implementations. . I have given thanks in the code but I don't ensure this meets any fsspec or dlt licensing requirements.

CICD:

  • make test-common is passing
  • lots mypy errors in make lint in local dev

Copy link

netlify bot commented Feb 9, 2024

Deploy Preview for dlt-hub-docs canceled.

Name Link
🔨 Latest commit b473fb5
🔍 Latest deploy log https://app.netlify.com/sites/dlt-hub-docs/deploys/65cf3d6464f66200088c6ff6

Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is quality work! thanks for all the code and tests. also the fs registration is a nice addition.

in the meantime we merged gdrive implementation. you could merge current devel into this branch. conflicts should not be significant

"File path or netloc missing. Field bucket_url of FilesystemClientConfiguration"
" must contain valid url with a path or host:password component."
)
if not url.scheme in ("gitpythonfs", "github", "git", "s3"):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. pls use not in operator. linter will complain somewhere
  2. why s3 is exempt? AFAIK it must contain bucket name which is a path component

also could you remind me when gitpythonfs does not need netloc and path

Copy link
Contributor Author

@deanja deanja Feb 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. pls use not in operator. linter will complain somewhere

Accepted

  1. why s3 is exempt? AFAIK it must contain bucket name which is a path component

You are right. I will take s3 out of the the exception list.

My thinking was (would need a separate github issue):
s3fs already gives nice error messages for invalid bucket name, unauthorised for bucket etc. My guess is that adlfs an , gcsfs do too. I doubt the rule saves users from confusion/harm. I was heading towards proving that, and then removing this rule entirely or only testing for specific known problem protocols (not exclusion list). I was trying to reduce fsspec implementation-specific conditional logic in the dlt codebase.

also could you remind me when gitpythonfs does not need netloc and path

gitpythonfs:// returns the root of the repo. Expected use case: bucket_url = "gitpythonfs://", file_glob = "**.md" .

We decided it's neater and more common to put repo path in the config (kwargs), rather than stuff it into the url netloc. I looked at dropboxdrivefs and I think it will be same pattern as this - dropbox knows which account based on auth token, then the rest is globbable paths.

At some point we could extend public filesystem dlt source/resource API with methods that just have url parameter (which can include glob - s3://bucket/csv/**.csv, gitpythonfs://csv/**.csv) instead of separate bucket_url and file_glob parameters. It would be more fsspec-onic , in line with open_files("s3://bucket/csv/**.csv"). There could be an optional path_mask to shorten the path string that ends up in the pipeline metadata.

dlt/common/storages/fsspec_filesystem.py Show resolved Hide resolved
dlt/common/storages/fsspec_filesystem.py Show resolved Hide resolved
try:
return ensure_pendulum_datetime(file_metadata[field_name])
except KeyError:
if protocol not in MTIME_FIELD_NAMES:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I was thinking a lot about it recently. My take is that for protocols not supporting modification time we would just return pendulum.now(). what do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's good enough. I set to pendulum.now() ...for now 🤣 ...

Always Hard to tell with these unknowns 😓. now() duplicates loaded_at etc, which pipelines already have. And it could be misleading.

It would be better to use something that indicates that no info was available:

  • None. will it break the pipeline downstream?
  • Jan 1st 1970. ie, epoch zero. It's ugly but people know what it means.

dlt/common/storages/implementations/gitpythonfs.py Outdated Show resolved Hide resolved
tests/load/filesystem/test_filesystem_common.py Outdated Show resolved Hide resolved
@deanja deanja force-pushed the add-git-to-filesystem-source-301 branch from 350113e to 89b3ca5 Compare February 15, 2024 00:37
- Intended to be used in filesystem verified Source.
-read-only
- due to depedency of gitpython on git binaries, gitpythonfs should only be conditionally imported, for example by registering the module using  `register_implementation_in_fsspec()`
- See https://www.rfc-editor.org/rfc/rfc1808.html#section-2.1 . So, for example, gitpythonfs:// is valid and works in fsspec.
- However in bucket-based systems, the bucket name is where the netloc would be - eg s3://bucket_name.  But luckily s3fs (and probably az, gcs etc) already gives helpful error messages if no bucket name given, wrong bucket name given etc.
- As a ToDo, this rule could test only those protocols whose fsspec implementation needs netloc, rather than excluding a few protocols as is done here. But we don't know which protocols this rule was initially added for.
-reduces need to modify code every time a new fsspec implementation is added
- `mtime` is idiomatic in *nix file systems.
@deanja deanja force-pushed the add-git-to-filesystem-source-301 branch from 89b3ca5 to ea0557f Compare February 15, 2024 04:51
@@ -157,29 +199,31 @@ class FileItemDict(DictStrAny):
def __init__(
self,
mapping: FileItem,
credentials: Optional[Union[FileSystemCredentials, AbstractFileSystem]] = None,
fs_details: Optional[Union[AbstractFileSystem, FilesystemConfiguration, FileSystemCredentials]] = None,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have simply added FilesystemConfiguration as an additional type, and tidied up the naming. Is that far enough for this PR?

Possible fast-follower to this PR:

  1. Limit the types that can be passed to FileItemDict constructor. I get the feeling this is an important internal API. The following calls are currently in use:
  • AbstractFileSystem, in tests/common/storages/test_fsspec_filesystem.py
  • FileSystemCredentials, in tests/common/storages/utils.py
  • None, in verified-sources sources/inbox/init.py
    Could there be other usage in dlt example projects?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

support git filesystem in filesystem source
2 participants