-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue 85 - Remove S3 depencies #89
Conversation
ba6206a
to
855f371
Compare
"logger.info(f\"jsonl_files={jsonl_files}\")\n", | ||
"# Read and concatenate only the .jsonl files\n", | ||
"df = pd.concat([pd.read_json(io.BytesIO(s3_client.get_object(Bucket=config['s3_read_data']['read_bucket'], Key=file_key)['Body'].read()), lines=True) \n", | ||
"# df = pd.concat([pd.read_json(io.BytesIO(s3_client.get_object(Bucket=config['s3_read_data']['read_bucket'], Key=file_key)['Body'].read()), lines=True) \n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please remove commented out code (same for other places as well).
@@ -7,7 +7,11 @@ aws: | |||
# AWS region, this parameter is templatized, no need to change | |||
region: {region} | |||
# SageMaker execution role used to run FMBench, this parameter is templatized, no need to change | |||
sagemaker_execution_role: {role_arn} | |||
sagemaker_execution_role: arn:aws:iam::266000547366:role/fmbench-us-east-1-role |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be the {role_arn}
, we can programatically determine it by doing get_caller_identity
from the boto3 sts
client and that would work for both the SageMaker Notebook case and the EC2 case.
# Use S3 only, local file system only, or both (values are s3, local or both) | ||
# If set to local or both, set the local_file_system_path parameter | ||
s3_and_or_local_file_system: local | ||
local_file_system_path: /Users/rajramch/WorkDocsDownloads/workspace/aws/services/sagemaker/fmbench/sagemaker-fmbench-write |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should parameterize this so maybe something like {temp_dir}
which gets replaced with tempfile.gettempdir()
when the config is read (similar to how we handle the read_bucket
and other parameterized config file variable).
# Use S3 only or local file system only (values are s3 or local) | ||
# If set to local, set the local_file_system_path parameter | ||
s3_or_local_file_system: local | ||
local_file_system_path: /Users/rajramch/WorkDocsDownloads/workspace/aws/services/sagemaker/fmbench/sagemaker-fmbench-read |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment as above.
@@ -13,11 +13,11 @@ | |||
from typing import Dict, List | |||
from transformers import AutoTokenizer | |||
from botocore.exceptions import NoCredentialsError | |||
import shutil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this above to its appropriate place so that we have the imports in an increasing order of length, just a stylistic preference.
@@ -168,17 +168,68 @@ def process_item(item, prompt_template_keys: List, prompt_fmt: str) -> Dict: | |||
def nt_to_posix(p: str) -> str: | |||
return p.replace("\\", "/") | |||
|
|||
def is_read_local() -> str: | |||
is_read_local = globals.config.get('s3_read_data').get('s3_or_local_file_system') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggested tweak:
mode = globals.config.get('s3_read_data').get('s3_or_local_file_system', globals.FS_S3)
return mode == globals.FS_LOCAL
and we can define FS_LOCAL
and FS_S3
in globals.py
.
return local_read_path | ||
|
||
def _is_write_local_only(): | ||
is_write_local_only = globals.config.get('aws').get('s3_and_or_local_file_system') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same recommendation as above.
Issue #85
Replaced S3 dependencies with local file system. Made it configurable. It is also backward compatible with the existing configuration files.
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.