-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PXP-10845: Add utilities for dsv file operations and manipulation #179
base: master
Are you sure you want to change the base?
Conversation
The style in this PR agrees with This formatting comment was generated automatically by a script in uc-cdis/wool. |
output_writer.writerow(record) | ||
|
||
|
||
def convert_type(file_name: str, new_type: str, has_headers=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so is this only changing the extensions? For a tsv to a csv would we not need to replace the tabs to commas?
write(records, new_file_name) | ||
|
||
|
||
def merge_files(files: Union[str, list], output_file_name: str, has_headers=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we have different headers or are different file types? I think we should check for that
Args: | ||
file_name(str): | ||
Filename or file path | ||
Returns: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add an example in the description here?
has_headers=False, | ||
): | ||
""" | ||
Chunk records into files of input size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this could use a better description. Something like "chunk manifests into x different chunks" and maybe in the logs add how many records there will be per manifest.
|
||
def chunk( | ||
file_name: str, | ||
chunk_size: int, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh actually a fun idea would be adding another input, number of records in a manifest. So you're required to either give a number of chunks or number of records per chunks, do the math and create chunks according to that. If both are given, maybe do the math of wheteher or not those chunk and record numbers per chunk are compatible
if has_headers: | ||
file_one_headers = get_headers(file_one) | ||
file_two_headers = get_headers(file_two) | ||
if file_one_headers != file_two_headers and sorted(file_one_headers) == sorted( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i guess i dont understand what is going on here.
if we have headers:
A = ["a", "b", "c"]
B = ["b", "c", "a"]
print(A == B) : False
print(sorted(A) == sorted(B)) :True
So i see you're moving the header, i think you should add logs notifying that the columns aren't arranged correctly and we're moving the columns.
if file_one_headers != file_two_headers and sorted(file_one_headers) == sorted( | ||
file_two_headers | ||
): | ||
list_two = move_columns(file_one_headers, file_two) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be a sorted(A) != sorted(B)
else: | ||
list_two = file_to_list(file_two, has_headers) | ||
|
||
if not strict: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think you need to handle cases where filename
and file_name
are the same thing and this code should be smart enough to figure that out. Luckily there are functions in the sdk that already handles that. Check here. https://github.com/uc-cdis/gen3sdk-python/blob/master/gen3/tools/utils.py
Maybe right after we open up the files we standardize the header formats
|
||
logging.info(f"Taking intersection between {file_one} & {file_two}...") | ||
|
||
list_one = file_to_list(file_one, has_headers) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should validation for all manifests that the utils processes.
return [row for row in list_one if row in list_two] | ||
|
||
|
||
# TODO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you create a ticket for this TODO and link it here as well?
Jira Ticket: PXP-10845
New Features
Breaking Changes
Bug Fixes
Improvements
Dependency updates
Deployment changes