Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expand file tokenisation #53

Merged
merged 17 commits into from
Mar 2, 2024

Conversation

LibraChris
Copy link
Contributor

Expand the tokenisation of files and directories as described in #52.

Add file and directory terms used for AFSO
Update Tokenization for Directories by using the new Tokens from 5d22ada
Update Tokenization for Files by using the new Tokens from 5d22ada
Rename specific Tokens for clarity
@LibraChris
Copy link
Contributor Author

@kMutagene

Copy link
Member

@kMutagene kMutagene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LibraChris looks good, but i think we should separate the layers a bit cleaner.

  • The API in FileSystem should provide a function that simply tokenizes a single URI (e.g. signature System.Uri -> Cvparam)
  • That weay, the calling function can decide on the type of uris it provides, and we do not need to provide one function for paths and files each here.
  • I like the idea of "traversing" the URI segments, and i think all these match cases could be unified to a single one
  • These changes definately need some tests

src/ARCTokenization/Tokenization.fs Outdated Show resolved Hide resolved
src/ARCTokenization/Tokenization.fs Outdated Show resolved Hide resolved
src/ARCTokenization/Tokenization.fs Outdated Show resolved Hide resolved
Copy link
Member

@kMutagene kMutagene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after some small naming tweaks and added tests, lgtm

@@ -158,3 +159,40 @@ module Tokenization =
at.Columns
|> Array.map CompositeColumn.tokenize
|> List.ofArray

module SpecificTokens =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename SpecificTokens -> ArcFileSystem



let private normalisePath (path:string) =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use internal instead of private

| _ -> StructuralOntology.AFSO.``File Path`` |> fun t -> CvParam(t,relativePath)

/// Gets CvParams based on the root path, file system type, and full path
let getSpecificTokens (rootPath:string) (pType:PType) (path:string) =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename getSpecificTokens -> getArcFileSystemTokens

| Directory

/// Matches a CvParam based on the relative path and file system type
let matchFileSystemTokenByRelativePath (pType:PType) (relativePath: string) =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename matchFileSystemTokenByRelativePath -> convertRelativePath

@LibraChris
Copy link
Contributor Author

@kMutagene
All ISA.tryParseMetadataSheetFromToken functions are based on the absolute path of a token, either this needs to be adjusted or the ARC Tokenisation should be based on absolute paths

@kMutagene
Copy link
Member

kMutagene commented Feb 28, 2024

this needs to be adjusted

I would suggest by adding the root path as argument

- Rework MetadataSheet parsers by using the new Tokens
- Add parseProcessGraphColumnsFromToken  and parseProcessGraphColumnsFromTokens
Copy link
Member

@kMutagene kMutagene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice how this comes together now.

I have a few naming nitpicks. Also, you should adapt the xml documentation for the new token-based top level parsers. Most annotations for the token arguments are

/// he path to the xlsx file

Which is not true.

@@ -9,54 +9,68 @@ module internal ISA =

open System.IO

let tryParseMetadataSheetFromToken (isaFileName: string) (isaMdsParsingF: string -> IParam list) (absFileToken: IParam) =
let tryParseMetadataSheetFromToken (rootPath:string) (isaCvTerm: CvTerm) (isaMdsParsingF: string -> IParam list) (absFileToken: IParam) =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: rename absFileToken -> relFileToken, since we ar enow tokenizing the farc file system as relative paths by convention

with _ ->
None
else None

let parseMetadataSheetsFromTokens (isaFileName: string) (isaMdsParsingF: string -> IParam list) (absFileTokens: #IParam seq) =
let parseMetadataSheetsFromTokens (rootPath:string) (isaCvTerm: CvTerm) (isaMdsParsingF: string -> IParam list) (absFileTokens: #IParam seq) =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

absFileToken -> relFileToken

/// </summary>
/// <param name="rootPath">ARC root path</param>
/// <param name="absFileToken">he path to the study xlsx file</param>
let parseProcessGraphColumnsFromToken (rootPath:string) (absFileToken: IParam) =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

absFileToken -> relFileToken

/// <param name="rootPath">ARC root path</param>
/// <param name="isaCvTerm">The term to which the token should be found</param>
/// <param name="absFileToken">he path to the study xlsx file</param>
let parseProcessGraphColumnsFromTokens (rootPath:string) (isaCvTerm: CvTerm) (absFileTokens: #IParam seq) =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

absFileTokens -> relFileTokens

/// </summary>
/// <param name="rootPath">ARC root path</param>
/// <param name="absFileToken">he path to the study xlsx file</param>
static member parseProcessGraphColumnsFromToken (rootPath:string) (absFileToken: IParam) =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

absFileToken -> relFileToken

/// </summary>
/// <param name="rootPath">ARC root path</param>
/// <param name="absFileToken">he path to the xlsx file</param>
static member parseProcessGraphColumnsFromTokens (rootPath:string) (isaCsvTerm) (absFileTokens: #IParam seq) =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

absFileTokens -> relFileTokens

/// <param name="rootPath">ARC root path</param>
/// <param name="absFileToken">he path to the study xlsx file</param>
static member parseProcessGraphColumnsFromToken (rootPath:string) (absFileToken: IParam) =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

absFileToken -> relFileToken

/// </summary>
/// <param name="rootPath">ARC root path</param>
/// <param name="absFileToken">he path to the xlsx file</param>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

absFileTokens -> relFileTokens

Adress the requested Changes
| true -> Some (parseProcessGraphColumnsFromToken rootPath token)
| false -> None
)
|> fun x ->
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these functions should not fail but just return an empty sequence (maybe additionally printing warnings). Imagine if this is used in a validation package. It would lead to the parsing failing, without any validation being performed.

refFileTokens
|> Seq.choose (fun token -> tryParseMetadataSheetFromToken rootPath isaCvTerm isaMdsParsingF token)

let parseProcessGraphColumnsFromToken (rootPath:string) (refFileToken: IParam) =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refFileToken

can you elaborate on that naming? i'd have gone with relFileToken or relativeFileToken

let parseProcessGraphColumnsFromTokens (rootPath:string) (isaCvTerm: CvTerm) (refFileTokens: #IParam seq) =
refFileTokens
|> Seq.choose (fun token ->
match token.Name = isaCvTerm.Name with
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we can do a full match here. There are terms out there with the same name from different ontologies.

/// <param name="rootPath">ARC root path</param>
/// <param name="refFileToken">IParam seq of the ARC Tokens</param>
static member parseProcessGraphColumnsFromTokens (rootPath:string) (refFileTokens: #IParam seq) =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The xml comments still do not help that much.

IParam seq of the ARC Tokens

says nothing to me.

Also

Returns a seq of annotation tables from an IParam seq if the given tokens contains a filepath with the standard assay file name ("isa.assay.xlsx")

is only technically true, as we now match the term. So i would suggest something like this:

Returns a seq of annotation tables from an IParam seq for each contained token that is annotated with the term 'Assay File' (e.g, files in the ARC named isa.assay.xlsx).

and for refFileTokens (typo there as well)

a seq of IParams that may contain relevant tokens

)
|> fun x ->
match Seq.length x with
| 0 -> failwith "No token found"
| 0 -> Seq.empty
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

convoluted way of just returning x :D

@kMutagene kMutagene merged commit 776ca5a into nfdi4plants:main Mar 2, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants