You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It's pretty helpful for teaching purposes to be able to demonstrate small tsinfer input files without needing a VCF. I wonder if we could define a class method like this:
classVariantData:
...
defcreate_from_arrays(
cls,
path,
variant_positions,
variant_matrix_phased,
alleles,
ancestral_allele,
sample_id=None,
):
""" Create an VariantData instance directly from array data. Only useful for small datasets. Larger datasets should use e.g. bio2zarr to create a zarr datastore containing the required data and call VariantData(path_to_zarr) :param path str: The path used to store the data :param variant_positions array: a 1D array of variant positions :param variant_matrix_phased array: a 3D array of variants X samples x ploidy, giving an index into the allele array for each corresponding variant. Values must be coercable into 8-bit (np.int8) integers. Data for all samples is assumed to be phased :param alleles array: a 2D string array of variants x max_num_alleles at a site. Each allele list for a variant must be the same length, equal to the maximum value in the `variant_matrix_phased` array. Each allele list can be padded with `""` to ensure the correct length :param ancestral_allele array: a 1D string array specifying the ancestral allele for each variant. For unknown ancestral alleles, any character which is not in the allele list can be used. :param sample_id: a 1D string array of sample names. If None, samples (each corresponding to `ploidy` variant values) will be allocated the IDs "0", "1", "2", .. etc. """call_genotype=np.array(variant_matrix_phased, dtype=np.int8)
ifsample_idisNone:
sample_id=np.arange(call_genotype.shape[1]).astype(str)
# assume all phasedcall_genotype_phased=np.ones(call_genotype.shape[:2], dtype=bool)
zarr.save_group(
path,
variant_position=np.array(pos),
call_genotype=call_genotype,
call_genotype_phased=call_genotype_phased,
variant_allele=np.array(alleles),
variant_ancestral_allele=np.array(ancestral_allele),
sample_id=sample_id,
)
returncls(path)
We're going to need something like this for testing anyway - even better if it was just an in-memory store.
hyanwong
changed the title
Class method to create simple SgkitSampleData files for demos
Class method to create simple VariantData files for demos
Aug 28, 2024
This could also be an alternative to sgkit-dev/bio2zarr#232, as a method for getting a tree sequence into VariantData format. It would be useful to have an in-memory way to feed the data from a tree sequence into tsinfer, which I think is part of #783
It's pretty helpful for teaching purposes to be able to demonstrate small tsinfer input files without needing a VCF. I wonder if we could define a class method like this:
Then we could use it like this:
The text was updated successfully, but these errors were encountered: