Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bump scikit-learn from 1.0.2 to 1.5.0 #146

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 15 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,21 @@
__pycache__
kdb/__pycache__
**.kdb
docs/generated/
examples/example_report/*.png
examples/example_report
test/data
pypi_token.foo
kmerdb.egg-info/
kmerdb.log
test.*
build
dist
.reinstall.sh
examples/*/*.png
examples/*/*.jpg






2 changes: 1 addition & 1 deletion .readthedocs.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ build:
# Build documentation in the "docs/" directory with Sphinx
sphinx:
builder: html
configuration: docs/conf.py
configuration: docs/source/conf.py
fail_on_warning: true
# You can configure Sphinx to use a different builder, for instance use the dirhtml builder for simpler URLs
# builder: "dirhtml"
Expand Down
110 changes: 106 additions & 4 deletions TODO.org
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,98 @@
# .kdb files should be debrujin graph databases
# The final prototype would be .bgzf format from biopython

* 8/14/24 compositional ideas

** Notes:
its a regression problem. the coefficients should sum to unity, and the sum of the k-mer count vectors is the the total k-mer count


the count vector is the 'coefficient' times the aggregate vector (.kdb file)

therefore, the 'coefficients' or proportions contributable, along with regression statistics,

can be made be performing least-squares on a count-matrix Ax = b, where b is the aggregate/collated/composite count-vector profile of multiple species, from which the decomposition progresses.

A is the count matrix obtained by collating suspected species of the 'metagenome' into one k-mer profile.

b is the 'observed' or artificially generated metagenome k-mer profile.

I want to do this in either Cython, Numba, and/or CUDA.

Then do the D2 statistics.
**

**



* TODO 8/13/24 priority: compositional analysis

** Lots of priorities, currently. Comp. analysis, custom SVD implementation for the matrix command? A = UsigmaVt

** Markov chain submodules? Log-likelihood ratio

** TODO [/] T O D O list

*** IN-PROGRESS [ x ] Quarto journals elsevier preprint for biorxiv
:LOGBOOK:
- State "IN-PROGRESS" from "NEXT" [2024-08-16 Fri 20:33]
:END:

*** IN-PROGRESS [ x ] compositional analysis (p0)
:LOGBOOK:
- State "IN-PROGRESS" from "NEXT" [2024-08-16 Fri 20:33]
:END:

*** DONE [ x ]Are all species downloaded
CLOSED: [2024-08-16 Fri 20:33]
:LOGBOOK:
- State "DONE" from "WAITING" [2024-08-16 Fri 20:33]
- State "WAITING" from "DONE" [2024-08-14 Wed 16:36]
- State "DONE" from "WAITING" [2024-08-14 Wed 16:36]
- State "WAITING" from "IN-PROGRESS" [2024-08-14 Wed 16:36]
- State "IN-PROGRESS" from "NEXT" [2024-08-14 Wed 16:36]
:END:

** Description from Xiong et al Mouse IBD.

Once GF status was confirmed as described, GF NOD mice were colonized by co-housing with gnotobiotic mice
colonized with defined cultured bacterial species (Altered Schaedler's flora; ASF - [35]) which were prepared
in the laboratory from cloned bacteria using sterile technique. The ASF consists of Lactobacillus acidophilus
(ASF 360), Lactobacillus murinus (ASF 361), Bacteroides distasonis, (ASF 519), Mucispirillum schaedleri (ASF 457),
Eubacterium plexicaudatum (ASF 492), a Fusiform-shaped bacterium (ASF 356) and two Clostridium species (ASF 500, ASF 502).
These ASF-colonized gnotobiotic mice were then bred in isolators to ensure no additional species were introduced.
The presence of the ASF species was confirmed by species-specific bacterial qPCR [58].

**

**

* 8/9/24 profile decomposition/recomposition(simulation) problem
** Merge profiles with ratios (0.25% B. bifidum)

**

* 8/8/24 Taking Notes on Xuejiang Xiong Mouse model IBD study

** SRA Accession id

*** SRA051354
SRA051354
***
** What is the purpose of this study?

The goal of this study is to recreate a mouse model of the disease called "Irritable Bowel Disease", using agents that induce responses and irritation to the point where the induced condition and the condition known as "irritable bowel disorder" are functionally similar.

The mice are NOD (non-obese diabetic) and suceptible to germs. They are colonized with 8 symbiotic bowel microbes, known as Altered Schaedler flora (ASF).

Samples taken from the bowels of these mice reveal the effect of the irritant/inducer agent on the gut microflora as measurable by Illumina High-throughput sequencing (HTS). Specifically, transcriptional libraries are prepared following RiboMinus treatment, to enrich for mRNAs and other non-rRNAs.

The mRNA libraries were processed on a Genome Analyzer IIx in this study. The SRA accession id for the single-end fastq datasets, bulk RNA for metatranscriptomics and assembly, is SRA051354.

The study used


* 8/3/24 Kolmogorov complexity and Generalized Suffix Arrays

** Suffix array
Expand Down Expand Up @@ -125,18 +217,23 @@ Kolmogorov complexity comes in two flavors: prefix-free (K(x)) and simple comple


** TODO core species choices
*** chicken farm estuary system changes (algination, asphyxia, microbiological changes
*** chicken farm runoff - estuary system changes (algination, asphyxia, microbiological changes)
*** anti-human leaky gut syndrome changes.
**** i.e. looking at the human leaky gut syndrome, but in reverse. What are bioprotective species and niches that provide resilience to leaky-gut syndrome
**** TODO chemophore SMILES and gastrotoxic footprints
**** mouse model (SRA051354) currently being studied from Xuejiang Xiong
**** looking to assess the Altered Shaedler flora/formula changes in irritable bowel syndrome.
**** Currently, only have the accession and brief notes, still reading as of 8/12/24
****


*** pathology of lupus or auto-immune skin condition microbiome/metagenomic changes.
*** vaginal microbiome changes
***
** Perspective 1 from reivew on distance metrics
**
* IN PROGRESS 7/10/24 - [IMPORTANT] Needs a choice [cython d2 x graph algorithm features ]:
** [Key choice needed]: 1 [ 2 reviews + cython D2 metrics ] path 2 [ 2 reviews + graph algorithm ]

** cython d2 metrics including the delta distance : |pab(A)-pab(B)| (Karlin et al, tetra,tri,di- nucleotide frequencies)
** (describe Karlin delta, algorithm to calculate)
*** Karlin delta first requires the least ambiguous k-mer (4-mer) frequency, i.e. the frequency of self
Expand All @@ -145,13 +242,18 @@ Kolmogorov complexity comes in two flavors: prefix-free (K(x)) and simple comple
*** this specifies the numerator for the tetranucleotide frequency (lowercause tau)
*** the denominator is only the most specific tetra and 1-neighboring trinucleotide frequencies, and the mononucleotide frequencies. [ f(acc) f(accg) f(ccg) f(a) f(c) f(t) f(g) ]
**
** new graph file format specification ( walk,path is a subclass of unlabeled graph, where node labels can be visited, path order, and progressive or retro in the walk.
** new graph file format specification (walk, path is a subclass of unlabeled graph, where node labels can be visited, path order, and progressive or retro in the walk.
** contig generator method, and contig boundary definition specification
**
**
**
**
* 6/28/24 - [ ...whoops, forgot the date by 3 x24hr blocxz. ] okay, so the 0.8.4 release should have the graph labeling done.
* TODO 6/28/24 - [ ...whoops, forgot the date by 3 x24hr blocxz. ] okay, so the 0.8.4 release should have the graph labeling done.
:LOGBOOK:
- State "DELEGATED" from "CANCELED" [2024-08-12 Mon 17:02]
- State "CANCELED" from "DELEGATED" [2024-08-12 Mon 17:02]
- State "DELEGATED" from [2024-08-12 Mon 17:02]
:END:

** graph node labeling and classification, and walk strategy

Expand Down
20 changes: 20 additions & 0 deletions docs/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# Minimal makefile for Sphinx documentation
#

# You can set these variables from the command line, and also
# from the environment for the first two.
SPHINXOPTS ?=
SPHINXBUILD ?= sphinx-build
SOURCEDIR = source
BUILDDIR = build

# Put it first so that "make" without argument is like "make help".
help:
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)

.PHONY: help Makefile

# Catch-all target: route all unknown targets to Sphinx using the new
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
%: Makefile
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
38 changes: 38 additions & 0 deletions docs/conf.py.old
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
# Configuration file for the Sphinx documentation builder.
#
# For the full list of built-in configuration values, see the documentation:
# https://www.sphinx-doc.org/en/master/usage/configuration.html

# -- Project information -----------------------------------------------------
# https://www.sphinx-doc.org/en/master/usage/configuration.html#project-information

project = 'kmerdb'
copyright = '2024, Matt Ralston'
author = 'Matt Ralston'
release = '0.8.6'

# -- General configuration ---------------------------------------------------
# https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration

extensions = [
#'sphinx.ext.duration',
#'sphinx.ext.doctest',
'sphinx.ext.intersphinx',
'sphinx.ext.autodoc',
'sphinx.ext.autosummary',
'sphinx.ext.mathjax',
'sphinx.ext.viewcode',
# 'sphinx_rtd_theme',
]


templates_path = ['_templates']
exclude_patterns = []



# -- Options for HTML output -------------------------------------------------
# https://www.sphinx-doc.org/en/master/usage/configuration.html#options-for-html-output

html_theme = 'alabaster'
html_static_path = ['_static']
123 changes: 117 additions & 6 deletions docs/index.rst.old
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@


kmerdb
===================

Expand Down Expand Up @@ -70,17 +69,18 @@ IUPAC residues (ATCG+RYSWKM+BDHV) are kept throughout the k-mer counting. But no
## Development


```bash
python setup.py test
```
::

python setup.py test


## License

Created by Matthew Ralston - [Scientist, Programmer, Musician](http://matthewralston.github.io) - [Email](mailto:[email protected])

Distributed under the Apache license. See `LICENSE.txt` for the copy distributed with this project. Open source software is not for everyone, and im the author and maintainer. cheers, on me. You may use and distribute this software, gratis, so long as the original LICENSE.txt is distributed along with the software. This software is distributed AS IS and provides no warranties of any kind.

```
::
Copyright 2020 Matthew Ralston

Licensed under the Apache License, Version 2.0 (the "License");
Expand All @@ -94,7 +94,7 @@ Distributed under the Apache license. See `LICENSE.txt` for the copy distributed
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
```


## Contributing

Expand All @@ -113,3 +113,114 @@ Thank you to the authors of kPAL and Jellyfish for the inspiration and bit shift
The intention is that more developers would want to add functionality to the codebase or even just utilize things downstream, but to build out directly with numpy and scipy/scikit as needed to suggest the basic infrastructure for the ML problems and modeling approaches that could be applied to such datasets. This project began under GPL v3.0 and was relicensed with Apache v2. Hopefully this project could gain some interest. I have so much fun working on this project. There's more to it than meets the eye. I'm working on a preprint, and the draft is included in some of the latest versions of the codebase, specifically .Rmd files.

More on the flip-side. It's so complex with technology these days...

.. kmerdb documentation master file, created by
sphinx-quickstart on Thu Aug 8 00:32:04 2024.
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.

kmerdb documentation
====================

Add your content using 'reStructuredText' syntax. See the
`reStructuredText <https://www.sphinx-doc.org/en/master/usage/restructuredtext/index.html>`_
documentation for details.


.. toctree::
:maxdepth: 2
:caption: Contents:

.. automodule:: kmerdb.fileutil
:members:

.. automodule:: kmerdb.graph
:members:

.. automodule:: kmerdb.parse
:members:

.. automodule:: kmerdb.kmer
:members:

.. automodule:: kmerdb.util
:members:

.. automodule:: kmerdb.index
:members:

.. automodule:: kmerdb.logger
:members:

.. automodule:: kmerdb.distance
:members:

.. automodule:: kmerdb.python_distances
:members:

.. automodule:: kmerdb.probability
:members:

.. autoclass:: kmerdb.fileutil
:members:

.. autoclass:: kmerdb.graph
:members:

.. autoclass:: kmerdb.parse
:members:

.. autoclass:: kmerdb.kmer
:members:

.. autoclass:: kmerdb.util
:members:

.. autoclass:: kmerdb.index
:members:

.. autoclass:: kmerdb.logger
:members:

.. autoclass:: kmerdb.distance
:members:

.. autoclass:: kmerdb.python_distances
:members:

.. autoclass:: kmerdb.probability
:members:

.. autoexception:: kmerdb.fileutil
:members:

.. autoexception:: kmerdb.graph
:members:

.. autoexception:: kmerdb.parse
:members:

.. autoexception:: kmerdb.kmer
:members:

.. autoexception:: kmerdb.util
:members:

.. autoexception:: kmerdb.index
:members:

.. autoexception:: kmerdb.logger
:members:

.. autoexception:: kmerdb.distance
:members:

.. autoexception:: kmerdb.python_distances
:members:

.. autoexception:: kmerdb.probability
:members:




Loading