Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sourcery refactored master branch #1

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Conversation

sourcery-ai[bot]
Copy link

@sourcery-ai sourcery-ai bot commented Dec 4, 2023

Branch master refactored by Sourcery.

If you're happy with these changes, merge this Pull Request using the Squash and merge strategy.

See our documentation here.

Run Sourcery locally

Reduce the feedback loop during development by using the Sourcery editor plugin:

Review changes via command line

To manually merge these changes, make sure you're on the master branch, then run:

git fetch origin sourcery/master
git merge --ff-only FETCH_HEAD
git reset HEAD^

Help us improve this pull request!

@sourcery-ai sourcery-ai bot requested a review from ben-silke December 4, 2023 23:01
Copy link
Author

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to GitHub API limits, only the first 60 comments can be shown.

Comment on lines 91 to -112

fig = plt.figure()
ax = fig.gca(projection='3d')
ax.plot_trisurf(df['Chunk Size'], df['p4,p11'], df['Match Rate'], linewidth=0.2)
ax.set_xlabel("Chunk Size")
ax.set_ylabel("Parameters")
ax.set_zlabel("Match Rate")
plt.show()

df2 = df[df["Tool"] == "mgm2"].groupby(["p4", "p11"], as_index=False).mean()

idx = df2["Match Rate"].argmax()
p4 = df2.at[idx, "p4"]
p11 = df2.at[idx, "p11"]
df_best = df[(df["p4"] == p4) & (df["p11"] == p11)]
df_alex = df[(df["p4"] == 10) & (df["p11"] == 20)]
fig, ax = plt.subplots()
sns.lineplot("Chunk Size", "Match Rate", data=df_best, label="Optimized")
sns.lineplot("Chunk Size", "Match Rate", data=df[df["Tool"] == "mprodigal"], label="MProdigal")
sns.lineplot("Chunk Size", "Match Rate", data=df_alex, label="Original")
ax.set_ylim(0, 1)
plt.show()
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function main refactored with the following changes:

Comment on lines -90 to +94
labels_per_seqname[lab.seqname()] = list()
labels_per_seqname[lab.seqname()] = []

labels_per_seqname[lab.seqname()].append(lab)

counter = 0
for seqname in labels_per_seqname:
for counter, (seqname, value) in enumerate(labels_per_seqname.items()):
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function get_features_from_prediction refactored with the following changes:

Comment on lines -187 to -189
list_entries = list()


Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function build_gcode_features_for_gi_for_chunk refactored with the following changes:

Comment on lines -208 to +202
list_df = list()
list_df = []
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function build_gcode_features_for_gi refactored with the following changes:

Comment on lines -230 to +227
# type: (Environment, GenomeInfoList, str, List[int], Dict[str, Any]) -> pd.DataFrame
list_df = list()

for gi in gil:
list_df.append(
build_gcode_features_for_gi(env, gi, tool, chunks, **kwargs)
)

list_df = [
build_gcode_features_for_gi(env, gi, tool, chunks, **kwargs)
for gi in gil
]
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function build_gcode_features refactored with the following changes:

This removes the following comments ( why? ):

# type: (Environment, GenomeInfoList, str, List[int], Dict[str, Any]) -> pd.DataFrame

Comment on lines -130 to +132
list_entries = list()
list_entries = []
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function add_codon_probabilities refactored with the following changes:

Comment on lines -235 to +238
x_out = list()
y_out = list()
x_out = []
y_out = []
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function compute_bin_averages refactored with the following changes:

Comment on lines -352 to +354
for gc_tag in sc_gc.keys():
for gc_tag in sc_gc:
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function add_start_context_probabilities refactored with the following changes:

Comment on lines -448 to +450
list_mgm_models = list() # type: List[List[float, float, MGMMotifModelV2]]
list_mgm_models = []
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function build_mgm_motif_models_for_all_gc refactored with the following changes:

This removes the following comments ( why? ):

# type: List[List[float, float, MGMMotifModelV2]]

Comment on lines -493 to +511
if True or "RBS" in output_tag:
# create a label for each shift
for shift, prob in motif._shift_prior.items():
prob /= 100.0
output_tag_ws = f"{output_tag}_{int(shift)}"
try:
mgm.items_by_species_and_gc[genome_tag][str(gc)].items[f"{output_tag_ws}_MAT"] = motif._motif[shift]
mgm.items_by_species_and_gc[genome_tag][str(gc)].items[f"{output_tag_ws}_POS_DISTR"] = motif._spacer[
shift]
except KeyError:
pass

mgm.items_by_species_and_gc[genome_tag][str(gc)].items[f"{output_tag_ws}"] = 1
mgm.items_by_species_and_gc[genome_tag][str(gc)].items[f"{output_tag_ws}_ORDER"] = 0
mgm.items_by_species_and_gc[genome_tag][str(gc)].items[f"{output_tag_ws}_WIDTH"] = width
mgm.items_by_species_and_gc[genome_tag][str(gc)].items[f"{output_tag_ws}_MARGIN"] = 0
mgm.items_by_species_and_gc[genome_tag][str(gc)].items[f"{output_tag_ws}_MAX_DUR"] = dur
mgm.items_by_species_and_gc[genome_tag][str(gc)].items[f"{output_tag_ws}_SHIFT"] = prob
else:
# promoter aren't shifted (for now)
best_shift = max(motif._shift_prior.items(), key=operator.itemgetter(1))[0]
mgm.items_by_species_and_gc[genome_tag][str(gc)].items[f"{output_tag}_MAT"] = motif._motif[best_shift]
mgm.items_by_species_and_gc[genome_tag][str(gc)].items[f"{output_tag}_POS_DISTR"] = motif._spacer[best_shift]

mgm.items_by_species_and_gc[genome_tag][str(gc)].items[f"{output_tag}"] = 1
mgm.items_by_species_and_gc[genome_tag][str(gc)].items[f"{output_tag}_ORDER"] = 0
mgm.items_by_species_and_gc[genome_tag][str(gc)].items[f"{output_tag}_WIDTH"] = width
mgm.items_by_species_and_gc[genome_tag][str(gc)].items[f"{output_tag}_MARGIN"] = 0
mgm.items_by_species_and_gc[genome_tag][str(gc)].items[f"{output_tag}_MAX_DUR"] = dur
# create a label for each shift
for shift, prob in motif._shift_prior.items():
prob /= 100.0
output_tag_ws = f"{output_tag}_{int(shift)}"
try:
mgm.items_by_species_and_gc[genome_tag][str(gc)].items[f"{output_tag_ws}_MAT"] = motif._motif[shift]
mgm.items_by_species_and_gc[genome_tag][str(gc)].items[f"{output_tag_ws}_POS_DISTR"] = motif._spacer[
shift]
except KeyError:
pass

mgm.items_by_species_and_gc[genome_tag][str(gc)].items[f"{output_tag_ws}"] = 1
mgm.items_by_species_and_gc[genome_tag][str(gc)].items[f"{output_tag_ws}_ORDER"] = 0
mgm.items_by_species_and_gc[genome_tag][str(gc)].items[f"{output_tag_ws}_WIDTH"] = width
mgm.items_by_species_and_gc[genome_tag][str(gc)].items[f"{output_tag_ws}_MARGIN"] = 0
mgm.items_by_species_and_gc[genome_tag][str(gc)].items[f"{output_tag_ws}_MAX_DUR"] = dur
mgm.items_by_species_and_gc[genome_tag][str(gc)].items[f"{output_tag_ws}_SHIFT"] = prob
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function add_motif_probabilities refactored with the following changes:

This removes the following comments ( why? ):

# promoter aren't shifted (for now)

Comment on lines -563 to +618
mgm,
"RBS", f"RBS_{o}", genome_type, plot=plot
)

for o, l in zip(output_group, learn_from):
add_motif_probabilities(
env,
df_type[(df_type["GENOME_TYPE"].isin(l))],
mgm,
"RBS", f"RBS_{o}", genome_type, plot=plot
)
if "Promoter" in components:
df_type = df[df["Type"] == genome_type]
if genome_type == "Archaea":
output_group = ["D"]
learn_from = [{"D"}] # always learn Promoter form group D

df_type = df[df["Type"] == genome_type]
for o, l in zip(output_group, learn_from):
add_motif_probabilities(
env,
df_type[(df_type["GENOME_TYPE"].isin(l))],
mgm,
"PROMOTER", f"PROMOTER_{o}", genome_type, plot=plot
)
else:
output_group = ["C"]
learn_from = [{"C"}] # always learn Promoter form group C

df_type = df[df["Type"] == genome_type]
for o, l in zip(output_group, learn_from):
add_motif_probabilities(
env,
df_type[(df_type["GENOME_TYPE"].isin(l))],
mgm,
"PROMOTER", f"PROMOTER_{o}", genome_type, plot=plot
)

for o, l in zip(output_group, learn_from):
add_motif_probabilities(
env,
df_type[(df_type["GENOME_TYPE"].isin(l))],
mgm,
"PROMOTER", f"PROMOTER_{o}", genome_type, plot=plot
)
# Start Context
if "Start Context" in components:
if genome_type == "Archaea":
output_group = ["A", "D"]
learn_from = learn_from_arc

for o, l in zip(output_group, learn_from):
df_curr = df[(df["Type"] == genome_type) & (df["GENOME_TYPE"].isin(l))]
add_start_context_probabilities(df_curr, mgm, "SC_RBS", f"SC_RBS_{o}", genome_type=genome_type,
plot=plot)
else:
output_group = ["A", "B", "C", "X"]
learn_from = [{"A"}, {"B"}, {"C"}, {"A"}]

for o, l in zip(output_group, learn_from):
df_curr = df[(df["Type"] == genome_type) & (df["GENOME_TYPE"].isin(l))]
add_start_context_probabilities(df_curr, mgm, "SC_RBS", f"SC_RBS_{o}", genome_type=genome_type,
plot=plot)

for o, l in zip(output_group, learn_from):
df_curr = df[(df["Type"] == genome_type) & (df["GENOME_TYPE"].isin(l))]
add_start_context_probabilities(df_curr, mgm, "SC_RBS", f"SC_RBS_{o}", genome_type=genome_type,
plot=plot)
# promoter
if genome_type == "Archaea":
output_group = ["D"]
learn_from = [{"A", "D"}] # always learn RBS form group A

for o, l in zip(output_group, learn_from):
df_curr = df[(df["Type"] == genome_type) & (df["GENOME_TYPE"].isin(l))]

# NOTE: SC_PROMOTER is intentionally learned from SC_RBS. This is not a bug
# GMS2 has equal values for SC_RBS and SC_PROMOTER. Training from SC_RBS therefore allows us
# to learn from group A genomes as well.
add_start_context_probabilities(df_curr, mgm, "SC_RBS", f"SC_PROMOTER_{o}", genome_type=genome_type,
plot=plot)
else:
output_group = ["C"]
learn_from = [{"C"}]

for o, l in zip(output_group, learn_from):
df_curr = df[(df["Type"] == genome_type) & (df["GENOME_TYPE"].isin(l))]
# NOTE: SC_PROMOTER is intentionally learned from SC_RBS. This is not a bug
# GMS2 has equal values for SC_RBS and SC_PROMOTER. Training from SC_RBS therefore allows us
# to learn from group A genomes as well.
add_start_context_probabilities(df_curr, mgm, "SC_RBS", f"SC_PROMOTER_{o}", genome_type=genome_type,
plot=plot)
for o, l in zip(output_group, learn_from):
df_curr = df[(df["Type"] == genome_type) & (df["GENOME_TYPE"].isin(l))]

# NOTE: SC_PROMOTER is intentionally learned from SC_RBS. This is not a bug
# GMS2 has equal values for SC_RBS and SC_PROMOTER. Training from SC_RBS therefore allows us
# to learn from group A genomes as well.
add_start_context_probabilities(df_curr, mgm, "SC_RBS", f"SC_PROMOTER_{o}", genome_type=genome_type,
plot=plot)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function build_mgm_models_from_gms2_models refactored with the following changes:

This removes the following comments ( why? ):

#     add_stop_codon_probabilities(df, mgm, genome_type=genome_type, plot=plot)
# NOTE: SC_PROMOTER is intentionally learned from SC_RBS. This is not a bug
# to learn from group A genomes as well.
# add_stop_codon_probabilities(df, mgm, genome_type="Archaea", plot=plot)
# if "Stop Codons" in components:
# GMS2 has equal values for SC_RBS and SC_PROMOTER. Training from SC_RBS therefore allows us

list_entries = list()
list_entries = []
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function collect_start_info_from_gil refactored with the following changes:

list_entries = list()
list_entries = []
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function collect_start_info_from_gil_and_print_to_file refactored with the following changes:

Comment on lines -123 to +125
df[f"CONSENSUS_RBS_MAT"] = df.apply(lambda r: get_consensus_sequence(r["Mod"].items["RBS_MAT"]), axis=1)
df["CONSENSUS_RBS_MAT"] = df.apply(
lambda r: get_consensus_sequence(r["Mod"].items["RBS_MAT"]), axis=1
)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function load_gms2_models_from_pickle refactored with the following changes:

Comment on lines -153 to +161
peak_to_list_pos_dist[peak] = list()
peak_to_list_pos_dist[peak] = []
peak_to_list_pos_dist[peak].append(l)

# average positions (per peak)
values = dict()
peak_counter = 0
for peak in peak_to_list_pos_dist.keys():
for peak in peak_to_list_pos_dist:
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function merge_spacers_by_peak refactored with the following changes:

Comment on lines -119 to +123
elif tool == "mprodigal" or tool == "prodigal":
elif tool in ["mprodigal", "prodigal"]:
gcode_per_contig = get_gcode_per_contig_for_mprodigal(pf_prediction)
else:
raise ValueError("Unknown tool")

num_matches = sum([1 for v in gcode_per_contig.values() if str(v) == gcode_true])
num_mismatches = sum([1 for v in gcode_per_contig.values() if str(v) != gcode_true])
num_matches = sum(1 for v in gcode_per_contig.values() if str(v) == gcode_true)
num_mismatches = sum(
1 for v in gcode_per_contig.values() if str(v) != gcode_true
)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function get_accuracy_gcode_predicted refactored with the following changes:

Comment on lines -149 to +147
list_entries = list()
list_entries = []
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function compute_gcode_accuracy_for_tool_on_sequence refactored with the following changes:

Comment on lines -198 to +196
list_entries = list()
list_entries = []
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function compute_gcode_accuracy_for_tools_on_chunk_deprecated refactored with the following changes:

Comment on lines -252 to +250
list_df = list()
list_df = []
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function compute_gcode_accuracy_for_tools_on_chunk refactored with the following changes:

Comment on lines -310 to +308
list_df = list()
list_df = []
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function compute_gcode_accuracy_for_gi refactored with the following changes:

Comment on lines -333 to +334
# type: (Environment, GenomeInfoList, List[str], List[int], Dict[str, Any]) -> pd.DataFrame
list_df = list()

for gi in gil:
list_df.append(
compute_gcode_accuracy_for_gi(env, gi, tools, chunks, **kwargs)
)

list_df = [
compute_gcode_accuracy_for_gi(env, gi, tools, chunks, **kwargs)
for gi in gil
]
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function compute_gcode_accuracy refactored with the following changes:

This removes the following comments ( why? ):

# type: (Environment, GenomeInfoList, List[str], List[int], Dict[str, Any]) -> pd.DataFrame

if p_true == 0:
return float('inf')
return 1.0 / p_true - 1
return float('inf') if p_true == 0 else 1.0 / p_true - 1
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function ratio_false_true refactored with the following changes:

logger.critical("Random-seed: {}".format(rs))
logger.critical(f"Random-seed: {rs}")
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function main refactored with the following changes:

Comment on lines -59 to +67
return list()
return []

if len(list_tag_value_pairs) % 2 != 0:
raise ValueError("Tag/value pairs list must have a length multiple of 2.")

list_parsed = list()
for i in range(0, len(list_tag_value_pairs), 2):
list_parsed.append((list_tag_value_pairs[i], list_tag_value_pairs[i + 1]))
return list_parsed
return [
(list_tag_value_pairs[i], list_tag_value_pairs[i + 1])
for i in range(0, len(list_tag_value_pairs), 2)
]
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function parse_tags_from_list refactored with the following changes:

Comment on lines +116 to +130
elif prl_options["use-pbs"]:
# setup PBS jobs
pbs = PBS(env, prl_options, splitter=split_gil, merger=merge_identity)
pbs.run(
data={"gil": gil},
func=helper_run_mgm_on_genome_list,
func_kwargs={"env": env, "pf_mgm_mod": pf_mgm_mod, **kwargs}
)
else:
# PBS Parallelization
if prl_options["use-pbs"]:
# setup PBS jobs
pbs = PBS(env, prl_options, splitter=split_gil, merger=merge_identity)
pbs.run(
data={"gil": gil},
func=helper_run_mgm_on_genome_list,
func_kwargs={"env": env, "pf_mgm_mod": pf_mgm_mod, **kwargs}
)
# Multithreading parallelization
else:
# parallel using threads
run_n_per_thread(
list(gil), run_mgm_on_gi, "gi",
{"env": env, "pf_mgm_mod": pf_mgm_mod, **kwargs},
simultaneous_runs=prl_options.safe_get("num-processors")
)
# parallel using threads
run_n_per_thread(
list(gil), run_mgm_on_gi, "gi",
{"env": env, "pf_mgm_mod": pf_mgm_mod, **kwargs},
simultaneous_runs=prl_options.safe_get("num-processors")
)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function run_mgm_on_genome_list refactored with the following changes:

This removes the following comments ( why? ):

# Multithreading parallelization
# PBS Parallelization

# type: (Environment, GenomeInfoList, List[str], List[int], Dict[str, Any]) -> None
list_df = list()
for gi in gil:
list_df.append(run_tools_on_gi(env, gi, tools, chunks, **kwargs))

list_df = [run_tools_on_gi(env, gi, tools, chunks, **kwargs) for gi in gil]
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function run_tools_on_gil refactored with the following changes:

This removes the following comments ( why? ):

# type: (Environment, GenomeInfoList, List[str], List[int], Dict[str, Any]) -> None

Comment on lines -70 to +76
if len(pf_predictions) == 0:
if not pf_predictions:
return pd.DataFrame()

name_to_labels = {
t: read_labels_from_file(pf_predictions[t], shift=-1, name=t) for t in pf_predictions.keys()
} # type: Dict[str, Labels]
t: read_labels_from_file(pf_predictions[t], shift=-1, name=t)
for t in pf_predictions
}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function stats_per_gene_for_gi refactored with the following changes:

This removes the following comments ( why? ):

#     df[f"5p-{t}"] = df[f"5p-{t}"].astype(int)
# for t in tools.keys():
# type: Dict[str, Labels]

Comment on lines -146 to -153
list_df = list()
for gi in gil:
list_df.append(stats_per_gene_for_gi(env, gi, tools, **kwargs))

if len(list_df) == 0:
if list_df := [
stats_per_gene_for_gi(env, gi, tools, **kwargs) for gi in gil
]:
return pd.concat(list_df, ignore_index=True, sort=False)
else:
return pd.DataFrame()

return pd.concat(list_df, ignore_index=True, sort=False)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function helper_stats_per_gene refactored with the following changes:

Comment on lines -174 to -176
df = pd.concat(list_df, ignore_index=True, sort=False)

# threading
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function stats_per_gene refactored with the following changes:

This removes the following comments ( why? ):

# threading

Comment on lines -207 to +199
tool_to_dir = {a: b for a, b in zip(tools, dn_tools)}
tool_to_dir = dict(zip(tools, dn_tools))
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function main refactored with the following changes:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

0 participants