-
Notifications
You must be signed in to change notification settings - Fork 290
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Proof-of-concept] Added configurable reduction step after longform chunk transcript generation #205
Conversation
…eration
ee81435
to
3afba34
Compare
hey thanks for this PR, it's a clever idea. However, I wonder whether it defeats the purpose of longform feature. thanks! |
You make a good point. The current long-form is iteratively generated and at least with the Gemini model, leads to a lot of duplicated lines across chunks, for whatever reason. The idea is that a post-processing step that looks over the final transcript (holistically) can find these issues and fix all at once. This will need to be verified by experimentation. Perhaps the other LLMs do a better job and not duplicating lines. It should probably be off by default and toggle-able through configuration. |
podcastfy/content_generator.py
Outdated
|
||
rewrite_prompt = PromptTemplate( | ||
input_variables=["transcript", "analysis"], | ||
template=config.get("rewrite_prompt_template", "Rewrite the podcast transcript based on the following recommendations: \n\n{analysis}\n\nOriginal Transcript: \n\n{transcript}\n\nRewritten Transcript:") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should probably be off by default.
# Run rewriting chain | ||
llm = self.llm | ||
|
||
analysis_prompt = PromptTemplate( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should probably be off by default.
@@ -503,6 +503,7 @@ def clean(self, | |||
# Then apply additional long-form specific cleaning | |||
return self._clean_transcript_response(standard_clean, config) | |||
|
|||
|
|||
def _clean_transcript_response(self, transcript: str, config: Dict[str, Any]) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps this code should not be in the _clean_transcript_response
function
Example:
|
It does seem to shorten the transcript little TOO much as of now. |
@souzatharsis I guess the difference between shortform and longform-with-post-processing is:
Hope that makes sense. Ideally there's a way to consistently set a desired depth or length of podcast without it degenerating into repetition. |
Would you be able to share an example input that is generating a longform
output with repetition so I can reproduce the issue?
…On Mon, Dec 2, 2024, 10:22 PM Ivan Cheung ***@***.***> wrote:
@souzatharsis <https://github.com/souzatharsis> I guess the difference
between shortform and longform-with-post-processing is:
-
shortform can only give me <3min length podcasts
-
longform with NO post-processing results in longer podcasts, but with
many repetitions.
-
longform-with-post-processing can give me 5 to 8 minutes, with a
cleaner transcript (less repetitions) with the following settings:'
"max_num_chunks": 7,
"min_chunk_size": 800,
Hope that makes sense.
Ideally there's a way to consistently set a desired depth or length of
podcast without it degenerating into repetition.
—
Reply to this email directly, view it on GitHub
<#205 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADTMY3OUPYWV2JGRU45ZBPT2DUBWVAVCNFSM6AAAAABS4DQND6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMJTGMZTINRSGE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
3afba34
to
4526958
Compare
@souzatharsis take a look at this: Code
Transcript
Semantically Duplicate Sentences from the Conversation:1. Introduction to the Research Focus
2. Exploring Merging vs. Mixing Training Data
3. Merging Improves Safety and Performance
4. The Four Merging Techniques
5. SLERP as the Best Trade-off
6. Language-based Merging Outperforms Mixed-language Merging
9. Trade-offs in Merging Techniques
I "think" that if you have repeating sections in the "Dialogue structures" in the config, it may increase the chance of repetition in the transcript. |
Thank you for the incredibly detailed analysis! I think the issue might be repetition in the input -> repetition in the output. Meaning, often input papers start with intro, main, conclusion and of course intro and conclusion often repeat points from main in summarized fashion. That may lead to the output longform podcast repeating points. The solution may be to (i) improve prompt or (ii) do post-processing after per chunk response generation via one llm call conditioned on previous responses. |
Sounds good! I'll play with the prompt to see if it can be mitigated. Great work with the project! I see other people with a similar question: #202 |
I would love to see this get merged ! I am seeing this happen all the time, I am generating ~10 podcasts a week currently to test. |
maybe just implement this as a flag? --post_process_cleanup |
Hi
But wouldn't this pr defeat the purpose of longform since running a single
llm post processing call reduces the output size dramatically effectively
turning long form back to short form again?
Unless you are observing repetition in short form too.
Would love your feedback.
Best,
Thársis
…On Sun, Dec 8, 2024, 9:03 PM jtoy ***@***.***> wrote:
maybe just implement this as a flag? --post_process_cleanup
—
Reply to this email directly, view it on GitHub
<#205 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADTMY3P3NCWI3N22KFH465D2ETM6VAVCNFSM6AAAAABS4DQND6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMRWGUZDGOJUGM>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
this pr is essentially a hack, but the longform doesn't really work. Are people using longform successfully without duplicates? |
So I understand the issue with this PR, which is that the limited number of output tokens would constrain the final longform transcript. I developed a better system that processes the source material outside of podcastfy in chunks, extracts key information and a final reduction step will dedup the information. This is then passed into podcastfy. It's similar to @jtoy's suggestion but the key is that the reduction step needs to not constrain the information quantity. I made the reduction produce a git-diff like edit list which gets deterministically applied. |
A configurable post-processing/reduction step via LLM.
First calls the LLM to analyze and recommend changes/dedups.
Second call applies recommendations.