Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Proof-of-concept] Added configurable reduction step after longform chunk transcript generation #205

Closed
wants to merge 2 commits into from

Conversation

ivanmkc
Copy link

@ivanmkc ivanmkc commented Dec 2, 2024

A configurable post-processing/reduction step via LLM.

First calls the LLM to analyze and recommend changes/dedups.
Second call applies recommendations.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
…eration
@ivanmkc ivanmkc force-pushed the imkc--longform-reducer branch 2 times, most recently from ee81435 to 3afba34 Compare December 2, 2024 19:43
@souzatharsis
Copy link
Owner

hey thanks for this PR, it's a clever idea. However, I wonder whether it defeats the purpose of longform feature.
wouldn't adding an LLM reduction step call after longform actually make the response back to shortform again?
curious about your ideas.

thanks!

@ivanmkc
Copy link
Author

ivanmkc commented Dec 2, 2024

You make a good point.

The current long-form is iteratively generated and at least with the Gemini model, leads to a lot of duplicated lines across chunks, for whatever reason.

The idea is that a post-processing step that looks over the final transcript (holistically) can find these issues and fix all at once. This will need to be verified by experimentation. Perhaps the other LLMs do a better job and not duplicating lines.

It should probably be off by default and toggle-able through configuration.


rewrite_prompt = PromptTemplate(
input_variables=["transcript", "analysis"],
template=config.get("rewrite_prompt_template", "Rewrite the podcast transcript based on the following recommendations: \n\n{analysis}\n\nOriginal Transcript: \n\n{transcript}\n\nRewritten Transcript:")
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably be off by default.

# Run rewriting chain
llm = self.llm

analysis_prompt = PromptTemplate(
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably be off by default.

@@ -503,6 +503,7 @@ def clean(self,
# Then apply additional long-form specific cleaning
return self._clean_transcript_response(standard_clean, config)


def _clean_transcript_response(self, transcript: str, config: Dict[str, Any]) -> str:
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this code should not be in the _clean_transcript_response function

@ivanmkc
Copy link
Author

ivanmkc commented Dec 2, 2024

Example:

2024-12-02 15:52:54,530 - DEBUG - Successfully analyzed transcript: 

## Analysis and Suggestions for Podcast Transcript Improvement

This transcript suffers from significant repetition, particularly in the latter half, as the speakers recap and summarize multiple times.  The flow can be improved by consolidating these sections and removing redundant information.  Here's a breakdown with suggestions:

**General Flow and Repetition:**

* **Large Sections of Recap (Lines 157 onwards):**  The multiple recap sections toward the end are excessive.  Most of the information is already covered earlier.  Condense these into a single, concise summary highlighting the key takeaways. Aim to keep the ending focused on the core contributions and open questions, rather than re-explaining the techniques in detail.

* **Repetitive Questioning (Throughout):**  Person2 often asks questions that simply rephrase what Person1 just said. This adds little value and slows down the pace.  Streamline these questions to be more insightful or to prompt deeper discussion. For example, instead of "So they can see what the model is actually “looking” at?", perhaps a question like, "How does this visual interface compare to other interpretability methods?" would be more engaging.

* **Overuse of Filler Words (Lines 1, 3):**  Reduce filler words like "uh," "um," "you know," and "like." This will make the podcast sound more professional and polished.

**Specific Instances of Redundancy:**

* **Lines 18-22:**  The discussion about attributing positive value to coins and negative value to enemies is restated multiple times. Keep the initial explanation and remove the subsequent rephrasing.

* **Lines 33-40:** The "hallucination" concept is explained twice in quick succession.  Combine these explanations into a single, clear description.

* **Lines 45-64, and repeated later:** The circuit editing and its limitations (not achieving complete blindness) are discussed multiple times throughout the transcript.  Consolidate these points into a single section.

* **Lines 65-118:** The discussion about the diversity hypothesis and gradient-based visualization is particularly repetitive.  Streamline this section significantly, focusing on the core issue: gradient-based methods don't work well due to the lack of visual diversity and reliance on pixel shortcuts.

* **Lines 120-156 and repeated later:**  The description of the interface, sparsity, unexpected signs, outlier frames, and open questions are repeated towards the end.  Remove the second instance and ensure the first explanation is clear and comprehensive.

* **Appendix Discussions (Lines 182 onwards, and repeated later):** The discussions about the appendices are repeated nearly verbatim.  Condense these into a single mention of the supplementary material and what each appendix covers, without going into excessive detail.


**Suggestions for Improved Coherency:**

* **Transitioning Between Topics:**  Use clearer transitions between different aspects of the paper.  For example, when moving from the interface discussion to the diversity hypothesis, use a bridging sentence like, "While this interface provides valuable insights, the researchers also explored the impact of training data diversity on interpretability."

* **Focus on Key Takeaways:**  Emphasize the most important contributions of the paper, such as the novel interface, the circuit editing technique, and the diversity hypothesis.  Avoid getting bogged down in overly technical details.

* **Engage the Listener:**  Pose open-ended questions to the audience or offer perspectives on the implications of the research. This will make the podcast more interactive and thought-provoking.


By addressing these issues, the podcast can become significantly more concise, engaging, and easier to follow.  The key is to remove redundancy, improve transitions, and maintain a clear focus on the most important aspects of the research.

2024-12-02 15:52:54,531 - DEBUG - Executing rewriting chain
2024-12-02 15:53:16,138 - DEBUG - Successfully rewrote transcript
2024-12-02 15:53:16,139 - DEBUG - Successfully rewrote transcript, BEFORE = 

<Person1>Welcome to AI Digest - Your daily AI research review. Uh, today we're diving into, um, this fascinating 2020 Distill publication, "Understanding RL Vision."
</Person1><Person2>Right.  A really interesting look at interpretability techniques applied to a reinforcement learning model.  Specifically, one trained to play… CoinRun?
</Person2><Person1>Yeah, CoinRun.  It's a, you know, side-scrolling platformer.  The agent has to, like, navigate obstacles,  dodge enemies, and grab a coin at the end of the level.
</Person1><Person2>Classic platformer setup. So, how are they tackling interpretability here?
</Person2><Person1>Well, they use attribution combined with dimensionality reduction—specifically, non-negative matrix factorization, or NMF.  This lets them build an interface to explore how the model detects objects and how those objects influence its decisions.
</Person1><Person2>So they can see what the model is actually “looking” at?
</Person2><Person1>Exactly. And not just what it sees, but how those things affect its value function and policy.  Like, is it seeing a coin and thinking, “Good news!” or an enemy and going “Oh no!”
</Person1><Person2>I see. So, like, attributing positive value to the coin and negative value to the enemy?
</Person2><Person1>Precisely. And the NMF helps break down those attributions into components that correspond to different types of objects. They even color-code them in the interface.
</Person1><Person2>Clever. This sounds very visual.  Are they able to use this to diagnose problems with the model?
</Person2><Person1>Oh, absolutely.  They analyze cases where the agent fails, and the interface helps them pinpoint why.  Sometimes it's just bad luck, you know, unlucky action sampling. But other times…
</Person1><Person2>Other times it’s something more systematic?
</Person2><Person1>Right.  Like, an obstacle being temporarily hidden.  Or even the model “hallucinating” features that aren’t there.  They call these “hallucinations.”
</Person1><Person2>Hallucinations? So the model is, what, seeing things that aren’t actually in the game?
</Person2><Person1>Yeah, like maybe a random shape in the background gets interpreted as a coin. It's fascinating. These hallucinations usually don’t affect the agent's behavior, though, because they're so brief.
</Person1><Person2>Interesting. But temporary glitches can still reveal a lot about how the model works, right?
</Person2><Person1>For sure.  And get this – they even edit the model weights to intentionally blind it to certain hazards. Like, making it ignore buzzsaws or left-moving enemies.
</Person1><Person2>Wait, they’re editing the model itself?  How does that work?
</Person2><Person1>It's a kind of “circuit editing,” as they call it. They project out the NMF direction corresponding to the feature they want to eliminate. Pretty ingenious, right?
</Person1><Person2>Yeah, that’s wild.  So, does it actually work? Does the agent become blind to these hazards?
</Person2><Person1>It does, to a significant extent.  They verify it by checking which hazards cause the edited agents to fail. It's a form of quantitative validation of their analysis. Though, interestingly, they couldn’t achieve complete blindness.
</Person1><Person2>Meaning…?
</Person2><Person1>The model seems to have other, less interpretable ways of detecting these hazards, even when the primary feature is suppressed.
</Person1><Person2>So there’s more going on under the hood than their interface reveals? Hmm, that’s something to consider. What about this “diversity hypothesis” they mention?
</Person2><Person1>So, they found these, uh… “hallucinations,” right? Where the model briefly misinterprets something in the background.
</Person1><Person2>Yeah, like that curved shape it thought was a coin, causing a spike in the value function.  It’s like a temporary visual glitch.
</Person2><Person1>Exactly.  And they use these glitches to, you know, understand how the model works.  But they don't stop there. They go a step further and actually edit the model.
</Person1><Person2>Oh?  The "circuit editing" they mentioned?
</Person2><Person1>Right.  They basically try to blind the model to specific hazards—buzzsaws and left-moving enemies.
</Person1><Person2>So, like, surgically removing the model’s ability to see those things?
</Person2><Person1>Essentially.  And it partially works. The edited models fail more often due to those specific hazards.
</Person1><Person2>But not completely blind, you said?
</Person2><Person1>Yeah, the model seems to have backup systems, other ways of detecting those hazards, even without the main feature.  It's like… redundancy.
</Person1><Person1>Makes it more robust, I suppose. But less interpretable.
</Person1><Person2>Right. So, what about this diversity hypothesis?  Something about training data diversity and interpretability?
</Person2><Person1>So, uh, they tried a bunch of things to make gradient-based feature visualization work with this CoinRun model, right?
</Person1><Person2>Yeah, like jittering, rotating, scaling the image… the whole nine yards.
</Person2><Person1>But, uh, nothing seemed to work.  It just didn't produce good visualizations, especially for the, you know, intermediate layers.
</Person1><Person2>Right. It only worked for the first layer, which is… interesting.
</Person2><Person1>Yeah, and they compared it to ImageNet, where gradient-based visualization works pretty well.  Big difference.
</Person1><Person2>So, CoinRun is a tougher nut to crack, visually speaking.
</Person2><Person1>Seems that way. They even tried penalizing extreme colors, you know, because the visualizations were kind of… garish.
</Person1><Person2>Oh? And did that help?
</Person2><Person1>Not really.  It just shrunk the really bright areas, but didn't make the visualizations any more interpretable.
</Person1><Person2>Huh.  What about different optimization objectives?  Like, uh… the caricature objective?
</Person2><Person1>Tried that too.  No dice.  Also tried dimensionality reduction, but… nothing.
</Person1><Person2>So, they threw the kitchen sink at it, and still no good visualizations.  What gives?
</Person2><Person1>Well, they think it's because CoinRun doesn't actually require that much visual processing. The model can get away with using, like, simple pixel configurations as shortcuts.
</Person1><Person2>Ah, so it's not really "seeing" the game in the same way we do?
</Person2><Person1>Exactly.  And these shortcuts work fine on the training data, but they fall apart when you start exploring the wider image space, like gradient-based methods do.
</Person1><Person2>Makes sense.  So, what about this "diversity hypothesis" they mention?  How does that fit in?
</Person2><Person1>Well, they say the lack of low-level visual diversity in the training data might be hurting interpretability at all levels.
</Person1><Person2>Interesting. Because the gradient has to backpropagate through those early layers, right?
</Person2><Person1>Yeah, something like that. They also suggest that diversity needs to be considered in the context of the task.
</Person1><Person2>Meaning…?
</Person2><Person1>Like, just increasing the visual diversity of the training data might not be enough if the task itself doesn't require the model to actually use that diversity.
</Person1><Person1>So, they built this interface, right?  Overlaying visualizations on the game screen.  Pretty slick.
</Person1><Person2>Yeah, using feature visualization to show what each channel is picking up.  Gives you a sense of what the model's "seeing."
</Person2><Person1>And they extend it to whole trajectories, adding video controls and a timeline.  Makes it much easier to analyze the agent's behavior.
</Person1><Person2>Right.  They also talk about some interesting properties of the attributions, like how sparse they are.
</Person2><Person1>Sparse?  You mean concentrated in a few spots?
</Person1><Person2>Exactly.  Just a handful of position-channel pairs account for most of the attribution. They think it’s because the model uses simple pixel configurations to identify objects.
</Person2><Person1>Kind of like shortcuts, then? Instead of, like, a full-blown object recognition system.
</Person1><Person2>Yeah. And because of this sparsity, they smooth the attribution out a bit in the interface.
</Person2><Person1>Makes sense. Easier to see the important areas. But they also mention unexpected signs in the attribution.
</Person1><Person2>Right, like sometimes a buzzsaw gets positive attribution, even though it’s a hazard.  They suspect it’s due to statistical collinearity.
</Person2><Person1>Collinearity?  So, correlations in the level generation, like coins and walls always appearing at the end?
</Person1><Person2>Yeah, something like that. The model might be picking up on these correlations and assigning value in unexpected ways.
</Person2><Person1>Interesting. And what about those outlier frames?  The ones where the attribution goes haywire?
</Person1><Person2>They don’t have a great explanation, but they think it might be due to complex feature interactions. Plus, there's often a large residual component outside the NMF directions.
</Person2><Person1>So, the NMF isn't capturing everything that's going on in those cases?  Makes you wonder how much we can really trust the attributions, right?
</Person1><Person2>For sure.  It highlights the limitations of these techniques.  Then they go on to discuss some open questions, like the "diversity hypothesis.”
</Person2><Person1>Ah, yes. The idea that low-level visual diversity in the training data affects interpretability at all levels.
</Person1><Person2>Right.  They also wonder about the relationship between diversity, interpretable features, and generalization.
</Person2><Person1>And how to quantify the amount of diversity needed for interpretable features.  Tricky stuff.
</Person1><Person2>Definitely.  They also raise questions about interpretability in the absence of diversity, like how to cope with non-diverse levels of abstraction.
</Person2><Person1>And why gradient-based visualization breaks down in those cases.  Plus, the trustworthiness of dataset examples and attribution.
</Person1><Person2>Lots of unanswered questions.  They also talk about interpretability in the RL framework specifically, like interpreting non-visual features and improving the reliability of RL models.
</Person2><Person1>So, uh, they really emphasize the importance of interpretability research being, like, scalable, trustworthy, exhaustive, and low-cost.
</Person1><Person2>Yeah, makes sense.  If we want these techniques to be actually useful, they can't just work on toy problems, right?
</Person2><Person1>Exactly.  And they have to give us reliable information. No use if the explanations are misleading.
</Person1><Person2>Right.  And ideally, we'd want to understand everything the model's doing, not just a small piece of the puzzle.
</Person2><Person1>Totally.  Plus, it can't be so computationally expensive that no one can actually use it.  Low cost is key.
</Person1><Person2>For sure.  And they tie these criteria back to their questions about diversity, particularly exhaustiveness.
</Person2><Person1>Yeah, they argue that if non-diverse features are still present even with diversity, then our current methods might be missing something important.
</Person1><Person2>Makes sense.  We might be overlooking key aspects of how these models work.
</Person2><Person1>They seem to think that just applying existing techniques more rigorously could yield valuable insights, especially as models tackle more complex tasks.
</Person1><Person2>Kind of like early biologists with a magnifying glass, they say.  Just observing and carefully analyzing what's already there.
</Person2><Person1>And they provide a ton of supplementary material, too.  Code, model weights, interfaces… the whole nine yards.
</Person1><Person2>Oh?  So, other researchers can easily build on their work? That's great.
</Person2><Person1>Absolutely.  They even detail their model editing method in Appendix A.  The whole projection matrix thing, for blinding the agent to specific features.
</Person1><Person2>Right, the "circuit editing."  So they're actually manipulating the model's weights directly.
</Person2><Person1>Yeah, pretty ingenious. And in Appendix B, they dive into integrated gradients for a hidden layer, explaining how they attribute the value function.
</Person1><Person2>So, breaking down how the model assigns value to different parts of the input, right?  The "good news" and "bad news" signals.
</Person2><Person1>Exactly.  They use a baseline value and calculate how each activation contributes to the difference from that baseline.
</Person1><Person2>Interesting.  So, they're really trying to understand how the model makes decisions, step by step.
</Person2><Person1>So, to recap, they really dug into how this RL model "sees" the game, using attribution and NMF, building this, uh… really cool interface.
</Person1><Person2>Yeah, like overlaying the visualizations right onto the game screen.  Super helpful for understanding what the model's focusing on.
</Person2><Person1>And they could even edit the model!  Blinding it to certain hazards, like buzzsaws.  Circuit editing, they called it.
</Person1><Person2>Right, manipulating the weights based on the NMF components.  Pretty wild. But it didn't achieve complete blindness, right?
</Person2><Person1>Yeah, the model had these… backup systems. Other ways of detecting the hazards.  Redundancy, I guess.
</Person1><Person2>Makes it robust, but less interpretable.  So, then they went into these open questions, like the diversity hypothesis.
</Person2><Person1>The whole thing about low-level visual diversity in the training data impacting interpretability at all levels.  And how to even quantify that.
</Person1><Person2>Right. And how interpretability changes in the absence of diversity.  Lots of tricky questions there.
</Person2><Person1>They also talked about scaling these interpretability techniques to more complex tasks, the importance of trustworthiness, exhaustiveness…
</Person1><Person2>Yeah, making sure these methods are actually useful in practice.  Not just for toy problems.
</Person2><Person1>And then there's all the supplementary material.  Code, weights, interfaces… everything you need to build on their work.
</Person1><Person2>Oh, and they detail the model editing in Appendix A, right? The projection matrix stuff.
</Person2><Person1>Exactly. And Appendix B breaks down the integrated gradients, how they attribute the value function.  The "good news, bad news" signals.
</Person1><Person2>Step-by-step decision-making.  Pretty thorough.  What about Appendix C?  The architecture details?
</Person2><Person1>Oh, right.  They lay out all the layers.  Convolutional, pooling, dense… the whole shebang.  It's based on IMPALA, but with some tweaks.
</Person1><Person2>Tweaks for interpretability, they said.  Fewer convolutional layers, more dense layers…
</Person2><Person1>Yeah, and no residual connections.  Plus, L2 pooling instead of max pooling.  For smoother gradients, I think.
</Person1><Person2>Interesting.  They really considered interpretability from the ground up, huh? Even in the architecture itself.
</Person2><Person1>So, uh, Appendix C, right?  They get into the nitty-gritty of the model architecture.
</Person1><Person2>Yeah, all the layers, convolutional, pooling, dense… the whole nine yards.  It's based on IMPALA, but with some modifications.
</Person2><Person1>They tweaked it for interpretability, right? Fewer convolutional layers, more dense layers.
</Person1><Person2>And no residual connections. Plus, they used L2 pooling instead of max pooling.  For smoother gradients, they said.
</Person2><Person1>Makes sense.  They really thought about interpretability from the ground up, huh? Even in the architecture itself.
</Person1><Person2>Yeah.  And they mention using the caricature objective for feature visualization, too.  Maximizing the dot product between activations.
</Person2><Person1>Oh, right.  Like comparing the input image activations to a reference image. A good way to get a first look at the features.
</Person1><Person2>They also talk about NMF, non-negative matrix factorization.  For dimensionality reduction.
</Person2><Person1>Yeah, flattening the spatial dimensions of the activations.  So, no spatial info used in the dimensionality reduction. Footnote 14, I think.
</Person1><Person2>And they sample observations from gameplay, compute attributions, flatten, then apply NMF.  Footnote 15.
</Person2><Person1>They even have a workaround for the positive and negative attributions.  Concatenating them along the batch dimension.  Or the channel dimension, I guess. Footnote 16.
</Person1><Person2>Lots of clever tricks.  And they discuss generalization too, using CoinRun levels and the distribution of procedurally-generated levels. Footnote 17.
</Person2><Person1>But they acknowledge the need for a wider class of games to really formalize the diversity, or lack thereof, in CoinRun.
</Person1><Person2>Right.  And a warning about heavily regularized feature visualization, how it might not be trustworthy. Footnote 18.
</Person2><Person1>Could be confusing correlation with causation.  Tricky stuff.  They also get into integrated gradients, using a baseline value. Footnote 19.
</Person1><Person2>And they numerically approximate the integral, evaluating the integrand at different alpha values.  Footnote 20, if I remember correctly.
</Person2><Person1>So, wrapping up our deep dive into "Understanding RL Vision," this Distill 2020 publication, uh… really fascinating stuff.
</Person1><Person2>Yeah, definitely.  A really insightful look into interpretability in reinforcement learning, specifically with their CoinRun model.
</Person2><Person1>Right.  Using that platformer game as a testbed, they explored how the model "sees" the game world, using attribution and NMF.
</Person1><Person2>That interface they built, overlaying the visualizations on the game screen, was pretty slick.  Really helped visualize what the model was focusing on.
</Person2><Person1>And the circuit editing!  Actually manipulating the model weights to blind it to hazards.  Ingenious, but…
</Person1><Person2>But not completely effective.  The model had those backup systems, other ways of detecting the hazards.  Redundancy, making it robust but less interpretable.
</Person2><Person1>Exactly.  And then those open questions they raised.  The diversity hypothesis, how low-level visual diversity in the training data affects interpretability.
</Person1><Person2>Yeah, and how to quantify that diversity.  Plus, the challenges of interpretability in the absence of diversity.  Lots of food for thought there.
</Person2><Person1>They also stressed the need for scalable, trustworthy, exhaustive, and low-cost interpretability techniques.  If we want these methods to be practical, right?
</Person1><Person2>Absolutely.  No use if they only work on toy problems or give us misleading explanations.
</Person2><Person1>And all that supplementary material! Code, weights, interfaces… really makes it easy for other researchers to build on their work. 
</Person1><Person2>Yeah, the detailed explanations of the model editing in Appendix A, the integrated gradients in Appendix B, and the architecture details in C.  They really covered all the bases.  Even tweaking the IMPALA architecture for interpretability.
</Person2><Person1>Fewer convolutional layers, more dense layers, no residual connections, L2 pooling… they really thought about it from the ground up.
</Person1><Person2>Right.  Using the caricature objective, NMF for dimensionality reduction, workarounds for positive and negative attributions…  A lot of clever tricks.
</Person2><Person1>And their discussion of generalization, using CoinRun levels and procedurally generated ones.  Though, they acknowledge the limitations and the need for more diverse games.
</Person1><Person2>For sure.  Plus, the warnings about regularized feature visualization and the potential for confusing correlation with causation.  Important caveats.
</Person2><Person1>It really highlights the complexities of interpretability research in RL. Still so much to explore.
</Person1><Person2>Definitely. But this paper provides a solid foundation and a lot of promising directions for future work. A great read for anyone interested in understanding how RL agents "see" the world.
</Person2><Person1>Thanks for tuning in to AI Digest, and a special shout-out to our listeners. Catch you next time for another deep dive into the world of AI.  Bye everyone!
</Person1><Person2>See you all next time!</Person2>
2024-12-02 15:53:16,139 - DEBUG - Successfully rewrote transcript, AFTER = 

<Person1>Welcome to AI Digest - Your daily AI research review. Today we're diving into this fascinating 2020 Distill publication, "Understanding RL Vision."
</Person1><Person2>This paper offers an interesting look at interpretability techniques applied to a reinforcement learning model trained to play CoinRun.
</Person2><Person1>CoinRun is a side-scrolling platformer where the agent navigates obstacles, dodges enemies, and collects a coin.  The researchers use attribution combined with non-negative matrix factorization (NMF) to build an interface exploring how the model detects objects and how they influence its decisions.
</Person1><Person2>How does this visual interface compare to other interpretability methods?
</Person2><Person1>It's a novel approach. It allows us to see not just what the model sees, but how those objects affect its value function and policy.  For example, whether a coin evokes a "Good news!" response or an enemy triggers an "Oh no!" reaction. The NMF helps decompose attributions into components corresponding to different object types, which are color-coded in the interface.
</Person1><Person2>Can this be used to diagnose problems?
</Person2><Person1>Absolutely.  Analyzing agent failures reveals instances of bad luck, hidden obstacles, or even "hallucinations" where the model misinterprets background shapes as coins. These hallucinations are usually brief and don't significantly impact behavior.  Even more interesting, the researchers use "circuit editing" to project out NMF directions and intentionally blind the model to specific hazards like buzzsaws.  While effective, it doesn't achieve complete blindness, suggesting the model has other, less interpretable ways of detecting these hazards.
</Person1><Person2>So the interface doesn't reveal the full picture? What about the "diversity hypothesis"?
</Person2><Person1>The researchers explored why gradient-based feature visualization techniques, even with modifications like jittering and scaling, failed to produce interpretable visualizations for CoinRun, especially in intermediate layers. They hypothesize that the lack of low-level visual diversity in the training data might be hindering interpretability.  CoinRun's simplicity allows the model to rely on pixel configuration shortcuts, which work well on training data but break down when exploring the wider image space, as gradient-based methods do. This also suggests that diversity needs to be considered within the task context; simply increasing visual diversity might not be enough if the task doesn't require its utilization.
</Person1><Person2>So, back to the interface, what specific insights did it provide?
</Person2><Person1>The interface overlays visualizations on the game screen, extending to whole trajectories with video controls and a timeline.  The attributions are sparse, with a few position-channel pairs accounting for most of the attribution, likely due to the model's reliance on simple pixel configurations.  This sparsity is smoothed in the interface for clarity.  Interestingly, some attributions have unexpected signs, like buzzsaws receiving positive attribution, potentially due to statistical collinearities in level generation.  Outlier frames with erratic attribution remain less understood, possibly arising from complex feature interactions or the NMF not capturing the full picture.
</Person1><Person2>It seems like there are still many open questions.
</Person2><Person1>Yes, the researchers highlight several, including how to quantify the diversity needed for interpretable features, how interpretability changes in the absence of diversity, and the trustworthiness of attributions.  They also discuss broader questions regarding interpretability in RL, such as interpreting non-visual features and improving model reliability.  They emphasize the need for scalable, trustworthy, exhaustive, and low-cost interpretability research.  They argue that applying existing techniques more rigorously could yield valuable insights, especially as models tackle more complex tasks.
</Person1><Person2>And they provide supplementary material?
</Person2><Person1>Extensive supplementary material, including code, model weights, and detailed explanations of the model editing (Appendix A), integrated gradients used for attributing the value function (Appendix B), and the model architecture, which is based on IMPALA but tweaked for interpretability with fewer convolutional layers, more dense layers, no residual connections, and L2 pooling (Appendix C). They also discuss their use of the caricature objective for initial feature visualization, NMF for dimensionality reduction, and a workaround for handling positive and negative attributions.  Finally, they address generalization performance and offer caveats about the trustworthiness of heavily regularized feature visualizations.
</Person1><Person2>This paper seems like a great starting point for anyone interested in this area.
</Person2><Person1>Absolutely. Thanks for tuning in to AI Digest. Catch you next time!
</Person1><Person2>See you all next time!</Person2>

@ivanmkc
Copy link
Author

ivanmkc commented Dec 2, 2024

It does seem to shorten the transcript little TOO much as of now.

@ivanmkc
Copy link
Author

ivanmkc commented Dec 3, 2024

@souzatharsis I guess the difference between shortform and longform-with-post-processing is:

  • shortform can only give me <3min length podcasts

  • longform with NO post-processing results in longer podcasts, but with many repetitions.

  • longform-with-post-processing can give me 5 to 8 minutes, with a cleaner transcript (less repetitions) with the following settings:'

    "max_num_chunks": 7,
    "min_chunk_size": 800,

Hope that makes sense.

Ideally there's a way to consistently set a desired depth or length of podcast without it degenerating into repetition.

@souzatharsis
Copy link
Owner

souzatharsis commented Dec 3, 2024 via email

@ivanmkc ivanmkc force-pushed the imkc--longform-reducer branch from 3afba34 to 4526958 Compare December 3, 2024 01:57
@ivanmkc
Copy link
Author

ivanmkc commented Dec 5, 2024

@souzatharsis take a look at this:

Code

custom_config = {
    "max_num_chunks": 4,
    "min_chunk_size": 800,
    "podcast_name": "the AI Podcast",
    "creativity": 0.7,
    "user_instructions": """
    Targeted towards researchers and ML engineers. 
    You must not do any speculation but only stick to facts presented in the paper. 
    Avoid smalltalk, um's and ah's
    Do not repeat the same information more than once.
    Do not add show notes.
    """,
    "dialogue_structure": [
        "Brief acknowledgement of authors and publication date", 
        "Follow structure of original paper",
    ],
}

audio_file = generate_podcast(
    urls=["content/cohere/Mix Data or Merge Models.pdf"],
    llm_model_name="gemini-1.5-pro-latest",
    api_key_label="GEMINI_API_KEY",
    tts_model="geminimulti",    
    longform=True,
    conversation_config=custom_config
)

Transcript

1 <Person1>Welcome to the AI Podcast.  This paper really dives into a fascinating challenge with Large Language Models: balancing safety and performance, especially across multiple languages.</Person1>
2 <Person2>Right, and it tackles a pretty critical issue, yeah?  Keeping LLMs safe is hard enough in one language, let alone trying to make them safe and effective across many.</Person2>
3 <Person1>Exactly.  They call out the Western-centric bias in a lot of safety training data, which means current methods don't always translate well to other languages and cultural contexts.</Person1>
4 <Person2>So, what's their approach?  How do they try to solve this?</Person2>
5 <Person1>They're looking at model merging. Instead of mixing all your training data together, you train separate models on different data subsets–like one for safety, one for general performance–and then merge them.</Person1>
6 <Person2>Interesting.  Kind of like creating specialists and then combining their expertise.</Person2>
7 <Person1>Precisely.  They compare this merging approach to the traditional method of just mixing all the data together and training one model.</Person1>
8 <Person2>And? Which one works better?</Person2>
9 <Person1>Well, their findings show that merging is significantly more effective, particularly for balancing safety and general performance in multilingual settings.</Person1>
10 <Person2>Oh?  How much more effective are we talking here?</Person2>
11 <Person1>They saw improvements of up to 8% in general performance and a whopping 10% in safety, which is pretty substantial.</Person1>
12 <Person2>That's impressive.  What merging techniques did they use?</Person2>
13 <Person1>They explored four different methods: Linear merging, SLERP, TIES-Merging, and DARE-TIES.</Person1>
14 <Person2>So, a pretty thorough investigation then.  Did they have a clear winner?</Person2>
15 <Person1>It seems SLERP, or Spherical Linear Interpolation, consistently produced the best trade-off between safety and general performance.</Person1>
16 <Person2>Interesting.  So, not just merging, but the way you merge matters. What about the data itself? What languages did they look at?</Person2>
17 <Person1>They tested across six languages: English, Hindi, French, Spanish, Arabic, and Russian.  A good representation of different language families.</Person1>
18 <Person2>Definitely.  And did they find that merging was equally effective across all languages?</Person2>
19 <Person1>That's where it gets even more interesting. They also explored language-based merging, where they trained models on monolingual data and then merged them.  This proved incredibly effective, even outperforming the mixed-data approach with the same data.</Person1>
20 <Person2>So, training on individual languages and then merging is even better? That's counterintuitive.</Person2>
21 <Person1>It is, but it highlights the importance of addressing the unique challenges each language presents.  They found up to a 4% increase in general performance and a 7% reduction in harm with this method.</Person1>
22 <Person2>Wow. So, merging wins on both objective-based and language-based approaches.  Any other key takeaways?</Person2>
23 <Person1>They also found that not all merging algorithms are created equal. Some, like TIES, were great for safety but negatively impacted general performance, while others, like DARE-TIES, offered a better balance.</Person1>
24 <Person2>So, choosing the right merging technique is crucial.  What about applying these techniques in the real world?</Person2>
25 <Person1>The research provides a strong foundation for building safer and more effective multilingual language models.  It suggests that merging could be a key strategy for navigating the complexities of multilingual alignment and mitigating harm in a more nuanced way.</Person1>
26 <Person2>Absolutely.  This work opens up exciting possibilities for developing truly global and responsible LLMs.  Thanks for breaking it down.</Person2>
27 <Person1>So, we're diving into some fascinating research on multilingual language model safety and performance, focusing on this merging technique.</Person1>
28 <Person2>Yeah, this is really interesting stuff.  They're basically trying to find the best way to make these models both safe and effective across multiple languages, which is a huge challenge.</Person2>
29 <Person1>Absolutely. They're using the multilingual Aya Red-teaming dataset and generating adversarial prompts and safe completions, kind of like a safety stress test.</Person1>
30 <Person2>Right, and they're also using a general-purpose dataset based on translated Ultrafeedback data – 10,000 English prompts translated into other languages.</Person2>
31 <Person1>Exactly. And a key aspect is the mix of training data. They look at 0%, 15%, and 100% safety mixes to see how the balance affects the models.</Person1>
32 <Person2>So, 100% safety data is like the ideal, but 15% is probably closer to a real-world scenario, right?</Person2>
33 <Person1>Precisely. Now, the core of their work is exploring different merging techniques: linear, SLERP, TIES, and DARE-TIES.</Person1>
34 <Person2>Okay, so they're training separate models on different objectives—like safety and general performance—and then merging them, rather than just mixing all the training data together.</Person2>
35 <Person1>Right. And the results are pretty compelling.  They saw some significant improvements, up to 8% in general performance and 10% in safety with merging, compared to just mixed training data.</Person1>
36 <Person2>Wow, those are not insignificant gains. Which merging method came out on top?</Person2>
37 <Person1>SLERP, or Spherical Linear Interpolation, seems to be the sweet spot for balancing safety and performance.</Person1>
38 <Person2>Interesting. And they evaluated this across six languages, right? English, Hindi, French, Spanish, Arabic, and Russian.</Person2>
39 <Person1>Yes, a good mix of language families.  And they didn't stop there. They also explored language-based merging, training models on monolingual data and then merging.</Person1>
40 <Person2>So, training separate models for each language and then combining them? How did that compare?</Person2>
41 <Person1>Even better, surprisingly. Up to 4% better general performance and 7% reduction in harm compared to the mixed-language approach.</Person1>
42 <Person2>Huh. So, specializing by language and then merging is the way to go?  That's counterintuitive, but fascinating.</Person2>
43 <Person1>It is.  They also compared merging at different training stages – SFT and DPO.  DPO merging consistently showed larger improvements.</Person1>
44 <Person2>So the when matters as much as the how.  What about the where? Did they see consistent gains across all languages?</Person2>
45 <Person1>Not exactly. While overall merging was beneficial, the impact varied across languages. For example, Russian saw the biggest safety gains with DPO and TIES merging, while Spanish showed the largest general performance improvements with SLERP.</Person1>
46 <Person2>So a nuanced picture then.  It's not a one-size-fits-all solution, but merging clearly offers a promising path towards safer and more effective multilingual LLMs.  Great breakdown. Thanks!</Person2>
47 <Person1>So, uh, when they merged these DPO checkpoints, English didn't benefit as much as other languages. In fact, it kinda backfired a bit, right?</Person1>
48 <Person2>Yeah, a bit of a surprise.  They saw a pretty significant drop in safety, almost 25%, and about 15% in general performance compared to just the 15% safety mix.</Person2>
49 <Person1>And for SFT checkpoints, Spanish and Hindi had the least gains in safety and general performance respectively.</Person1>
50 <Person2>But, I mean, even with those dips, merging still performed better than the base model in over half the cases. So, it's not a total loss.</Person2>
51 <Person1>Right. Now, they also looked at merging monolingual models, which is, um, training models on data from just one language and then merging them.</Person1>
52 <Person2>So, instead of mixing all the languages in the training data, they train separate models on, say, just French data, just Spanish data, and so on?</Person2>
53 <Person1>Exactly. They tested this with three languages—English, French, and Spanish—and then with all six languages.</Person1>
54 <Person2>And? What happened?</Person2>
55 <Person1>Well, merging the six monolingual models actually did better than the mixed-data baseline.  Like, a 7% reduction in harm and a 4% boost in general performance.</Person1>
56 <Person2>Oh, wow.  So, training on individual languages and then merging is superior? Interesting.</Person2>
57 <Person1>Yeah, but there's a catch. Merging just the three languages – English, French, and Spanish – performed even better than merging all six.</Person1>
58 <Person2>So there's some kind of interference when you merge too many languages?</Person2>
59 <Person1>Seems like it. Maybe something to do with cross-lingual influences messing things up a bit. But overall, merging monolingual models still looks promising.</Person1>
60 <Person2>Definitely. They also looked at how the weighting of the safety model affects the merged model's performance, right?</Person2>
61 <Person1>Yeah, uh, they found that giving more weight to the safety model, unsurprisingly, improved safety but at the cost of general performance.</Person1>
62 <Person2>Makes sense.  But even with different weightings, merging still outperformed the mixed-data approach.  Consistent, yeah?</Person2>
63 <Person1>Right. Finally, they explored different training pipelines, seeing if it's better to merge before or after preference tuning, like with DPO.</Person1>
64 <Person2>And which one worked better?</Person2>
65 <Person1>Merging before DPO led to greater improvements in general performance, but merging after SFT and then doing DPO was better for safety.</Person1>
66 <Person2>So, the when matters as much as the how. Fascinating stuff. It really highlights the complexities of building safe and effective multilingual LLMs.</Person2>
67 <Person1>So, we're looking at this research on merging language models, specifically for multilingual safety and performance.</Person1>
68 <Person2>Right, and they're using some interesting datasets to test this: the Aya Red-teaming set for safety, and a translated version of Ultrafeedback for general performance.</Person2>
69 <Person1>Yeah, that multilingual aspect is key here.  They're testing across six languages, representing different families – English, Hindi, French, Spanish, Arabic, and Russian.</Person1>
70 <Person2>And they're comparing merging techniques – linear, SLERP, TIES, and DARE-TIES – against different training data mixes, including a 15% safety mix, which is probably more realistic than 100% safety data.</Person2>
71 <Person1>Exactly.  Now, for SFT checkpoints, SLERP seems to be the winner overall, offering the best balance between safety and general performance.</Person1>
72 <Person2>But the results vary quite a bit by language.  For example, in Table 3, Russian shows the biggest safety improvement with TIES merging, while Spanish has the least gain.</Person2>
73 <Person1>And for general performance with SFT models, Table 4 shows that TIES merging actually gives the biggest boost to English and Spanish, while Hindi sees the smallest improvement with SLERP.</Person1>
74 <Person2>Interesting.  What about DPO checkpoints?  Anything change there?</Person2>
75 <Person1>Yeah, the picture gets more complex.  Looking at Table 5 for safety, Russian again sees the biggest gains with TIES merging, but English actually suffers a pretty big drop with linear merging.</Person1>
76 <Person2>Oh, wow, almost a 25% decrease in safety for English with DPO and linear merging!  That's a significant setback.</Person2>
77 <Person1>It is.  And for general performance with DPO, Table 6 shows SLERP giving the biggest boost to Spanish, while TIES merging dramatically hurts English.</Person1>
78 <Person2>So the best approach really depends on both the language and the training method.  What other merging strategies did they explore?</Person2>
79 <Person1>They also looked at merging monolingual models, training separate models on individual languages and then combining them.</Person1>
80 <Person2>And how did that perform?</Person2>
81 <Person1>Surprisingly well! Merging six monolingual models actually outperformed the mixed-data baseline, with a 7% reduction in harm and a 4% boost in general performance.</Person1>
82 <Person2>Huh. So, specializing and then merging is better than training on a mix of languages?  That's counterintuitive, but interesting.</Person2>
83 <Person1>It is!  But merging just three monolingual models—English, French, and Spanish—did even better than merging all six.  Maybe there's some interference when you combine too many languages.</Person1>
84 <Person2>Possibly.  They also looked at the impact of weighting the safety model, right?  More weight to safety should mean better safety, but at a cost to general performance.</Person2>
85 <Person1>Exactly.  And that's what they found.  But even with different weightings, merging still consistently outperformed the mixed-data approach.</Person1>
86 <Person2>So, merging remains a promising technique, even with these trade-offs.  What's the bottom line here?  What are the key takeaways from this research?</Person2>
87 <Person1>Well, merging definitely shows potential for improving both the safety and performance of multilingual LLMs.  But the choice of merging method, the training stage, and the specific language all play a crucial role.  It's not a one-size-fits-all solution.</Person1>
88 <Person2>Right.  It highlights the complexities of working with multiple languages and the need for more nuanced approaches to safety and alignment.  Fascinating stuff! Thanks for breaking it down.</Person2>
89 <Person1>So, we're zeroing in on Table 6, right?  DPO checkpoints, general performance—how the different merging techniques shake out.</Person1>
90 <Person2>Yeah, and it's… well, it's a mixed bag, isn't it? Some interesting wins, but also some surprising dips.</Person2>
91 <Person1>For sure.  Like, Spanish showing a nearly 9% jump with SLERP merging. That's impressive.</Person1>
92 <Person2>Definitely.  But then English takes a major hit with TIES, almost a 20% drop.  What's going on there?</Person2>
93 <Person1>Good question.  It seems TIES merging, while sometimes beneficial for safety, can really hurt general performance in certain languages, like we're seeing here with English.</Person1>
94 <Person2>And it's not just English.  French takes a hit too with TIES, though not as dramatically.  Around 6% down.</Person2>
95 <Person1>Right.  But then look at Russian—a solid 7% gain with TIES.  It really underscores how language-specific these effects can be. Absolutely. Hindi also sees a nice boost with TIES, close to 5%. So, it's not a universal negative, but definitely something to watch out for.</Person1>
96 <Person2>Yeah, and SLERP seems like the safer bet, consistently delivering decent gains across the board. Nothing spectacular, but generally positive. Except for Hindi, a small drop there.</Person2>
97 <Person1>True.  Around a 2% improvement for English and French with SLERP, and almost 5% for Russian.  A more predictable pattern than TIES.</Person1>
98 <Person2>Linear merging seems to be the most volatile. Big win for Arabic, almost 10%, but a significant drop for English, nearly 15%.  Hard to draw any firm conclusions there.</Person2>
99 <Person1>Definitely a higher risk, higher reward approach.  DARE-TIES, on the other hand, seems to be the most consistent, hovering around 2-3% improvement for most languages, except for Hindi with a small drop.</Person1>
100 <Person2>So, the takeaway here is that there's no magic bullet. The best merging strategy really depends on the specific language and the trade-offs you're willing to make between safety and general performance.</Person2>
101 <Person1>Precisely.  And this research really highlights the complexities of building truly robust and effective multilingual LLMs.  It's a fascinating area, and clearly, there's still much to explore.</Person1>
102 <Person2>Absolutely.  Lots of exciting avenues for future research.  Thanks for joining us today on AI Digest. We hope you found this insightful.  Until next time, keep exploring the world of AI!</Person2>

Semantically Duplicate Sentences from the Conversation:

1. Introduction to the Research Focus

  • Line 1:

    "Welcome to the AI Podcast. This paper really dives into a fascinating challenge with Large Language Models: balancing safety and performance, especially across multiple languages."
    (Person1)

  • Line 27:

    "So, we're diving into some fascinating research on multilingual language model safety and performance, focusing on this merging technique."
    (Person1)

  • Line 67:

    "So, we're looking at this research on merging language models, specifically for multilingual safety and performance."
    (Person1)


2. Exploring Merging vs. Mixing Training Data

  • Line 5:

    "They're looking at model merging. Instead of mixing all your training data together, you train separate models on different data subsets–like one for safety, one for general performance–and then merge them."
    (Person1)

  • Line 7:

    "Precisely. They compare this merging approach to the traditional method of just mixing all the data together and training one model."
    (Person1)

  • Line 34:

    "Okay, so they're training separate models on different objectives—like safety and general performance—and then merging them, rather than just mixing all the training data together."
    (Person2)


3. Merging Improves Safety and Performance

  • Line 9:

    "Well, their findings show that merging is significantly more effective, particularly for balancing safety and general performance in multilingual settings."
    (Person1)

  • Line 35:

    "Right. And the results are pretty compelling. They saw some significant improvements, up to 8% in general performance and 10% in safety with merging, compared to just mixed training data."
    (Person1)


4. The Four Merging Techniques

  • Line 13:

    "They explored four different methods: Linear merging, SLERP, TIES-Merging, and DARE-TIES."
    (Person1)

  • Line 33:

    "Precisely. Now, the core of their work is exploring different merging techniques: linear, SLERP, TIES, and DARE-TIES."
    (Person1)

  • Line 70:

    "And they're comparing merging techniques – linear, SLERP, TIES, and DARE-TIES – against different training data mixes, including a 15% safety mix, which is probably more realistic than 100% safety data."
    (Person2)


5. SLERP as the Best Trade-off

  • Line 15:

    "It seems SLERP, or Spherical Linear Interpolation, consistently produced the best trade-off between safety and general performance."
    (Person1)

  • Line 37:

    "SLERP, or Spherical Linear Interpolation, seems to be the sweet spot for balancing safety and performance."
    (Person1)

  • Line 71:

    "Exactly. Now, for SFT checkpoints, SLERP seems to be the winner overall, offering the best balance between safety and general performance."
    (Person1)


6. Language-based Merging Outperforms Mixed-language Merging

  • Line 19:

    "That's where it gets even more interesting. They also explored language-based merging, where they trained models on monolingual data and then merged them. This proved incredibly effective, even outperforming the mixed-data approach with the same data."
    (Person1)

  • Line 41:

    "Even better, surprisingly. Up to 4% better general performance and 7% reduction in harm compared to the mixed-language approach."
    (Person1)

  • Line 55:

    "Well, merging the six monolingual models actually did better than the mixed-data baseline. Like, a 7% reduction in harm and a 4% boost in general performance."
    (Person1)


9. Trade-offs in Merging Techniques

  • Line 43:

    "It is. They also compared merging at different training stages – SFT and DPO. DPO merging consistently showed larger improvements."
    (Person1)

  • Line 65:

    "Merging before DPO led to greater improvements in general performance, but merging after SFT and then doing DPO was better for safety."
    (Person1)

  • Line 100:

    "So, the takeaway here is that there's no magic bullet. The best merging strategy really depends on the specific language and the trade-offs you're willing to make between safety and general performance."
    (Person2)


I "think" that if you have repeating sections in the "Dialogue structures" in the config, it may increase the chance of repetition in the transcript.

@souzatharsis
Copy link
Owner

Thank you for the incredibly detailed analysis!

I think the issue might be repetition in the input -> repetition in the output.

Meaning, often input papers start with intro, main, conclusion and of course intro and conclusion often repeat points from main in summarized fashion. That may lead to the output longform podcast repeating points.

The solution may be to (i) improve prompt or (ii) do post-processing after per chunk response generation via one llm call conditioned on previous responses.

@ivanmkc
Copy link
Author

ivanmkc commented Dec 6, 2024

Sounds good! I'll play with the prompt to see if it can be mitigated.

Great work with the project!

I see other people with a similar question: #202

@ivanmkc ivanmkc marked this pull request as draft December 6, 2024 01:15
@jtoy
Copy link

jtoy commented Dec 9, 2024

I would love to see this get merged ! I am seeing this happen all the time, I am generating ~10 podcasts a week currently to test.

@jtoy
Copy link

jtoy commented Dec 9, 2024

maybe just implement this as a flag? --post_process_cleanup

@souzatharsis
Copy link
Owner

souzatharsis commented Dec 9, 2024 via email

@jtoy
Copy link

jtoy commented Dec 11, 2024

this pr is essentially a hack, but the longform doesn't really work. Are people using longform successfully without duplicates?
I would say in my testing, 70% have duplicates.
I built a similar system with blog posts a while ago and we did it using this basic process:
read all the content and write an outline, then do each piece of the outline. More involved, but got rid of all the duplication.

@ivanmkc
Copy link
Author

ivanmkc commented Dec 12, 2024

So I understand the issue with this PR, which is that the limited number of output tokens would constrain the final longform transcript. I developed a better system that processes the source material outside of podcastfy in chunks, extracts key information and a final reduction step will dedup the information.

This is then passed into podcastfy. It's similar to @jtoy's suggestion but the key is that the reduction step needs to not constrain the information quantity. I made the reduction produce a git-diff like edit list which gets deterministically applied.

@ivanmkc ivanmkc closed this Dec 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants