Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bad underscores in imported data #1160

Open
pdpinch opened this issue Mar 24, 2022 · 10 comments
Open

bad underscores in imported data #1160

pdpinch opened this issue Mar 24, 2022 · 10 comments

Comments

@pdpinch
Copy link
Member

pdpinch commented Mar 24, 2022

tl;dr There is a significant amount (but perhaps not huge) amount of existing data that will continue to exhibit the original bad-underscore behavior until someone fixes the markdown. This CSV lists many examples. (The CSV is all instances of _{{ in OCW.) https://docs.google.com/spreadsheets/d/1YR9TjoknAM_lP5zVfI6Wv25ynQ0Q6cij8TqxOoU0tDk/edit?usp=sharing

Details

Setting aside whether or not authors should be specifying italics..., they need to be asterisks not underscores in order for some things to display. To expand on what @gumaerc said above, here are some examples:

Screen Shot 2022-03-24 at 2 46 50 PM

Some notes:

  • 3 does not seem realistic, but 4 is, as we've seen

  • this sort of content can come out of ocw-to-hugo. It may never have interacted with studio

  • Delightfully: Github-flavored markdowns DOES interpret 4 as italics: ant _bear_{{< sup 2 >}}_ cat --> "ant bear{{< sup 2 >}} cat" because GFM doesn't replace the shortcode with alphanumeric characters.

    This is really useful to us: It means that although the published content looks wrong, if you open it in studio and edit+save, it will be fixed.

    Caveat: Won't work if the underscore has been escaped, which will be true if the content was edited Prior to Wassaf's PR. But that is only the case for one page, which someone can fix in the admin panel.

Here is an exhaustive list of all pages that have _{{ in their markdown: https://docs.google.com/spreadsheets/d/1YR9TjoknAM_lP5zVfI6Wv25ynQ0Q6cij8TqxOoU0tDk/edit?usp=sharing

Originally posted by @ChristopherChudzicki in mitodl/ocw-hugo-themes#501 (comment)

@pdpinch
Copy link
Member Author

pdpinch commented Mar 24, 2022

@ChristopherChudzicki in your sheet, there is a column "needs_admin_fix" what does that mean?

@pdpinch pdpinch transferred this issue from mitodl/ocw-hugo-themes Mar 24, 2022
@ChristopherChudzicki
Copy link
Contributor

ChristopherChudzicki commented Mar 24, 2022

@pdpinch I didn't actually test this, but I suspect that "open in studio, edit, save" won't work if the italicizing underscores have been escaped. Those two pages have escaped italicizing underscores**, so I think we need to remove the escapes in admin panel, then save. Or just change to asterisks in admin panel, too.

** or rather... escaped underscores that were probably intended to italicize but no longer do.

@pdpinch
Copy link
Member Author

pdpinch commented Mar 24, 2022

The three pages with needs_admin_fix set to TRUE are all in 16.90. I would fix these myself, but we should probably re-import 16.90 first, after mitodl/ocw-to-hugo#493 is deployed.

@ChristopherChudzicki
Copy link
Contributor

ChristopherChudzicki commented Mar 24, 2022

@pdpinch Are you re-importing all of it?

If so / if there are any courses that are being entirely re-imported, it might be worth changing ocw-to-hugo to use * instead of _ for italicizing. It's a simple Turndown config (emDelimiter).

@Wassaf-Shahzad
Copy link
Contributor

@pdpinch Are you re-importing all of it?

If so / if there are any courses that are being entirely re-imported, it might be worth changing ocw-to-hugo to use * instead of _ for italicizing. It's a simple Turndown config (emDelimiter).

Already done #1147 in this PR

@pdpinch
Copy link
Member Author

pdpinch commented Feb 21, 2023

@ChristopherChudzicki do you think we can close this?

@ChristopherChudzicki
Copy link
Contributor

@pdpinch I do not think it is resolved, but it is better (520 instances before; now 233). I added a new sheet with the updated info: https://docs.google.com/spreadsheets/d/1YR9TjoknAM_lP5zVfI6Wv25ynQ0Q6cij8TqxOoU0tDk/edit?usp=sharing

I had forgotten about this issue, but according to my previous comment we should be able to fix these by opening them in studio and saving them without edits (since studio normalizes underscore-italics to asterisk italics).

@pdpinch
Copy link
Member Author

pdpinch commented May 16, 2023

@ChristopherChudzicki would you mind updating the googlesheet? and please make a note here of the query you are using.

@ChristopherChudzicki
Copy link
Contributor

ChristopherChudzicki commented May 17, 2023

@pdpinch I've updated the sheet, which also has the query.

Again, the resolution is "replace _ with * for italics", which can be done by opening the page in Studio, saving, and re-publishing. I retested a few on RC, and the results look good.

Here's the query as well:

SELECT
        w_name AS site_name,
        COUNT(1),
        wc_title AS page_title,
        REGEXP_REPLACE(FORMAT(
                'https://ocw.mit.edu/courses/%1$s/%2$s/%3$s',
                w_name,
                RIGHT(wc_dirpath, LENGTH(wc_dirpath) - 8),
                wc_filename
        ), '_index$', '') AS ocw_url,
        FORMAT(
                'https://github.mit.edu/mitocwcontent/%1$s/blob/main/%2$s/%3$s.md',
                w_short_id,
                wc_dirpath,
                wc_filename
        ) AS github_url,
        FORMAT(
                'https://ocw-studio.odl.mit.edu/admin/websites/websitecontent/%1$s/change/',
                wc_id
        ) AS admin_url,
                FORMAT(
                'https://ocw-studio.odl.mit.edu/sites/%1$s/type/%2$s/%3$s',
                w_name,
                wc_type,
                wc_text_id
        ) AS studio_url
FROM (
SELECT
        w.name AS w_name,
        w.short_id AS w_short_id,
        wc.id AS wc_id,
        wc.type AS wc_type,
        wc.dirpath AS wc_dirpath,
        wc.filename AS wc_filename,
        wc.text_id AS wc_text_id,
        wc.title AS wc_title,
        w.publish_date AS w_publish_date,
/*
                Italics via underscore preceeding shortcodes are problematic:
                ...shortcode >}}_should be italics_{{< shortcode... is problematic
                
                Search for instances of '_{{' NOT preceeded by whitespace.
                Exclude 'preceeded by whitespace' because all such existing occurences
                are of the form _{{% resource_link %}}_, which is perfectly fine.
*/
        regexp_matches(markdown, '[^\s]_\{\{.*?\}\}', 'g')
FROM websites_websitecontent wc
JOIN websites_website w ON wc.website_id = w.uuid
) x
GROUP BY (
w_name,
wc_title,
github_url,
ocw_url,
studio_url,
admin_url
)
ORDER BY count DESC;

@pdpinch
Copy link
Member Author

pdpinch commented May 19, 2023

Since this is self-healing, I'll ask the OCW team if they want to do it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants