-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bad underscores in imported data #1160
Comments
@ChristopherChudzicki in your sheet, there is a column "needs_admin_fix" what does that mean? |
@pdpinch I didn't actually test this, but I suspect that "open in studio, edit, save" won't work if the italicizing underscores have been escaped. Those two pages have escaped italicizing underscores**, so I think we need to remove the escapes in admin panel, then save. Or just change to asterisks in admin panel, too. ** or rather... escaped underscores that were probably intended to italicize but no longer do. |
The three pages with |
@pdpinch Are you re-importing all of it? If so / if there are any courses that are being entirely re-imported, it might be worth changing ocw-to-hugo to use |
Already done #1147 in this PR |
@ChristopherChudzicki do you think we can close this? |
@pdpinch I do not think it is resolved, but it is better (520 instances before; now 233). I added a new sheet with the updated info: https://docs.google.com/spreadsheets/d/1YR9TjoknAM_lP5zVfI6Wv25ynQ0Q6cij8TqxOoU0tDk/edit?usp=sharing I had forgotten about this issue, but according to my previous comment we should be able to fix these by opening them in studio and saving them without edits (since studio normalizes underscore-italics to asterisk italics). |
@ChristopherChudzicki would you mind updating the googlesheet? and please make a note here of the query you are using. |
@pdpinch I've updated the sheet, which also has the query. Again, the resolution is "replace Here's the query as well: SELECT
w_name AS site_name,
COUNT(1),
wc_title AS page_title,
REGEXP_REPLACE(FORMAT(
'https://ocw.mit.edu/courses/%1$s/%2$s/%3$s',
w_name,
RIGHT(wc_dirpath, LENGTH(wc_dirpath) - 8),
wc_filename
), '_index$', '') AS ocw_url,
FORMAT(
'https://github.mit.edu/mitocwcontent/%1$s/blob/main/%2$s/%3$s.md',
w_short_id,
wc_dirpath,
wc_filename
) AS github_url,
FORMAT(
'https://ocw-studio.odl.mit.edu/admin/websites/websitecontent/%1$s/change/',
wc_id
) AS admin_url,
FORMAT(
'https://ocw-studio.odl.mit.edu/sites/%1$s/type/%2$s/%3$s',
w_name,
wc_type,
wc_text_id
) AS studio_url
FROM (
SELECT
w.name AS w_name,
w.short_id AS w_short_id,
wc.id AS wc_id,
wc.type AS wc_type,
wc.dirpath AS wc_dirpath,
wc.filename AS wc_filename,
wc.text_id AS wc_text_id,
wc.title AS wc_title,
w.publish_date AS w_publish_date,
/*
Italics via underscore preceeding shortcodes are problematic:
...shortcode >}}_should be italics_{{< shortcode... is problematic
Search for instances of '_{{' NOT preceeded by whitespace.
Exclude 'preceeded by whitespace' because all such existing occurences
are of the form _{{% resource_link %}}_, which is perfectly fine.
*/
regexp_matches(markdown, '[^\s]_\{\{.*?\}\}', 'g')
FROM websites_websitecontent wc
JOIN websites_website w ON wc.website_id = w.uuid
) x
GROUP BY (
w_name,
wc_title,
github_url,
ocw_url,
studio_url,
admin_url
)
ORDER BY count DESC; |
Since this is self-healing, I'll ask the OCW team if they want to do it. |
tl;dr There is a significant amount (but perhaps not huge) amount of existing data that will continue to exhibit the original bad-underscore behavior until someone fixes the markdown. This CSV lists many examples. (The CSV is all instances of
_{{
in OCW.) https://docs.google.com/spreadsheets/d/1YR9TjoknAM_lP5zVfI6Wv25ynQ0Q6cij8TqxOoU0tDk/edit?usp=sharingDetails
Setting aside whether or not authors should be specifying italics..., they need to be asterisks not underscores in order for some things to display. To expand on what @gumaerc said above, here are some examples:
Some notes:
3 does not seem realistic, but 4 is, as we've seen
this sort of content can come out of ocw-to-hugo. It may never have interacted with studio
Delightfully: Github-flavored markdowns DOES interpret 4 as italics:
ant _bear_{{< sup 2 >}}_ cat
--> "ant bear{{< sup 2 >}} cat" because GFM doesn't replace the shortcode with alphanumeric characters.This is really useful to us: It means that although the published content looks wrong, if you open it in studio and edit+save, it will be fixed.
Caveat: Won't work if the underscore has been escaped, which will be true if the content was edited Prior to Wassaf's PR. But that is only the case for one page, which someone can fix in the admin panel.
Here is an exhaustive list of all pages that have
_{{
in their markdown: https://docs.google.com/spreadsheets/d/1YR9TjoknAM_lP5zVfI6Wv25ynQ0Q6cij8TqxOoU0tDk/edit?usp=sharingOriginally posted by @ChristopherChudzicki in mitodl/ocw-hugo-themes#501 (comment)
The text was updated successfully, but these errors were encountered: