Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add more rows to data set; discuss reconciliation, GREL, extensions/packages #29

Closed
ErinBecker opened this issue Jun 29, 2018 · 7 comments
Labels
help wanted Looking for Contributors status:refer to cac Curriculum Advisory Committee input needed type:enhancement Propose enhancement to the lesson

Comments

@ErinBecker
Copy link
Contributor

ErinBecker commented Jun 29, 2018

The Social Sciences CAC ([email protected]) met June 15th and 19th to discuss the full Social Sciences curriculum and provide recommendations to the Maintainers about work for these lessons between now and their publication (September 2018). Their specific action items for this lesson are as follows:

  • Getting more rows of the SAFI data to use in the lessons
  • Incorporating data reconciliation, packages, and more GREL into the OpenRefine lesson

Please see the meeting minutes for more details.

@bencomp
Copy link
Contributor

bencomp commented Oct 21, 2022

I understand this is an issue from the previous Curriculum Advisory Committee, when other people maintained this lesson. Regardless, I would still like to thank that CAC for bringing up these action items and see how we can address them. I have referred to these suggestions on multiple occasions, so they haven't been ignored all these years.

TL;DR: regarding the action items, I would say:

  • yes, more rows of data (and more importantly, more errors)
  • not sure about reconciliation for this dataset
  • packages/extensions can go in 'Discussion'
  • yes, more GREL.

As an instructor of this lesson, I agree that 131 rows of data is not a lot. Instead of using facets or clustering to find outliers, you could fix incorrect values in these rows manually. What would be a good number? 1000, which according to the minutes is half the dataset?
It is said to be not critical to have more rows and I can see that. I have argued in #108 that the number of fixes that the data need is too little. I think that is more important (although I am biased).

Reconciliation is very useful too, but we would need things to reconcile, preferably with example research questions that help learners understand why you would reconcile. Perhaps the easiest example would be to reconcile the names of the villages, districts, province and wards. There may be a possibility to use this to spot that in one row the village is said to be in the incorrect ward or district.
This dataset is probably not the best for demonstrating reconciliation, which is about connecting strings in a dataset to more widely used identifiers. Ethics come into play here, as I start to think about how reconciling items in this dataset could help deanonymise the respondents.

If I understand correctly, packages in the action items refers to extensions and distributions for OpenRefine. I think that could be a topic for the discussion page/section, because support for extensions across OR versions varies a lot and the distributions appear to be niche products. They can be powerful, but I feel they are less suited to people who first discover OR.
As I did install the RDF extension my screen looks a bit different from the learners' screens, which I do point out at the beginning of the lesson.

More GREL: yes! GREL is of course the way to transform the data. I wonder if GREL should be introduced with simpler examples than .replace on strings, like incrementing numbers (value + 1) or combining strings to create URLs ("https://example.org/" + value).

Overall, I would like to check in with the current CAC for their views and suggestions. I suggested several potential improvements for this lesson in #102, #122, #108 that would require or benefit from CAC input. They influence how much time opens up for other learning objectives.
Suggestions from others (not in the CAC) are also very welcome, of course!

@bencomp bencomp added help wanted Looking for Contributors status:refer to cac Curriculum Advisory Committee input needed type:enhancement Propose enhancement to the lesson labels Oct 21, 2022
@bencomp
Copy link
Contributor

bencomp commented Nov 30, 2022

I posted a link to this issue in the lesson's Slack channel, with questions that are discussed here. @ostephens responded as follows (copied with permission):

How many rows should the dataset have?

In the Library Carpentry OR lesson we have a dataset of 1001 rows which seems to work as a good size - it's big enough you can appreciate that a tool helps, but small enough that we don't have any performance issues.

What kind of GREL expressions should be added to the lesson?

Assuming the current dataset, the cells with lists in look good for some GREL examples. So for instance the "items_owned" column can be manipulated using GREL to give a count of the most common items that are owned (mobile phones and radios just ahead of ploughs).
The current format of those lists makes the GREL slightly complicated to get a clean list and done correctly I think a series of steps that goes through the process of 'cleaning' this column could be provide a really good set of learning materials - one of the great things about OpenRefine is that ability to get real time feedback on changes as you work with the data.
OTOH if a more accessible example is needed the data set could be updated to simplify those lists to be just semi-colon separated which would make the process much simpler.

Another GREL example that would work with the current dataset would be the formatting of the "interview_date" column which is currently in dd-MMM-yyyy (vs the start and end columns which use ISO-8601). So something like:
value.toDate("dd-MMM-yy").toString("yyyy-MM-dd")
could provide a good example.
And give an opportunity to more generally talk about Date manipulation in OR (I would have guessed that date issues might come up commonly in social science datasets - but I may be wrong as not my area)

Is reconciliation useful for this dataset?

The province data, and most of the district data will reconcile nicely against Wikidata which could make a good example and allow the user to bring in data from Wikidata (e.g. the coordinate location for the district - although the data set already has some GPS coordinates so this isn't a strong example here).
Unfortunately at the moment the ward and village information doesn't have wikidata entries that match - although of course someone could fix that 🙂

Should we discuss extensions and alternative distributions of OpenRefine?

Unless there is something really specific I'd say these are worth mentioning but not including in detail (this is the approach we take in Library Carpentry).

@bencomp
Copy link
Contributor

bencomp commented Jan 25, 2023

@datacarpentry/curriculum-advisors-social-science Your input would be very welcome.

@ndporter
Copy link

One comment from teaching this recently with the list of items column - the lesson uses GREL to facet by subsets of the column but doesn't demonstrate how to change that column to something more usable (such as dummy variables for each category of item once they're cleaned). As a bonus, parsing it to columns also highlights for learners the difference between cell transforms, multi-valued cell splits, and column splits.

All of that said, adding more GREL is also tricky when learners don't have programming experience because chaining functions can rapidly become confusing to novice coders.

@eirini-zormpa
Copy link
Member

thank you for these points @bencomp and @ndporter ✨ The CAC haven't had a meeting in a while, but we'll make sure to discuss this issue next time we do!

@bencomp
Copy link
Contributor

bencomp commented Feb 16, 2023

Thanks for your responses, @ndporter and @eirini-zormpa! I look forward to the results of your discussion.

As to your comment, @ndporter: the idea of using OpenRefine to create dummy variables from the items column had not yet crossed my mind. I like it. After trying and going through the manual and StackOverflow for a little bit, I think it is doable, but not in this workshop. It requires exporting the ID and items columns, doing the transformation in a new project and then importing the new columns (crossing them one by one, potentially) into the project. That is madness. Perhaps there are easier ways using column splitting, but I guess the current exercise of splitting to count is good enough. I'm open to other suggestions for introducing more GREL.

@bencomp bencomp changed the title action items from Curriculum Advisors Add more rows to data set; discuss reconciliation, GREL, extensions/packages Jul 19, 2023
@bencomp
Copy link
Contributor

bencomp commented Sep 20, 2023

As a Maintainer, I would like to be able to close this issue after five years. It has been open for so long, because it is a collection of suggestions. Some suggestions can be worked on, but others are probably out of scope for the lesson.

To allow for more targeted discussion and decisions, as well as progress on incorporating them, I updated #108 to also track the expansion of the data set with more rows and I created separate issues for the other suggestions.

I will copy relevant comments to these other topics, so we can continue the discussions and close this issue.

@bencomp bencomp closed this as completed Sep 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Looking for Contributors status:refer to cac Curriculum Advisory Committee input needed type:enhancement Propose enhancement to the lesson
Projects
None yet
Development

No branches or pull requests

4 participants