Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expanded correct queries for 3 questions + minor prompt/typo fixes #151

Merged
merged 4 commits into from
Jun 3, 2024

Conversation

rishsriv
Copy link
Member

@rishsriv rishsriv commented Jun 3, 2024

The 3 questions fixed were:

What is the total number of citations received by each author?
Previously, this only relied on sum(publication.citation_num) for the correct answer. Now, COUNT(c.cited) is also accepted as a correct answer

Subtract 2 weeks from the most recent predicted graduation date and give the month.
Allow for more date formats

Which papers have the highest number of authors, ordered by the number of authors in descending order?
Allow answers that include both the paper id and paper title, instead of just one

@@ -5,7 +5,7 @@ What is the average number of citations received by publications in the last 5 y
Which authors have published papers in journals within the past 20 years?,"SELECT DISTINCT {author.name, author.aid} FROM author JOIN writes ON author.aid = writes.aid JOIN publication ON writes.pid = publication.pid WHERE publication.year >= extract(YEAR FROM CURRENT_DATE) - 20;",academic,date_functions,
What's the difference in time between the first and last paper published?,SELECT max(YEAR) - min(YEAR) AS time_difference FROM publication;,academic,date_functions,
"Which authors have written publications in both the domain ""Machine Learning"" and the domain ""Data Science""?","SELECT {author.name,author.aid} FROM author WHERE author.aid IN (SELECT domain_author.aid FROM domain_author WHERE domain_author.did IN (SELECT domain.did FROM DOMAIN WHERE domain.name IN ('Machine Learning', 'Data Science') ) GROUP BY 1 HAVING COUNT(DISTINCT domain_author.did) = 2);",academic,group_by,
What is the total number of citations received by each author?,"SELECT {author.name, author.aid}, sum(publication.citation_num) AS total_citations FROM author JOIN writes ON author.aid = writes.aid JOIN publication ON writes.pid = publication.pid GROUP BY {} ORDER BY total_citations DESC NULLS LAST;",academic,group_by,
What is the total number of citations received by each author?,"SELECT {author.name, author.aid}, sum(publication.citation_num) AS total_citations FROM author JOIN writes ON author.aid = writes.aid JOIN publication ON writes.pid = publication.pid GROUP BY {} ORDER BY total_citations DESC NULLS LAST;SELECT a.name, COUNT(c.cited) AS total_citations FROM author a JOIN writes w ON a.aid = w.aid JOIN publication p ON w.pid = p.pid JOIN cite c ON p.pid = c.cited GROUP BY a.name ORDER BY total_citations DESC;",academic,group_by,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we expand it to:
SELECT {author.name, author.aid}, COUNT(c.cited) AS total_citations FROM author a JOIN writes w ON a.aid = w.aid JOIN publication p ON w.pid = p.pid JOIN cite c ON p.pid = c.cited GROUP BY {} ORDER BY total_citations DESC; to include author.aid too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do, thank you!

@@ -160,7 +160,7 @@ What is the total number of papers published in each year?,"SELECT paper.year, C
What is the total number of papers associated with each dataset?,"SELECT paperdataset.datasetid, COUNT(DISTINCT paperdataset.paperid) AS total_papers FROM paperdataset GROUP BY paperdataset.datasetid;SELECT dataset.datasetname, COUNT(paperdataset.paperid) AS total_papers FROM paperdataset JOIN dataset ON paperdataset.datasetid = dataset.datasetid GROUP BY dataset.datasetname;",scholar,group_by,
How many keyphrases are associated with each paper?,"SELECT paperkeyphrase.paperid, COUNT(paperkeyphrase.keyphraseid) AS keyphrase_count FROM paperkeyphrase GROUP BY paperkeyphrase.paperid ORDER BY keyphrase_count DESC NULLS LAST;SELECT p.title, COUNT(pk.keyphraseid) AS num_keyphrases FROM paper p JOIN paperkeyphrase pk ON p.paperid = pk.paperid GROUP BY p.title ORDER BY num_keyphrases DESC NULLS LAST;",scholar,group_by,
How many authors have published more than 2 papers?,SELECT COUNT(*) AS number_of_authors FROM (SELECT writes.authorid FROM writes GROUP BY writes.authorid HAVING COUNT(writes.paperid) > 2) AS subquery;,scholar,group_by,
"Which papers have the highest number of authors, ordered by the number of authors in descending order?","SELECT writes.paperid, COUNT(writes.authorid) AS num_authors FROM writes GROUP BY writes.paperid ORDER BY num_authors DESC NULLS LAST;SELECT paper.title, COUNT(DISTINCT writes.authorid) AS num_authors FROM paper JOIN writes ON paper.paperid = writes.paperid GROUP BY paper.title ORDER BY num_authors DESC;",scholar,order_by,
"Which papers have the highest number of authors, ordered by the number of authors in descending order?","SELECT {paper.paperid, paper.title}, COUNT(DISTINCT writes.authorid) AS num_authors FROM paper JOIN writes ON paper.paperid = writes.paperid GROUP BY {} ORDER BY num_authors DESC;",scholar,order_by,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for consolidating. But I think should leave the writes.paperid SQL cos they produce different dataframes:

SELECT writes.paperid, COUNT(writes.authorid) AS num_authors FROM writes GROUP BY writes.paperid ORDER BY num_authors DESC NULLS LAST;
 paperid | num_authors 
---------+-------------
       4 |           3
       2 |           3
       1 |           3
       5 |           2
       3 |           1

SELECT paper.paperid, COUNT(DISTINCT writes.authorid) AS num_authors FROM paper JOIN writes ON paper.paperid = writes.paperid GROUP BY paper.paperid ORDER BY num_authors DESC;
 paperid | num_authors 
---------+-------------
       1 |           3
       2 |           3
       4 |           3
       5 |           2
       3 |           1

Copy link
Member Author

@rishsriv rishsriv Jun 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the same results – just ordered differently! :) The new ordering logic in sql-eval takes care of the ordering when doing the eval

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh we already fixed that? Ok my bad pls ignore! Then I think we might have a few other redundant queries lying around :P

@wendy-aw
Copy link
Contributor

wendy-aw commented Jun 3, 2024

Thanks for the fixes! Just two small comments from me

Copy link
Collaborator

@wongjingping wongjingping left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you for the detailed review and improvements to make the benchmark more robust and informative!

@@ -5,7 +5,7 @@ What is the average number of citations received by publications in the last 5 y
Which authors have published papers in journals within the past 20 years?,"SELECT DISTINCT {author.name, author.aid} FROM author JOIN writes ON author.aid = writes.aid JOIN publication ON writes.pid = publication.pid WHERE publication.year >= extract(YEAR FROM CURRENT_DATE) - 20;",academic,date_functions,
What's the difference in time between the first and last paper published?,SELECT max(YEAR) - min(YEAR) AS time_difference FROM publication;,academic,date_functions,
"Which authors have written publications in both the domain ""Machine Learning"" and the domain ""Data Science""?","SELECT {author.name,author.aid} FROM author WHERE author.aid IN (SELECT domain_author.aid FROM domain_author WHERE domain_author.did IN (SELECT domain.did FROM DOMAIN WHERE domain.name IN ('Machine Learning', 'Data Science') ) GROUP BY 1 HAVING COUNT(DISTINCT domain_author.did) = 2);",academic,group_by,
What is the total number of citations received by each author?,"SELECT {author.name, author.aid}, sum(publication.citation_num) AS total_citations FROM author JOIN writes ON author.aid = writes.aid JOIN publication ON writes.pid = publication.pid GROUP BY {} ORDER BY total_citations DESC NULLS LAST;",academic,group_by,
What is the total number of citations received by each author?,"SELECT {author.name, author.aid}, sum(publication.citation_num) AS total_citations FROM author JOIN writes ON author.aid = writes.aid JOIN publication ON writes.pid = publication.pid GROUP BY {} ORDER BY total_citations DESC NULLS LAST;SELECT a.name, COUNT(c.cited) AS total_citations FROM author a JOIN writes w ON a.aid = w.aid JOIN publication p ON w.pid = p.pid JOIN cite c ON p.pid = c.cited GROUP BY a.name ORDER BY total_citations DESC;",academic,group_by,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we also allow the COUNT(c.cited) to incorporate either permutations of a.name/a.aid? It would look something like:

SELECT {a.name, a.aid}, COUNT(c.cited) AS total_citations FROM author a JOIN writes w ON a.aid = w.aid JOIN publication p ON w.pid = p.pid JOIN cite c ON p.pid = c.c
ited GROUP BY {} ORDER BY total_citations DESC;

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup agreed, adding a fix for this!

@@ -33,7 +33,7 @@ What month were most students admitted?,"SELECT date_trunc('month', s.admit_term
What's the average predicted time to graduation since admission in no. of days?,SELECT avg(predicted_graduation_semester - admit_term) AS average_predicted_time_to_graduation FROM student;,advising,date_functions,
How many students were predicted to graduate in the last 10 years?,"SELECT count(*) AS num_students_graduated FROM student WHERE predicted_graduation_semester >= DATE_TRUNC('year', CURRENT_DATE) - interval '10 year';",advising,date_functions,
How long has it been in days since the last admitted student?,SELECT CURRENT_DATE - max(admit_term) AS duration_since_last_admitted_student FROM student;,advising,date_functions,
Subtract 2 weeks from the most recent predicted graduation date and give the month.,"SELECT DATE_TRUNC('month', s.predicted_graduation_semester - INTERVAL '2 weeks') AS month FROM student s ORDER BY s.predicted_graduation_semester DESC LIMIT 1;SELECT extract(MONTH FROM predicted_graduation_semester - interval '2 weeks') AS month FROM student ORDER BY predicted_graduation_semester DESC LIMIT 1;",advising,date_functions,
Subtract 2 weeks from the most recent predicted graduation date and give the month.,"SELECT DATE_TRUNC('month', s.predicted_graduation_semester - INTERVAL '2 weeks') AS month FROM student s ORDER BY s.predicted_graduation_semester DESC LIMIT 1;SELECT extract(MONTH FROM predicted_graduation_semester - interval '2 weeks') AS month FROM student ORDER BY predicted_graduation_semester DESC LIMIT 1;SELECT to_char(s.predicted_graduation_semester - interval '14 days', 'Month') AS MONTH FROM student s ORDER BY s.predicted_graduation_semester DESC LIMIT 1;",advising,date_functions,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch!

@@ -160,7 +160,7 @@ What is the total number of papers published in each year?,"SELECT paper.year, C
What is the total number of papers associated with each dataset?,"SELECT paperdataset.datasetid, COUNT(DISTINCT paperdataset.paperid) AS total_papers FROM paperdataset GROUP BY paperdataset.datasetid;SELECT dataset.datasetname, COUNT(paperdataset.paperid) AS total_papers FROM paperdataset JOIN dataset ON paperdataset.datasetid = dataset.datasetid GROUP BY dataset.datasetname;",scholar,group_by,
How many keyphrases are associated with each paper?,"SELECT paperkeyphrase.paperid, COUNT(paperkeyphrase.keyphraseid) AS keyphrase_count FROM paperkeyphrase GROUP BY paperkeyphrase.paperid ORDER BY keyphrase_count DESC NULLS LAST;SELECT p.title, COUNT(pk.keyphraseid) AS num_keyphrases FROM paper p JOIN paperkeyphrase pk ON p.paperid = pk.paperid GROUP BY p.title ORDER BY num_keyphrases DESC NULLS LAST;",scholar,group_by,
How many authors have published more than 2 papers?,SELECT COUNT(*) AS number_of_authors FROM (SELECT writes.authorid FROM writes GROUP BY writes.authorid HAVING COUNT(writes.paperid) > 2) AS subquery;,scholar,group_by,
"Which papers have the highest number of authors, ordered by the number of authors in descending order?","SELECT writes.paperid, COUNT(writes.authorid) AS num_authors FROM writes GROUP BY writes.paperid ORDER BY num_authors DESC NULLS LAST;SELECT paper.title, COUNT(DISTINCT writes.authorid) AS num_authors FROM paper JOIN writes ON paper.paperid = writes.paperid GROUP BY paper.title ORDER BY num_authors DESC;",scholar,order_by,
"Which papers have the highest number of authors, ordered by the number of authors in descending order?","SELECT {paper.paperid, paper.title}, COUNT(DISTINCT writes.authorid) AS num_authors FROM paper JOIN writes ON paper.paperid = writes.paperid GROUP BY {} ORDER BY num_authors DESC;",scholar,order_by,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As there is a tie here in the top 3 papers, I'll add some more data to defog-data to ensure that we break the ties via the data since we only need to verify the ordering by number of authors and not the paper id/name:

 paperid |                    title                     | num_authors 
---------+----------------------------------------------+-------------
       1 | A Study on Machine Learning Algorithms       |           3
       2 | The Effects of Climate Change on Agriculture |           3
       4 | COVID-19 Impact on Society                   |           3
       5 | Machine Learning in Tackling Climate Change  |           2
       3 | Social Media and Mental Health               |           1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, though the eval logic now takes "tie-breaks" into account for order by questions

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New logic here! I think we should be good to go without changes to defog-data

if order_by_clause:

@rishsriv rishsriv merged commit 9139aa3 into main Jun 3, 2024
2 checks passed
@rishsriv rishsriv deleted the rishabh/update-sqleval branch June 3, 2024 04:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants