-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expanded correct queries for 3 questions + minor prompt/typo fixes #151
Conversation
data/questions_gen_postgres.csv
Outdated
@@ -5,7 +5,7 @@ What is the average number of citations received by publications in the last 5 y | |||
Which authors have published papers in journals within the past 20 years?,"SELECT DISTINCT {author.name, author.aid} FROM author JOIN writes ON author.aid = writes.aid JOIN publication ON writes.pid = publication.pid WHERE publication.year >= extract(YEAR FROM CURRENT_DATE) - 20;",academic,date_functions, | |||
What's the difference in time between the first and last paper published?,SELECT max(YEAR) - min(YEAR) AS time_difference FROM publication;,academic,date_functions, | |||
"Which authors have written publications in both the domain ""Machine Learning"" and the domain ""Data Science""?","SELECT {author.name,author.aid} FROM author WHERE author.aid IN (SELECT domain_author.aid FROM domain_author WHERE domain_author.did IN (SELECT domain.did FROM DOMAIN WHERE domain.name IN ('Machine Learning', 'Data Science') ) GROUP BY 1 HAVING COUNT(DISTINCT domain_author.did) = 2);",academic,group_by, | |||
What is the total number of citations received by each author?,"SELECT {author.name, author.aid}, sum(publication.citation_num) AS total_citations FROM author JOIN writes ON author.aid = writes.aid JOIN publication ON writes.pid = publication.pid GROUP BY {} ORDER BY total_citations DESC NULLS LAST;",academic,group_by, | |||
What is the total number of citations received by each author?,"SELECT {author.name, author.aid}, sum(publication.citation_num) AS total_citations FROM author JOIN writes ON author.aid = writes.aid JOIN publication ON writes.pid = publication.pid GROUP BY {} ORDER BY total_citations DESC NULLS LAST;SELECT a.name, COUNT(c.cited) AS total_citations FROM author a JOIN writes w ON a.aid = w.aid JOIN publication p ON w.pid = p.pid JOIN cite c ON p.pid = c.cited GROUP BY a.name ORDER BY total_citations DESC;",academic,group_by, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we expand it to:
SELECT {author.name, author.aid}, COUNT(c.cited) AS total_citations FROM author a JOIN writes w ON a.aid = w.aid JOIN publication p ON w.pid = p.pid JOIN cite c ON p.pid = c.cited GROUP BY {} ORDER BY total_citations DESC;
to include author.aid too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do, thank you!
@@ -160,7 +160,7 @@ What is the total number of papers published in each year?,"SELECT paper.year, C | |||
What is the total number of papers associated with each dataset?,"SELECT paperdataset.datasetid, COUNT(DISTINCT paperdataset.paperid) AS total_papers FROM paperdataset GROUP BY paperdataset.datasetid;SELECT dataset.datasetname, COUNT(paperdataset.paperid) AS total_papers FROM paperdataset JOIN dataset ON paperdataset.datasetid = dataset.datasetid GROUP BY dataset.datasetname;",scholar,group_by, | |||
How many keyphrases are associated with each paper?,"SELECT paperkeyphrase.paperid, COUNT(paperkeyphrase.keyphraseid) AS keyphrase_count FROM paperkeyphrase GROUP BY paperkeyphrase.paperid ORDER BY keyphrase_count DESC NULLS LAST;SELECT p.title, COUNT(pk.keyphraseid) AS num_keyphrases FROM paper p JOIN paperkeyphrase pk ON p.paperid = pk.paperid GROUP BY p.title ORDER BY num_keyphrases DESC NULLS LAST;",scholar,group_by, | |||
How many authors have published more than 2 papers?,SELECT COUNT(*) AS number_of_authors FROM (SELECT writes.authorid FROM writes GROUP BY writes.authorid HAVING COUNT(writes.paperid) > 2) AS subquery;,scholar,group_by, | |||
"Which papers have the highest number of authors, ordered by the number of authors in descending order?","SELECT writes.paperid, COUNT(writes.authorid) AS num_authors FROM writes GROUP BY writes.paperid ORDER BY num_authors DESC NULLS LAST;SELECT paper.title, COUNT(DISTINCT writes.authorid) AS num_authors FROM paper JOIN writes ON paper.paperid = writes.paperid GROUP BY paper.title ORDER BY num_authors DESC;",scholar,order_by, | |||
"Which papers have the highest number of authors, ordered by the number of authors in descending order?","SELECT {paper.paperid, paper.title}, COUNT(DISTINCT writes.authorid) AS num_authors FROM paper JOIN writes ON paper.paperid = writes.paperid GROUP BY {} ORDER BY num_authors DESC;",scholar,order_by, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for consolidating. But I think should leave the writes.paperid SQL cos they produce different dataframes:
SELECT writes.paperid, COUNT(writes.authorid) AS num_authors FROM writes GROUP BY writes.paperid ORDER BY num_authors DESC NULLS LAST;
paperid | num_authors
---------+-------------
4 | 3
2 | 3
1 | 3
5 | 2
3 | 1
SELECT paper.paperid, COUNT(DISTINCT writes.authorid) AS num_authors FROM paper JOIN writes ON paper.paperid = writes.paperid GROUP BY paper.paperid ORDER BY num_authors DESC;
paperid | num_authors
---------+-------------
1 | 3
2 | 3
4 | 3
5 | 2
3 | 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's the same results – just ordered differently! :) The new ordering logic in sql-eval takes care of the ordering when doing the eval
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh we already fixed that? Ok my bad pls ignore! Then I think we might have a few other redundant queries lying around :P
Thanks for the fixes! Just two small comments from me |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you for the detailed review and improvements to make the benchmark more robust and informative!
data/questions_gen_postgres.csv
Outdated
@@ -5,7 +5,7 @@ What is the average number of citations received by publications in the last 5 y | |||
Which authors have published papers in journals within the past 20 years?,"SELECT DISTINCT {author.name, author.aid} FROM author JOIN writes ON author.aid = writes.aid JOIN publication ON writes.pid = publication.pid WHERE publication.year >= extract(YEAR FROM CURRENT_DATE) - 20;",academic,date_functions, | |||
What's the difference in time between the first and last paper published?,SELECT max(YEAR) - min(YEAR) AS time_difference FROM publication;,academic,date_functions, | |||
"Which authors have written publications in both the domain ""Machine Learning"" and the domain ""Data Science""?","SELECT {author.name,author.aid} FROM author WHERE author.aid IN (SELECT domain_author.aid FROM domain_author WHERE domain_author.did IN (SELECT domain.did FROM DOMAIN WHERE domain.name IN ('Machine Learning', 'Data Science') ) GROUP BY 1 HAVING COUNT(DISTINCT domain_author.did) = 2);",academic,group_by, | |||
What is the total number of citations received by each author?,"SELECT {author.name, author.aid}, sum(publication.citation_num) AS total_citations FROM author JOIN writes ON author.aid = writes.aid JOIN publication ON writes.pid = publication.pid GROUP BY {} ORDER BY total_citations DESC NULLS LAST;",academic,group_by, | |||
What is the total number of citations received by each author?,"SELECT {author.name, author.aid}, sum(publication.citation_num) AS total_citations FROM author JOIN writes ON author.aid = writes.aid JOIN publication ON writes.pid = publication.pid GROUP BY {} ORDER BY total_citations DESC NULLS LAST;SELECT a.name, COUNT(c.cited) AS total_citations FROM author a JOIN writes w ON a.aid = w.aid JOIN publication p ON w.pid = p.pid JOIN cite c ON p.pid = c.cited GROUP BY a.name ORDER BY total_citations DESC;",academic,group_by, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we also allow the COUNT(c.cited)
to incorporate either permutations of a.name/a.aid? It would look something like:
SELECT {a.name, a.aid}, COUNT(c.cited) AS total_citations FROM author a JOIN writes w ON a.aid = w.aid JOIN publication p ON w.pid = p.pid JOIN cite c ON p.pid = c.c
ited GROUP BY {} ORDER BY total_citations DESC;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup agreed, adding a fix for this!
@@ -33,7 +33,7 @@ What month were most students admitted?,"SELECT date_trunc('month', s.admit_term | |||
What's the average predicted time to graduation since admission in no. of days?,SELECT avg(predicted_graduation_semester - admit_term) AS average_predicted_time_to_graduation FROM student;,advising,date_functions, | |||
How many students were predicted to graduate in the last 10 years?,"SELECT count(*) AS num_students_graduated FROM student WHERE predicted_graduation_semester >= DATE_TRUNC('year', CURRENT_DATE) - interval '10 year';",advising,date_functions, | |||
How long has it been in days since the last admitted student?,SELECT CURRENT_DATE - max(admit_term) AS duration_since_last_admitted_student FROM student;,advising,date_functions, | |||
Subtract 2 weeks from the most recent predicted graduation date and give the month.,"SELECT DATE_TRUNC('month', s.predicted_graduation_semester - INTERVAL '2 weeks') AS month FROM student s ORDER BY s.predicted_graduation_semester DESC LIMIT 1;SELECT extract(MONTH FROM predicted_graduation_semester - interval '2 weeks') AS month FROM student ORDER BY predicted_graduation_semester DESC LIMIT 1;",advising,date_functions, | |||
Subtract 2 weeks from the most recent predicted graduation date and give the month.,"SELECT DATE_TRUNC('month', s.predicted_graduation_semester - INTERVAL '2 weeks') AS month FROM student s ORDER BY s.predicted_graduation_semester DESC LIMIT 1;SELECT extract(MONTH FROM predicted_graduation_semester - interval '2 weeks') AS month FROM student ORDER BY predicted_graduation_semester DESC LIMIT 1;SELECT to_char(s.predicted_graduation_semester - interval '14 days', 'Month') AS MONTH FROM student s ORDER BY s.predicted_graduation_semester DESC LIMIT 1;",advising,date_functions, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch!
@@ -160,7 +160,7 @@ What is the total number of papers published in each year?,"SELECT paper.year, C | |||
What is the total number of papers associated with each dataset?,"SELECT paperdataset.datasetid, COUNT(DISTINCT paperdataset.paperid) AS total_papers FROM paperdataset GROUP BY paperdataset.datasetid;SELECT dataset.datasetname, COUNT(paperdataset.paperid) AS total_papers FROM paperdataset JOIN dataset ON paperdataset.datasetid = dataset.datasetid GROUP BY dataset.datasetname;",scholar,group_by, | |||
How many keyphrases are associated with each paper?,"SELECT paperkeyphrase.paperid, COUNT(paperkeyphrase.keyphraseid) AS keyphrase_count FROM paperkeyphrase GROUP BY paperkeyphrase.paperid ORDER BY keyphrase_count DESC NULLS LAST;SELECT p.title, COUNT(pk.keyphraseid) AS num_keyphrases FROM paper p JOIN paperkeyphrase pk ON p.paperid = pk.paperid GROUP BY p.title ORDER BY num_keyphrases DESC NULLS LAST;",scholar,group_by, | |||
How many authors have published more than 2 papers?,SELECT COUNT(*) AS number_of_authors FROM (SELECT writes.authorid FROM writes GROUP BY writes.authorid HAVING COUNT(writes.paperid) > 2) AS subquery;,scholar,group_by, | |||
"Which papers have the highest number of authors, ordered by the number of authors in descending order?","SELECT writes.paperid, COUNT(writes.authorid) AS num_authors FROM writes GROUP BY writes.paperid ORDER BY num_authors DESC NULLS LAST;SELECT paper.title, COUNT(DISTINCT writes.authorid) AS num_authors FROM paper JOIN writes ON paper.paperid = writes.paperid GROUP BY paper.title ORDER BY num_authors DESC;",scholar,order_by, | |||
"Which papers have the highest number of authors, ordered by the number of authors in descending order?","SELECT {paper.paperid, paper.title}, COUNT(DISTINCT writes.authorid) AS num_authors FROM paper JOIN writes ON paper.paperid = writes.paperid GROUP BY {} ORDER BY num_authors DESC;",scholar,order_by, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As there is a tie here in the top 3 papers, I'll add some more data to defog-data
to ensure that we break the ties via the data since we only need to verify the ordering by number of authors and not the paper id/name:
paperid | title | num_authors
---------+----------------------------------------------+-------------
1 | A Study on Machine Learning Algorithms | 3
2 | The Effects of Climate Change on Agriculture | 3
4 | COVID-19 Impact on Society | 3
5 | Machine Learning in Tackling Climate Change | 2
3 | Social Media and Mental Health | 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, though the eval logic now takes "tie-breaks" into account for order by questions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New logic here! I think we should be good to go without changes to defog-data
Line 42 in b8b7704
if order_by_clause: |
The 3 questions fixed were:
What is the total number of citations received by each author?
Previously, this only relied on
sum(publication.citation_num)
for the correct answer. Now,COUNT(c.cited)
is also accepted as a correct answerSubtract 2 weeks from the most recent predicted graduation date and give the month.
Allow for more date formats
Which papers have the highest number of authors, ordered by the number of authors in descending order?
Allow answers that include both the paper id and paper title, instead of just one