Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[#425] Extend harvester to support include/exclude filtering by tags #427

Open
wants to merge 84 commits into
base: master
Choose a base branch
from
Open
Changes from 1 commit
Commits
Show all changes
84 commits
Select commit Hold shift + click to select a range
c20cff7
[#425] Extend harvester to support include/exclude filtering by tags
scottlimmer Nov 12, 2020
762ee14
Fix version check in templates to work on 2.10+
amercader Nov 23, 2020
faed627
Use pg 12 on 2.9
amercader Nov 23, 2020
8365eb3
Test master allowing failures
amercader Nov 24, 2020
646a060
Fix syntax
amercader Nov 24, 2020
d3432e9
Merge branch 'travis-pg-again'
amercader Nov 24, 2020
edccfc4
Fix flaky redis test
amercader Dec 11, 2020
76c0118
Migrate tests from Travis CI to GitHub Actions
amercader Dec 15, 2020
2cb3ab3
Show badge just for master
amercader Dec 15, 2020
95e82c2
Remove test-ga.ini, use env instead
amercader Dec 15, 2020
d005538
Replace default path to CKAN core config file with the one on the con…
amercader Dec 16, 2020
eb24643
Fix last harvest time to avoid timeouts
avdata99 Feb 3, 2021
6222e38
Merge pull request #431 from avdata99/fix_timeout
amercader Feb 4, 2021
15849ba
Add check that redis key is available before running .strptime to avo…
thejuliekramer Feb 4, 2021
3c83f41
Add flake8 step to workflow
Zharktas Mar 5, 2021
078077c
flake8
Zharktas Mar 5, 2021
2f1a2d3
flake8
Zharktas Mar 5, 2021
44bf788
Merge pull request #436 from ckan/add_flake8_step_to_workflow
tino097 Mar 21, 2021
52b2139
Optimize last_error_free_job to be more efficient
Zharktas Mar 12, 2021
1c64d0f
Oops. wrong syntax
Zharktas Mar 12, 2021
4f27de9
Use .outerjoin instead to support old sqlalchemy
Zharktas Mar 12, 2021
106e205
flake8
Zharktas Mar 22, 2021
94b17d4
Include webassets.yml in MANIFEST.in
amercader Mar 26, 2021
80e4e60
Merge pull request #437 from ckan/optimize_last_error_free_job
amercader Mar 26, 2021
ae554f0
Merge pull request #432 from thejuliekramer/fix_strptime_error_redis_key
amercader Mar 26, 2021
dff1fce
Merge branch 'master' of github.com:ckan/ckanext-harvest
amercader Mar 26, 2021
e2e8acb
Bump version
amercader Mar 26, 2021
d4064eb
Update README -fix run-test command name
alantygel Apr 18, 2021
5458ca2
Update README - replace _ by - in all commands
alantygel Apr 20, 2021
fd5a6b7
fix for py3 also works for py2; https://stackoverflow.com/questions/4…
nickumia-reisys Jul 19, 2021
8ee5eb3
Removes call to no longer used `render_jinja2' function
benjwadams Jul 30, 2021
a5dac48
Add summary of changes to CHANGELOG.rst
benjwadams Jul 30, 2021
35a9ba2
Allow GET requests to harvester.refresh
bzar Aug 2, 2021
ceaabd4
Merge pull request #439 from alantygel/patch-1
metaodi Sep 1, 2021
415b7d8
Add note about changed command names to README
metaodi Sep 1, 2021
e3c70bf
Merge pull request #454 from ckan/readme-changed-command-names
amercader Sep 1, 2021
eb7f6fc
Allow clearing a harvest source with a GET request
bzar Sep 13, 2021
d91ea66
Merge pull request #452 from bzar/patch-1
amercader Sep 13, 2021
d06444e
Add harvest info to solr
jbrown-xentity Sep 20, 2021
cb38910
Use py2 compatible formatting
jbrown-xentity Sep 20, 2021
fc8e723
replace render_jinja2 with render
TomeCirun Sep 22, 2021
51f85ee
Merge pull request #459 from TomeCirun/456-replace-render_jinja2-with…
Zharktas Sep 22, 2021
3d2b398
set the default MQ_TYPE to be redis
TomeCirun Sep 28, 2021
7e14f92
Merge pull request #463 from TomeCirun/change-default-MQ_TYPE
tino097 Sep 29, 2021
99538a4
Merge pull request #21 from GSA/json-serialization-fix
nickumia-reisys Sep 30, 2021
b653113
Merge pull request #20 from GSA/fix-pkg-extras
nickumia-reisys Sep 30, 2021
b3b519a
Revert "Fix pkg extras"
jbrown-xentity Sep 30, 2021
eca0ae0
Merge pull request #23 from GSA/revert-20-fix-pkg-extras
jbrown-xentity Sep 30, 2021
7f565cc
Merge pull request #451 from benjwadams/fix_render_function
amercader Oct 1, 2021
63701ad
Merge pull request #450 from GSA/json-serialization-fix
amercader Oct 1, 2021
6995fbc
Adjust solr index from live tests
jbrown-xentity Oct 4, 2021
9d2faea
Merge branch 'datagov-py3' into fix-pkg-extras
jbrown-xentity Oct 4, 2021
c0b3823
Fix lint
jbrown-xentity Oct 4, 2021
a418ed4
Remove unnecessary code, add tests
jbrown-xentity Oct 4, 2021
496a0df
Merge pull request #24 from GSA/fix-pkg-extras
nickumia-reisys Oct 5, 2021
bfa0da7
Explicitly set harvest extra info
jbrown-xentity Oct 7, 2021
2cd3d8d
Contain fix for both data_dict and validated
jbrown-xentity Oct 7, 2021
14ba290
Only update if extras exist
jbrown-xentity Oct 7, 2021
ff40069
Create extras if not there
jbrown-xentity Oct 7, 2021
925b0ee
Fix lint
jbrown-xentity Oct 7, 2021
04586fe
Merge pull request #26 from GSA/bug/clean-harvest-info
jbrown-xentity Oct 7, 2021
2910693
Merge pull request #458 from GSA/fix-pkg-extras
amercader Oct 8, 2021
254ffb0
remove force_import_val
TomeCirun Oct 15, 2021
878b81f
use render_jinja2 for when it is available for ckan core <= 2.9.4
FuhuXia Oct 20, 2021
7783919
flake8
FuhuXia Oct 21, 2021
419befe
Merge pull request #470 from FuhuXia/469-use-render_jinja2-when-avail…
amercader Oct 22, 2021
8f8cde8
Merge branch 'master' into 466-cli-run-test-throws-an-error
TomeCirun Oct 22, 2021
9d5679f
Merge pull request #467 from TomeCirun/466-cli-run-test-throws-an-error
tino097 Oct 22, 2021
0a06cad
Adds test for queue.resubmit_objects()
seitenbau-govdata Nov 19, 2021
2891b46
Fixes timeout date calculation on different time zones by using utcnow
seitenbau-govdata Nov 19, 2021
a934f27
Adds debug logging for re-sending objects to queue
seitenbau-govdata Nov 20, 2021
d2cf517
Clears redis db before queue test
seitenbau-govdata Nov 20, 2021
d84d847
Merge pull request #482 from GovDataOfficial/improve-test-coverage-an…
metaodi Nov 29, 2021
57a551d
Update changelog and setup.py for version 1.3.4
seitenbau-govdata Dec 1, 2021
6b238f8
Add option keep-actual to clearsource_history command
seitenbau-govdata Dec 15, 2021
d2b7340
Rename keep-actual to keep-current and updated documentation
seitenbau-govdata Dec 22, 2021
e650be8
Futureproof version check to work also on 3.0+
Zharktas Jan 18, 2022
b702bf9
Merge pull request #484 from GovDataOfficial/fix-clearsourcehistory-c…
amercader Jan 24, 2022
ddc891f
Merge pull request #487 from ckan/futureproof_version_check
amercader Jan 24, 2022
9d8d346
Merge branch 'release-version-1-3-4' of https://github.com/GovDataOff…
amercader Jan 24, 2022
0ae24e1
Add more changelog entries
amercader Jan 24, 2022
694a8cc
Merge branch 'GovDataOfficial-release-version-1-3-4'
amercader Jan 24, 2022
be6822a
[#425] Extend harvester to support include/exclude filtering by tags
scottlimmer Nov 12, 2020
edf6c7d
Merge branch 'feature/425-tags-filter-include-exclude' of https://git…
scottlimmer Apr 4, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Fix last harvest time to avoid timeouts
avdata99 committed Feb 3, 2021
commit eb246439c682164963757bbf1b003769581f3d6c
19 changes: 19 additions & 0 deletions ckanext/harvest/model/__init__.py
Original file line number Diff line number Diff line change
@@ -176,12 +176,31 @@ def get_last_finished_object(self):

return query

def get_last_gathered_object(self):
''' Determine the last gathered object in this job
Helpful to know if a job is running or not and
to avoid timeouts when the source is running
'''

query = Session.query(HarvestObject)\
.filter(HarvestObject.harvest_job_id == self.id)\
.order_by(HarvestObject.gathered.desc())\
.first()

return query

def get_last_action_time(self):
last_object = self.get_last_finished_object()
if last_object is not None:
return last_object.import_finished

if self.gather_finished is not None:
return self.gather_finished

last_gathered_object = self.get_last_gathered_object()
if last_gathered_object is not None:
return last_gathered_object.gathered

return self.created

def get_gather_errors(self):
22 changes: 18 additions & 4 deletions ckanext/harvest/tests/test_timeouts.py
Original file line number Diff line number Diff line change
@@ -13,9 +13,10 @@
@pytest.mark.usefixtures('with_plugins', 'clean_db', 'harvest_setup', 'clean_queues')
@pytest.mark.ckan_config('ckan.plugins', 'harvest test_action_harvester')
class TestModelFunctions:
dataset_counter = 0

def test_timeout_jobs(self):
""" Create harvest spurce, job and objects
""" Create harvest source, job and objects
Validate we read the last object fished time
Validate we raise timeout in harvest_jobs_run_action
"""
@@ -69,6 +70,17 @@ def test_no_gathered_job(self):
assert_equal(job.get_last_finished_object(), None)
assert_equal(job.get_last_action_time(), job.created)

def test_gather_get_last_action_time(self):
""" Test get_last_action_time at gather stage """
source, job = self.get_source()

ob1 = self.add_object(job=job, source=source, state='WAITING')
ob2 = self.add_object(job=job, source=source, state='WAITING')
ob3 = self.add_object(job=job, source=source, state='WAITING')

assert_equal(job.get_last_gathered_object(), ob3)
assert_equal(job.get_last_action_time(), ob3.gathered)

def run(self, timeout, source, job):
""" Run the havester_job_run and return the errors """

@@ -118,9 +130,10 @@ def get_source(self):

return source, job

def add_object(self, job, source, state, minutes_ago):
def add_object(self, job, source, state, minutes_ago=0):
now = datetime.utcnow()
name = 'dataset-{}-{}'.format(state.lower(), minutes_ago)
self.dataset_counter += 1
name = 'dataset-{}-{}'.format(state.lower(), self.dataset_counter)
dataset = ckan_factories.Dataset(name=name)
obj = harvest_factories.HarvestObjectObj(
job=job,
@@ -132,6 +145,7 @@ def add_object(self, job, source, state, minutes_ago):
)

obj.state = state
obj.import_finished = now - timedelta(minutes=minutes_ago)
if minutes_ago > 0:
obj.import_finished = now - timedelta(minutes=minutes_ago)
obj.save()
return obj