scraper, project, superblock: Implement automated greylisting #2778

jamescowens · 2024-10-02T22:07:50Z

This PR is for tracking the progress of implementing automated greylisting.

Please see the following set of notes for design considerations that need to be discussed.

Basic manual greylisting and scraper machinery for determining automatic greylisting complete
- - Manual greylisting is an administrative contract type that rides normal transactions
    - Scrapers will now collect statistics on projects that have a greylisted status of either AUTO_GREYLISTED or MANUAL_GREYLISTED. Credits and average credits will be recorded in the project payloads, but the project magnitude will be zero, and they will not contribute to CPID magnitude
    - The Scraper code does not deal DIRECTLY with greylisting rules as this is an individual node responsibility
    - The convergence rules in terms of required number of projects use ACTIVE projects and do not include greylisted projects even though the stats are being collected for greylisted projects. This is because a project may be greylisted and available or literally not available at all, so convergences cannot be always expected at the project level for greylisted projects.
TODO
- Wire up automatic greylisting
  - These exist along superblock boundaries, like superblocks themselves (claims with a valid superblock contract) and beacon activation
  - Need to compute ZCD and WAS
  - ZCD rule is <= 7 zero credit days out of 40, WAS rule is last 7 days average project credits / 40 days average project credits >= 0.1
  - Since this needs to be on superblock boundaries, we can slightly change the rules for implementation to be in superblocks rather than days. Since almost all of the time superblocks are very close to one day, this is almost the same.
  - Implies an algorithm that operates over 40 days of superblock history
  - No stats for a whitelisted project in a superblock (i.e. because the project is hard down) needs to be counted in ZCD, with zero entry for WAS averaging, even though last project convergence may be from the 48 hour stats carryover. This may require a tweak to the scraper convergence code
  - Choice of
    - stateless methods that repeatedly iterate over 40 superblock history to apply rules
    - methods over a cache structure that stores a subset of information from up to 40 superblocks relevant to the rule computation
    - Advantage of stateless is simplicity
    - Disadvantage is that it is fairly expensive, as the superblock registry has to be iterated over and processed – this involves disk I/0.
    - Advantage of caching is speed
    - Disadvantage of caching is complexity
    - Once client is synced this is only called when the superblock is staked and processed by the client, ~1 per 24 hours.
    - During sync will be approx 1 per 960 blocks
  - Need to define order of precedence of manual greylist versus automatic. Status of manual greylist must always override automatic as the whole point to manual greylisting once this is put into place is to deal with corner case issues that are not handled by the ZCD and WAS rules.
    - Manual greylisting is granular to each block (i.e. an administrative contract of type project)
    - Automatic greylisting is granular to the superblock (valid staked superblock claim)
    - Walking this through…
      - Example 1

block → MAN_GREYLISTED

superblock → AUTO_GREYLISTED → status still MAN_GREYLISTED

superblock → removed from AUTO_GREYLISTED → status still MAN_GREYLISTED

block → removal from MAN_GREYLISTED → ACTIVE

- - - - Example 2

block → MAN_GREYLISTED

superblock → AUTO_GREYLISTED → status still MAN_GREYLISTED

block → removed from MAN_GREYLISTED → AUTO_GREYLISTED

superblock → removed from AUTO_GREYLISTED → ACTIVE

- - - I think this means we have to do the cache. The most convenient way to deal with this order of precedence is to store the underlying AUTO_GREYLIST status in the cache and have methods to utilize this information
      - Have status in cache of something like AUTO_GREYLIST_QUAL which means project has met the conditions for AUTO_GREYLIST by the rules, but was already MAN_GREYLISTED
      - This would be checked for each contract injection to change MAN_GREYLIST status, to decide whether new status is either AUTO_GREYLIST or ACTIVE
      - The AUTO_GREYLIST_QUAL is a flag on the project at the current (head) state
      - Maybe this really belongs in the in memory superblock structure? This does not need to be in the on-disk (chain) superblock structure at the cost of some computations.
      - There is an existing superblock cache (SuperblockIndex g_superblock_index) that currently stores the last three superblocks and could be expanded to 40 superblocks as an easy way to do the cache. This means more computation on top of the cache but much faster because it operates on in memory structures rather than reading from the superblock registry (disk I/O). It also means more memory usage.
      - Maybe best to modify the cache to be a hybrid and store more limited information for superblocks 4 – 40. But this makes the cache updating more complicated.
      - The memory usage of the additional superblocks is minimal compared to the current size of other data structures with the current active beacon count and chain length; however, when the benefactor contracts are implemented, this will no longer be true.
  - Create more detailed automated greylist reporting
    - - Simple listing status on project not sufficient, because users will want to know the details of why a project is greylisted (this is ZCD/WAS reporting)
        
        Probably should be something that operates on the project grid in the GUI and allows “clicking” the project whitelisting status and then having a pop-up window that displays the details of ZCD/WAS.

div72 · 2024-10-06T22:30:11Z

Scrapers will now collect statistics on projects that have a greylisted status of either AUTO_GREYLISTED or MANUAL_GREYLISTED.

Makes sense for automatic greylisted projects for de-greylisting, but why are statistics collected for manually greylisted projects? The greylister can then operate on projects with either ACTIVE or AUTO_GREYLISTED status.

Also considering the WAS & ZCD calculation is done per day, I am not sure if bothering with caching is worth it. Might be worthwhile to make some dumb implementation and do a benchmark.

Could adding a separate -projectnotify parameter be useful? I've been thinking about making a mailing list for new polls, adding project state changes doesn't sound bad.

jamescowens · 2024-10-07T02:51:49Z

Excellent question. Depending on the reasons for the manual greylist, statistics may still be available for a project. If so they should continue to be collected, because the ZCD and WAS rules would then apply if the manual greylist status was removed.

What are you thinking in terms of functionality for the -projectnotify parameter?

div72 · 2024-10-08T22:06:24Z

If so they should continue to be collected, because the ZCD and WAS rules would then apply if the manual greylist status was removed.

Good point.

block → removal from MAN_GREYLISTED → ACTIVE

Manual greylisting should instantly take effect, but should ungreylisting do so too? It should be simpler to make manual ungreylisting put the project in an auto greylist state. It'll take until the next superblock for the project to become active but that's ok imo. I'm imagining a FSM like this:

                              ⣀⡤⠤⠒⠒⠒⠉⠉⠉⠉⠉⠉⠙⠒⠒⠢⠤⣄⡀                         
                            ⡴⠋⠁                    ⠉⠳⡄                       
                            ⢣⣀    AUTO_GREYLIST    ⢀⣠⠇                       
                             ⠈⠙⠒⠤⠤⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⡠⠤⠴⠚⠉                         
                          ⠠⠤⣔⠶⢢               ⡠⣤⢄⣀⡀                       
                        ⢀⡠⠔⠉   ⠁             ⠈  ⠑⠢⡀                       
                     ⠘⠴⠮⠥⠤                         ⠈⠑⠤⡀                      
   ⣀⣠⠤⠤⠖⠒⠒⠒⠒⠒⠒⠒⠒⠒⠲⠤⠤⣄⣀                               ⣈⣑⢄⡀⡔                          
⣠⠖⠋⠁                   ⠈⠙⠲⣄                              ⣉⡭⠤⠖⠒⠒⠒⠒⠦⠤⣄⡀                          
⡇      MANUAL_GREYLIST     ⢸   ⠒⠾⣛⠒⠒⠒⠒⠒⠒⠒⠒⠒⠒⠒⠒⠒⠒⠒⠒⠒⠒  ⢸⡁  ACTIVE ⢀⡹                           
⠙⠲⢤⣀                   ⣀⡤⠖⠋                              ⠈⠙⠒⠒⠒⠒⠒⠒⠒⠚⠉                           
   ⠈⠉⠙⠒⠒⠲⠤⠤⠤⠤⠤⠤⠤⠖⠒⠒⠋⠉⠁

What are you thinking in terms of functionality for the -projectnotify parameter?

Similar to other notify commands. It should be triggered on project status changes(added to whitelist, removed, greylisted etc.), should call a script with the contract hash(or superblock hash in case of an automatic greylist).

jamescowens · 2024-11-24T18:05:03Z

That is a good simplification actually.

jamescowens · 2025-01-06T05:30:50Z

Well... back to the drawing board. With "Initial implementation of AutoGreylist" I have created an auto greylist class that executes the ZCD and WAS rules. It is not wired into the whitelist or the superblock yet, but you can see the results using getautogreylist rpc.

The idea on this is to use the AutoGreylist as an override in the Whitelist class, marking the status of projects that meet automatic greylisting status as AUTO_GREYLISTED. This would take precedence over ACTIVE or even MANUALLY_GREYLISTED, There are no project entry (registry) status updates made for the automatic greylist. It is maintained as a global singleton cache which is intended to be refreshed when a new superblock is staked. So when a project comes off the auto greylist, it will revert automatically to the original status dictated by the last valid project entry status.

There is a big problem with the input to the AutoGreylist class, however. The scrapers filter the projects to select only the statistics of actively beaconed crunchers to save processing time and network bandwidth/node memory. This means that the TC reported in the project files provided by the scrapers for greylisted projects, indeed all projects, when summed, only are across active beacon holders for that project, not ALL crunchers. The AutoGreylist class uses the information in the historical superblocks, which is a reduced form of the scraper statistics, so it telescopes the same problem.

Unfortunately, this causes serious issues with the current rules. For example, here is an output after the SB at 3473369 was posted on mainnet:

{
"auto_greylist_projects": [
{
"project:": "SiDock@home",
"zcd": 8,
"WAS": 0.05824032176973353
},
{
"project:": "asteroids@home",
"zcd": 6,
"WAS": 0
},
{
"project:": "gpugrid",
"zcd": 4,
"WAS": 0.02413605533979817
},
{
"project:": "rosetta@home",
"zcd": 10,
"WAS": 0.01227473310563061
}
]
}

I haven't fully traced the WAS, but the ZCD for asteroids, I went and traced the void AutoGreylist::RefreshWithSuperblock(SuperblockPtr superblock_ptr_in) as it updated the greylist. What became immediately apparent is that the asteroids TC for the latest superblock at the time of the run (3473369), was actually LESS than the TC for the previous SB (3472381). How can this be. It is absolutely possible. For example, beaconholders that contributed to the TC for that project may expire between SB's and actually cause the TC summed across all active beacon holders to decline. This is clearly not what is intended by the rules.

The rules using the script based greylist checker operate on the total TC for the entire project across ALL CPID's, whether registered beacon holders or not, and as such are much more stable.

I was hoping to get away with minimal changing to the scraper plumbing and the project stats objects that are put in the manifests of the scrapers to avoid additional points of possible problems, but it looks like I am going to have to bite the bullet and

Sum the TC for all of the users in the source stats export file for a project as it is being processed by the scraper, and
Modify the manifest processing to handle that information, which will have to be provided as another manifest object.
Modify the superblock to store the project-wide TC's for all projects across ALL CPID's, not just the TC project sum for active beaconed CPIDs.
Have the AutoGreylist class use this info instead of the existing project TC sums in the superblock.

This is going to cause the scraper processing to go up some.

Ugh.

…atus

Also implement corresponding -projectnotify startup parameter that provides the ability to run a cmd based on a change to the whitelist.

This adds a boolean argument to the listprojects RPC function. This boolean defaults to false, which only lists active projects. When true, it will list projects of every status.

This supports project status in the superblock.

This is the initial implementation of the AutoGreylist class. It also includes a diagnostic getautogreylist RPC to see the current results of the algorithm.

…(part 1) This commit adds the internal machinery to the scraper to compute the total credit across ALL cpids for each project, regardless of whether they are active beaconholders or not. This is to support auto greylisting. This commit does NOT include the network manifest changes.

This allows the automatic greylist to be overridden for a project if for some reason it is not functioning correctly.

…ride

jamescowens · 2025-01-23T03:37:55Z

Alright. I am nearing the end of the work on the automated greylisting functionality. There are two important issues that have come out of my testing that I would like some opinions on:

Unlike @RoboticMind's script checker that operates on the BOINC site data, the scrapers will only start collecting overall total credit data to evaluate the auto greylisting status when the project is first whitelisted. This means the dataset is constrained at the beginning, so the rules are constrained. For example, if there have been only three superblocks since the whitelisting, then there can't possibly be 7 ZCD, so that rule must passes. This rule will not have a chance to fail until there are 7 superblocks after the whitelisting. To me this is fine.
If the first superblock doesn't collect stats due to a problem with the project after the whitelisting, at the second superblock, the WAS will evaluate to zero and the project will be greylisted on the second superblock. This can be avoided by requiring a minimum of two superblocks back from the baseline before the rule is applied.
The WAS definition of the 7 day credit average / 40 day credit average is actually imprecise. One must calculate this by measuring the CHANGE in credit between the superblocks at the beginning and end of the appropriate intervals. So does 7 mean 7 SB's, which is actually 6 superblocks of deltas back from the present, because that includes 7 superblocks, (endpoint inclusive) or is it 7 superblocks back from the present (baseline), which measures seven superblocks of change, which is actually 8 superblocks inclusive? Same corresponding question for the 40 superblock denominator. @RoboticMind in his script seems to take the latter point of view based on my read of it... i.e. 7 superblocks of delta for the numerator of WAS, which is actually 8 superblocks inclusive, and 40 superblocks of delta, which is actually 41 superblocks inclusive for the denominator.

Thoughts?

Made small adjustments as a result of unit testing. Note that a tough to track down error was occurring due to an incorrect action from the MarkAsSuperblock() call on the CBlockIndex object. It turns out this was a enum confusion caused by SUPERBLOCK being used in two different enums in the same namespace. This has been corrected by puting the MinedType enum in the GRC namespace. The rules have also been finalized to mean the 7 out of 20 ZCD's means 20 SB's lookback from the current, which is actually 21 SB's including the baseline, and correspondingly for the other intervals. Also a grace period of 7 SB's from the baseline is used before the rules are applied to allow enough statistics to be collected to be meaningful.

jamescowens self-assigned this Oct 2, 2024

jamescowens added enhancement mandatory labels Oct 2, 2024

jamescowens added this to the Natasha milestone Oct 2, 2024

jamescowens force-pushed the implement_greylist branch 2 times, most recently from 715dba6 to 11cbd3b Compare October 6, 2024 20:19

jamescowens force-pushed the implement_greylist branch from 11cbd3b to de00a19 Compare October 22, 2024 21:33

jamescowens force-pushed the implement_greylist branch 2 times, most recently from 6dbd145 to 23e6b12 Compare November 25, 2024 01:00

jamescowens force-pushed the implement_greylist branch from 23e6b12 to c254f43 Compare December 2, 2024 00:33

jamescowens force-pushed the implement_greylist branch from c501783 to 3fe15e7 Compare January 6, 2025 03:09

jamescowens force-pushed the implement_greylist branch 6 times, most recently from df06937 to 9331288 Compare January 8, 2025 05:13

jamescowens added 8 commits January 12, 2025 04:18

Initial machinery in project.h/cpp and blockchain.cpp for greylist st…

f925dd3

…atus

Scraper greylisting machinery (Part 1)

0930d85

Implement NotifyProjectChanged core signal

66904a0

Also implement corresponding -projectnotify startup parameter that provides the ability to run a cmd based on a change to the whitelist.

Extend listprojects

27ea162

This adds a boolean argument to the listprojects RPC function. This boolean defaults to false, which only lists active projects. When true, it will list projects of every status.

Enhance GUI to understand new greylisting status

c5d6d1c

Baseline changes for superblock v3 contracts

7ebc97d

This supports project status in the superblock.

Initial implementation of AutoGreylist (WIP)

66a824f

This is the initial implementation of the AutoGreylist class. It also includes a diagnostic getautogreylist RPC to see the current results of the algorithm.

jamescowens force-pushed the implement_greylist branch from 7acd87d to 3783458 Compare January 19, 2025 00:08

Add AUTO_GREYLIST_OVERRIDE to project status

cf9fb42

This allows the automatic greylist to be overridden for a project if for some reason it is not functioning correctly.

jamescowens force-pushed the implement_greylist branch 13 times, most recently from 380f0c3 to e9c0e11 Compare January 21, 2025 06:31

jamescowens added 8 commits January 21, 2025 02:07

Modify addkey and project constructor to deal with auto greylist over…

c28e547

…ride

Add auto greylist update history and make other improvements

baeb1d1

Wire in greylist history to getautogreylist

86e6d8b

Change order of precedence for excluded project status in UI.

5ceb60a

Add documentation and other minor adjustments

5be3a0a

Change around the AutoGreylist initialization

0c36ba0

Update copyright on changed/added files.

805b083

Toughen UpdateGreylistCandidateEntry and AutoGreylist::Reset()

7960dfa

jamescowens force-pushed the implement_greylist branch from e9c0e11 to 7960dfa Compare January 21, 2025 07:08

jamescowens force-pushed the implement_greylist branch 2 times, most recently from 0ec7cef to 8350062 Compare January 23, 2025 20:28

jamescowens force-pushed the implement_greylist branch from 8350062 to 8a751ab Compare January 24, 2025 21:29

jamescowens changed the title ~~scraper, project, superblock: Implement automated greylisting (tracking PR for WIP)~~ scraper, project, superblock: Implement automated greylisting Jan 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scraper, project, superblock: Implement automated greylisting #2778

scraper, project, superblock: Implement automated greylisting #2778

jamescowens commented Oct 2, 2024 •

edited

Loading

div72 commented Oct 6, 2024

jamescowens commented Oct 7, 2024 •

edited

Loading

div72 commented Oct 8, 2024

jamescowens commented Nov 24, 2024

jamescowens commented Jan 6, 2025 •

edited

Loading

jamescowens commented Jan 23, 2025 •

edited

Loading

scraper, project, superblock: Implement automated greylisting #2778

Are you sure you want to change the base?

scraper, project, superblock: Implement automated greylisting #2778

Conversation

jamescowens commented Oct 2, 2024 • edited Loading

div72 commented Oct 6, 2024

jamescowens commented Oct 7, 2024 • edited Loading

div72 commented Oct 8, 2024

jamescowens commented Nov 24, 2024

jamescowens commented Jan 6, 2025 • edited Loading

jamescowens commented Jan 23, 2025 • edited Loading

jamescowens commented Oct 2, 2024 •

edited

Loading

jamescowens commented Oct 7, 2024 •

edited

Loading

jamescowens commented Jan 6, 2025 •

edited

Loading

jamescowens commented Jan 23, 2025 •

edited

Loading