Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dump of Steam appid list is broken #166

Open
romatthe opened this issue Feb 26, 2024 · 7 comments
Open

Dump of Steam appid list is broken #166

romatthe opened this issue Feb 26, 2024 · 7 comments
Labels
good first issue Good for newcomers

Comments

@romatthe
Copy link

romatthe commented Feb 26, 2024

This concerns the SteamAppList projects, but I thought it was more appropriate to file the issue here

For the longest time, I thought my games not appearing in the list when opening the application was because the database had to be updated (via your SteamAppList and SteamAppListDumps projects). Because updating the database of games often takes a couple of hours, I usually just used SamReloaded by typing in a game's appid whenever a game was missing on the list (great feature btw).

However, I think the problem is that the script to build the database actually just doesn't work properly.

Feel free to correct me (as my JavaScript-fu is pretty terrible), but from what I understand, the process for updating the game database is as follows:

  1. The existing list is checked out from git.
  2. Via the Steam API, a complete list of all known apps on Steam is fetched.
  3. That list is diffed versus the existing list. In other words, we don't want to query apps that are already in the existing list, only ones that aren't yet.
  4. Details for the remaining games are then fetched.

If this is correct, it doesn't work very well. There are two problematic cases:

  1. Sometimes, games receive achievements later in their lifecycle. Early Access games often don't have achievements yet, and 1.0 games sometimes get achievements in major updates as well. As a completely random example of the latter: Paradise Killer (appid 1160220) received achievements only several months after the initial release.
  2. Games often pop up in the list of apps from the Steam API that don't have achievements yet, most likely because the simply haven't been released yet. Completely random examples for this: Atomic Heart (appid 668580) for a game that had achievements on day 1 but does not appear in the list of games in SamReloaded, and Black Myth: Wukong (2358720) as an example of a game that isn't out yet.

What happens these cases is that the game gets added to the master list with the achievements field filled out as null.

Here's what the example games look like in the database dump I generated a few hours ago:

/// Paradise Killer
{"appid":1160220,"name":"Paradise Killer","type":"game","achievements":null}
/// Atomic Heart
{"appid":668580,"name":"Atomic Heart","type":"game","achievements":null}
/// Black Myth: Wukong
{"appid":2358720,"name":"Black Myth: Wukong","type":"game","achievements":null}

Obviously, if you query Steam right now, you get a much different answer:

Paradise Killer: https://store.steampowered.com/api/appdetails/?filters=basic,achievements&appids=1160220:

{
    "1160220":{
       "success":true,
       "data":{
          "type":"game",
          "name":"Paradise Killer",
          "steam_appid":1160220,

          // Etc, etc...

          "achievements":{
             "total":39,
             "highlighted":[
                {
                   "name":"Scholar of the Pantheon",
                   "path":"https:\/\/cdn.akamai.steamstatic.com\/steamcommunity\/public\/images\/apps\/1160220\/23086a11a2e2b415fca83cd7b2b975fd706ac863.jpg"
                },
                // Etc, etc...
             ]
          }
       }
    }
 }

Atomic Heart: https://store.steampowered.com/api/appdetails/?filters=basic,achievements&appids=668580:

{
    "668580":{
       "success":true,
       "data":{
          "type":"game",
          "name":"Atomic Heart",
          "steam_appid":668580,

          // Etc, etc,....

          "achievements":{
             "total":51,
             "highlighted":[
                {
                   "name":"The Motherland Does Not Forget its Heroes",
                   "path":"https:\/\/cdn.akamai.steamstatic.com\/steamcommunity\/public\/images\/apps\/668580\/fbf6ef189b6beb1e0828e3a4909e6a460543e64b.jpg"
                },
                // Etc, etc...
             ]
          }
       }
    }
 }

However, because these games were originally written into the app_list.json with "achievements: null", and because games are never queried again once they are written down into app_list.json, they will not appear in the list of games in SamRewritten, because they are never updated and shown as having achievements.

Black Myth: Wukong is a little different of course, since it currently doesn't list any achievements:

https://store.steampowered.com/api/appdetails/?filters=basic,achievements&appids=2358720:

{
    "2358720":{
       "success":true,
       "data":{
          "type":"game",
          "name":"Black Myth: Wukong",
          "steam_appid":2358720,

          // No achievements to be found...
       }
    }
 }

In this case, the absence of achievements is indeed correct. But that's just because the game hasn't been released yet. Odds are extremely high it will have achievements upon release. But again, this game will never be queried again, because it's already on the list.

If what I'm saying is correct, then the real question becomes, of course, what can be done about that? The easy solution would be NOT to write games with "achievements: null" into app_list.json so that they get queried again anew every time a Steam app dump is taken. Or you could change to logic to query games even if they have "achievements: null".

However, I think this is also pretty problematic. Let's do a quick search to see how many games Steam lists:

$ cat app_list.json | jq '[.applist.apps[] | select(.type=="game")] | length'
111222

Now how many of those do NOT have achievements?

$  cat app_list.json | jq '[.applist.apps[] | select(.type=="game") | select(.achievements==null)] | length'
63234

This means that you'd need to query over half of the entire list of Steam apps if you were to take a dump for the first time with the modification proposed above.

Now, of course, I'd imagine there would at least be several thousand games that would get filled up with the correct achievement count (examples Paradise Killer and Atomic Heart being among them), so next time you take the dump, the amount of games that would need to be queried would be less. But you'd still looking at several ten thousand games that don't have achievements and never will have any (and thus that will always be marked with "achievements: null") that you'd be querying each and every time you take an app list dump. In other words, this would dramatically increase the time it takes to perform a new dump.

So, the simple solution proposed above wouldn't be very practical I think, it may end up doubling or tripling the amount of time it takes to dump the entire database.

Any ideas? Did I get something wrong here?

@romatthe
Copy link
Author

After thinking about this for a few more minutes, a relatively easy solution that would weed out some of the most glaring issues (but still not entirely circumvent some of the problems) is the following:

When fetching achievement details for a specific appid, change the query from
https://store.steampowered.com/api/appdetails/?filters=basic,achievements&appids=<APPID>
to
https://store.steampowered.com/api/appdetails/?filters=achievements,release_date&appids=<APPID>

This gives you an additional piece of data release_date that has a field coming_soon that's set to true in case the game hasn't been released yet. (It also filters out the "basic" info to decrease the size of the payload since I don't think any of that info is used anyway, right?).

Example, Paradise Killer (1160220):

{
   "1160220":{
      "success":true,
      "data":{
         "achievements":{
            "total":39,
            "highlighted":[
               ...
            ]
         },
         "release_date":{
            "coming_soon":false,
            "date":"4 Sep, 2020"
         }
      }
   }
}

Example, Black Myth: Wukong (2358720):

{
   "2358720":{
      "success":true,
      "data":{
         "release_date":{
            "coming_soon":true,
            "date":"20 Aug, 2024"
         }
      }
   }
}

So, what you could do is, whenever you are going through the entire list of appids, when you then query the Storefront API for a specific appid, and you retrieve a game that has its coming_soon flag set to true, simply discard it and do NOT add it to app_list.json. That means that the games that DO get added to app_list.json with "achievements: null" are games that, at the very least, are released and do not have achievements.

As soon as the games then get released, they get included in app_list.json in the next dump, and you get a more definitive statement on whether or not they actually have achievements.

This still has three issues:

  1. Games that receive achievement support AFTER release still won't get picked up.
  2. You're still increasing the amount of appids you need to check during each dumping process.
  3. You effectively need to start over, throw the existing database in the trash and start a dump from complete scratch (since the current database contains a ton of incorrect entries that we have to throw out).

Despite those issues, it would at least stop putting games into the database that haven't been released yet, which is probably 98% of the games effected by this entire issue I'd wager.

(Also maybe just filter out appids that have their type field not set to game? Doesn't that just increase the size of the JSON file stored in the git repo for no reason?)

Hopefully I'm still making sense here. Sorry for bothering you with such long posts, I'm not good at being concise.

@PaulCombal
Copy link
Owner

Hello,

What an impressive post! I will try to answer all of your questions and lay down my thoughts here, if I'm not clear enough or forgot something, please ask me again. You're not bothering, quite the opposite!

The process you described is wrong. Here is how SamRewritten uses the list:

  1. SAM downloads it from github
  2. On every appId encountered in the list, SamRewritten checks if the app is owned by the current user
  3. If the app is owned, the app is added to the main menu view, using only data from the list.

SamRewritten only uses the Steamworks SDK, and not the web api. Therefore, it is not able to fetch additional info about games itself, without impersonating an app.

You pointed out a great flaw in the refreshing process of the list. The already processed apps by SteamAppList are not processed again. That means the app changes are never registered.

I implemented the list this way since it takes a lot of time to refresh it, and I did not have in mind to automate it one day. Thinking of it now, a better solution would be to refresh all the apps in a scheduled time, probably using github scheduled actions, like every week. That way the problem is solved and human intervention is not needed. If you're looking to contribute maybe that would be a cool little project, to fiddle with SteamAppList.
That way the list would be updated fully every week.

What you say about the app sorting and required information to collect about them makes a lot of sense. But again, unfortunately I do not have more time to spend on this project anymore.. unfortunately. However, if you want to contribute and need guidance, I will be very happy to help!

I already merged your PR in SteamApplistDump, I'm not sure what I can do more with this issue but I will leave it up for discussion.

Let me know what you think, have a good day :)

@romatthe
Copy link
Author

Hey, thanks for the response. Sorry again for the lack of brevity.

The process you described is wrong. Here is how SamRewritten uses the list:

  1. SAM downloads it from github
  2. On every appId encountered in the list, SamRewritten checks if the app is owned by the current user
  3. If the app is owned, the app is added to the main menu view, using only data from the list.

Totally my mistake for not being clear, what I meant was the process of GENERATING the initial list via your SteamAppList projects, aka the JS script. Not the process of how SamRewritten uses this list.

I totally understand you don't have a lot of time anymore, like you announced previously. If I find the time, I could have a try over the weekend to look into making some changes to the script that takes the dump as well as looking into the automation of executing said script.

I think your suggestion of looking into Github actions is certainly a really good one. I have some experience with it, but not a lot. From what I recall, you could schedule a workflow with a cron-style expression. I'd need to have a look at what the potential limitation are for very long-running tasks of course.

Feel free to leave this issue open, and I'll report if I was able to achieve something. It might take a while though.

P.S.: I also indicated in your post mentioning your lack of time that I'd personally be interested in attempting a rewrite of the core functionality of SamRewritten. I can't guarantee anything, since I don't have a lot of time and suffer from a chronic lack of energy, but I do hope it might be something I can work towards in the future.

@PaulCombal
Copy link
Owner

You don't need to apologize!

Indeed I misunderstood what you meant, so yes you're totally right, this is totally how the SteamAppList project works.

Even though I'm very busy with other things, I'm very open to contributions, so please feel free to experiment new things, you can even attempt to rewrite the SteamAppList from scratch altogether if you wish. As far as I remember this wasn't the most elegant code.

That's exactly what I was thinking about for the Cron-style github action, glad we have the same vision! I will leave the thread up for discussion and more.

To answer your PS: If you attempt to rewrite the core functionnality of SamRewritten, don't make the same mistake than I did: clearly separate the server to the client, and defining the process architecture is key. How will you implement your process pool? Are you going to be using threading? That kind of stuff.

@romatthe
Copy link
Author

romatthe commented Mar 5, 2024

Small update (and note to myself for future reference):

I did a very quick session of fiddling with the script over the weekend. The result is as follows:

I was able to very quickly make some adjustments that allowed you to take a dump from scratch if you set the correct environmental variable, while also discarding titles that haven't been released yet (so we can grab the achievement details during a later dump. The big downside is of course that it takes several lightyears to perform a complete dump. It's still running right now, but I think it will probably end up taking somewhere around 100 hours.

The second thing I did was to explore the use of the IStoreService web API for collecting all known appids, more specifically the https://api.steampowered.com/IStoreService/GetAppList/v1/ endpoint. This one is actually (semi) documented, and the following benefits:

  1. You can actually select the type of app you want to retrieve, which by default only includes games (but you can filter as desired).
  2. It also allows you to show only apps that have "been updated" or "changed" since a specific timestamp. What exactly it means for an app to be updated is not clear to me, but it is interesting to investigate.
  3. It's a paginated API where you have to provide the last appid from your previous request to get the next page, but this perhaps might also be used to fetch new apps since the last time you performed the dump. Again, I'd need to look deeper into this.

There are some downsides as well:

  1. Like I said, it's paginated, so you need to page through it, but that's no biggie really.
  2. It requires you to have a Web API key. Again, this is also not a big deal at all if it's part of an automated pipeline, but is VERY annoying if taking the dump happens manually and requires someone to always provide a key (because it requires a linked account, etc etc).

Using the new endpoint significantly cuts down on the time required to perform a complete dump, because you're no longer querying the details of appids not associated with games. I think this full dump takes about 40 hours. I have already performed one with a quick POC, but I have yet to validate the results.

@PaulCombal
Copy link
Owner

Interesting findings! I had no idea about IStoreService! Maybe it's a new thing? Either way, don't worry about the web API key: if we turn it into an automated github action, I don't think it's going to be necessary to change it often.

Although I don't contribute to SamRewritten I can review or do some parts on this if you need help. Looks like you're already on your way though!

@PaulCombal PaulCombal added the good first issue Good for newcomers label Mar 6, 2024
@Tropingenie
Copy link

Tropingenie commented Oct 20, 2024

I've also been working on my own fork of SteamAppsList.

One note I make is that we do not want to take a dump from scratch. Per PaulCombal/SteamAppsList#1 the script already has trouble with games that have been removed, so we don't want to lose any games by deleting the whole dump. A rough way of doing this is to update games without previous achievements. This still misses games that had achievements list updated later in life (e.g. Hardspace Shipbreaker), but it doesn't cause any issues. This also has the problem that it catches a lot of games that weren't updated, so it's still a band-aid until the POC API fork works.

PaulCombal/SteamAppsList#5 also notes another issue with the long retrieval times. The end goal would be to check if a game has updated, like how SteamDB does it (or presumably the POC API fork). However, a band-aid is to just save the dump more often. While this doesn't help the time it takes, it at least makes it more bearable for a human to run this script (or to recover a failed github action).

I've implemented both of these (and the required interaction, since we can't do a hard update after a failed run) in my fork and they seem to be working well. I'm running a few thousand batches to verify, and then plan to merge some of @romatthe's improvements before I submit a PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

3 participants