Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate input data before training the models #342

Open
barjin opened this issue Dec 12, 2024 · 9 comments · May be fixed by #349
Open

Validate input data before training the models #342

barjin opened this issue Dec 12, 2024 · 9 comments · May be fixed by #349
Labels
debt Code quality improvement or decrease of technical debt. t-tooling Issues with this label are in the ownership of the tooling team.

Comments

@barjin
Copy link
Collaborator

barjin commented Dec 12, 2024

As mentioned in #339 (and the related comments), the collected input data can contain arbitrary values (e.g. as a result of a penetration test run against the collecting server). This leads to the generation of less believable (or even potentially dangerous) fingerprints.

The input data should be validated before training the models with generator-networks-creator to ensure we only generate real fingerprints. This could be simple for some properties (e.g. Navigator.appCodeName should be always Mozilla), but may be impossible for other properties (e.g. Navigator.userAgent can be pretty much arbitrary string - sans the syntax).

Note that this blocks re-enabling the automatic updates of the models.

@barjin barjin added debt Code quality improvement or decrease of technical debt. t-tooling Issues with this label are in the ownership of the tooling team. labels Dec 12, 2024
@0xARYA
Copy link

0xARYA commented Jan 4, 2025

Hey @barjin

I wanted to check in regards to the progress with this issue? Has anyone internally started work on it? I was looking at potential solutions regarding this-- would love to help in any way.

@barjin
Copy link
Collaborator Author

barjin commented Jan 6, 2025

Hello @0xARYA and thank you for your interest in this project.

There was an open community PR adding basic validation before the model generation step, but the author decided to delete it (I can find the GitHub notifications in my email inbox, but the links are dead). We didn't get much time to look into this yet, so any expertise or ideas on how to validate separate parts of the fingerprints are definitely welcome!

Btw today, while solving an unrelated issue, I regenerated the models in the packages, manually checked those for the bad values and triggered a new release. This means there is a new version (2.1.62) of the fingerprint-suite packages with fresh models available on npm.

@0xARYA
Copy link

0xARYA commented Jan 21, 2025

https://github.com/kkoooqq/fakebrowser/blob/586e85c0ed872513d2e0703d8c516250a8a4365b/src/core/DeviceDescriptor.ts#L239

I think this could be a good reference for a basic starting point, obviously dealing with the poisoning issue is a whole other can of worms... I cannot come to a conclusive standpoint in regard to whether the poisoning issue is a solution where you'd take the blacklist or the whitelist route...

@0xARYA
Copy link

0xARYA commented Jan 21, 2025

I assume any sort of filtering logic would be implemented in the following function?

@0xARYA
Copy link

0xARYA commented Jan 22, 2025

I'm now trying to tackle this issue and hopefully increase quality across the board-- one really trivial step is eliminating fingerprint's with truthy webdriver.

I am currently just stuck on trying to understand the structure of the records, it seems like I can possibly reverse engineer the structure but if I could receive guidance as as I cannot currently download the dataset to inspect it myself.

@0xARYA
Copy link

0xARYA commented Jan 22, 2025

Another thing we need to address to bring this library back up to speed is the new(-er?!) client hint headers, we're missing a sizeable amount and it causes issues with sites that do pre-response validation like amazon and google.

@barjin
Copy link
Collaborator Author

barjin commented Jan 23, 2025

I assume any sort of filtering logic would be implemented in the following function?

fingerprint-suite/packages/generator-networks-creator/src/generator-networks-creator.ts

Line 59 in b42c60a

Yes, this sounds about right. The prepareRecords method takes the collected browser fingerprints and parses / filters those. You see, we do some similar steps as the devs of fakebrowser in the checkLegal method (nice catch btw, we can definitely use that as inspiration).

The format of our collected fingerprint is as follows:

[{
  "id": "jGK6LWYyfaJ8c7Y5lhw2V",
  "collectedAt": "2025-01-11T18:59:52.513Z",
  "requestFingerprint": {
    "headers": {
      ":method": "GET",
      ":authority": "hostname",
      ":scheme": "https",
      ":path": "path",
      "sec-ch-ua": "\"Google Chrome\";v=\"131\", \"Chromium\";v=\"131\", \"Not_A Brand\";v=\"24\"",
      "sec-ch-ua-mobile": "?1",
      "sec-ch-ua-platform": "\"Android\"",
      "upgrade-insecure-requests": "1",
      "user-agent": "Mozilla/5.0 (Linux; Android 10; K) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Mobile Safari/537.36",
      "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
      "sec-fetch-site": "same-site",
      "sec-fetch-mode": "navigate",
      "sec-fetch-dest": "iframe",
      "referer": "https://apify.com/",
      "accept-encoding": "gzip, deflate, br, zstd",
      "accept-language": "en-US,en;q=0.9,es;q=0.8",
      "priority": "u=0, i"
    },
    "httpVersion": "2.0",
    "tlsVersion": "TLSv1.3",
    "tlsName": "TLS_AES_256_GCM_SHA384",
    "tlsStandardName": "TLS_AES_256_GCM_SHA384"
  },
  "browserFingerprint": {
    "language": "en-US",
    "oscpu": null,
    "doNotTrack": null,
    "product": "Gecko",
    "vendorSub": "",
    "appCodeName": "Mozilla",
    "appName": "Netscape",
    "appVersion": "5.0 (Linux; Android 10; K) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Mobile Safari/537.36",
    "webdriver": false,
    "maxTouchPoints": 5,
    "userAgentData": {
      "brands": [
        {
          "brand": "Google Chrome",
          "version": "131"
        },
        {
          "brand": "Chromium",
          "version": "131"
        },
        {
          "brand": "Not_A Brand",
          "version": "24"
        }
      ],
      "mobile": true,
      "platform": "Android",
      "architecture": "",
      "bitness": "",
      "fullVersionList": [
        {
          "brand": "Google Chrome",
          "version": "131.0.6778.260"
        },
        {
          "brand": "Chromium",
          "version": "131.0.6778.260"
        },
        {
          "brand": "Not_A Brand",
          "version": "24.0.0.0"
        }
      ],
      "model": "23026RN54G",
      "platformVersion": "13.0.0",
      "uaFullVersion": "131.0.6778.260"
    },
    "extraProperties": {
      "vendorFlavors": [
        "chrome"
      ],
      "globalPrivacyControl": null,
      "pdfViewerEnabled": null,
      "installedApps": []
    },
    "userAgent": "Mozilla/5.0 (Linux; Android 10; K) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Mobile Safari/537.36",
    "platform": "Linux armv81",
    "languages": [
      "en-US",
      "en",
      "es"
    ],
    "videoCard": {
      "vendor": "Imagination Technologies",
      "renderer": "PowerVR Rogue GE8320"
    },
    "multimediaDevices": {
      "speakers": [
        {
          "deviceId": "",
          "kind": "audiooutput",
          "label": "",
          "groupId": ""
        }
      ],
      "micros": [
        {
          "deviceId": "",
          "kind": "audioinput",
          "label": "",
          "groupId": ""
        }
      ],
      "webcams": [
        {
          "deviceId": "",
          "kind": "videoinput",
          "label": "",
          "groupId": ""
        }
      ]
    },
    "productSub": "20030107",
    "battery": {
      "charging": false,
      "chargingTime": null,
      "dischargingTime": 5008,
      "level": 0.47
    },
    "deviceMemory": 2,
    "audioCodecs": {
      "ogg": "probably",
      "mp3": "probably",
      "wav": "probably",
      "m4a": "maybe",
      "aac": "probably"
    },
    "videoCodecs": {
      "ogg": "",
      "h264": "probably",
      "webm": "probably"
    },
    "screen": {
      "availHeight": 800,
      "availWidth": 360,
      "pixelDepth": 24,
      "height": 800,
      "width": 360,
      "availTop": 0,
      "availLeft": 0,
      "colorDepth": 24,
      "innerHeight": 0,
      "outerHeight": 752,
      "outerWidth": 360,
      "innerWidth": 0,
      "screenX": 0,
      "pageXOffset": 0,
      "pageYOffset": 0,
      "devicePixelRatio": 2,
      "clientWidth": 0,
      "clientHeight": 19,
      "hasHDR": false
    },
    "hardwareConcurrency": 8,
    "plugins": [],
    "mimeTypes": [],
    "fonts": [
      "sans-serif-thin"
    ],
    "vendor": "Google Inc."
  }
}]

Hopefully, you can see how this structure maps to the properties accessed in prepareRecords.

Regarding the filtering of the malicious payloads - the whitelist approach (i.e. specifically describing what is allowed, even with regexes and similar) makes more sense to me, as it sounds more future-proof. Since this project is open-source, sharing a blacklist of forbidden patterns might just inspire the pentesters to craft malicious payloads that don't match these conditions.

Speaking technically, we could specify e.g. a zod schema with regexes and conditions for every property of the collected records.

@0xARYA
Copy link

0xARYA commented Jan 23, 2025

Thanks so much for the help—I really appreciate it! I'll have a PR out soon. The fingerprint filtering adjustments shouldn't be too tricky, and I believe with proper filtering, we can also as an unintended benefit tackle the fuzzing contamination issue.

I’m really excited to bring this library up to speed. One of the key things we need to address is filling in some of the missing data points from the RFCs that have been implemented in the meantime—especially when it comes to client hints. For example, we’re currently missing formFactors and wow64 on the client side. As for headers, there are quite a few missing ones (see: browserleaks.com/client-hints).

This gap is becoming a bigger issue since many sites depend on that data. Take Amazon, for instance—they use it to test if the client sends the requested hints and to apply a basic fingerprint to requests before even sending a response.

Another missing data point we might want to consider is speechSynthesis voice data.

I’ve been working on a fork of this library with updated evasions, but I’m holding off on pushing a PR until we’re closer to getting everything else in place. I want to make sure we don’t inadvertently reintroduce gaps along the way.

@0xARYA
Copy link

0xARYA commented Jan 23, 2025

Cursory Initial Validation

I’ve already worked through some of the checks. Now, I’m just figuring out the best way to handle the screen dimension check.

I’ll also be adding a few more checks, like ensuring the fonts match the OS, validating client hints against headers, and so on.

I think it’d be really useful to bring in a compat table (something like this: MDN Browser Compatibility Data) to make sure that missing data is actually due to a lack of browser support, rather than something being blocked by a malicious actor.

@0xARYA 0xARYA linked a pull request Jan 24, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
debt Code quality improvement or decrease of technical debt. t-tooling Issues with this label are in the ownership of the tooling team.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants