-
Notifications
You must be signed in to change notification settings - Fork 449
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Phd Placeholder: learn-to-rank, decentralised AI, on-device AI, something. #7586
Comments
Hmmm, very difficult choice. |
Re-read papers regarding learn-to-rank and learned how to use the IPV8. With it I created an algorithm which simulates a number of nodes and sends messages to one another. From here I worked with Marcel and started implementing a system whereby one node sends a query to the swarm and then receives recommendations of content back from it. The progress is detailed in ticket 7290. There are 2 design choices: One issue discovered was regarding the size of the IPV8 network packet which is currently smaller than the entire model serialized with Pytorch, Marcel is currently working on that. We have 720k weights at the moment, and the maximum network packet size for IPV8 is 2.7MB so we have to fit in as many weight updates as possible. You can see a demonstration of the prototype below: I'm currently working on how to aggregate the recommendations of the swarm (for example, what happens if the recommendations of each node which received the query are entirely different). My branch on Marcel's repository: https://github.com/mg98/p2p-ol2r/tree/petrus-branch |
It's beyond amazing what you acomplished in 6 weeks after starting your phd. 🦄 🦄 🦄 Can we upgrade to transformers? That is the cardinal question for scientific output. We had Distributed AI in unusable form deployed already in 2012 within our Tribler network. Doing model updates is too complex compared to simple starting with sending training triplets around in a IPv8 community. The key is simplicity, ease of deployment, correctness, and ease of debugging. Nobody has a self-organising live AI with lifelong learning, as you have today in embryonic form. We even removed our deployed clicklog code in 2015 because it was not good enough. Options:
For a Youtube alternative smartphone app we have a single simple network primitive : Next sprint goal: get a performance graph! |
After looking into what datasets we could use for training a hypothetical model, I found ORCAS which consists of almost 20 million queries and the relevant website link given the query. It is compiled by Microsoft and it represents searches made on Bing in a period of a few months (with a few caveats to preserve privacy, such as showing only queries which have been searched a number of times and not showing a user_ID and stuff like that). The data seems good, but the fact that we have links instead of titles of documents made it impossible to use the triplet model we have right now (where we need to calculate the 768 dimension embedding of the title of the document: since we don't have a document-title and only a link we cannot do that). So I was looking for another model architecture to be usable in our predicament and I found Transformer Memory as a Differentiable Search Index. The paper argues that instead of using a dual-encoder method (where we encode the query and the document on the same space and then find the document which is nearest neighbour to the query) we can use the differentiable-search-index (DSI), where we have a neural network map directly the query to the document. The paper presents a number of methods to achieve this but the easiest one to implement for me at this time was to simply assign each document one number, have the output layer of the network be composed of the same number of neutrons as the number of documents and make the network essentially assign probabilities to each document, given a query. Additionally, the paper performs this work with a Transformer architecture, raising the possibility of us integrating Nanogpt into the future architecture. I got to implement an intermediary version of the network whereby the same encoder that Marcel used (the allenai/specter language model) encodes a query and the output is the probability for each document individually. The rest of the architecture is left unmodified: Moving forward, I'm looking to finally implement a good number of peers in a network that send each other the query and answer (from ORCAS) and get the model to train. |
Cool stuff 👍 Could you tell me more about your performance metrics? I have two questions:
This matters a lot for deployment in Tribler. |
But keep in mind, this is extremely preliminary, I did not implement NanoGPT with this setup so that's bound to increase computing requirements |
Paper idea to try out for 2 weeks:
LLM for search related work example on Github called vimGPT: vimgpt.mov |
I got the T5 LLM to generate the ID's of ORCAS documents.
I was looking for what to do moving forward. I found a paper survey on the use of LLM's in the context of information retrieval. It was very informational, there's a LOT of research in this area at the moment. Made a list of 23 papers which were referenced there that I'm planning to go through at an accelerated pace. At the moment I'm still wondering what to do next to make the work I've already performed publishable by the conference on the 5'th of Jan. |
update |
In the past weeks I've managed to introduce 10 users who send each other query-doc_id pairs. The mechanism implemented is the following:
For the future I think trying to use DAS6 to perform a test with 100 peers may be worthwhile to check the integrity of the model and the evolution as the number of peers increases. |
AI with access to all human knowledge, art, and entertainment.AGI could help humanity by developing new drugs, treatments for diseases, and turbocharging the global economy.
Related: How is AI impacting science? (Metascience 2023 Conference in Washington, D.C., May 2023.) |
Public AI with associative democracyWho owns AI? Who owns The Internet, Bitcoin, and Bittorrent? We applied public infrastructure principles to AI. We build an AI ecosystem which is owned by both nobody and everybody. The results is a democratically self-governing association for AI. We pioneered 1) a new ownership model for AI, 2) novel model for training, and 3) competitive access to GPU hardware. AI should be public and contribute to the common good. More then just open weights, we envision full democratic self-governance. AI improvements are a social process! The process of create long-enduring communities is to slowly grow and evolve them. The first permissionless open source machine learning infrastructure was Internet-deployed in 2012. |
Solid progress! Operational decentralised machine learning 🚀 🚀 🚀 De-DSI for the win. Possible next step is enabling unbounded scalability and on-device LLM. See Enabling On-Device Large Language Model Personalization with Self-Supervised Data Selection and Synthesis or the knowledge graph direction. We might want to schedule both! New hardware will come for the on-device 1-bit LLM era update: Nature paper 😲 Uses LLM for parsing of 1200 sentences and 1100 abstracts of scientific papers. Avoids the hard work of PDF knowledge extraction. Structured information extraction from scientific text with large language models |
Poster for the De-DSI paper: |
In the last few days I've read papers on
I also thought about how a mixture-of-experts with multi-layered semantic sharding would work. At the moment something that I could try would be:
I also haven't found any paper on personalized models in decentralized federated learning, so it would be a gap which is unexplored and thus maybe easy to publish about. |
Focus on finding a phd problem to solve. Avoid "Technology push" that makes much science useless. We need GPU's for training. We need a dataset. We need publishable problem. Perhaps it is time to dive for 3 weeks into a production system? Some ideas and links
Hipster publishable idea: secure information dissemination for decentralised AI (e.g. MeritRank, clicklog, long-lived ID, sharing data, not unverifiable vector of gradient decent) |
btw about teaching...prepare for helping out with msc students more + master course of Blockchain Engineering. update : machine learning for 1) personalisation 2) de-DSI content discovery 3) decentralised seeder content discovery {DHT becomes 👉 IPv4 generative AI} 4) sybil protection 5) spam protection 6) learn-to-rank |
In the last few weeks I was in vacation. After that I got a recommendation engine working based on collaborative filtering of the movielens dataset. Nothing too fancy, just an SVD algorithm applied on the movielens-1m data. I've also read a few papers, including a literature review on foundation models in recommendation algorithms. I got two preliminary ideas for future research that I haven't seen yet implemented:
The two ideas could be used together as well I imagine. |
Still a few months left to find a great paper idea 🕙 "As simple as possible" architecture: 3 items send; 3 recommended items received.
Paper idea: aim to have a recommender without clicklog leakage. No text queries. Peers do not explicitly exchange profiles. Spread real clicklog snippets, from an unknown peer. Focus on unlinkability. They replay old recommendation requests to hide their own request. Use this as a naive approach, with known spam vulnerability. goal for 19 Aug 2024: Above architecture. 100 IPv8 peers listening, send 3 items to random peer, you get 3 recommended items back. Movielens. Outcome format: single amazing .GIF .... 🎉 update: share the embedding with another user. This could somehow be used to train a model. on-device model. 1 protocol query/response for both real-time search/recommendation and online continual learning in background. Build upon our strength: permisionless gen-AI with full scalability. Possible goal:
|
In the last 2 months I went with Marcel to the Oxford NLP summer school, took vacation back home and worked on an idea I had recently. I refreshed my understanding of the topic, having in the last few years not touched the topic professionally. The professor was from King Abdullah university in Saudi Arabia, his name is Naeemullah Khan. While there I thought more deeply about an idea I came up with previously, and pitched it to Prof. Khan and another postdoc from a lab at Oxford. The postdoc is Dr. Naman Goel. The idea is to use the Microsoft Recall upcoming feature (which takes screenshots of the activity on the PC every few minutes) in order to get an idea about the preference of the user. This preference can be used to generate query-recommendations for web services, including Tribler. Both Prof. Khan and Dr. Goel gave their approval and Dr. Goel even said he's willing to contribute with weekly calls and analysis of results (the code would be my task). |
Venue: LCN or the collective intelligence Journal: https://journals.sagepub.com/editorial-board/COL |
A potentially interesting topic for your PhD is to check out self-evolving distributed ontologies based on Tries and - at least in this text-based proof of concept - based on Gemini (but other models like ChatGPT should also work). Of course, communicating using human language with the Gemini model is (probably) not a good way forward and this would need some more sophisticated hooking into the underlying model (i.e., Gemini here). My txt-based intuition is here: learningtrees.txt |
Great progress! For next sprint
update
update2 |
Since you're doing automatic content analysis for decentralized search, here are some papers for related work:
|
For the past 2 weeks I was reading papers and trying to understand the cutting edge in distributed training. In particular I focused on a recent preprint paper I spent time understanding the mathematics of the issue (convergence and privacy guarantees) and made good progress. I realised that in order to be able to perform this kind of work I would need to go through the references to understand the theorems that are used in this field. This would take a while. Remains to be decided whether it's a good use of my time. Additionally, I ran their algorithm, posted here. |
Systems or networking storyline for publication IEEE LCN or PETS or Middleware. Future ambition is NeurIPS or ICML For next meeting in 2 weeks: attack ideas, IPv8 porting effort, get a experiment graph out of Shatter |
I have further looked into the code from SHATTER, data inference/reconstruction attack methods, and (as per Jeremie's recommendation) into MixNN which does similar work, though more basic. I have presented the attack idea on models which mix their parameters and send them to different people to Dr. Naman Goel from Oxford lab and he suggested that since the method is not widely accepted, it may be an attack on an architecture which not many people use, thus being not very interesting. I thought of looking into byzantine attacks in decentralized networks, then saw that a normal gradient similarity method has been published already in June this year, so I'd have to see if I can come up with something new. I found a literature review on the topic which I believe would be useful to read. Idea: User has consumed some content, each with a semantic coordinate (calculated with an LLM for example). Then, we calculate the semantic coordinates of the user as the average of the coordinates of the content they have consumed. If I search with a query, I get the coordinates of the query, and then check around me for people who's semantic coordinate is closest to the query, then I ask them, as they are the most likely users to have content in which I'm interested. |
Document needed for phd progress meeting. Mixture of Experts scaling is a great opportunity for decentralisation we talked about already in 18 Oct 2023. Idea outline:
update much related work exists on 6G federated learning. Yet highly theoretical, impractical, and immature. Great stuff to help realise for real 😃 IEEE/ACM Transactions on Networking cfp
15 Jan 2025 deadline, super rush! 🤔 |
Idea 1: Decentralized file-search based on taste embeddings:Description: When searching for a file in a decentralized network, instead of flooding the network with the query, the system finds people who have similar items to my query and only query them.
Idea 2: Decentralized learning with model-parallelism:Description: Investigate different aspects of model training in decentralized networks when single nodes can hold only a section of the model. Methodology:
Brief update after discussion with Naman: Both ideas should be pursued at the same time. If second fails to deliver because the field is too crowded, at least I have the first one. So the general plan with De-DSI:
And the general plan with the decentralized model-parallel training:
|
Some reading pointers: Semantic Overlay Networks. Arturo Crespo and Hector Garcia-Molina Kademlia: A Peer-to-Peer Information System Based on the XOR Metric Epidemic Broadcast Trees |
ToDo: determine phd focus and scope
Phd Funding project: https://www.tudelft.nl/en/2020/tu-delft/eur33m-research-funding-to-establish-trust-in-the-internet-economy
Duration: 1 Sep 2023 - 1 sep 2027
First weeks: reading and learning. See this looong Tribler reading list of 1999-2023 papers, the "short version". Long version is 236 papers 😄 . Run Tribler from the sources.
Before doing fancy decentralised machine learning, learn-to-rank; first have stability, semantic search, and classical algorithms deployed. Current Dev team focus: #3868
update: Sprint focus? reading more Tribler articles and get this code going again: https://github.com/devos50/decentralized-rules-prototype
Dreams from a young man 👴 From IETF Journal Oct 2012 "Moving Toward a Censorship-free
Internet (page16)", using phone-to-phone communication as used during Arab Spring uprising.
Wise words on difficulty of Distributed Systems for young engineers/scientists (also discussion on Hacker News)
The text was updated successfully, but these errors were encountered: