-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sort On Demand #328
Comments
To me, this is a "don't fix it if it ain't broke" situation. I'm not even sure that at 900 news items this will be a significant factor. Looping over arrays in javascript isn't that heavy. It's very light compared to doing database queries in terms of time to response. The current implementation isn't exactly efficient though (I hadn't thought about looping over every vote for every post, that's a little scary). I would, however, be in favor of profiling the front-page render. I think going to town optimizing stuff is the wrong way to approach it, but if we find that one of the items in the list above is taking a really large amount of time, we might be able to optimize just that piece. The trouble with storing the score with the news item itself is that while all posts age at the same rate, the score degrades exponentially rather than linearly. So you really do need to calculate the score for each item to get the proper order. If we ever got a really large user base, I would say that heavy caching would be appropriate for rendering the front page (as I'm sure reddit does), but I don't think keeping score constantly really works. |
If we can make that query more efficient without caching, I would be fine with that. Something needs to be done, though... Right now I'm getting 2s load times for the main page (before assets - just the first request) which is a long time. Yesterday we were getting requests timing out in the morning, which showed the user big ugly application down pages - I'm pretty sure this was because requests were queuing up, and 2s per requests was too slow to clear them out before the request timed out. |
I agree, it's too slow in it's current form. I was chalking that page load On Wed, Apr 2, 2014 at 9:50 AM, Joe Wegner [email protected] wrote:
|
Here's a possible solution I thought up. I'll have to be brief now, but maybe I can flesh this out more when I have more time on my hands next week. What we could do is find a way to cache the news list. We could even do this "cache" in mongo: create a new collection, and add documents with an article id and score as calculated by our algorithm. Every minute or so, have a task go over these documents and update their score. We could also eagerly update the document when an item's votes change. Then the query is a very simple query over this collection, ordered by score and limited to the number of news items on a page. |
I didn't realize the main page was taking 2s to load - that's clearly a problem. I do think we need to profile the main page to figure out what's causing the issue - although looping votes and items is a likely culprit, and could be addressed with a cached collection. |
@dfjones This is exactly how I've implemented this in the past, in that case I just stored the score with the news item in the DB. Then, as you said, to display the front page is a simple descending order query and who cares if the score is slightly inaccurate in between refreshes. We certainly don't have enough news item volume to notice any problems anyway, better to optimize the front page loading speed and give up some accuracy on the scoring. |
@megamattron I'd be more in favor of a separate collection that is just a front-page cache (or first n pages) rather than storing the score in the main collection. |
@treygriffith Just wondering, why would you prefer a full cache, rather than just an additional field on the news item collection? Is there a good reason for the data duplication? There is an unfortunate requirement that we need to keep in mind here.. Heroku gives us one dyno for free - anything above that is going to be $30(ish) per month. I don't know what @megamattron's plans are for scaling, but I would like to avoid anyone having to pay for Pullup for as long as possible. Having a scheduled job that recalculates scores every few minutes will require a separate worker dyno (it could be done on the main thread, but that will not scale well with Node's single-process architecture). I think that we have to either speed up the current live scoring process, or cache the scores on demand (ie: new vote, new post). I'll think on it more, in hopes of finding a way that the age/gravity stuff could still work with this approach. |
You're going to have to have a background process whether you have a cache collection or not. Having a process that is constantly updating all of the news items, as well as updating on votes, etc, starts to create problems because you can potentially have multiple actors simultaneously updating the same record. Imagine if we add the ability to edit self posts - anytime you save your edits at the same time that the background process is updating, or that a user is casting a vote, there will be a collision. If we're caching data, I think it's clearer to actually use a cache than to have parts of every record be cached, and other parts not. On Wed, Apr 2, 2014 at 7:46 PM, Joe Wegner [email protected]
|
My initial thinking about a separate collection is that it might be slightly easier to manage. You can limit the number of documents stored there, so whatever is working over them can be limited in the time/resources it consumes. We can do this too if we just add scores to the existing documents, but it feels cleaner to me to separate this out. As for the update worker, I vote to do the simplest thing that could possibly work. I'm not sure how this fits with heroku, but I think node has a setTimeout like function that we could use to call the updater periodically. Node is a single process, but anything that runs as an event callback should be fine. |
@dfjones If we intend to run the updater every minute or so, I think we will be creating more overhead than we already have. I don't think we currently have someone loading the homepage every minute (although, maybe our crashes are only occurring with burst traffic). It is (sort of) true that setTimeout will wait for the event queue to empty out before it runs, but any new requests that come in while the updater is running will have to wait for it to finish. I'm not entirely sure (I would like to build in some profiling), but I think that most of the time is spent processing inside of Node, rather than waiting on IO - that means the thread will be blocking all other requests for most of 2 seconds. @treygriffith I'm trying to work towards a solution that doesn't have a process constantly updating all of the news items. I'd like to find a solution that only updates the scores and/or cache when an event that would cause the order to change happens - that should solve the collision issue. I don't yet know if such a solution is possible, but it seems like it should be. I am going to spend some time this afternoon thinking on it - hopefully there's a clever solution here that meets all of the goals here. |
Does something like https://github.com/ncb000gt/node-cron execute on the web serving thread? If not, that really seems to me like the best solution. It'll be doing roughly the same amount of work we're doing now every time we load the news list, but only infrequently. Even every 10 mins will probably cover the level of activity we have for the forseeable future. As for where to store the score, creating a duplicate structure to store the news just to add an additional int value seems to have dubious benefits and concrete downsides. Please don't read the above with a negative tone, none is intended! However, in general I think an important goal for this site probably would be to try and do things in simplest way possible. It will help new users get up to speed and just keep things moving forward in general. We can always fix things when they start having problems in the future. @treygriffith If you end up finding a simple alternative to that I'm all for it, but I'm hesitant to get involved in anything too complicated. |
Ok, I can agree to keeping the score on the current news items. As for the actual processing of scores, it is my understanding that NodeJS has a single thread. To get around this, I think we will need to use something like Perhaps there is a library or technique we can use that is a little more convenient than calling Edit: did some searching and found event-stream. Maybe this in combination with the cron library @megamattron linked will be a nice way to structure this work? |
As a test, I loaded my dev database with 1000 posts and randomly generated between 1-6 votes for each post (3471 votes total). With this data set response times were averaging mid 13 seconds per request. For reference, the Profiling (and please correct me if I'm misinterpreting these results) seems to indicate to me that the culprit is the (note that when running the profiler my response times dropped to ~15s +) Caching or pre-calculating the scores is premature IMO and we should at least attempt to improve the way votes get totaled for submissions. For the record, I definitely think that caching/saving scores between requests is a sound idea overall (at the very least we shouldn't be re-calculating them /every/ request if we can avoid it), but we should improve (within reason) the scoring functionality before we deal with caching. Has anyone else tried loading larger datasets? If so, I'd be interested in hearing if your experience differed from my own. Lastly, one other thing to consider which I don't think I've seen mentioned in this thread: Perhaps we may have gone down the wrong path by declaring a reference relationship between the |
Yikes, that's pretty bad. As far as the design of the database (i.e. the more relational approach), I'm the one that did that, and here's my reasoning: #146 (comment) I hadn't considered putting Votes as a sub-document on comments and news stories, which may have been the best decision. I'd definitely be willing to revisit that. |
Or rather, I did consider it, but the benefits might outweigh the drawbacks I listed in that comment. |
I totally agree with your rational in #146. I do kind of like the idea of embedding the vote data in the NewsItems, but can't help but think that, as the site grows, we'd simply be trading a "now" problem for a "later" problem. I do think there is an easier solution however: Simplifying on the idea of pre-calculating/caching the scores, we could very easily store the aggregate vote count and continue to live-calculate the scores. My tests show that assembling the vote/item relationship is what is really taking most of the time. I just pushed an example to my iss328 branch which demonstrates this (warning, this was quick and dirty so there are bugs. e.g., scores on Issues don't show.). In short: I've added a post save hook to the vote model so every saved comment increments an aggregated post count on the associated item. Now when scoring we can skip the time consuming lookup of votes for all items. With only this change my tests show a significant performance increase. Downsides:
As I said, this is just a quick example to get some feedback on whether you all think this is worth pursuing, so any comments/concerns are appreciated. |
I think that you're right - we shouldn't focus on caching the scores themselves, but should cache the vote count. I took a look at your branch, I think it looks good, and probably the right way to go. It doesn't introduce too many future problems that I can think of (especially as compared to embedding Votes). The only thing that I think should be added would be some sort of balancing routine that can occasionally (once a day?) go through the Votes collection and update items with their proper vote count. That would be a way to get to a sort of eventual consistency. When we implement that, it would solve your second concern, as the first run of the consistency routine would update all of the existing items with the correct vote count. |
Thanks for the reply. Given the little other feedback on this idea I think I'm going to break out my suggestion into three smaller parts:
When I have some free time I'll refactor my code & send a pull-request for step 1. If that is approved, we can revisit how we want to proceed. |
That sounds like a good plan of action to me. You're right that anything Since we'll probably have to sync up initially, we can just make that But we can discuss all that once you've done part 1. Thanks for putting in On Thu, Apr 24, 2014 at 3:19 PM, Carl Groner [email protected]:
|
I sent a PR (#334) a couple of days ago for comment/discussion as I commented above. If you all would prefer to go another way please feel free to close it. Thanks. |
I'd vote to merge @cgroner PR #334 and implement the rest of his plan outlined above. Having the votes calculated for each news item will speed up the score based sorting. After we have this in place, we can evaluate where we are and if we will need to take further action to improve the performance of score based sorting. |
Right now, when you load the homepage (or any subsequent news page), here's what happens:
As you can see, this is a terribly heavy process. We're not feeling the full weight of it yet, because we don't have 900 posts, but Pullup will continue to degrade in performance as we post more topics.
Along with some improvements on the scoring process, I suggest that we update the score on demand. Currently, posts will only change order based on two events: a new vote, or a new post. The score is calculated based on age, but all posts age at the same rate so the scores should age similarly. I think we should recalculate scores every time a vote is placed or a post is added, and store that value in the database.
The text was updated successfully, but these errors were encountered: