forums
new posts
donate
UER Store
events
location db
db map
search
members
faq
terms of service
privacy policy
register
login




1 2 3  
UER Forum > Private Boards Index > UER Database Talk > How UER's 'advanced' readstates system works on the LDB (Viewed 16477 times)
Avatar-X 

Alpha Husky


Location: West Coast
Gender: Male
Total Likes: 765 likes


yay!

 |  |  | AvBrand
How UER's 'advanced' readstates system works on the LDB
< on 12/6/2010 10:33 PM >
Reply with Quote
Posted on Forum: UER Forum
Taking a minute of time to talk about how it works.

UER's DB is running MySQL 5 on 64-bit Windows 2003 Server.

The key factor of the 'advanced' readstates system is that it needs to be able to provide a list of locations and provide an icon or status indicator with each one, showing that something has changed.

The current status indicators are:
- New location
- New photo
- New comment
- New story
- New thread
- New post in thread
- Change to location

Here's how I do it.

First, each location has a series of fields on the 'location' table that keep a count of the number of items in that location. The fields are nPics, nGals, nStory, nThreads, nPosts. When a photo is added, the nPics count is incremented. When a photo is deleted, nPics is decremented. The same for all of the other counts.

If the counts get out of sync, it is relatively simple to fix them by simply counting all of the photos in the location and then updating it.

Next there is another table called 'locseen'. This table stores, for each location, for each user, the counts of what you have SEEN on that location. This table contains a User ID, a Location ID, and nPics, nGals, nThreads, etc.

So, if a location has 5 pictures, the nPics on the location will be 5. If you've seen 3 of those pictures, your nPics entry for that location will be 3.

Now it is relatively simple to generate a list of all locations and include a field for how many UNSEEN pictures there are:

SELECT L.locid, (L.nPics - S.nPics) FROM location as L LEFT JOIN locseen AS S ON S.locid = L.locid AND s.userid = 'userid'

where 'userid' is your user id from your login data.

This creates a list of locations, with a count next to each one, indicating how many pictures there are that we haven't seen (total pics - seen pics = unseen pics). With this data we can easily display a blue dot next to any location where this number is > 0.

Now, we need to be able to roll this data up to a City, Province, and finally Country level. This is quite simple. Here we generate a list of locations, grouped by Country:

SELECT L.locid, SUM(L.nPics - S.nPics) FROM location as L LEFT JOIN locseen AS S ON S.locid = L.locid AND s.userid = 'userid'
GROUP BY L.country

Adding the SUM in there gives us the total number of unseen pictures in all locations, grouped by country. So now we can display a blue dot next to 'Canada' if there is more than 0 unseen pictures in any location in Canada. This makes it easy to "drill down" the list and find where the unseen pictures are.

So, there's a few problems with this approach already:
- Simply providing the front-page list of the UEDB's countries involves the DB going through every single location record and joining it with every single "seen" location record for that user. This process is quite slow, and often results in load times of 30+ seconds of that page.
- The information itself is extremely volatile. I'll cover that next.

So, the big problem is: How do I know how many photos you've seen? I can't just increment a counter each time you look at a photo, because then I'd count the same photo twice. I need to specifically store exactly which photos you have seen, on a per-photo basis.

So, enter a table called "gread". This table stores one entry per user per photo. If an entry is present, the photo has been seen. If an entry is not present, the photo is new.

Now, when you view a photo, I check this table to see if you've already seen it. If there is no matching record, I increase your "view count" by 1 and insert the record. If you have already seen it, I do nothing.

This table is the one that is huge. It has over 10 million records and doing anything with this table is a bitch. Even a simple "optimize" takes over 2 hours.

And there's yet another problem too. This table only tracks which photos you've seen. What happens when the counts get out of sync? As you'll see, this happens frequently.

When a photo is deleted from the DB (or moved to another location or another gallery or whatever), the counts are screwed up. Lets say you had seen all 5 photos in a particular location, but now one of those photos is deleted. The count will now be 4 photos, but you'll have seen 5. This means that if a 5th photo is added later, it will not show as a new photo for you.

In addition, due to the roll-up, this also messes up the other locations. 1 new photo + -1 seen photos = 0 new photos. That could be solved with a simple IF statement, but that doesn't solve the first problem.

So, to take care of this issue, I run a script every 24 hours that fixes these counts. It does this by calculating the totals for photos seen for each user, and then overwriting the "seen" counts. This process works, but it is extremely slow, and becomes slower as more users use the Readstates system. This is why I implemented the switch to turn off readstates in the first place, years ago... with everyone using Readstates it was so slow it would choke up the server for hours. Even now, this process takes several hours to complete, every day.

Essentially the script goes through and says 'Av may think he's seen 5 photos here, but I can only find records for 4 photos (since one was deleted) so we're just going to set that to 4'.

Anyway, there's the UEDB's readstates system in a nutshell. Combined, the tables require some 2.5 GB of HD space and represent 40 million rows of data. You can see why this is a big problem.

In summation, the problem isn't even necessarily my schema, but simply the fact that accessing that much data takes time. The system works, it is just very very slow.

Suggestions?



[last edit 12/6/2010 10:34 PM by Avatar-X - edited 1 times]

huskies - such fluff.
EatsTooMuchJam 


Location: Minneapolis, MN
Gender: Male
Total Likes: 24 likes


Squirty "Stickybuns" von Cherrypants

 |  |  | Add to ICQ | Yahoo! IM | AIM Message | Photography Site
Re: How UER's 'advanced' readstates system works on the LDB
< Reply # 1 on 12/7/2010 12:02 AM >
Reply with Quote
Posted on Forum: UER Forum
First thought:
Joins can be pretty toxic to database performance and calculating all of that stuff in real time is probably never going to scale well. If your requirement is that granular for all of the database states, you may consider not calculating the data in real-time, but rather do a daily roll-up of all of it which would also allow you to do a number of queries once and then instead of doing a join, just do the join work in your own code.
Which type of database table are you using? MyISAM? InnoDB? InnoDB will probably be (overall) more performant

Second thought:
Shard the 10,000,000 row table that sucks so much to deal with. One of the simplest ways to do it would be to increment a hashing algorithm which takes a username, for instance, and computes a numerical figure from it within a certain range and then create tables for each part of the range - gread-1-10000, gread-10001-20000, gread-20001-30000, etc.
It's still not optimal and it's likely that some tables will still grow bigger than others, but at least now for each table scan, you're doing a lot less i/o.

Third thought:
Do you have indexes on the fields you're querying most frequently? They'll make writes slower, of course, but a lot of your big reads are going to be wretched if they have to do a full table scan. 10,000,000 rows is a lot, sure, but it's not huge. A full table scan on that many rows is going to be suicide, but an indexed query will be faster.

Fourth thought:
What data type are you using for userid's? I assume it's a numeric type and not a string, right?

Fifth thought:
There's a subset of users who don't care about advanced read states in the LDB (including at least me and it sounds like Kowalski too). How hard would it be to make the read state settings more granular such that forum read states could be set differently than LDB states? That could be big.

Sixth thought:
How often do you recompute the read states mentioned below and how often do you update them? Is it all done synchronously on each page view? You could put a queue into a static class and drop updates into the queue. Then have a thread come along every few minutes and do a batch update. If you want to still provide the appearance of updates happening in real time, you could store your result sets in a relatively small local memcached instance. Update it synchronously since memcached is fast and does no disk i/o and then update the database asynchronously behind the scenes. The danger is that if memcached crashes, everything becomes slow - and if the server crashes, any data still in queue will not be written and some users may have their latest few read state changes not reflected after the server comes back.
You're not a financial app. You can tolerate reduced consistency for those operations.

Then you could have something like (pseudocode, sorry, no mysql stuff in front of me)


class States {
int nPics { get; set; }
int nGals { get; set; }
int nStory { get; set; }
int nThreads { get; set; }
int nPosts { get; set; }
}

List<States> userStates = memcached.get("states-" + userid.toString());
if(userStates == null)
{
for(int x=0; x < mysql.resultcount ; x++) {
userStates.add(new States { nPics = result.nPics, nGals = result.nGals,
nStory = result.nStory, nThreads = result.nThreads, nPosts = result.nPosts });
}

memcached.store(userStates);
}

Given that there are around 8k locations in the database an assuming you're using a 32-bit int for each identifier, the total sum of that data should be 200k which is pretty big for memcached, certainly, but if you spun up memcached and gave it 256M of data, you should be able to accomodate your entire working set at a given time - without incurring lots of real-time database i/o.

Note that you could also use the internal cache object within IIS pretty safely since you're working with only a single server - and it lets you set differing levels of priority for data so the results from extremely expensive queries can be stickier than the results from cheap ones.

Bear in mind that simply implementing caching still has the penalty of the first page load being very slow, but every page load after should be blazingly fast. The slowness of the first will also be mitigated by the reduced i/o contention from all of the real-time queries you're no longer doing... and that's also not to mention that inserts tend to do sort of yucky things to the MySQL query cache so doing fewer real-time inserts will also make overall database performance a bit better.

I'll probably have more thoughts in time... that's all just off the top of my head.


edit: just got rid of too-wide line



[last edit 12/7/2010 3:36 AM by Avatar-X - edited 1 times]

"The large print giveth and the small print taketh away."
-Tom Waits
Avatar-X 

Alpha Husky


Location: West Coast
Gender: Male
Total Likes: 765 likes


yay!

 |  |  | AvBrand
Re: How UER's 'advanced' readstates system works on the LDB
< Reply # 2 on 12/7/2010 3:47 AM >
Reply with Quote
Posted on Forum: UER Forum
1.
Yeah, calculating it in real time is slow, but that's what people want. Also, calculating it for each user (even if they are not going to look at it) is even slower. Plus, you'd see there's a new pic, go look at it, and it would still be marked as "unseen" since it won't refresh the data until tomorrow. So I don't think that would be a good solution.

I was using MyISAM for the longest time, and it did perform a lot better than Inno, but it suffered heavily from locking problems. So I switched to Inno, which performs well once it gets "warmed up" but the initial load times can be brutal.

2. The sharding would work, and probably wouldn't be too difficult to implement... of all your suggestions, this is the best one. But it also creates just that much more stuff to deal with. A simple (mod 5) on the user ID would work fine. This is something I might look at trying... but it is still a lot of data, and while it might speed it up 4x, that's still a lot of data and also still quite slow.

3. Yup. Tons of indexes. Heavily used "explain" to make sure my queries were as indexed as possible. That's how I got the initial page loads so quick when the DB wasn't quite as huge as it is now.

4. Yes.

5. Wouldn't be that difficult, but the forum readstates suffer from many of the same problems. For some people (turbozutek, for example) it often takes 26 seconds just to load the "new posts" page.

6. I'm not really sure what you mean by this point... which recompile? nPics/etc is updated when new photos are added, so that's not a big deal. The "re-sync" is done every 24 hours, and that is a big deal.

Looking at your code, remember that those "nPics" values are stored per user, PER LOCATION. So for 8,000 locations and 36,000 users, that's 288,000,000 potential records.

Looking at your code again, it looks like you want me to store the npics/etc counts for the master locations only? It would still need to be indexed against the "locseen" table to see how that location differes from what you have seen... so still a lot of IO.


The biggest problem is also the 24-hour sync, which often takes several seconds per user... multiply by 3000 people actively using the DB and you've got hours and hours of time, and this only gets worse as the size of the DB and the userbase increases.

-av




huskies - such fluff.
Avatar-X 

Alpha Husky


Location: West Coast
Gender: Male
Total Likes: 765 likes


yay!

 |  |  | AvBrand
Re: How UER's 'advanced' readstates system works on the LDB
< Reply # 3 on 12/7/2010 3:51 AM >
Reply with Quote
Posted on Forum: UER Forum
Forum Readstates:

This system is much simpler than the location DB one, but again due to the large dataset, can really slow things down.

I simply have a table called "read" and it stores one record per user per thread that has been read.

The record contains the user ID, the thread ID, and the date of the last read post.

Now I can see:

  • A thread is new if it exists in the DB but no corresponding record exists in the 'read' table
  • A thread has new posts if the most recent post is newer than the corresponding 'read date' in the table


Again, the slowdown happens from the aggregation of all of that data for thousands of threads into one page (the main forum page).

-av




huskies - such fluff.
EatsTooMuchJam 


Location: Minneapolis, MN
Gender: Male
Total Likes: 24 likes


Squirty "Stickybuns" von Cherrypants

 |  |  | Add to ICQ | Yahoo! IM | AIM Message | Photography Site
Re: How UER's 'advanced' readstates system works on the LDB
< Reply # 4 on 12/7/2010 4:41 AM >
Reply with Quote
Posted on Forum: UER Forum
Posted by Avatar-X
1.
Yeah, calculating it in real time is slow, but that's what people want. Also, calculating it for each user (even if they are not going to look at it) is even slower. Plus, you'd see there's a new pic, go look at it, and it would still be marked as "unseen" since it won't refresh the data until tomorrow. So I don't think that would be a good solution.


You're probably right. I need to think about that one some more, but it's likely that there's a way to have a bit of both worlds.


I was using MyISAM for the longest time, and it did perform a lot better than Inno, but it suffered heavily from locking problems. So I switched to Inno, which performs well once it gets "warmed up" but the initial load times can be brutal.


Yeah - full table locks in MyISAM for nearly everything are pretty horrible. It's what makes InnoDB a lot faster in practice much of the time when accessing from more than one thread.


2. The sharding would work, and probably wouldn't be too difficult to implement... of all your suggestions, this is the best one. But it also creates just that much more stuff to deal with. A simple (mod 5) on the user ID would work fine. This is something I might look at trying... but it is still a lot of data, and while it might speed it up 4x, that's still a lot of data and also still quite slow.


Sure - a mod 5 works nicely. The only rael limitation is that if you decide to change things, you'll end up having to completely rebalance. I was suggesting something more akin to a consistent hashing algorithm which usually allows for somewhat more granular expansion. Starting with the mod 5 method will work nicely as a start.


5. Wouldn't be that difficult, but the forum readstates suffer from many of the same problems. For some people (turbozutek, for example) it often takes 26 seconds just to load the "new posts" page.


I'll comment on the readstates below where you explained a bit more. I'll still stick with allowing people to disable the feature where they don't care about it (and maybe turning it off by default for LD wouldn't be a bad thing.


6. I'm not really sure what you mean by this point... which recompile? nPics/etc is updated when new photos are added, so that's not a big deal. The "re-sync" is done every 24 hours, and that is a big deal.


Ergh. I suck at explaining things. Sorry about that.
I'm not talking about recompiles. I'm talking about synchronous database I/O. Even if it's not a big deal, it's unnecessary I/O and it'll be making other queries run more slowly.


Looking at your code, remember that those "nPics" values are stored per user, PER LOCATION. So for 8,000 locations and 36,000 users, that's 288,000,000 potential records.


To be clear, I'm talking more about buffering your working set in RAM and eliminating repeating queries to the database by storing computed objects in a memory cache. While you have 36,000 users, only 1,000 have been active in the last month. That's 8,000,000 records - and memcached will expire objects when it fills (usually using a LRU algorithm IIRC) as well the local cache within IIS (though it also has some smarts as I mentioned for priority levels of cached data if you choose to set them).


Looking at your code again, it looks like you want me to store the npics/etc counts for the master locations only? It would still need to be indexed against the "locseen" table to see how that location differes from what you have seen... so still a lot of IO.


Potentially as an initial load thing, yes, but after that every page should be really fast. It's a start down the path toward better scalability.


The biggest problem is also the 24-hour sync, which often takes several seconds per user... multiply by 3000 people actively using the DB and you've got hours and hours of time, and this only gets worse as the size of the DB and the userbase increases.
-av


Agreed. The 24-hour sync sounds brutal - and that's going to be partly due to the heavy use of joins and whatnot. I suspect that by using simpler queries and doing some of the join work yourself, that process could be sped up significantly (plus the sharding won't hurt either).




"The large print giveth and the small print taketh away."
-Tom Waits
Avatar-X 

Alpha Husky


Location: West Coast
Gender: Male
Total Likes: 765 likes


yay!

 |  |  | AvBrand
Re: How UER's 'advanced' readstates system works on the LDB
< Reply # 5 on 12/7/2010 4:55 AM >
Reply with Quote
Posted on Forum: UER Forum
I forgot to mention that the site is coded in ASP3, no .NET.

-av




huskies - such fluff.
pianissimo357 


Total Likes: 0 likes




 |  | 
Re: How UER's 'advanced' readstates system works on the LDB
< Reply # 6 on 12/7/2010 5:03 AM >
Reply with Quote
Posted on Forum: UER Forum
For what it's worth, I don't care about using the LDB read states, and would gladly turn them off to save a few server processor cycles if the option is made present. I wouldn't be surprised if there were others that feel the same. The LDB read states system doesn't keep track of all locations (even non-public) for registered users that don't have full member status, does it? Not keeping view records for the locations that they aren't authorized to access anyway would help.

@ current stats
(
Number of registered users: 37083
Number of contributing users (post in last 30 days): 1061
Number of active users (seen in last 30 days): 2984
Number of Full Members: 2753
Number of Basic Members users waiting for Full Member status: 22
Number of Locations: 7994
Number of Publically Viewable Locations: 2485
)

you're looking at (37000-2800)*2500 + 2800*8000 = 107,900,000




edit for number crunching



[last edit 12/7/2010 5:10 AM by pianissimo357 - edited 1 times]

Avatar-X 

Alpha Husky


Location: West Coast
Gender: Male
Total Likes: 765 likes


yay!

 |  |  | AvBrand
Re: How UER's 'advanced' readstates system works on the LDB
< Reply # 7 on 12/7/2010 5:20 AM >
Reply with Quote
Posted on Forum: UER Forum
It only keeps records for locations you've actually seen -- since non-full-members can't see full-member locations, it doesn't track those states.

-av




huskies - such fluff.
EatsTooMuchJam 


Location: Minneapolis, MN
Gender: Male
Total Likes: 24 likes


Squirty "Stickybuns" von Cherrypants

 |  |  | Add to ICQ | Yahoo! IM | AIM Message | Photography Site
Re: How UER's 'advanced' readstates system works on the LDB
< Reply # 8 on 12/7/2010 6:07 AM >
Reply with Quote
Posted on Forum: UER Forum
Ergh. Started this a while ago and then got sidetracked. I bet there has been more discussion since I started. Anyway, here goes!

Posted by Avatar-X
Forum Readstates:

This system is much simpler than the location DB one, but again due to the large dataset, can really slow things down.

I simply have a table called "read" and it stores one record per user per thread that has been read.


The "read" table sounds like it could be sharded as well. As always, scanning through 1,000,000 records is better than scanning through 10,000,000 records (probably by a factor of 10, even!).


The record contains the user ID, the thread ID, and the date of the last read post.

Now I can see:

  • A thread is new if it exists in the DB but no corresponding record exists in the 'read' table
  • A thread has new posts if the most recent post is newer than the corresponding 'read date' in the table


Again, the slowdown happens from the aggregation of all of that data for thousands of threads into one page (the main forum page).

-av


This is actually most likely a lot simpler than the LDB problem - easier yet if you can tolerate a few seconds at a time of potentially inaccurate circles on the main page.

1) When the user connects, instead of doing a join-type query, you do a simple request for all of the user's current read states and you store them in a cache - either webserver-local or memcached.

class readState {
public int ThreadID;
public DateTime lastRead;
}

List<readState> curStates = (List<readState>)cache.Get(userID.toString() + "ReadStates");

if(curStates == null)
{
resultSet=query("select ThreadID,lastRead from read where userid='" + userid.toString() + "'");
for(int x=0;x<resultSet.count;x++)
{
curStates.Add(new readState { ThreadID = resultSet[x].GetInt(0), lastRead = resultSet[x].getDateTime(1) });
}
Cache.Store(userID.toString() + "ReadStates", curStates);
}

2) Are the threads' most recent post ids stored in a single table as well? I'll avoid any further pseudo-code until you confirm that since the layout of those tables is relevant as well.




"The large print giveth and the small print taketh away."
-Tom Waits
EatsTooMuchJam 


Location: Minneapolis, MN
Gender: Male
Total Likes: 24 likes


Squirty "Stickybuns" von Cherrypants

 |  |  | Add to ICQ | Yahoo! IM | AIM Message | Photography Site
Re: How UER's 'advanced' readstates system works on the LDB
< Reply # 9 on 12/7/2010 6:11 AM >
Reply with Quote
Posted on Forum: UER Forum
Posted by Avatar-X
I forgot to mention that the site is coded in ASP3, no .NET.

-av


Oh man. Ouch.
That will impact my offer for help. I know very little of ASP 3. What language do you end up using for codebehind?




"The large print giveth and the small print taketh away."
-Tom Waits
Avatar-X 

Alpha Husky


Location: West Coast
Gender: Male
Total Likes: 765 likes


yay!

 |  |  | AvBrand
Re: How UER's 'advanced' readstates system works on the LDB
< Reply # 10 on 12/7/2010 2:44 PM >
Reply with Quote
Posted on Forum: UER Forum
ASP 3 is basically VBScript for the web. It's powerful enough and can call out to Active-X components for added functionality. It doesn't do any of that memcaching stuff you're talking about, though.

Threads are stored in a table called "threads" and contain data about the category, the thread ID, the poster, the most recent post, etc.

-av




huskies - such fluff.
trent 

I'm Trent! Get Bent!


Location: Drainwhale hunting
Gender: Male
Total Likes: 9 likes


Not on UER anymore.

 |  |  | infinitedecay
Re: How UER's 'advanced' readstates system works on the LDB
< Reply # 11 on 12/7/2010 3:10 PM >
Reply with Quote
Posted on Forum: UER Forum
I'm quite interested in this topic. I'm a DBA at work, so while I'm at work, I might as well think about UER instead

I'm sympathetic to Av's dilemma, it's a difficult balance.

1.) I as well would be completely okay with throwing away UE DB read states in an attempt to allow them to continue as normal for the forums. Tracking the DB in such a detailed manner is overkill. Or minimally just use a very basic readstate for UE DB entries and if there's a new picture or comment, just one counter or something counting the total number of items associated to that DB entry and if it's changed or not since it was last viewed.

2.) From scanning this thread is sounds like the table in charge of read states is massive and has become major bottleneck, especially in join queries with other tables.

3.) I have a wild idea. It's so wild that it's something I would never dare to attempt in a DB of my own... What about the idea of creating a unique read-state table for each user and tracking only their states on their individual table instead of one HUGE table? Then when read states are being processed for a user the query would run much quickly as the readstate table would be say a few hundred records long for the user, not a million that it needs to filter through for a user currently. The obvious downfall is the OCD nightmare of having 30,000+ different user tables mixed with the other 'normal' tables UER uses, but the name of all of the could started with a 'z' so they're pushed to the end of the table list. I wouldn't wish that setup upon my worst enemy. But, right now in my head it seems like it could significantly lighten the load on processing forum readstates (as for UE DB readstates - meh, who cares about those though). One challenge I see would be writing SQL statements to dynamically pull from the correct read state table for the current user -- usually the table name is hard coded into a SQL query statement, but it would need to be some kind of variable used for a table name. I'm sure that's possible somehow. Anyway, what about the feasibility of this overall idea? Would it work?








He who rules the underground, rules the city above.
Avatar-X 

Alpha Husky


Location: West Coast
Gender: Male
Total Likes: 765 likes


yay!

 |  |  | AvBrand
Re: How UER's 'advanced' readstates system works on the LDB
< Reply # 12 on 12/7/2010 4:36 PM >
Reply with Quote
Posted on Forum: UER Forum
One table per user wouldn't actually be the end of the world -- I could create a new database just for those tables so they don't muck up my 'main' database.

It is a logistical nightmare if I need to change or add to the tables... I'd have to do it 8 bazillion times (or write a script to do it).

I'm not sure if it is worth the huge amount of work that doing this would require. I'd have to migrate over all the old data and then manage it. Don't even get me started on backups.

-av




huskies - such fluff.
trent 

I'm Trent! Get Bent!


Location: Drainwhale hunting
Gender: Male
Total Likes: 9 likes


Not on UER anymore.

 |  |  | infinitedecay
Re: How UER's 'advanced' readstates system works on the LDB
< Reply # 13 on 12/7/2010 5:08 PM >
Reply with Quote
Posted on Forum: UER Forum
Posted by Avatar-X
One table per user wouldn't actually be the end of the world -- I could create a new database just for those tables so they don't muck up my 'main' database.

It is a logistical nightmare if I need to change or add to the tables... I'd have to do it 8 bazillion times (or write a script to do it).

I'm not sure if it is worth the huge amount of work that doing this would require. I'd have to migrate over all the old data and then manage it. Don't even get me started on backups.

-av


You bring up a good point on working with such a large amount of tables. Any type of maintenance on them probably would require some kind of script to do it. Hopefully though if something along these lines was set up, was kept fairly simple it will just work and won't need to have maintenance on it.


"I'm not sure if it is worth the huge amount of work that doing this would require. I'd have to migrate over all the old data and then manage it. Don't even get me started on backups."

It's true that it would take a little extra work updating SQL statements on the sites pages, creating new individual tables for existing users, and adding functionality to auto-create a table for new users joining UER. But in theory, it will give UER the exact same read-states that the users have come to love with minimal impact on performance.

For converting the HUGE read states table to user-specific tables, you could probably use the user's uid number as the name of their new read-state table, e.g. user #34152's read-state table is called table '34152'. A nice little script could be written to incrementally loop through all existing user ID's and run a make table query for that user and then additionally run an append query for that user to migrate their existing read state data over. (or just start fresh with read-states and import none of the old data -- I don't think people will mind when seeing the end result)

As for backups, I'm actually interested in those too. For backing up do you copy records or tables from a main DB to a backup DB? Or do you just backup the entire databases as whole? I'm just not following where backups would be a challenge to implement with this.

I hope you consider this idea. I think it would be pretty bad-ass to keep the existing read state system which is very enjoyable, but update it so that it runs in a way which doesn't severely kill the system/DB performance.




He who rules the underground, rules the city above.
EatsTooMuchJam 


Location: Minneapolis, MN
Gender: Male
Total Likes: 24 likes


Squirty "Stickybuns" von Cherrypants

 |  |  | Add to ICQ | Yahoo! IM | AIM Message | Photography Site
Re: How UER's 'advanced' readstates system works on the LDB
< Reply # 14 on 12/7/2010 5:59 PM >
Reply with Quote
Posted on Forum: UER Forum
Posted by trent
3.) I have a wild idea. It's so wild that it's something I would never dare to attempt in a DB of my own... What about the idea of creating a unique read-state table for each user and tracking only their states on their individual table instead of one HUGE table? Then when read states are being processed for a user the query would run much quickly as the readstate table would be say a few hundred records long for the user, not a million that it needs to filter through for a user currently. The obvious downfall is the OCD nightmare of having 30,000+ different user tables mixed with the other 'normal' tables UER uses, but the name of all of the could started with a 'z' so they're pushed to the end of the table list. I wouldn't wish that setup upon my worst enemy. But, right now in my head it seems like it could significantly lighten the load on processing forum readstates (as for UE DB readstates - meh, who cares about those though). One challenge I see would be writing SQL statements to dynamically pull from the correct read state table for the current user -- usually the table name is hard coded into a SQL query statement, but it would need to be some kind of variable used for a table name. I'm sure that's possible somehow. Anyway, what about the feasibility of this overall idea? Would it work?


That's basically the same as the sharding I suggested earlier, but (in my opinion) too many shards.




"The large print giveth and the small print taketh away."
-Tom Waits
trent 

I'm Trent! Get Bent!


Location: Drainwhale hunting
Gender: Male
Total Likes: 9 likes


Not on UER anymore.

 |  |  | infinitedecay
Re: How UER's 'advanced' readstates system works on the LDB
< Reply # 15 on 12/7/2010 6:18 PM >
Reply with Quote
Posted on Forum: UER Forum
Posted by EatsTooMuchJam


That's basically the same as the sharding I suggested earlier, but (in my opinion) too many shards.


Sorry, I've been slacking on reading every post on this topic in detail. Yah, I agree, that much sharding is against every DB fundamental I've ever leaned. Yet in this case, I don't know how else to keep the same functionality by increase it's performance.

I also wonder if the existing system could be kept, but with minor tweaks to make it run smoother? It's been working fine for years, right? So lets reproduce those old conditions from where it ran smoothly. Many have mentioned some of these ideas:

-Deleting inactive user accounts well as purging their read-state data upon deletion.
-When a thread becomes archived, disable any read states on it (not sure if that's happening).
-Completely kill the UE DB read states or severely lessen them.
-For new users, disable read-states by default. Let them think it's just a normal forum until they find how to turn it on. In the mean time, less read state data to deal with.
-Consider limiting the number of read-states a person can have and give them the ability to review a list of their states and delete ones they don't care about so even when maxed out they can still clear some out while keeping their important ones.

It's hard to tell w/o seeing the data, but I would think the 'fat' could be trimmed down to make the states run smooth again and without redoing how they work. Would it run smooth again if that 1 million+ record table is shrunk back down to a quarter million? I could see 1/4 mil doable with user read state quotas and purging of users.


-Limiting the how far read states




He who rules the underground, rules the city above.
Avatar-X 

Alpha Husky


Location: West Coast
Gender: Male
Total Likes: 765 likes


yay!

 |  |  | AvBrand
Re: How UER's 'advanced' readstates system works on the LDB
< Reply # 16 on 12/7/2010 6:27 PM >
Reply with Quote
Posted on Forum: UER Forum


As for backups, I'm actually interested in those too. For backing up do you copy records or tables from a main DB to a backup DB? Or do you just backup the entire databases as whole? I'm just not following where backups would be a challenge to implement with this.


I back up each table into a text file, which is then RARed and downloaded to an offsite server. This would be way too many files with a table per user.


-Deleting inactive user accounts well as purging their read-state data upon deletion.

Already do this. Doesn't help much. Also since the tables are so large, operations like this are extremely slow.

-When a thread becomes archived, disable any read states on it (not sure if that's happening).

Yes, that is actually the entire reason why I implemented the Archive system: to lessen the readstates overhead.

-Completely kill the UE DB read states or severely lessen them.

People are bitching about it now with the lessened readstates... how is killing them a solution at all?


-For new users, disable read-states by default. Let them think it's just a normal forum until they find how to turn it on. In the mean time, less read state data to deal with.

That's how it's been for at least 5 years. All of the people who are complaining, have specifically turned on the readstates.


-Consider limiting the number of read-states a person can have and give them the ability to review a list of their states and delete ones they don't care about so even when maxed out they can still clear some out while keeping their important ones.


Who'se going to seriously do that? "Hmm, I don't think I care about this thread anymore..." Nobody. The most that people will do is click "Mark Read".

It's been working fine for years, right?


Not really. It'a always been slow and problematic.




huskies - such fluff.
trent 

I'm Trent! Get Bent!


Location: Drainwhale hunting
Gender: Male
Total Likes: 9 likes


Not on UER anymore.

 |  |  | infinitedecay
Re: How UER's 'advanced' readstates system works on the LDB
< Reply # 17 on 12/7/2010 6:40 PM >
Reply with Quote
Posted on Forum: UER Forum
Posted by Avatar-X
I back up each table into a text file, which is then RARed and downloaded to an offsite server. This would be way too many files with a table per user.

Do you do that by hand? If the process is automated I guess I'm not understanding how anything would change on the backup process except a larger file size due to overhead of each tables table structure instead of one big table.

Posted by Avatar-X
People are bitching about it now with the lessened readstates... how is killing them a solution at all?

From reading the forums, it seems the general consensus is that people would gladly give up UE DB read states to preserve the traditional read state system for the forums. It seems people would prefer the old read states on the forums only over neutered read states on the forums.

Posted by Avatar-X
Not really. It'a always been slow and problematic.

That's news to me. Besides when backups are running the site has run just fine for me for years, except for for the past month or two.



Anyway, was just offering some suggestions. It's your site. I hope the changes work out.



[last edit 12/7/2010 6:40 PM by trent - edited 1 times]

He who rules the underground, rules the city above.
Avatar-X 

Alpha Husky


Location: West Coast
Gender: Male
Total Likes: 765 likes


yay!

 |  |  | AvBrand
Re: How UER's 'advanced' readstates system works on the LDB
< Reply # 18 on 12/7/2010 6:40 PM >
Reply with Quote
Posted on Forum: UER Forum
So, let's focus on the forum readstates for the moment because they represent only two tables and are much simpler.

One of the key problems with the speed is generating the 'category' page (the front page of the forum). In order to get the proper read states, it needs to total up all of the readstates for ALL of the threads in each category, essentially this is all of the unarchived threads on the whole site. These are then joined to the read table.

Shrinking the size of the read table will help speed this up, but it is still a LOT of rows to tot up each time.

Any ideas on how to speed this up?
-av




huskies - such fluff.
Avatar-X 

Alpha Husky


Location: West Coast
Gender: Male
Total Likes: 765 likes


yay!

 |  |  | AvBrand
Re: How UER's 'advanced' readstates system works on the LDB
< Reply # 19 on 12/7/2010 6:45 PM >
Reply with Quote
Posted on Forum: UER Forum
Posted by trent

Do you do that by hand? If the process is automated I guess I'm not understanding how anything would change on the backup process except a larger file size due to overhead of each tables table structure instead of one big table.


No, it is automated. But it means the difference between 120 files and 30,000 files. That's just an unmanageable amount of files.


From reading the forums, it seems the general consensus is that people would gladly give up UE DB read states to preserve the traditional read state system for the forums. It seems people would prefer the old read states on the forums only over neutered read states on the forums.


I'm not sure where this idea came from, but the two are not one and the same. Giving up LDB read states won't solve the problems of the forum read states.


That's news to me. Besides when backups are running the site has run just fine for me for years, except for for the past month or two.


Well, you don't run the site -- I have to deal with this stuff. So, it is not news to me. As I said above, the Archive system was implemented only as a way to keep the size of the DB down.


Anyway, was just offering some suggestions. It's your site. I hope the changes work out.


I am extremely grateful for your suggestions. I originally wrote all of this readstates stuff 7~ years ago, when I was 21, with no DB experience or degree or anything. I started this thread to see if people who do this for a living have better ideas than I did 7 years ago, but so far the only really workable solution has been the sharding, and that has its own problems too.

I wonder how Google does it -- in my Google reader, I have about 50 RSS feeds. Google knows exactly which posts I have read and which I haven't, and can display that information instantly. They can even tot it up into an "all unread items" count. So whatever system they've implemented, it's got item-level granularity and works very quickly.

-av





huskies - such fluff.
UER Forum > Private Boards Index > UER Database Talk > How UER's 'advanced' readstates system works on the LDB (Viewed 16477 times)
1 2 3  


Add a poll to this thread



This thread is in a public category, and can't be made private.



All content and images copyright © 2002-2024 UER.CA and respective creators. Graphical Design by Crossfire.
To contact webmaster, or click to email with problems or other questions about this site: UER CONTACT
View Terms of Service | View Privacy Policy | Server colocation provided by Beanfield
This page was generated for you in 125 milliseconds. Since June 23, 2002, a total of 737126358 pages have been generated.