Welp, found the actual source of lag

jasonrohrer · 2018-07-25 00:47:51

I actually just ran a replace for uint64_t in both header and .cpp file for KISSDB. Running the benchmark on a smaller data set, I see the same checksums, etc, with the 64 and 32-bit version, so I think it worked. I also switched fseeko to 32-bit mode.

At first, I tried to figure out which were hash values and which were other values, but I don't think it matters.

Though the previous run with its 1.2GiB file is making me nervous about the 4GiB limit...

jasonrohrer · 2018-07-25 00:57:43

Yes, KISSDB was thrashing, and the 32-bit version is way better:

Opening DB (table size 30000000) used 5840 bytes, took 0.000000 sec
Inserted 15752961
Inserts used 360001536 bytes, took 179.170000 sec
Flushing system disk caches.
Random lookup for 500 batchs of 3000 (1500000/1500000 hits)
Random look used 0 bytes, took 3.840000 sec
Checksum = 186209500
Flushing system disk caches.
Last-inserted lookup for 500 batchs of 2401 (1200500/1200500 hits)
Last-inserted lookup used 0 bytes, took 3.102000 sec
Checksum = 283318000
Flushing system disk caches.
Random lookup for non-existing 500 repeating batchs of 3000 (0/1500000 hits)
Random look/miss used 0 bytes, took 0.934000 sec
Flushing system disk caches.
Random lookup for non-existing 1500000 non-repeating values (0/1500000 hits)
Random look/miss used 0 bytes, took 4.576000 sec
Flushing system disk caches.
Inserts for previously non-existing 500 batchs of 3000
Inserts after miss used 0 bytes, took 7.170000 sec
Flushing system disk caches.
Iterated 15755961, checksum 1953944243
Iterating used 0 bytes, took 46.211000 sec
Max bin depth = 3

real	4m6.418s
user	0m50.763s
sys	2m10.474s
-rw-r--r-- 1 root root 675119248 Jul 25 00:53 test.db

sc0rp · 2018-07-25 01:27:48

Thats pretty neat performance. Just small final touch - start with 40% load to give it some breathing room for inserts. Beyond 50% longer chains are starting to form, so it's better to avoid that - cost with KissDB is very high.

Now run it with full game, to check if there are other bottleneck elsewhere.

sc0rp · 2018-07-25 01:29:00

Max bin depth = 3

I wonder how it got down from 4 to 3. Is there some non seeded randomness involved?

jasonrohrer · 2018-07-25 01:36:50

No non-seeded randomness.

I changed the hash function from 64 to 32-bit, though, for the djb2 alg. I think that will change the hash values.... maybe? I haven't thought about it, though, and I'm tired.

Testing MurmurHash2 now.... it's SLOW.....

179 sec was long enough for those 15 million inserts.

jasonrohrer · 2018-07-25 01:41:42

Also, while I'm waiting, this has got me nervous:

top - 01:39:14 up  5:15,  2 users,  load average: 1.23, 1.25, 1.07
Tasks:  93 total,   2 running,  52 sleeping,   0 stopped,   0 zombie
%Cpu(s):  2.3 us, 19.2 sy,  0.0 ni,  0.0 id, 78.5 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem :  1009052 total,    73708 free,   755852 used,   179492 buff/cache
KiB Swap:   262140 total,   219032 free,    43108 used.   110848 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND                                    
10293 root      20   0  717164 704724   1612 R 21.2 69.8   1:58.68 kissTest                                   
10226 root      20   0       0      0      0 I  2.6  0.0   0:15.84 kworker/u2:2                               
   34 root      20   0       0      0      0 S  2.3  0.0   1:38.73 kswapd0                                    
  153 root       0 -20       0      0      0 I  0.3  0.0   0:21.79 kworker/0:1H                               
    1 root      20   0   77896   1592   1288 S  0.0  0.2   0:02.51 systemd                                    
    2 root      20   0       0      0      0 S  0.0  0.0   0:00.00 kthreadd

Does this mean Murmur2 has likely caused more collisions, just by random chance?

The problem with KISS is that collisions are so expensive in terms of resource consumption.

sc0rp · 2018-07-25 01:46:20

jasonrohrer wrote:

Does this mean Murmur2 has likely caused more collisions, just by random chance?

Looks like. I'm quite surprised. Before seeing that, I'd bet it'll perform better. Are you using 32bit version?

jasonrohrer wrote:

The problem with KISS is that collisions are so expensive in terms of resource consumption.

That's why I think it's only stopgap solution, to buy some time for something better.

jasonrohrer · 2018-07-25 02:12:02

Well, I've bought time already by using STACKDB and taking server1 and 2 out of circulation, and turning the cap down to 40 on the other servers.

The process of getting all the servers to rebuild their DBs with a larger table size, with or without switching back to KISS, is non-trivial. I'd like to get a better solution (low RAM, constant time, cheap collisions) before I go through the process of writing the conversion code.

Also, the big insert still hasn't finished yet with Murmur2. Yes, 32-bit version, along with 32-bit KISSDB mod. Looks like it's thrashing:

top - 02:08:01 up  5:44,  2 users,  load average: 1.06, 1.09, 1.09
Tasks:  93 total,   2 running,  52 sleeping,   0 stopped,   0 zombie
%Cpu(s):  2.4 us, 15.9 sy,  0.0 ni,  0.0 id, 81.7 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem :  1009052 total,    71200 free,   897900 used,    39952 buff/cache
KiB Swap:   262140 total,   129396 free,   132744 used.    17896 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND                                    
10293 root      20   0  951540 849604   1612 R 14.3 84.2   5:58.56 kissTest                                   
   34 root      20   0       0      0      0 S  1.7  0.0   2:01.37 kswapd0                                    
10226 root      20   0       0      0      0 I  1.0  0.0   0:30.45 kworker/u2:2                               
  153 root       0 -20       0      0      0 I  0.3  0.0   0:27.58 kworker/0:1H                               
    1 root      20   0   77896   2660   2384 S  0.0  0.3   0:02.59 systemd                                    
    2 root      20   0       0      0      0 S  0.0  0.0   0:00.00 kthreadd                                   
    4 root       0 -20       0      0      0 I  0.0  0.0   0:00.00 kworker/0:0H

Just killed it. It ran for 42 minutes without finishing the insert.

Will try a different Murmur2 seed and see if I get luckier.

jasonrohrer · 2018-07-25 02:25:46

Well, even with a different seed, same problem. I added some print-outs to KISS to document new pages being added.

Local epoch time  = 0
GMT epoch time  = 0
Opening DB (table size 30000000) used 5840 bytes, took 0.000000 sec
Adding new page to the hash table
Adding new page to the hash table
Adding new page to the hash table
Adding new page to the hash table
Adding new page to the hash table
^C
real	1m3.300s

Switching back to djb2, I'm seeing only two pages added in the first three or so minutes, and way less RAM:

top - 02:29:00 up  6:05,  2 users,  load average: 1.70, 1.03, 1.03
Tasks:  93 total,   2 running,  53 sleeping,   0 stopped,   0 zombie
%Cpu(s): 23.8 us, 75.8 sy,  0.0 ni,  0.0 id,  0.3 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem :  1009052 total,   122636 free,   290052 used,   596364 buff/cache
KiB Swap:   262140 total,   224400 free,    37740 used.   572156 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND                                    
10440 root      20   0  248412 237144   2828 R 87.4 23.5   1:41.81 kissTest                                   
10226 root      20   0       0      0      0 D 11.0  0.0   1:09.49 kworker/u2:2                               
  153 root       0 -20       0      0      0 I  1.3  0.0   0:34.52 kworker/0:1H                               
10315 root      20   0   44048   1804   1448 R  0.3  0.2   0:02.05 top                                        
    1 root      20   0   77896   1376   1072 S  0.0  0.1   0:02.63 systemd

Wonder what's going on with Murmur2?

I copy-pasted the code for the 32-bit function directly from here:

https://github.com/aappleby/smhasher/bl … rHash2.cpp

It's the first function in the file.... I didn't change it at all.

In the mean time, the djb2 version has finished the insert with only 3 pages:

Local epoch time  = 0
GMT epoch time  = 0
Opening DB (table size 30000000) used 5840 bytes, took 0.000000 sec
Adding new page to the hash table
Adding new page to the hash table
Adding new page to the hash table
Inserted 15752961
Inserts used 360001536 bytes, took 170.821000 sec

Looking to verify that the Murmurhash2 code that I'm using is working correctly, but there are no test vectors published? What kind of hash has no test vectors? I do see that it's supposed to produce different results on different platforms, so maybe test vectors would be useless, but still....

Also, found this note:

Update October 28, 2010 - A couple of people have reported a flaw in MurmurHash2, and I've confirmed it - repetitions of 4-byte patterns have a much higher chance of collision than expected. This is not fixable, but should not cause problems in most cases (it applies only to keys that are identical except for the repeating sections and with those sections 4-byte-aligned). While investigating the flaw I've found a new mix function that improves the performance yet again, so I'll be publishing that as MurmurHash3.

sc0rp · 2018-07-25 02:38:24

jasonrohrer wrote:

Well, I've bought time already by using STACKDB and taking server1 and 2 out of circulation, and turning the cap down to 40 on the other servers.
The process of getting all the servers to rebuild their DBs with a larger table size, with or without switching back to KISS, is non-trivial. I'd like to get a better solution (low RAM, constant time, cheap collisions) before I go through the process of writing the conversion code.

Ok. So here is the deal. I'll finish up my implementation of "Linear hashing with overflow-handling by linear probing". I'll add all the tests to have good coverage. From what I've seen so far I can promise "almost constant RAM usage *), O(1) lookup time including misses, cheap collisions". It's "almost" constant only because in rare cases it needs some transient RAM to deal with overflow islands during split (it's several pages of data, less than a MB, so nothing large). For plain inserts, lookups or misses it never needs additional memory, no matter if it has to deal with 1k of records, 1M of records, or 1B of records.

Then I'll explain all the details to you, how it exactly works, so if you encounter some problems in future, you can profile and fix it yourself. I'll place all the code in public domain.

If it's used and works well, after some time I'll extract battle hardened version it into separate library on GitHub, so other people can use it as well.

I'll prefer to go that route, instead of making some library on GitHub first, because then I'll have "a solution looking for a problem to solve", which is never good.

Last edited by sc0rp (2018-07-25 02:47:05)

jasonrohrer · 2018-07-25 02:59:09

I really need to write my own version of this.

It's part of what I do.

I do appreciate the offer, though.

I would also like to run the same benchmark on yours and mine. I mean, obviously, I'll be shooting for speed on all operations that is better than KISS or STACK, but beyond that, I'll have nothing to compare it to.

Also, I did read something today about k-wise independent hashing, and the proof that 5-independent hashing is necessary to guarantee O(1) inserts if you're doing linear probing. But I've seen no examples of actual hash functions that have this 5-independent property.

sc0rp · 2018-07-25 02:59:49

jasonrohrer wrote:

Wonder what's going on with Murmur2?

Without debugging I don't know. I certainly would expect it to perform better. Heck, djb2 is as weak as you can go. It's just bit shift and add. My first reaction to seeing something like that would be to change it to at least "on each round multipy by prime, take middle bits where most mixing occurs".

Without debugging it further, maybe just switch to SipHash and see what happens. If it also behaves badly, you have some serious problem worth investigating.

jasonrohrer · 2018-07-25 05:06:12

I tried a few other hash functions, including Murmur3 and Jenkins "one-at-atime." I'm seeing similar behavior with 5-6 collisions.

The only other function that worked with only 3 collisions was the similarly simple sdbm hash.

This makes me wonder if there's some problem in the way that I plugged these other hash functions in. I tried various seeds for the seeded one, and the Murmur2 code literally takes the same parameters as djb2.

All the hash functions that I've tried are at the top of the file here:

https://github.com/jasonrohrer/OneLife/ … ssdb32.cpp

Does anything obvious jump out? Am I using these incorrectly somehow?

Also, it's not an INSANE number of collisions, like thousands. Just five or six slowly over time, but that's enough to cause it to start using too much RAM.

The test key data here are extremely non-random, literally marching through all possible x,y,s,b values from (0,0,0,0) up to (63,63,63,63), such that we get 15M combinations. That does mean there are a lot of zero bits in there.

sc0rp · 2018-07-25 09:20:08

jasonrohrer wrote:

I tried a few other hash functions, including Murmur3 and Jenkins "one-at-atime." I'm seeing similar behavior with 5-6 collisions.
Does anything obvious jump out? Am I using these incorrectly somehow?

This gives non-uniform distribution, the larger hash table size, the worse bias:
uint32_t hash = KISSDB_hash(key,db->key_size) % (uint32_t)db->hash_table_size;

Use power of 2 for quick fix. You can mix in top hash bits as well.

Last edited by sc0rp (2018-07-25 09:22:07)

sc0rp · 2018-07-25 11:37:13

I've run few test and I think I know what happend. The modulo with 32 bits hash gave tiny bias (<1%), but the main thing is: if you put 15M items randomly into 30M bins, you get distribution looking more or less like this:
Counter({0: 18196821, 1: 9096015, 2: 2275788, 3: 378792, 4: 47342, 5: 4811, 6: 409, 7: 20, 8: 2})
aka birthday paradox

So the funny thing is, djb2 got so good results precisely because it is so bad In that particular case it worked more like counter than hash.

I wouldn't feel safe with keeping djb2. It turned out to be in our favor here, but may quickly degenerate into massive number of collisions as well.

The solution is simple, but not trivial. Use smaller number of larger buckets to flatten it out. If you use 30M 1 item buckets expected max chain length is 8. If you use 15M 2 item buckets, it's 5. With 7.5M of 4 items buckets it's 4. With 1.5M 20 item buckets it's 2 (slight overflow - I've got 29 items in largest one).

Or maybe switch to StackDB with large hash table. Having 8 items in longest chain shouldn't give much problem.

Last edited by sc0rp (2018-07-25 14:00:21)

Joriom · 2018-07-25 14:19:58

One simple question.
Is storing every single tile as unique entry really a good idea?
If database search time is the bottle neck, why not "group" tiles into bigger chunks of binary data and store those in db?
Even in simple case of 10x10 tiles per "chunk" you reduce the ammount of entries in DB by 100. Basically with NxN chunk its N^2 reduction in elements in the table. Ofc this way you need to load bigger parts of map and every time "decode" them into actuall tiles and endoce back to save, but you could hold those in RAM a bit longer, untill they're not needed for sure.
Some engines use this method even twice for file storage - basically group data in huge chunks made of smaller chunks and store actuall data two layers deep directories.

sc0rp · 2018-07-25 14:30:19

I proposed the same:
https://onehouronelife.com/forums/viewt … 350#p25350

Originally as idea how to get rid of map wipes.

Joriom · 2018-07-25 14:33:38

sc0rp wrote:

I proposed the same:
https://onehouronelife.com/forums/viewt … 350#p25350
Originally as idea how to get rid of map wipes.

Sorry, I must be really tired... I scrolled though the thread and somehow missed that...

jasonrohrer · 2018-07-25 14:58:46

Yeah, birthday paradox. Another reason why KISSDB's expensive collisions are no good. If there was a linked list to handle the occasional collision, it's no big deal. But making a whole new page of a 30M cell table is a big deal. Not only because of RAM and thrashing, but it's not a great realtime moment when you decide, upon insert, to make the new page.

Anyway, you're essentially saying that, given the data set (32-bit int 4-tuples between (0,0,0,0) and (63,63,63,63)), djb2 is so bad that it evenly distributes the data in an artificial way, avoiding the birthday paradox, whereas the other hashes are so good and so random that they are subject to things that truly random distributions are subject to, like the birthday paradox.

Joriom, as for why we don't just store big chunks like that, a few things (at least what I was thinking when I designed it this way):

1. The man-made map is sparse, and most of it is empty. If you go and drop a stone in some spot in the wilderness, we need to remember that one tile out there, but the rest of the stuff around there is "empty" as far as the database is concerned---either truly empty tiles, or proc-genned wilderness that we don't need to store. We don't want to record a whole 10x10 chunk around it. We could use a hybrid system where we record chunks in cities and record sparse cells separately, but see the other points. The downside of a sparse system is that the NULL result (there's nothing here) is the most common result, and it's also rather expensive, because you have to ask about each empty cell separately. A chunk system would get a whole page of NULL results together. BUT, it would also be storing all these known-to-be-empty cells, which I believe(d) would be space-prohibitive in the long run.

2. Even densely-filed chunks of the map (villages) have an unbounded amount of information potential, due to containers. It's not just 10x10 objectID integers, or 3200 bytes. It's also what's in those containers and sub containers. How big those are is currently independent from the database code. In the editor, I can make a container with 8 slots or 17 slots or 30 slots, and it will just work, with no server code changes. And the containers, like everything else, are sparse. An object with 30 slots that only has 2 items in it will only have two things stored in the database (along with the number 2, to tell how many contained items are there).

3. Early on, I settled on a database system with fixed size keys and data, and a binary format for those keys and data. This simplifies a lot of things, and provides a nice abstraction on top of which you can build all sorts of things very easily. An arbitrary-length blob storage system could be an alternative (where you key by x,y and then have some variable-length encoded representation of what's there). But that's a lot more complicated to code, and probably more complicated to make fast as a database engine, and maybe less space efficient (depending on the encoding format of the blob).

4. Technical debt! I'm just one guy, the existing server is complicated, and overhauling it is a very expensive proposition. The existing storage system is at least functional and bug-free. The current problem is too many hash table collisions. Much easier to solve that problem than to rewrite the entire storage system from scratch.

sc0rp · 2018-07-25 16:01:29

jasonrohrer wrote:

A chunk system would get a whole page of NULL results together. BUT, it would also be storing all these known-to-be-empty cells, which I believe(d) would be space-prohibitive in the long run.

Real time compression like LZ4 will keep it small.

jasonrohrer wrote:

2. Even densely-filed chunks of the map (villages) have an unbounded amount of information potential, due to containers. [...] An object with 30 slots that only has 2 items in it will only have two things stored in the database (along with the number 2, to tell how many contained items are there).

You can store it the pretty much the same way it's transmited to client "123,10:5:5"

jasonrohrer wrote:

3. Early on, I settled on a database system with fixed size keys and data, and a binary format for those keys and data. This simplifies a lot of things, and provides a nice abstraction on top of which you can build all sorts of things very easily. An arbitrary-length blob storage system could be an alternative (where you key by x,y and then have some variable-length encoded representation of what's there). But that's a lot more complicated to code, and probably more complicated to make fast as a database engine, and maybe less space efficient (depending on the encoding format of the blob).

The simplest way to create variable sized db out of fixed one is indirection. Split it into two tables: "main" with small cells, and "blob" with big cells. If data is small, write directly to main. If it's too big, store in main IDs of objects from blob table. If it's huge, make top level blobs hold IDs of all necessary blobs with data.

jasonrohrer wrote:

4. Technical debt! I'm just one guy, the existing server is complicated, and overhauling it is a very expensive proposition. The existing storage system is at least functional and bug-free. The current problem is too many hash table collisions. Much easier to solve that problem than to rewrite the entire storage system from scratch.

IMO it's comparable. The only layer that needs to know about chunks is cache above map db. It can easily provide illusion to the rest of the code that things come and go as single items. So no code changes above it are required. Only if some preformance bottleneck is identified, you can make part of server code aware of chunks to speed it up.

jasonrohrer · 2018-07-25 18:48:20

Just implemented a simple disk-based linear probing hash table.

I'm tracking the deepest probe needed in the benchmark.

22K records loaded into a 80K slot table.

With murmur2, the deepest linear probe is 11 steps
With djb2, the deepest linear probe is 7 steps.

I've run the tests that come with murmur2, and it passes.

sc0rp · 2018-07-25 19:07:41

Just few hints. Instead of 80k single item buckets use eg. 4k 20 item buckets - that will even out random spikes. And won't be slower, because you cannot get less than a block from disk anyway.

Prefetch 4-8 buckets on read. If bucket is full, you'll have next already in RAM. Unless you run massively parallel I/O, you won't see any slowdown.

Last edited by sc0rp (2018-07-25 19:10:08)

jasonrohrer · 2018-07-25 20:19:47

The big difference that I'm seeing between the no-RAM, 100% disk linear probing implementation and the KISS implementation is that KISS is way faster to return the truly NULL result (if that slot in the hash table is actually empty), because it keeps the hash table in RAM.

But oddly, if there is a non-NULL entry in the RAM hash table, it still has to go to disk to check the key for that entry. Yet its still storing a uint64_t file pointer for each cell in RAM.

Part of this is because of the append-only file structure, where new hash table pages and data blocks are interleaved. We don't automatically know the file position if we know the hash value, so we have to store that somewhere.

So it seems that a simple 1-byte existence map in RAM (or maybe even smaller, a bitmap) would suffice for this purpose, assuming that the file structure is isomorphic to your hash table.

sc0rp · 2018-07-25 20:48:46

jasonrohrer wrote:

The big difference that I'm seeing between the no-RAM, 100% disk linear probing implementation and the KISS implementation is that KISS is way faster to return the truly NULL result (if that slot in the hash table is actually empty), because it keeps the hash table in RAM.

So with 50% load only half the time.

jasonrohrer wrote:

But oddly, if there is a non-NULL entry in the RAM hash table, it still has to go to disk to check the key for that entry. Yet its still storing a uint64_t file pointer for each cell in RAM.

For all in chain, to be precise.

jasonrohrer wrote:

So it seems that a simple 1-byte existence map in RAM (or maybe even smaller, a bitmap) would suffice for this purpose, assuming that the file structure is isomorphic to your hash table.

That doesn't buy much - only half of the reads are avoided. If that will be the main hotspot, make Bloom filter. For 10bits/key it will allow you to skip 99% of null lookups.

Other option is to store information about what other data is available in some record you read anyway, e.g. with biome data for tile.

Last edited by sc0rp (2018-07-25 21:16:01)

jasonrohrer · 2018-07-25 22:04:07

Another thing that I'm noticing about KISS:

The hash table is random access, but on reads, the disk version of the hash table isn't touched. We get a file pos out of the RAM hash table, and then go there on the disk to read the key and see if it matches. But those records are appended to the hash table in the order in which they are inserted.

Thus, if there are patterns in the order in which the data records are accessed, relative to the order in which they were inserted, there will be cache coherency.

One Hour One Life Forums

#51 2018-07-25 00:47:51

Re: Welp, found the actual source of lag

#52 2018-07-25 00:57:43

Re: Welp, found the actual source of lag

#53 2018-07-25 01:27:48

Re: Welp, found the actual source of lag

#54 2018-07-25 01:29:00

Re: Welp, found the actual source of lag

#55 2018-07-25 01:36:50

Re: Welp, found the actual source of lag

#56 2018-07-25 01:41:42

Re: Welp, found the actual source of lag

#57 2018-07-25 01:46:20

Re: Welp, found the actual source of lag

#58 2018-07-25 02:12:02

Re: Welp, found the actual source of lag

#59 2018-07-25 02:25:46

Re: Welp, found the actual source of lag

#60 2018-07-25 02:38:24

Re: Welp, found the actual source of lag

#61 2018-07-25 02:59:09

Re: Welp, found the actual source of lag

#62 2018-07-25 02:59:49

Re: Welp, found the actual source of lag

#63 2018-07-25 05:06:12

Re: Welp, found the actual source of lag

#64 2018-07-25 09:20:08

Re: Welp, found the actual source of lag

#65 2018-07-25 11:37:13

Re: Welp, found the actual source of lag

#66 2018-07-25 14:19:58

Re: Welp, found the actual source of lag

#67 2018-07-25 14:30:19

Re: Welp, found the actual source of lag

#68 2018-07-25 14:33:38

Re: Welp, found the actual source of lag

#69 2018-07-25 14:58:46

Re: Welp, found the actual source of lag

#70 2018-07-25 16:01:29

Re: Welp, found the actual source of lag

#71 2018-07-25 18:48:20

Re: Welp, found the actual source of lag

#72 2018-07-25 19:07:41

Re: Welp, found the actual source of lag

#73 2018-07-25 20:19:47

Re: Welp, found the actual source of lag

#74 2018-07-25 20:48:46

Re: Welp, found the actual source of lag

#75 2018-07-25 22:04:07

Re: Welp, found the actual source of lag

Board footer