a multiplayer game of parenting and civilization building
You are not logged in.
I actually just ran a replace for uint64_t in both header and .cpp file for KISSDB. Running the benchmark on a smaller data set, I see the same checksums, etc, with the 64 and 32-bit version, so I think it worked. I also switched fseeko to 32-bit mode.
At first, I tried to figure out which were hash values and which were other values, but I don't think it matters.
Though the previous run with its 1.2GiB file is making me nervous about the 4GiB limit...
Offline
Yes, KISSDB was thrashing, and the 32-bit version is way better:
Opening DB (table size 30000000) used 5840 bytes, took 0.000000 sec
Inserted 15752961
Inserts used 360001536 bytes, took 179.170000 sec
Flushing system disk caches.
Random lookup for 500 batchs of 3000 (1500000/1500000 hits)
Random look used 0 bytes, took 3.840000 sec
Checksum = 186209500
Flushing system disk caches.
Last-inserted lookup for 500 batchs of 2401 (1200500/1200500 hits)
Last-inserted lookup used 0 bytes, took 3.102000 sec
Checksum = 283318000
Flushing system disk caches.
Random lookup for non-existing 500 repeating batchs of 3000 (0/1500000 hits)
Random look/miss used 0 bytes, took 0.934000 sec
Flushing system disk caches.
Random lookup for non-existing 1500000 non-repeating values (0/1500000 hits)
Random look/miss used 0 bytes, took 4.576000 sec
Flushing system disk caches.
Inserts for previously non-existing 500 batchs of 3000
Inserts after miss used 0 bytes, took 7.170000 sec
Flushing system disk caches.
Iterated 15755961, checksum 1953944243
Iterating used 0 bytes, took 46.211000 sec
Max bin depth = 3
real 4m6.418s
user 0m50.763s
sys 2m10.474s
-rw-r--r-- 1 root root 675119248 Jul 25 00:53 test.db
Offline
Thats pretty neat performance. Just small final touch - start with 40% load to give it some breathing room for inserts. Beyond 50% longer chains are starting to form, so it's better to avoid that - cost with KissDB is very high.
Now run it with full game, to check if there are other bottleneck elsewhere.
Offline
Max bin depth = 3
I wonder how it got down from 4 to 3. Is there some non seeded randomness involved?
Offline
No non-seeded randomness.
I changed the hash function from 64 to 32-bit, though, for the djb2 alg. I think that will change the hash values.... maybe? I haven't thought about it, though, and I'm tired.
Testing MurmurHash2 now.... it's SLOW.....
179 sec was long enough for those 15 million inserts.
Offline
Also, while I'm waiting, this has got me nervous:
top - 01:39:14 up 5:15, 2 users, load average: 1.23, 1.25, 1.07
Tasks: 93 total, 2 running, 52 sleeping, 0 stopped, 0 zombie
%Cpu(s): 2.3 us, 19.2 sy, 0.0 ni, 0.0 id, 78.5 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 1009052 total, 73708 free, 755852 used, 179492 buff/cache
KiB Swap: 262140 total, 219032 free, 43108 used. 110848 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
10293 root 20 0 717164 704724 1612 R 21.2 69.8 1:58.68 kissTest
10226 root 20 0 0 0 0 I 2.6 0.0 0:15.84 kworker/u2:2
34 root 20 0 0 0 0 S 2.3 0.0 1:38.73 kswapd0
153 root 0 -20 0 0 0 I 0.3 0.0 0:21.79 kworker/0:1H
1 root 20 0 77896 1592 1288 S 0.0 0.2 0:02.51 systemd
2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd
Does this mean Murmur2 has likely caused more collisions, just by random chance?
The problem with KISS is that collisions are so expensive in terms of resource consumption.
Offline
Does this mean Murmur2 has likely caused more collisions, just by random chance?
Looks like. I'm quite surprised. Before seeing that, I'd bet it'll perform better. Are you using 32bit version?
The problem with KISS is that collisions are so expensive in terms of resource consumption.
That's why I think it's only stopgap solution, to buy some time for something better.
Offline
Well, I've bought time already by using STACKDB and taking server1 and 2 out of circulation, and turning the cap down to 40 on the other servers.
The process of getting all the servers to rebuild their DBs with a larger table size, with or without switching back to KISS, is non-trivial. I'd like to get a better solution (low RAM, constant time, cheap collisions) before I go through the process of writing the conversion code.
Also, the big insert still hasn't finished yet with Murmur2. Yes, 32-bit version, along with 32-bit KISSDB mod. Looks like it's thrashing:
top - 02:08:01 up 5:44, 2 users, load average: 1.06, 1.09, 1.09
Tasks: 93 total, 2 running, 52 sleeping, 0 stopped, 0 zombie
%Cpu(s): 2.4 us, 15.9 sy, 0.0 ni, 0.0 id, 81.7 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 1009052 total, 71200 free, 897900 used, 39952 buff/cache
KiB Swap: 262140 total, 129396 free, 132744 used. 17896 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
10293 root 20 0 951540 849604 1612 R 14.3 84.2 5:58.56 kissTest
34 root 20 0 0 0 0 S 1.7 0.0 2:01.37 kswapd0
10226 root 20 0 0 0 0 I 1.0 0.0 0:30.45 kworker/u2:2
153 root 0 -20 0 0 0 I 0.3 0.0 0:27.58 kworker/0:1H
1 root 20 0 77896 2660 2384 S 0.0 0.3 0:02.59 systemd
2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd
4 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 kworker/0:0H
Just killed it. It ran for 42 minutes without finishing the insert.
Will try a different Murmur2 seed and see if I get luckier.
Offline
Well, even with a different seed, same problem. I added some print-outs to KISS to document new pages being added.
Local epoch time = 0
GMT epoch time = 0
Opening DB (table size 30000000) used 5840 bytes, took 0.000000 sec
Adding new page to the hash table
Adding new page to the hash table
Adding new page to the hash table
Adding new page to the hash table
Adding new page to the hash table
^C
real 1m3.300s
Switching back to djb2, I'm seeing only two pages added in the first three or so minutes, and way less RAM:
top - 02:29:00 up 6:05, 2 users, load average: 1.70, 1.03, 1.03
Tasks: 93 total, 2 running, 53 sleeping, 0 stopped, 0 zombie
%Cpu(s): 23.8 us, 75.8 sy, 0.0 ni, 0.0 id, 0.3 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 1009052 total, 122636 free, 290052 used, 596364 buff/cache
KiB Swap: 262140 total, 224400 free, 37740 used. 572156 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
10440 root 20 0 248412 237144 2828 R 87.4 23.5 1:41.81 kissTest
10226 root 20 0 0 0 0 D 11.0 0.0 1:09.49 kworker/u2:2
153 root 0 -20 0 0 0 I 1.3 0.0 0:34.52 kworker/0:1H
10315 root 20 0 44048 1804 1448 R 0.3 0.2 0:02.05 top
1 root 20 0 77896 1376 1072 S 0.0 0.1 0:02.63 systemd
Wonder what's going on with Murmur2?
I copy-pasted the code for the 32-bit function directly from here:
https://github.com/aappleby/smhasher/bl … rHash2.cpp
It's the first function in the file.... I didn't change it at all.
In the mean time, the djb2 version has finished the insert with only 3 pages:
Local epoch time = 0
GMT epoch time = 0
Opening DB (table size 30000000) used 5840 bytes, took 0.000000 sec
Adding new page to the hash table
Adding new page to the hash table
Adding new page to the hash table
Inserted 15752961
Inserts used 360001536 bytes, took 170.821000 sec
Looking to verify that the Murmurhash2 code that I'm using is working correctly, but there are no test vectors published? What kind of hash has no test vectors? I do see that it's supposed to produce different results on different platforms, so maybe test vectors would be useless, but still....
Also, found this note:
Update October 28, 2010 - A couple of people have reported a flaw in MurmurHash2, and I've confirmed it - repetitions of 4-byte patterns have a much higher chance of collision than expected. This is not fixable, but should not cause problems in most cases (it applies only to keys that are identical except for the repeating sections and with those sections 4-byte-aligned). While investigating the flaw I've found a new mix function that improves the performance yet again, so I'll be publishing that as MurmurHash3.
Offline
Well, I've bought time already by using STACKDB and taking server1 and 2 out of circulation, and turning the cap down to 40 on the other servers.
The process of getting all the servers to rebuild their DBs with a larger table size, with or without switching back to KISS, is non-trivial. I'd like to get a better solution (low RAM, constant time, cheap collisions) before I go through the process of writing the conversion code.
Ok. So here is the deal. I'll finish up my implementation of "Linear hashing with overflow-handling by linear probing". I'll add all the tests to have good coverage. From what I've seen so far I can promise "almost constant RAM usage *), O(1) lookup time including misses, cheap collisions". It's "almost" constant only because in rare cases it needs some transient RAM to deal with overflow islands during split (it's several pages of data, less than a MB, so nothing large). For plain inserts, lookups or misses it never needs additional memory, no matter if it has to deal with 1k of records, 1M of records, or 1B of records.
Then I'll explain all the details to you, how it exactly works, so if you encounter some problems in future, you can profile and fix it yourself. I'll place all the code in public domain.
If it's used and works well, after some time I'll extract battle hardened version it into separate library on GitHub, so other people can use it as well.
I'll prefer to go that route, instead of making some library on GitHub first, because then I'll have "a solution looking for a problem to solve", which is never good.
Last edited by sc0rp (2018-07-25 02:47:05)
Offline
I really need to write my own version of this.
It's part of what I do.
I do appreciate the offer, though.
I would also like to run the same benchmark on yours and mine. I mean, obviously, I'll be shooting for speed on all operations that is better than KISS or STACK, but beyond that, I'll have nothing to compare it to.
Also, I did read something today about k-wise independent hashing, and the proof that 5-independent hashing is necessary to guarantee O(1) inserts if you're doing linear probing. But I've seen no examples of actual hash functions that have this 5-independent property.
Offline
Wonder what's going on with Murmur2?
Without debugging I don't know. I certainly would expect it to perform better. Heck, djb2 is as weak as you can go. It's just bit shift and add. My first reaction to seeing something like that would be to change it to at least "on each round multipy by prime, take middle bits where most mixing occurs".
Without debugging it further, maybe just switch to SipHash and see what happens. If it also behaves badly, you have some serious problem worth investigating.
Offline
I tried a few other hash functions, including Murmur3 and Jenkins "one-at-atime." I'm seeing similar behavior with 5-6 collisions.
The only other function that worked with only 3 collisions was the similarly simple sdbm hash.
This makes me wonder if there's some problem in the way that I plugged these other hash functions in. I tried various seeds for the seeded one, and the Murmur2 code literally takes the same parameters as djb2.
All the hash functions that I've tried are at the top of the file here:
https://github.com/jasonrohrer/OneLife/ … ssdb32.cpp
Does anything obvious jump out? Am I using these incorrectly somehow?
Also, it's not an INSANE number of collisions, like thousands. Just five or six slowly over time, but that's enough to cause it to start using too much RAM.
The test key data here are extremely non-random, literally marching through all possible x,y,s,b values from (0,0,0,0) up to (63,63,63,63), such that we get 15M combinations. That does mean there are a lot of zero bits in there.
Offline
I tried a few other hash functions, including Murmur3 and Jenkins "one-at-atime." I'm seeing similar behavior with 5-6 collisions.
Does anything obvious jump out? Am I using these incorrectly somehow?
This gives non-uniform distribution, the larger hash table size, the worse bias:
uint32_t hash = KISSDB_hash(key,db->key_size) % (uint32_t)db->hash_table_size;
Use power of 2 for quick fix. You can mix in top hash bits as well.
Last edited by sc0rp (2018-07-25 09:22:07)
Offline
I've run few test and I think I know what happend. The modulo with 32 bits hash gave tiny bias (<1%), but the main thing is: if you put 15M items randomly into 30M bins, you get distribution looking more or less like this:
Counter({0: 18196821, 1: 9096015, 2: 2275788, 3: 378792, 4: 47342, 5: 4811, 6: 409, 7: 20, 8: 2})
aka birthday paradox
So the funny thing is, djb2 got so good results precisely because it is so bad In that particular case it worked more like counter than hash.
I wouldn't feel safe with keeping djb2. It turned out to be in our favor here, but may quickly degenerate into massive number of collisions as well.
The solution is simple, but not trivial. Use smaller number of larger buckets to flatten it out. If you use 30M 1 item buckets expected max chain length is 8. If you use 15M 2 item buckets, it's 5. With 7.5M of 4 items buckets it's 4. With 1.5M 20 item buckets it's 2 (slight overflow - I've got 29 items in largest one).
Or maybe switch to StackDB with large hash table. Having 8 items in longest chain shouldn't give much problem.
Last edited by sc0rp (2018-07-25 14:00:21)
Offline
One simple question.
Is storing every single tile as unique entry really a good idea?
If database search time is the bottle neck, why not "group" tiles into bigger chunks of binary data and store those in db?
Even in simple case of 10x10 tiles per "chunk" you reduce the ammount of entries in DB by 100. Basically with NxN chunk its N^2 reduction in elements in the table. Ofc this way you need to load bigger parts of map and every time "decode" them into actuall tiles and endoce back to save, but you could hold those in RAM a bit longer, untill they're not needed for sure.
Some engines use this method even twice for file storage - basically group data in huge chunks made of smaller chunks and store actuall data two layers deep directories.
[Download] Zoomed Out FOV Mod || [Tutorial] Compile Win32 client in Linux VirtualBox || OHOL TOS/EULA explained
OHOL official Discord || My private discord: discord.joriom.pl || Crafting Reference: onetech.info
Offline
I proposed the same:
https://onehouronelife.com/forums/viewt … 350#p25350
Originally as idea how to get rid of map wipes.
Offline
I proposed the same:
https://onehouronelife.com/forums/viewt … 350#p25350Originally as idea how to get rid of map wipes.
Sorry, I must be really tired... I scrolled though the thread and somehow missed that...
[Download] Zoomed Out FOV Mod || [Tutorial] Compile Win32 client in Linux VirtualBox || OHOL TOS/EULA explained
OHOL official Discord || My private discord: discord.joriom.pl || Crafting Reference: onetech.info
Offline
Yeah, birthday paradox. Another reason why KISSDB's expensive collisions are no good. If there was a linked list to handle the occasional collision, it's no big deal. But making a whole new page of a 30M cell table is a big deal. Not only because of RAM and thrashing, but it's not a great realtime moment when you decide, upon insert, to make the new page.
Anyway, you're essentially saying that, given the data set (32-bit int 4-tuples between (0,0,0,0) and (63,63,63,63)), djb2 is so bad that it evenly distributes the data in an artificial way, avoiding the birthday paradox, whereas the other hashes are so good and so random that they are subject to things that truly random distributions are subject to, like the birthday paradox.
Joriom, as for why we don't just store big chunks like that, a few things (at least what I was thinking when I designed it this way):
1. The man-made map is sparse, and most of it is empty. If you go and drop a stone in some spot in the wilderness, we need to remember that one tile out there, but the rest of the stuff around there is "empty" as far as the database is concerned---either truly empty tiles, or proc-genned wilderness that we don't need to store. We don't want to record a whole 10x10 chunk around it. We could use a hybrid system where we record chunks in cities and record sparse cells separately, but see the other points. The downside of a sparse system is that the NULL result (there's nothing here) is the most common result, and it's also rather expensive, because you have to ask about each empty cell separately. A chunk system would get a whole page of NULL results together. BUT, it would also be storing all these known-to-be-empty cells, which I believe(d) would be space-prohibitive in the long run.
2. Even densely-filed chunks of the map (villages) have an unbounded amount of information potential, due to containers. It's not just 10x10 objectID integers, or 3200 bytes. It's also what's in those containers and sub containers. How big those are is currently independent from the database code. In the editor, I can make a container with 8 slots or 17 slots or 30 slots, and it will just work, with no server code changes. And the containers, like everything else, are sparse. An object with 30 slots that only has 2 items in it will only have two things stored in the database (along with the number 2, to tell how many contained items are there).
3. Early on, I settled on a database system with fixed size keys and data, and a binary format for those keys and data. This simplifies a lot of things, and provides a nice abstraction on top of which you can build all sorts of things very easily. An arbitrary-length blob storage system could be an alternative (where you key by x,y and then have some variable-length encoded representation of what's there). But that's a lot more complicated to code, and probably more complicated to make fast as a database engine, and maybe less space efficient (depending on the encoding format of the blob).
4. Technical debt! I'm just one guy, the existing server is complicated, and overhauling it is a very expensive proposition. The existing storage system is at least functional and bug-free. The current problem is too many hash table collisions. Much easier to solve that problem than to rewrite the entire storage system from scratch.
Offline
A chunk system would get a whole page of NULL results together. BUT, it would also be storing all these known-to-be-empty cells, which I believe(d) would be space-prohibitive in the long run.
Real time compression like LZ4 will keep it small.
2. Even densely-filed chunks of the map (villages) have an unbounded amount of information potential, due to containers. [...] An object with 30 slots that only has 2 items in it will only have two things stored in the database (along with the number 2, to tell how many contained items are there).
You can store it the pretty much the same way it's transmited to client "123,10:5:5"
3. Early on, I settled on a database system with fixed size keys and data, and a binary format for those keys and data. This simplifies a lot of things, and provides a nice abstraction on top of which you can build all sorts of things very easily. An arbitrary-length blob storage system could be an alternative (where you key by x,y and then have some variable-length encoded representation of what's there). But that's a lot more complicated to code, and probably more complicated to make fast as a database engine, and maybe less space efficient (depending on the encoding format of the blob).
The simplest way to create variable sized db out of fixed one is indirection. Split it into two tables: "main" with small cells, and "blob" with big cells. If data is small, write directly to main. If it's too big, store in main IDs of objects from blob table. If it's huge, make top level blobs hold IDs of all necessary blobs with data.
4. Technical debt! I'm just one guy, the existing server is complicated, and overhauling it is a very expensive proposition. The existing storage system is at least functional and bug-free. The current problem is too many hash table collisions. Much easier to solve that problem than to rewrite the entire storage system from scratch.
IMO it's comparable. The only layer that needs to know about chunks is cache above map db. It can easily provide illusion to the rest of the code that things come and go as single items. So no code changes above it are required. Only if some preformance bottleneck is identified, you can make part of server code aware of chunks to speed it up.
Offline
Just implemented a simple disk-based linear probing hash table.
I'm tracking the deepest probe needed in the benchmark.
22K records loaded into a 80K slot table.
With murmur2, the deepest linear probe is 11 steps
With djb2, the deepest linear probe is 7 steps.
I've run the tests that come with murmur2, and it passes.
Offline
Just few hints. Instead of 80k single item buckets use eg. 4k 20 item buckets - that will even out random spikes. And won't be slower, because you cannot get less than a block from disk anyway.
Prefetch 4-8 buckets on read. If bucket is full, you'll have next already in RAM. Unless you run massively parallel I/O, you won't see any slowdown.
Last edited by sc0rp (2018-07-25 19:10:08)
Offline
The big difference that I'm seeing between the no-RAM, 100% disk linear probing implementation and the KISS implementation is that KISS is way faster to return the truly NULL result (if that slot in the hash table is actually empty), because it keeps the hash table in RAM.
But oddly, if there is a non-NULL entry in the RAM hash table, it still has to go to disk to check the key for that entry. Yet its still storing a uint64_t file pointer for each cell in RAM.
Part of this is because of the append-only file structure, where new hash table pages and data blocks are interleaved. We don't automatically know the file position if we know the hash value, so we have to store that somewhere.
So it seems that a simple 1-byte existence map in RAM (or maybe even smaller, a bitmap) would suffice for this purpose, assuming that the file structure is isomorphic to your hash table.
Offline
The big difference that I'm seeing between the no-RAM, 100% disk linear probing implementation and the KISS implementation is that KISS is way faster to return the truly NULL result (if that slot in the hash table is actually empty), because it keeps the hash table in RAM.
So with 50% load only half the time.
But oddly, if there is a non-NULL entry in the RAM hash table, it still has to go to disk to check the key for that entry. Yet its still storing a uint64_t file pointer for each cell in RAM.
For all in chain, to be precise.
So it seems that a simple 1-byte existence map in RAM (or maybe even smaller, a bitmap) would suffice for this purpose, assuming that the file structure is isomorphic to your hash table.
That doesn't buy much - only half of the reads are avoided. If that will be the main hotspot, make Bloom filter. For 10bits/key it will allow you to skip 99% of null lookups.
Other option is to store information about what other data is available in some record you read anyway, e.g. with biome data for tile.
Last edited by sc0rp (2018-07-25 21:16:01)
Offline
Another thing that I'm noticing about KISS:
The hash table is random access, but on reads, the disk version of the hash table isn't touched. We get a file pos out of the RAM hash table, and then go there on the disk to read the key and see if it matches. But those records are appended to the hash table in the order in which they are inserted.
Thus, if there are patterns in the order in which the data records are accessed, relative to the order in which they were inserted, there will be cache coherency.
Offline