Conversation
nwf-msr
left a comment
There was a problem hiding this comment.
Comments so far; posting before switching threads.
nwf-msr
left a comment
There was a problem hiding this comment.
Generally looks quite nice. ISTR snmalloc of old had the ability to conditionally keep stats or not; perhaps it would be worth having an empty implementation of the Stat and MonotoneStat interfaces and either templating or having a namespace snmalloc-scoped using to pick between them?
src/snmalloc/mem/globalalloc.h
Outdated
| } | ||
|
|
||
| if ( | ||
| result == nullptr && RemoteDeallocCache::remote_inflight.get_curr() != 0) |
There was a problem hiding this comment.
Something has happened (TM) with the syntax there. Can this be a SNMALLOC_ASSERT_MSG?
nwf-msr
left a comment
There was a problem hiding this comment.
Generally looks quite nice. ISTR snmalloc of old had the ability to conditionally keep stats or not; perhaps it would be worth having an empty implementation of the Stat and MonotoneStat interfaces and either templating or having a namespace snmalloc-scoped using to pick between them?
I was going to profile to see how much the operations cost. If they are noticeable, then I will macro it away as you suggest. |
bfc415a to
5bc8fd8
Compare
|
So I have benchmarked this, and it has a perf regression. The worst case seems to be 3% (glibc-thread), but most tests are below 1%. I am going to investigate moving more of the statistics off the fast path. This will basically be,
This will over approximate the current user allocations quite a bit, but should make the overhead practically zero. Alternatively, we could look at making this a compile time option. |
|
It would be great if this could be rebased to main (even if not landed), I gave it a shot but some of the conflicts were too weird for me to figure out. I don't care about landing or a perf regression, I just want to get some better stats around utilization for some local testing/comparisons. |
This adds a collection of per sizeclass statistic for tracking how many allocations have occurred on each thread. These are racily combined to provide basic tracking information.
|
@akrieger I have rebased and it seems to pass tests, but this is currently minimally tested. I'll set a perf run to check what the regression is. Please let me know what kind of API you would like to access the stats. Also, how accurate do you want the statistics? This should be tracking individual allocations and deallocations for each sizeclass. But we might want to track the number of allocations and deallocations at a coarser granularity to reduce the performance impact. I would either make this reasonably accurate statistic available under a compile flag, or an over-approximating system which is always on. Or possibly both. |
|
The perf results look similar to before, but now we have prettier results. The regression in Currently, I don't think the performance is good enough for an always on feature. So either we need to add compile flags and some more CI targets, or reduce the accuracy and make it always on. |
|
I personally would like an accurate system over a performant one, but that's because I'm doing offline evaluation of various allocator options to decide which to use :) There's two main questions I have not gotten good answers for when comparing/evaluating the various allocators: what is my fragmentation/utilization like, and can I tune my size classes to get better results for my specific workloads (and to what specific sizes). Right now all I can see are very high level patterns like 'on this test suite, snmalloc is consistently from 0-100MB higher in rss than mimalloc v3 but also seems to more aggressively return memory to the kernel'. But that extra 100MB might come at a bad time for an old android device and cause it to OOM instead, so I want to know if that's usable memory that will buffer incoming allocators or relatively permanently fragmented. The api doesn't have to be particularly fast to answer any question either. Like a function call which returns or prints a list of stats like, I don't know, amount of used/fragmented/free space per slab or bucket or whatever the internal unit of allocation is (apologies, I haven't dug into it that deeply), which I can then print out at my convenience and postprocess in a spreadsheet app. It can take however long it needs to walk the internal structures in that case. I wrote up this entire comment, by the way, without having reminded myself what the original PR summary was, and I see now that what I'm asking for is exactly what this PR was originally intended for :) (For comparison, until now my memory debugging tool is to the Visual Studio memory profiler, which is... great for debugging specific allocations but not good for high level statistics). |
|
@akrieger thanks. I think what is there should be fairly useable based on your description. It dumps to std::err, so hopefully not too interleaved with your output. We can move to a file, but that would be a reasonable amount of work to add for all platforms (we have a lot of them). You can then grep and The This doesn't stop the threads and uses relaxed reads and writes, so it isn't a technically correct snapshot, but for the kind of analysis you want, and I used it for, it should be accurate enough. The 100Mb you mentioned what is the overall footprint? I am interested to know how much overhead we have on mimalloc for your application. |
|
Up to 100mb out of 1-2gb, so 5-10%, but it's rounding to the .1GB in the UI and I haven't dug deeper into that aspect yet (still mostly spending my time analyzing cpu/wall time). |
This adds some statistic for tracking
The per sizeclass statistics are tracked per allocator, and a racy read is done to combine the results for displaying.
These statistics were used to debug #615 to calculate the fragmentation.
The displayed statistics are intended for post processing to calculate the fragmentation/utilisation.
The interface just prints the results using
message. This could be improved with a better logging infrastructure.