Added long running commands runtime to log files. #5260

annapetrosyan26 · 2024-06-18T11:59:27Z

Added logging of the runtime and peak memory for the following commands:

global_place
rtl_macro_placer
repair_design
repair_timing
global_route
detailed_route

Updated the regression tests diff checking to ignore runtime lines.

Signed-off-by: Anna Petrosyan <[email protected]>

github-actions · 2024-06-18T12:09:17Z

clang-tidy review says "All clean, LGTM! 👍"

annapetrosyan26 · 2024-06-18T12:15:51Z

@stefanottili @rovinski @oharboe @maliberty @QuantamHD this PR addresses the #5003 issue.
Could you please take a look? The changes have been locally tested.

oharboe

Wrote up some feedback on what I can think of.

Also, there was a request to make this non-default option so that logs/reports don't become any more non-idempotent than they already are.

If OpenROAD wants to be an "examplary citizen" for bazel type build systems, then it should be a way to configure OpenROAD logs and reports should to idempotent... This of course means forgoing some types of logging and reporting like running times and memory consumption but this is understood and accepted in a bazel context.

oharboe · 2024-06-18T12:24:54Z

src/utl/src/timer.cpp

+
+ScopedStatistics::~ScopedStatistics()
+{
+  logger_->report("Runtime {} seconds, memory used {} KB.",


This output leaves me with questions...

Scope? What took this long and used this much memory. What was the memory usage at start of scope compared to the peak usage.

The request to add runtime info is originated from #3539 (comment) , so the scope is going to be clear, however I agree with your note about start memory consumption, so I suggest the following format for the log:
Runtime: 0.000545613 seconds, memory: start - 96123222 KB, peak - 105279488 KB.

Will this work?

This doesn't describe the scope, what was run during this time.

I think I mostly would care to know the peak on top of the starting memory consumption and in megabytes. kilobytes is too small a unit in 2024 for humans...

DRT already outputs in this format

[INFO DRT-0267] cpu time = 00:00:04, elapsed time = 00:00:01, memory = 2197.27 (MB), peak = 4031.35 (MB)

Is there a reason to have another?

That prefix is nice and gives some context, but there should be a human description of the scope of what is being measured. In some cases it would be unmabigious from that prefix, but e.g. detailed routing measures for each iteration, for instance. So some further description of context could be useful.

We can also add the command name before reporting its runtime to be clear about the scope.

I'm fine to add a prefix

The command name before reporting its runtime is already added. But by adding prefix like this [INFO DRT-0267] we should also create a unique message ID for each tool as Logger::info requires it, is this your suggestion?

oharboe · 2024-06-18T12:29:08Z

src/utl/src/timer.cpp

+size_t ScopedStatistics::getPeakMemoryUsage()
+{
+  struct rusage rusage;
+  getrusage(RUSAGE_SELF, &rusage);


error handling for getrusage missing. There may not be access rights to get this information or otherwise it is not available?

Actually, what I want to know is maximum virtual memory, not maximum resident memory set. The only reason why ORFS is settling for maximum resident memory set is because ps doesn't measure maximum virtual memory.

A discussion on cores and max virtual memory vs. maximum resident memory set.

If you can't measure it you can't report it so I'm not sure what is requested here.

I would argue that max virtual memory is not very interesting in 64bit computing. You'll never run out of virtual address space and thrashing is a resident set size issue.

max virtual address space can be measured in C++ easily enough, it just the time command that doesn't support it, which makes it awkward for ORFS to report it.

I disagree that virtual memory size isn't interesting. resident set size will always be smaller than physical memory and and doesn't give any indication that you have too large a working set and that the system could be thrashing. That said, even if max virtual memory is larger than physical memory, doesn't mean that there will be thrashing.

How is it easily measured?

rss ~= physical memory is a warning sign. vsz > physical memory doesn't say much.

rss says something about what is happening on the computer it is running on, vsz says something about how much memory is required. As an OpenROAD developer, I don't care about my own computer, I only care about how much memory OpenROAD is using. As a user of OpenROAD, I don't care about the computer that is failing, I already know it is failing, I want to know how much memory is needed.

So, I'm interested in knowing vsz, more than rss. I did run tests with rss and it only tells me something about how much memory is used when I'm not overcommitted. If I'm overcommitted, rss doesn't tell me anything as it will be somewhat close to physical memory. It doesn't tell me how much memory I need.

I don't have a mac, but when I compile and run this on Ubuntu, I get the output below:

#include <iostream> #include <unistd.h> #ifdef __linux__ #include <fstream> #elif __APPLE__ #include <mach/mach.h> #endif void getMemoryUsage(size_t &rss, size_t &vsz) { #ifdef __linux__ std::ifstream statm("/proc/self/statm"); if (!statm.is_open()) { std::cerr << "Error opening /proc/self/statm" << std::endl; return; } size_t pageSize = sysconf(_SC_PAGESIZE); size_t size, resident, share, text, lib, data, dt; statm >> size >> resident >> share >> text >> lib >> data >> dt; rss = resident * pageSize; // Convert from pages to bytes vsz = size * pageSize; // Convert from pages to bytes statm.close(); #elif __APPLE__ mach_task_basic_info info; mach_msg_type_number_t infoCount = MACH_TASK_BASIC_INFO_COUNT; if (task_info(mach_task_self(), MACH_TASK_BASIC_INFO, (task_info_t)&info, &infoCount) != KERN_SUCCESS) { std::cerr << "Error getting task info" << std::endl; return; } rss = info.resident_size; vsz = info.virtual_size; #endif } int main() { size_t rss = 0, vsz = 0; getMemoryUsage(rss, vsz); std::cout << "RSS: " << rss << " bytes" << std::endl; std::cout << "VSZ: " << vsz << " bytes" << std::endl; return 0; }

Output:

$ ./a.out RSS: 3440640 bytes VSZ: 6377472 bytes

Would it be more convenient to have both RSS and VSZ information?

For my part, I think we need both. Over time, we may develop better understanding of how they interact with OpenROAD and decide we only need rsz or rss and why only one is sufficient, but for now I'm definitely missing rsz when analyzing how much more resources I need to give a build server...

oharboe · 2024-06-18T12:33:24Z

src/utl/test/test_statistics.py

@@ -0,0 +1,42 @@
+import os


hook up to CI?

Isn't using pathlib or os a matter of taste ?

What I mean with "CI" is continuous integration, I wanted to know if this test is hooked up to the tests run on pull requests in github.

Regarding os vs pathlib, I think pathlib is more recent and a bit more elegant(but a little bit more complex than os) in the Python world but that os is more familiar(I know it better). I have no strong opinion on either for such simple scripts as these. It depends on what my colleagues prefer.

This test is included in the regression, so it's in CI.

oharboe · 2024-06-18T12:35:10Z

test/regression.tcl

-              puts " *FAIL*$error_msg"
-              append_failure $test
-              incr errors(fail)
+


this proc is already quite long. Break out a new proc and give the proc a good name so that it is clear what is being tested. Testing for regressions in maximum resident set is bound to have very significant variations...

annapetrosyan26

Regarding to make this as an option: the request has a pending question which is not answered yet so I made it as default.

annapetrosyan26 · 2024-06-18T15:10:21Z

src/utl/test/test_statistics.py

@@ -0,0 +1,42 @@
+import os


Isn't using pathlib or os a matter of taste ?

annapetrosyan26 · 2024-06-18T15:10:30Z

src/utl/src/timer.cpp

+
+ScopedStatistics::~ScopedStatistics()
+{
+  logger_->report("Runtime {} seconds, memory used {} KB.",


The request to add runtime info is originated from #3539 (comment) , so the scope is going to be clear, however I agree with your note about start memory consumption, so I suggest the following format for the log:
Runtime: 0.000545613 seconds, memory: start - 96123222 KB, peak - 105279488 KB.

Will this work?

stefanottili · 2024-06-18T20:46:26Z

Would it be possible to put the name of the command in the same line ?
The whole idea was to be able to grep for a keyword to easily get runtime info out of the verbose log file.

maliberty · 2024-06-18T21:25:29Z

Also, there was a request to make this non-default option so that logs/reports don't become any more non-idempotent than they already are.

I disagreed then and still do. The goal is to have an informative log file by default. If you want to give up information to get idempotency that should be the non-default option.

oharboe · 2024-06-19T01:57:31Z

Also, there was a request to make this non-default option so that logs/reports don't become any more non-idempotent than they already are.

I disagreed then and still do. The goal is to have an informative log file by default. If you want to give up information to get idempotency that should be the non-default option.

I wanted to mention it, because it came up last time.

I do prefer this sort of information in logs by default and I don't know bazel well enough to understand why or how non-idempotent log artifacts cause a problem or if they are just an eyesore of sorts. I haven't observed any problems as such. Non-idempotent logs don't surprise me and isn't a problem for me. Looking forward to an explanation of the issue here for such build systems. bazel is one example, there are other... Certainly with build times of 24 hours, artifacts are required, ORFS alone doesn't cut it.

annapetrosyan26 · 2024-06-19T09:30:24Z

Would it be possible to put the name of the command in the same line ? The whole idea was to be able to grep for a keyword to easily get runtime info out of the verbose log file.

We can put the name of the command in the same line.

oharboe · 2024-06-19T10:39:28Z

Could you show some example output in this PR after the next update?

annapetrosyan26 · 2024-06-19T11:00:58Z

Could you show some example output in this PR after the next update?

The log file should contain an information like the example below:
repair_design: runtime = 0.000545613 (seconds), rsz start = 2197.27 (MB), vsz start = 4031.35 (MB), rsz peak = 2197.27 (MB), vsz peak = 4031.35 (MB).
Will this work?

oharboe · 2024-06-19T11:23:02Z

Could you show some example output in this PR after the next update?

The log file should contain an information like the example below: repair_design: runtime = 0.000545613 (seconds), rsz start = 2197.27 (MB), vsz start = 4031.35 (MB), rsz peak = 2197.27 (MB), vsz peak = 4031.35 (MB). Will this work?

Round off to integer seconds & megabytes.
Now I have to mentally do the math of peak - start, which is the information I care about. So the log line could do this math for me. net usage rsz = 1234/MB, vsz 12552/MB, peak rsz = 13534/MB, vsz 1552/MB
If this log line line can be squeezed into 80 chars while being legible, that's a boon...

Signed-off-by: Anna Petrosyan <[email protected]>

github-actions

clang-tidy made some suggestions

src/utl/src/timer.cpp

Signed-off-by: Anna Petrosyan <[email protected]>

github-actions · 2024-06-20T06:38:29Z

clang-tidy review says "All clean, LGTM! 👍"

src/utl/include/utl/timer.h

oharboe · 2024-06-20T07:07:06Z

src/utl/src/timer.cpp

+      while (*p < '0' || *p > '9') {
+        p++;
+      }
+      result = (size_t) atoi(p);


if this input was human made, I would recommend more error handling, like checking that the number is a valid decimal integer and not something like 00031243, but this is well defined machine generated input, an API to the OS, so I think this is fine as it is.

oharboe

👍 Looks good to me. I don't know anything about that tcl testing code/CI integration. Leaving that for others to review...

maliberty · 2024-06-20T15:33:09Z

src/gpl/src/nesterovPlace.cpp

@@ -62,6 +63,7 @@ NesterovPlace::NesterovPlace(const NesterovPlaceVars& npVars,
                             utl::Logger* log)
    : NesterovPlace()
 {
+  utl::ScopedStatistics stat(log, "global_place");


This should go in Replace::doNesterovPlace to capture initial placement as well.

maliberty · 2024-06-20T15:37:41Z

src/rsz/src/RepairDesign.cc

@@ -93,6 +94,7 @@ void RepairDesign::repairDesign(double max_wire_length,
                                bool verbose)
 {
  init();
+  utl::ScopedStatistics stat(logger_, "repair_design");


Move to Resizer::repairDesign. This API might be called internally.

maliberty · 2024-06-20T15:38:50Z

src/rsz/src/RepairHold.cc

@@ -93,6 +94,7 @@ void RepairHold::repairHold(
    const bool verbose)
 {
  init();
+  utl::ScopedStatistics stat(logger_, "repair_timing");


Move to Resizer::repairHold

maliberty · 2024-06-20T15:39:15Z

src/rsz/src/RepairHold.cc

@@ -132,6 +134,7 @@ void RepairHold::repairHold(const Pin* end_pin,
                            const int max_passes)
 {
  init();
+  utl::ScopedStatistics stat(logger_, "repair_timing");


Move to Resizer::repairHold

maliberty · 2024-06-20T15:41:32Z

src/utl/test/test_statistics.py

+def main():
+    script_dir = os.path.dirname(os.path.abspath(__file__))
+    base_dir = os.path.abspath(os.path.join(script_dir, "../../rsz/test"))
+    log_path = os.path.join(base_dir, "results", "repair_design4-tcl.log")
+    regression_script_path = os.path.join(base_dir, "regression")
+
+    if not os.path.exists(log_path):
+        if not run_regression_script(regression_script_path, "repair_design4"):
+            exit(1)


Why is a regression test calling regression again? This will create a conflict with the results of running repair_design4 (especially if run in parallel with this test). It is better to make this test self-contained.

maliberty · 2024-06-20T16:00:33Z

src/utl/src/timer.cpp

+
+ScopedStatistics::~ScopedStatistics()
+{
+  logger_->report(msg_ + ": runtime {} seconds, usage: rsz = {} MB, vsz = {} MB, peak: rsz = {} MB, vsz = {} MB", static_cast<int>(Timer::elapsed()), (getPeakRSZ() - start_rsz_),


runtime is unclear as it could be elapsed or cpu time. I think using elapsed is clearer.

Personally I think peak is sufficient (@oharboe any opinion)?

If we don't know the starting memory consumption, we can miss false positives: the problem is not with the algorithm that is reported as too high usage, but something in the setup.

This doesn't tell you the starting memory - only the end - start

correct. pick a way to present two numbers....

start memory is trivially calculated from peak and usagr, true for any cpmbination of these two numbers. most time usage and peak is what is of interested

usage and peak do not give start. The peak may well be in the middle not at the end.

Which is what peak should say and why I don't think usage matters much.

Agreed. Keep it simple. If there is something wrong(large initial memory allocation outside of scope), then it can be debugged otherwise. I think normal operation, determining required RAM, is more important and other information clouds that...

so let's go with the DRT format above (with prefix)

From the conversation below it's not clear what the final message should be. Please confirm if it should look like the DRT message you provided.
[INFO DRT-0267] cpu time = 00:00:04, elapsed time = 00:00:01, memory = 2197.27 (MB), peak = 4031.35 (MB)
Otherwise please provide the exact way you would like the message to be.

I was thinking the prefix would be about what is being timed, eg

repair_design: cpu time = 00:00:04, elapsed time = 00:00:01, memory = 2197.27 (MB), peak = 4031.35 (MB)

Though the prefix would be up to the client code.

stefanottili · 2024-06-22T04:49:08Z

It would be nice to have one common prefix that one can grep for.

maliberty · 2024-06-24T12:00:04Z

src/utl/src/timer.cpp

+  FILE* file = fopen("/proc/self/status", "r");
+  if (file == nullptr) {
+    perror("Failed to open /proc/self/status");
+    exit(EXIT_FAILURE);


This is too harsh a form of error handling. Issuing a logger warning is sufficient.

maliberty · 2024-06-24T12:00:59Z

src/utl/src/timer.cpp

+
+size_t ScopedStatistics::getMemoryUsage(const char* tag)
+{
+  FILE* file = fopen("/proc/self/status", "r");


procfs doesn't exist on OSX unfortunately. src/drt/src/db/infra/frTime_helper.cpp attempts to handle OSX

Signed-off-by: Anna Petrosyan <[email protected]>

github-actions

clang-tidy made some suggestions

src/utl/include/utl/timer.h

src/utl/src/timer.cpp

Signed-off-by: Anna Petrosyan <[email protected]>

github-actions

clang-tidy made some suggestions

src/utl/src/timer.cpp

Signed-off-by: Anna Petrosyan <[email protected]>

github-actions

clang-tidy made some suggestions

src/utl/src/timer.cpp

Signed-off-by: Anna Petrosyan <[email protected]>

github-actions · 2024-06-28T06:45:05Z

clang-tidy review says "All clean, LGTM! 👍"

Signed-off-by: Anna Petrosyan <[email protected]>

github-actions

clang-tidy made some suggestions

src/utl/src/timer.cpp

Signed-off-by: Anna Petrosyan <[email protected]>

github-actions · 2024-06-28T09:27:41Z

clang-tidy review says "All clean, LGTM! 👍"

annapetrosyan26 · 2024-06-28T10:36:32Z

@maliberty the appropriate changes have been made. Could you please review them?

annapetrosyan26 added 2 commits June 18, 2024 11:56

Added long running commands runtime to log files, added unit test

b4746e1

Signed-off-by: Anna Petrosyan <[email protected]>

Merge branch 'master' into issue5003

3d2793c

Signed-off-by: Anna Petrosyan <[email protected]>

oharboe suggested changes Jun 18, 2024

View reviewed changes

annapetrosyan26 commented Jun 18, 2024

View reviewed changes

Added vsz, rsz memory usage, added command name in log file.

139a866

Signed-off-by: Anna Petrosyan <[email protected]>

github-actions bot reviewed Jun 19, 2024

View reviewed changes

Removed unused define

fe02c84

Signed-off-by: Anna Petrosyan <[email protected]>

annapetrosyan26 requested review from maliberty and oharboe June 20, 2024 07:02

oharboe reviewed Jun 20, 2024

View reviewed changes

src/utl/include/utl/timer.h Outdated Show resolved Hide resolved

oharboe reviewed Jun 20, 2024

View reviewed changes

oharboe approved these changes Jun 20, 2024

View reviewed changes

maliberty requested changes Jun 20, 2024

View reviewed changes

maliberty reviewed Jun 24, 2024

View reviewed changes

annapetrosyan26 added 2 commits June 25, 2024 16:01

Merge branch 'master' into issue5003

1feca9a

Signed-off-by: Anna Petrosyan <[email protected]>

Change logging format and unit test

d38cfde

Signed-off-by: Anna Petrosyan <[email protected]>

annapetrosyan26 requested a review from maliberty June 25, 2024 13:53

Restore RepairHold

a3adf9f

Signed-off-by: Anna Petrosyan <[email protected]>

github-actions bot reviewed Jun 25, 2024

View reviewed changes

src/utl/include/utl/timer.h Outdated Show resolved Hide resolved

src/utl/include/utl/timer.h Outdated Show resolved Hide resolved

src/utl/src/timer.cpp Outdated Show resolved Hide resolved

rovinski reviewed Jun 26, 2024

View reviewed changes

src/utl/src/timer.cpp Outdated Show resolved Hide resolved

annapetrosyan26 added 2 commits June 27, 2024 10:26

Merge branch 'master' into issue5003

7757589

Signed-off-by: Anna Petrosyan <[email protected]>

Change header files

f86f009

Signed-off-by: Anna Petrosyan <[email protected]>

github-actions bot reviewed Jun 27, 2024

View reviewed changes

src/utl/src/timer.cpp Outdated Show resolved Hide resolved

Improve format

2fba1c3

Signed-off-by: Anna Petrosyan <[email protected]>

github-actions bot reviewed Jun 27, 2024

View reviewed changes

src/utl/src/timer.cpp Outdated Show resolved Hide resolved

Remove unused parameters

e18535d

Signed-off-by: Anna Petrosyan <[email protected]>

Changes related to defines

d101470

Signed-off-by: Anna Petrosyan <[email protected]>

github-actions bot reviewed Jun 28, 2024

View reviewed changes

src/utl/src/timer.cpp Show resolved Hide resolved

Add checks for unsupported OS

fd70ec0

Signed-off-by: Anna Petrosyan <[email protected]>

oharboe mentioned this pull request Jun 29, 2024

Out of Memory Error The-OpenROAD-Project/megaboom#65

Closed

Added long running commands runtime to log files. #5260

Are you sure you want to change the base?

Added long running commands runtime to log files. #5260

Conversation

annapetrosyan26 commented Jun 18, 2024

github-actions bot commented Jun 18, 2024

annapetrosyan26 commented Jun 18, 2024

oharboe left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

annapetrosyan26 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stefanottili commented Jun 18, 2024

maliberty commented Jun 18, 2024

oharboe commented Jun 19, 2024

annapetrosyan26 commented Jun 19, 2024

oharboe commented Jun 19, 2024

annapetrosyan26 commented Jun 19, 2024

oharboe commented Jun 19, 2024

github-actions bot left a comment

Choose a reason for hiding this comment

github-actions bot commented Jun 20, 2024

Choose a reason for hiding this comment

oharboe left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stefanottili commented Jun 22, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot left a comment

Choose a reason for hiding this comment

github-actions bot left a comment

Choose a reason for hiding this comment

github-actions bot left a comment

Choose a reason for hiding this comment

github-actions bot commented Jun 28, 2024

github-actions bot left a comment

Choose a reason for hiding this comment

github-actions bot commented Jun 28, 2024

annapetrosyan26 commented Jun 28, 2024