00:21:03  * dbeveniusjoined
00:26:09  * dbeveniusquit (Ping timeout: 272 seconds)
00:36:40  <Trott>AIX is totally the backlog now and not Windows at all. Since that question comes up every week.
00:36:52  <Trott>But the fix will be coming soon with better disk I/O. Go team!
00:37:20  <refack>I'm looking for low hanging perf fuit
00:44:10  <Trott>Is there a "how coverage.nodejs.org daily reports are generated" doc somewhere?
00:44:16  <Trott>Or a Jenkins job I can look at or something?
00:45:25  <Trott>Maybe this? https://ci.nodejs.org/view/All/job/node-test-commit-linux-coverage-daily/
00:45:29  <refack>https://ci.nodejs.org/view/Node.js%20Daily/job/node-test-commit-linux-coverage-daily/configure
00:45:34  <refack>Yep
00:52:33  <Trott>So I'm going to add `test-pummel` and `test-benchmark` to stuff run in the coverage job. My understanding is that there's not really anywhere we document changes like this, but I suppose I can open an issue in the repo explaining what I did and then close the issue?
00:53:28  <refack>current practice is an issue with `ci-change` title and label 🤷‍♂️
00:54:00  <refack>If there's previus issue discussion, you can also add a reference in a comment
00:56:29  * node-ghjoined
00:56:29  * node-ghpart
00:58:49  * node-ghjoined
00:58:49  * node-ghpart
00:59:23  <Trott>Thanks. I opened an issue.
00:59:26  * node-ghjoined
00:59:26  * node-ghpart
01:04:05  <Trott>Whoa. That coverage machine is fast.
01:04:51  <refack>It's one of 3 monsters provisioned for benchmarking
01:09:55  <Trott>Some pummel tests are still timing out on it, which suggests deadlock/race-conditions to me. The pummel stuff has been fun, to be honest. I'm wondering why I put it off for so long.
01:10:24  <refack>JJJ
01:10:47  <refack>But they probably still contribute to the covrage anyway
01:14:45  <Trott>Will the results automatically appear at coverage.nodejs.org or is there something else that I would have to do for that?
01:15:01  <Trott>(I mean, I guess I could just wait for the daily job, but if I want the results of the job I just kicked off to appear there?)
01:15:10  <refack>Some cache thing
01:16:57  <Trott>So I just need to wait for a little bit? Adding pummel and benchmark brought the C++ coverage back up to 91% which I expected. But it also looks like the JS coverage *decreased* from 95.41% to 92.75%. I want to dive more into the results to see what happened, and coverage.nodejs.org is the easiest way to do that.
01:17:25  <refack>There should be a way... I'm checking the configs
01:21:40  <refack>Best I can do is https://ci.nodejs.org/view/Node.js%20Daily/job/node-test-commit-linux-coverage-daily/nodes=benchmark/ws/coverage/*zip*/coverage.zip
01:21:52  <refack>From the bottom of https://ci.nodejs.org/view/Node.js%20Daily/job/node-test-commit-linux-coverage-daily/nodes=benchmark/ws/coverage/
01:24:28  <Trott>OK, we'll see if I can get motivated to dig into the zip file once it finishes downloading. :-D Thanks!
01:34:17  <refack>So there's a cron that promotes the uploaded coverage report at some time in some TZ
01:34:18  <refack>https://github.com/nodejs/build/blob/55714b80ab6aed964da19501900d53d2704ee831/setup/www/tasks/site-setup.yaml#L118
01:36:36  <Trott>I wonder if the build-site.sh iojs two lines above can go since iojs.org now just redirects to nodejs.org?
01:46:58  <refack>Prbly, but it might also be that this playbook is out of sync. It's mostly for reference, and not a source of truth
01:50:50  * dbeveniusjoined
01:50:52  <refack>Ohh, look what I found https://github.com/nodejs/build/blob/master/jenkins/scripts/coverage/README.md
01:55:10  * dbeveniusquit (Ping timeout: 244 seconds)
02:12:47  <Trott>Well, all right!
02:13:11  * dbeveniusjoined
02:14:00  <Trott>I guess https://github.com/nodejs/build/blob/master/jenkins/scripts/coverage/README.md#transfer-to-benchmarking-data-machine is what needs to happen...
02:14:21  <Trott>Oh, well, I guess the rsync from the previous section actually?
02:14:57  <Trott>Oh, but I guess since it's an infra-* machine, neither you nor I have access....
02:15:06  <Trott>Harumph.
02:15:13  <refack>First `rsync` is part of the job (copy to benchmark machine)
02:15:37  <refack>Cron that runs `* */4` does second rsync from `benchmark` to `www`
02:18:01  * dbeveniusquit (Ping timeout: 268 seconds)
02:31:02  * dbeveniusjoined
02:33:20  * node-ghjoined
02:33:20  * node-ghpart
02:33:35  * node-ghjoined
02:33:35  * node-ghpart
02:35:58  * dbeveniusquit (Ping timeout: 272 seconds)
02:52:01  * dbeveniusjoined
02:56:43  * dbeveniusquit (Ping timeout: 245 seconds)
03:29:06  * dbeveniusjoined
05:10:43  * node-ghjoined
05:10:43  * node-ghpart
05:15:43  * node-ghjoined
05:15:43  * node-ghpart
05:17:01  * node-ghjoined
05:17:01  * node-ghpart
05:46:31  * node-slack-botpart
05:46:43  * node-slack-botjoined
11:18:09  <Trott>Jenkins has been saying it's going to shut down for many hours now, and jobs aren't starting, but...it's not shutting down. Anyone know what's going on?
12:07:32  <Trott>Jenkins no longer says it's going to shut down, but jobs in don't seem to be starting?
12:08:08  <Trott>"All nodes of label ‘jenkins-workspace’ are offline"
12:44:34  <refack>all nodes are offline. seems like the worker can't phone home:
12:44:45  <refack>https://www.irccloud.com/pastebin/LFPpuwC9/
12:45:02  <refack>Too early for me to dig into this... I'll try later
12:49:29  * dbeveniusquit (Ping timeout: 268 seconds)
12:53:04  * dbeveniusjoined
12:57:47  * dbeveniusquit (Client Quit)
12:58:59  * dbeveniusjoined
12:59:34  * dbeveniusquit (Client Quit)
13:46:09  * node-ghjoined
13:46:09  * node-ghpart
14:02:12  <mhdawson__>Took a quick look at the machine and nothing jumped out
14:02:19  <mhdawson__>same for the logs in Jenkins
14:05:39  <mhdawson__>firewall table still look ok on a quick sniff check.
14:05:50  <mhdawson__>and seems like workers can ping ci.ndoejs.org
14:12:58  <mhdawson__>Talking with George about just doing a restart.
14:15:53  <node-slack-bot>[george.adams] restarted using the GUI but no luck
14:23:32  <mhdawson__>Is it normal to have 2 processes running with -jar .../jenkins.war
14:24:00  <mhdawson__>Both of the processes also seem to say Jan29
14:24:09  <mhdawson__>date
14:24:29  <node-slack-bot>[george.adams] we shouldn't do
14:24:54  <mhdawson__>maybe we should shutdown, verify that everything is stopped and then restart
14:25:05  <node-slack-bot>[george.adams] yeah I would restart the machine and see what happens
14:25:13  <mhdawson__>although not up to speed on how we restart jenkins
14:25:29  <mhdawson__>I was thinking just jenkins but I guess a machine restart should accomplish the same
14:25:40  <node-slack-bot>[george.adams] yeah
14:25:53  <mhdawson__>ok reboot then. Do you want to do that or should I?
14:26:05  <node-slack-bot>[george.adams] I don't have access :\
14:26:12  <mhdawson__>ok I'll do it now
14:26:32  <mhdawson__>although is there a shutdown. Might be nice to shutdown as cleanly as we can
14:26:53  <mhdawson__>a /stop or something like that?
14:27:09  <node-slack-bot>[george.adams] I'm prepared jenkins for shutdown
14:27:12  <node-slack-bot>[george.adams] should be safe now
14:27:26  <mhdawson__>ok will do
14:28:50  <mhdawson__>that was a fast reboot
14:28:56  <mhdawson__>jenkins seems to be restarting now
14:30:11  <mhdawson__>hmm, still no luck connecting
14:31:30  <mhdawson__>I'm not really sure it reboot
14:31:35  <mhdawson__>nevermind
14:31:52  <mhdawson__>was looking at wrong machine
14:38:49  <mhdawson__>https://www.irccloud.com/pastebin/dtA1c203/
14:39:09  <mhdawson__>Would have expected to see the tcp line in there as well
14:40:08  <node-slack-bot>[george.adams] yeah that's strange
14:47:39  <mhdawson__>Does look like jenkins is configured to use 49187 though
14:51:50  <mhdawson__>looks like port is active and accessible locally though
14:52:01  <mhdawson__>tried through telnet and from ci machine and it connected
15:58:34  <mhdawson__>Trying to find any logs that would indicate if firewall is involved or not
16:02:22  <refack>The interwebs has this advice: add `-Dhudson.TcpSlaveAgentListener.hostName=ci.nodejs.org`
16:03:02  <refack>Seems like the master is advertising `https://ci.nodejs.org/` which is a URL not a valid hostname
16:03:36  <mhdawson__>what would have changed that needs that change
16:03:48  <mhdawson__>were would that go ?
16:06:16  <refack>In the service script that starts java `usr/bin/java -Djava.awt.headless=true -Dhudson.model.User ...... -jar /usr/share/jenkins/jenkins.war`
16:06:48  <refack>Rod did a security update, but it seems like Jenkins restarted only last night
16:07:07  <mhdawson__>ah that would at least explain why we see a problem now
16:08:31  <mhdawson__>It might be better to fix jenkins that having to update all of the workers
16:12:05  <mhdawson__>I added that and tried to connect and still got the same problem
16:26:24  <mhdawson__>How easy is it to roll back updates?
16:27:39  <node-slack-bot>[george.adams] depends if Rod kept the old war file
16:27:40  <refack>I think Rod uses `apt` to update, so it could be pinned https://help.ubuntu.com/community/PinningHowto
16:31:19  <mhdawson__>I see somebody restarted jenkins again?
16:34:01  <refack>Here's a list of all jenkins debs https://pkg.jenkins.io/debian-stable/
16:34:39  <refack>I think LKGR was 150.1
16:45:14  <mhdawson__>I don't see anything in the change logs that would suggest problems related to this
16:46:21  <refack>Ok, I think I found the problem...
16:46:25  <refack>The port should be 41913
16:46:41  <refack>It probably got overritten during the update
16:46:56  <mhdawson__>that would explain it
16:47:07  <mhdawson__>Seems like workers are coming back on line
16:47:14  <refack>Most workers are comming back online
16:47:49  <refack>pffffffffffffffppppppffff 😤
16:48:14  <mhdawson__>I had seen 41913 in the firewall rules and wondered about that
16:48:35  <mhdawson__>but did not have enough context to know if that was setting the port for jnlp
16:49:04  <mhdawson__>Wish I know iptables better as it would have been obvious :)
16:49:49  <mhdawson__>That explains why I could telnet to it locally but not from external machines
16:50:16  <mhdawson__>I had suspected the firewall was blocking had not yet found logs to confirm that
16:50:33  <mhdawson__>Will remember for next time to check the port
16:51:24  <refack>Found it in the config audit log
16:51:25  <refack>https://ci.nodejs.org/jobConfigHistory/showDiffFiles?name=config&timestamp1=2019-01-30_20-06-17&timestamp2=2019-01-31_07-01-47
16:51:34  <refack>It's not even highlighted
16:54:41  <mhdawson__>The the update changed it from 41913 to 0?
16:54:54  <refack>So it seems
16:55:15  <mhdawson__>but it was set as 49187
16:56:49  <refack>I flicked the check box from "random" to "fixed" and that's what came up. Then I googled it and found references so I assumed
16:57:39  <refack>(It might have been in the middle of the night...)
16:59:13  <mhdawson__>This seems to bthe one https://ci.nodejs.org/jobConfigHistory/showDiffFiles?name=config&timestamp1=2019-01-31_07-01-47&timestamp2=2019-01-31_11-53-55
16:59:28  <mhdawson__>ok so that explains the 49187 bit
16:59:53  <mhdawson__>If I'd known that had been changed would have looked harder at the firewall config to see if it matched
17:01:20  <refack>Looks like that was done a only 10m ago 🤔
17:01:55  <refack>So I might have reverted what I did at night since it didn't work
17:01:57  <mhdawson__>The compare is probably just backwards
17:03:33  <mhdawson__>There is an entry at 2019-01-31_07-01-47 which says "unknown" for the user
17:03:47  <mhdawson__>and the link is a comparison of what we have now to that
17:03:54  <refack>AFAICT that was when Jenkins restarted
17:04:32  <mhdawson__>Which I guess would make sense if the restart triggered an update to the config that somehow overwrote the entry
17:05:26  <mhdawson__>This shows the before after for that chnage https://ci.nodejs.org/jobConfigHistory/showDiffFiles?name=config&timestamp1=2019-01-31_07-01-47&timestamp2=2019-01-30_20-06-17
17:05:33  <refack>https://ci.nodejs.org/jobConfigHistory/showDiffFiles?name=config&timestamp1=2019-01-31_07-19-44&timestamp2=2019-01-31_07-25-55
17:05:47  <refack>Found it. Yeah I did it out of sleep
17:07:11  <mhdawson__>So restart overwrote the config, you updated to try an fix but just got the number wrong right
17:07:14  <mhdawson__>?
17:07:36  <refack>That's what I figure
17:07:36  <mhdawson__>It's bad at the update cloberred the configuration entry
17:08:25  <mhdawson__>I'm going to get back to my regular day. Thanks for figuring it out
17:08:47  <mhdawson__>At least I've learned a bit more so if something similar happens I'll have a better idea what to look for
17:08:53  <refack>Thanks all for the help and cross-checking
17:09:13  <mhdawson__>:)
17:09:53  <refack>And thanks to Trott for flagging the issue
17:12:41  <Trott>Should we add an item to the build-agenda to have a post-mortem on this? We can do it either during a regular meeting or else do it separately, but we should probably do something formal-ish and devise some process changes based on what we learn.
17:13:03  <refack>👍
17:13:09  <refack>I'll write up what I know
17:13:51  <Trott>Also: Let's watch for a new fanned job to finish? Looks to me like only Resume Build fanned jobs are green right now.
17:15:18  <refack>There was another bit that was clobbered (related to the temp git repo)
17:15:39  <refack>The bottom `hudson.slaves.EnvironmentVariablesNodeProperty` https://ci.nodejs.org/jobConfigHistory/showDiffFiles?name=config&timestamp1=2019-01-30_20-06-17&timestamp2=2019-01-31_07-01-47
17:15:47  <refack>I saw that on some other update
17:16:31  <mhdawson__>Worth capturing that in an issue so we can figure out what that has affected
17:16:38  <mhdawson__>I agree
17:16:48  <mhdawson__>I think we should force an update immediately after an updatre
17:16:59  <mhdawson__>so that we have context as to what might have caused problems
17:17:07  <mhdawson__>as opposed to finding at some later random time
18:21:03  <joyee>Is there a job that allows you to test that `make binary` works ?
18:21:30  <joyee>It seems quite easy to break install.py accidentally without noticing it
18:30:30  <richardlau>It's possible to do test builds on the release CI but only a few people have access to it.
18:33:57  <refack>joyee: What richardlau hints at... AFAIR the nightly release job need `install.py` to work properly, so we at least get daily coverage
18:35:08  <joyee>but only when someone notice that a nightly build is missing?
18:35:30  <joyee>(seems a bit random)
18:35:31  <refack>I get an email if the nightly job fails
18:35:40  <joyee>cool
18:37:27  <refack>...but it would be nice to find a way to test it in the public CI as well...
18:41:01  <richardlau>It's definitely doable... we used to do so internally in IBM. The trick was the release check that stops the build if the release bit isn't set in node_version.h.
18:42:25  <refack>I'm worried about too much friction...
18:43:09  <refack>Maybe we could wrap that into the arm jobs passing the binaries...
18:56:58  * node-ghjoined
18:56:58  * node-ghpart
19:03:42  <Trott>Any idea what could have caused https://ci.nodejs.org/job/node-test-commit-aix/20771/nodes=aix61-ppc64/console and if we should do something to prevent it from happening again? Or was it a one-time thing? (Maybe someone doing work on the AIX machine at the time?)
19:10:24  <refack>Trott: looks like either a bug in the tap reported, or in tap2junit
19:10:53  <refack>Seems like one test does not have a duration (actualy it seems like it has no YAML content at all)
19:14:29  <richardlau>it doesn't look like all the tests ran (it bailed out early)?
19:19:21  <refack>So it might be a memory issue... I'll monitor that worker
19:22:49  <refack>2524 tests 😰 last number I remember was 2000+change
19:25:16  * node-ghjoined
19:25:16  * node-ghpart
19:50:02  <mhdawson__>I'm wondering what git-nodesource-update-reference is and why it fails so often
19:50:36  <mhdawson__>at least for me it has been 50/50
19:52:14  <refack>It is related to moving the binaries in the arm fanned job
19:55:27  <refack>Seems like it is timing out because the temp repo is too big (git does lazy GC on demand, and in this case it causes the request to time out)
19:55:56  <mhdawson__>Is it new?
19:56:10  <mhdawson__>just wondering since I've seen it a lot recently
19:56:51  <refack>AFAIK been there for as long as the arm-fanned job was there
19:57:50  <refack>Ever since we started keeping the binaries to enable "resume" the temp git get too big once in a while
19:58:27  <refack>Also when Rod had trouble with his ISP that was the first job in the sequence to fail
19:58:29  <mhdawson__>k, I guess I just never noticed it
19:59:35  <refack>Since it only does house keeping if it fails it's an infra problem. It doesn't run any tests or compilation...
20:00:29  <mhdawson__>can I resume if that fails?
20:01:24  <refack>I think so. Since it depends on performance.
20:02:16  <refack>If the cross compiler is green, the binaries are in the temp repo.
20:02:38  <refack>Getting them to Australia is the tricky part
20:12:23  <mhdawson__>Its a bit strange that the arm binary tests are running even though that failed
20:12:59  <mhdawson__>so in the end I could have a CI where that was the only job that failed.
20:13:55  <refack>It only need to succeed once. IIUC if a previous run already fetched that branch you need, it'll work
20:14:14  <refack>I assume you are resuming?
20:16:43  <mhdawson__>waiting for it to finish first
20:17:07  <mhdawson__>I don't think I can before then
20:51:38  * node-ghjoined
20:51:38  * node-ghpart
21:25:45  * chorrelljoined
21:26:55  * chorrellquit (Client Quit)
21:37:48  * node-ghjoined
21:37:48  * node-ghpart