00:01:24  * benjamincoejoined
00:04:21  * node-ghjoined
00:04:21  * node-ghpart
00:04:34  <refack>can anyone with infra access kick Jenkins? joaocgreis rvagg mhdawson__
00:05:28  <rvagg>Kick? As in a full restart?
00:05:44  <rvagg>I did that yesterday, but now it's worse?
00:05:47  <refack>Maybe it's perma 504 for me
00:05:55  <refack>Yes it's worse
00:05:57  * benjamincoequit (Remote host closed the connection)
00:06:08  <rvagg>Ok, inspecting
00:07:34  * benjamincoejoined
00:11:27  * benjamincoequit (Remote host closed the connection)
00:14:06  * benjamincoejoined
00:15:16  <rvagg>yeah, java process doesn't want to die, so it's certainly not in a good way
00:22:02  <rvagg>ok, so it's looking to me like we're not trimming quite as agressively as we should do, there's builds in there from the 19th of August but we should be on a 7 day trim cycle, so something's getting skipped
00:22:34  <rvagg>yea, and I've run the trim job from the backup server and it's remove more than half of the jobs now, I bet this fixes jenkins slowness but it doesn't explain why it's not been trimming since late August
00:22:53  <rvagg>jenkins back up now, will probably behave nicer but we have some investigation to do
00:25:39  <rvagg>can someone explain why alpine38 is still offline? "test"
00:26:01  <rvagg>actually only 1/2 of them
00:38:08  <rvagg>ok, maybe a disk problem on backup server, seems to be behaving properly again but we need to keep an eye on this
01:01:52  <rvagg>jenkins CPU 1001%, what a beast
01:37:41  * benjamincoequit (Remote host closed the connection)
01:46:55  * nemixquit (Ping timeout: 244 seconds)
03:09:41  * node-ghjoined
03:09:41  * node-ghpart
03:28:47  <Trott>test-linuxonecc-rhel72-s390x-3 has multiple consecutive build failures. Anybody around to take a look at it or should I take it offline? Most recent failure at https://ci.nodejs.org/job/node-test-commit-linuxone/nodes=rhel72-s390x/4916/console.
03:38:15  * benjamincoejoined
03:42:45  * benjamincoequit (Ping timeout: 252 seconds)
04:08:57  * gabrielschulhofquit (Ping timeout: 252 seconds)
04:44:27  * lucalanzianiquit (Ping timeout: 240 seconds)
04:46:30  * lucalanzianijoined
05:08:05  <Trott>Also seems like AIX is stalled. https://ci.nodejs.org/job/node-test-commit-aix/
08:34:52  * srl295quit (Quit: Connection closed for inactivity)
12:39:12  * benjamincoejoined
12:43:57  * benjamincoequit (Ping timeout: 252 seconds)
13:02:28  <node-slack-bot>[trott] Removed a stale `index.lock` file on `test-softlayer-ubuntu1404-x86-1` that had it in perpetual build failure.
13:02:39  <node-slack-bot>[trott] Probably caused by a terminated run.
13:04:26  <node-slack-bot>[trott] The problems above with LinuxONE and AIX are still ongoing and I'm not sure what to do there.
13:11:54  <Trott>Marking test-linuxonecc-rhel72-s390x-3 temporarily offline. (Is that a machine IBM folks have to take care of? mhdawson__)
13:16:44  * nemixjoined
13:26:47  * nemixquit (Ping timeout: 240 seconds)
13:32:16  <Trott>Not sure what to do about AIX. Jobs have been waiting for 10 hours and no AIX host picks it up to run. Going to try rebooting. (Running `ansible-playbook playbooks/jenkins/worker/create.yml --limit "test-osuosl-aix61-ppc64_be-2"` resulted in an error for me.)
13:32:43  <refack>I'll give it a look
13:33:19  <Trott>OK, I won't reboot then. (I was thinking maybe instead of rebooting it, I can just disable testing on AIX for the weekend until IBM folks are back on Monday.)
13:33:31  <refack>They are offline - https://ci.nodejs.org/label/aix61-ppc64/
13:33:34  <Trott>Sorry to report issues on a weekend to a volunteer team.
13:34:49  <Trott>(I tried the `java -jar agent.jar ...` command and that couldn't connect either.)
13:36:00  <Trott>In the vein of "don't be afraid to ask 'stupid' questions": How do we get them back online?
13:36:34  <refack>Well they are ssh-online, they are just not running the jenkins agent.
13:36:52  <refack>Simplest case is to run the command that's on the worker-node screen
13:37:01  <refack>(just be sure to run it as `iojs`
13:37:27  <refack>but there should be some daemon-izing script on the machine
13:37:41  <refack>it's just different for each platform
13:40:28  <Trott>Should I try running the command as user `iojs` or will I just get in your way?
13:41:23  <refack>I'm trying to remember how to start the daemon on AIX
13:42:19  <Trott>`/etc/rc.d/rc2.d/S20jenkins` maybe?
13:42:51  <refack>probably, ho do I kick it
13:43:00  * refackgoogling start daemon on aix
13:43:39  <Trott>`/etc/rc.d/rc2.d/S20jenkins start` as root?
13:44:39  <refack>doesn't err but doesn't catch
13:44:46  <refack>(no java proc
13:45:23  <Trott>I'm seeing `startsrc` in Google but you're probably 10 steps ahead of me at least so I'll stop at this point.
13:49:41  <refack>Ok `/etc/rc.d/rc2.d/S20jenkins start` worked. it's just `ps axu` looks wierd
13:52:02  <refack>Ok, both AIX are online
13:53:40  <Trott>Thanks! Hopefully I'll retain enough of this information to be able to do it myself next time. :-D
13:55:00  <refack>There is general work on a "fix.yml" playbook. Ansible should know the minutiae of the different platforms
13:55:17  * gabrielschulhofjoined
13:55:46  * gabrielschulhofquit (Client Quit)
15:07:10  <Trott>It looks like yesterday's issue of slooooooow Jenkins with lots of 504 errors is back...
15:51:09  * nemixjoined
16:05:24  <refack>Trott: re the linuxone failing:
16:05:47  <refack>https://www.irccloud.com/pastebin/j9oIun0z/
16:05:52  <refack>it was a leftover
16:07:35  <Trott>Oh, maybe that was the host Ali was using and ran a test as a user different than iojs or something? Just a guess...
16:07:52  <refack>Most probably
16:08:11  <refack>We should have a kill script with setuid
17:08:05  * nemix_joined
17:11:36  * nemixquit (Ping timeout: 252 seconds)
18:00:19  * bcoejoined
18:32:00  * bcoequit (Remote host closed the connection)
18:45:26  * bcoejoined
18:46:15  * bcoe_joined
18:49:34  * bcoequit (Ping timeout: 240 seconds)
18:52:26  * bcoe_quit (Remote host closed the connection)
18:52:42  * bcoejoined
18:53:12  * bcoequit (Remote host closed the connection)
18:53:32  * bcoejoined
18:54:00  * bcoequit (Remote host closed the connection)
18:54:15  * bcoejoined
20:22:50  * node-ghjoined
20:22:50  * node-ghpart
22:42:16  * nemix__joined
22:45:28  * nemix_quit (Ping timeout: 246 seconds)