00:00:16  * paddybyersquit (Quit: paddybyers)
00:33:11  * TooTallNatequit (Read error: Connection reset by peer)
00:33:14  * nathanjoined
00:33:41  * nathanchanged nick to Guest37993
00:41:51  * Guest37993changed nick to TooTallNate
01:06:28  * sh1mmerquit (Quit: sh1mmer)
01:17:49  * indexzeroquit (Quit: indexzero)
01:19:47  * indexzerojoined
01:19:47  * sh1mmerjoined
01:25:44  * piscisaureus_joined
01:36:03  <piscisaureus_>ryah bnoordhuis: http://groups.google.com/group/nodejs/msg/f5273f0a7b1c665e
01:36:13  <piscisaureus_>do we know why?
01:39:33  <ryah>piscisaureus_: no
01:39:46  <ryah>piscisaureus_: that code has not been touched much between v0.4 and v0.6
01:40:02  <ryah>but we should add a benchmark to the tree and get it up on http://arlolra.no.de/
01:40:09  <piscisaureus_>ryah: I am afraid libuv could be slower than libev
01:40:21  <piscisaureus_>that is, the number of rounds the event loop can make per second
01:41:06  <ryah>yeah
01:41:09  <ryah>could be
01:41:19  <piscisaureus_>maybe it's that we compile with EV_MULTIPLICITY and the overhead of uv_default_loop()
01:41:42  <piscisaureus_>because afaict libuv-unix does not much more than libev otherwise
01:41:55  <ryah>i seriously doubt that is the cause
01:42:03  <ryah>more likely some random V8 shit
01:42:15  <piscisaureus_>oh yeah
01:42:31  <piscisaureus_>I should run it with --trace-bailout --trace-deopt
01:42:45  <ryah>piscisaureus_: commit the benchmark into the benchmark dir
01:42:58  <piscisaureus_>k
01:43:51  * sh1mmerquit (Quit: sh1mmer)
01:45:07  * sh1mmerjoined
01:47:54  * sh1mmerquit (Client Quit)
02:07:14  <piscisaureus_>ryah: we already have it in our tree :-/
02:08:09  <piscisaureus_>but my benchmark tests something else so I am going to land that too
02:10:22  <CIA-111>node: Bert Belder v0.6 * rc6347dc / benchmark/next-tick-2.js :
02:10:22  <CIA-111>node: Add another nextTick benchmark
02:10:22  <CIA-111>node: It tests how many iterations the event loop can make per second. - http://git.io/wyBnDA
02:18:10  * travis-cijoined
02:18:11  <travis-ci>[travis-ci] joyent/node#138 (v0.6 - c6347dc : Bert Belder): The build passed.
02:18:11  <travis-ci>[travis-ci] Change view : https://github.com/joyent/node/compare/cf2513e...c6347dc
02:18:11  <travis-ci>[travis-ci] Build details : http://travis-ci.org/joyent/node/builds/439460
02:18:11  * travis-cipart
02:25:22  <ryah>piscisaureus_: thanks
02:29:07  <ryah>arlolra: yt?
02:29:16  <ryah>arlolra: can you add https://github.com/joyent/node/commit/c6347dcfb43273214dc872e60c8cd94a93fee027 to the website?
02:29:48  <arlolra>for sure
02:44:37  <ryah>arlolra: danke
02:51:47  * sh1mmerjoined
03:07:52  * indexzeroquit (Quit: indexzero)
03:18:19  * TooTallNatequit (Quit: Linkinus - http://linkinus.com)
03:22:16  * indexzerojoined
03:54:08  * pieternquit (Quit: pietern)
04:20:58  * rmustaccpart
04:23:46  <piscisaureus_>ryah: https://gist.github.com/d50183eb434370cafe6a <-- think about it
04:31:14  <ryah>isn't finally a keyword?
04:31:26  <piscisaureus_>ryah: yeah
04:31:34  <piscisaureus_>ryah: I just used it as a function name
04:31:54  <piscisaureus_>ryah: replace it by `onend` or by nameless functions
04:32:45  <ryah>this seems like a math problem to me
04:32:56  <ryah>it feels the same :)
04:33:06  <piscisaureus_>to me too
04:33:16  <piscisaureus_>it is extremely difficult to reason about
04:33:36  <piscisaureus_>good luck you got a degree in math :-)
04:33:52  <piscisaureus_>s/good luck/luckily/
04:34:28  <piscisaureus_>the createServer(function(socket) { ... is a nasty problem
04:35:01  <ryah>what does createDomain pass to it's first arg?
04:35:14  <piscisaureus_>nothing
04:35:27  <ryah>var d2 = createDomain(function finally2(err) { if (err) console.log(err); console.log("the end");
04:35:30  <ryah>});
04:35:34  <ryah>^-- is err ever non-null ?
04:35:38  <piscisaureus_>oh yeah
04:35:46  <piscisaureus_>if someone threw within the domain
04:35:58  <ryah>oh i see
04:36:04  <benvie>finally will throw because it's a keyword, but it'll only throw after the user throws? is that how it works?
04:36:22  <ryah>i think you should rename finally piscisaureus_ - just so it's not confusing
04:36:30  <ryah>`finally`
04:37:31  <piscisaureus_>yeah
04:37:43  <ryah>piscisaureus_: "but not before at least one piece of code has been run in the domain."
04:37:56  <ryah>what does "one piece of code" mean?
04:39:18  <piscisaureus_>ryah: just like node. It will exit when the refcount drops to 0
04:39:57  <ryah>piscisaureus_: i was thinking something more built in: net.createDomainServer(function (domain) { /* called for each connection */ })
04:40:09  <ryah>piscisaureus_: it blocks?
04:40:18  <piscisaureus_>ryah: oh no
04:40:57  <ryah>then we don't have to say The socket cannot have any callbacks attached at that point.
04:41:03  <ryah>^-- this is a funny restriction
04:41:06  <ryah>would be best to avoid
04:41:13  <ryah>by a different API
04:41:59  <ryah>although what you have is okay
04:42:02  <piscisaureus_>ryah: no I mean, the refcount is the number of io handles attached to the domain. But after one domain.run(...) call the domain may still hold io handles so it doesn't exit.
04:42:02  <piscisaureus_>If after the first run() call the refcount is 0 the domain yields immediately after that.
04:42:16  <piscisaureus_>but right after the createDomain call the refcount is also 0
04:42:22  <piscisaureus_>but it should not yield right away
04:42:23  <ryah>yep
04:42:29  <ryah>man domains are going to be awesome
04:42:40  <ryah>if this actually works
04:42:48  <ryah>node will destroy erlang
04:43:07  <ryah>especially with isolates
04:45:23  <piscisaureus_>I hope we won't find any nasty edge cases
04:45:40  * benviequit
04:45:47  <piscisaureus_>the guarantuee that we *must* make is that every domain always yields reliably
04:46:54  <ryah>piscisaureus_: yes the ref count thing i also thought baout before
04:46:57  <ryah>i think it's a good idea
04:47:06  <piscisaureus_>I like that part best. People want to encapsulate an complex operation and then they just want one result: did it work or did it not?
04:47:16  <ryah>yes
04:47:23  <ryah>it will be good for doing the debugger
04:47:33  <ryah>we won't return until the V8 RPC returns
04:47:40  <ryah>we can do that with a domain
04:47:56  <ryah>that basically saves you from passing a callback through the system
04:48:28  <piscisaureus_>yeah
04:48:29  <ryah>well.. hm
04:48:39  <ryah>if we opened the socket for each RPC - it would be easy
04:48:44  <ryah>maybe in this case it doesnt
04:49:39  <piscisaureus_>I am not quite sure about black holing a run() to a domain that has already yielded
04:49:46  <piscisaureus_>it may be better to make it throw
04:50:11  <ryah>try-catch for events
04:50:19  <ryah>this is what it is
04:50:51  <piscisaureus_>that's what I wanted
04:51:05  <piscisaureus_>the only annoying thing is that you cannot really "detect" callbacks
04:51:07  <ryah>oh right - that's why you call it finally
04:52:43  <piscisaureus_>it needs work for sure ... this is just a draft
05:32:17  <piscisaureus_>http://groups.google.com/group/nodejs/browse_thread/thread/9d608f19d9f7b5c4 hmmm
05:33:36  <arlolra>piscisaureus_: from your gist above, on line 46, the d2.run(something) ... would that run if the above call to run hadn't yielded yet? that seems unreliable if it would
05:35:28  <arlolra>maybe just ignore that question
05:37:05  <piscisaureus_>arlolra: yeah I agree there may be a problem here
05:37:54  <arlolra>ok, the rest seemed pretty clear though
05:37:57  <piscisaureus_>maybe I should not allow a second run() call like that
05:41:05  <piscisaureus_>ryah: why is the stream.write callback not documented? Is it not supposed to be used?
05:44:57  * benviejoined
05:47:49  * einaros-quit (Ping timeout: 276 seconds)
06:25:46  <ryah>piscisaureus_: it should be documented
06:28:49  * piscisaureus_quit (Ping timeout: 255 seconds)
07:21:31  * sh1mmerquit (Read error: Connection reset by peer)
07:21:48  * sh1mmerjoined
07:23:44  * sh1mmerquit (Client Quit)
07:26:39  * sh1mmerjoined
07:33:43  * AvianFluquit (Quit: Leaving)
08:01:13  * paddybyersjoined
10:00:27  * paddybyers_joined
10:01:47  * Skomskijoined
10:03:16  * paddybyersquit (Ping timeout: 240 seconds)
10:03:16  * paddybyers_changed nick to paddybyers
10:45:59  * felixgejoined
10:45:59  * felixgequit (Changing host)
10:45:59  * felixgejoined
11:44:18  * dshaw_joined
12:03:22  * dshaw_quit (Ping timeout: 240 seconds)
13:02:26  * Skomskiquit (Quit: Nettalk6 - www.ntalk.de)
14:02:24  * piscisaureus_joined
14:31:14  * felixgequit (Quit: http://www.debuggable.com/)
14:47:20  * CoverSli1ejoined
14:47:54  * CoverSlidequit (Read error: Connection reset by peer)
17:46:54  <piscisaureus_>ryah: bnoordhuis: no call tonight right?
17:55:26  * sh1mmerquit (Quit: sh1mmer)
17:58:54  * AvianFlujoined
18:02:13  * dshaw_joined
18:24:02  * piscisaureus_quit (Quit: ~ Trillian Astra - www.trillian.im ~)
18:47:30  * TooTallNatejoined
19:06:24  * mralephjoined
20:37:24  <mjr_>Hey guys, looks like we are still getting cores from http parser.
20:37:38  <mjr_>I'll send up a few.
20:42:07  * dshaw_quit (Quit: Leaving.)
20:42:20  <mjr_>https://gist.github.com/a9722d25619c0ef7eb21
20:43:12  * piscisaureus_joined
20:44:41  <mjr_>So it looks like its still breaking in a similar place, although perhaps for a different reason.
20:45:09  <mjr_>It seems like our client might be sending a seriously mangled header, but perhaps it is also node getting confused.
20:46:31  <mmalecki>mjr_: do you maybe know headers client sends?
20:46:48  <mjr_>oh sure, I have the source code
20:47:08  <mjr_>But every time I look, the client is doing the right thing.
20:47:54  <mmalecki>mjr_: hm, can you replicate it? or does it only appear under some bigger load?
20:48:02  <mjr_>Only in production
20:50:51  <mmalecki>mjr_: which version are you guys running?
20:51:01  <mjr_>tracking v0.6 branch
20:53:24  <mjr_>cores and current node executable are here: http://voxer.com/media/cores2.tar.gz
20:53:48  <mjr_>arch is x86_64, libs are Ubuntu 10.04
20:56:36  <mmalecki>line 92 is weird, I'd expect length to be equal to header's length
21:15:56  * sh1mmerjoined
21:17:28  * AndreasMadsenjoined
21:18:30  * AndreasMadsenquit (Client Quit)
21:19:25  * perezd_joined
21:20:11  * perezdquit (Ping timeout: 252 seconds)
21:20:11  * perezd_changed nick to perezd
21:35:31  <pquerna>ryah: did that failure detector code ever make it into an npm package somewhere?
22:28:25  <mjr_>pquerna: you guys ever run into this "TCP incast" thing in your big cassandra cluster or other big things?
22:28:26  <mjr_>http://bradhedlund.com/2011/05/01/tcp-incast-and-cloud-application-performance/
22:29:10  <pquerna>mjr_: yes, in memcached years ago
22:29:23  <pquerna>mjr_: once we were above about 40 memcached nodes with a wide multi-get
22:29:36  <mjr_>What did you do about it?
22:30:30  <pquerna>short term, we changed our code to do multiget in batches of like 30 servers at once, and partitioned the memcached cluster along keys to keep the number of machines down
22:30:42  <pquerna>longterm, this is why memcache added the udp protocol
22:31:14  <mjr_>Seems like for small responses UDP would have the same issue though, no?
22:31:35  <pquerna>in cassandra this isn't a real issue, because most of the time our replica count is like 5 or 7, and most queries have 'row' locality
22:31:49  <pquerna>and a row is the level of partitioning for cross-machine
22:32:24  <mjr_>We seem to be hitting this in both our riak clusters and our node code.
22:32:39  <pquerna>mjr_: well, back then, with tcp it would make the linux kernel do dumb stuff, like backoff and have a hard coded 200ms backoff, just switching to udp elminated that, but yes, if you start overloading switch ports, boo humbug
22:32:46  <mjr_>Our replica count is only 3, but somehow we get into these nasty burst/loss cycles.
22:33:34  <mjr_>Average bitrate is 500mbits or so, but there are 10ms bursts of over 1000, all the time.
22:33:42  <pquerna>right
22:34:36  <pquerna>do you have managed switches in there?
22:34:43  <pquerna>should be able to get stats out
22:35:03  <mjr_>yeah, they claim tons of output discards
22:35:31  <mjr_>sampling interface counters at 10Hz reveals the awful, bursty truth.
22:36:59  <pquerna>hrm. Whats really causing that kind of burst though?
22:37:21  <pquerna>is this to do with how you span out audio data to large groups?
22:38:03  <mjr_>I think riak does it on its own, based on how we are using it, for some reason.
22:38:13  <pquerna>i guess i don't gork why such high bursts would be a property of riak, but don't know riak internals that well
22:38:15  <mjr_>But I also think our node clusters are doing it to themselves.
22:38:19  <pquerna>unless you have a 'very' wide query
22:38:56  <mjr_>We don't do any of the coverage query stuff, it's all straight kv.
22:39:04  <pquerna>something that gets 1mb of data from every riak node
22:39:13  <pquerna>(or less, just asks every ndoe for something)
22:39:16  <pquerna>hrm
22:39:38  <mjr_>So any given node only ever asks 3 other nodes for the answer. Those 3 probably do all answer at once.
22:39:51  <pquerna>right, but thats really not that much data
22:40:03  <pquerna>should be just fine unless the switch is otherwise crappy
22:40:04  <mjr_>Yeah, I know. Hence the mystery.
22:40:46  <mjr_>It's a bunch of Cisco 3750's. Legend has it that these are the most susceptible to this incast thing.
22:41:11  <pquerna>span ports? then maybe you can at least see whats bursting
22:41:47  <pquerna>even regular tcp dump, should be able to see some 'smaller' bursts and maybe isolate it
22:41:48  <mjr_>It's kind of impossible to monitor because the backup is happening inside the switch on the way to the output port, so spanning would just cause another backup.
22:42:41  <mjr_>I don't yet know how to make tcpdump take 500 mbits of traffic for a second and pick out the bursts.
22:43:04  <pquerna>true :-/
22:43:11  <mjr_>Anyway, interesting problems for sure.
22:43:12  <pquerna>wireshark has some graphing tools built into it
22:43:20  <pquerna>so you can graph the throughput
22:44:26  <mjr_>I really just want bigger output buffers, but these switches have relatively small ones.
22:44:54  <mjr_>I also want ryan to fix my segfaults. :)
22:46:16  * mralephquit (Quit: Leaving.)
22:46:24  <pquerna>https://github.com/vrv/linux-microsecondrto
22:47:18  <pquerna>ah nice
22:47:22  <pquerna>they added a tunable for the tro
22:47:24  <pquerna>rto
22:47:38  <pquerna>yeahaaa
22:48:06  <mjr_>That looks cool, at least if there's no way to avoid it, you can more effectively cope with it.
22:48:45  <pquerna>makes sense though, 200ms just doesn't make sense in a X cluster of 100 machines on LAN :)
22:50:19  <mjr_>Yeah, which normally have rtt of less than 1ms
22:53:19  <pquerna>mjr_: simple hack maybe, depending on what ports are overloading
22:53:32  <pquerna>mjr_: plug in another port, split your client traffic over both
22:53:44  <mjr_>Yeah, that's what we are doing actually.
22:53:54  <mjr_>Dumb way to double the receive port buffers
22:54:32  <mjr_>Just painful to add another NIC to every machine, and for some reason these machines won't do 10GB.
22:55:29  <pquerna>another physical card? that sucks
22:55:39  <pquerna>most our machines got 4 em in now days :-/
22:58:24  <mjr_>10Gb is the real solution to this problem, but our Ubuntu-supplied kernels are unhappy with it.
23:02:44  <CIA-111>node: Ryan Dahl isolates2 * r1a3b283 / (6 files in 3 dirs): move isolate V8 functions out of node.cc - http://git.io/fGzaEw
23:02:44  <CIA-111>node: Ryan Dahl master * r624f70e / (.gitignore Makefile configure tools/gyp_node):
23:02:45  <CIA-111>node: GYP: rename options.gypi to config.gypi
23:02:45  <CIA-111>node: Sounds more familiar to unix users used to config.h - http://git.io/eLlIzA
23:02:58  <ryah>pquerna: probably not - but let me get it for you...
23:04:07  <pquerna>really wish delayed acks and rto could be set by-subnets
23:04:27  <ryah>mjr_: did the frequency of segfaults decrease?
23:04:54  <ryah>piscisaureus_: srry - everyone is on vacation so i didn't go online this morning
23:06:06  * pieternjoined
23:07:42  <mjr_>ryah: I think so, but there are still quite a lot of them.
23:07:52  <ryah>pquerna: https://gist.github.com/454b33101532a2e43340
23:07:53  <mjr_>Perhaps they come in bursts, the timestamps seem to support this.
23:08:07  <pquerna>ryah: thanks
23:08:48  <ryah>mjr_: damn. i thought we nailed that bug :///
23:09:01  <mjr_>Yeah, I guess not.
23:09:14  <mjr_>It was clean on that one server for a day.
23:09:54  <mjr_>Aside from moving this all to Joyent, which can't happen until your guys deliver the machines, is there a better way to see what's going on here?
23:11:08  <CIA-111>node: Ryan Dahl v0.6 * rd85c85a / doc/api/addons.markdown : Change 'real example' in addon doc - http://git.io/I9j8GQ
23:12:08  <ryah>what is the freq of crashes?
23:12:20  <mjr_>I'll get the full list of timestamps
23:13:57  <mjr_>https://gist.github.com/7a371382bc8ed825f89c
23:14:15  * travis-cijoined
23:14:16  <travis-ci>[travis-ci] joyent/node#139 (isolates2 - 1a3b283 : Ryan Dahl): The build is still failing.
23:14:16  <travis-ci>[travis-ci] Change view : https://github.com/joyent/node/compare/2ac02f4...1a3b283
23:14:16  <travis-ci>[travis-ci] Build details : http://travis-ci.org/joyent/node/builds/442139
23:14:16  * travis-cipart
23:14:39  <mjr_>Is there some obvious unix way to sort those, or do I have to employ elaborate awk/perl magic?
23:15:12  * travis-cijoined
23:15:12  <travis-ci>[travis-ci] joyent/node#140 (master - 624f70e : Ryan Dahl): The build is still failing.
23:15:12  <travis-ci>[travis-ci] Change view : https://github.com/joyent/node/compare/6ac22bf...624f70e
23:15:12  <travis-ci>[travis-ci] Build details : http://travis-ci.org/joyent/node/builds/442141
23:15:12  * travis-cipart
23:16:59  <pquerna>ryah: cool with making it a little npm package (crediting joyent of course)?
23:18:59  * travis-cijoined
23:18:59  <travis-ci>[travis-ci] joyent/node#141 (v0.6 - d85c85a : Ryan Dahl): The build passed.
23:18:59  <travis-ci>[travis-ci] Change view : https://github.com/joyent/node/compare/c6347dc...d85c85a
23:18:59  <travis-ci>[travis-ci] Build details : http://travis-ci.org/joyent/node/builds/442174
23:18:59  * travis-cipart
23:19:54  <ryah>mjr_: sort -n --key=7 cores.txt
23:19:57  <ryah>pquerna: sure
23:20:31  <ryah>mjr_: oops i guess no -n
23:20:45  <mjr_>also, it spans days
23:21:00  <mjr_>That's the tricky part.
23:21:13  <ryah>sort --key=6,7 cores.txt
23:21:14  * piscisaureus_quit (Ping timeout: 240 seconds)
23:21:35  <mjr_>oh shit, I didn't know you could do two keys like that.
23:21:47  <mjr_>well, I'm glad I asked.
23:21:56  <ryah>me neither :)
23:22:03  <mjr_>https://gist.github.com/815d097ba175bccfedf0
23:22:36  <mjr_>looks like it roughly follows the traffic pattern
23:22:51  <ryah>this is one box?
23:23:02  <mjr_>No, this is the entire cluster.
23:23:29  <ryah>k
23:23:56  <ryah>just to check - all the machines are upgrade to v0.6 HEAD ?
23:24:24  <mjr_>Not sure if this helps or not, and I hate to even mention it, but we are corrupting data somewhere. Getting reports of mangled JPEG files and JSON with bits of previous messages interspersed.
23:24:52  <mjr_>Could be our fault somehow, like we are mangling it between the client and server, actually uploading corrupted data.
23:25:10  <mjr_>Yes, all machines are v0.6 branch as of right after you landed that patch.
23:25:17  <ryah>it might be worthwhile to load efence into one of these processes?
23:25:50  <ryah>i think i have code for that in wscript... one sec.
23:26:03  <ryah>or maybe you can do it with LD_PRELOAD or whatever
23:26:30  <ryah>yeah- i think you should be able to do LD_PRELOAD - you know efence?
23:26:45  <ryah>it's like a ghetto version of valgrind
23:26:50  <mjr_>I don't. Does it have a significant performance impact?
23:26:58  <ryah>but i think it runs closer to roduction speed
23:27:18  <ryah>guessing - completely off the top of my head - maybe 20% hit
23:27:30  <ryah>i haven't used it for over a year though so im not sure.
23:27:36  * indexzeroquit (Quit: indexzero)
23:27:44  <ryah>that will let us know if we're hitting any uninitialized memory
23:28:14  <ryah>you should have it in apt-get
23:29:35  <mjr_>OK, I'll try it on your user's router process. :)
23:29:47  <ryah>:)
23:30:46  <ryah>meanwhile i'm going to poke around in the parser code again
23:33:10  <ryah>i really hate these fucking C features to tell functions to inline
23:33:28  <ryah>the VM should be deciding that not me
23:34:02  <mjr_>yes, in this post-llvm world that is more reasonable.
23:34:38  <ryah>mjr_: i have a patch for you to try - i had some handlescopes in the wrong places
23:34:47  <mjr_>heh, sure
23:34:58  <mjr_>want me to try both that and efence, or one at a time?
23:34:59  <ryah>mjr_: https://gist.github.com/52e8075223f7e36c00c5
23:35:06  <ryah>mjr_: one at a time please
23:35:11  <mjr_>patch first?
23:35:22  <ryah>in parallel if you're down for it
23:35:53  <ryah>(on differnt processes)
23:36:09  <ryah>im also removing always_inline
23:36:14  <ryah>which i dont trust
23:36:25  <ryah>and i doubt is actually measurable
23:53:15  <mjr_>Sorry, got distracted by splunk. Have you seen this thing? Pretty great.
23:56:25  <ryah>the head of ops in the office next to me keeps telling me how great it is
23:56:28  <ryah>havent used it
23:57:46  <mjr_>So we generate 1TB of log data every day, and this thing indexes the shit out of it.
23:58:15  <mjr_>grep became laughably slow