00:06:58  * chorrelljoined
00:18:05  * peterbraden__quit (Ping timeout: 250 seconds)
00:19:07  * yunongquit (Quit: Leaving.)
00:21:56  * nfitchquit (Quit: Leaving.)
00:23:34  * fredkquit (Quit: Leaving.)
00:58:23  * bcantrillquit (Quit: Leaving.)
01:09:15  * yunongjoined
01:20:31  * yunongquit (Quit: Leaving.)
01:27:26  * abraxasjoined
02:07:40  * elijah-mbpquit (Remote host closed the connection)
02:09:12  * elijah-mbpjoined
02:09:27  * elijah-mbpquit (Remote host closed the connection)
02:20:12  * bcantrilljoined
02:27:44  * chorrellquit (Quit: My MacBook Pro has gone to sleep. ZZZzzz…)
02:28:53  * ghostbar_quit (Ping timeout: 240 seconds)
03:08:48  * elijah-mbpjoined
03:26:19  * mcavagequit (Remote host closed the connection)
03:46:46  * ghostbarjoined
03:50:46  * ghostbarquit (Ping timeout: 240 seconds)
04:26:46  * _Tenchi_quit (Quit: Leaving)
04:44:41  * ghostbarjoined
05:35:26  * yunongjoined
05:55:51  * yunongquit (Quit: Leaving.)
05:58:26  * marsellquit (Quit: marsell)
06:07:29  * marselljoined
06:32:24  * marsellquit (Quit: marsell)
07:23:10  * ghostbar_joined
07:23:46  * ghostbarquit (Ping timeout: 240 seconds)
07:39:41  * mamashjoined
07:58:40  * mamashpart
07:58:46  * mamashjoined
08:05:03  * bcantrillquit (Quit: Leaving.)
08:59:41  * marselljoined
10:44:02  * abraxasquit (Remote host closed the connection)
12:57:51  * chorrelljoined
13:00:25  * chorrell_joined
13:50:44  * bcantrilljoined
13:55:44  * chorrell_quit (Quit: My iMac has gone to sleep. ZZZzzz…)
14:27:44  * bcantrillquit (Quit: Leaving.)
14:45:57  * dcrawfordjoined
15:05:05  * almostobsoletejoined
15:05:23  <almostobsolete>Hello, I'm just having a play with Manta
15:06:16  <marsell>Have fun. :)
15:06:48  <almostobsolete>Is there a way of having the map phase output under different keys and the reduce phase to run once per key?
15:07:10  <almostobsolete>That's how I'm used to writing map-reduce jobs, or should I be doing things differently here?
15:07:17  <rmustacc>Sounds like you want to just send it to another map phase before the reduce phase.
15:07:49  <almostobsolete>I'm doing log file analysis and what I want to do is produce stats for each user (a given log entry has a user property). So there are lots of log files (1 per day) each of which has lots of lines.
15:08:32  <rmustacc>Naively wouldn't you just build up that hash from each file and then sum them together in reduce?
15:09:15  * chorrellquit (Quit: My MacBook Pro has gone to sleep. ZZZzzz…)
15:10:06  <almostobsolete>The way I've done it as a map reduce before is to emit something like (user_id, {foocount: 1}) in the map phase then in the reduce phase it's grouped by the key (user id) and I just add up all the foocounts
15:10:49  <almostobsolete>rmustacc: I'm not sure I follow?
15:11:26  * papertigersjoined
15:11:55  <rmustacc>I think we're saying the same thing, but at least for me, I don't see the need for automated grouping.
15:12:25  <almostobsolete>There are probably too many users to fit into memory all at once
15:13:05  <almostobsolete>Are you talking about the technique used here: http://apidocs.joyent.com/manta/job-patterns.html#word-frequency-count
15:13:51  <almostobsolete>Because it seems to rely on the AWK reducer process keeping all the stats in memory as it processes. I may be totally missing the point here though :p
15:15:34  <rmustacc>Well, if ask isn't working for you, you can use something else.
15:15:48  <rmustacc>But I don't really have a sense for how large you're referring to.
15:16:35  <marsell>You can have multiple reducers.
15:16:44  <rmustacc>But what I would consider doing is something like a multi-phase job.
15:16:52  <rmustacc>multiple reducers isn't quite what he wants.
15:16:58  <almostobsolete>It's not specifically awk, it's just that as far as I can see if the reduce phase is just a single run of a single program then it's always going to be limited by the machine it's running on.
15:17:21  <rmustacc>That is, well, the definition of reduce generally.
15:17:22  <almostobsolete>So, for example, if instead of a single count I'm actually generating quite a lot of information per-user
15:17:50  <rmustacc>Well, you can do multiple reducer rounds as Marsell mentioned.
15:17:57  <rmustacc>You can also look at the msplit documentaiton.
15:17:58  <almostobsolete>That's what I've used grouping by key for in Map-Reduce before, as I understand it that's what lets Map-Reduce work on large data sets
15:18:20  <rmustacc>Have you tried it yet and know that it doesn't work?
15:18:20  <almostobsolete>Multiple reducer rounds may be what I want, I'll check out msplit docs as well
15:18:25  <almostobsolete>thanks!
15:19:48  <almostobsolete>ah, msplit is exactly what I wanted
15:19:48  * mcavagejoined
15:21:19  <mcavage>almostobsolete: fwiw, we basically do exactly what you're describing (split by $userid) for internal manta things.
15:21:47  <mcavage>that's what msplit was written for - so you can get around the limitations of one box with reduce -- you likely still want a final reduce phase to make a single output object, but that's not strictly necessary I suppose.
15:23:30  * chorrelljoined
15:23:56  <almostobsolete>It even understands JSON, that's really useful :)
15:24:04  <mcavage>well, we needed it too ;)
15:24:59  <almostobsolete>Good to know you guys are dog-fooding it all
15:26:12  <mcavage>well we're going to be talking about manta in the fall at several conferences, this is one of the topics for sure: how manta built manta ;)
15:26:25  <mcavage>we implemented all the upstack batch stuff on top it.
15:26:31  <almostobsolete>nice
15:28:31  * chorrellquit (Read error: Connection reset by peer)
15:32:55  <almostobsolete>mput is giving me "url is a required argument" when I type "mput -f map.py /andymitchell/stor/map.py"
15:32:55  * nfitchjoined
15:33:07  <almostobsolete>sure it's something really silly :p
15:33:25  <mcavage>env | grep MANTA
15:33:44  <mcavage>almostobsolete: you either need to have MANTA_URL set or pass in --url
15:34:10  <almostobsolete>ah, I haven't restarted that terminal window since I added that to my .profile
15:34:18  <almostobsolete>thanks, knew it'd be something simple :)
15:34:21  <mcavage>. ~/.profile
15:34:26  <mcavage>you don't need to restart ;)
15:34:47  <almostobsolete>thanks
15:36:19  * chorrelljoined
15:37:10  <almostobsolete>Is there an easy way to get mjob to output errors if they occur? "job 9491e2dc-70d7-4773-bc0a-47519bbd4a06 had 565 errors" isn't super helpful...
15:38:46  <mcavage>mjob errors 9491e2dc-70d7-4773-bc0a-47519bbd4a06
15:38:47  * chorrellquit (Read error: Connection reset by peer)
15:38:53  <mcavage>also:
15:39:28  <mcavage>mjob errors 9491e2dc-70d7-4773-bc0a-47519bbd4a06 | json -ga stderr | xargs -L 10 mget
15:39:46  <mcavage>that *should* give you all the stderr spew -- you might need to tweak
15:40:10  <mcavage>anyway, the errors come back as a json stream that has things like the reason it exited as a switchable code, and we save stderr for you off to a manta object.
15:40:39  <almostobsolete>great
15:41:13  <almostobsolete>Would be a useful addition to mjob if it had an option to do that for you in the event of an error (much like the -o option does for non-error output)
15:41:40  <mcavage>yeah, probably so.
15:41:51  <mcavage>if you want to put an issue on github, appreciated ;)
15:42:44  * chorrelljoined
15:42:45  <almostobsolete>Oh, I hadn't notice it was on github, will do
15:43:51  <mcavage>oh are you just using the mac installer?
15:43:53  <mcavage>good to know :)
15:43:57  <mcavage>anyway, joyent/node-manta
15:44:09  <almostobsolete>I installed via npm
15:44:20  <mcavage>oh ok.
15:44:28  <almostobsolete>But didn't really look much beyond that, should have guessed it was on github
15:44:38  <mcavage>no worries :)
15:46:22  <almostobsolete>https://github.com/joyent/node-manta/issues/87
15:46:22  * chorrellquit (Read error: Connection reset by peer)
15:46:30  <mcavage>thanks!
15:47:43  <almostobsolete>Are you guys accepting pull requests? If I get a chance I might add it
15:47:52  <mcavage>yes! :)
15:51:06  * chorrelljoined
15:53:13  * chorrellquit (Read error: Connection reset by peer)
15:55:26  * papertigersquit (Ping timeout: 240 seconds)
15:57:09  * chorrelljoined
15:59:43  * chorrellquit (Read error: Connection reset by peer)
16:00:03  * CarlosCjoined
16:00:53  * fredkjoined
16:01:24  * fredk1joined
16:01:24  * fredkquit (Read error: Connection reset by peer)
16:02:27  * fredk1quit (Client Quit)
16:03:59  <almostobsolete>Any idea what I'm doing wrong here: https://gist.github.com/almost/9e154f0e1b2ff8f21c6c (I'm getting "bash: /assets/andymitchell/stor/reducer.py: No such file or directory", so it's finding the map.py but not the reducer.py)
16:04:26  * papertigersjoined
16:05:00  * fredkjoined
16:05:39  <jperkin>-s ordering is sensitive, try re-arranging so it is -s .. -m .. -s .. -r
16:06:04  <almostobsolete>oh right, so the -s only applies to the next -m or -r?
16:06:32  <jperkin>yes
16:06:40  <almostobsolete>cool, makes sense
16:07:45  * mamashpart
16:08:07  <almostobsolete>any examples of how I put msplit into the pipeline? Should it be another mapper with the -m option?
16:09:12  <nfitch>almostobsolete: There are some examples here: https://github.com/joyent/manta-compute-bin/blob/master/docs/man/msplit.md
16:10:12  <almostobsolete>Thanks, those examples show how to use the command on it's own, but I'm not entirely clear where to slot it into a call to mjob
16:11:20  * papertigers_joined
16:11:34  <nfitch>Oh, it's used as part of the commands run in marlin compute. It can be used in a map or reduce command as long as there are subsequent phases. It is usually the last thing in the command.
16:11:35  * papertigersquit (Ping timeout: 264 seconds)
16:11:45  <nfitch>Let me dig up a trivial example....
16:12:39  <almostobsolete>Oh right, so I should pipe my other mapper through to it rather than specify it as a map step on its own?
16:12:52  <nfitch>mjob create -m 'for i in {1..1000}; do echo $i; done | msplit -n 4' --count=4 -r 'cat'
16:14:09  <almostobsolete>ah, I see, I need to specify number of reducers for mjob as well. I assume I can then run a second level of reducer to combine them?
16:14:11  <nfitch>Yes. So ^^ will split the 1000 number stream into 4 streams. The 'cat' in the reduce phase is useless as the maps would already output 4 files.
16:14:21  <nfitch>Exactly!
16:14:53  <almostobsolete>Perfect, my reducer is already re entrant so I can just use the same reducer twice
16:19:32  * almostobsoletequit (Remote host closed the connection)
16:20:53  * AvianFlujoined
16:21:39  * papertigers_quit (Ping timeout: 276 seconds)
16:24:23  <nfitch>Another example is how we aggregate metrics in our log files. We have a bunch of bunyan (https://github.com/trentm/node-bunyan) objects, each has a time field. We use msplit with a -j and -e to bucketize the log records to the set of reducers. The reduce phase calculates metrics for the time periods, and a final reducer combines into an hourly summary.
16:24:40  <nfitch>Here are the phases in a recent job (slightly modified :)
16:24:41  <nfitch>https://gist.github.com/nfitch/6010266
16:32:32  * ryancnelsonjoined
16:46:22  * papertigersjoined
16:47:43  * papertigersquit (Client Quit)
16:56:05  * papertigersjoined
17:07:27  * elijah-mbpquit (*.net *.split)
17:09:58  * elijah-mbpjoined
17:13:21  * papertigersquit (Quit: papertigers)
17:27:39  * rmustacc_joined
17:28:45  * rmustaccquit (Disconnected by services)
17:28:48  * rmustacc_changed nick to rmustacc
17:32:04  * almostob`joined
17:37:36  * almostob`quit (Remote host closed the connection)
17:38:30  * _Tenchi_joined
17:47:32  * mamashjoined
18:05:00  * mamashpart
18:05:42  * bcantrilljoined
18:11:53  * mamashjoined
18:16:51  * mamashpart
18:31:59  * bcantrillquit (Quit: Leaving.)
18:35:18  * ryancnelsonquit (Quit: Leaving.)
18:50:34  * dcrawfordquit (Ping timeout: 262 seconds)
19:06:05  * papertigersjoined
19:31:47  * ghostbar_quit (Remote host closed the connection)
19:32:16  * ghostbarjoined
19:36:34  * ghostbarquit (Ping timeout: 246 seconds)
19:47:36  * papertigersquit (Quit: papertigers)
19:53:15  * papertigersjoined
19:55:43  * bcantrilljoined
19:58:56  * ghostbarjoined
20:04:45  * papertigersquit (Quit: papertigers)
20:22:07  * mamashjoined
21:50:43  * bixujoined
22:20:28  * elijah-mbpquit (Remote host closed the connection)
22:31:37  * elijah-mbpjoined
22:42:14  * stonecobraquit (*.net *.split)
22:42:20  * bcantrillquit (*.net *.split)
22:42:20  * _Tenchi_quit (*.net *.split)
22:42:20  * CarlosCquit (*.net *.split)
22:42:21  * ed209quit (*.net *.split)
22:42:22  * konobiquit (*.net *.split)
22:42:23  * asongequit (*.net *.split)
22:44:08  * _Tenchi_joined
22:44:36  * bcantrilljoined
22:44:36  * 45PAA35F7joined
22:44:36  * CarlosCjoined
22:44:36  * ed209joined
22:44:36  * konobijoined
22:44:36  * asongejoined
22:45:25  * stonecobrajoined
22:45:47  * papertigersjoined
22:47:14  * 45PAA35F7quit (Write error: Connection reset by peer)
22:47:16  * mamashpart
22:50:08  * papertigersquit (Client Quit)
23:04:04  * papertigersjoined
23:06:22  * papertigersquit (Client Quit)
23:31:58  * mcavagequit
23:44:16  * papertigersjoined
23:50:46  * papertigersquit (Ping timeout: 240 seconds)
23:53:19  * nfitchquit (Quit: Leaving.)
23:59:36  * chorrelljoined