[vsnet-grb-info 7237] The Perfect Storm: Lost/Delayed GCN Notices

GCN Circulars gcncirc at capella.gsfc.nasa.gov
Wed Jan 14 13:29:12 JST 2009


TO: All GCN Customers
RE: Delayed and lost Notices and Circulars
DT: 14-Jan-09

The GCN computer experienced a heavy load factor problem starting
at 18:48 UT (13-Jan-09) and lasted for a few hours.  The cause of the
load factor was the confluence of six factors:
1) Both Fermi-GBM and -LAT triggered on GRB 090113 which resulted
in their full compliments of TDRSS messages to GCN (and the resulting
generation and distribution of those notices).
2) Swift-BAT also triggered on this burst.  But it happened during
a Malindi telemtry downlink pass.  As such, the real-time TDRSS messages
for this burst were delayed in an on-board holding buffer until the end
of the telemetry pass.  Once that was done, all the messages from
BAT, XRT and UVOT for that burst came down to GCN in a highly
time-compressed sequence.  There was not the usual few to 10's of seconds
between messages to do the generation and distribution for each message
before the next message arrived.
3) The Swift and ROTSE teams submitted their usual rapid-response
Circulars to GCN.  The circulars are processed on the same computer
as the notices.
4) Nine minutes after the arrival at GCN of the TDRSS messages
from the GRB trigger, Swift-BAT triggered on an outburst from
the HMXB 1A 1118-61 source, which generated a second series
of TDRSS messages to GCN.
5) And 11 minutes after that second trigger, BAT triggered again
on the 1A 1118-61 source, generating a third sequence of TDRSS
messages to GCN.  These 2nd and 3rd sequences were trying to be
processed while the first sequence was being processed.
6) And all during the time of these messages from the 5 instruments
on the 2 spacecraft from the 3 events, the Swift, AGILE, and INTEGRAL
missions were producing a somewhat higher than normal rate
of Pointing Direction messages.

The result of these multiple converging streams of messages to GCN
caused the load factor on the computer to get into the 90's.
The normal load factor dring a Swift burst is 3-5 for about 10 minutes.
Once the load factor goes above 8, the sendmail demon temporarily
suspends distribution of the emails that were being generated
for all these Notices.  Normally I would have quickly noticed
this high load factor and taken steps to mitigate the delays
in notice distribution, but I was in a meeting for about two hours,
so the high load factor and email suspension lasted a long time.

Once I noticed the problem, I started mitigation procedures
(mostly just manually suspending most of the 500+ processes spawned
to generate and distribute the notices from these 5 instruments
and 3 triggers so that the remaining (active) processes would
have cpu cycles to get their processing completed.  After about an hour,
I got the load factor down to the 2-5 range.  Sendmail was just resuming
distribution of the email notices when I accidentally sent a wrong load-control
command that killed the parent process.  So all 500+ child processes
also died, thus resulting in the loss to the GCN customers
of an unknown fraction of the notices generated for these 3 events.

My apologies to everyone who did not get their copies of some of the
notices and circulars.  I have already implemented an improvement
to the dynamic load-control program running within GCN that will 
help reduce this "perfect storm" scenario.  It will reduce the
duration of the episode from hours to 20-30 of minutes, thus reducing
the probability of that multiple triggers will happen within an overlapping
time.  While there have already been several simultaneous Fermi-n-Swift
burst detections without any load factor problems, it was the
time-bunching of the Swift messages from the first BAT trigger
plus the extra messsages from the second and third BAT triggers
that pushed GCN over the edge.  I am working on ways to handle
that special scenario as well.

Again, my apologies for the lost notices and the lost opportunities.


More information about the vsnet-grb-info mailing list