Sunday, August 20, 2006

SIGGRAPH.org's Interpretive Dance

So, how have you been sleeping lately? Glad to hear it. Me? Not much at all.

Seriously, our Plone / Zope has been crashing like me after three pitchers of margaritas. Our setup has been running fine for many months, through a number of Zope and Plone versions from somewhere in Zope 2.6 or 2.7-land with Plone 2.0 up to Zope 2.8.7 and Plone 2.1.3. About two weeks ago, our Zope started acting up, hanging, or 'spinning' in zspeak, a couple of times a day and not responding to any requests until restarted.

Since the time this started, which roughly corresponds to the time I checked out of my hotel in Boston on August 4, after SIGGRAPH 2006, we have expanded the list of addresses which receive notification of site down errors from our monitoring system. A handful of us have lost a lot of sleep / rest by keeping our eyelids peeled back in case our Zope needs to be restarted again. One some days, this has had to happen as much as once per hour, and we went to the extreme of moving our site between machines in hopes of isolating a hardware problem such as faulty memory, which hasn't turned out to be a likely cause of our troubles.

I removed as many products as I could and asked everyone within an e-mail's reach not to add or modify content until further notice without notifying me, and the site has continued to crash. Last night, finally, I switched a version of the main site to Zope 2.9.3 and Plone 2.5 in a load-balanced setup with some ad-hoc auto-restart scripts which I hope I can get to start working soon if problems persist.

One thing is for sure, the new setup is a helluva lot faster and with two clients, should hopefully take at least twice as long to fail.