Spinitron outage Apr 20th and 21st 2022

Amanda_Shauger · April 20, 2022, 8:27pm

When we go to KXCI 91.3FM Tucson AZ – Real People Real Radio we get the error message An internal server error occurred.

eva · April 20, 2022, 8:49pm

Hi Amanda,

Thanks for reporting it, We’re working on it as fast as we possibly can!!! Hopefully the site will be back very soon.

I apologize for the outage and thanks for your patient.

UPDATE (from @tom ) Mon April 25. I wanted to put a “technical difficulties” graphic near the top of this page but they were all too cute for me given how raw I still feel. So let’s have this photo.

Spinitron rents servers from French provider OVH in their data center near Montreal on which runs the database software that was crashing last week runs. Read about the OVH fire in Strasbourg last year shown here in this Reuters article Millions of websites offline after fire at French cloud services firm | Reuters

UPDATE (from @tom) Thu Apr 28. Sorry if that image I chose on Monday was alarming. That isn’t what happened to our servers. Spinitron is still safely at home in this building.

Which is right next door to a hydo-electric power station in Beauharnois, Quebec, Canada.

I’m afraid it’s all still rather menacing to look at, as large industrial installations often are. So if it helps, here’s a more relaxing image of the inside of a computer.

I assume the fans were not rotating at the time that snap got shot. Right?

tom · April 20, 2022, 8:58pm

The database servers have been crashing. Very mysterious

Apr 20 16:08:32 bhs6 mysqld[10356]: realloc(): invalid next size
Apr 20 16:08:32 bhs6 mysqld[10356]: 220420 16:08:32 [ERROR] mysqld got signal 6 ;

I managed to recover the cluster and the service at least for now.

tom · April 20, 2022, 9:14pm

the cluster is down again. i’m going to try recovering from a recent backup

tom · April 20, 2022, 10:02pm

service is back up for now. i rebooted all the servers in the cluster now. monitoring to see if this bug recurs. if it does we will need to install a difference mariadb server version

tom · April 20, 2022, 10:10pm

no luck. all the db servers crashed again. i’ll attempt downgrading

tom · April 20, 2022, 11:36pm

i’ve restarted one of the nodes, bhs5 without the Galera cluster software, i.e. as a standalone. Spinitron service is available again but i don’t know if it will last. even if it does, this is still just a temporary fix

tom · April 20, 2022, 11:52pm

no joy. same db server crash. trying to load from a recent db backup now. will take an hour or so

tom · April 21, 2022, 12:44am

backup loaded. service up for now. i don’t expect this to last.

can anyone help me downgrade Debian packages? i’m stuck with aptitude errors like this

 mariadb-server-10.4 : Conflicts: mariadb-server (< 1:10.4.24+maria~buster) but 1:10.4.21+maria~buster is to be installed

tom · April 21, 2022, 1:42am

i downgraded from mariadb-server 10.4.24 to 10.4.21 and i have some hope that this stabilizes things.

there’s more repair to do but i will try to deal with it tomorrow, after giving this latest change some time to see if it helps

tom · April 21, 2022, 1:46am

damn. that didn’t help. i don’t know what to try next

tom · April 21, 2022, 3:19am

i’ll keep nursing the currently function server along while i buy and set up new hardware

dklann · April 21, 2022, 1:02pm

I just got off the phone with Tom. If anyone questions his commitment to y’all getting this problem fixed, know that he has been up all night nursing the servers and working hard to fix this problem. Tom and Eva have pulled out all the stops to get this strange and obscure problem corrected and make the servers stable as they’ve been for many years.

tom · April 21, 2022, 3:07pm

got a new server. built it up with all new software versions. installed our apps. migrated databases and search engines to it. and now it’s running and seems to be working more or less. we’ll have to wait a few hours to confirm if the same crashes are going to happen.

many apologies, of course, for all the service interruptions.

damn. it’s just gone down again

tom · April 21, 2022, 4:02pm

i’m disabling the metadata push service for the time being to see if that helps

tom · April 21, 2022, 4:12pm

push is back on. turning it off seems not to have helped

GregDSea · April 21, 2022, 5:42pm

Tom, thank you so much for the focused attention on these issues. I know you understand this critical connection to stations and their audience. KBCS looks ok right now for how we are implementing the playlist for app and web. I have access on Spinitron for creating playlists.
Please let us know if anything we can do. Again, thank you for your attention to this server crash.

traveller · April 21, 2022, 9:31pm

I noticed that all image files I uploaded in the past 18 hours have vanished from the playlists. This is not a complaint, just FYI, but you probably already know this. I will re-upload them. Just some images of Ukrainian album covers.

I’m impressed by your efforts to fix everything, Tom & Eva. You’re not alone: Brainwashed.com has been suffering from server issues for weeks now

tom · April 21, 2022, 11:54pm

sorry about the loss of some images. i was focused on the database and didn’t think to copy those over for a while.

thanks for the words of support. this is a very miserable experience. turning out to be eye-wateringly expensive too. we hired mariadb enterprise support this afternoon. $yikes.

tom · April 22, 2022, 4:41pm

The database server has behaved well for the last 11 hours since I made a configuration change recommended by support staff at MariaDB, the firm that develops the DB software that we use.

Now I will start to put the cluster back into normal operation. You may notice some features of Spinitron not working properly until it is all done (things like searches not finding the most recent spins or new releases). This will take many days.

News for nerds

Apr 20 14:27:32 left coast time we get the first of these crashes. mariadb (the process) makes a fubar move in memory allocation and (I’m guessing now) glibc’s memory allocation software notices this mistake and tells mariadb to abort that move (i.e. libc sends signal 6 to mariadb’s main process). mariadb doesn’t have a way to recover from this kind of error and attempts to shut itself down in the most urgent manner it knows but hangs while trying to write a backtrace. This much looks like this:

realloc(): invalid next size
220421 19:08:13 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
Server version: 10.5.15-MariaDB-0+deb11u1
key_buffer_size=134217728
read_buffer_size=131072
max_used_connections=11
max_threads=202
thread_count=11
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 575744 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
Thread pointer: 0x7f9ba8000c58
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x7fa478ba2d78 thread_stack 0x49000

Then I manually send a SIGKILL to mariadb and systemd restarts it. mariadb then attempts to recover the data base from the file remains. Initially this was complicated by the procedures of restarting a stuck Galera cluster until I reconfigured to just one server. Then I experimented with different software versions and then new hardware and eventually we got advice from MariaDB (the firm) to try dynamic linking mariadb to jemalloc(*). Then mariadb (the process) stopped crashing.

So much for the good news but for the glass-half-empty types like me (the type I imagine you’d want looking after the IT you rely on) the experience raises a number of questions. What changed some time before Apr 20 14:27:32 that led to this pathological behavior? (I guess some new pattern of requests.) Exactly how did mariadb crash? (The instruction sequence.) What if anything needs to be fixed in the mariadb+glibc configuration? How did mariadb get stuck in its emergency exit requiring manual -9 to restart?

(*) I wonder if the business model of offering a “Community” software version for $0 together with a superior but vaguely defined “Enterprise” version of it that comes with support services for $x might establish a structural incentive to the hoarding of magic tricks like this.