The accepted solution is, unfortunately, wrong. It is right as far as it says,
Kill the connection with lock
This is indeed (almost surely; see below) what to do. But then it suggests,
Kill 1398
…and 1398 is not the connection with the lock. How could it be? 1398 is the connection waiting for the lock. This means it does not yet have the lock, and therefore, killing it avails nothing. The process holding the lock will still hold the lock, and the next thread trying to do something will therefore also stall and enter “Waiting for metadata lock” in due order.
You have no guarantee that the processes “waiting for metadata lock” (WFML) won’t also block, but you can be certain that killing only WFML processes will achieve exactly nothing.
The real cause is that another process is holding the lock, and more importantly, SHOW FULL PROCESSLIST
will not tell you directly which it is.
The one thing you can be sure of, it is none of the processes labeled “Waiting for metadata lock”. Those are, so to speak, the victims.
SHOW FULL PROCESSLIST
WILL tell you if the process is doing something, yes. Usually it works. Here, the process holding the lock is doing nothing, and hides among other threads also doing nothing and reported as ‘Sleeping’.
If SHOW FULL PROCESSLIST
shows you a process running a DML, or in “Sending data” status, well then, that is almost certainly the culprit. The other processes are waiting for it to release the locks (they can be implicit locks; the process need not have issued a LOCK TABLE at all, that actually would lock in a different way). But a process can be holding a lock while it’s doing nothing, and is duly marked as “Sleep”.
In the OP’s case the culprit is almost certainly process 1396, which started before process 1398 and is now in Sleep
state, and has been for 46 seconds. Since 1396 clearly did all that it needed to do (as proved by the fact that it is now sleeping, and has done so for 46 seconds, as far as MySQL is concerned), no thread having gone to sleep before it could have held a lock and be still holding it (or 1396 would also have stalled).
Due to MySQL’s “deadlock-free” locking strategy, no process can hold a lock, release it, and resume it again; so a lock wait is invariably caused by a process still holding a lock and never having held that lock before. This is useful (we will make use of this fact below), because it guarantees that the lock “queue”, so to speak, is sequential.
IMPORTANT: if you connected to MySQL as a limited user, SHOW FULL PROCESSLIST
will not show all the processes. So the lock might be held by a process that you don’t see.
So: if SHOW FULL PROCESSLIST
shows you everything and shows one running process, then that process is probably responsible and you need wait for it to finish whatever it’s doing (or you may kill it – at your peril).
The rest of this answer deals with the baffling case where processes are waiting without an apparent reason and no one seems to be doing anything.
A better SHOW PROCESSLIST
SELECT ID, TIME, USER, HOST, DB, COMMAND, STATE, INFO
FROM INFORMATION_SCHEMA.PROCESSLIST WHERE DB IS NOT NULL
AND (`INFO` NOT LIKE '%INFORMATION_SCHEMA%' OR INFO IS NULL)
ORDER BY `DB`, `TIME` DESC
The above can be tuned to show only the processes in SLEEP state, and anyway it will sort them by time descending, so it is easier to find the process that is hanging (due to sequentiality, it usually is the Sleep
‘ing one immediately before the ones “waiting for metadata lock”; and it will always be one of those that have slept more than any of those waiting).
The important thing
Leave any “waiting for metadata lock” process alone.
Quick and dirty solution, not really recommended but quick
Kill all processes in “Sleep” state, on the same database, that are older than the oldest thread in “waiting for metadata lock” state. This is what Arnaud Amaury would have done:
- for each database that has at least one thread in WaitingForMetadataLock:
- the oldest connection in WFML on that DB turns out to be Z seconds old
- ALL the “Sleep” threads on that DB and older than Z must go. Start with the freshest ones, just in case.
- If one older and non-sleeping connection exists on that DB, then maybe that is the one holding the lock, but it is doing something. You may of course kill it, but especially if it is an UPDATE/INSERT/DELETE, you do so at your own peril.
- After each
KILL
, re-evaluate the situation and restart the process accordingly. Processes that were waiting might now be running, or they may have run briefly and be now sleeping. They might even be the ones, now, holding a new metadata lock.
Ninety-nine times out of one hundred, the thread to be killed is the youngest among those in Sleep state that are older than the older one waiting for metadata lock:
TIME STATUS
319 Sleep
205 Sleep
19 Sleep <--- one of these two "19"
19 Sleep <--- and probably this one(*)
15 Waiting for metadata lock <--- oldest WFML
15 Waiting for metadata lock
14 Waiting for metadata lock
(*) the TIME order actually has milliseconds, or so I was told, it just doesn’t show them. So while both processes have a Time value of 19, the lowest one ought to be younger.
More focused fix
Run SHOW ENGINE INNODB STATUS
and look at the “TRANSACTION” section. You will find, among others, something like
TRANSACTION 1701, ACTIVE 58 sec;2 lock struct(s), heap size 376, 1 row lock(s), undo log entries 1
MySQL thread id 1396, OS thread handle 0x7fd06d675700, query id 1138 hostname 1.2.3.4 whatever;
Now you check with SHOW FULL PROCESSLIST
what is thread id 1396 doing with its #1701 transaction. Chances are it is in “Sleep” status. So: an active transaction (#1701) with an active lock, it has even done some changes as it has an undo log entry… but is currently idle. That and no other is the thread you need to kill. Losing those changes.
Remember that doing nothing in MySQL does not mean doing nothing in general. If you get some records from MySQL and build a CSV for FTP upload, during the FTP upload the MySQL connection is idle.
Actually if the process using MySQL and the MySQL server are on the same machine, that machine runs Linux, and you have root privileges, there’s a way to find out which process has the connection that requested the lock. This in turn allows to determine (from CPU usage or, at worst, strace -ff -p pid
) whether that process is really doing something or not, to help decide if it’s safe to kill.
Why does this happen?
I see this happening with webapps that use “persistent” or “pooled” MySQL connections, which nowadays usually save very little time: the webapp instance terminated, but the connection did not, so its lock is still alive… and blocking everyone else.
Another interesting way that I found is, in the hypotheses above, to run a query returning some rows, and only retrieve some of them. If the query is not set to “auto-clean” (however the underlying DBA does it), it will keep the connection open and prevent a full lock on the table from going through. I had this happen to me in a piece of code that verified whether a row existed by selecting that row and verifying whether it got an error (not exists) or not (it must exist), but without actually retrieving the row.
PHP and PDO
PDO has persistent connection capability. This is how I ensure PDO does not pool connections and closes every one of them. It’s messy.
When opening, set the options (fourth option to new PDO()):
PDO::ATTR_PERSISTENT => false
When disconnecting:
// We should have no transactions and no locks.
// So we discard them.
try {
$pdo->exec('ROLLBACK WORK');
$pdo->exec('UNLOCK TABLES');
} catch (Exception $err) {
// Send a mail
}
// No cooperative locks. So this will not hurt a bit.
try {
$pdo->exec('DO RELEASE_ALL_LOCKS()');
} catch (Exception $err) {
// Send a mail
}
// Ensure the connection withers on the vine, but not too soon.
$pdo->exec('SET wait_timeout = 5');
// $pdo->setAttribute(PDO::ATTR_TIMEOUT, 5);
// If nothing else works!
// try {
// $pdo->exec('KILL CONNECTION_ID()');
// } catch (Exception $err) {
// // Exception here is expected: "Query execution was interrupted"
// }
// Invoke the garbage collector
$pdo = NULL;
Ask the DB
Another way to get the culprit if you have a recent MySQL, but not too recent since this is going to be deprecated, is (you need privileges again on the information schema)
SELECT * FROM INFORMATION_SCHEMA.INNODB_LOCKS
WHERE LOCK_TRX_ID IN
(SELECT BLOCKING_TRX_ID FROM INFORMATION_SCHEMA.INNODB_LOCK_WAITS);
Actual solution, requiring time and work
The problem is usually caused by this architecture:
webapp (jar, php) --> container or app connection pool (mysqldb, php_module, fastcgi...) --> --> MySQL
When the webapp dies, or the webapp lightweight thread instance dies, the container/connection pool might not. And it is the container that keeps the connection open, so obviously the connection does not close. Quite predictably, MySQL does not consider the operation complete.
If the webapp didn’t clean after itself (no ROLLBACK
or COMMIT
for a transaction, no UNLOCK TABLES
, etc.), then whatever that webapp started doing is still extant, and might still be blocking everyone else.
There are then two solutions. The worse one is to lower the idle timeout. But guess what happens if you wait too long between two queries (exactly: “MySQL server has gone away”). You could then use mysql_ping
if available (soon to be deprecated. There are workarounds for PDO. Or you might check for that error, and reopen the connection if it happens (this is the Python way). So – for a small performance fee – it’s doable.
The better, smarter solution is less straightforward to implement. Endeavour to have the script clean after itself, ensuring to retrieve all rows or free all query resources, catch all exception and deal with them properly, or, if possible, skip persistent connections altogether. Let each instance create its own connection or use a smart pool driver (in PHP PDO, use PDO::ATTR_PERSISTENT
explicitly set to false
).
Alternatively (e.g. in PHP) you can have destruct and exception handlers force clean the connection by committing or rolling back transactions (this ought to be enough) and maybe even issuing explicit table unlocks and RELEASE_ALL_LOCKS(), or committing connection suicide (KILL CONNECTION_ID()
) for good measure.
I do not know of a way of querying for extant resultset resources in order to free them; the only way would be to save those resources in a private array.