本文主要学习记录一些问题的分析思路和方法。一、 问题现象
“有一个SQL可以在数据库中执行1ms,但由业务压力测试启动的SQL将始终执行。结果找不到,过程没有锁定,CPU占用率也很高。”二、 本文调查了思路1.pg_stat_activity查看
主要看stateee是否处于活跃状态和等待事件,即stateee 和 wait_event两个字段,一般也会看执行计划。 select * from pg_stat_activity where pid = '105430'; 2.GDB抓运行堆栈
步骤是: gdb SQL进程跟踪问题 105430(问题SQL需要正在运行) 在这里,gdb执行bt命令显示堆栈 LWLockAcquire 是轻量锁的获取函数 #0 0x000000000071f5e3 in pg_atomic_read_u32_impl (ptr=0x2b7667e03e04) at ../../../../src/include/port/atomics/generic.h:48#1 pg_atomic_read_u32 (ptr=0x2b7667e03e04) at ../../../../src/include/port/atomics.h:247#2 LWLockAttemptLock (lock=lock@antry=0x2b7667e03e00,mode=mode@entry=LW_SHARED) at lwlock.c:750#3 0x00000000007201b7 in LWLockAcquire (lock=lock@antry=0x2b7667e03e00,mode=mode@entry=LW_SHARED) at lwlock.c:1187
根据这一步,您可以模拟和捕获慢SQL运行堆栈
会话1
会话2 gdb -p 13411
会话1执行慢sql select count(*) from t_insert a,t_insert b;
会话2,执行c和bt。您可以看到,Execnestlop函数和整个调用栈目前正在执行。
3. strace分析
常用命令参考strace分析sqlplus登录慢问题_Hehuyi_In博客 strace -p 进程号# 输出如下futex(0x2b767dfa6b8,FUTEX_WAIT,0,NULL) = -1 EAGAIN (Resource temporarily unavailable)futex(0x2b767dfa6b8,FUTEX_WAIT,0,NULL) = -1 EAGAIN (Resource temporarily unavailable)futex(0x2b767dfbd38,FUTEX_WAKE,1) = 0 futex(0x2b767dfa6b8,FUTEX_WAIT,0,NULL) = -1 EAGAIN (Resource temporarily unavailable)futex(0x2b767dfa6b8,FUTEX_WAIT,0,NULL) = -1 EAGAIN (Resource temporarily unavailable)futex(0x2b767dfa6b8,FUTEX_WAIT,0,NULL) = -1 EAGAIN (Resource temporarily unavailable)futex(0x2b767dfbd38,FUTEX_WAKE,1) = 0
可以看出,此时不断唤醒等待,唤醒等待,如此重复,仿佛陷入了一个死循环,自锁自锁。自锁本身没有等待队列,死锁检测机制,需要手动释放,所以现象似乎遇到了死锁。 ///uadr指向一个地址,val代表该地址的预期值*uaddr==waitintint会在val时进行。 futex_wait(int *uaddr, int val);///唤醒n在uadr指向的锁变量上挂起等待的过程int futex_wake(int *uaddr, int n); 4. perf抓取热点函数 perf record -g -p pid 抓取30s数据
热点函数可以看到Pinbuffer、LWLockAttempLock、LWLockrelease等,这些函数都涉及到spinlock。但到目前为止,我们只发现了基本现象,但我们不知道原因。即使我们知道锁的特性,我们也不知道哪个链接有问题。5.打开全日志 log_statement = all
观察发现,有些SQL可以正确执行同一类SQL,但只有一个参数不能执行,只有BIND过程,没有EXECUTE过程! 2022-08-05 17:36:29.234 CST,"xxxx","xxxx241835,30.69.79.67:41914,62ece183.3bab,6102,"BIND",2022-08-05 17:23:15 CST,44/1713,0,LOG,00000,"""2022-08-05 17:36:29.234 CST,"xxxx","xxxx241835,30.69.79.67:41914,62ece183.3bab,6103,"SELECT",2022-08-05 17:23:15 CST,44/1713,0,LOG,00000,"""2022-08-05 17:36:29.234 CST,"xxxx","xxxx241840,30.69.79.67:46722,62ece183.3b0b0,2648UPDATE",2022-08-05 17:23:15 CST,45/809,820459524,LOG,00000,""# 以下是2022-08-05 17:36:29.235 CST,"xxxx","xxxx241840,30.69.79.67:46722,62ece183.3b0b0,2649BIND",2022-08-05 17:23:15 CST,45/809,820459524,LOG,00000,",""2022-08-05 17:36:29.235 CST,"xxxx","xxxx241835,30.69.79.67:41914,62ece183.3bab,6104,"BIND",2022-08-05 17:23:15 CST,44/1713,0,LOG,00000,"""2022-08-05 17:36:29.235 CST,"xxxx","xxxx241840,30.69.79.67:62ece183.3b0b0,2650,46722UPDATE,,,""2022-08-05 17:36:29.236 CST,"xxxx","xxxx241840,30.69.79.67:46722,62ece183.3b0b0,2651BIND",2022-08-05 17:23:15 CST,45/809,820459524,LOG,00000,"""2022-08-05 17:36:29.236 CST,"xxxx","xxxx241840,30.69.79.67:46722,62ece183.3b0b0,2652UPDATE",2022-08-05 17:23:15 CST,45/809,820459524,LOG,00000,"'2022-08-05 17:36:29.237 CST,"xxxx","xxxx241840,30.69.79.67:46722,62ece183.3b0b0,2653BIND",2022-08-05 17:23:15 CST,45/809,820459524,LOG,00000,"""2022-08-05 17:36:29.237 CST,"xxxx","xxxx241840,30.69.79.67:46722,62ece183.3b0b0,2654UPDATE",2022-08-05 17:23:15 CST,45/809,820459524,LOG,00000,"""
可以看出,当最后一次17岁时:36:29.235SQL实施后,241835过程BIND后,没有后续日志!其他类似的SQL可以正常执行,所以这个参数可以在数据库中执行,你可以看到其他参数可以立即执行,但这个SQL一跑就死了,ctrl + c停不下来!
你可以看到第一次51.927ms出来了,但第二次还没有结束。执行计划显示是索引。根据我之前的经验,我隐约觉得索引出了问题。6. amcheck插件检测索引索引 SELECT bt_index_check(index => c.oid, heapallindexed => i.indisunique), c.relname, c.relpagesFROM pg_index iJOIN pg_opclass op ON i.indclass[0] = op.oidJOIN pg_am am ON op.opcmethod = am.oidJOIN pg_class c ON i.indexrelid = c.oidJOIN pg_namespace n ON c.relnamespace = n.oidWHERE am.amname = 'btree' AND n.nspname = 'public'-- Don't check temp tables, which may be from another session:AND c.relpersistence != 't'AND c.relname = '索引名'-- Function may throw an error when this is omitted:AND c.relkind = 'i' AND i.indisready AND i.indisvalidORDER BY c.relpages DESC LIMIT 10;
果然。。索引又出错了!
此外,还显示了报错函数staticvoid bt_target_page_check(BtreeCheckState*state)三、 小结
这个问题的现象非常令人费解。当它被引入一定值时,它将无法停止运行。现象是它一直在旋转,等待锁定数据页面。
根据结果,查询阅读错误的索引页面后,尝试添加锁,但数据异常,锁失败,加上旋转锁的显著特点是“死等”,需要获得Spinlock过程将主动检查是否可以获得锁,直到锁将继续执行以下逻辑,锁将被线程持有,直到线程本身主动释放,这也是这个过程无法停止的原因,同一参数的传输过程正在等待,堆积越多!当然,解决方案也很好,reindex可以重建索引。