让 sysbench 支持 PostgreSQL 服务端绑定变量

April 28, 2016, 9:17 am

≫ Next: 改写 sysbench oltp.lua 支持PostgreSQL服务端绑定变量

≪ Previous: 如何用 sysbench 并行装载 PostgreSQL 测试数据

首先介绍一下几种数据库绑定变量的语义。
.1. PostgreSQL 绑定变量的语义是使用?来表示任意位置的变量, 例如 :

select info from test where id=? and c1=?;

.2. Oracle 使用:var来表示变量，例如：

   stmt = db_prepare("UPDATE ".. table_name .." SET k=k+1 WHERE id=to_number(:x) and 'a' = :y")
   params = {}
   params[1] = '444'
   params[2] = 'a'
   db_bind_param(stmt, params)

.3. MySQL 使用?来表示变量，例如：

   points = ""
   for i = 1,random_points do
      points = points .. "?, "
   end

   -- Get rid of last comma and space.
   points = string.sub(points, 1, string.len(points) - 2)

   stmt = db_prepare([[
        SELECT id, k, c, pad
          FROM sbtest
          WHERE k IN (]] .. points .. [[)
        ]])

   params = {}
   for j = 1,random_points do
      params[j] = 1
   end

   db_bind_param(stmt, params)

但是sysbench 对PG绑定变量的支持并不好，例如：

vi lua/oltp_pg.lua

pathtest = string.match(test, "(.*/)") or ""

dofile(pathtest .. "common.lua")

function thread_init(thread_id)
   set_vars()

   stmt = db_prepare([[
        SELECT info
          FROM test
          WHERE id = ? and 'a' = ?
        ]])

   params = {}
   params[1] = 1
   params[2] = 'a'

   db_bind_param(stmt, params)
end

function event(thread_id)
   params[1] = string.format("%d", sb_rand(1, oltp_table_size))
   params[2] = 'a'
   db_query('BEGIN')
   db_execute(stmt)
   db_query('COMMIT')
end

测试

./sysbench_pg \
--test=lua/oltp_pg.lua   \
--db-driver=pgsql   \
--pgsql-host=127.0.0.1   \
--pgsql-port=1921   \
--pgsql-user=postgres   \
--pgsql-password=postgres   \
--pgsql-db=postgres   \
--oltp-tables-count=1   \
--oltp-table-size=1000000   \
--num-threads=1    \
--max-time=120    \
--max-requests=0   \
--report-interval=1   \
run

报错如下

sysbench 0.5:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 1
Report intermediate results every 1 second(s)
Random number generator seed is 0 and will be ignored


Threads started!

FATAL: query execution failed: -872378608
FATAL: failed to execute function `event': (null)

数据库端的错误信息，bind失败，未正确的提交2个参数。

2016-04-28 22:44:12.330 CST,"postgres","postgres",10763,"[local]",572221bc.2a0b,1,"BIND",2016-04-28 22:44:12 CST,1/4646740,0,ERROR,22P02,"invalid input syntax for integer: """"",,,,,,"        SELECT info
          FROM test
          WHERE id = $1 and 'a' = $2
        ",,"pg_atoi, numutils.c:52",""

使用gdb跟踪到报错的代码：

int sb_lua_db_execute(lua_State *L)@sysbench/scripting/script_lua.c

在fix这个bug之前，本文的目的是要让sysbench支持服务端的绑定变量。
需要用到PostgreSQL的prepare语句和execute语句。
例子，在thread_init中定义一次prepare，在event中执行execute。
为了让sysbench能统计每秒的tps，必须使用显示的begin; 和 commit;

vi lua/oltp_pg.lua

pathtest = string.match(test, "(.*/)") or ""

dofile(pathtest .. "common.lua")

function thread_init(thread_id)
   set_vars()

   db_query('prepare p' .. thread_id .. '(int) as select * from test where id=$1')
end

function event(thread_id)
   local rs
   local i

   i = sb_rand(1, oltp_table_size)
   rs = db_query("begin" )
   rs = db_query("execute p".. thread_id .. "(" .. i .. ")" )
   -- rs = db_query("select * from test where id=" .. i )
   rs = db_query("commit" )

end

测试

create table test (id int primary key, info text);
insert into test select generate_series(1,1000000),'test';

sysbench_pg

./sysbench_pg \
--test=lua/oltp_pg.lua   \
--db-driver=pgsql   \
--pgsql-host=127.0.0.1   \
--pgsql-port=1921   \
--pgsql-user=postgres   \
--pgsql-password=postgres   \
--pgsql-db=postgres   \
--oltp-tables-count=1   \
--oltp-table-size=1000000   \
--num-threads=1    \
--max-time=120    \
--max-requests=0   \
--report-interval=1   \
run


[   1s] threads: 128, tps: 468353.17, reads/s: 0.00, writes/s: 0.00, response time: 0.33ms (95%)
[   2s] threads: 128, tps: 474536.37, reads/s: 0.00, writes/s: 0.00, response time: 0.32ms (95%)
[   3s] threads: 128, tps: 476768.82, reads/s: 0.00, writes/s: 0.00, response time: 0.32ms (95%)
[   4s] threads: 128, tps: 477219.36, reads/s: 0.00, writes/s: 0.00, response time: 0.32ms (95%)
[   5s] threads: 128, tps: 476848.04, reads/s: 0.00, writes/s: 0.00, response time: 0.32ms (95%)

连接unix socket

./sysbench_pg \
--test=lua/oltp_pg.lua   \
--db-driver=pgsql   \
--pgsql-host=$PGDATA   \
--pgsql-port=1921   \
--pgsql-user=postgres   \
--pgsql-password=postgres   \
--pgsql-db=postgres   \
--oltp-tables-count=1   \
--oltp-table-size=1000000   \
--num-threads=1    \
--max-time=120    \
--max-requests=0   \
--report-interval=1   \
run

[   1s] threads: 128, tps: 534132.82, reads/s: 0.00, writes/s: 0.00, response time: 0.29ms (95%)
[   2s] threads: 128, tps: 539569.98, reads/s: 0.00, writes/s: 0.00, response time: 0.29ms (95%)
[   3s] threads: 128, tps: 542427.96, reads/s: 0.00, writes/s: 0.00, response time: 0.29ms (95%)
[   4s] threads: 128, tps: 542168.03, reads/s: 0.00, writes/s: 0.00, response time: 0.28ms (95%)

测试一下未使用服务端绑定变量的性能，把oltp_pg.lua的内容修改一下 :

   -- rs = db_query("execute p".. thread_id .. "(" .. i .. ")" )
   rs = db_query("select * from test where id=" .. i )

测试结果

./sysbench_pg \
--test=lua/oltp_pg.lua   \
--db-driver=pgsql   \
--pgsql-host=127.0.0.1   \
--pgsql-port=1921   \
--pgsql-user=postgres   \
--pgsql-password=postgres   \
--pgsql-db=postgres   \
--oltp-tables-count=1   \
--oltp-table-size=1000000   \
--num-threads=1    \
--max-time=120    \
--max-requests=0   \
--report-interval=1   \
run

[   1s] threads: 128, tps: 367946.22, reads/s: 367985.22, writes/s: 0.00, response time: 0.40ms (95%)
[   2s] threads: 128, tps: 371138.13, reads/s: 371137.13, writes/s: 0.00, response time: 0.40ms (95%)
[   3s] threads: 128, tps: 371514.94, reads/s: 371525.94, writes/s: 0.00, response time: 0.40ms (95%)
[   4s] threads: 128, tps: 371680.18, reads/s: 371663.18, writes/s: 0.00, response time: 0.40ms (95%)

./sysbench_pg \
--test=lua/oltp_pg.lua   \
--db-driver=pgsql   \
--pgsql-host=$PGDATA   \
--pgsql-port=1921   \
--pgsql-user=postgres   \
--pgsql-password=postgres   \
--pgsql-db=postgres   \
--oltp-tables-count=1   \
--oltp-table-size=1000000   \
--num-threads=1    \
--max-time=120    \
--max-requests=0   \
--report-interval=1   \
run

[   1s] threads: 128, tps: 410439.59, reads/s: 410484.59, writes/s: 0.00, response time: 0.37ms (95%)
[   2s] threads: 128, tps: 414555.41, reads/s: 414568.41, writes/s: 0.00, response time: 0.36ms (95%)
[   3s] threads: 128, tps: 415483.61, reads/s: 415468.61, writes/s: 0.00, response time: 0.36ms (95%)
[   4s] threads: 128, tps: 416120.30, reads/s: 416125.30, writes/s: 0.00, response time: 0.36ms (95%)

对比pgbench：

vi test.sql

\setrandom id 1 1000000
begin;
select info from test where id=:id;
commit;

未使用绑定变量

pgbench -M simple -n -r -P 1 -f ./test.sql -c 128 -j 128 -T 100 -h 127.0.0.1
progress: 2.0 s, 406356.7 tps, lat 0.314 ms stddev 0.065
progress: 3.0 s, 408601.2 tps, lat 0.312 ms stddev 0.053
progress: 4.0 s, 409713.9 tps, lat 0.311 ms stddev 0.048
progress: 5.0 s, 410598.9 tps, lat 0.311 ms stddev 0.046

pgbench -M simple -n -r -P 1 -f ./test.sql -c 128 -j 128 -T 100 -h $PGDATA
progress: 2.0 s, 455661.7 tps, lat 0.279 ms stddev 0.042
progress: 3.0 s, 456656.3 tps, lat 0.279 ms stddev 0.078
progress: 4.0 s, 458107.1 tps, lat 0.278 ms stddev 0.033
progress: 5.0 s, 458687.4 tps, lat 0.278 ms stddev 0.033

使用绑定变量

pgbench -M prepared -n -r -P 1 -f ./test.sql -c 128 -j 128 -T 100 -h 127.0.0.1
progress: 2.0 s, 575148.0 tps, lat 0.222 ms stddev 0.057
progress: 3.0 s, 577477.6 tps, lat 0.221 ms stddev 0.060
progress: 4.0 s, 578402.7 tps, lat 0.220 ms stddev 0.058
progress: 5.0 s, 580408.2 tps, lat 0.220 ms stddev 0.043

pgbench -M prepared -n -r -P 1 -f ./test.sql -c 128 -j 128 -T 100 -h $PGDATA
progress: 2.0 s, 650961.8 tps, lat 0.195 ms stddev 0.033
progress: 3.0 s, 653079.1 tps, lat 0.195 ms stddev 0.027
progress: 4.0 s, 653964.2 tps, lat 0.194 ms stddev 0.034
progress: 5.0 s, 655027.3 tps, lat 0.194 ms stddev 0.027
progress: 6.0 s, 655417.3 tps, lat 0.194 ms stddev 0.039

使用auto commit

vi test.sql

begin;
\setrandom id 1 1000000
select info from test where id=:id;

未使用绑定变量

pgbench -M simple -n -r -P 1 -f ./test.sql -c 128 -j 128 -T 100 -h 127.0.0.1
progress: 2.0 s, 582766.5 tps, lat 0.218 ms stddev 0.034
progress: 3.0 s, 585359.1 tps, lat 0.217 ms stddev 0.033
progress: 4.0 s, 585994.5 tps, lat 0.217 ms stddev 0.070

pgbench -M simple -n -r -P 1 -f ./test.sql -c 128 -j 128 -T 100 -h $PGDATA
progress: 2.0 s, 623373.9 tps, lat 0.204 ms stddev 0.397
progress: 3.0 s, 626771.3 tps, lat 0.203 ms stddev 0.260
progress: 4.0 s, 623826.0 tps, lat 0.204 ms stddev 0.590
progress: 5.0 s, 625747.0 tps, lat 0.203 ms stddev 0.679

使用绑定变量

pgbench -M prepared -n -r -P 1 -f ./test.sql -c 128 -j 128 -T 100 -h 127.0.0.1
progress: 2.0 s, 1024293.8 tps, lat 0.124 ms stddev 0.024
progress: 3.0 s, 1027868.7 tps, lat 0.123 ms stddev 0.026
progress: 4.0 s, 1030192.5 tps, lat 0.123 ms stddev 0.024
progress: 5.0 s, 1031413.8 tps, lat 0.123 ms stddev 0.022

pgbench -M prepared -n -r -P 1 -f ./test.sql -c 128 -j 128 -T 100 -h $PGDATA
progress: 3.0 s, 1134774.5 tps, lat 0.112 ms stddev 0.018
progress: 4.0 s, 1136322.2 tps, lat 0.111 ms stddev 0.020
progress: 5.0 s, 1138087.2 tps, lat 0.111 ms stddev 0.020
progress: 6.0 s, 1138184.3 tps, lat 0.111 ms stddev 0.018

毫无疑问，最好的性能是，使用绑定变量，auto commit，unix socket。
下面是以上几组按KEY查询的TPS性能对比图：

下一篇介绍一下TPS瓶颈分析之网络瓶颈。

↧

改写 sysbench oltp.lua 支持PostgreSQL服务端绑定变量

April 28, 2016, 9:17 am

≫ Next: PostgreSQL 网络延迟瓶颈定量分析

≪ Previous: 让 sysbench 支持 PostgreSQL 服务端绑定变量

源码在这里
https://github.com/digoal/sysbench_lua/tree/master/lua
已经把oltp.lua改掉了，支持10条SQL，(有需要可以再自行调整)包括
但是由于sysbench不能识别execute语句，所以都算成了other query，实际上就是这些使用了服务端绑定变量的query。
在一台普通的X86机器上测试了一下，15GB数据，跑以下SQL能达到47万QPS。

   -- select c from tbl where id = $1;
   -- select id,k,c,pad from tbl where id in ($1,...$n);
   -- select c from tbl where id between $1 and $2;
   -- select sum(k) from tbl where id between $1 and $2;
   -- select c from tbl where id between $1 and $2 order by c;
   -- select distinct c from tbl where id between $1 and $2 order by c;
   -- update tbl set k=k+1 where id = $1;
   -- update tbl set c=$2 where id = $1;
   -- delete from tbl where id = $1;
   -- insert into tbl(id, k, c, pad) values ($1,$2,$3,$4);

oltp_pg.lua源码

-- use case

--     ./sysbench_pg --test=lua/parallel_init_pg.lua \
--       --db-driver=pgsql \
--       --pgsql-host=$PGDATA \
--       --pgsql-port=1921 \
--       --pgsql-user=postgres \
--       --pgsql-password=postgres \
--       --pgsql-db=postgres \
--       --oltp-tables-count=64 \
--       --oltp-table-size=1000000 \
--       --num-threads=64 \
--       cleanup

--     ./sysbench_pg --test=lua/parallel_init_pg.lua \
--       --db-driver=pgsql \
--       --pgsql-host=$PGDATA \
--       --pgsql-port=1921 \
--       --pgsql-user=postgres \
--       --pgsql-password=postgres \
--       --pgsql-db=postgres \
--       --oltp-tables-count=64 \
--       --oltp-table-size=1000000 \
--       --num-threads=64 \
--       run

--    ./sysbench_pg   \
--    --test=lua/oltp_pg.lua   \
--    --db-driver=pgsql   \
--    --pgsql-host=$PGDATA   \
--    --pgsql-port=1921   \
--    --pgsql-user=postgres   \
--    --pgsql-password=postgres   \
--    --pgsql-db=postgres   \
--    --oltp-tables-count=64   \
--    --oltp-table-size=1000000   \
--    --num-threads=64  \
--    --max-time=120  \
--    --max-requests=0 \
--    --report-interval=1 \
--    run

pathtest = string.match(test, "(.*/)") or ""

dofile(pathtest .. "common.lua")

function thread_init(thread_id)
   set_vars()

   oltp_point_selects = 10  -- query 10 times
   random_points = 10       -- query id in (10 vars)
   oltp_simple_ranges = 1   --  query 1 times
   oltp_sum_ranges = 1      --  query 1 times
   oltp_order_ranges = 1    --  query 1 times
   oltp_distinct_ranges = 1   --  query 1 times
   oltp_index_updates = 1     --  query 1 times
   oltp_non_index_updates = 1   --  query 1 times
   oltp_range_size = 100        --  query between $1 and $1+100-1
   oltp_read_only = false       -- query delete,update,insert also

   local table_name
   local pars
   local vars
   local i

   begin_query = "BEGIN"
   commit_query = "COMMIT"

   table_name = "sbtest" .. (thread_id+1)

   -- select c from tbl where id = $1;
   db_query("prepare p1(int) as select c from " .. table_name .. " WHERE id=$1")

   -- select id,k,c,pad from tbl where id in ($1,...$n);
   pars = ""
   vars = ""
   for i = 1,random_points do
      pars = pars .. "int, "
      vars = vars .. "$" .. i .. ", "
   end
   pars = string.sub(pars, 1, string.len(pars) - 2)
   vars = string.sub(vars, 1, string.len(vars) - 2)
   db_query("prepare p2(" .. pars .. ") as select id,k,c,pad from " .. table_name .. " WHERE id in (" .. vars .. ")")

   -- select c from tbl where id between $1 and $2;
   db_query("prepare p3(int,int) as SELECT c FROM " .. table_name .. " WHERE id BETWEEN $1 and $2")

   -- select sum(k) from tbl where id between $1 and $2;
   db_query("prepare p4(int,int) as SELECT sum(k) FROM " .. table_name .. " WHERE id BETWEEN $1 and $2")

   -- select c from tbl where id between $1 and $2 order by c;
   db_query("prepare p5(int,int) as SELECT c FROM " .. table_name .. " WHERE id BETWEEN $1 and $2 order by c")

   -- select distinct c from tbl where id between $1 and $2 order by c;
   db_query("prepare p6(int,int) as SELECT distinct c FROM " .. table_name .. " WHERE id BETWEEN $1 and $2 order by c")

   -- update tbl set k=k+1 where id = $1;
   db_query("prepare p7(int) as update " .. table_name .. " set k=k+1 where id = $1")

   -- update tbl set c=$2 where id = $1;
   db_query("prepare p8(int,text) as update " .. table_name .. " set c=$2 where id = $1")

   -- delete from tbl where id = $1;
   db_query("prepare p9(int) as delete from " .. table_name .. " where id = $1")

   -- insert into tbl(id, k, c, pad) values ($1,$2,$3,$4);
   db_query("prepare p10(int,int,text,text) as insert into " .. table_name .. "(id, k, c, pad) values ($1,$2,$3,$4)")
end

function event(thread_id)
   local i
   local evars
   local range_start
   local c_val
   local pad_val

   db_query(begin_query)

   for i=1, oltp_point_selects do
     db_query("execute p1(" .. sb_rand(1, oltp_table_size) .. ")")
   end

   evars = ""
   for i = 1,random_points do
     evars = evars .. sb_rand(1, oltp_table_size) .. ", "
   end
   evars = string.sub(evars, 1, string.len(evars) - 2)
   db_query("execute p2(" .. evars .. ")")

   for i=1, oltp_simple_ranges do
      range_start = sb_rand(1, oltp_table_size)
      db_query("execute p3(" .. range_start .. "," .. (range_start + oltp_range_size - 1) .. ")")
   end

   for i=1, oltp_sum_ranges do
      range_start = sb_rand(1, oltp_table_size)
      db_query("execute p4(" .. range_start .. "," .. (range_start + oltp_range_size - 1) .. ")")
   end

   for i=1, oltp_order_ranges do
      range_start = sb_rand(1, oltp_table_size)
      db_query("execute p5(" .. range_start .. "," .. (range_start + oltp_range_size - 1) .. ")")
   end

   for i=1, oltp_distinct_ranges do
      range_start = sb_rand(1, oltp_table_size)
      db_query("execute p6(" .. range_start .. "," .. (range_start + oltp_range_size - 1) .. ")")
   end

   if not oltp_read_only then

     for i=1, oltp_index_updates do
        db_query("execute p7(" .. sb_rand(1, oltp_table_size) .. ")")
     end

     for i=1, oltp_non_index_updates do
        c_val = sb_rand_str("###########-###########-###########-###########-###########-###########-###########-###########-###########-###########")
        db_query("execute p8(" .. sb_rand(1, oltp_table_size) .. ", '" .. c_val .. "')")
     end

     -- delete then insert
     i = sb_rand(1, oltp_table_size)
     c_val = sb_rand_str([[
###########-###########-###########-###########-###########-###########-###########-###########-###########-###########]])
     pad_val = sb_rand_str([[
###########-###########-###########-###########-###########]])

     db_query("execute p9(" .. i .. ")")
     db_query("execute p10" .. string.format("(%d, %d, '%s', '%s')",i, sb_rand(1, oltp_table_size) , c_val, pad_val) )

   end -- oltp_read_only

   db_query(commit_query)

end

↧

PostgreSQL 网络延迟瓶颈定量分析

April 28, 2016, 9:18 am

≫ Next: 使用sysbench测试阿里云RDS PostgreSQL性能

≪ Previous: 改写 sysbench oltp.lua 支持PostgreSQL服务端绑定变量

在使用sysbench或者pgbench测试数据库性能时，连unix socket, loop address性能差异是非常大的，特别是非常小的事务，例如基于KEY的查询，或者select 1这样的简单查询。
原因是这种查询在数据库端的处理非常快，从而网络延迟在整个耗时占比上就会比较大。
还有一种场景结果集比较大，网络延迟在整个耗时占比上也会比较大。
那么如何来定量分析呢？
.1. 分析包的大小，可以通过tcpdump抓包，取得数据库请求过程中传输的包大小和数量。
.2. 或者从PostgreSQL源码中，根据实际的查询计算对应的包大小，例如libpq。

以select 1;为例分析。

如何计算出一个请求在数据库处理的耗时，以及在网络传输段的耗时？

网络传输端的耗时，可以通过前面拿到的包大小，然后使用网络延迟分析工具例如 qperf 得到。
例子：
假设通过tcpdump得到的 select 1; 的包大小为16字节（去除TCP包头）。
使用qperf测试单会话下的16字节包tcp延迟。

启动qperf服务端

yum install -y qperf

qperf -lp 8888 &

测试回环地址8-16字节 TCP包延迟

qperf 127.0.0.1 -lp 8888 -t 6 -oo msg_size:8:64:*2 -v tcp_lat &  
latency        =  5.16 us
    msg_rate       =   194 K/sec
    msg_size       =    16 bytes
    time           =     6 sec
    loc_cpus_used  =  86.8 % cpus
    rem_cpus_used  =  86.8 % cpus

测试select 1的qps

vi test.sql  
select 1;  

pgbench -M prepared -n -r -P 1 -f ./test.sql -c 1 -j 1 -T 10 -h 127.0.0.1  
tps = 36938.282331 (including connections establishing)

根据以上QPS以及qperf测出来的网络延迟，计算出select 1的数据库处理时延，注意需要计算回包的时间，所以TCP延迟*2

(1000000/36938.03) - (5.16*2) = 16.75 us

同局域网主机的延迟测试

qperf xxx.xxx.xxx.xxx -lp 8888 -t 6 -oo msg_size:8:64:*2 -v tcp_lat &
tcp_lat:
    latency        =  13.6 us
    msg_rate       =  73.8 K/sec
    msg_size       =    16 bytes
    time           =     6 sec
    loc_cpus_used  =  8.67 % cpus
    rem_cpus_used  =   9.5 % cpus

使用以上值，以及前面得到的数据库处理延迟，计算理论上在这台机器连接到数据库服务器进行tps测试的结果应该是

1000000 / (16.75 + 13.6*2) = 22753  tps

实际测试得到的tps, 基本一致

pgbench -M prepared -n -r -P 1 -f ./test.sql -c 1 -j 1 -T 10 -h xxx.xxx.xxx.xxx
tps = 24724.666217 (including connections establishing)

并发情况下的网络延迟分析。
启动多个qperf服务端

qperf -lp 8888 &
qperf -lp 8889 &
qperf -lp 8890 &
qperf -lp 8891 &
qperf -lp 8892 &
qperf -lp 8893 &
qperf -lp 8894 &
qperf -lp 8895 &
qperf -lp 8896 &
qperf -lp 8897 &
qperf -lp 8898 &
qperf -lp 8899 &
qperf -lp 8900 &
qperf -lp 8901 &
qperf -lp 8902 &
qperf -lp 8903 &
qperf -lp 8904 &
qperf -lp 8905 &
qperf -lp 8906 &
qperf -lp 8907 &
qperf -lp 8908 &
qperf -lp 8909 &
qperf -lp 8910 &
qperf -lp 8911 &
qperf -lp 8912 &
qperf -lp 8913 &
qperf -lp 8914 &
qperf -lp 8915 &
qperf -lp 8916 &
qperf -lp 8917 &
qperf -lp 8918 &
qperf -lp 8919 &
qperf -lp 8920 &
qperf -lp 8921 &
qperf -lp 8922 &
qperf -lp 8923 &
qperf -lp 8924 &
qperf -lp 8925 &
qperf -lp 8926 &
qperf -lp 8927 &
qperf -lp 8928 &
qperf -lp 8929 &
qperf -lp 8930 &
qperf -lp 8931 &
qperf -lp 8932 &
qperf -lp 8933 &
qperf -lp 8934 &
qperf -lp 8935 &
qperf -lp 8936 &
qperf -lp 8937 &
qperf -lp 8938 &
qperf -lp 8939 &
qperf -lp 8940 &
qperf -lp 8941 &
qperf -lp 8942 &
qperf -lp 8943 &
qperf -lp 8944 &
qperf -lp 8945 &
qperf -lp 8946 &
qperf -lp 8947 &
qperf -lp 8948 &
qperf -lp 8949 &
qperf -lp 8950 &
qperf -lp 8951 &

并发测试延迟

qperf 127.0.0.1 -t 6 -oo msg_size:16:16:*2 -v -lp 8888 tcp_lat &
qperf 127.0.0.1 -t 6 -oo msg_size:16:16:*2 -v -lp 8889 tcp_lat &
qperf 127.0.0.1 -t 6 -oo msg_size:16:16:*2 -v -lp 8890 tcp_lat &
qperf 127.0.0.1 -t 6 -oo msg_size:16:16:*2 -v -lp 8891 tcp_lat &
qperf 127.0.0.1 -t 6 -oo msg_size:16:16:*2 -v -lp 8892 tcp_lat &
qperf 127.0.0.1 -t 6 -oo msg_size:16:16:*2 -v -lp 8893 tcp_lat &
qperf 127.0.0.1 -t 6 -oo msg_size:16:16:*2 -v -lp 8894 tcp_lat &
qperf 127.0.0.1 -t 6 -oo msg_size:16:16:*2 -v -lp 8895 tcp_lat &
qperf 127.0.0.1 -t 6 -oo msg_size:16:16:*2 -v -lp 8896 tcp_lat &
qperf 127.0.0.1 -t 6 -oo msg_size:16:16:*2 -v -lp 8897 tcp_lat &
qperf 127.0.0.1 -t 6 -oo msg_size:16:16:*2 -v -lp 8898 tcp_lat &
qperf 127.0.0.1 -t 6 -oo msg_size:16:16:*2 -v -lp 8899 tcp_lat &
qperf 127.0.0.1 -t 6 -oo msg_size:16:16:*2 -v -lp 8900 tcp_lat &
qperf 127.0.0.1 -t 6 -oo msg_size:16:16:*2 -v -lp 8901 tcp_lat &
qperf 127.0.0.1 -t 6 -oo msg_size:16:16:*2 -v -lp 8902 tcp_lat &
qperf 127.0.0.1 -t 6 -oo msg_size:16:16:*2 -v -lp 8903 tcp_lat &
qperf 127.0.0.1 -t 6 -oo msg_size:16:16:*2 -v -lp 8904 tcp_lat &
qperf 127.0.0.1 -t 6 -oo msg_size:16:16:*2 -v -lp 8905 tcp_lat &
qperf 127.0.0.1 -t 6 -oo msg_size:16:16:*2 -v -lp 8906 tcp_lat &
qperf 127.0.0.1 -t 6 -oo msg_size:16:16:*2 -v -lp 8907 tcp_lat &
qperf 127.0.0.1 -t 6 -oo msg_size:16:16:*2 -v -lp 8908 tcp_lat &
qperf 127.0.0.1 -t 6 -oo msg_size:16:16:*2 -v -lp 8909 tcp_lat &
qperf 127.0.0.1 -t 6 -oo msg_size:16:16:*2 -v -lp 8910 tcp_lat &
qperf 127.0.0.1 -t 6 -oo msg_size:16:16:*2 -v -lp 8911 tcp_lat &
qperf 127.0.0.1 -t 6 -oo msg_size:16:16:*2 -v -lp 8912 tcp_lat &
qperf 127.0.0.1 -t 6 -oo msg_size:16:16:*2 -v -lp 8913 tcp_lat &
qperf 127.0.0.1 -t 6 -oo msg_size:16:16:*2 -v -lp 8914 tcp_lat &
qperf 127.0.0.1 -t 6 -oo msg_size:16:16:*2 -v -lp 8915 tcp_lat &
qperf 127.0.0.1 -t 6 -oo msg_size:16:16:*2 -v -lp 8916 tcp_lat &
qperf 127.0.0.1 -t 6 -oo msg_size:16:16:*2 -v -lp 8917 tcp_lat &
qperf 127.0.0.1 -t 6 -oo msg_size:16:16:*2 -v -lp 8918 tcp_lat &
qperf 127.0.0.1 -t 6 -oo msg_size:16:16:*2 -v -lp 8919 tcp_lat &
qperf 127.0.0.1 -t 6 -oo msg_size:16:16:*2 -v -lp 8920 tcp_lat &
qperf 127.0.0.1 -t 6 -oo msg_size:16:16:*2 -v -lp 8921 tcp_lat &
qperf 127.0.0.1 -t 6 -oo msg_size:16:16:*2 -v -lp 8922 tcp_lat &
qperf 127.0.0.1 -t 6 -oo msg_size:16:16:*2 -v -lp 8923 tcp_lat &
qperf 127.0.0.1 -t 6 -oo msg_size:16:16:*2 -v -lp 8924 tcp_lat &
qperf 127.0.0.1 -t 6 -oo msg_size:16:16:*2 -v -lp 8925 tcp_lat &
qperf 127.0.0.1 -t 6 -oo msg_size:16:16:*2 -v -lp 8926 tcp_lat &
qperf 127.0.0.1 -t 6 -oo msg_size:16:16:*2 -v -lp 8927 tcp_lat &
qperf 127.0.0.1 -t 6 -oo msg_size:16:16:*2 -v -lp 8928 tcp_lat &
qperf 127.0.0.1 -t 6 -oo msg_size:16:16:*2 -v -lp 8929 tcp_lat &
qperf 127.0.0.1 -t 6 -oo msg_size:16:16:*2 -v -lp 8930 tcp_lat &
qperf 127.0.0.1 -t 6 -oo msg_size:16:16:*2 -v -lp 8931 tcp_lat &
qperf 127.0.0.1 -t 6 -oo msg_size:16:16:*2 -v -lp 8932 tcp_lat &
qperf 127.0.0.1 -t 6 -oo msg_size:16:16:*2 -v -lp 8933 tcp_lat &
qperf 127.0.0.1 -t 6 -oo msg_size:16:16:*2 -v -lp 8934 tcp_lat &
qperf 127.0.0.1 -t 6 -oo msg_size:16:16:*2 -v -lp 8935 tcp_lat &
qperf 127.0.0.1 -t 6 -oo msg_size:16:16:*2 -v -lp 8936 tcp_lat &
qperf 127.0.0.1 -t 6 -oo msg_size:16:16:*2 -v -lp 8937 tcp_lat &
qperf 127.0.0.1 -t 6 -oo msg_size:16:16:*2 -v -lp 8938 tcp_lat &
qperf 127.0.0.1 -t 6 -oo msg_size:16:16:*2 -v -lp 8939 tcp_lat &
qperf 127.0.0.1 -t 6 -oo msg_size:16:16:*2 -v -lp 8940 tcp_lat &
qperf 127.0.0.1 -t 6 -oo msg_size:16:16:*2 -v -lp 8941 tcp_lat &
qperf 127.0.0.1 -t 6 -oo msg_size:16:16:*2 -v -lp 8942 tcp_lat &
qperf 127.0.0.1 -t 6 -oo msg_size:16:16:*2 -v -lp 8943 tcp_lat &
qperf 127.0.0.1 -t 6 -oo msg_size:16:16:*2 -v -lp 8944 tcp_lat &
qperf 127.0.0.1 -t 6 -oo msg_size:16:16:*2 -v -lp 8945 tcp_lat &
qperf 127.0.0.1 -t 6 -oo msg_size:16:16:*2 -v -lp 8946 tcp_lat &
qperf 127.0.0.1 -t 6 -oo msg_size:16:16:*2 -v -lp 8947 tcp_lat &
qperf 127.0.0.1 -t 6 -oo msg_size:16:16:*2 -v -lp 8948 tcp_lat &
qperf 127.0.0.1 -t 6 -oo msg_size:16:16:*2 -v -lp 8949 tcp_lat &
qperf 127.0.0.1 -t 6 -oo msg_size:16:16:*2 -v -lp 8950 tcp_lat &
qperf 127.0.0.1 -t 6 -oo msg_size:16:16:*2 -v -lp 8951 tcp_lat &

回环地址，64个并发的延迟约11.8us

latency        =   11.8 us

测试64个并发的tps，并计算出数据库端的耗时(64核的机器)

pgbench -M prepared -n -r -P 1 -f ./test.sql -c 64 -j 64 -T 10 -h 127.0.0.1

tps = 1575989.161924 (including connections establishing)

计算出数据库的RT
与单进程的RT基本一致，说明现在PostgreSQL在高并发下的处理能力已经非常强大了，充分利用了CPU的多核，并且性能是线性的。

(1000000/(1575989.161924/64)) - (11.8*2) = 17 us

在远端主机测试网络延迟
从测试结果来看，已经大大超出了数据库本地处理的时间，网络成了最大的瓶颈

qperf xxx.xxx.xxx.xxx -t 6 -oo msg_size:16:16:*2 -v -lp 8888 tcp_lat &
...
qperf xxx.xxx.xxx.xxx -t 6 -oo msg_size:16:16:*2 -v -lp 8951 tcp_lat &

latency        =  61.8 us

推算出64并发的TPS

(1000000/(17 + 61.8*2)) * 64 = 455192

推算出来的TPS与实际测出来的TPS基本一致

pgbench -M prepared -n -r -P 1 -f ./test.sql -c 64 -j 64 -T 10 -h xxx.xxx.xxx.xxx
tps = 466737.781999 (including connections establishing)

以上是网络延迟的定量分析，网络延迟在高并发的数据库应用中，影响还是非常大的。

参考
man tcpdump
man qperf
http://blog.yufeng.info/archives/2234

↧

使用sysbench测试阿里云RDS PostgreSQL性能

April 28, 2016, 9:19 am

≫ Next: gdb 调试 sysbench

≪ Previous: PostgreSQL 网络延迟瓶颈定量分析

测试PostgreSQL数据库性能的方法很多，例如pgbench, sysbench。
sysbench因为使用lua脚本编程，支持多线程，灵活度更高，测试复杂的业务逻辑建议用sysbench。
pgbench其实也很好，纯C写的，本身的开销小，测高并发低延迟的场景建议用pgbench。

首先要购买RDS PG数据库实例
创建数据库用户
还需要购买同机房，与RDS PG同VPC网络ECS或者同经典网络的ECS
在ECS端安装PostgreSQL客户端

useradd digoal
su - digoal

wget https://ftp.postgresql.org/pub/source/v9.5.2/postgresql-9.5.2.tar.bz2
tar -jxvf postgresql-9.5.2.tar.bz2
cd postgresql-9.5.2
./configure --prefix=/home/digoal/pgsql9.5
gmake world -j 16
gmake install-world -j 16

vi ~/env_pg.sh
export PS1="$USER@`/bin/hostname -s`-> "
export PGPORT=1921
export LANG=en_US.utf8
export PGHOME=/home/digoal/pgsql9.5
export LD_LIBRARY_PATH=$PGHOME/lib:/lib64:/usr/lib64:/usr/local/lib64:/lib:/usr/lib:/usr/local/lib:$LD_LIBRARY_PATH
export DATE=`date +"%Y%m%d%H%M"`
export PATH=$PGHOME/bin:$PATH:.
export MANPATH=$PGHOME/share/man:$MANPATH
export PGHOST=$PGDATA
export PGUSER=postgres
export PGDATABASE=postgres
alias rm='rm -i'
alias ll='ls -lh'
unalias vi

. ~/env_pg.sh

安装sysbench

cd ~
mkdir sysbench
cd sysbench

git clone https://github.com/digoal/sysbench_bin.git
git clone https://github.com/digoal/sysbench2.git
git clone https://github.com/digoal/sysbench_lua.git

mkdir sysbench
cd sysbench
cp -r ../sysbench_bin/bin/* ./
cp -r ../sysbench_lua/lua ./
cp ../sysbench_bin/gendata.c ./
gcc -o gendata gendata.c

初始化测试数据

./sysbench_pg --test=lua/parallel_init_pg.lua \
  --db-driver=pgsql \
  --pgsql-host=xxx.xxx.xxx.xxx \
  --pgsql-port=3432 \
  --pgsql-user=digoal \
  --pgsql-password=pwd \
  --pgsql-db=postgres \
  --oltp-tables-count=16 \
  --oltp-table-size=1000000 \
  --num-threads=16 \
  cleanup


./sysbench_pg --test=lua/parallel_init_pg.lua \
  --db-driver=pgsql \
  --pgsql-host=xxx.xxx.xxx.xxx \
  --pgsql-port=3432 \
  --pgsql-user=digoal \
  --pgsql-password=pwd \
  --pgsql-db=postgres \
  --oltp-tables-count=16 \
  --oltp-table-size=1000000 \
  --num-threads=16 \
  run

测试oltp_pg.lua的内容，包含SQL如下，其中第一条SQL循环10次：

   -- select c from tbl where id = $1;
   -- select id,k,c,pad from tbl where id in ($1,...$n);
   -- select c from tbl where id between $1 and $2;
   -- select sum(k) from tbl where id between $1 and $2;
   -- select c from tbl where id between $1 and $2 order by c;
   -- select distinct c from tbl where id between $1 and $2 order by c;
   -- update tbl set k=k+1 where id = $1;
   -- update tbl set c=$2 where id = $1;
   -- delete from tbl where id = $1;
   -- insert into tbl(id, k, c, pad) values ($1,$2,$3,$4);

一个事务执行19条SQL。

./sysbench_pg --test=lua/oltp_pg.lua \
  --db-driver=pgsql \
  --pgsql-host=xxx.xxx.xxx.xxx \
  --pgsql-port=3432 \
  --pgsql-user=digoal \
  --pgsql-password=pwd \
  --pgsql-db=postgres \
  --oltp-tables-count=16 \
  --oltp-table-size=1000000 \
  --num-threads=16 \
  --max-time=120  \
  --max-requests=0 \
  --report-interval=1 \
  run

OLTP test statistics:
    queries performed:
        read:                            0
        write:                           0
        other:                           566572
        total:                           566572
    transactions:                        26972  (224.62 per sec.)
    deadlocks:                           0      (0.00 per sec.)
    read/write requests:                 0      (0.00 per sec.)
    other operations:                    566572 (4718.32 per sec.)

General statistics:
    total time:                          120.0791s
    total number of events:              26972
    total time taken by event execution: 1919.7217s
    response time:
         min:                                 39.35ms
         avg:                                 71.17ms
         max:                               3159.62ms
         approx.  95 percentile:             124.54ms

Threads fairness:
    events (avg/stddev):           1685.7500/85.94
    execution time (avg/stddev):   119.9826/0.02

瓶颈分析
连接到阿里云RDS管控平台，观察压测时间段的资源开销，哪个到了瓶颈就升级哪个资源。
如果是网络的问题，可以增加测试的并发来提升TPS。
因为单个会话的链路延迟已经是没法降低的。
关于链路延迟量化分析的文章可参考
https://yq.aliyun.com/articles/35176

RDS PG的优化手段

alter role all set random_page_cost=1.2;
alter role all set synchronous_commit=off;

因为RDS链路较长，延迟会比本地延迟大很多。
但是如何量化这个延迟呢？
因为rds pg数据库服务器我们没法用qperf来测试，所以需要借助数据库本身来测试延迟。

alter role all set random_page_cost=1.2;  
alter role all set synchronous_commit=off;

重连数据库，测试数据库本身处理SQL的RT

create table test(crt_time timestamp);  

do language plpgsql $$
declare
begin
  for i in 1..10000 loop
    insert into test values (clock_timestamp());
  end loop;
end;
$$;

postgres=> select avg(rt) from (select lead(extract(microseconds from crt_time)) over (order by crt_time)-extract(microseconds from crt_time) rt from test) t;
       avg        
------------------
 10.1338133813381
(1 row)

数据库处理RT平均约10微秒。
创建用于测试网络RT的函数。

create or replace function f() returns void as $$
  insert into test values(clock_timestamp());  
$$ language sql;

清除数据

truncate test;

在ECS主机上创建测试脚本

vi test.sql
select f();

压测

export PGPASSWORD=pwd; pgbench -M prepared -n -r -P 1 -f ./test.sql -c 1 -j 1 -T 10 -h xxx.xxx.xxx.xxx -p 3432 -U digoal postgres
tps = 197.976441 (including connections establishing)

计算RT

postgres=> select avg(rt) from (select lead(extract(microseconds from crt_time)) over (order by crt_time)-extract(microseconds from crt_time) rt from test) t;
       avg        
------------------
 5045.96513390601  
(1 row)

扣除数据库自身处理开销10微秒，网络的RT约5.036毫秒。
延迟不小。

使用并发可以弥补这个链路延迟的短板问题，例如开启300个并发，再次测试。

truncate test;  
export PGPASSWORD=pwd; pgbench -M prepared -n -r -P 1 -f ./test.sql -c 300 -j 300 -T 10 -h xxx.xxx.xxx.xxx -p 3432 -U digoal postgres
tps = 27368.404844 (including connections establishing)
postgres=> select avg(rt) from (select lead(extract(microseconds from crt_time)) over (order by crt_time)-extract(microseconds from crt_time) rt from test) t;
       avg        
------------------
 37.5476444551323
(1 row)

吞吐量上来了，但是单个事务的RT还是摆在那里的。
另外一点，使用云数据库，建议多用UDF，减少应用程序和数据库的交互次数，从而缩短整个业务逻辑的响应时间。

↧

gdb 调试 sysbench

April 28, 2016, 9:20 am

≫ Next: 如何分析D状态进程

≪ Previous: 使用sysbench测试阿里云RDS PostgreSQL性能

前几天在写这篇文档的时候，发现sysbench对PostgreSQL libpq绑定变量使用的支持并不好。
《让 sysbench 支持 PostgreSQL 服务端绑定变量》
https://yq.aliyun.com/articles/34870
那么怎样跟踪出错的代码呢？
通过gdb跟踪是一种手段，但是sysbench在测试PostgreSQL libpq绑定时立即就退出。通过pid来跟踪不太恰当，可以使用gdb的run指令来跟踪（之前没有仔细研究过gdb，还好有RDS PG内核团队小鲜肉给的方法，靠谱的团队，有问题立即就能找到靠谱的人）。
例如调试data程序

gdb date
(gdb) run
Starting program: /bin/date 
[Thread debugging using libthread_db enabled]
Thu Apr 28 22:32:24 CST 2016

Program exited normally.

run后面加参数，实际上就是data命令加参数的效果一样

gdb date
(gdb) run +%F%t
Starting program: /bin/date +%F%t
[Thread debugging using libthread_db enabled]
2016-04-28
Program exited normally.

对于sysbench_pg，因为出错就立即退出，所以需要先加断点，然后再run，例如我们大概已经分析到sysbench_pg一定会运行的函数，设为断点，然后用单步调试。

(gdb) break [<file-name>:]<func-name>
(gdb) break [<file-name>:]<line-num>

例子 :

gdb ./sysbench_pg

(gdb) b sb_lua_db_execute
或
(gdb) b script_lua.c:sb_lua_db_execute
Breakpoint 1 at 0x40f130: file script_lua.c, line 851.

(gdb) run --test=lua/oltp_pg1.lua   --db-driver=pgsql   --pgsql-host=$PGDATA   --pgsql-port=1921   --pgsql-user=postgres   --pgsql-password=postgres   --pgsql-db=postgres   --oltp-tables-count=1   --oltp-table-size=1000000   --num-threads=1  --max-time=120  --max-requests=0 --report-interval=1 run

[Thread debugging using libthread_db enabled]
sysbench 0.5:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 1
Report intermediate results every 1 second(s)
Random number generator seed is 0 and will be ignored

[New Thread 0x7ffff7e6c700 (LWP 10898)]
[New Thread 0x7ffff7e5b700 (LWP 10899)]
Threads started!

[Switching to Thread 0x7ffff7e5b700 (LWP 10899)]

Breakpoint 1, sb_lua_db_execute (L=0x8ab080) at script_lua.c:851
851     script_lua.c: No such file or directory.
        in script_lua.c
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.80.2.alios6.x86_64 libaio-0.3.107-10.1.alios6.x86_64

(gdb) n
863     in script_lua.c

(gdb) s
sb_lua_get_context (L=0x8ab080) at script_lua.c:1109
1109    in script_lua.c

查看对应的代码：

$vi sysbench/scripting/script_lua.c
:set nu
:1109

1108 sb_lua_ctxt_t *sb_lua_get_context(lua_State *L)
1109 {

打印当前环境的变量值

(gdb) p L
$1 = (lua_State *) 0x8ab080

(gdb) p *L
$2 = {next = 0x7ffff00097b0, tt = 8 '\b', marked = 97 'a', status = 0 '\000', top = 0x8ab3a0, base = 0x8ab390, l_G = 0x8ab138, ci = 0x8a20a0, savedpc = 0x8b6d78, stack_last = 0x8ab560, stack = 0x8ab2f0, end_ci = 0x8a2168, 
  base_ci = 0x8a2050, stacksize = 45, size_ci = 8, nCcalls = 1, hookmask = 0 '\000', allowhook = 1 '\001', basehookcount = 0, hookcount = 0, hook = 0, l_gt = {value = {gc = 0x8aa560, p = 0x8aa560, n = 9086304, b = 9086304}, tt = 5}, 
  env = {value = {gc = 0x8af150, p = 0x8af150, n = 9105744, b = 9105744}, tt = 5}, openupval = 0x0, gclist = 0x0, errorJmp = 0x7ffff7e5ac20, errfunc = 0}

(gdb) p *L->savedpc
$3 = 147525

一路回车，在这个位置抛出异常

sb_lua_db_execute (L=0x8ab080) at script_lua.c:943
943     script_lua.c: No such file or directory.
        in script_lua.c
(gdb) 
942     in script_lua.c
(gdb) 
943     in script_lua.c
(gdb) 
946     in script_lua.c
(gdb) 
945     in script_lua.c
(gdb) 
946     in script_lua.c
(gdb) 
948     in script_lua.c
(gdb) 
lua_error (L=0x8ab080) at lapi.c:957
957     lapi.c: No such file or directory.
        in lapi.c
(gdb) 
960     in lapi.c
(gdb) 
luaG_errormsg (L=0x8ab080) at ldebug.c:600
600     ldebug.c: No such file or directory.
        in ldebug.c
(gdb) 
601     in ldebug.c
(gdb) 
610     in ldebug.c
(gdb) 
609     in ldebug.c
(gdb) 
610     in ldebug.c
(gdb) 
609     in ldebug.c
(gdb) 
luaD_throw (L=0x8ab080, errcode=2) at ldo.c:94
94      ldo.c: No such file or directory.
        in ldo.c
(gdb) 
95      in ldo.c
(gdb) 
94      in ldo.c
(gdb) 
95      in ldo.c
(gdb) 
96      in ldo.c
(gdb) 
97      in ldo.c
(gdb) 

FATAL: failed to execute function `event': (null)
[Thread 0x7ffff7e5b700 (LWP 11124) exited]
[Thread 0x7ffff7e6c700 (LWP 11123) exited]

重来一遍，直接跟踪行号

gdb ./sysbench_pg
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-50.1.alios6)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/dege.zzz/sysbench/sysbench_pg...done.
(gdb) b script_lua.c:943
Breakpoint 1 at 0x40f2cd: file script_lua.c, line 943.
(gdb) run --test=lua/oltp_pg1.lua   --db-driver=pgsql   --pgsql-host=$PGDATA   --pgsql-port=1921   --pgsql-user=postgres   --pgsql-password=postgres   --pgsql-db=postgres   --oltp-tables-count=1   --oltp-table-size=1000000   --num-threads=1  --max-time=120  --max-requests=0 --report-interval=1 run
Starting program: /home/dege.zzz/sysbench/sysbench_pg --test=lua/oltp_pg1.lua   --db-driver=pgsql   --pgsql-host=$PGDATA   --pgsql-port=1921   --pgsql-user=postgres   --pgsql-password=postgres   --pgsql-db=postgres   --oltp-tables-count=1   --oltp-table-size=1000000   --num-threads=1  --max-time=120  --max-requests=0 --report-interval=1 run
[Thread debugging using libthread_db enabled]
sysbench 0.5:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 1
Report intermediate results every 1 second(s)
Random number generator seed is 0 and will be ignored


[New Thread 0x7ffff7e6c700 (LWP 11347)]
[New Thread 0x7ffff7e5b700 (LWP 11348)]
Threads started!

FATAL: query execution failed: -268398832
[Switching to Thread 0x7ffff7e5b700 (LWP 11348)]

Breakpoint 1, sb_lua_db_execute (L=0x8ab080) at script_lua.c:943
943     script_lua.c: No such file or directory.
        in script_lua.c
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.80.2.alios6.x86_64 libaio-0.3.107-10.1.alios6.x86_64
(gdb) n
942     in script_lua.c

对应的代码
应该是在类型处理时出现了问题。

 908   /* Rebind if needed */
 909   if (needs_rebind)
 910   {
 911     binds = (db_bind_t *)calloc(stmt->nparams, sizeof(db_bind_t));
 912     if (binds == NULL)
 913       luaL_error(L, "Memory allocation failure");
 914 
 915     for (i = 0; i < stmt->nparams; i++)
 916     {
 917       param = stmt->params + i;
 918       binds[i].type = param->type;
 919       binds[i].is_null = &param->is_null;
 920       if (*binds[i].is_null != 0)
 921         continue;
 922       switch (param->type)
 923       {
 924         case DB_TYPE_INT:
 925           binds[i].buffer = param->buf;
 926           break;
 927         case DB_TYPE_CHAR:
 928           binds[i].buffer = param->buf;
 929           binds[i].data_len = &stmt->params[i].buflen;
 930           binds[i].is_null = 0;
 931           break;
 932         default:
 933           luaL_error(L, "Unsupported variable type");
 934       }
 935     }

 937     if (db_bind_param(stmt->ptr, binds, stmt->nparams))
 938       luaL_error(L, "db_bind_param() failed");
 939     free(binds);
 940   }
 941 
 942   ptr = db_execute(stmt->ptr);
 943   if (ptr == NULL)
 944   {
 945     stmt->rs = NULL;
 946     if (ctxt->con->db_errno == SB_DB_ERROR_DEADLOCK)
 947       lua_pushnumber(L, SB_DB_RESTART_TRANSACTION);

gdb的详细用法可以参考gdb手册。

↧

如何分析D状态进程

April 28, 2016, 9:21 am

≫ Next: PostgreSQL Oracle兼容性之 - psql prompt like Oracle SQL*Plus

≪ Previous: gdb 调试 sysbench

在使用top查看进程状态时，我们有时候会看到D状态的进程。

       w: S  --  Process Status
          The status of the task which can be one of:
             ’D’ = uninterruptible sleep
             ’R’ = running
             ’S’ = sleeping
             ’T’ = traced or stopped
             ’Z’ = zombie

D是一种不可中断的sleep，如果你发现大量的D状态的进程，这个时候这些进程实际上是没有在处理业务逻辑的。
例如使用PostgreSQL时，批量的往数据库导入数据，如果导入的数据量大到OS脏页回写的速度赶不上写入的速度时，并且用户刷dirty page的阈值到达，用户进程会需要主动刷脏页。

vm.dirty_background_ratio = 10
vm.dirty_background_bytes = 0
vm.dirty_ratio = 20
vm.dirty_bytes = 0
vm.dirty_writeback_centisecs = 50
vm.dirty_expire_centisecs = 6000

例如以上配置，OS脏页超过20%时，用户调write也需要主动的刷脏页，就会看到进程处于D状态，直到脏页水位下降到10%以下。
当然还有其他的原因会导致进程进入D状态，我们需要观察进程的stack，看看它处于什么状态。
例如处于R状态的COPY PostgreSQL进程，它的stack是什么样的？

cat /proc/17944/status ; echo -e "\n"; cat /proc/17944/stack
Name:   postgres
State:  R (running)
Tgid:   17944
Pid:    17944
PPid:   57925
TracerPid:      0
Uid:    123293  123293  123293  123293
Gid:    100     100     100     100
Utrace: 0
FDSize: 64
Groups: 100 19001 
VmPeak: 272294920 kB
VmSize:   119788 kB
VmLck:         0 kB
VmHWM:      3244 kB
VmRSS:      2812 kB
VmData:     2140 kB
VmStk:       152 kB
VmExe:      5852 kB
VmLib:      2400 kB
VmPTE:        64 kB
VmSwap:        0 kB
Threads:        1
SigQ:   0/4131614
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: 0000000001301800
SigCgt: 0000000180006287
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: ffffffffffffffff
Cpus_allowed:   ffffffff,ffffffff
Cpus_allowed_list:      0-63
Mems_allowed:   00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000001
Mems_allowed_list:      0
voluntary_ctxt_switches:        55758
nonvoluntary_ctxt_switches:     103995


[<ffffffff8121b40a>] sys_semtimedop+0x81a/0x840
[<ffffffffffffffff>] 0xffffffffffffffff

例如处于D状态的COPY PostgreSQL进程，它的stack是什么样的？
可以看到它处于刷脏页速度受限的状态，与ext4内核有关。

cat /proc/17944/status ; echo -e "\n"; cat /proc/17944/stack
Name:   postgres
State:  D (disk sleep)
Tgid:   17944
Pid:    17944
PPid:   57925
TracerPid:      0
Uid:    123293  123293  123293  123293
Gid:    100     100     100     100
Utrace: 0
FDSize: 64
Groups: 100 19001 
VmPeak: 272294920 kB
VmSize:   119788 kB
VmLck:         0 kB
VmHWM:      3244 kB
VmRSS:      2812 kB
VmData:     2140 kB
VmStk:       152 kB
VmExe:      5852 kB
VmLib:      2400 kB
VmPTE:        64 kB
VmSwap:        0 kB
Threads:        1
SigQ:   0/4131614
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: 0000000001301800
SigCgt: 0000000180006287
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: ffffffffffffffff
Cpus_allowed:   ffffffff,ffffffff
Cpus_allowed_list:      0-63
Mems_allowed:   00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000001
Mems_allowed_list:      0
voluntary_ctxt_switches:        55922
nonvoluntary_ctxt_switches:     104189


[<ffffffff81133cf0>] balance_dirty_pages_ratelimited_nr+0x2d0/0x9a0
[<ffffffff8111f19a>] generic_file_buffered_write+0x1da/0x2e0
[<ffffffff81120fe0>] __generic_file_aio_write+0x260/0x490
[<ffffffff81121298>] generic_file_aio_write+0x88/0x100
[<ffffffffa00b9463>] ext4_file_write+0x43/0xe0 [ext4]
[<ffffffff8118863a>] do_sync_write+0xfa/0x140
[<ffffffff81188938>] vfs_write+0xb8/0x1a0
[<ffffffff81189231>] sys_write+0x51/0x90
[<ffffffff8100c072>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

↧

PostgreSQL Oracle兼容性之 - psql prompt like Oracle SQL*Plus

May 6, 2016, 4:54 am

≫ Next: PostgreSQL 可靠性和一致性代码分析

≪ Previous: 如何分析D状态进程

Oracle的SQL*Plus客户端支持使用promote输入变量值，然后在脚本的其他位置使用该变量值。
例如大量的dbms脚本使用了这个用法，如statspack输入起始值。
https://docs.oracle.com/cd/B19306_01/server.102/b14357/ch12032.htm
在PostgreSQL中，psql客户端也提供了类似的用法，例如：

postgres=# create table test(id int, info text);
CREATE TABLE
postgres=# insert into test select generate_series(1,100),'test';
INSERT 0 100

提示输入ID的值，返回该ID对应的test的行。

vi test.sql
\prompt "please enter a id: " id
select * from test where id=:id;

dege.zzz@r10k04474-> psql -h 127.0.0.1 -p 1922 -f ./test.sql
"please enter a id: "1
 id | info 
----+------
  1 | test
(1 row)

在psql命令行中执行

postgres=# \ir test.sql
"please enter a id: "1
 id | info 
----+------
  1 | test
(1 row)

其中  
  \i FILE                execute commands from file
  \ir FILE               as \i, but relative to location of current script

参考帮助文档

man psql

       \prompt [ text ] name
           Prompts the user to supply text, which is assigned to the variable name. An optional prompt string, text, can be specified. (For multiword prompts, surround the text with single quotes.)

           By default, \prompt uses the terminal for input and output. However, if the -f command line switch was used, \prompt uses standard input and standard output.

↧

PostgreSQL 可靠性和一致性代码分析

May 6, 2016, 4:55 am

≫ Next: PostgreSQL serializable read only deferrable事务的用法背景

≪ Previous: PostgreSQL Oracle兼容性之 - psql prompt like Oracle SQL*Plus

PostgreSQL 的数据可靠性是依赖XLOG的实现的，所有的对数据块的变更操作在write到磁盘前，一定是确保这个变更产生的REDO会先写到XLOG，并保证XLOG已落盘。
也就是说流程是这样的：
.1. 首先将需要变更的块从文件读入shared buffer

.2. 变更shared buffer中block的内容

.3. 将shared buffer中block变更的内容写入XLOG，如果是checkpoint后第一次变更该块，则写full page。（通过参数控制是否要写full page）。

.4. 在bgwriter将shared buffer中的脏块write到os dirty page前，会确保它对应的XLOG已经落盘，通过脏块的LSN来确保。

所以问题来了，如果用户使用了异步提交，即synchronous_commit=off，会怎样呢？
也没有问题，因为在第四步，一定是会保证造成脏页的XLOG是先落盘的。
所以开启synchronous_commit=off，只会造成丢XLOG，绝对不会造成数据不一致。
确保可靠性和一致性的代码如下：


/*
 * Main entry point for bgwriter process
 *
 * This is invoked from AuxiliaryProcessMain, which has already created the
 * basic execution environment, but not enabled signals yet.
 */
void
BackgroundWriterMain(void)
{
...
        /*
         * Do one cycle of dirty-buffer writing.
         */
        can_hibernate = BgBufferSync();

...



/*
 * BgBufferSync -- Write out some dirty buffers in the pool.
 *
 * This is called periodically by the background writer process.
 *
 * Returns true if it's appropriate for the bgwriter process to go into
 * low-power hibernation mode.  (This happens if the strategy clock sweep
 * has been "lapped" and no buffer allocations have occurred recently,
 * or if the bgwriter has been effectively disabled by setting
 * bgwriter_lru_maxpages to 0.)
 */
bool
BgBufferSync(void)
{
...

    /* Execute the LRU scan */
    while (num_to_scan > 0 && reusable_buffers < upcoming_alloc_est)
    {
        int         buffer_state = SyncOneBuffer(next_to_clean, true);

...


/*
 * SyncOneBuffer -- process a single buffer during syncing.
 *
 * If skip_recently_used is true, we don't write currently-pinned buffers, nor
 * buffers marked recently used, as these are not replacement candidates.
 *
 * Returns a bitmask containing the following flag bits:
 *  BUF_WRITTEN: we wrote the buffer.
 *  BUF_REUSABLE: buffer is available for replacement, ie, it has
 *      pin count 0 and usage count 0.
 *
 * (BUF_WRITTEN could be set in error if FlushBuffers finds the buffer clean
 * after locking it, but we don't care all that much.)
 *
 * Note: caller must have done ResourceOwnerEnlargeBuffers.
 */
static int
SyncOneBuffer(int buf_id, bool skip_recently_used)
{

...

    FlushBuffer(bufHdr, NULL);
...


/*
 * FlushBuffer
 *      Physically write out a shared buffer.
 *
 * NOTE: this actually just passes the buffer contents to the kernel; the
 * real write to disk won't happen until the kernel feels like it.  This
 * is okay from our point of view since we can redo the changes from WAL.
 * However, we will need to force the changes to disk via fsync before
 * we can checkpoint WAL.
 *
 * The caller must hold a pin on the buffer and have share-locked the
 * buffer contents.  (Note: a share-lock does not prevent updates of
 * hint bits in the buffer, so the page could change while the write
 * is in progress, but we assume that that will not invalidate the data
 * written.)
 *
 * If the caller has an smgr reference for the buffer's relation, pass it
 * as the second parameter.  If not, pass NULL.
 */
static void
FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
{

...

    /*
     * Force XLOG flush up to buffer's LSN.  This implements the basic WAL
     * rule that log updates must hit disk before any of the data-file changes
     * they describe do.
     *
     * However, this rule does not apply to unlogged relations, which will be
     * lost after a crash anyway.  Most unlogged relation pages do not bear
     * LSNs since we never emit WAL records for them, and therefore flushing
     * up through the buffer LSN would be useless, but harmless.  However,
     * GiST indexes use LSNs internally to track page-splits, and therefore
     * unlogged GiST pages bear "fake" LSNs generated by
     * GetFakeLSNForUnloggedRel.  It is unlikely but possible that the fake
     * LSN counter could advance past the WAL insertion point; and if it did
     * happen, attempting to flush WAL through that location would fail, with
     * disastrous system-wide consequences.  To make sure that can't happen,
     * skip the flush if the buffer isn't permanent.
     */
    if (buf->flags & BM_PERMANENT)
        XLogFlush(recptr);

...




/*
 * Ensure that all XLOG data through the given position is flushed to disk.
 *
 * NOTE: this differs from XLogWrite mainly in that the WALWriteLock is not
 * already held, and we try to avoid acquiring it if possible.
 */
void
XLogFlush(XLogRecPtr record)
{
    XLogRecPtr  WriteRqstPtr;
    XLogwrtRqst WriteRqst;

...
        XLogWrite(WriteRqst, false);

...



/*
 * Write and/or fsync the log at least as far as WriteRqst indicates.
 *
 * If flexible == TRUE, we don't have to write as far as WriteRqst, but
 * may stop at any convenient boundary (such as a cache or logfile boundary).
 * This option allows us to avoid uselessly issuing multiple writes when a
 * single one would do.
 *
 * Must be called with WALWriteLock held. WaitXLogInsertionsToFinish(WriteRqst)
 * must be called before grabbing the lock, to make sure the data is ready to
 * write.
 */
static void
XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
{

...
    /*
     * If asked to flush, do so
     */
    if (LogwrtResult.Flush < WriteRqst.Flush &&
        LogwrtResult.Flush < LogwrtResult.Write)

    {
        /*
         * Could get here without iterating above loop, in which case we might
         * have no open file or the wrong one.  However, we do not need to
         * fsync more than one file.
         */
        if (sync_method != SYNC_METHOD_OPEN &&
            sync_method != SYNC_METHOD_OPEN_DSYNC)
        {
            if (openLogFile >= 0 &&
                !XLByteInPrevSeg(LogwrtResult.Write, openLogSegNo))
                XLogFileClose();
            if (openLogFile < 0)
            {
                XLByteToPrevSeg(LogwrtResult.Write, openLogSegNo);
                openLogFile = XLogFileOpen(openLogSegNo);
                openLogOff = 0;
            }

            issue_xlog_fsync(openLogFile, openLogSegNo);
        }

        /* signal that we need to wakeup walsenders later */
        WalSndWakeupRequest();

        LogwrtResult.Flush = LogwrtResult.Write;
    }
...

异步提交代码如下

/*
     * Check if we want to commit asynchronously.  We can allow the XLOG flush
     * to happen asynchronously if synchronous_commit=off, or if the current
     * transaction has not performed any WAL-logged operation or didn't assign
     * a xid.  The transaction can end up not writing any WAL, even if it has
     * a xid, if it only wrote to temporary and/or unlogged tables.  It can
     * end up having written WAL without an xid if it did HOT pruning.  In
     * case of a crash, the loss of such a transaction will be irrelevant;
     * temp tables will be lost anyway, unlogged tables will be truncated and
     * HOT pruning will be done again later. (Given the foregoing, you might
     * think that it would be unnecessary to emit the XLOG record at all in
     * this case, but we don't currently try to do that.  It would certainly
     * cause problems at least in Hot Standby mode, where the
     * KnownAssignedXids machinery requires tracking every XID assignment.  It
     * might be OK to skip it only when wal_level < hot_standby, but for now
     * we don't.)
     *
     * However, if we're doing cleanup of any non-temp rels or committing any
     * command that wanted to force sync commit, then we must flush XLOG
     * immediately.  (We must not allow asynchronous commit if there are any
     * non-temp tables to be deleted, because we might delete the files before
     * the COMMIT record is flushed to disk.  We do allow asynchronous commit
     * if all to-be-deleted tables are temporary though, since they are lost
     * anyway if we crash.)
     */
    if ((wrote_xlog && markXidCommitted &&
         synchronous_commit > SYNCHRONOUS_COMMIT_OFF) ||
        forceSyncCommit || nrels > 0)
    {
        XLogFlush(XactLastRecEnd);

        /*
         * Now we may update the CLOG, if we wrote a COMMIT record above
         */
        if (markXidCommitted)
            TransactionIdCommitTree(xid, nchildren, children);
    }
    else
    {
        /*
         * Asynchronous commit case:
         *
         * This enables possible committed transaction loss in the case of a
         * postmaster crash because WAL buffers are left unwritten. Ideally we
         * could issue the WAL write without the fsync, but some
         * wal_sync_methods do not allow separate write/fsync.
         *
         * Report the latest async commit LSN, so that the WAL writer knows to
         * flush this commit.
         */
        XLogSetAsyncXactLSN(XactLastRecEnd);

        /*
         * We must not immediately update the CLOG, since we didn't flush the
         * XLOG. Instead, we store the LSN up to which the XLOG must be
         * flushed before the CLOG may be updated.
         */
        if (markXidCommitted)
            TransactionIdAsyncCommitTree(xid, nchildren, children, XactLastRecEnd);
    }

↧

PostgreSQL serializable read only deferrable事务的用法背景

May 6, 2016, 4:56 am

≫ Next: PostgreSQL 备份链路sslcompression压缩

≪ Previous: PostgreSQL 可靠性和一致性代码分析

在开始讲serializable read only deferrable用法前，需要先了解一下serializable隔离级别。
https://wiki.postgresql.org/wiki/Serializable
http://www.postgresql.org/docs/9.5/static/transaction-iso.html#XACT-SERIALIZABLE
http://blog.163.com/digoal@126/blog/static/1638770402013111795219541

serializable是最高的隔离级别，模拟并行事务的串行处理，当不能实现串行一致时，后提交的会话需要回滚，因此serializable事务相互之间是有干扰的。
PostgreSQL在三个地方介绍到了serializable的deferrable用法。
有些时候，用户可能会需要串行一致的视角执行SQL来统计一些数据的报告，这些SQL可能需要运行非常长的时间，但是只需要只读操作。对于运行非常长时间的串行事务，即使是只读的，也可能和之前已经开启并正在执行的其他串行事务产生干扰，从而导致其他串行事务需要回滚。
因此对于这类长时间运行的只读串行事务，我们可以使用一个deferrable模式，这个模式是等待其他的串行事务结束或确保其他正在执行的串行事务不会与当前的串行事务冲突，然后才开始执行只读SQL。
这个时间是不可控的。
但是好处是，当你运行非常长的串行只读事务中的SQL时，不会再感染其他人执行的串行事务了，你爱跑多长时间都行。
在pg_dump中也有用到，看pg_dump 的说明：
(因为pg_dump通常需要长时间运行，使用--serializable-deferrable是指从串行视角来备份数据库，从而得到串行一致的备份数据)
pg_dump默认是使用repeatable read隔离级别来备份的。

       --serializable-deferrable
           Use a serializable transaction for the dump, to ensure that the snapshot used is consistent with later database states; but do this by waiting for a point in the transaction stream at which no anomalies can be present,
           so that there isn't a risk of the dump failing or causing other transactions to roll back with a serialization_failure. See Chapter 13, Concurrency Control, in the documentation for more information about transaction
           isolation and concurrency control.

           This option is not beneficial for a dump which is intended only for disaster recovery. It could be useful for a dump used to load a copy of the database for reporting or other read-only load sharing while the original
           database continues to be updated. Without it the dump may reflect a state which is not consistent with any serial execution of the transactions eventually committed. For example, if batch processing techniques are used,
           a batch may show as closed in the dump without all of the items which are in the batch appearing.

           This option will make no difference if there are no read-write transactions active when pg_dump is started. If read-write transactions are active, the start of the dump may be delayed for an indeterminate length of time.
           Once running, performance with or without the switch is the same.

set transaction语法也支持设置deferrable属性
deferrable值对read only serializable事务生效

http://www.postgresql.org/docs/9.5/static/sql-set-transaction.html

The DEFERRABLE transaction property has no effect unless the transaction is also SERIALIZABLE and READ ONLY. When all three of these properties are selected for a transaction, the transaction may block when first acquiring its snapshot, after which it is able to run without the normal overhead of a SERIALIZABLE transaction and without any risk of contributing to or being canceled by a serialization failure. This mode is well suited for long-running reports or backups.

还有一个参数控制默认的serializable read only事务是否为deferrable。

http://www.postgresql.org/docs/9.5/static/runtime-config-client.html#GUC-DEFAULT-TRANSACTION-DEFERRABLE

default_transaction_deferrable (boolean)
When running at the serializable isolation level, a deferrable read-only SQL transaction may be delayed before it is allowed to proceed. However, once it begins executing it does not incur any of the overhead required to ensure serializability; so serialization code will have no reason to force it to abort because of concurrent updates, making this option suitable for long-running read-only transactions.

This parameter controls the default deferrable status of each new transaction. It currently has no effect on read-write transactions or those operating at isolation levels lower than serializable. The default is off.

↧

PostgreSQL 备份链路sslcompression压缩

May 6, 2016, 4:56 am

≫ Next: 论云数据库编程能力的重要性

≪ Previous: PostgreSQL serializable read only deferrable事务的用法背景

通过链路压缩，提高窄带网络PostgreSQL数据库的备份性能。
需要用到PostgreSQL的SSL支持，用法请参考
http://blog.163.com/digoal@126/blog/static/163877040201342233131835

流复制协议，pg_dump都支持ssl，因为它们都走libpq的调用，libpq是支持ssl的。

http://www.postgresql.org/docs/9.3/static/libpq-envars.html

PGSSLCOMPRESSION behaves the same as the sslcompression connection parameter.

http://www.postgresql.org/docs/9.5/static/libpq-connect.html#LIBPQ-CONNECT-SSLCOMPRESSION

sslcompression
If set to 1 (default), data sent over SSL connections will be compressed (this requires OpenSSL version 0.9.8 or later). If set to 0, compression will be disabled (this requires OpenSSL 1.0.0 or later). This parameter is ignored if a connection without SSL is made, or if the version of OpenSSL used does not support it.

Compression uses CPU time, but can improve throughput if the network is the bottleneck. Disabling compression can improve response time and throughput if CPU performance is the limiting factor.

注意如果你用的是linux，可能会遇到LINUX的一个BUG，使用时，数据库服务端和客户端都必须先设置以下环境变量

export OPENSSL_DEFAULT_ZLIB=1

然后启动数据库，和客户端。

psql postgresql://xxx.xxx.xxx.xxx:1921/postgres?user=postgres\&sslcompression=1\&application_name=myapp\&password=postgres\&sslmode=require
psql (9.5.2)
SSL connection (protocol: TLSv1.2, cipher: ECDHE-RSA-AES256-GCM-SHA384, bits: 256, compression: on)
Type "help" for help.

postgres=# select * from pg_stat_ssl;
 pid  | ssl | version |           cipher            | bits | compression | clientdn 
------+-----+---------+-----------------------------+------+-------------+----------
 9091 | t   | TLSv1.2 | ECDHE-RSA-AES256-GCM-SHA384 |  256 | t           | 
 9106 | t   | TLSv1.2 | ECDHE-RSA-AES256-GCM-SHA384 |  256 | t           | 
(2 rows)

逻辑备份使用ssl压缩

pg_dump postgresql://xxx.xxx.xxx.xxx:1921/postgres?user=postgres\&sslcompression=1\&application_name=myapp\&password=postgres\&sslmode=require -F c -f ./test.dmp

流复制使用SSL压缩

primary_conninfo = 'host=xxx.xxx.xxx.xxx port=xxxx user=xxx password=xxx sslmode=require sslcompression=1'

↧

论云数据库编程能力的重要性

May 6, 2016, 4:57 am

≫ Next: PostgreSQL csvlog 源码分析

≪ Previous: PostgreSQL 备份链路sslcompression压缩

云为我们提供了便利，降低了开发和运维的成本。
但是也必须思考一个问题，我们的云组件之间的的网络延迟？
（相比较局域网的服务器和服务器之间）
你可以用各种方法测试验证一下。
以往我们把数据放在数据库，数据库只提供简单的增删改查，大部分的业务逻辑放在应用服务器来完成。
但是在云时代，如果我们还这样的话，应用服务器和数据库间如果多次交互，会浪费大量的时间。
我们应该充分利用数据库的编程能力，例如PostgreSQL，是一个功能非常强大的数据库，我们完全可以把业务逻辑放在数据库处理，
例如使用plv8, plpython, plpgsql, plperl, pltcl等函数语言，
数据类型支持也非常的丰富，例如jsonb, GIS, text, 异构类型，Key-Value类型...等，
索引支持btree, hash, gin , gist, spgist, brin等索引类型，
SQL语法层面支持窗口查询，递归查询，grouping set, 等高级语法。
JOIN方面，支持hash join , merge join , nestloop join ，
优化器方面，支持自定义成本因子， CBO ，遗传算法等。
另外PostgreSQL更强大之处，可以利用GPU加速运算，包括隐式加速，也可以显示加速。
隐式指数据库自身提供的custom scan provider编程接口，已经有实现了的插件。
显式，指的是过程语言和CUDA的结合，例如PyCUDA。
还可以用Julia来方便的实现并行编程。
PostgreSQL完全可以满足大多数业务的需求。

功能如此强大的数据库，只用来做增删改查，是不是有点浪费呢，充分利用它的功能，当网络是瓶颈的时候，让业务逻辑和数据靠近，可以大大提升效率，降低RT，提升业务系统的用户体验。

关于云端网络延迟，可以参考我在前几天写的几篇文档。
PostgreSQL 网络延迟定量分析
https://yq.aliyun.com/articles/35176

使用sysbench测试阿里云RDS PostgreSQL性能
（内容中包含了如何测试云数据库的网络延迟）
https://yq.aliyun.com/articles/35517

本文还会利用sysbench来佐证一下，使用数据库服务端编程后，带来的性能提升是多么明显。
测试环境依旧是阿里云RDS PostgreSQL，ECS是32核的机器，与RDS PG在同一机房。

步骤
购买RDS PG数据库实例
创建数据库用户
购买同机房，与RDS PG同VPC网络ECS或者同经典网络的ECS
在ECS端安装PostgreSQL客户端

useradd digoal  
su - digoal  

wget https://ftp.postgresql.org/pub/source/v9.5.2/postgresql-9.5.2.tar.bz2  
tar -jxvf postgresql-9.5.2.tar.bz2  
cd postgresql-9.5.2  
./configure --prefix=/home/digoal/pgsql9.5  
gmake world -j 16  
gmake install-world -j 16  

vi ~/env_pg.sh  
export PS1="$USER@`/bin/hostname -s`-> "  
export PGPORT=1921  
export LANG=en_US.utf8  
export PGHOME=/home/digoal/pgsql9.5  
export LD_LIBRARY_PATH=$PGHOME/lib:/lib64:/usr/lib64:/usr/local/lib64:/lib:/usr/lib:/usr/local/lib:$LD_LIBRARY_PATH  
export DATE=`date +"%Y%m%d%H%M"`  
export PATH=$PGHOME/bin:$PATH:.  
export MANPATH=$PGHOME/share/man:$MANPATH  
export PGHOST=$PGDATA  
export PGUSER=postgres  
export PGDATABASE=postgres  
alias rm='rm -i'  
alias ll='ls -lh'  
unalias vi  

. ~/env_pg.sh

安装sysbench(from github)

cd ~  

git clone https://github.com/digoal/sysbench.git

并行初始化测试数据
初始化256张表，每张表100万数据。

cd sysbench/sysbench

./sysbench_pg --test=lua/parallel_init_pg.lua \  
  --db-driver=pgsql \  
  --pgsql-host=xxx.xxx.xxx.xxx \  
  --pgsql-port=3432 \  
  --pgsql-user=digoal \  
  --pgsql-password=pwd \  
  --pgsql-db=postgres \  
  --oltp-tables-count=256 \  
  --oltp-table-size=1000000 \  
  --num-threads=256 \  
  cleanup  

./sysbench_pg --test=lua/parallel_init_pg.lua \  
  --db-driver=pgsql \  
  --pgsql-host=xxx.xxx.xxx.xxx \  
  --pgsql-port=3432 \  
  --pgsql-user=digoal \  
  --pgsql-password=pwd \  
  --pgsql-db=postgres \  
  --oltp-tables-count=256 \  
  --oltp-table-size=1000000 \  
  --num-threads=256 \  
  run

表结构和数据样本如下

postgres=# \d sbtest1
                        Unlogged table "public.sbtest1"
 Column |      Type      |                      Modifiers                       
--------+----------------+------------------------------------------------------
 id     | integer        | not null default nextval('sbtest1_id_seq'::regclass)
 k      | integer        | 
 c      | character(120) | not null default ''::bpchar
 pad    | character(60)  | not null default ''::bpchar
Indexes:
    "sbtest1_pkey" PRIMARY KEY, btree (id)
    "k_1" btree (k)

postgres=# select * from sbtest1 limit 5;
 id |   k    |                                                            c                                                             |                             pad                              
----+--------+--------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------
  1 | 927400 | 78247013700-08372511066-37272961232-38016471864-11589387900-80841931510-50966088603-62786739920-93329627701-94363926684  | 20366050401-55867147298-47128473450-44371584107-36273281249 
  2 | 462112 | 33336348704-49028541945-04338357184-44674729632-69224541153-27063217868-14496534686-77928030196-63919798937-80588593810  | 38824762880-21605093266-59283997376-03159087192-70078005827 
  3 | 609690 | 17141976536-38472427836-27892280734-53859074932-31683911066-84549350288-65797420080-49379319521-63297303760-61130825562  | 36249349424-23238674070-77120648190-02671383694-80399189992 
  4 | 442570 | 67075106316-33193756800-10093726800-79829712284-63470268100-62589769080-83382855836-21662325414-74934263415-54280518945  | 73517378377-96791797586-54757886848-05144609036-20409864730 
  5 | 126743 | 21608327653-47776651750-16637007643-12991848186-40427635184-24941570285-23769806501-34607807466-88813292380-75665466083  | 06891362272-65143041120-84598756285-94704681508-91545142862 
(5 rows)

测试包含SQL如下，其中第一条SQL执行10次：

   -- select c from tbl where id = $1;  --  执行10次  
   -- select id,k,c,pad from tbl where id in ($1,...$n);  
   -- select c from tbl where id between $1 and $2;  
   -- select sum(k) from tbl where id between $1 and $2;  
   -- select c from tbl where id between $1 and $2 order by c;  
   -- select distinct c from tbl where id between $1 and $2 order by c;  
   -- update tbl set k=k+1 where id = $1;  
   -- update tbl set c=$2 where id = $1;  
   -- delete from tbl where id = $1;  
   -- insert into tbl(id, k, c, pad) values ($1,$2,$3,$4);

首先测试不使用服务端编程时的性能, 分别测试16并发和256并发

./sysbench_pg --test=lua/oltp_pg.lua \  
  --db-driver=pgsql \  
  --pgsql-host=xxx.xxx.xxx.xxx \  
  --pgsql-port=3432 \  
  --pgsql-user=digoal \  
  --pgsql-password=pwd \  
  --pgsql-db=postgres \  
  --oltp-tables-count=16 \  
  --oltp-table-size=1000000 \  
  --num-threads=16 \  
  --max-time=120  \  
  --max-requests=0 \  
  --report-interval=1 \  
  run  

./sysbench_pg --test=lua/oltp_pg.lua \  
  --db-driver=pgsql \  
  --pgsql-host=xxx.xxx.xxx.xxx \  
  --pgsql-port=3432 \  
  --pgsql-user=digoal \  
  --pgsql-password=pwd \  
  --pgsql-db=postgres \  
  --oltp-tables-count=256 \  
  --oltp-table-size=1000000 \  
  --num-threads=256 \  
  --max-time=120  \  
  --max-requests=0 \  
  --report-interval=1 \  
  run

测试结果
16并发
tps 248.27
qps 4717.13
256并发
tps 1243.61
qps 23628.59

然后测试使用服务端编程时的性能, 分别测试16并发和256并发

./sysbench_pg --test=lua/oltp_pg_udf.lua \  
  --db-driver=pgsql \  
  --pgsql-host=xxx.xxx.xxx.xxx \  
  --pgsql-port=3432 \  
  --pgsql-user=digoal \  
  --pgsql-password=pwd \  
  --pgsql-db=postgres \  
  --oltp-tables-count=16 \  
  --oltp-table-size=1000000 \  
  --num-threads=16 \  
  --max-time=120  \  
  --max-requests=0 \  
  --report-interval=1 \  
  run  

./sysbench_pg --test=lua/oltp_pg_udf.lua \  
  --db-driver=pgsql \  
  --pgsql-host=xxx.xxx.xxx.xxx \  
  --pgsql-port=3432 \  
  --pgsql-user=digoal \  
  --pgsql-password=pwd \  
  --pgsql-db=postgres \  
  --oltp-tables-count=256 \  
  --oltp-table-size=1000000 \  
  --num-threads=256 \  
  --max-time=120  \  
  --max-requests=0 \  
  --report-interval=1 \  
  run

测试结果
16并发
tps 1533.44
qps 29135.36
256并发
tps 1684.45
qps 32004.55

从测试数据可以非常明显的看到，RT对小事务的影响非常大（单个连接TPS只有 15.56，交互次数越多TPS越低）。
使用服务端编程，可以大大的规避网络问题，对于交互较多的高并发小事务起到的效果非常棒。
在所有的关系数据库中，PostgreSQL支持的服务端编程语言应该是最丰富的，例如C，Python，java，javascript, Lua, perl, tcl, perl, ......。

↧

PostgreSQL csvlog 源码分析

May 6, 2016, 5:01 am

≫ Next: PostgreSQL pg_backup_start_time() CST 时区转换问题

≪ Previous: 论云数据库编程能力的重要性

PostgreSQL csvlog日志格式记录了非常多的信息，通过CSV外部表，可以使用SQL对日志进行分析。
文档中的例子：
http://www.postgresql.org/docs/9.5/static/runtime-config-logging.html#RUNTIME-CONFIG-LOGGING-CSVLOG

CREATE TABLE postgres_log
(
  log_time timestamp(3) with time zone,
  user_name text,
  database_name text,
  process_id integer,
  connection_from text,
  session_id text,
  session_line_num bigint,
  command_tag text,
  session_start_time timestamp with time zone,
  virtual_transaction_id text,
  transaction_id bigint,
  error_severity text,
  sql_state_code text,
  message text,
  detail text,
  hint text,
  internal_query text,
  internal_query_pos integer,
  context text,
  query text,
  query_pos integer,
  location text,
  application_name text,
  PRIMARY KEY (session_id, session_line_num)
);
To import a log file into this table, use the COPY FROM command:

COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;

那么csvlog每个字段的含义是什么呢？
有些字面比较好理解，有些不太好理解，不用担心，PostgreSQL的代码非常简洁，不了解的字段就去看看代码吧：

src/include/utils/elog.h
/*
 * ErrorData holds the data accumulated during any one ereport() cycle.
 * Any non-NULL pointers must point to palloc'd data.
 * (The const pointers are an exception; we assume they point at non-freeable
 * constant strings.)
 */
typedef struct ErrorData
{
    int         elevel;         /* error level */
    bool        output_to_server;       /* will report to server log? */
    bool        output_to_client;       /* will report to client? */
    bool        show_funcname;  /* true to force funcname inclusion */
    bool        hide_stmt;      /* true to prevent STATEMENT: inclusion */
    const char *filename;       /* __FILE__ of ereport() call */
    int         lineno;         /* __LINE__ of ereport() call */
    const char *funcname;       /* __func__ of ereport() call */
    const char *domain;         /* message domain */
    const char *context_domain; /* message domain for context message */
    int         sqlerrcode;     /* encoded ERRSTATE */
    char       *message;        /* primary error message */
    char       *detail;         /* detail error message */
    char       *detail_log;     /* detail error message for server log only */
    char       *hint;           /* hint message */
    char       *context;        /* context message */
    char       *schema_name;    /* name of schema */
    char       *table_name;     /* name of table */
    char       *column_name;    /* name of column */
    char       *datatype_name;  /* name of datatype */
    char       *constraint_name;    /* name of constraint */
    int         cursorpos;      /* cursor index into query string */
    int         internalpos;    /* cursor index into internalquery */
    char       *internalquery;  /* text of internally-generated query */
    int         saved_errno;    /* errno at entry */

    /* context containing associated non-constant strings */
    struct MemoryContextData *assoc_context;
} ErrorData;

write_csvlog的接口

src/backend/utils/error/elog.c
/*
 * Constructs the error message, depending on the Errordata it gets, in a CSV
 * format which is described in doc/src/sgml/config.sgml.
 */
static void
write_csvlog(ErrorData *edata)
{
        StringInfoData buf;
        bool            print_stmt = false;

        /* static counter for line numbers */
        static long log_line_number = 0;

        /* has counter been reset in current process? */
        static int      log_my_pid = 0;

        /*
         * This is one of the few places where we'd rather not inherit a static
         * variable's value from the postmaster.  But since we will, reset it when
         * MyProcPid changes.
         */
        if (log_my_pid != MyProcPid)
        {
                log_line_number = 0;
                log_my_pid = MyProcPid;
                formatted_start_time[0] = '\0';
        }
        log_line_number++;

        initStringInfo(&buf);

       // 从这里开始，每个字段什么意思都可以看到，每个字段都用appendStringInfoChar(&buf, ',');隔开来了。  
        /*
         * timestamp with milliseconds
         *
         * Check if the timestamp is already calculated for the syslog message,
         * and use it if so.  Otherwise, get the current timestamp.  This is done
         * to put same timestamp in both syslog and csvlog messages.
         */
        if (formatted_log_time[0] == '\0')
                setup_formatted_log_time();

        appendStringInfoString(&buf, formatted_log_time);
        appendStringInfoChar(&buf, ',');

        /* username */
        if (MyProcPort)
                appendCSVLiteral(&buf, MyProcPort->user_name);
        appendStringInfoChar(&buf, ',');

        /* database name */
        if (MyProcPort)
                appendCSVLiteral(&buf, MyProcPort->database_name);
        appendStringInfoChar(&buf, ',');

        /* Process id  */
        if (MyProcPid != 0)
                appendStringInfo(&buf, "%d", MyProcPid);
        appendStringInfoChar(&buf, ',');

        /* Remote host and port */
        if (MyProcPort && MyProcPort->remote_host)
        {
                appendStringInfoChar(&buf, '"');
                appendStringInfoString(&buf, MyProcPort->remote_host);
                if (MyProcPort->remote_port && MyProcPort->remote_port[0] != '\0')
                {
                        appendStringInfoChar(&buf, ':');
                        appendStringInfoString(&buf, MyProcPort->remote_port);
                }
                appendStringInfoChar(&buf, '"');
        }
        appendStringInfoChar(&buf, ',');

        /* session id */  // session id 是两个字段组成的分别是后台进程的启动时间和PID，所以是唯一的
        appendStringInfo(&buf, "%lx.%x", (long) MyStartTime, MyProcPid);
        appendStringInfoChar(&buf, ',');

        /* Line number */
        appendStringInfo(&buf, "%ld", log_line_number);
        appendStringInfoChar(&buf, ',');

        /* PS display */
        if (MyProcPort)
        {
                StringInfoData msgbuf;
                const char *psdisp;
                int                     displen;

                initStringInfo(&msgbuf);

                psdisp = get_ps_display(&displen);
                appendBinaryStringInfo(&msgbuf, psdisp, displen);
                appendCSVLiteral(&buf, msgbuf.data);

                pfree(msgbuf.data);
        }
        appendStringInfoChar(&buf, ',');

        /* session start timestamp */
        if (formatted_start_time[0] == '\0')
                setup_formatted_start_time();
        appendStringInfoString(&buf, formatted_start_time);
        appendStringInfoChar(&buf, ',');

        /* Virtual transaction id */
        /* keep VXID format in sync with lockfuncs.c */
        if (MyProc != NULL && MyProc->backendId != InvalidBackendId)
                appendStringInfo(&buf, "%d/%u", MyProc->backendId, MyProc->lxid);
        appendStringInfoChar(&buf, ',');
        /* Transaction id */
        appendStringInfo(&buf, "%u", GetTopTransactionIdIfAny());
        appendStringInfoChar(&buf, ',');

        /* Error severity */
        appendStringInfoString(&buf, error_severity(edata->elevel));
        appendStringInfoChar(&buf, ',');

        /* SQL state code */
        appendStringInfoString(&buf, unpack_sql_state(edata->sqlerrcode));
        appendStringInfoChar(&buf, ',');

        /* errmessage */
        appendCSVLiteral(&buf, edata->message);
        appendStringInfoChar(&buf, ',');

        /* errdetail or errdetail_log */  // 是否输出代码位置
        if (edata->detail_log)
                appendCSVLiteral(&buf, edata->detail_log);
        else
                appendCSVLiteral(&buf, edata->detail);
        appendStringInfoChar(&buf, ',');

        /* errhint */
        appendCSVLiteral(&buf, edata->hint);
        appendStringInfoChar(&buf, ',');

        /* internal query */
        appendCSVLiteral(&buf, edata->internalquery);
        appendStringInfoChar(&buf, ',');

        /* if printed internal query, print internal pos too */
        if (edata->internalpos > 0 && edata->internalquery != NULL)
                appendStringInfo(&buf, "%d", edata->internalpos);
        appendStringInfoChar(&buf, ',');

        /* errcontext */
        if (!edata->hide_ctx)
                appendCSVLiteral(&buf, edata->context);
        appendStringInfoChar(&buf, ',');

        /* user query --- only reported if not disabled by the caller */
        if (is_log_level_output(edata->elevel, log_min_error_statement) &&
                debug_query_string != NULL &&
                !edata->hide_stmt)
                print_stmt = true;
        if (print_stmt)
                appendCSVLiteral(&buf, debug_query_string);
        appendStringInfoChar(&buf, ',');
        if (print_stmt && edata->cursorpos > 0)
                appendStringInfo(&buf, "%d", edata->cursorpos);
        appendStringInfoChar(&buf, ',');

        /* file error location */
        if (Log_error_verbosity >= PGERROR_VERBOSE)
        {
                StringInfoData msgbuf;

                initStringInfo(&msgbuf);

                if (edata->funcname && edata->filename)
                        appendStringInfo(&msgbuf, "%s, %s:%d",
                                                         edata->funcname, edata->filename,
                                                         edata->lineno);
                else if (edata->filename)
                        appendStringInfo(&msgbuf, "%s:%d",
                                                         edata->filename, edata->lineno);
                appendCSVLiteral(&buf, msgbuf.data);
                pfree(msgbuf.data);
        }
        appendStringInfoChar(&buf, ',');

        /* application name */
        if (application_name)
                appendCSVLiteral(&buf, application_name);

        appendStringInfoChar(&buf, '\n');

        /* If in the syslogger process, try to write messages direct to file */
        if (am_syslogger)
                write_syslogger_file(buf.data, buf.len, LOG_DESTINATION_CSVLOG);
        else
                write_pipe_chunks(buf.data, buf.len, LOG_DESTINATION_CSVLOG);

        pfree(buf.data);
}

另外需要提一下,如果写日志的是syslogger则直接写文件，如果是其他进程，则把日志发到pipe管道。
如果开启了SQL审计日志，小事务高并发会受到较大的影响，优化可以从这里的代码入手哦。

↧

PostgreSQL pg_backup_start_time() CST 时区转换问题

May 6, 2016, 5:02 am

≫ Next: 中文模糊查询性能优化 by PostgreSQL trgm

≪ Previous: PostgreSQL csvlog 源码分析

PostgreSQL的物理备份方法之一：
在使用pg_start_backup()函数新建备份点后，用户可以开始拷贝PG的数据文件。

postgres=# select pg_start_backup('a'),now();
 pg_start_backup |              now              
-----------------+-------------------------------
 0/50000028      | 2016-05-06 11:03:30.917509+08
(1 row)

调用pg_start_backup后，会创建一个检查点，同时在$PGDATA中新建一个backup_label文件。
里面包含了START TIME的信息，是创建完检查点后的时间。

START WAL LOCATION: 0/50000028 (file 000000010000000000000014)
CHECKPOINT LOCATION: 0/50000028
BACKUP METHOD: pg_start_backup
BACKUP FROM: master
START TIME: 2016-05-06 11:03:33 CST
LABEL: a

但是，使用pg_backup_start_time得到的时间与之不符。

postgres=# select pg_backup_start_time();
  pg_backup_start_time  
------------------------
 2016-05-07 01:03:33+08
(1 row)

原因分析，首先我们要看看pg_backup_start_time的代码

postgres=# \df+ pg_backup_start_time
                                                                                         List of functions
   Schema   |         Name         |     Result data type     | Argument data types |  Type  | Security | Volatility |  Owner   | Language |     Source code      |          Description           
------------+----------------------+--------------------------+---------------------+--------+----------+------------+----------+----------+----------------------+--------------------------------
 pg_catalog | pg_backup_start_time | timestamp with time zone |                     | normal | invoker  | stable     | postgres | internal | pg_backup_start_time | start time of an online backup
(1 row)

代码如下

/*
 * Returns start time of an online exclusive backup.
 *
 * When there's no exclusive backup in progress, the function
 * returns NULL.
 */
Datum
pg_backup_start_time(PG_FUNCTION_ARGS)
{
        Datum           xtime;
        FILE       *lfp;
        char            fline[MAXPGPATH];
        char            backup_start_time[30];

        /*
         * See if label file is present
         */
        lfp = AllocateFile(BACKUP_LABEL_FILE, "r");
        if (lfp == NULL)
        {
                if (errno != ENOENT)
                        ereport(ERROR,
                                        (errcode_for_file_access(),
                                         errmsg("could not read file \"%s\": %m",
                                                        BACKUP_LABEL_FILE)));
                PG_RETURN_NULL();
        }

        /*
         * Parse the file to find the START TIME line.
         */
        backup_start_time[0] = '\0';
        while (fgets(fline, sizeof(fline), lfp) != NULL)
        {
                if (sscanf(fline, "START TIME: %25[^\n]\n", backup_start_time) == 1)
                        break;
        }

        /* Check for a read error. */
        if (ferror(lfp))
                ereport(ERROR,
                                (errcode_for_file_access(),
                           errmsg("could not read file \"%s\": %m", BACKUP_LABEL_FILE)));

        /* Close the backup label file. */
        if (FreeFile(lfp))
                ereport(ERROR,
                                (errcode_for_file_access(),
                          errmsg("could not close file \"%s\": %m", BACKUP_LABEL_FILE)));

        if (strlen(backup_start_time) == 0)
                ereport(ERROR,
                                (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
                                 errmsg("invalid data in file \"%s\"", BACKUP_LABEL_FILE)));

        /*
         * Convert the time string read from file to TimestampTz form.
         */
        xtime = DirectFunctionCall3(timestamptz_in,
                                                                CStringGetDatum(backup_start_time),
                                                                ObjectIdGetDatum(InvalidOid),
                                                                Int32GetDatum(-1));

        PG_RETURN_DATUM(xtime);
}

所以从代码可以看到pg_backup_start_time是从backup_label中获取到启动时间，并转化为带时区的时间的。

CST时间，CST同时可以代表如下 4 个不同的时区：

Central Standard Time (USA) UT-6:00
Central Standard Time (Australia) UT+9:30
China Standard Time UT+8:00
Cuba Standard Time UT-4:00

所以问题其实是出在时区转化这里：

postgres=# show timezone;
 TimeZone 
----------
 PRC
(1 row)

postgres=# select timestamp '2016-05-06 11:03:33 CST';
      timestamp      
---------------------
 2016-05-06 11:03:33
(1 row)

postgres=# select timestamptz '2016-05-06 11:03:33 CST';
      timestamptz       
------------------------
 2016-05-07 01:03:33+08
(1 row)

PostgreSQL pg_backup_start_time应该是把CST用USA时区来处理的

postgres=# set timezone='-6';
SET
postgres=# select pg_backup_start_time();
  pg_backup_start_time  
------------------------
 2016-05-06 11:03:33-06
(1 row)

↧

中文模糊查询性能优化 by PostgreSQL trgm

May 6, 2016, 5:02 am

≫ Next: PostgreSQL 百亿级数据范围查询, 分组排序窗口取值极致优化 case

≪ Previous: PostgreSQL pg_backup_start_time() CST 时区转换问题

前模糊，后模糊，前后模糊，正则匹配都属于文本搜索领域常见的需求。
PostgreSQL在文本搜索领域除了全文检索，还有trgm是一般数据库没有的，甚至可能很多人没有听说过。
对于前模糊和后模糊，PG则与其他数据库一样，可以使用btree来加速，后模糊可以使用反转函数的函数索引来加速。
对于前后模糊和正则匹配，则可以使用trgm，TRGM是一个非常强的插件，对这类文本搜索场景性能提升非常有效，100万左右的数据量，性能提升有500倍以上。

例子：
生成100万数据

postgres=# create table tbl (id int, info text);
CREATE TABLE
postgres=# insert into tbl select generate_series(1,1000000), md5(random()::text);
INSERT 0 1000000
postgres=# create index idx_tbl_1 on tbl using gin(info gin_trgm_ops);
CREATE INDEX

postgres=# select * from tbl limit 10;
 id |               info               
----+----------------------------------
  1 | dc369f84738f7fa4dc38c364cef817d0
  2 | 4912b0b16670c4f2390d44ae790b9809
  3 | eb442b00bf3b5bc6863d004a2c8fa3bb
  4 | 0b4b8a8ad0cdf2e6870afbb94813eba4
  5 | 661e895ee982ec4d9f944b10adffb897
  6 | 09c4e7476d4bdfc1ccbdfe92ba0fdbdf
  7 | 8b6e442faed938d066dda5e552100277
  8 | e5cdeca599d5068a8d3bb6ce9f370827
  9 | ddbbfbeaa9199219b7c909fb395d9a69
 10 | 96f254f64df1ec43bb0cb4801222c919
(10 rows)

postgres=# select * from tbl where info ~ '670c4f2';
 id |               info               
----+----------------------------------
  2 | 4912b0b16670c4f2390d44ae790b9809
(1 row)
Time: 2.668 ms

postgres=# explain analyze select * from tbl where info ~ '670c4f2';
                                                     QUERY PLAN                                                      
---------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on tbl  (cost=28.27..138.43 rows=100 width=37) (actual time=1.957..1.958 rows=1 loops=1)
   Recheck Cond: (info ~ '670c4f2'::text)
   Heap Blocks: exact=1
   ->  Bitmap Index Scan on idx_tbl_1  (cost=0.00..28.25 rows=100 width=0) (actual time=1.939..1.939 rows=1 loops=1)
         Index Cond: (info ~ '670c4f2'::text)
 Planning time: 0.342 ms
 Execution time: 1.989 ms
(7 rows)

不使用TRGM优化，需要1657毫秒.
postgres=# set enable_bitmapscan=off;
SET
Time: 0.272 ms
postgres=# select * from tbl where info ~ 'e770044a';
 id |               info               
----+----------------------------------
  6 | 776c3cdf5fa818a324ef3e770044a488
(1 row)
Time: 1657.231 ms

对于ascii字符，性能提升非常明显。

因为trgm不支持wchar，所以需要转换一下。
中文：

postgres=# explain analyze select * from tbl where info ~ '中国';
                                                       QUERY PLAN                                                       
------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on tbl  (cost=149.62..151.82 rows=2 width=37) (actual time=8.624..8.624 rows=0 loops=1)
   Recheck Cond: (info ~ '中国'::text)
   Rows Removed by Index Recheck: 10103
   Heap Blocks: exact=156
   ->  Bitmap Index Scan on idx_tbl_1  (cost=0.00..149.61 rows=2 width=0) (actual time=1.167..1.167 rows=10103 loops=1)
         Index Cond: (info ~ '中国'::text)
 Planning time: 0.244 ms
 Execution time: 8.657 ms
(8 rows)
Time: 9.388 ms

中文虽然走索引，但是它是没有正确的使用token的，所以都放到recheck了。
还不如全表扫描

postgres=# set enable_bitmapscan=off;
SET
postgres=# explain analyze select * from tbl where info ~ '中国';
                                           QUERY PLAN                                           
------------------------------------------------------------------------------------------------
 Seq Scan on tbl  (cost=0.00..399.75 rows=2 width=37) (actual time=6.899..6.899 rows=0 loops=1)
   Filter: (info ~ '中国'::text)
   Rows Removed by Filter: 10103
 Planning time: 0.213 ms
 Execution time: 6.921 ms
(5 rows)
Time: 7.593 ms

但是你可以用PostgreSQL的函数索引和bytea化(转换成ascii码)来实现这块的功能
例如

postgres=# select text(textsend(info)) from tbl limit 10;
                                                                                       text                                                                                       
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 \xe7abbde69b8ce7b5a4e8b197e5afa9e58c88e991a6e7b18ce5b495e8a79fe7ae8ee882bce7a283e7af9de8a086e7ac8de59e81e5a6bae9bcb6e6ba9fe981bbe4bda8e7928de98ab0e5a18de697b5e79fabe9b0a5e9b0a5
 \xe5aa8ee69ab5e58996e892b0e89484e587b0e8bcbce69f80e79eb8e89390e7baa8e79f93e582b6e98f81e9a18ee9b48ee9ba8ce784a6e8b5a2e5a797e9a3b5e5a4aee986b1e9919de6b19ce9bdb9e6bbb6e8b5bde8b5bd
 \xe7b4a4e5b2b3e7ac96e79481e78dbce5b28ae6b9b6e88dafe5aebce4bcbde8a3a3e4be98e78e93e5848ae4b888e5b0b5e5aeaee9aeb2e99982e59a98e6b0b2e583b3e9b799e893a5e5ba89e8949fe7868ee78cbde78cbd
 \xe797a3e4b991e8baaee9ae88e69db5e78c99e9a8abe9bd80e7bd98e8b3bae89cb5e799bbe78d89e990a7e5b989e6a484e6a1a1e6939ce9b490e890b4e9a5abe6b392e58a9be5adaae9b895e89985e8a79ee8b889e8b889
 \xe687a4e9b795e58094e9b0a6e6a58ee4bd80e6898ae6bdbee7828de788bde79897e8be83e59b93e7908ae9879be7b093e89eaae6a3bce792bee59e9ae8b5abe7a89fe9b6aae99bbae9a18fe6b3abe7b7aae89282e89282
 \xe996b8e5a4b7e6b2b7e8a397e6a898e58a94e6a4a5e586b3e9b8b5e5ba98e99ba4e99c90e6be90e88d94e99dade89892e594abe59d98e5a7afe592a0e58c9be59590e8a299e7bb86e9abace7a5bee881bde793a7e793a7
 \xe795aee7bba4e4bc86e7b29ae780b2e7bd9fe8a9bee8bf97e68486e5a4bde8a79ee6bf8be98cb8e8b6bfe4bb8ae88ba3e8ba98e6acb8e6aa94e59ab5e697bfe78b96e6859be7afb9e9bb85e799a7e798a3e6a982e6a982
 \xe98987e7828be585ace9808ce5959be6b4a0e582ade59fbfe7b18ee792b9e8bd87e8849ce89d98e4b8b4e7af9ce6abb3e98a8ce89490e897bde59ea7e8a5a8e98a94e7848be59abae5bb9be890b6e58188e6acb8e6acb8
 \xe7898de88880e89abfe99dbfe5bab9e5b387e8b3a7e8a0bfe9a4a7e5aa9be6a18ee68ca7e9b2b2e58b8de6a088e6a4abe5a481e58297e4bb90e5b780e786b4e6958de58bb4e78884e9ae98e9909ae8b19be984a8e984a8
 \xe6b4a8e8b99ee6b789e8bfb9e9b69de9b0a6e9b7bde59fbae6a886e793a1e691ace9a185e5bba1e699a5e9bcace78598e9adaee9b199e59eb5e897b6e88f92e69caee8b9ade8beade4bdbae5b3b6e599b9e7bea1e7bea1
(10 rows)
Time: 0.457 ms

对bytea文本创建gin索引

postgres=# create or replace function textsend_i (text) returns bytea as $$
  select textsend($1);
$$ language sql strict immutable;
CREATE FUNCTION

postgres=# drop index idx_tbl_1 ;
DROP INDEX
Time: 10.179 ms
postgres=# create index idx_tbl_1 on tbl using gin(text(textsend_i(info)) gin_trgm_ops);
CREATE INDEX

使用了bytea的gin索引后，性能提升非常明显，数据量越多，性能表现越好。

postgres=# set enable_bitmapscan=on;
postgres=# explain analyze select * from tbl where text(textsend_i(info)) ~ ltrim(text(textsend_i('中国')), '\x');
                                                      QUERY PLAN                                                      
----------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on tbl  (cost=369.28..504.93 rows=100 width=37) (actual time=0.099..0.099 rows=0 loops=1)
   Recheck Cond: ((textsend_i(info))::text ~ 'e4b8ade59bbd'::text)
   ->  Bitmap Index Scan on idx_tbl_1  (cost=0.00..369.25 rows=100 width=0) (actual time=0.097..0.097 rows=0 loops=1)
         Index Cond: ((textsend_i(info))::text ~ 'e4b8ade59bbd'::text)
 Planning time: 0.494 ms
 Execution time: 0.128 ms
(6 rows)

postgres=# select * from tbl limit 10;
 id |                            info                            
----+------------------------------------------------------------
  1 | 竽曌絤豗審匈鑦籌崕觟箎肼碃篝蠆笍垁妺鼶溟遻佨璍銰塍旵矫鰥鰥
  2 | 媎暵剖蒰蔄凰輼柀瞸蓐纨矓傶鏁顎鴎麌焦赢姗飵央醱鑝汜齹滶赽赽
  3 | 紤岳笖甁獼岊湶药宼伽裣侘玓儊丈尵宮鮲陂嚘氲僳鷙蓥庉蔟熎猽猽
  4 | 痣乑躮鮈杵猙騫齀罘賺蜵登獉鐧幉椄桡擜鴐萴饫泒力孪鸕虅觞踉踉
  5 | 懤鷕倔鰦楎佀扊潾炍爽瘗较囓琊釛簓螪棼璾垚赫稟鶪雺顏泫緪蒂蒂
  6 | 閸夷沷裗樘劔椥决鸵庘雤霐澐荔靭蘒唫坘姯咠匛啐袙细髬祾聽瓧瓧
  7 | 畮绤伆粚瀲罟詾迗愆夽觞濋錸趿今苣躘欸檔嚵旿狖慛篹黅癧瘣橂橂
  8 | 鉇炋公逌啛洠傭埿籎璹轇脜蝘临篜櫳銌蔐藽垧襨銔焋嚺廛萶偈欸欸
  9 | 牍舀蚿靿庹峇賧蠿餧媛桎挧鲲勍栈椫夁傗仐巀熴敍勴爄鮘鐚豛鄨鄨
 10 | 洨蹞淉迹鶝鰦鷽基樆瓡摬顅廡晥鼬煘魮鱙垵藶菒朮蹭辭佺島噹羡羡
(10 rows)

postgres=# explain analyze select * from tbl where text(textsend_i(info)) ~ ltrim(text(textsend_i('坘')), '\x');
                                                      QUERY PLAN                                                      
----------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on tbl  (cost=149.88..574.79 rows=320 width=37) (actual time=0.063..0.063 rows=0 loops=1)
   Recheck Cond: ((textsend_i(info))::text ~ 'e59d98'::text)
   ->  Bitmap Index Scan on idx_tbl_1  (cost=0.00..149.80 rows=320 width=0) (actual time=0.061..0.061 rows=0 loops=1)
         Index Cond: ((textsend_i(info))::text ~ 'e59d98'::text)
 Planning time: 0.303 ms
 Execution time: 0.087 ms
(6 rows)

postgres=# select * from tbl where text(textsend_i(info)) ~ ltrim(text(textsend_i('坘')), '\x');
  id  |                            info                            
------+------------------------------------------------------------
    6 | 閸夷沷裗樘劔椥决鸵庘雤霐澐荔靭蘒唫坘姯咠匛啐袙细髬祾聽瓧瓧
  432 | 飒莭鮊鍥?笩妳琈笈慻儘轴轧坘碠郎蚿呙偓鍹脆鼺蹔谕蚱畨縫鱳鱳
  934 | 咓僨復圼峷奁扉羰滵樞韴迬猰優鰸獤溅躐瓜抵権纀懶粯坘蚲纾鴁鴁
 3135 | 倣稽蛯巭瘄皮蓈睫柨苧眱賴髄猍乱歖痐坘恋顎东趥谓鰪棩剔烱茟茟
 3969 | 崴坘螏顓碴鵰邰欴苄蛨簰瘰膪菷栱镘衟齘觊诀忮繈憘痴峣撋梆澝澝
 4688 | 围豁啖顫诬呅尥腥缾郸熛枵焐篯坘僇矟銘隨譼鎶舰肳礞婛轲蠟慕慕
 6121 | 窳研稼旅唣疚褣鬾韨赑躽坘浒攁舑遬鳴滴抓嗠捒铗牜欘質丛姤騖騖
 6904 | 飘稘輔鬄枠舶婬儁噈坘裎姖爙炃苖隽斓堯鈶摙蚼疁兗快鐕鎒墩譭譭
 8854 | 叒鐲唬鞩泍糕懜坘戚靥鎿鋂炿尟汜阢甌鲖埁顔胳邉謾宱肦劰責戆戆
 9104 | 鵬篱爯俌坘柉誵孀漴纞錀澁摫螭芄餜爹綅俆逨哒猈珢輿廄陲欗缷缷
 9404 | 民坘謤齏隽紽峐荟頩胯頴傳蠂枯滦榦陠帡疃鈶遽艌瘧蒭嗍龞瓈嚍嚍
 9727 | 夃坘慫逹壪泵偉鸶揺雠倴矸虠覾芽齏遬儂錞鐴焑劽疁擯蛛倞瑫菰菰
(12 rows)

有兴趣还可以再参考以下文章。
如何用PostgreSQL解决一个人工智能语义去重的小问题
https://yq.aliyun.com/articles/25899

PostgreSQL 百亿数据秒级响应正则及模糊查询
https://yq.aliyun.com/articles/7444

PostgreSQL 1000亿数据量正则匹配速度与激情
https://yq.aliyun.com/articles/7549

↧

PostgreSQL 百亿级数据范围查询, 分组排序窗口取值极致优化 case

May 6, 2016, 5:03 am

≫ Next: 论count使用不当的罪名和分页的优化

≪ Previous: 中文模糊查询性能优化 by PostgreSQL trgm

本文将对一个任意范围按ID分组查出每个ID对应的最新记录的CASE做一个极致的优化体验。
优化后性能维持在可控范围内，任意数据量，毫秒级返回，性能平稳可控。
比优化前性能提升1万倍。

CASE

有一张数据表，结构：   

CREATE TABLE target_position ( 
target_id varchar(80), 
time bigint, 
content text 
); 

数据量是 100 亿条左右   
target_id 大约 20 万个   

数据库使用的是 PostgreSQL 9.4    

需求：   
查询每个目标指定时间段的最新一条数据，要求1秒内返回数据。  
时间段不确定     

现在是使用窗口函数来实现，如下：   
select target_id,time,content from (select *,row_number() over (partition by target_id order by time desc) rid from target_position where time>开始时间 and time<=结束时间) as t where rid=1; 
效果很差。

分析一下原理，这个case其实慢就慢在扫描的时间段，因为需要遍历整个时间段的数据，然后分组排序，取出该时间段内每个target_id的最新一条记录。
这个语句决定了时间段越大，可能的扫描量就越大，时间越久。
直奔最优方案，CASE里有提到，target_id大约20万个，理论上不管要扫描的范围有多大，最多只需要扫描20万条tuple。
怎样做到呢，用函数即可。
首先要开另外一种表维护target_id的唯一值，方便取数据，这个需要应用层配合来做到这一点，其实不难的，就是关系解耦。
下面是测试样本

postgres=# create unlogged table t1(id int, crt_time timestamp);
CREATE TABLE
postgres=# create unlogged table t2(id int primary key);
CREATE TABLE
postgres=# insert into t1 select trunc(random()*200000),clock_timestamp() from generate_series(1,100000000);
INSERT 0 100000000
postgres=# create index idx_t1_1 on t1(id,crt_time desc);
CREATE INDEX
postgres=# select * from t1 limit 10;
   id   |          crt_time          
--------+----------------------------
  49092 | 2016-05-06 16:50:29.88595
    947 | 2016-05-06 16:50:29.887553
 179124 | 2016-05-06 16:50:29.887562
 197308 | 2016-05-06 16:50:29.887564
  93558 | 2016-05-06 16:50:29.887566
 127133 | 2016-05-06 16:50:29.887568
 163507 | 2016-05-06 16:50:29.887569
 110546 | 2016-05-06 16:50:29.887571
  65363 | 2016-05-06 16:50:29.887573
 122666 | 2016-05-06 16:50:29.887575
(10 rows)
postgres=# insert into t2 select generate_series(1,200000);
INSERT 0 200000

来看一个未优化的查询计划和耗时，从查询计划来看，已经很优了，但是由于提供的查询范围内数据量有450多万，所以最后查询的耗时也达到了15秒。

postgres=# explain analyze select * from (select *,row_number() over(partition by id order by crt_time desc) rn from t1 where crt_time between '2016-05-06 16:50:29.887566' and '2016-05-06 16:50:34.887566') t where rn=1;
                                                                                   QUERY PLAN                                                                                    
----------------------------------------------------------------------------------------------------------------------------
 Subquery Scan on t  (cost=0.57..1819615.87 rows=2500 width=20) (actual time=0.083..15301.915 rows=200000 loops=1)
   Filter: (t.rn = 1)
   Rows Removed by Filter: 4320229
   ->  WindowAgg  (cost=0.57..1813365.87 rows=500000 width=12) (actual time=0.078..14012.867 rows=4520229 loops=1)
         ->  Index Only Scan using idx_t1_1 on t1  (cost=0.57..1804615.87 rows=500000 width=12) (actual time=0.066..10603.161 rows=4520229 loops=1)
               Index Cond: ((crt_time >= '2016-05-06 16:50:29.887566'::timestamp without time zone) AND (crt_time <= '2016-05-06 16:50:34.887566'::timestamp without time zone))
               Heap Fetches: 4520229
 Planning time: 0.202 ms
 Execution time: 15356.066 ms
(9 rows)

优化阶段1

通过online code循环，性能提升到了秒级。

postgres=# do language plpgsql $$  
declare
x int;
begin
  for x in select id from t2 loop
    perform * from t1 where id=x and crt_time between '2016-05-06 16:50:29.887566' and '2016-05-06 16:50:34.887566' order by crt_time desc limit 1;
  end loop;
end;
$$;
DO
Time: 2311.081 ms

写成函数更通用

postgres=# create or replace function f(start_time timestamp, end_time timestamp) returns setof t1 as $$
declare
  x int;
begin
  for x in select id from t2 loop
    return query select * from t1 where id=x and crt_time between '2016-05-06 16:50:29.887566' and '2016-05-06 16:50:32.887566' order by crt_time desc limit 1;
  end loop;
  return;
end;
$$ language plpgsql strict;
CREATE FUNCTION

postgres=# explain analyze select * from f('2016-05-06 16:50:29.887566', '2016-05-06 16:50:34.887566');
                                                   QUERY PLAN                                                   
----------------------------------------------------------------------------------------------------------------
 Function Scan on f  (cost=0.25..10.25 rows=1000 width=12) (actual time=2802.565..2850.445 rows=199999 loops=1)
 Planning time: 0.036 ms
 Execution time: 2885.924 ms
(3 rows)
Time: 2886.314 ms

postgres=# select * from f('2016-05-06 16:50:29.887566', '2016-05-06 16:50:34.887566') limit 10;
 id |          crt_time          
----+----------------------------
  1 | 2016-05-06 16:50:32.507124
  2 | 2016-05-06 16:50:32.774655
  3 | 2016-05-06 16:50:32.48621
  4 | 2016-05-06 16:50:32.874258
  5 | 2016-05-06 16:50:32.677812
  6 | 2016-05-06 16:50:32.091517
  7 | 2016-05-06 16:50:32.724287
  8 | 2016-05-06 16:50:32.669251
  9 | 2016-05-06 16:50:32.815634
 10 | 2016-05-06 16:50:32.812239
(10 rows)
Time: 3108.222 ms

把时间范围放大到扫描约5000万记录的范围。
用原来的方法需要104秒，时间随数据量范围变大而增加。

postgres=# explain analyze select * from (select *,row_number() over(partition by id order by crt_time desc) rn from t1 where crt_time between '2016-05-06 16:50:29.887566' and '2016-05-06 16:51:19.887566') t where rn=1;
                                                                                   QUERY PLAN                                                                                    
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Subquery Scan on t  (cost=0.57..1819615.87 rows=2500 width=20) (actual time=0.042..103886.966 rows=200000 loops=1)
   Filter: (t.rn = 1)
   Rows Removed by Filter: 46031611
   ->  WindowAgg  (cost=0.57..1813365.87 rows=500000 width=12) (actual time=0.037..92722.913 rows=46231611 loops=1)
         ->  Index Only Scan using idx_t1_1 on t1  (cost=0.57..1804615.87 rows=500000 width=12) (actual time=0.030..62673.221 rows=46231611 loops=1)
               Index Cond: ((crt_time >= '2016-05-06 16:50:29.887566'::timestamp without time zone) AND (crt_time <= '2016-05-06 16:51:19.887566'::timestamp without time zone))
               Heap Fetches: 46231611
 Planning time: 0.119 ms
 Execution time: 103950.955 ms
(9 rows)
Time: 103951.638 ms

用优化后的方法时间依旧不变，只需要2.9秒出结果

postgres=# explain analyze select * from f('2016-05-06 16:50:29.887566', '2016-05-06 16:51:19.887566');
                                                   QUERY PLAN                                                   
----------------------------------------------------------------------------------------------------------------
 Function Scan on f  (cost=0.25..10.25 rows=1000 width=12) (actual time=2809.562..2858.468 rows=199999 loops=1)
 Planning time: 0.037 ms
 Execution time: 2894.181 ms
(3 rows)
Time: 2894.605 ms

优化阶段2

继续优化，把SQL抽象成函数

postgres=# create or replace function f1(int, timestamp, timestamp) returns t1 as $$
  select * from t1 where id=$1 and crt_time between $2 and $3 order by crt_time desc limit 1;
$$ language sql strict;
CREATE FUNCTION
Time: 0.564 ms

循环在外头，比函数中的FOR效率更高，内核中的代码开销更少，所以效率提升到2.3秒了。

postgres=# explain analyze select f1(id,'2016-05-06 16:50:29.887566','2016-05-06 16:50:34.887566') from t2;
                                                 QUERY PLAN                                                  
-------------------------------------------------------------------------------------------------------------
 Seq Scan on t2  (cost=0.00..59560.50 rows=225675 width=4) (actual time=0.206..2213.069 rows=200000 loops=1)
 Planning time: 0.121 ms
 Execution time: 2261.185 ms
(3 rows)
Time: 2261.740 ms

postgres=# select count(*) from (select f1(id,'2016-05-06 16:50:29.887566','2016-05-06 16:50:34.887566') from t2)t;
 count  
--------
 200000
(1 row)
Time: 2359.005 ms

因为循环放到外面了，所以可以用游标，可以用limit限制，返回20万记录可以使用分页，对用户体验来说大大提升。

postgres=# select f1(id,'2016-05-06 16:50:29.887566','2016-05-06 16:50:34.887566') from t2 limit 10;
                f1                 
-----------------------------------
 (1,"2016-05-06 16:50:34.818639")
 (2,"2016-05-06 16:50:34.874603")
 (3,"2016-05-06 16:50:34.741072")
 (4,"2016-05-06 16:50:34.727868")
 (5,"2016-05-06 16:50:34.507418")
 (6,"2016-05-06 16:50:34.715711")
 (7,"2016-05-06 16:50:34.817961")
 (8,"2016-05-06 16:50:34.786087")
 (9,"2016-05-06 16:50:34.76778")
 (10,"2016-05-06 16:50:34.836663")
(10 rows)
Time: 0.771 ms

优化阶段3

但是返回所有记录还是没有到1秒内对吧，还有优化的空间么？
我的目标除了优化，还需要榨干硬件性能。
所以，如果你的硬件资源足够，那么其实这个时候就需要并行了，因为取单条记录是很快的，但是循环20万次就慢了。
来看看1万次循环要多久，降低到115毫秒了，符合要求。

postgres=# select count(*) from (select f1(id,'2016-05-06 16:50:29.887566','2016-05-06 16:50:34.887566') from (select * from t2 limit 10000) t) t;
 count 
-------
 10000
(1 row)
Time: 115.690 ms

所以要降低到1秒以内，可以开20个并行，每个查一部分ID，组成一个大的结果集即可。
目前还不支持数据库层的并行，将来PG 9.6会支持。
现在可以在应用层这么来做，但是如何做到并行的数据一致性呢？
这里不得不提一下PG的黑科技，shared export snapshot，允许会话间共享事务快照，所有的事务看到的状态是一致的，这个黑科技已经应用在并行备份中。
现在，应用层如果有跨会话的一致性视角要求，也能使用这个黑科技哦，例如 :
首先
开启会话1

postgres=# begin transaction isolation level repeatable read;
BEGIN
Time: 0.173 ms
postgres=# select pg_export_snapshot();
 pg_export_snapshot 
--------------------
 0FC9C2A3-1
(1 row)

开启会话2, 并导入快照

postgres=# begin transaction isolation level repeatable read;
BEGIN
postgres=# SET TRANSACTION SNAPSHOT '0FC9C2A3-1';
SET

开启会话3, 并导入快照

postgres=# begin transaction isolation level repeatable read;
BEGIN
postgres=# SET TRANSACTION SNAPSHOT '0FC9C2A3-1';
SET

并行的分别在三个会话执行如下

postgres=# select count(*) from (select f1(id,'2016-05-06 16:50:29.887566','2016-05-06 16:50:34.887566') from (select * from t2 order by id limit 70000 offset 0) t) t;
 count 
-------
 70000
(1 row)
Time: 775.071 ms
postgres=# select count(*) from (select f1(id,'2016-05-06 16:50:29.887566','2016-05-06 16:50:34.887566') from (select * from t2 order by id limit 70000 offset 70000) t) t;
 count 
-------
 70000
(1 row)
Time: 763.747 ms
postgres=# select count(*) from (select f1(id,'2016-05-06 16:50:29.887566','2016-05-06 16:50:34.887566') from (select * from t2 order by id limit 70000 offset 140000) t) t;
 count 
-------
 60000
(1 row)

Time: 665.743 ms

并行执行降到1秒内了。
以上查询还有优化的空间哦，就在offset这里，其实ID是PK，所以没有必要用offset，价格范围更好。
但是瓶颈其实不在扫描T2表，所以就是这么任性，不管了。

如果还要优化，把t2再打散即可，做到10毫秒是没有问题的，也就是千万范围的数据能提升1万倍哦。
从优化原理来看，数据量到百亿性能也是一样的，不信可以试试的。

优化阶段4

优化到这里就结束了吗? 当然还没有，因为前面的优化是把ID抽象出来了的，所以不管你要取值的范围是多大，都需要扫描所有的ID，虽然都走索引，但是还有提升的空间。
因此还有优化手段，可以减少扫描的ID次数，例如我给你100万的数据范围，但是这些范围内只有100个唯一ID，理论上只需要扫描100次，但是使用前面的方法，它依旧要扫描20万次。
方法很简单：
（假设需要扫描的时间字段是有流式属性的，既自增，那么可以使用PostgreSQL的黑科技brin索引来提速，如果不是流式的，那就要用传统的btree索引走index only scan了 on(crt_time,id)）
这个索引是为了快速的得到这个范围内的最大ID。

postgres=# create index idx_t2_1 on t1 using brin(crt_time);
CREATE INDEX

插入100万流式数据，但是这100万记录中只有100个唯一ID。

postgres=# insert into t1 select trunc(random()*100),clock_timestamp() from generate_series(1,1000000);
INSERT 0 1000000
Time: 4065.084 ms
postgres=# select now();
             now              
------------------------------
 2016-05-07 11:32:12.93416+08
(1 row)
Time: 0.346 ms

创建一个函数，用来获取输入的ID的下一个ID的最大时间，放在递归查询里面使用。

create or replace function f2(int,timestamp,timestamp) returns t1 as $$
  select * from t1 where id is not null and id>$1 and crt_time between $2 and $3 order by id,crt_time desc limit 1;
$$ language sql strict set enable_sort=off;

创建另一个函数，使用递归查询，得到给定范围的所有ID的最大时间。

create or replace function f3(start_time timestamp, end_time timestamp) returns setof t1 as $$
declare
maxid int;
begin
  select max(id) into maxid from t1 where crt_time between start_time and end_time;
  return query with recursive skip as (
  (
    select id,crt_time from t1 where crt_time between start_time and end_time order by id,crt_time desc limit 1
  )
  union all
  (
    select (f2(s1.id, start_time, end_time)).* from skip s1 where s1.id <> maxid and s1.id is not null
  ) 
) select * from skip;
end;
$$ language plpgsql strict;

postgres=# select * from f3('2016-05-07 09:50:29.887566','2016-05-07 16:50:29.987566');
 id |          crt_time          
----+----------------------------
  0 | 2016-05-07 11:32:00.983203
  1 | 2016-05-07 11:32:00.982906
...
 97 | 2016-05-07 11:32:00.983281
 98 | 2016-05-07 11:32:00.983206
 99 | 2016-05-07 11:32:00.983107
(100 rows)
Time: 177.203 ms

速度杠杠的，只需要177毫秒。

使用阶段3的优化方法需要的时间是恒定的，约3秒多。

select count(*) from (select * from (select (f1(id,'2016-05-07 09:50:29.887566','2016-05-07 16:50:29.987566')).* from t2) t where t.* is not null) t;
 count 
-------
   100
(1 row)
Time: 3153.508 ms

但是阶段4的优化也不是万能的，因为它并不适用于给定范围的ID很多的情况。
请看：

postgres=# select count(*) from f3('2016-05-06 16:50:29.887566','2016-05-06 16:50:34.887566');
 count  
--------
 200000
(1 row)
Time: 13344.261 ms

对于给定范围ID很多的情况，还是建议使用阶段3的优化方法。

postgres=#  select count(*) from (select * from (select (f1(id,'2016-05-06 16:50:29.887566','2016-05-06 16:50:34.887566')).* from t2) t where t.* is not null) t;
 count  
--------
 200000
(1 row)
Time: 3846.156 ms

优化阶段5

怎么自动评估选定范围内的唯一的ID个数呢？
可以用到我前面文章提到的方法,使用以下评估函数

CREATE FUNCTION count_estimate(query text) RETURNS INTEGER AS
$func$
DECLARE
    rec   record;
    ROWS  INTEGER;
BEGIN
    FOR rec IN EXECUTE 'EXPLAIN ' || query LOOP
        ROWS := SUBSTRING(rec."QUERY PLAN" FROM ' rows=([[:digit:]]+)');
        EXIT WHEN ROWS IS NOT NULL;
    END LOOP;

    RETURN ROWS;
END
$func$ LANGUAGE plpgsql;

postgres=# explain select distinct id from t1 where crt_time between '2016-05-06 16:50:29.887566' and '2016-05-06 16:50:34.887566';
                                                                                   QUERY PLAN                                                                                    
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=672240.13..672329.49 rows=8936 width=4)
   Group Key: id
   ->  Bitmap Heap Scan on t1  (cost=46663.05..660864.26 rows=4550347 width=4)
         Recheck Cond: ((crt_time >= '2016-05-06 16:50:29.887566'::timestamp without time zone) AND (crt_time <= '2016-05-06 16:50:34.887566'::timestamp without time zone))
         ->  Bitmap Index Scan on idx_t2_1  (cost=0.00..45525.47 rows=4550347 width=0)
               Index Cond: ((crt_time >= '2016-05-06 16:50:29.887566'::timestamp without time zone) AND (crt_time <= '2016-05-06 16:50:34.887566'::timestamp without time zone))
(6 rows)
Time: 0.645 ms

postgres=# explain select distinct id from t1 where crt_time between '2016-05-07 09:50:29.887566' and '2016-05-07 16:50:29.987566';
                                                                                   QUERY PLAN                                                                                    
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=23.12..23.13 rows=1 width=4)
   Group Key: id
   ->  Bitmap Heap Scan on t1  (cost=22.00..23.12 rows=1 width=4)
         Recheck Cond: ((crt_time >= '2016-05-07 09:50:29.887566'::timestamp without time zone) AND (crt_time <= '2016-05-07 16:50:29.987566'::timestamp without time zone))
         ->  Bitmap Index Scan on idx_t2_1  (cost=0.00..22.00 rows=1 width=0)
               Index Cond: ((crt_time >= '2016-05-07 09:50:29.887566'::timestamp without time zone) AND (crt_time <= '2016-05-07 16:50:29.987566'::timestamp without time zone))
(6 rows)
Time: 0.641 ms


postgres=# select count_estimate($$select distinct id from t1 where crt_time between '2016-05-06 16:50:29.887566' and '2016-05-06 16:50:34.887566'$$);
 count_estimate 
----------------
           8936
(1 row)
Time: 1.139 ms

postgres=# select count_estimate($$select distinct id from t1 where crt_time between '2016-05-07 09:50:29.887566' and '2016-05-07 16:50:29.987566'$$);
 count_estimate 
----------------
              1
(1 row)
Time: 0.706 ms

接下来你懂的，根据记录数选择应该使用阶段3还是阶段4的优化方法。

另外再奉上count(distinct xx) 以及 distinct xx的优化，也是极为变态的。

↧

论count使用不当的罪名和分页的优化

May 6, 2016, 5:03 am

≫ Next: distinct xx和count(distinct xx)的变态递归优化方法

≪ Previous: PostgreSQL 百亿级数据范围查询, 分组排序窗口取值极致优化 case

分页是一个非常常见的应用场景，然而恐怕没有多少人想过其优化方法。
确一味的责怪为什么数据库用count(*)计算分页数是如此的慢。
很多开发人员喜欢用count先算一下结果集的大小，然后就知道需要排多少页。
然后再从数据库取出对应的数据，并展示给用户。
问题1
count会扫一遍数据，然后取数据又扫一遍数据。重复劳动。
问题2，很多人喜欢用order by offset limit来展示分页。
其实也是一个非常大的问题，因为扫描的数据也放大了，即使在order by 的列上用到了索引也会放大扫描的数据量。
因为offset的row也是需要扫的。

问题1的优化
使用评估行数，方法如下
创建一个函数，从explain中抽取返回的记录数

CREATE FUNCTION count_estimate(query text) RETURNS INTEGER AS
$func$
DECLARE
    rec   record;
    ROWS  INTEGER;
BEGIN
    FOR rec IN EXECUTE 'EXPLAIN ' || query LOOP
        ROWS := SUBSTRING(rec."QUERY PLAN" FROM ' rows=([[:digit:]]+)');
        EXIT WHEN ROWS IS NOT NULL;
    END LOOP;

    RETURN ROWS;
END
$func$ LANGUAGE plpgsql;

评估的行数和实际的行数相差不大，精度和柱状图有关。
PostgreSQL autovacuum进程会根据表的数据量变化比例自动对表进行统计信息的更新。
而且可以配置表级别的统计信息更新频率以及是否开启更新。

postgres=# select count_estimate('select * from sbtest1 where id between 100 and 100000');
 count_estimate 
----------------
         102166
(1 row)

postgres=# explain select * from sbtest1 where id between 100 and 100000;
                                      QUERY PLAN                                       
---------------------------------------------------------------------------------------
 Index Scan using sbtest1_pkey on sbtest1  (cost=0.43..17398.14 rows=102166 width=190)
   Index Cond: ((id >= 100) AND (id <= 100000))
(2 rows)

postgres=# select count(*) from sbtest1 where id between 100 and 100000;
 count 
-------
 99901
(1 row)

也就是说，应用程序完全可以使用评估的记录数来评估分页数。
这样做就不需要扫描表了，性能提升尤为可观。

问题2的优化
问题2其实表现在数据可能被多次扫描，使用游标就能解决。
未优化的情况，取前面的记录很快。

postgres=# explain analyze select * from sbtest1 where id between 100 and 1000000 order by id offset 0 limit 100;
                                                                QUERY PLAN                                                                
------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.43..9.74 rows=100 width=190) (actual time=0.019..0.088 rows=100 loops=1)
   ->  Index Scan using sbtest1_pkey on sbtest1  (cost=0.43..93450.08 rows=1003938 width=190) (actual time=0.018..0.051 rows=100 loops=1)
         Index Cond: ((id >= 100) AND (id <= 1000000))
 Planning time: 0.152 ms
 Execution time: 0.125 ms
(5 rows)

取后面的记录，因为前面的记录也要扫描，所以明显变慢。

postgres=# explain analyze select * from sbtest1 where id between 100 and 1000000 order by id offset 900000 limit 100;
                                                                  QUERY PLAN                                                                   
-----------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=83775.21..83784.52 rows=100 width=190) (actual time=461.941..462.009 rows=100 loops=1)
   ->  Index Scan using sbtest1_pkey on sbtest1  (cost=0.43..93450.08 rows=1003938 width=190) (actual time=0.025..308.865 rows=900100 loops=1)
         Index Cond: ((id >= 100) AND (id <= 1000000))
 Planning time: 0.179 ms
 Execution time: 462.053 ms
(5 rows)

如果有很多个分页，效率下降可想而知。

优化手段

postgres=# begin;
BEGIN
Time: 0.152 ms
postgres=# declare cur1 cursor for select * from sbtest1 where id between 100 and 1000000 order by id;
DECLARE CURSOR
Time: 0.422 ms
postgres=# fetch 100 from cur1;
。。。

获取到数据末尾时，效率也是一样的不会变化。

↧

distinct xx和count(distinct xx)的变态递归优化方法

May 28, 2016, 10:43 pm

≫ Next: PostgreSQL Oracle 兼容性之 - add_months

≪ Previous: 论count使用不当的罪名和分页的优化

今天要说的这个优化是从前面一篇讲解《performance tuning case :use cursor or trigger replace group by and order by》
http://blog.163.com/digoal@126/blog/static/16387704020128142829610/
的延展.

CASE

例如一个表中有一个字段是性别, 这个表不管有多少条记录, 性别这个字段一般来说也就2个值
select count(distinct sex) from table; 得到的结果当然是2. 但是如果数据量很大的情况下, 这种运算就非常耗时, 下面来测试一下 :

PostgreSQL

测试表

digoal=> create table sex (sex char(1), otherinfo text);  
CREATE TABLE

测试数据

digoal=> insert into sex select 'm', generate_series(1,10000000)||'this is test';  
INSERT 0 10000000  
digoal=> insert into sex select 'w', generate_series(1,10000000)||'this is test';  
INSERT 0 10000000

测试SQL1

digoal=> \timing on  
digoal=> select count(distinct sex) from sex;  
 count   
-------  
     2  
(1 row)  
Time: 47254.221 ms

测试SQL2

digoal=> select sex from sex t group by sex;  
 sex   
-----  
 w  
 m  
(2 rows)  
Time: 6534.433 ms

执行计划

digoal=> explain select count(distinct sex) from sex;  
                             QUERY PLAN                                
---------------------------------------------------------------------  
 Aggregate  (cost=377385.25..377385.26 rows=1 width=2)  
   ->  Seq Scan on sex  (cost=0.00..327386.00 rows=19999700 width=2)  

digoal=> explain select sex from sex t group by sex;  
                              QUERY PLAN                                 
-----------------------------------------------------------------------  
 HashAggregate  (cost=377385.25..377385.27 rows=2 width=2)  
   ->  Seq Scan on sex t  (cost=0.00..327386.00 rows=19999700 width=2)

创建索引

digoal=> create index idx_sex_1 on sex(sex);  
CREATE INDEX  
digoal=> set enable_seqscan=off;  
SET

使用索引后的执行计划, PostgreSQL可以使用Index Only Scan.

digoal=> explain select count(distinct sex) from sex;  
                                         QUERY PLAN                                           
--------------------------------------------------------------------------------------------  
 Aggregate  (cost=532235.01..532235.02 rows=1 width=2)  
   ->  Index Only Scan using idx_sex_1 on sex  (cost=0.00..482234.97 rows=20000016 width=2)  

digoal=> explain select sex from sex t group by sex;  
                                          QUERY PLAN                                            
----------------------------------------------------------------------------------------------  
 Group  (cost=0.00..532235.01 rows=2 width=2)  
   ->  Index Only Scan using idx_sex_1 on sex t  (cost=0.00..482234.97 rows=20000016 width=2)

创建索引后SQL耗时

digoal=> select count(distinct sex) from sex;  
 count   
-------  
     2  
(1 row)  
Time: 49589.947 ms  

digoal=> select sex from sex t group by sex;  
 sex   
-----  
 m  
 w  
(2 rows)  
Time: 6608.053 ms

O

测试表

SQL> create table sex(sex char(1), otherinfo varchar2(64));  
Table created.

测试数据

SQL> insert into sex select 'm', rownum||'this is test' from dual connect by level <=10000001;  
10000001 rows created.  
SQL> commit;  
Commit complete.  
SQL> insert into sex select 'w', rownum||'this is test' from dual connect by level <=10000001;  
10000001 rows created.  
SQL> commit;  
Commit complete.

测试SQL1:

SQL> set autotrace on  
SQL> set timing on  
SQL> select count(distinct sex) from sex;  
COUNT(DISTINCTSEX)  
------------------  
                 2  
Elapsed: 00:00:03.62  

Execution Plan  
----------------------------------------------------------  
Plan hash value: 2096505595  
---------------------------------------------------------------------------  
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |  
---------------------------------------------------------------------------  
|   0 | SELECT STATEMENT   |      |     1 |     3 | 13106   (3)| 00:02:38 |  
|   1 |  SORT GROUP BY     |      |     1 |     3 |            |          |  
|   2 |   TABLE ACCESS FULL| SEX  |    24M|    69M| 13106   (3)| 00:02:38 |  
---------------------------------------------------------------------------  
Note  
-----  
   - dynamic sampling used for this statement  
Statistics  
----------------------------------------------------------  
          0  recursive calls  
          0  db block gets  
      74074  consistent gets  
          0  physical reads  
          0  redo size  
        525  bytes sent via SQL*Net to client  
        487  bytes received via SQL*Net from client  
          2  SQL*Net roundtrips to/from client  
          1  sorts (memory)  
          0  sorts (disk)  
          1  rows processed

测试SQL2

SQL> select sex from sex t group by sex;  
S  
-  
w  
m  
Elapsed: 00:00:03.23  
Execution Plan  
----------------------------------------------------------  
Plan hash value: 2807610159  
---------------------------------------------------------------------------  
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |  
---------------------------------------------------------------------------  
|   0 | SELECT STATEMENT   |      |    24M|    69M| 14908  (14)| 00:02:59 |  
|   1 |  HASH GROUP BY     |      |    24M|    69M| 14908  (14)| 00:02:59 |  
|   2 |   TABLE ACCESS FULL| SEX  |    24M|    69M| 13106   (3)| 00:02:38 |  
---------------------------------------------------------------------------  
Note  
-----  
   - dynamic sampling used for this statement  
Statistics  
----------------------------------------------------------  
          0  recursive calls  
          0  db block gets  
      74074  consistent gets  
          0  physical reads  
          0  redo size  
        563  bytes sent via SQL*Net to client  
        487  bytes received via SQL*Net from client  
          2  SQL*Net roundtrips to/from client  
          0  sorts (memory)  
          0  sorts (disk)  
          2  rows processed

创建索引

SQL> create index idx_sex_1 on sex(sex);  
Index created.  
Elapsed: 00:00:33.40

创建索引后的测试, 执行时间没有明显变化.

SQL> select count(distinct sex) from sex;  
COUNT(DISTINCTSEX)  
------------------  
                 2  
Elapsed: 00:00:04.32  
Execution Plan  
----------------------------------------------------------  
Plan hash value: 1805173869  
-----------------------------------------------------------------------------------  
| Id  | Operation             | Name      | Rows  | Bytes | Cost (%CPU)| Time     |  
-----------------------------------------------------------------------------------  
|   0 | SELECT STATEMENT      |           |     1 |     3 |  6465   (3)| 00:01:18 |  
|   1 |  SORT GROUP BY        |           |     1 |     3 |            |          |  
|   2 |   INDEX FAST FULL SCAN| IDX_SEX_1 |    24M|    69M|  6465   (3)| 00:01:18 |  
-----------------------------------------------------------------------------------  
Note  
-----  
   - dynamic sampling used for this statement  
Statistics  
----------------------------------------------------------  
          5  recursive calls  
          0  db block gets  
      36421  consistent gets  
      36300  physical reads  
          0  redo size  
        525  bytes sent via SQL*Net to client  
        487  bytes received via SQL*Net from client  
          2  SQL*Net roundtrips to/from client  
          1  sorts (memory)  
          0  sorts (disk)  
          1  rows processed  
SQL> select sex from sex t group by sex;  
S  
-  
w  
m  
Elapsed: 00:00:03.21  
Execution Plan  
----------------------------------------------------------  
Plan hash value: 2807610159  
---------------------------------------------------------------------------  
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |  
---------------------------------------------------------------------------  
|   0 | SELECT STATEMENT   |      |    24M|    69M| 14908  (14)| 00:02:59 |  
|   1 |  HASH GROUP BY     |      |    24M|    69M| 14908  (14)| 00:02:59 |  
|   2 |   TABLE ACCESS FULL| SEX  |    24M|    69M| 13106   (3)| 00:02:38 |  
---------------------------------------------------------------------------  
Note  
-----  
   - dynamic sampling used for this statement  
Statistics  
----------------------------------------------------------  
          5  recursive calls  
          0  db block gets  
      74170  consistent gets  
          0  physical reads  
          0  redo size  
        563  bytes sent via SQL*Net to client  
        487  bytes received via SQL*Net from client  
          2  SQL*Net roundtrips to/from client  
          0  sorts (memory)  
          0  sorts (disk)  
          2  rows processed

对比以上测试, O的性能要明显优于PostgreSQL.
将count(distinct sex)修改如下后PostgreSQL的执行速度有明显改善, 但是性能还是低于O一截, 约一半.

digoal=> select count(*) from (select sex from sex t group by sex) t;  
 count   
-------  
     2  
(1 row)  
Time: 6231.965 ms

开始优化咯

那么如何优化呢?
在PostgreSQL中的递归SQL在这里就派上大用场了, 结合btree索引扫描. 性能可以提升几万倍.
来看如下优化过程 :
创建测试表 :

create table user_download_log (user_id int not null, listid int not null, apkid int not null, get_time timestamp(0) not null, otherinfo text);

插入测试数据

insert into user_download_log select generate_series(0,10000000),generate_series(0,10000000),generate_series(0,10000000),generate_series(clock_timestamp(),clock_timestamp()+interval '10000000 min',interval '1 min'), 'this is test';

创建索引 :

create index i1 on user_download_log (user_id);  
create index i2 on user_download_log (otherinfo);

查看数据分布 :
用来说明递归SQL适合哪种场景的优化.

select count(distinct user_id), count(distinct otherinfo) from user_download_log;  
  count   | count   
----------+-------  
 10000001 |     1

查看未优化时以下SQL的执行计划以及耗时.

digoal=> explain analyze select count(distinct otherinfo) from user_download_log;  
                                                               QUERY PLAN                                                             

------------------------------------------------------------------------------------------------------------------------------------  
----  
 Aggregate  (cost=208334.36..208334.37 rows=1 width=13) (actual time=6295.493..6295.494 rows=1 loops=1)  
   ->  Seq Scan on user_download_log  (cost=0.00..183334.29 rows=10000029 width=13) (actual time=0.014..1612.333 rows=10000001 loops  
=1)  
 Total runtime: 6295.550 ms

优化后的SQL :

digoal=> with recursive skip as (  
digoal(>   (  
digoal(>     select min(t.otherinfo) as otherinfo from user_download_log t where t.otherinfo is not null  
digoal(>   )  
digoal(>   union all  
digoal(>   (  
digoal(>     select (select min(t.otherinfo) from user_download_log t where t.otherinfo > s.otherinfo and t.otherinfo is not null)   
digoal(>       from skip s where s.otherinfo is not null  
digoal(>   )  -- 这里的where s.otherinfo is not null 一定要加,否则就死循环了.  
digoal(> )   
digoal-> select count(distinct otherinfo) from skip;  
 count   
-------  
     1  
(1 row)

优化后的SQL执行计划以及耗时, 性能提升了36390倍, 相比O也提升了上万倍.

digoal=> explain analyze with recursive skip as (  
  (  
    select min(t.otherinfo) as otherinfo from user_download_log t where t.otherinfo is not null  
  )  
  union all  
  (  
    select (select min(t.otherinfo) from user_download_log t where t.otherinfo > s.otherinfo and t.otherinfo is not null)   
      from skip s where s.otherinfo is not null  
  )  -- 这里的where s.otherinfo is not null 一定要加,否则就死循环了.  
)   
select count(distinct otherinfo) from skip;  
                                                                                 QUERY PLAN                                           

------------------------------------------------------------------------------------------------------------------------------------  
-----------------------------------------  
 Aggregate  (cost=10.55..10.56 rows=1 width=32) (actual time=0.094..0.094 rows=1 loops=1)  
   CTE skip  
     ->  Recursive Union  (cost=0.03..8.28 rows=101 width=32) (actual time=0.044..0.073 rows=2 loops=1)  
           ->  Result  (cost=0.03..0.04 rows=1 width=0) (actual time=0.042..0.042 rows=1 loops=1)  
                 InitPlan 1 (returns $1)  
                   ->  Limit  (cost=0.00..0.03 rows=1 width=13) (actual time=0.038..0.039 rows=1 loops=1)  
                         ->  Index Only Scan using i2 on user_download_log t  (cost=0.00..296844.61 rows=10000029 width=13) (actual   
time=0.037..0.037 rows=1 loops=1)  
                               Index Cond: (otherinfo IS NOT NULL)  
                               Heap Fetches: 1  
           ->  WorkTable Scan on skip s  (cost=0.00..0.62 rows=10 width=32) (actual time=0.013..0.013 rows=0 loops=2)  
                 Filter: (otherinfo IS NOT NULL)  
                 Rows Removed by Filter: 0  
                 SubPlan 3  
                   ->  Result  (cost=0.03..0.04 rows=1 width=0) (actual time=0.018..0.018 rows=1 loops=1)  
                         InitPlan 2 (returns $3)  
                           ->  Limit  (cost=0.00..0.03 rows=1 width=13) (actual time=0.017..0.017 rows=0 loops=1)  
                                 ->  Index Only Scan using i2 on user_download_log t  (cost=0.00..107284.96 rows=3333343 width=13) (  
actual time=0.015..0.015 rows=0 loops=1)  
                                       Index Cond: ((otherinfo > s.otherinfo) AND (otherinfo IS NOT NULL))  
                                       Heap Fetches: 0  
   ->  CTE Scan on skip  (cost=0.00..2.02 rows=101 width=32) (actual time=0.047..0.077 rows=2 loops=1)  
 Total runtime: 0.173 ms  
(21 rows)

换一个字段, 数据分布广泛的字段上使用以上优化方法, 看是否妥当, 以下是原始SQL的执行计划以及耗时 :

digoal=> explain analyze select count(distinct user_id) from user_download_log;  
                                                              QUERY PLAN                                                              

------------------------------------------------------------------------------------------------------------------------------------  
---  
 Aggregate  (cost=208334.36..208334.37 rows=1 width=4) (actual time=4008.858..4008.858 rows=1 loops=1)  
   ->  Seq Scan on user_download_log  (cost=0.00..183334.29 rows=10000029 width=4) (actual time=0.014..1606.607 rows=10000001 loops=  
1)  
 Total runtime: 4008.916 ms

换一个字段, 数据分布广泛的字段上使用以上优化方法, 看是否妥当, 以下是采用递归SQL后的执行计划以及耗时 :
显然性能是下降的, 所以使用递归SQL不适合数据分布广泛的字段的group by或者count(distinct)操作.

digoal=> explain analyze with recursive skip as (  
  (  
    select min(t.user_id) as user_id from user_download_log t where t.user_id is not null  
  )  
  union all  
  (  
    select (select min(t.user_id) from user_download_log t where t.user_id > s.user_id and t.user_id is not null)   
      from skip s where s.user_id is not null  
  )  -- 这里的where s.user_id is not null 一定要加,否则就死循环了.  
)   
select count(distinct user_id) from skip;  
                                                                                    QUERY PLAN                                        

------------------------------------------------------------------------------------------------------------------------------------  
-----------------------------------------------  
 Aggregate  (cost=10.44..10.45 rows=1 width=4) (actual time=186741.338..186741.339 rows=1 loops=1)  
   CTE skip  
     ->  Recursive Union  (cost=0.03..8.17 rows=101 width=4) (actual time=0.047..178296.238 rows=10000002 loops=1)  
           ->  Result  (cost=0.03..0.04 rows=1 width=0) (actual time=0.046..0.046 rows=1 loops=1)  
                 InitPlan 1 (returns $1)  
                   ->  Limit  (cost=0.00..0.03 rows=1 width=4) (actual time=0.042..0.042 rows=1 loops=1)  
                         ->  Index Only Scan using i1 on user_download_log t  (cost=0.00..285759.50 rows=10000029 width=4) (actual t  
ime=0.040..0.040 rows=1 loops=1)  
                               Index Cond: (user_id IS NOT NULL)  
                               Heap Fetches: 1  
           ->  WorkTable Scan on skip s  (cost=0.00..0.61 rows=10 width=4) (actual time=0.017..0.017 rows=1 loops=10000002)  
                 Filter: (user_id IS NOT NULL)  
                 Rows Removed by Filter: 0  
                 SubPlan 3  
                   ->  Result  (cost=0.03..0.04 rows=1 width=0) (actual time=0.016..0.016 rows=1 loops=10000001)  
                         InitPlan 2 (returns $3)  
                           ->  Limit  (cost=0.00..0.03 rows=1 width=4) (actual time=0.015..0.015 rows=1 loops=10000001)  
                                 ->  Index Only Scan using i1 on user_download_log t  (cost=0.00..103588.85 rows=3333343 width=4) (a  
ctual time=0.014..0.014 rows=1 loops=10000001)  
                                       Index Cond: ((user_id > s.user_id) AND (user_id IS NOT NULL))  
                                       Heap Fetches: 10000000  
   ->  CTE Scan on skip  (cost=0.00..2.02 rows=101 width=4) (actual time=0.050..183449.391 rows=10000002 loops=1)  
 Total runtime: 186909.323 ms  
(21 rows)  
Time: 186910.482 ms

以下是同样的数据结构以及测试数据在O下的测试.

SQL> create table test (id int, otherinfo varchar2(32)) nologging;  
Table created.  
SQL> insert into test select rownum,'this is test' from dual connect by level <=10000001;  
10000001 rows created.  
SQL> commit;  
SQL> create index i1 on test(id);  
SQL> create index i2 on test(otherinfo);  
SQL> explain plan for select count(distinct id) from test;  
Explained.  
SQL> select * from table(dbms_xplan.display());  
PLAN_TABLE_OUTPUT  
--------------------------------------------------------------------------------------------------------------------------------------------  
Plan hash value: 1403727100  
------------------------------------------------------------------------------  
| Id  | Operation             | Name | Rows  | Bytes | Cost (%CPU)| Time     |  
------------------------------------------------------------------------------  
|   0 | SELECT STATEMENT      |      |     1 |    13 |  4178   (3)| 00:00:51 |  
|   1 |  SORT GROUP BY        |      |     1 |    13 |            |          |  
|   2 |   INDEX FAST FULL SCAN| I1   |  9834K|   121M|  4178   (3)| 00:00:51 |  
------------------------------------------------------------------------------  
Note  
-----  
   - dynamic sampling used for this statement  
13 rows selected.  
SQL> explain plan for select count(distinct otherinfo) from test;  
Explained.  
SQL> select * from table(dbms_xplan.display());  
PLAN_TABLE_OUTPUT  
--------------------------------------------------------------------------------------------------------------------------------------------  
Plan hash value: 2603667166  
---------------------------------------------------------------------------  
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |  
---------------------------------------------------------------------------  
|   0 | SELECT STATEMENT   |      |     1 |    18 |  5837   (3)| 00:01:11 |  
|   1 |  SORT GROUP BY     |      |     1 |    18 |            |          |  
|   2 |   TABLE ACCESS FULL| TEST |  9834K|   168M|  5837   (3)| 00:01:11 |  
---------------------------------------------------------------------------  
Note  
-----  
   - dynamic sampling used for this statement  
13 rows selected.  
SQL> set timing on  
SQL> select count(distinct otherinfo) from test;  
COUNT(DISTINCTOTHERINFO)  
------------------------  
                       1  
Elapsed: 00:00:02.49  
SQL> select count(distinct id) from test;  
COUNT(DISTINCTID)  
-----------------  
         10000001  
Elapsed: 00:00:07.13

从执行耗时可以看出PostgreSQL在数据分布稀疏的字段上使用递归SQL优化后的性能相比O有41213倍的性能提升.

补充

递归查询中不允许使用聚合函数 :

with recursive skip as (  
  (  
    select min(t.otherinfo) as otherinfo from user_download_log t where t.otherinfo is not null  
  )  
  union all  
  (  
    select min(t.otherinfo) from user_download_log t, skip s   
      where t.otherinfo > s.otherinfo   
      and t.otherinfo is not null  
      and s.otherinfo is not null  
  )  -- 这里的where s.otherinfo is not null 一定要加,否则就死循环了.  
)   
select * from skip;  
ERROR:  aggregate functions not allowed in a recursive query's recursive term  
LINE 7:     select min(t.otherinfo) from user_download_log t, skip s...  
                   ^  
Time: 0.581 ms

修改如下即可 :

with recursive skip as (  
  (  
    select min(t.otherinfo) as otherinfo from user_download_log t where t.otherinfo is not null  
  )  
  union all  
  (  
    select (select min(t.otherinfo) from user_download_log t where t.otherinfo > s.otherinfo and t.otherinfo is not null)   
      from skip s where s.otherinfo is not null  
  )  -- 这里的where s.otherinfo is not null 一定要加,否则就死循环了.  
)   
select * from skip;

细心的朋友发现O测试中未对表进行分析, 以下是分析后的结果, 执行计划无变化 :

SQL> analyze table sex estimate statistics for all columns sample 10 percent;  
Table analyzed.  
SQL> analyze index idx_sex_1 estimate statistics sample 10 percent;  
Index analyzed.  
SQL> select sex from sex t group by sex;  

S  
-  
w  
m  

Elapsed: 00:00:03.17  

Execution Plan  
----------------------------------------------------------  
Plan hash value: 2807610159  

---------------------------------------------------------------------------  
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |  
---------------------------------------------------------------------------  
|   0 | SELECT STATEMENT   |      |     2 |     2 | 14519  (12)| 00:02:55 |  
|   1 |  HASH GROUP BY     |      |     2 |     2 | 14519  (12)| 00:02:55 |  
|   2 |   TABLE ACCESS FULL| SEX  |    20M|    19M| 13062   (2)| 00:02:37 |  
---------------------------------------------------------------------------  


Statistics  
----------------------------------------------------------  
          1  recursive calls  
          0  db block gets  
      74074  consistent gets  
          0  physical reads  
          0  redo size  
        563  bytes sent via SQL*Net to client  
        487  bytes received via SQL*Net from client  
          2  SQL*Net roundtrips to/from client  
          0  sorts (memory)  
          0  sorts (disk)  
          2  rows processed  

SQL> select count(distinct sex) from sex;  

COUNT(DISTINCTSEX)  
------------------  
                 2  

Elapsed: 00:00:03.85  

Execution Plan  
----------------------------------------------------------  
Plan hash value: 1805173869  

-----------------------------------------------------------------------------------  
| Id  | Operation             | Name      | Rows  | Bytes | Cost (%CPU)| Time     |  
-----------------------------------------------------------------------------------  
|   0 | SELECT STATEMENT      |           |     1 |     1 |  6454   (3)| 00:01:18 |  
|   1 |  SORT GROUP BY        |           |     1 |     1 |            |          |  
|   2 |   INDEX FAST FULL SCAN| IDX_SEX_1 |    20M|    19M|  6454   (3)| 00:01:18 |  
-----------------------------------------------------------------------------------  


Statistics  
----------------------------------------------------------  
          1  recursive calls  
          0  db block gets  
      36325  consistent gets  
          0  physical reads  
          0  redo size  
        525  bytes sent via SQL*Net to client  
        487  bytes received via SQL*Net from client  
          2  SQL*Net roundtrips to/from client  
          1  sorts (memory)  
          0  sorts (disk)  
          1  rows processed

O在这类应用场景中还有一个选择，使用位图索引。
摘录一段O位图索引的介绍
位图索引 Bitmap index
场合：列的基数很少，可枚举，重复值很多，数据不会被经常更新
原理：一个键值对应很多行（rowid），格式：键值 start_rowid end_rowid 位图
优点：OLAP 例如报表类数据库重复率高的数据特定类型的查询例如count、or、and等逻辑操作因为只需要进行位运算即可得到我们需要的结果
缺点：不适合重复率低的字段，还有经常DML操作（insert，update，delete），因为位图索引的锁代价极高，修改一个位图索引段影响整个位图段，例如修改
一个键值，会影响同键值的多行，所以对于OLTP 系统位图索引基本上是不适用的

因bitmap在OLTP使用场景较少，PostgreSQL 没有实现这个类型的索引。
http://www.postgresql.org/message-id/flat/27879.1098227105@sss.pgh.pa.us#27879.1098227105@sss.pgh.pa.us
https://en.wikipedia.org/wiki/Bitmap_index
http://grokbase.com/t/postgresql/pgsql-hackers/051xeh5b0a/implementing-bitmap-indexes
想了解更多PG索引的情况，请参考
http://leopard.in.ua/2015/04/13/postgresql-indexes

↧

PostgreSQL Oracle 兼容性之 - add_months

May 28, 2016, 10:44 pm

≫ Next: PostgreSQL GIN索引limit慢的原因分析

≪ Previous: distinct xx和count(distinct xx)的变态递归优化方法

有网友反映PostgreSQL oraface的add_months在某些日期与Oracle 的add_months不一致。
查了一下Oracle 的开发手册，add_months是这样定义的，如果当前日期是月末，或者目标月没有当前日期的，取最后一天。
例子
2015年2月28日是2月的最后一天，所以按照Oracle的计算方法，无论加减多少个月结果应该都是目标月份的月末，而PostgreSQL 并不是这样的：　　

postgres=# select timestamp '2015-02-28' - interval '1 month';
      ?column?       
---------------------
 2015-01-28 00:00:00
(1 row)

postgres=# select oracle.add_months('2015-02-28 11:11:11+08',-1);
     add_months      
---------------------
 2015-01-28 11:11:11
(1 row)

以上查询在Oracle应该得到1月31号的结果。

目标月份没有当前日期，去目标月份的最后一天，比如3月30日减去一个月，不可能是2月30日，所以取2月的最后一天，这个规则是和Oracle一致的。

postgres=# select timestamp '2015-03-30' - interval '1 month';
      ?column?       
---------------------
 2015-02-28 00:00:00
(1 row)

Oracle add_months的解释如下：
http://docs.oracle.com/cd/B19306_01/server.102/b14200/functions004.htm

ADD_MONTHS returns the date date plus integer months. The date argument can be a datetime value or any value that can be implicitly converted to DATE. The integer argument can be an integer or any value that can be implicitly converted to an integer. The return type is always DATE, regardless of the datatype of date. If date is the last day of the month or if the resulting month has fewer days than the day component of date, then the result is the last day of the resulting month. Otherwise, the result has the same day component as date.

orafce中add_months的代码
SELECT ($1 + interval '1 month' * $2)::oracle.date;
问题就出在这里。
所以要和Oracle完全兼容，可以这样
创建两个这样的函数，如果当前日期是月末的话，则目标月取月末，否则就按照PG原来的算法。

create or replace function add_months(timestamp, int) returns timestamp as $$
declare
  i interval := ($2 || 'month');
  d1 date := date(to_timestamp($1::text,'yyyy-mm') + interval '1 month' - interval '1 day');
  d2 date := date($1);
  res timestamp;
begin
  select case when d1=d2 then ((to_char($1+i+interval '1 month', 'yyyy-mm')||'-01')::date - 1) + $1::time else $1+i end into res;
  return res;
end;
$$ language plpgsql strict;

create or replace function add_months(timestamptz, int) returns timestamptz as $$
declare
  i interval := ($2 || 'month');
  d1 date := date(to_timestamp($1::text,'yyyy-mm') + interval '1 month' - interval '1 day');
  d2 date := date($1);
  res timestamptz;
begin
  select case when d1=d2 then ((to_char($1+i+interval '1 month', 'yyyy-mm')||'-01')::date - 1) + $1::timetz else $1+i end into res;
  return res;
end;
$$ language plpgsql strict;

测试：　
达到目的

postgres=# select add_months('2015-02-28 11:11:11+08',-1);
       add_months       
------------------------
 2015-01-31 11:11:11+08
(1 row)

postgres=# select add_months('2015-02-28 11:11:11+08',-12);
       add_months       
------------------------
 2014-02-28 11:11:11+08
(1 row)

postgres=# select add_months('2015-02-28 11:11:11+08',-24);
       add_months       
------------------------
 2013-02-28 11:11:11+08
(1 row)

postgres=# select add_months('2015-02-28 11:11:11+08',-36);
       add_months       
------------------------
 2012-02-29 11:11:11+08
(1 row)

postgres=# select add_months('2015-03-30 11:11:11+08',-1);
       add_months       
------------------------
 2015-02-28 11:11:11+08
(1 row)

postgres=# select add_months('2015-03-31 11:11:11+08',-1);
       add_months       
------------------------
 2015-02-28 11:11:11+08
(1 row)

postgres=# select add_months('2015-03-31 11:11:11+08',1);
       add_months       
------------------------
 2015-04-30 11:11:11+08
(1 row)

postgres=# select add_months('2015-03-30 11:11:11+08',1);
       add_months       
------------------------
 2015-04-30 11:11:11+08
(1 row)

postgres=# select add_months('2015-02-28 11:11:11+08',1);
       add_months       
------------------------
 2015-03-31 11:11:11+08
(1 row)

↧

PostgreSQL GIN索引limit慢的原因分析

May 28, 2016, 10:53 pm

≫ Next: PostgreSQL 逻辑结构和权限体系介绍

≪ Previous: PostgreSQL Oracle 兼容性之 - add_months

PostgreSQL GIN索引的结构如下图 :
假设这个表有2列，一列存储INT，另一列存储INT数组，最左边的表示记录的行号。

假设对INT数组建立GIN索引，那么GIN索引会记录每个数组element对应的行号，对于行号多的，会存成LIST，然后在索引中指向该list。

好了接下来分析一下limit慢的原因，实际上和gin索引的扫描方法有关，目前gin 索引只支持bitmap index scan，也就是说，会将所有匹配的行号取出，排序，然后去heap表取记录。
那么不管你limit多小，根据行号排序是免不了的，这就是limit比btree索引以及gist索引等不需要bitmap index scan的其他索引方法慢的原因。
例子：

postgres=# create table t3(id int, info int[]);
CREATE TABLE
postgres=# insert into t3 select generate_series(1,10000),array[1,2,3,4,5];
INSERT 0 10000
postgres=# create index idx_t3_info on t3 using gin(info);
CREATE INDEX
postgres=# set enable_seqscan=off;
SET

数组匹配，走索引，注意是bitmap index scan，所以被匹配的数组对应有1万条记录的话，这1万条记录的行号会先排序，然后扫描heap取出记录。

postgres=# explain analyze select * from t3 where info  && array [1] ;
                                                         QUERY PLAN                                                          
-----------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on t3  (cost=83.00..302.00 rows=10000 width=45) (actual time=1.156..3.565 rows=10000 loops=1)
   Recheck Cond: (info && '{1}'::integer[])
   Heap Blocks: exact=94
   ->  Bitmap Index Scan on idx_t3_info  (cost=0.00..80.50 rows=10000 width=0) (actual time=1.129..1.129 rows=10000 loops=1)
         Index Cond: (info && '{1}'::integer[])
 Planning time: 0.107 ms
 Execution time: 5.272 ms
(7 rows)

因为走bitmap index scan, 所以即使加了limit 1，行号排序少不了，开销是不小的。

postgres=# explain analyze select * from t3 where info  && array [1] limit 1;
                                                            QUERY PLAN                                                             
----------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=83.00..83.02 rows=1 width=45) (actual time=1.121..1.121 rows=1 loops=1)
   ->  Bitmap Heap Scan on t3  (cost=83.00..302.00 rows=10000 width=45) (actual time=1.119..1.119 rows=1 loops=1)
         Recheck Cond: (info && '{1}'::integer[])
         Heap Blocks: exact=1
         ->  Bitmap Index Scan on idx_t3_info  (cost=0.00..80.50 rows=10000 width=0) (actual time=1.095..1.095 rows=10000 loops=1)
               Index Cond: (info && '{1}'::integer[])
 Planning time: 0.113 ms
 Execution time: 1.175 ms
(8 rows)

上面就是gin 索引limit慢的原因。
但是GIN这么设计是有原因的，因为数组中可能存在大量的重复值。
例如我需要找的element有3个1,2,3，假设一共有10万条记录.
而1,2,3对应的ctid中可能存在大量重复的page，那么使用bitmap index scan就可以大大减少离散扫描的情况。
对于获取大量离散存放的堆数据是有奇效的。
而如果获取的记录数比较少，并且数据库的shared buffer足够大的话，完全没有必要bitmap index scan效果一般。

下面扩展一下，另一个例子，使用btree_gin使得一些标准类型也支持GIN索引，因此可以用它来建立联合索引。
联合索引一般用在一个字段选择性不好，但几个字段组合起来选择性就比较好的情况。
例子

postgres=# create extension btree_gin;
CREATE EXTENSION

postgres=# create table t4(id int, info int[]);
CREATE TABLE
postgres=# insert into t4 select trunc(random()*1000), array_append(array[1,2,3], trunc(random()*1000)::int) from generate_series(1,100000);
INSERT 0 100000
postgres=# select * from t4 limit 10;
 id  |    info     
-----+-------------
 588 | {1,2,3,835}
 382 | {1,2,3,332}
 817 | {1,2,3,476}
 478 | {1,2,3,597}
 928 | {1,2,3,714}
 645 | {1,2,3,539}
 457 | {1,2,3,536}
 713 | {1,2,3,246}
 842 | {1,2,3,545}
 194 | {1,2,3,70}
(10 rows)

postgres=# create index idx_t4 on t4 using gin(id,info);
CREATE INDEX
postgres=# explain (analyze,verbose,costs,timing,buffers) select * from t4 where id=10 and info && array[1,2,3];
                                                            QUERY PLAN                                                            
---------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on public.t4  (cost=10000000010.89..10000000111.71 rows=97 width=44) (actual time=1.572..1.737 rows=97 loops=1)
   Output: id, info
   Recheck Cond: ((t4.id = 10) AND (t4.info && '{1,2,3}'::integer[]))
   Heap Blocks: exact=92
   Buffers: shared hit=179
   ->  Bitmap Index Scan on idx_t4  (cost=0.00..10.87 rows=97 width=0) (actual time=1.554..1.554 rows=97 loops=1)
         Index Cond: ((t4.id = 10) AND (t4.info && '{1,2,3}'::integer[]))
         Buffers: shared hit=87
 Planning time: 0.262 ms
 Execution time: 1.786 ms
(10 rows)

gin的联合索引用在什么地方比较好？
使用索引对应字段上的条件可以将范围缩小到很小的场景。
如果不能这样，或者是btree就可以缩小到很小的范围，那么建议使用BTREE就够了。
或者是说使用了limit限制要取的记录数，那么使用btree也是更好的，因为btree可以走index scan也可以走bitmap index scan。适用于小数据量查询，也适用于大数据量查询。

↧

PostgreSQL 逻辑结构和权限体系介绍

May 28, 2016, 10:54 pm

≫ Next: 使用社区版本pg_dump 逻辑备份导出 EDB PPAS 的风险

≪ Previous: PostgreSQL GIN索引limit慢的原因分析

本文旨在帮助用户理解PostgreSQL的逻辑结构和权限体系，帮助用户快速的理解和管理数据库的权限。

逻辑结构

最上层是实例，实例中允许创建多个数据库，每个数据库中可以创建多个schema，每个schema下面可以创建多个对象。
对象包括表、物化视图、操作符、索引、视图、序列、函数、... 等等。

在数据库中所有的权限都和角色（用户）挂钩，public是一个特殊角色，代表所有人。
超级用户是有允许任意操作对象的，普通用户只能操作自己创建的对象。
另外有一些对象是有赋予给public角色默认权限的，所以建好之后，所以人都有这些默认权限。

权限体系

实例级别的权限由pg_hba.conf来控制，例如：

# TYPE  DATABASE        USER            ADDRESS                 METHOD
# "local" is for Unix domain socket connections only
local   all             all                                     trust
# IPv4 local connections:
host    all             all             127.0.0.1/32            trust
host all postgres 0.0.0.0/0 reject
host all all 0.0.0.0/0 md5

以上配置的解释
允许任何本地用户无密码连接任何数据库
不允许postgres用户从任何外部地址连接任何数据库
允许其他任何用户从外部地址通过密码连接任何数据库

数据库级别的权限，包括允许连接数据库，允许在数据库中创建schema。
默认情况下，数据库在创建后，允许public角色连接，即允许任何人连接。
默认情况下，数据库在创建后，不允许除了超级用户和owner之外的任何人在数据库中创建schema。
默认情况下，数据库在创建后，会自动创建名为public 的schema，这个schema的all权限已经赋予给public角色，即允许任何人在里面创建对象。

schema级别的权限，包括允许查看schema中的对象，允许在schema中创建对象。
默认情况下新建的schema的权限不会赋予给public角色，因此除了超级用户和owner，任何人都没有权限查看schema中的对象或者在schema中新建对象。

schema使用 , 特别注意

According to the SQL standard, the owner of a schema always owns all objects within it. PostgreSQL allows schemas to contain objects owned by users other than the schema owner. This can happen only if the schema owner grants the CREATE privilege on his schema to someone else, or a superuser chooses to create objects in it.

schema的owner默认是该schema下的所有对象的owner，但是PostgreSQL又允许用户在别人的schema下创建对象，所以一个对象可能属于两个owner，而且schema 的owner有 drop对象的权限。  
对于两个owner都有drop的权限，这个我个人认为是一个BUG。  

所以千万不要把自己的对象创建到别人的schema下面，那很危险。

GRANT { { SELECT | INSERT | UPDATE | DELETE | TRUNCATE | REFERENCES | TRIGGER }
    [, ...] | ALL [ PRIVILEGES ] }
    ON { [ TABLE ] table_name [, ...]
         | ALL TABLES IN SCHEMA schema_name [, ...] }
    TO role_specification [, ...] [ WITH GRANT OPTION ]

GRANT { { SELECT | INSERT | UPDATE | REFERENCES } ( column_name [, ...] )
    [, ...] | ALL [ PRIVILEGES ] ( column_name [, ...] ) }
    ON [ TABLE ] table_name [, ...]
    TO role_specification [, ...] [ WITH GRANT OPTION ]

GRANT { { USAGE | SELECT | UPDATE }
    [, ...] | ALL [ PRIVILEGES ] }
    ON { SEQUENCE sequence_name [, ...]
         | ALL SEQUENCES IN SCHEMA schema_name [, ...] }
    TO role_specification [, ...] [ WITH GRANT OPTION ]

GRANT { { CREATE | CONNECT | TEMPORARY | TEMP } [, ...] | ALL [ PRIVILEGES ] }
    ON DATABASE database_name [, ...]
    TO role_specification [, ...] [ WITH GRANT OPTION ]

GRANT { USAGE | ALL [ PRIVILEGES ] }
    ON DOMAIN domain_name [, ...]
    TO role_specification [, ...] [ WITH GRANT OPTION ]

GRANT { USAGE | ALL [ PRIVILEGES ] }
    ON FOREIGN DATA WRAPPER fdw_name [, ...]
    TO role_specification [, ...] [ WITH GRANT OPTION ]

GRANT { USAGE | ALL [ PRIVILEGES ] }
    ON FOREIGN SERVER server_name [, ...]
    TO role_specification [, ...] [ WITH GRANT OPTION ]

GRANT { EXECUTE | ALL [ PRIVILEGES ] }
    ON { FUNCTION function_name ( [ [ argmode ] [ arg_name ] arg_type [, ...] ] ) [, ...]
         | ALL FUNCTIONS IN SCHEMA schema_name [, ...] }
    TO role_specification [, ...] [ WITH GRANT OPTION ]

GRANT { USAGE | ALL [ PRIVILEGES ] }
    ON LANGUAGE lang_name [, ...]
    TO role_specification [, ...] [ WITH GRANT OPTION ]

GRANT { { SELECT | UPDATE } [, ...] | ALL [ PRIVILEGES ] }
    ON LARGE OBJECT loid [, ...]
    TO role_specification [, ...] [ WITH GRANT OPTION ]

GRANT { { CREATE | USAGE } [, ...] | ALL [ PRIVILEGES ] }
    ON SCHEMA schema_name [, ...]
    TO role_specification [, ...] [ WITH GRANT OPTION ]

GRANT { CREATE | ALL [ PRIVILEGES ] }
    ON TABLESPACE tablespace_name [, ...]
    TO role_specification [, ...] [ WITH GRANT OPTION ]

GRANT { USAGE | ALL [ PRIVILEGES ] }
    ON TYPE type_name [, ...]
    TO role_specification [, ...] [ WITH GRANT OPTION ]

where role_specification can be:

    [ GROUP ] role_name
  | PUBLIC
  | CURRENT_USER
  | SESSION_USER

GRANT role_name [, ...] TO role_name [, ...] [ WITH ADMIN OPTION ]

简单介绍一下grant的一些通用选项
WITH ADMIN OPTION表示被赋予权限的用户，拿到对应的权限后，还能将对应的权限赋予给其他人，否则只能自己有这个权限，但是不能再赋予给其他人。

用户

用户，角色在PostgreSQL是一个概念。

public

public角色，代表所有人的意思。

如何查看和解读一个对象的当前权限状态

以表为例：

select relname,relacl from pg_class where relkind='r';

或者执行

SELECT n.nspname as "Schema",
  c.relname as "Name",
  CASE c.relkind WHEN 'r' THEN 'table' WHEN 'v' THEN 'view' WHEN 'm' THEN 'materialized view' WHEN 'S' THEN 'sequence' WHEN 'f' THEN 'foreign table' END as "Type",
  pg_catalog.array_to_string(c.relacl, E'\n') AS "Access privileges",
  pg_catalog.array_to_string(ARRAY(
    SELECT attname || E':\n  ' || pg_catalog.array_to_string(attacl, E'\n  ')
    FROM pg_catalog.pg_attribute a
    WHERE attrelid = c.oid AND NOT attisdropped AND attacl IS NOT NULL
  ), E'\n') AS "Column privileges",
  pg_catalog.array_to_string(ARRAY(
    SELECT polname
    || CASE WHEN polcmd != '*' THEN
           E' (' || polcmd || E'):'
       ELSE E':' 
       END
    || CASE WHEN polqual IS NOT NULL THEN
           E'\n  (u): ' || pg_catalog.pg_get_expr(polqual, polrelid)
       ELSE E''
       END
    || CASE WHEN polwithcheck IS NOT NULL THEN
           E'\n  (c): ' || pg_catalog.pg_get_expr(polwithcheck, polrelid)
       ELSE E''
       END    || CASE WHEN polroles <> '{0}' THEN
           E'\n  to: ' || pg_catalog.array_to_string(
               ARRAY(
                   SELECT rolname
                   FROM pg_catalog.pg_roles
                   WHERE oid = ANY (polroles)
                   ORDER BY 1
               ), E', ')
       ELSE E''
       END
    FROM pg_catalog.pg_policy pol
    WHERE polrelid = c.oid), E'\n')
    AS "Policies"
FROM pg_catalog.pg_class c
     LEFT JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace
WHERE c.relkind IN ('r', 'v', 'm', 'S', 'f')
  AND n.nspname !~ '^pg_' AND pg_catalog.pg_table_is_visible(c.oid)
ORDER BY 1, 2;

得到权限说明如下

 Schema |      Name       |   Type   |       Access privileges        | Column privileges | Policies 
--------+-----------------+----------+--------------------------------+-------------------+----------
 public | sbtest1         | table    | postgres=arwdDxt/postgres     +|                   | 
        |                 |          | digoal=a*r*w*d*D*x*t*/postgres |                   | 
 public | sbtest10        | table    | postgres=arwdDxt/postgres      |                   | 
 public | sbtest10_id_seq | sequence |                                |                   | 
 public | sbtest11        | table    | postgres=arwdDxt/postgres      |                   | 
 public | sbtest11_id_seq | sequence |                                |                   | 
 public | sbtest12        | table    | postgres=arwdDxt/postgres      |                   | 
 public | sbtest12_id_seq | sequence |                                |                   |

解释一下 Access privileges
rolename=xxx 其中rolename就是被赋予权限的用户名，即权限被赋予给谁了?
=xxx 表示这个权限赋予给了public角色，即所有人
/yyyy 表示是谁赋予的这个权限?
权限的含义如下

rolename=xxxx -- privileges granted to a role
        =xxxx -- privileges granted to PUBLIC

            r -- SELECT ("read")
            w -- UPDATE ("write")
            a -- INSERT ("append")
            d -- DELETE
            D -- TRUNCATE
            x -- REFERENCES
            t -- TRIGGER
            X -- EXECUTE
            U -- USAGE
            C -- CREATE
            c -- CONNECT
            T -- TEMPORARY
      arwdDxt -- ALL PRIVILEGES (for tables, varies for other objects)
            * -- grant option for preceding privilege

        /yyyy -- role that granted this privilege

例子
赋予权限的人是postgres用户， sbtest2表的select权限被赋予给了digoal用户。

postgres=# grant select on sbtest2 to digoal;
GRANT
postgres=# \dp+ sbtest2
                                  Access privileges
 Schema |  Name   | Type  |     Access privileges     | Column privileges | Policies 
--------+---------+-------+---------------------------+-------------------+----------
 public | sbtest2 | table | postgres=arwdDxt/postgres+|                   | 
        |         |       | digoal=r/postgres         |                   | 
(1 row)

回收权限一定要针对已有的权限来，如果你发现这里的权限还在，那照着权限回收即可。
例如

revoke select on sbtest2 from digoal;

参考

grant
revoke

更高基本的安全控制

PostgreSQL还支持凌驾于基本权限体系之上的安全策略，这些安全策略一般在企业级的商业数据库中才有。

行安全策略

https://yq.aliyun.com/articles/4271

SELinux-PostgreSQL

↧