PostgreSQL ECPG ifdef include等预处理用法

June 14, 2016, 7:22 am

≪ Previous: PostgreSQL 函数稳定性与constraint_excluded分区表逻辑推理过滤的CASE

PostgreSQL 社区版本的ecpg在一些预处理的用法上和Oracle的PROC有一些不一样的地方，使用者需要注意。
例如社区版本的ecpg不支持c里面使用的#ifdef或者#ifndef这样的预处理语法，需要用其他写法来替代。
所以你如果使用#ifdef这样的写法在.pgc里面，在使用ecpg编译时报错，你可能觉得很奇怪。
例子

$ vi t.pgc
#include <stdio.h>
#include <stdlib.h>
#include <pgtypes_numeric.h>

EXEC SQL WHENEVER SQLERROR STOP;

int
main(void)
{
EXEC SQL BEGIN DECLARE SECTION;
    numeric *num;
    numeric *num2;
    decimal *dec;

#ifdef ABC    // ecpg对于这样的代码会整段拷贝输出到.c，同时里面的所有语句都需要被parser过一遍，因此可能报错
    errtype *abc;
#endif

EXEC SQL END DECLARE SECTION;

    EXEC SQL CONNECT TO tcp:postgresql://xxx.xxxcs.com:3433/postgres AS db_digoal USER digoal USING pwd;

    num = PGTYPESnumeric_new();
    dec = PGTYPESdecimal_new();

    EXEC SQL SELECT 12.345::numeric(4,2), 23.456::decimal(4,2) INTO :num, :dec;

    printf("numeric = %s\n", PGTYPESnumeric_to_asc(num, 0));
    printf("numeric = %s\n", PGTYPESnumeric_to_asc(num, 1));
    printf("numeric = %s\n", PGTYPESnumeric_to_asc(num, 2));

    /* Convert decimal to numeric to show a decimal value. */
    num2 = PGTYPESnumeric_new();
    PGTYPESnumeric_from_decimal(dec, num2);

    printf("decimal = %s\n", PGTYPESnumeric_to_asc(num2, 0));
    printf("decimal = %s\n", PGTYPESnumeric_to_asc(num2, 1));
    printf("decimal = %s\n", PGTYPESnumeric_to_asc(num2, 2));

    PGTYPESnumeric_free(num2);
    PGTYPESdecimal_free(dec);
    PGTYPESnumeric_free(num);

    EXEC SQL COMMIT;
    EXEC SQL DISCONNECT ALL;
    return 0;
}

编译报错

ecpg -t -c -I/home/digoal/pgsql9.6/include -o t.c t.pgc
t.pgc:16: ERROR: unrecognized data type name "errtype"

如何修正呢？

#ifdef ABC
    errtype *abc;
#endif

改成

EXEC SQL ifdef ABC;
    errtype *abc;
EXEC SQL endif;

编译通过

digoal@iZ25zysa2jmZ-> ecpg -t -c -I/home/digoal/pgsql9.6/include -o t.c t.pgc

加上ABC宏，里面的这一段才会过parser。

digoal@iZ25zysa2jmZ-> ecpg -t -c -I/home/digoal/pgsql9.6/include -o t.c -D ABC t.pgc
t.pgc:16: ERROR: unrecognized data type name "errtype"

PostgreSQL ecpg还支持include的预处理.
如果include的头文件中包含了ECPG的用法，必须使用以下几种方式来预处理

EXEC SQL INCLUDE filename;
EXEC SQL INCLUDE <filename>;
EXEC SQL INCLUDE "filename";

如果include的文件没有ECPG的语法，则不需要这么做，使用原来的方法即可，ecpg会直接拷贝输出到.c

#include <filename.h>

对于EDB提供的ecpg版本，是支持#ifdef这种写法的，用于兼容ORACLE的PROC。

参考
https://www.postgresql.org/docs/9.5/static/ecpg-preproc.html

EDB ECPG
https://www.enterprisedb.com/docs/en/9.5/ecpg/Postgres_Plus_Advanced_Server_ecpgPlus_Guide.1.24.html#

The ECPGPlus C-preprocessor enforces two behaviors that are dependent on the mode in which you invoke ECPGPlus:
?
PROC mode
?
non-PROC mode

Compiling in PROC mode
In PROC mode, ECPGPlus allows you to:
?
Declare host variables outside of an EXEC SQL BEGIN/END DECLARE SECTION.
?
Use any C variable as a host variable as long as it is of a data type compatible with ECPG.
When you invoke ECPGPlus in PROC mode (by including the -C PROC keywords), the ECPG compiler honors the following C-preprocessor directives:
#include
#if expression
#ifdef symbolName
#ifndef symbolName
#else
#elif expression
#endif
#define symbolName expansion
#define symbolName([macro arguments]) expansion
#undef symbolName
#defined(symbolName)

Pre-processor directives are used to effect or direct the code that is received by the compiler. For example, using the following code sample:
#if HAVE_LONG_LONG == 1
#define BALANCE_TYPE long long
#else
#define BALANCE_TYPE double
#endif
...
BALANCE_TYPE customerBalance;

If you invoke ECPGPlus with the following command-line arguments:
ecpg –C PROC –DHAVE_LONG_LONG=1
ECPGPlus will copy the entire fragment (without change) to the output file, but will only send the following tokens to the ECPG parser:
long long customerBalance;
On the other hand, if you invoke ECPGPlus with the following command-line arguments:
ecpg –C PROC –DHAVE_LONG_LONG=0
The ECPG parser will receive the following tokens:
double customerBalance;

If your code uses preprocessor directives to filter the code that is sent to the compiler, the complete code is retained in the original code, while the ECPG parser sees only the processed token stream.
Compiling in non-PROC mode
If you do not include the -C PROC command-line option:
?
C preprocessor directives are copied to the output file without change.
?
You must declare the type and name of each C variable that you intend to use as a host variable within an EXEC SQL BEGIN/END DECLARE section.
When invoked in non-PROC mode, ECPG implements the behavior described in the PostgreSQL Core documentation

↧

PostgreSQL 并行计算在 xfs, ext4 下的表现

June 14, 2016, 7:24 am

≫ Next: 在PostgreSQL中如何生成kmean算法的测试数据

≪ Previous: PostgreSQL ECPG ifdef include等预处理用法

关于PostgreSQL 9.6 基于CPU并行计算的文章，之前写过几篇，请参考
https://yq.aliyun.com/articles/44655
https://yq.aliyun.com/articles/44649

本文主要是对EXT4和XFS进行测试比较，两者在并行度上的性能差异。
文件系统的格式化参数以及挂载参数如下
XFS

mkfs.xfs -f -b size=4096 -l logdev=/dev/dfc1,size=2047868928,sunit=16 -d agsize=536862720 /dev/dfc2

/dev/dfc2 /u03 xfs defaults,allocsize=16M,inode64,nobarrier,nolargeio,logbsize=262144,noatime,nodiratime,swalloc,logdev=/dev/dfc1 0 0

EXT4

mkfs.ext4 /dev/dfc1

e2label /dev/dfc1 u03

LABEL=u03               /u03            ext4            defaults,noatime,nodiratime,nodelalloc,barrier=0,data=writeback 0 0

测试数据和 https://yq.aliyun.com/articles/44655 一样, 90GB。
xfs
测试发现直到22个并发，性能才开始出现拐点，并且在64个并发时回落没有EXT4那么明显。

postgres=# set max_parallel_degree =64 ;  
postgres=# select count(*) from t_bit2 ;
   count    
------------
 1600000000
(1 row)
Time: 18310.130 ms

postgres=# set max_parallel_degree =32;
postgres=# select count(*) from t_bit2 ;
   count    
------------
 1600000000
(1 row)
Time: 21144.919 ms

postgres=# set max_parallel_degree =17;
postgres=# select count(*) from t_bit2 ;
   count    
------------
 1600000000
(1 row)
Time: 8905.510 ms

postgres=# set max_parallel_degree =21;
postgres=# select count(*) from t_bit2;
   count    
------------
 1600000000
(1 row)
Time: 7583.344 ms

ext4
测试发现直到17个并发，性能开始出现拐点。

postgres=# set max_parallel_degree =64;
postgres=# select count(*) from t_bit2 ;
   count    
------------
 1600000000
(1 row)
Time: 32580.853 ms

postgres=# set max_parallel_degree =32;
postgres=# select count(*) from t_bit2 ;
   count    
------------
 1600000000
(1 row)
Time: 30209.980 ms

postgres=# set max_parallel_degree =17;
postgres=# select count(*) from t_bit2 ;
   count    
------------
 1600000000
(1 row)
Time: 9313.369 ms

从测试结果来看，XFS要优于EXT4，主要体现在可以做到更高的并发，以及更好的性能。
测试环境是centos 6，如果是7的话，XFS表现可能还会更好。

xfs的组策略，对并行的写I/O有较大帮助（如可以在多个组里面并行分配block和inode），格式化时的agcount选项。
所以, XFS对于高并发的写入优势会更加明显，例如单机多实例，或者Greenplum，都是典型的应用场景。

分配组
XFS文件系统内部被分为多个“分配组”，它们是文件系统中的等长线性存储区。
每个分配组各自管理自己的inode和剩余空间。文件和文件夹可以跨越分配组。
这一机制为XFS提供了可伸缩性和并行特性 —— 多个线程和进程可以同时在同一个文件系统上执行I/O操作。
这种由分配组带来的内部分区机制在一个文件系统跨越多个物理设备时特别有用，使得优化对下级存储部件的吞吐量利用率成为可能。

参考
http://baike.baidu.com/view/1222157.htm
man xfs

agcount=value
This is used to specify the number of allocation groups. 
The data section of the filesystem is divided into allocation groups to improve the performance of XFS. More allocation groups imply that more  parallelism  can  be  achieved  when  allocating blocks and inodes. 
The minimum allocation group size is 16 MiB; 
the maximum size is just under 1 TiB.  
The data section of the filesystem is divided into value allocation groups (default value is scaled automatically based on the underlying device size).

agsize=value
This is an alternative to using the agcount suboption. 
The value is the desired size of the allocation group expressed in bytes (usually using the m or g suffixes).  
This value must be a  multiple  of  the filesystem  block  size,  and  must  be  at least 16MiB, and no more than 1TiB, and may be automatically adjusted to properly align with the stripe geometry.  
The agcount and agsize suboptions are mutually exclusive.

↧

在PostgreSQL中如何生成kmean算法的测试数据

June 14, 2016, 7:25 am

≫ Next: 在PostgreSQL中如何生成线性相关的测试数据

≪ Previous: PostgreSQL 并行计算在 xfs, ext4 下的表现

生成Kmeans的测试数据。
例如每10000为界，生成10个种子，每个节点以100内的随机数相加，生成一组测试数据。

postgres=# create table test(id int, rand int);
CREATE TABLE

postgres=# insert into test select id*10000,trunc(random()*100 + id*10000) from generate_series(1,10) t(id), generate_series(1,100000) t1(rand);
INSERT 0 1000000

postgres=# select id,count(*) from test group by id;
   id   | count  
--------+--------
  10000 | 100000
  60000 | 100000
  40000 | 100000
  30000 | 100000
  90000 | 100000
  20000 | 100000
 100000 | 100000
  50000 | 100000
  70000 | 100000
  80000 | 100000
(10 rows)

直接使用kmeans分为10类，不设置种子的话，分得不是很准确。

postgres=# select k,id,count(*) from (select kmeans(array[rand], 10) over () k, id from test) t group by 1,2 order by 1,2;
 k |   id   | count  
---+--------+--------
 0 |  10000 | 100000
 0 |  20000 | 100000
 1 |  30000 |  49707
 2 |  30000 |  50293
 3 |  40000 | 100000
 4 |  50000 | 100000
 5 |  60000 | 100000
 6 |  70000 | 100000
 7 |  80000 |  49871
 8 |  80000 |  50129
 9 |  90000 | 100000
 9 | 100000 | 100000
(12 rows)

使用正确的种子后，分类精准。

postgres=# select k,id,count(*) from (select kmeans(array[rand], 10, array[10000,20000,30000,40000,50000,60000,70000,80000,90000,100000]) over () k, id from test) t group by 1,2 order by 1,2;
 k |   id   | count  
---+--------+--------
 0 |  10000 | 100000
 1 |  20000 | 100000
 2 |  30000 | 100000
 3 |  40000 | 100000
 4 |  50000 | 100000
 5 |  60000 | 100000
 6 |  70000 | 100000
 7 |  80000 | 100000
 8 |  90000 | 100000
 9 | 100000 | 100000
(10 rows)

参考
http://blog.163.com/digoal@126/blog/static/163877040201571745048121/
http://pgxn.org/dist/kmeans/

↧

在PostgreSQL中如何生成线性相关的测试数据

June 14, 2016, 7:25 am

≫ Next: 从尿检取中段谈数据库压测

≪ Previous: 在PostgreSQL中如何生成kmean算法的测试数据

生成线性相关的测试数据。
同样可以用到generate_series和随机数。
例子
生成10万条随机数字。

select trunc(10000 + 1000000*random()) id from generate_series(1,100000);

根据刚才那组数据，加减5以内的随机数，生成另一组数字。

select id, trunc(id + 5-random()*10) from 
(select trunc(10000 + 1000000*random()) id from generate_series(1,100000)) t;

如下

postgres=# create table corr_test(c1 int, c2 int);
CREATE TABLE
postgres=# insert into corr_test select id, trunc(id + 5-random()*10) from (select trunc(10000 + 1000000*random()) id from generate_series(1,100000)) t;
INSERT 0 100000

线性相关性如下：

postgres=# select corr(id, trunc(id + 5-random()*10)) from (select trunc(10000 + 1000000*random()) id from generate_series(1,100000)) t;
       corr        
-------------------
 0.999999999954681
(1 row)
... ...
postgres=# select corr(id, trunc(id + 5-random()*10)) from (select trunc(10000 + 1000000*random()) id from generate_series(1,100000)) t;
       corr        
-------------------
 0.999999999954898
(1 row)

p元回归的测试数据也可以使用以上方法生成。

↧

从尿检取中段谈数据库压测

June 14, 2016, 7:26 am

≫ Next: 精确度量Linux下进程占用多少内存的方法

≪ Previous: 在PostgreSQL中如何生成线性相关的测试数据

想必大家都参加过一年一次的体检，在进行尿液体检的时候，医生会告诉你要留中段尿！要留中段尿！要留中段尿！重要的事情说三遍。
为什么尿液化验要取中段尿呢？
因为前段尿和后段尿容易被污染,所以在进行尿常规和尿培养检查时都建议留取中段尿。

我们在做数据库压测时，也会遇到类似的情况，比如一个持续数天的TPCC压测，tps在时间曲线上的表现可能会是这样的：
开始时tps缓慢的攀升，然后会经过一个较长时间的平稳期，(期间可能还有有一些短暂的抖动)，最后又会以非常平缓的曲线开始性能慢慢下降。

简单分析
前段TPS攀升的原因，
开始时TPS缓慢攀升，是因为数据库的shared buffer还没有被填满，所有的查询都是从块设备直接读取的，所以由于块设备和内存访问速度的差异造成了一开始的速度会较慢。
然后随着数据库shared buffer的填充，以及OS层cache的填充(如果没有使用DIO)，命中率高了之后，RT会下降，TPS自然就升上去了。

中段TPS平稳的原因，
数据库的shared buffer和os cache都被热数据填充，所以在TPCC压测时，RT是比较均匀的。

期间TPS抖动的原因，
数据库在产生了一定数量的脏页后，需要将做检查点，将shared buffer的脏页刷回磁盘，所以会增加额外的分段批量写IO的操作（检查点的IO优化操作在此文不展开，我以前有针对PostgreSQL检查点的优化写过文章），特别是两个检查点之间的脏页很多时，抖动会较为明显。

后段TPS下降的原因，
因为TPCC涉及到较多的更新和插入操作。
随着数据不断的插入，表的数据量变大后，对应的索引的层次可能会变深，层次变深后，通过索引访问数据需要扫描的块就会增加，这是影响性能的原因之一。
另一方面，随着数据的更新，垃圾tuple会增加到一定的量，同时索引可能会膨胀，深度变深。需要访问的页数增加，也是导致性能下降的原因之一。

并不是所有的压测都会出现后段性能下降的情况，例如只读的场景，就不会出现后段的问题。
对于写入和更新的场景，如果控制好表的大小，如使用分区，也不会出现后段的情况，因为索引页的深度是可控的。

例如一组这样的测试结果，每轮测试2分钟，连续测n轮。

    transactions:                        147549 (1035.36 per sec.)
    transactions:                        149521 (1245.82 per sec.)
    transactions:                        159201 (1326.20 per sec.)
    transactions:                        152378 (1268.96 per sec.)
    transactions:                        153969 (1282.87 per sec.)
    transactions:                        154719 (1289.09 per sec.)
    transactions:                        160117 (1333.84 per sec.)
    transactions:                        161628 (1346.59 per sec.)
    transactions:                        160033 (1332.50 per sec.)
    transactions:                        154718 (1289.12 per sec.)
    transactions:                        155586 (1296.09 per sec.)
    transactions:                        153503 (1278.71 per sec.)
    transactions:                        151012 (1258.08 per sec.)
    transactions:                        162499 (1353.82 per sec.)
    transactions:                        153878 (1281.24 per sec.)
    transactions:                        158137 (1317.45 per sec.)
    transactions:                        157630 (1312.76 per sec.)
    transactions:                        151530 (1262.64 per sec.)
    transactions:                        152966 (1274.54 per sec.)
    transactions:                        154235 (1284.25 per sec.)
    transactions:                        153674 (1280.25 per sec.)
    transactions:                        152721 (1272.19 per sec.)
    transactions:                        154113 (1284.07 per sec.)
    transactions:                        162871 (1356.21 per sec.)
    transactions:                        150610 (1254.76 per sec.)
    transactions:                        152196 (1267.36 per sec.)
    transactions:                        158429 (1319.31 per sec.)
    transactions:                        152625 (1271.77 per sec.)
    transactions:                        159619 (1329.89 per sec.)

建议也是取中段，去掉最低值和最高值，取平均值。

↧

精确度量Linux下进程占用多少内存的方法

June 14, 2016, 7:27 am

≫ Next: 博客已搬迁至Github

≪ Previous: 从尿检取中段谈数据库压测

在Linux中，要了解进程的信息，莫过于从 proc 文件系统中入手去看。
proc的详细介绍，可以参考内核文档的解读，里面有很多内容

yum install -y kernel-doc
cat /usr/share/doc/kernel-doc-3.10.0/Documentation/filesystems/proc.txt

主要内容

Table of Contents
-----------------

  0     Preface
  0.1   Introduction/Credits
  0.2   Legal Stuff

  1     Collecting System Information
  1.1   Process-Specific Subdirectories
  1.2   Kernel data
  1.3   IDE devices in /proc/ide
  1.4   Networking info in /proc/net
  1.5   SCSI info
  1.6   Parallel port info in /proc/parport
  1.7   TTY info in /proc/tty
  1.8   Miscellaneous kernel statistics in /proc/stat
  1.9 Ext4 file system parameters

  2     Modifying System Parameters

  3     Per-Process Parameters
  3.1   /proc/<pid>/oom_adj & /proc/<pid>/oom_score_adj - Adjust the oom-killer
                                                                score
  3.2   /proc/<pid>/oom_score - Display current oom-killer score
  3.3   /proc/<pid>/io - Display the IO accounting fields
  3.4   /proc/<pid>/coredump_filter - Core dump filtering settings
  3.5   /proc/<pid>/mountinfo - Information about mounts
  3.6   /proc/<pid>/comm  & /proc/<pid>/task/<tid>/comm
  3.7   /proc/<pid>/task/<tid>/children - Information about task children
  3.8   /proc/<pid>/fdinfo/<fd> - Information about opened file

  4     Configuring procfs
  4.1   Mount options

和进程内存相关的几个信息

 maps           Memory maps to executables and library files    (2.4)
 statm          Process memory status information
 status         Process status in human readable form
 smaps          a extension based on maps, showing the memory consumption of
                each mapping and flags associated with it

详解

status

这里可以看到概貌的内存统计
程序启动后，进程的内存占用可能包括程序本身的空间，共享的内存空间，mmap，malloc 的等

 VmPeak                      peak virtual memory size
 VmSize                      total program size
 VmLck                       locked memory size
 VmHWM                       peak resident set size ("high water mark")
 VmRSS                       size of memory portions
 VmData                      size of data, stack, and text segments
 VmStk                       size of data, stack, and text segments
 VmExe                       size of text segment
 VmLib                       size of shared library code
 VmPTE                       size of page table entries
 VmSwap                      size of swap usage (the number of referred swapents)

statm

内存统计信息，单位为PAGE ，通过getconf可以获得操作系统的page大小
getconf PAGE_SIZE

 Field    Content
 size     total program size (pages)            (same as VmSize in status)
 resident size of memory portions (pages)       (same as VmRSS in status)
 shared   number of pages that are shared       (i.e. backed by a file)
 trs      number of pages that are 'code'       (not including libs; broken,
                                                        includes data segment)
 lrs      number of pages of library            (always 0 on 2.6)
 drs      number of pages of data/stack         (including libs; broken,
                                                        includes library text)
 dt       number of dirty pages                 (always 0 on 2.6)

maps

进程与可执行程序或动态库文件相关的映射信息

address           perms offset  dev   inode      pathname

08048000-08049000 r-xp 00000000 03:00 8312       /opt/test
08049000-0804a000 rw-p 00001000 03:00 8312       /opt/test
0804a000-0806b000 rw-p 00000000 00:00 0          [heap]
a7cb1000-a7cb2000 ---p 00000000 00:00 0
a7cb2000-a7eb2000 rw-p 00000000 00:00 0
a7eb2000-a7eb3000 ---p 00000000 00:00 0
a7eb3000-a7ed5000 rw-p 00000000 00:00 0          [stack:1001]
a7ed5000-a8008000 r-xp 00000000 03:00 4222       /lib/libc.so.6
a8008000-a800a000 r--p 00133000 03:00 4222       /lib/libc.so.6
a800a000-a800b000 rw-p 00135000 03:00 4222       /lib/libc.so.6
a800b000-a800e000 rw-p 00000000 00:00 0
a800e000-a8022000 r-xp 00000000 03:00 14462      /lib/libpthread.so.0
a8022000-a8023000 r--p 00013000 03:00 14462      /lib/libpthread.so.0
a8023000-a8024000 rw-p 00014000 03:00 14462      /lib/libpthread.so.0
a8024000-a8027000 rw-p 00000000 00:00 0
a8027000-a8043000 r-xp 00000000 03:00 8317       /lib/ld-linux.so.2
a8043000-a8044000 r--p 0001b000 03:00 8317       /lib/ld-linux.so.2
a8044000-a8045000 rw-p 0001c000 03:00 8317       /lib/ld-linux.so.2
aff35000-aff4a000 rw-p 00000000 00:00 0          [stack]
ffffe000-fffff000 r-xp 00000000 00:00 0          [vdso]

.1. where "address" is the address space in the process that it occupies, "perms"
is a set of permissions:

 r = read
 w = write
 x = execute
 s = shared
 p = private (copy on write)

.2. "offset" is the offset into the mapping, 

.3. "dev" is the device (major:minor), 

.4. "inode" is the inode  on that device.  

0 indicates that  no inode is associated with the memory region, as the case would be with BSS (uninitialized data).

.5. The "pathname" shows the name associated file for this mapping.  
If the mapping is not associated with a file:

 [heap]                   = the heap of the program
 [stack]                  = the stack of the main process
 [stack:1001]             = the stack of the thread with tid 1001
 [vdso]                   = the "virtual dynamic shared object",
                            the kernel system call handler

 or if empty, the mapping is anonymous.

smaps

对应每个映射的内存开销详情

08048000-080bc000 r-xp 00000000 03:02 13130      /bin/bash
Size:               1084 kB
Rss:                 892 kB
Pss:                 374 kB
Shared_Clean:        892 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:         0 kB
Referenced:          892 kB
Anonymous:             0 kB
Swap:                  0 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Locked:              374 kB
VmFlags: rd ex mr mw me de

.1. the size of the mapping(size), 
.2. the amount of the mapping that is currently resident in RAM (RSS), 
.3. the process' proportional share of this mapping (PSS), 
.4. the number of clean and dirty private pages in the mapping.  
Note that even a page which is part of a MAP_SHARED mapping, but has only a single pte mapped, 
i.e.  is currently used by only one process, is accounted as private and not as shared.  
.5. "Referenced" indicates the amount of memory currently marked as referenced or accessed.
.6. "Anonymous" shows the amount of memory that does not belong to any file.  
Even a mapping associated with a file may contain anonymous pages: 
when MAP_PRIVATE and a page is modified, the file page is replaced by a private anonymous copy.
.7. "Swap" shows how much would-be-anonymous memory is also used, but out on
swap.
.8. "VmFlags" field deserves a separate description. 
This member represents the kernel flags associated with the particular virtual memory area in two letter encoded manner. 
The codes are the following:  
    rd  - readable
    wr  - writeable
    ex  - executable
    sh  - shared
    mr  - may read
    mw  - may write
    me  - may execute
    ms  - may share
    gd  - stack segment growns down
    pf  - pure PFN range
    dw  - disabled write to the mapped file
    lo  - pages are locked in memory
    io  - memory mapped I/O area
    sr  - sequential read advise provided
    rr  - random read advise provided
    dc  - do not copy area on fork
    de  - do not expand area on remapping
    ac  - area is accountable
    nr  - swap space is not reserved for the area
    ht  - area uses huge tlb pages
    nl  - non-linear mapping
    ar  - architecture specific flag
    dd  - do not include area into core dump
    mm  - mixed map area
    hg  - huge page advise flag
    nh  - no-huge page advise flag
    mg  - mergable advise flag

一般来说，业务进程使用的内存主要有以下几种情况：
（1）用户空间的匿名映射页（Anonymous pages in User Mode address spaces），比如调用malloc分配的内存，以及使用MAP_ANONYMOUS的mmap；当系统内存不够时，内核可以将这部分内存交换出去；
（2）用户空间的文件映射页（Mapped pages in User Mode address spaces），包含map file和map tmpfs；前者比如指定文件的mmap，后者比如IPC共享内存；当系统内存不够时，内核可以回收这些页，但回收之前可能需要与文件同步数据；
（3）文件缓存（page in page cache of disk file）；发生在程序通过普通的read/write读写文件时，当系统内存不够时，内核可以回收这些页，但回收之前可能需要与文件同步数据；
（4）buffer pages，属于page cache；比如读取块设备文件。

进程RSS, 进程使用的所有物理内存（file_rss＋anon_rss），即Anonymous pages＋Mapped apges（包含共享内存）

Resident Set Size: 
  number of pages the process has in real memory.  
  This is just the pages which count toward text, data, or stack space.  
  This does not include pages which have not been demand-loaded in,  
  or which are swapped out.

显然如果把所有进程RSS的值相加，可能会超过实际的内存大小，原因是RSS统计存在一定的重复部分，例如在共享内存的计算方面，不同的进程会有重复的现象。
通过smaps可以非常方便的将重复的部分消除掉。

例如有多个进程加载了同样的库文件，那么会在这些进程间均摊这部分内存，均摊后的共享部分加上进程私有的内存记为Pss。

Pss:                 374 kB

私有的内存则在Private里面计算

Private_Clean:         0 kB
Private_Dirty:         0 kB

在linux中有一个工具叫smem，其实就是通过smaps来统计的。
PSS是Pss的相加
USS则是Private的相加

yum install -y smem smemstat

smem can report proportional set size (PSS), which is a more meaningful representation of the amount of memory used by libraries and applications in a virtual memory system.

Because large portions of physical memory are typically shared among multiple applications, the standard measure of memory usage known as resident set size (RSS) will significantly overestimate memory usage. PSS instead measures each application's "fair share" of each shared area to give a realistic measure.

例子

smem
  PID User     Command                         Swap      USS      PSS      RSS 
23716 digoal   postgres: postgres postgres        0     4924     5387     7040 

对应的RSS, PSS, USS分别等于以下相加.  
# cat /proc/23716/smaps | grep Rss
# cat /proc/23716/smaps | grep Pss
# cat /proc/23716/smaps | grep Private_

其他参考文章

https://www.selenic.com/smem/
http://hustcat.github.io/memory-usage-in-process-and-cgroup/
http://blog.hellosa.org/2010/02/26/pmap-process-memory.html

首先 ps 看一下我的系统跑着哪些process

$ ps aux

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
...
czbug     1980  0.0  1.7 180472 34416 ?        Sl   Feb25   0:01 /usr/bin/yakuake
...

我拿 yakuake 这个小程序作例子。

其中，关于内存的是 VSZ 和 RSS 这两个

man ps 看看它们的含义：

rss       RSS    resident set size, the non-swapped physical memory that a task has used (in kiloBytes). (alias rssize, rsz).

vsz       VSZ    virtual memory size of the process in KiB (1024-byte units). Device mappings are currently excluded; this is subject to change. (alias vsize).

简单一点说，RSS 就是这个process 实际占用的物理内存，VSZ 就是process 的虚拟内存，就是process 现在没有使用但未来可能会分配的内存大小。

其实这里的ps 出来的结果，是有点不正确的，如果把所有程序的 RSS 加起来，恐怕比你的实际内存还要大呢。为什么呢？？因为 ps 的结果，RSS 那部分，是包括共享内存的。这里我用 pmap 来看看。

$ pmap -d 1980

1980:   /usr/bin/yakuake
Address   Kbytes Mode  Offset           Device    Mapping
00110000    2524 r-x-- 0000000000000000 008:00002 libkio.so.5.3.0
00387000       4 ----- 0000000000277000 008:00002 libkio.so.5.3.0
00388000      32 r---- 0000000000277000 008:00002 libkio.so.5.3.0
00390000      16 rw--- 000000000027f000 008:00002 libkio.so.5.3.0
00394000     444 r-x-- 0000000000000000 008:00002 libQtDBus.so.4.5.2
00403000       4 ----- 000000000006f000 008:00002 libQtDBus.so.4.5.2
00404000       4 r---- 000000000006f000 008:00002 libQtDBus.so.4.5.2
00405000       4 rw--- 0000000000070000 008:00002 libQtDBus.so.4.5.2
00407000     228 r-x-- 0000000000000000 008:00002 libkparts.so.4.3.0
00440000       8 r---- 0000000000039000 008:00002 libkparts.so.4.3.0
00442000       4 rw--- 000000000003b000 008:00002 libkparts.so.4.3.0
00443000    3552 r-x-- 0000000000000000 008:00002 libkdeui.so.5.3.0
007bb000      76 r---- 0000000000377000 008:00002 libkdeui.so.5.3.0
007ce000      24 rw--- 000000000038a000 008:00002 libkdeui.so.5.3.0
007d4000       4 rw--- 0000000000000000 000:00000   [ anon ]
....
mapped: 180472K    writeable/private: 19208K    shared: 20544K

我略去了一部分输出，都是差不多的，重点在最后那行输出。

linux 会把一些shared libraries 载入到内存中，在pmap 的输出中，这些shared libraries 的名字通常是 lib*.so 。如 libX11.so.6.2.0 。这个 libX11.so.6.2.0 会被很多process load 到自己的运行环境中，同时，ps 输出的RSS 结果中，每个process 都包含了这个libX11.so.6.2.0 ，而事实上它只被load 了一次，如果单纯把ps 的结果相加，这样就重复计算了。

而 pmap 的输出中，writeable/private: 19208K ，这个就是yakuake 这个程序真正占用的物理内存，不包含shared libraries 。在这里，它只有19208K，而ps 的RSS 是34416K。

我在看这方面的资料时，还看到一些关于virtual memory 的，再记录下。

以下两个命令均可查看 vmsize 。

$ cat /proc/<pid>/stat | awk '{print $23 / 1024}'
$ cat /proc/<pid>/status | grep -i vmsize

一般来说，得出来的值，是和 ps 的 VSZ 是一样的，但有一种情况例外，就是查看X server 的时候。

举个例：

$ ps aux|grep /usr/bin/X|grep -v grep | awk '{print $2}'   # 得出X server 的 pid   ...
1076

$ cat /proc/1076/stat | awk '{print $23 / 1024}'
139012

$ cat /proc/1076/status | grep -i vmsize
VmSize:      106516 kB

而 ps 的 VSZ 为 106516 ，与后者是一致的。

据说是因为

VmSize = memory + memory-mapped hardware (e.g. video card memory).

↧

博客已搬迁至Github

January 9, 2017, 8:26 pm

≫ Next: 防止短连接耗尽你的动态TCP端口

≪ Previous: 精确度量Linux下进程占用多少内存的方法

由于在网易的文章经常莫名其妙的被移除，不再考虑使用网易博客。

博客已搬迁至：

https://github.com/digoal/blog

↧

防止短连接耗尽你的动态TCP端口

May 28, 2016, 11:09 pm

≫ Next: PostgreSQL V8.4 Warm-Standby design and implement

≪ Previous: 博客已搬迁至Github

用pgbench使用短连接压测一个PostgreSQL数据库(select 1)，其他数据库亦如此。

$ vi test.sql
select 1;

$ export PGPASSWORD=digoal
$ pgbench -M simple -C -n -r -P 1 -c 800 -j 80 -T 1000 -h xxx.xxx.xxx.xxx -p xxxx -U xxx dbname

压测一段时间之后，可能会因为本地（客户端）的端口耗尽，客户端会报错如下

connection to database "postgres" failed:
could not connect to server: Cannot assign requested address
        Is the server running on host "xxx.xxx.xxx.xxx" and accepting
        TCP/IP connections on port 1925?
connection to database "postgres" failed:
could not connect to server: Cannot assign requested address
        Is the server running on host "xxx.xxx.xxx.xxx" and accepting
        TCP/IP connections on port 1925?

原因是客户端需要为每一个连接动态创建TCP端口，所以每个连接会消耗一个端口。
客户端主动断开连接后，会进入TIME_WAIT状态。

详见TCP协议
https://en.wikipedia.org/wiki/Transmission_Control_Protocol

但是TIME_WAIT是有时间窗口的，Linux默认是60秒。
所以如果不停的产生和关闭TCP会话，就可能导致前面提到的问题。

对于Linux的客户端，通过调整几个操作系统内核参数可以解决这个问题。

net.ipv4.tcp_syncookies=1   # 开启SYN Cookies。当出现SYN等待队列溢出时，启用cookie来处理，可防范少量的SYN攻击
net.ipv4.tcp_tw_recycle=1   # 开启TCP连接中TIME-WAIT套接字的快速回收
net.ipv4.tcp_tw_reuse=1     # 开启重用。允许将TIME-WAIT套接字重新用于新的TCP连接
net.ipv4.tcp_timestamps=1   # 减少time_wait
net.ipv4.tcp_tw_timeout=3   # 收缩TIME_WAIT状态socket的回收时间窗口

↧

PostgreSQL V8.4 Warm-Standby design and implement

July 21, 2011, 7:50 am

≫ Next: a simple skill : postgresql slow sql report

≪ Previous: 防止短连接耗尽你的动态TCP端口

虽然现在PostgreSQL 9.0已经很成熟了，但是很多企业还是在用PostgreSQL 8.4的版本。

那么8.4怎么设计和实施一个数据库standby以及备份策略呢，下面简单的拿一个案例来分享一下 :

PostgreSQL V8.4 Warm-Standby design and implement - 德哥@Digoal - The Heart,The World.

以上分为三台主机，实际使用中，备份机和Standby库的主机可以是同一台机器。

以下是配置部分 :

主节点 10.0.0.10

远程NFS归档目录(postgres用户读写权限) /data/recoverydir/pg_arch

本地归档目录(postgres用户读写权限) /data1/archivedir/pg_arch

备份目录(postgres用户读写权限) /data1/pg_backup

日志目录(postgres用户读写权限) /data1/pg_run_log ; ln -s /data1/pg_run_log /var/log/pg_run_log

/dev/sdb1 670G 48G 589G 8% /data1

10.0.0.20:/data1/archivedir 670G 66G 571G 11% /data/recoverydir

/etc/fstab

10.0.0.20:/data1/archivedir /data/recoverydir nfs rw,rsize=8192,wsize=8192,noatime 1 3

/etc/exports

/data1/archivedir 10.0.0.20/32(rw,sync,wdelay,no_root_squash,anonuid=0,anongid=0)

exportfs -av

固定nfs监听端口,如修改 /etc/services

# Local services

mountd 845/tcp #rpc.mountd

mountd 842/udp #rpc.mountd

rquotad 790/tcp #rpc.rquotad

rquotad 787/udp #rpc.rquotad

备节点 10.0.0.20

远程NFS归档目录(postgres用户读写权限) /data/recoverydir/pg_arch

本地归档目录(postgres用户读写权限) /data1/archivedir/pg_arch

备份目录(postgres用户读写权限) /data1/pg_backup

日志目录(postgres用户读写权限) /data1/pg_run_log ; ln -s /data1/pg_run_log /var/log/pg_run_log

/dev/sdb1 670G 48G 589G 8% /data1

10.0.0.10:/data1/archivedir 670G 66G 571G 11% /data/recoverydir

/etc/fstab

10.0.0.10:/data1/archivedir /data/recoverydir nfs rw,rsize=8192,wsize=8192,noatime 1 3

/etc/exports

/data1/archivedir 10.0.0.10/32(rw,sync,wdelay,no_root_squash,anonuid=0,anongid=0)

固定nfs监听端口,如修改 /etc/services

# Local services

mountd 845/tcp #rpc.mountd

mountd 842/udp #rpc.mountd

rquotad 790/tcp #rpc.rquotad

rquotad 787/udp #rpc.rquotad

# On 备库节点,配置rsync

# Rsyncd On Standby Database Server & Backup Server

cat /etc/rsyncd.postgres.conf

# Rsyncd On Standby Database Server & Backup Server

port = 873

hosts deny = 0.0.0.0/0

read only = false

write only = false

gid = 0

uid = 0

use chroot = no

max connections = 10

pid file = /var/run/rsync.pid

lock file = /var/run/rsync.lock

log file = /var/log/rsync.log

[pgdata]

path = /data1/pg_data

comment = Building Database Dir.

hosts allow = 10.0.0.20,10.0.0.10

[pgbackup]

path = /data1/pg_backup

comment = Database Backup Dir.

hosts allow = 10.0.0.20,10.0.0.10

# On 备库节点,开启rsync后台进程,如果配置修改的话重新执行一遍以下程序,表示reload

rsync -v --daemon --config=/etc/rsyncd.postgres.conf

# On 主库节点,开启归档

fsync = on

full_page_writes = on

archive_mode = on

archive_command = 'cp -f %p /data1/archivedir/pg_arch/%f 2>>/var/log/pg_run_log/archive_cp_5432.log'

archive_timeout = 300

# 如果原来archive_mode = off , 修改后需要重启数据库,否则只需要reload配置

# On 备库节点,新建standby, 1.删除已经存在的数据文件目录,(用于重新建立)

rm -rf /data1/pg_data

# On 主库节点,同步数据文件到备库

psql -h 127.0.0.1 postgres postgres -c "select pg_start_backup(now()::text);"

rsync -acvz --exclude=pg_xlog /data1/pg_data/* 10.0.0.20::pgdata

psql -h 127.0.0.1 postgres postgres -c "select pg_stop_backup();"

# On 备库节点,新建standby, 2.配置standby

# recovery.conf

restore_command = 'pg_standby -d -s 2 -t /tmp/pgsql.trigger.5432 /data/recoverydir/pg_arch %f %p 2>>/var/log/pg_run_log/restore_standby_5432.log'

recovery_target_timeline = 'latest'

recovery_end_command = 'rm -f /tmp/pgsql.trigger.5432'

# 新建目录

mkdir /data1/pg_data/pg_xlog

rm -f /data1/pg_data/recovery.done

chown -R postgres:postgres /data1/pg_data

chmod -R 700 /data1/pg_data

# 启动standby库,进入recovery模式

pg_ctl start -D /data1/pg_data

# 至此,standby就建立好了.

下面来谈谈怎么做备份 :

# On 主库节点,备份数据库,一周一次,crontab: 1 3 * * 2 /usr/local/postgres/backupsh/backup_database_5432.sh >>/var/log/pg_run_log/backup_database_5432.log 2>&1

vi /usr/local/postgres/backupsh/backup_database_5432.sh

chmod 500 /usr/local/postgres/backupsh/backup_database_5432.sh

#!/bin/bash

DATE=`date +%Y%m%d`

TIME=`date +%F`

echo -e "$TIME : select pg_start_backup();\n"

psql -h 127.0.0.1 postgres postgres -c "select pg_start_backup(now()::text);"

RESULT=$?

echo -e "$TIME : select pg_start_backup(); result:$RESULT\n"

if [ $RESULT -ne 0 ]; then

exit $RESULT

echo -e "$TIME : Backup Database use rsync.\n"

rsync -acvz --exclude=pg_xlog /data1/pg_data/* 10.0.0.20::pgbackup/data_backup_$DATE

RESULT=$?

echo -e "$TIME : Backup Database use rsync. result:$RESULT\n"

echo -e "$TIME : select pg_stop_backup();\n"

psql -h 127.0.0.1 postgres postgres -c "select pg_stop_backup();"

RESULT=$?

echo -e "$TIME : select pg_stop_backup(); result:$RESULT\n"

exit $RESULT

# On 主库节点,备份归档日志,一天一次,crontab: 1 1 * * * /usr/local/postgres/backupsh/backup_archive_5432.sh >>/var/log/pg_run_log/backup_archive_5432.log 2>&1

vi /usr/local/postgres/backupsh/backup_archive_5432.sh

chmod 500 /usr/local/postgres/backupsh/backup_archive_5432.sh

#!/bin/bash

TIME=`date +%F`

echo -e "$TIME : Backup archivelog use rsync.\n"

rsync -acvz /data1/archivedir/pg_arch/* 10.0.0.20::pgbackup/arch_backup

RESULT=$?

echo -e "$TIME : Backup archivelog use rsync. result:$RESULT\n"

exit $RESULT

# 至此,备份就做好了

下面来谈谈怎么做一个策略,定期清理历史备份.假设清理15天前的备份.

# On 备库节点,清理超过保留时间窗口的数据,一天一次,crontab: 1 6 * * * /usr/local/postgres/backupsh/purge_backup.sh >>/var/log/pg_run_log/purge_backup_5432.log 2>&1

vi /usr/local/postgres/backupsh/purge_backup_5432.sh

chmod 500 /usr/local/postgres/backupsh/purge_backup_5432.sh

#!/bin/bash

TIME=`date +%F`

echo -e "$TIME : Delete Archivelog From Primary Host Dir.\n"

find /data/recoverydir/pg_arch/* -mtime +15 -exec rm -rf {} \;

echo -e "$TIME : Delete Archivelog From Backup Host Dir.\n"

find /data1/pg_backup/arch_backup/* -mtime +15 -exec rm -rf {} \;

echo -e "$TIME : Delete Datafile From Backup Host Dir.\n"

for i in `ls -1rt /data1/pg_backup/|grep data_backup_|head --lines=-1`

rm -rf /data1/pg_backup/$i

done

# 至此,清楚历史备份也部署好了.

下面来谈谈PITR,基于时间点的恢复测试.

# On 备库节点,数据恢复测试,数据恢复

mkdir /data1/pg_backup/data_backup_$DATE/pg_xlog

chown -R postgres:postgres /data1/pg_backup

chmod 700 /data1/pg_backup/data_backup_$DATE

rm -f /data1/pg_backup/data_backup_$DATE/recovery.done

# 编辑配置文件postgresql.conf,修改监听端口与本地已经存在的端口分开.

port = 5433

archive_command = 'cp -f %p /data1/archivedir/pg_arch/%f 2>>/var/log/pg_run_log/archive_cp_5433.log'

# 编辑配置文件,recovery.conf

restore_command = 'pg_standby -d -s 2 -t /tmp/pgsql.trigger.5433 /data/recoverydir/pg_arch %f %p 2>>/var/log/pg_run_log/restore_standby_5433.log'

# recovery_target_timeline = 'latest'

# 以下时间来自数据库select now()这种的输出,然后推算;

recovery_target_time = '2011-07-20 11:13:15.64642+02'

recovery_target_inclusive = 'true'

recovery_end_command = 'rm -f /tmp/pgsql.trigger.5433'

# 启动备库开始恢复

pg_ctl start -D /data1/pg_backup/data_backup_$DATE

# 触发激活(三选一,空文件默认smart,fast表示立刻激活,不找下一个WAL文件.smart表示自动找下一个wal文件,找到就APPLY,直到没有下一个WAL了就激活)

touch /tmp/pgsql.trigger.5433

echo "fast" > /tmp/pgsql.trigger.5433

echo "smart" > /tmp/pgsql.trigger.5433

# On 备库节点,数据恢复测试,reindex HASH索引

# 至此,PITR的测试结束,

下面谈谈应该监控哪些日志.

# 监控日志文件

# 主库节点

/var/log/pg_run_log/archive_cp_5432.log

/var/log/pg_run_log/backup_database_5432.log

/var/log/pg_run_log/backup_archive_5432.log

# 备节点

/var/log/pg_run_log/restore_standby_5432.log

/var/log/rsync.log

/var/log/pg_run_log/purge_backup_5432.log

【note】

由于备份对系统的IO有一定影响，建议在执行备份脚本的时候或者脚本里面的rsync前面加上nice -n 19。这样对系统影响最小.

Nicenesses range from -20 (most favorable scheduling) to 19 (least favorable).

【补充】

其他测试点:

数据文件备份crontab脚本测试

归档文件备份crontab脚本测试

清理超过保留时间窗口的数据crontab脚本测试

时间点恢复测试,激活时smart和fast对比,是否达到一致效果

standby激活测试

存储空间测算

监控日志文件

时间线文件删掉之后是否可以做恢复,是否可以做standby等

【参考】

http://www.postgresql.org/docs/8.4/static/pgstandby.html

pg_standby的激活模式 :

Smart Failover

In smart failover, the server is brought up after applying all WAL files available in the archive. This results in zero data loss, even if the standby server has fallen behind, but if there is a lot of unapplied WAL it can be a long time before the standby server becomes ready. To trigger a smart failover, create a trigger file containing the word smart, or just create it and leave it empty.

Fast Failover

In fast failover, the server is brought up immediately. Any WAL files in the archive that have not yet been applied will be ignored, and all transactions in those files are lost. To trigger a fast failover, create a trigger file and write the word fast into it. pg_standby can also be configured to execute a fast failover automatically if no new WAL file appears within a defined interval.

#恢复文件格式 ,recovery.conf example

#recovery_target_timeline = '33' # number or 'latest'

#recovery_target_inclusive = 'true' # 'true' or 'false'

#recovery_target_xid = '1100842'

#recovery_target_time = '2004-07-14 22:39:00 EST' # 可用select now();输出格式

#recovery_end_command = ''

#restore_command = 'cp /mnt/server/archivedir/%f %p'

【注意】

http://www.postgresql.org/docs/8.4/static/continuous-archiving.html

24.3.6. Caveats

At this writing, there are several limitations of the continuous archiving technique. These will probably be fixed in future releases:

Operations on hash indexes are not presently WAL-logged, so replay will not update these indexes. The recommended workaround is to manually REINDEX each such index after completing a recovery operation.
If a CREATE DATABASE command is executed while a base backup is being taken, and then the template database that the CREATE DATABASE copied is modified while the base backup is still in progress, it is possible that recovery will cause those modifications to be propagated into the created database as well. This is of course undesirable. To avoid this risk, it is best not to modify any template databases while taking a base backup.
CREATE TABLESPACE commands are WAL-logged with the literal absolute path, and will therefore be replayed as tablespace creations with the same absolute path. This might be undesirable if the log is being replayed on a different machine. It can be dangerous even if the log is being replayed on the same machine, but into a new data directory: the replay will still overwrite the contents of the original tablespace. To avoid potential gotchas of this sort, the best practice is to take a new base backup after creating or dropping tablespaces.

It should also be noted that the default WAL format is fairly bulky since it includes many disk page snapshots. These page snapshots are designed to support crash recovery, since we might need to fix partially-written disk pages. Depending on your system hardware and software, the risk of partial writes might be small enough to ignore, in which case you can significantly reduce the total volume of archived logs by turning off page snapshots using the full_page_writes parameter. (Read the notes and warnings inChapter 28 before you do so.) Turning off page snapshots does not prevent use of the logs for PITR operations. An area for future development is to compress archived WAL data by removing unnecessary page copies even when full_page_writes is on. In the meantime, administrators might wish to reduce the number of page snapshots included in WAL by increasing the checkpoint interval parameters as much as feasible.

↧

a simple skill : postgresql slow sql report

July 21, 2011, 9:31 pm

≫ Next: PostgreSQL's pgsql_tmp like oracle's temp tablespace

≪ Previous: PostgreSQL V8.4 Warm-Standby design and implement

其实也不完全算PostgreSQL的技巧，应该算是SHELL技巧。

首先要配置好postgresql.conf,让PG记录下慢查询，并且日志固定格式，如下:

log_destination = 'csvlog'

log_min_duration_statement = 100ms

pg_ctl reload -D $PGDATA

来看一个统计的例子:

vi digoal.sh

#!/bin/bash

if [ $# -ne 2 ]; then

echo "Use 2 parameter"

exit

file=$1

slow=$2

cnt=0

for i in `grep duration $1|grep SELECT|awk '{print $6}'|awk -F "." '{print $1}'`

if [ $i -gt $slow ]; then

cnt=$(($cnt+1))

done

echo "Count Slow Sql (>$slow ms) In The $file : $cnt"

chmod 500 digoal.sh

例如，查询7月14号的慢查询，大于500MS的有多少条:

./digoal.sh "/var/log/pg_log/postgresql-2011-07-14*.csv" 500

Count Slow Sql (>500 ms) In The /var/log/pg_log/postgresql-2011-07-14*.csv : 13324

一天13324条执行时间超过500MS的SQL。

↧

PostgreSQL's pgsql_tmp like oracle's temp tablespace

July 22, 2011, 7:17 am

≫ Next: rsync bwlimit

≪ Previous: a simple skill : postgresql slow sql report

在Oracle里面，不同的用户可以指定不同的默认临时表空间。

在PostgreSQL里面，临时目录pgsql_tmp。是放在数据库的默认表空间里面。如果建数据库的时候没有指定默认表空间，那么pgsql_tmp放在

$PGDATA/base/pgsql_tmp，如果指定了默认表空间，那么放在默认表空间里面.

如 :

digoal=> \db

List of tablespaces

Name | Owner | Location

------------+----------+---------------------------------

pg_default | postgres |

pg_global | postgres |

tbs_digoal | digoal | /home/pgdata/pg_root/tbs_digoal

digoal数据库的默认表空间是tbs_digoal,那么在/home/pgdata/pg_root/tbs_digoal里面肯定有一个pgsql_tmp目录.

postgres@db5-> ll /home/pgdata/pg_root/tbs_digoal/PG_9.1_201105231/

total 16K

drwx------ 2 postgres postgres 12K Jul 22 15:52 16386

drwx------ 2 postgres postgres 4.0K Jul 22 16:29 pgsql_tmp

pgsql_tmp功能与Oracle数据库的temp tablespace功能类似，用于存放排序,hash表等产生的临时数据,如order by,group by,distinct,merge join等操作。

pgsql_tmp目录中，文件格式如下:

pgsql_tmpPPP.NNN

其中PPP表示数据库backend process的PID，NNN表示这个BACKEND PROCESS产生的第N个临时文件。

pgsql_tmp中的使用和work_mem参数至关重要，work_mem很小时，如果排序操作超出work_mem定制的SIZE,那么会使用到pgsql_tmp。不仅如此,PostgreSQL不是说work_mem能满足就一定不写pgsql_tmp目录，有时候还是会用到pgsql_tmp。下面举例来看看是什么情况:

首先建立测试数据 :

digoal=> create table user_info (id bigint,firstname text,lastname text,corp text,post text,age int,crt_time timestamp without time zone,comment text);

CREATE TABLE

digoal=> insert into user_info select generate_series(1,5000000),'zhou','digoal'||generate_series(2,5000001),'sky-mobi','dba team leader',28,clock_timestamp(),'flkejwkhnlkjahg()JIOlk fkljkejhnKHJLKJHFKWEhgfoi2o43it09KHKJFnhrksj'||generate_series(3,5000002);

INSERT 0 5000000

digoal=> analyze user_info;

ANALYZE

digoal=> select pg_relation_size('user_info')/1024/1024;

?column?

----------

831

下面这个表主要是测算ID所占用的空间大概是多少。（待会用ID来做group by或distinct,我们可以看看数据库需要多少临时空间来处理这种SQL操作.最终你会发现，数据库需要的空间和实际排序用到的空间差不多）

digoal=> create table user_id (id bigint);

CREATE TABLE

digoal=> insert into user_id select generate_series(1,5000000);

INSERT 0 5000000

digoal=> select pg_relation_size('user_id')/1024/1024;

?column?

----------

172

因为有PAGE和tuple的头部信息，所以实际上ID占用的空间约120MB

查看当前的work_mem配置 :

digoal=> show work_mem;

work_mem

----------

1MB

因为ID内容约120MB，这里使用ID来分组或DISTINCT的话，我们可以想象，肯定是用到pgsql_tmp的。如下验证了我们的猜想。

# Sort Method: external sort Disk: 87984kB

digoal=> explain analyze verbose select distinct id from user_info;

QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------

------

Unique (cost=859215.04..884215.22 rows=5000036 width=8) (actual time=10747.954..20153.818 rows=5000000 loops=1)

Output: id

-> Sort (cost=859215.04..871715.13 rows=5000036 width=8) (actual time=10747.951..14051.498 rows=5000000 loops=1)

Output: id

Sort Key: user_info.id

Sort Method: external sort Disk: 87984kB

-> Seq Scan on digoal.user_info (cost=0.00..156383.36 rows=5000036 width=8) (actual time=0.014..3689.517 rows=5000000 loo

ps=1)

Output: id

Total runtime: 22895.580 ms

(9 rows)

在执行过程中在/home/pgdata/pg_root/tbs_digoal/PG_9.1_201105231/pgsql_tmp中产生文件pgsql_tmp23123.0 大小就是87984kB。

执行完这个文件自动删除。

然后我们来加大work_mem到120MB看看还用不用pgsql_tmp

digoal=> set work_mem='120 MB';

SET

digoal=> explain analyze verbose select distinct id from user_info;

QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------

HashAggregate (cost=168883.45..218883.81 rows=5000036 width=8) (actual time=8561.560..13222.270 rows=5000000 loops=1)

Output: id

-> Seq Scan on digoal.user_info (cost=0.00..156383.36 rows=5000036 width=8) (actual time=0.021..3665.240 rows=5000000 loops=1)

Output: id

Total runtime: 16016.834 ms

显然已经不使用disk文件了。

然后我们换一句SQL，distinct id,comment看看会不会用disk。因为id+comment是当前的排序操作内容，大于120MB。我们预计是会使用DISK的。

结果与我们预期的一致：

digoal=> explain analyze verbose select distinct id,comment from user_info;

QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------

-------

Unique (cost=883625.04..921125.31 rows=5000036 width=83) (actual time=14416.251..25803.966 rows=5000000 loops=1)

Output: id, comment

-> Sort (cost=883625.04..896125.13 rows=5000036 width=83) (actual time=14416.248..18399.329 rows=5000000 loops=1)

Output: id, comment

Sort Key: user_info.id, user_info.comment

Sort Method: external merge Disk: 458352kB

-> Seq Scan on digoal.user_info (cost=0.00..156383.36 rows=5000036 width=83) (actual time=0.011..4205.069 rows=5000000 lo

ops=1)

Output: id, comment

Total runtime: 28704.160 ms

(9 rows)

接下来我们加大work_mem到458352kB+120MB看看是什么情况?

digoal=> explain analyze verbose select distinct id,comment from user_info;

QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------

HashAggregate (cost=181383.54..231383.90 rows=5000036 width=83) (actual time=10108.302..14901.176 rows=5000000 loops=1)

Output: id, comment

-> Seq Scan on digoal.user_info (cost=0.00..156383.36 rows=5000036 width=83) (actual time=0.020..4080.396 rows=5000000 loops=1)

Output: id, comment

Total runtime: 17653.570 ms

(5 rows)

显然，不需要使用DISK了。

再次调整work_mem到 < 458352kB的值如370 MB.看看什么情况?

digoal=> set work_mem='370 MB';

SET

digoal=> explain analyze verbose select distinct id,comment from user_info;

QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------

-------

Unique (cost=883625.04..921125.31 rows=5000036 width=83) (actual time=14656.450..25803.112 rows=5000000 loops=1)

Output: id, comment

-> Sort (cost=883625.04..896125.13 rows=5000036 width=83) (actual time=14656.447..18389.175 rows=5000000 loops=1)

Output: id, comment

Sort Key: user_info.id, user_info.comment

Sort Method: external merge Disk: 458352kB

-> Seq Scan on digoal.user_info (cost=0.00..156383.36 rows=5000036 width=83) (actual time=0.014..4138.056 rows=5000000 lo

ops=1)

Output: id, comment

Total runtime: 28735.377 ms

是不是很奇怪？没错work_mem的370MB没有被使用，直接使用磁盘了。说明work_mem不满足需求的时候，就全部使用磁盘了。

再次调整work_mem到一个极其大的数字，目的是看看work_mem是不是一次性allocate的，还是需要多少用多少？为确保公正，我们重新开一个session。

postgres@db5-> psql -h 127.0.0.1 digoal digoal

psql (9.1beta2)

Type "help" for help.

digoal=> select pg_backend_pid();

pg_backend_pid

----------------

23496

查看23496这个backend process当前使用的内存是多少.如下，结果是4176KB。

postgres@db5-> ps -eo pid,rss,cmd|grep 23496

23496 4176 postgres: digoal digoal 127.0.0.1(2486) idle

下面执行一个SQL，看看RSS的变化。

digoal=> set work_mem='10240 MB';

SET

digoal=> explain analyze verbose select distinct id,comment from user_info;

QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------

HashAggregate (cost=181383.54..231383.90 rows=5000036 width=83) (actual time=10480.524..15375.754 rows=5000000 loops=1)

Output: id, comment

-> Seq Scan on digoal.user_info (cost=0.00..156383.36 rows=5000036 width=83) (actual time=0.052..4387.955 rows=5000000 loops=1)

Output: id, comment

Total runtime: 18207.895 ms

(5 rows)

查看内存消耗等于1767168KB ,约1.7GB。

postgres@db5-> ps -eo pid,rss,cmd|grep 23496

23496 1767168 postgres: digoal digoal 127.0.0.1(2486) EXPLAIN

SQL执行完后,内存消耗降低到877276KB ，约870MB。也就是说刚才的distinct操作用掉了约830MB内存。

postgres@db5-> ps -eo pid,rss,cmd|grep 23496

23496 877276 postgres: digoal digoal 127.0.0.1(2486) idle

为什么内存要用掉830MB,而磁盘只需要458352kB呢?将近一倍。

原因很简单，回想一下这个表是不是约830MB呢，再看看两次的执行计划，是不是不一样呢？而且执行完后进程还保留了870MB左右的内存，实际上就是user_info表的数据占据的空间以及其他分配给backend process的内存。而使用磁盘的时候，只用到了id和comment字段的内容，约458352kB。

我们再往表里插入一些数据来证实一下这个想法：

digoal=> insert into user_info select generate_series(1,5000000),'zhou','digoal'||generate_series(2,5000001),'sky-mobi','dba team leader',28,clock_timestamp(),'flkejwkhnlkjahgJLIJFOIJOIJEGF fkljkejhnKHJL你好Ehgfoi2o43it09KHKJFnhrksj'||generate_series(3,5000002);

INSERT 0 5000000

digoal=> analyze user_info;

ANALYZE

digoal=> select pg_relation_size('user_info')/1024/1024;

?column?

----------

1698

此时表的SIZE是1698MB,同样我们执行如下SQL

digoal=> explain analyze verbose select distinct id,comment from user_info;

QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------

---

HashAggregate (cost=369825.15..471442.58 rows=10161743 width=86) (actual time=20834.803..30828.287 rows=10000000 loops=1)

Output: id, comment

-> Seq Scan on digoal.user_info (cost=0.00..319016.43 rows=10161743 width=86) (actual time=0.021..8309.780 rows=10000000 loops=

Output: id, comment

Total runtime: 36475.840 ms

再次查看内存的使用，没错现在的使用是3.6GB

postgres@db5-> ps -eo pid,rss,cmd|grep 23496

23496 3632540 postgres: digoal digoal 127.0.0.1(2486) EXPLAIN

执行完后驻留了1.8GB。是不是刚好排序使用的是1698MB呢。没错。

postgres@db5-> ps -eo pid,rss,cmd|grep 23496

23496 1858568 postgres: digoal digoal 127.0.0.1(2486) idle

注意，1858568这部分被驻留的内存应该是操作系统OS CACHE。驻留的behavior可以通过posix来调整.pg_fincore就是不错的一个软件。

接下来我们要测试一下pgsql_tmp这个里面的文件,在执行过程中删掉会怎么样?

首先把work_mem调下来,

digoal=> set work_mem='1 MB';

SET

然后执行如下SQL

digoal=> select distinct id,comment from user_info limit 100;

然后马上到/home/pgdata/pg_root/tbs_digoal/PG_9.1_201105231/pgsql_tmp删除pgsql_tmp23496.*

神奇的事情发生了，结果可以正常返回。

id | comment

----+----------------------------------------------------------------------------

1 | flkejwkhnlkjahg()JIOlk fkljkejhnKHJLKJHFKWEhgfoi2o43it09KHKJFnhrksj3

1 | flkejwkhnlkjahgJLIJFOIJOIJEGF fkljkejhnKHJL你好Ehgfoi2o43it09KHKJFnhrksj3

2 | flkejwkhnlkjahg()JIOlk fkljkejhnKHJLKJHFKWEhgfoi2o43it09KHKJFnhrksj4

2 | flkejwkhnlkjahgJLIJFOIJOIJEGF fkljkejhnKHJL你好Ehgfoi2o43it09KHKJFnhrksj4

3 | flkejwkhnlkjahg()JIOlk fkljkejhnKHJLKJHFKWEhgfoi2o43it09KHKJFnhrksj5

3 | flkejwkhnlkjahgJLIJFOIJOIJEGF fkljkejhnKHJL你好Ehgfoi2o43it09KHKJFnhrksj5

4 | flkejwkhnlkjahg()JIOlk fkljkejhnKHJLKJHFKWEhgfoi2o43it09KHKJFnhrksj6

4 | flkejwkhnlkjahgJLIJFOIJOIJEGF fkljkejhnKHJL你好Ehgfoi2o43it09KHKJFnhrksj6

5 | flkejwkhnlkjahg()JIOlk fkljkejhnKHJLKJHFKWEhgfoi2o43it09KHKJFnhrksj7

5 | flkejwkhnlkjahgJLIJFOIJOIJEGF fkljkejhnKHJL你好Ehgfoi2o43it09KHKJFnhrksj7

6 | flkejwkhnlkjahg()JIOlk fkljkejhnKHJLKJHFKWEhgfoi2o43it09KHKJFnhrksj8

6 | flkejwkhnlkjahgJLIJFOIJOIJEGF fkljkejhnKHJL你好Ehgfoi2o43it09KHKJFnhrksj8

7 | flkejwkhnlkjahg()JIOlk fkljkejhnKHJLKJHFKWEhgfoi2o43it09KHKJFnhrksj9

7 | flkejwkhnlkjahgJLIJFOIJOIJEGF fkljkejhnKHJL你好Ehgfoi2o43it09KHKJFnhrksj9

8 | flkejwkhnlkjahg()JIOlk fkljkejhnKHJLKJHFKWEhgfoi2o43it09KHKJFnhrksj10

8 | flkejwkhnlkjahgJLIJFOIJOIJEGF fkljkejhnKHJL你好Ehgfoi2o43it09KHKJFnhrksj10

9 | flkejwkhnlkjahg()JIOlk fkljkejhnKHJLKJHFKWEhgfoi2o43it09KHKJFnhrksj11

9 | flkejwkhnlkjahgJLIJFOIJOIJEGF fkljkejhnKHJL你好Ehgfoi2o43it09KHKJFnhrksj11

10 | flkejwkhnlkjahg()JIOlk fkljkejhnKHJLKJHFKWEhgfoi2o43it09KHKJFnhrksj12

10 | flkejwkhnlkjahgJLIJFOIJOIJEGF fkljkejhnKHJL你好Ehgfoi2o43it09KHKJFnhrksj12

11 | flkejwkhnlkjahg()JIOlk fkljkejhnKHJLKJHFKWEhgfoi2o43it09KHKJFnhrksj13

11 | flkejwkhnlkjahgJLIJFOIJOIJEGF fkljkejhnKHJL你好Ehgfoi2o43it09KHKJFnhrksj13

12 | flkejwkhnlkjahg()JIOlk fkljkejhnKHJLKJHFKWEhgfoi2o43it09KHKJFnhrksj14

12 | flkejwkhnlkjahgJLIJFOIJOIJEGF fkljkejhnKHJL你好Ehgfoi2o43it09KHKJFnhrksj14

13 | flkejwkhnlkjahg()JIOlk fkljkejhnKHJLKJHFKWEhgfoi2o43it09KHKJFnhrksj15

13 | flkejwkhnlkjahgJLIJFOIJOIJEGF fkljkejhnKHJL你好Ehgfoi2o43it09KHKJFnhrksj15

14 | flkejwkhnlkjahg()JIOlk fkljkejhnKHJLKJHFKWEhgfoi2o43it09KHKJFnhrksj16

14 | flkejwkhnlkjahgJLIJFOIJOIJEGF fkljkejhnKHJL你好Ehgfoi2o43it09KHKJFnhrksj16

15 | flkejwkhnlkjahg()JIOlk fkljkejhnKHJLKJHFKWEhgfoi2o43it09KHKJFnhrksj17

15 | flkejwkhnlkjahgJLIJFOIJOIJEGF fkljkejhnKHJL你好Ehgfoi2o43it09KHKJFnhrksj17

16 | flkejwkhnlkjahg()JIOlk fkljkejhnKHJLKJHFKWEhgfoi2o43it09KHKJFnhrksj18

16 | flkejwkhnlkjahgJLIJFOIJOIJEGF fkljkejhnKHJL你好Ehgfoi2o43it09KHKJFnhrksj18

17 | flkejwkhnlkjahg()JIOlk fkljkejhnKHJLKJHFKWEhgfoi2o43it09KHKJFnhrksj19

17 | flkejwkhnlkjahgJLIJFOIJOIJEGF fkljkejhnKHJL你好Ehgfoi2o43it09KHKJFnhrksj19

18 | flkejwkhnlkjahg()JIOlk fkljkejhnKHJLKJHFKWEhgfoi2o43it09KHKJFnhrksj20

18 | flkejwkhnlkjahgJLIJFOIJOIJEGF fkljkejhnKHJL你好Ehgfoi2o43it09KHKJFnhrksj20

19 | flkejwkhnlkjahg()JIOlk fkljkejhnKHJLKJHFKWEhgfoi2o43it09KHKJFnhrksj21

19 | flkejwkhnlkjahgJLIJFOIJOIJEGF fkljkejhnKHJL你好Ehgfoi2o43it09KHKJFnhrksj21

20 | flkejwkhnlkjahg()JIOlk fkljkejhnKHJLKJHFKWEhgfoi2o43it09KHKJFnhrksj22

20 | flkejwkhnlkjahgJLIJFOIJOIJEGF fkljkejhnKHJL你好Ehgfoi2o43it09KHKJFnhrksj22

21 | flkejwkhnlkjahg()JIOlk fkljkejhnKHJLKJHFKWEhgfoi2o43it09KHKJFnhrksj23

21 | flkejwkhnlkjahgJLIJFOIJOIJEGF fkljkejhnKHJL你好Ehgfoi2o43it09KHKJFnhrksj23

22 | flkejwkhnlkjahg()JIOlk fkljkejhnKHJLKJHFKWEhgfoi2o43it09KHKJFnhrksj24

22 | flkejwkhnlkjahgJLIJFOIJOIJEGF fkljkejhnKHJL你好Ehgfoi2o43it09KHKJFnhrksj24

23 | flkejwkhnlkjahg()JIOlk fkljkejhnKHJLKJHFKWEhgfoi2o43it09KHKJFnhrksj25

23 | flkejwkhnlkjahgJLIJFOIJOIJEGF fkljkejhnKHJL你好Ehgfoi2o43it09KHKJFnhrksj25

24 | flkejwkhnlkjahg()JIOlk fkljkejhnKHJLKJHFKWEhgfoi2o43it09KHKJFnhrksj26

24 | flkejwkhnlkjahgJLIJFOIJOIJEGF fkljkejhnKHJL你好Ehgfoi2o43it09KHKJFnhrksj26

25 | flkejwkhnlkjahg()JIOlk fkljkejhnKHJLKJHFKWEhgfoi2o43it09KHKJFnhrksj27

25 | flkejwkhnlkjahgJLIJFOIJOIJEGF fkljkejhnKHJL你好Ehgfoi2o43it09KHKJFnhrksj27

26 | flkejwkhnlkjahg()JIOlk fkljkejhnKHJLKJHFKWEhgfoi2o43it09KHKJFnhrksj28

26 | flkejwkhnlkjahgJLIJFOIJOIJEGF fkljkejhnKHJL你好Ehgfoi2o43it09KHKJFnhrksj28

27 | flkejwkhnlkjahg()JIOlk fkljkejhnKHJLKJHFKWEhgfoi2o43it09KHKJFnhrksj29

27 | flkejwkhnlkjahgJLIJFOIJOIJEGF fkljkejhnKHJL你好Ehgfoi2o43it09KHKJFnhrksj29

28 | flkejwkhnlkjahg()JIOlk fkljkejhnKHJLKJHFKWEhgfoi2o43it09KHKJFnhrksj30

28 | flkejwkhnlkjahgJLIJFOIJOIJEGF fkljkejhnKHJL你好Ehgfoi2o43it09KHKJFnhrksj30

29 | flkejwkhnlkjahg()JIOlk fkljkejhnKHJLKJHFKWEhgfoi2o43it09KHKJFnhrksj31

29 | flkejwkhnlkjahgJLIJFOIJOIJEGF fkljkejhnKHJL你好Ehgfoi2o43it09KHKJFnhrksj31

30 | flkejwkhnlkjahg()JIOlk fkljkejhnKHJLKJHFKWEhgfoi2o43it09KHKJFnhrksj32

30 | flkejwkhnlkjahgJLIJFOIJOIJEGF fkljkejhnKHJL你好Ehgfoi2o43it09KHKJFnhrksj32

31 | flkejwkhnlkjahg()JIOlk fkljkejhnKHJLKJHFKWEhgfoi2o43it09KHKJFnhrksj33

31 | flkejwkhnlkjahgJLIJFOIJOIJEGF fkljkejhnKHJL你好Ehgfoi2o43it09KHKJFnhrksj33

32 | flkejwkhnlkjahg()JIOlk fkljkejhnKHJLKJHFKWEhgfoi2o43it09KHKJFnhrksj34

32 | flkejwkhnlkjahgJLIJFOIJOIJEGF fkljkejhnKHJL你好Ehgfoi2o43it09KHKJFnhrksj34

33 | flkejwkhnlkjahg()JIOlk fkljkejhnKHJLKJHFKWEhgfoi2o43it09KHKJFnhrksj35

33 | flkejwkhnlkjahgJLIJFOIJOIJEGF fkljkejhnKHJL你好Ehgfoi2o43it09KHKJFnhrksj35

34 | flkejwkhnlkjahg()JIOlk fkljkejhnKHJLKJHFKWEhgfoi2o43it09KHKJFnhrksj36

34 | flkejwkhnlkjahgJLIJFOIJOIJEGF fkljkejhnKHJL你好Ehgfoi2o43it09KHKJFnhrksj36

35 | flkejwkhnlkjahg()JIOlk fkljkejhnKHJLKJHFKWEhgfoi2o43it09KHKJFnhrksj37

35 | flkejwkhnlkjahgJLIJFOIJOIJEGF fkljkejhnKHJL你好Ehgfoi2o43it09KHKJFnhrksj37

36 | flkejwkhnlkjahg()JIOlk fkljkejhnKHJLKJHFKWEhgfoi2o43it09KHKJFnhrksj38

36 | flkejwkhnlkjahgJLIJFOIJOIJEGF fkljkejhnKHJL你好Ehgfoi2o43it09KHKJFnhrksj38

37 | flkejwkhnlkjahg()JIOlk fkljkejhnKHJLKJHFKWEhgfoi2o43it09KHKJFnhrksj39

37 | flkejwkhnlkjahgJLIJFOIJOIJEGF fkljkejhnKHJL你好Ehgfoi2o43it09KHKJFnhrksj39

38 | flkejwkhnlkjahg()JIOlk fkljkejhnKHJLKJHFKWEhgfoi2o43it09KHKJFnhrksj40

38 | flkejwkhnlkjahgJLIJFOIJOIJEGF fkljkejhnKHJL你好Ehgfoi2o43it09KHKJFnhrksj40

39 | flkejwkhnlkjahg()JIOlk fkljkejhnKHJLKJHFKWEhgfoi2o43it09KHKJFnhrksj41

39 | flkejwkhnlkjahgJLIJFOIJOIJEGF fkljkejhnKHJL你好Ehgfoi2o43it09KHKJFnhrksj41

40 | flkejwkhnlkjahg()JIOlk fkljkejhnKHJLKJHFKWEhgfoi2o43it09KHKJFnhrksj42

40 | flkejwkhnlkjahgJLIJFOIJOIJEGF fkljkejhnKHJL你好Ehgfoi2o43it09KHKJFnhrksj42

41 | flkejwkhnlkjahg()JIOlk fkljkejhnKHJLKJHFKWEhgfoi2o43it09KHKJFnhrksj43

41 | flkejwkhnlkjahgJLIJFOIJOIJEGF fkljkejhnKHJL你好Ehgfoi2o43it09KHKJFnhrksj43

42 | flkejwkhnlkjahg()JIOlk fkljkejhnKHJLKJHFKWEhgfoi2o43it09KHKJFnhrksj44

42 | flkejwkhnlkjahgJLIJFOIJOIJEGF fkljkejhnKHJL你好Ehgfoi2o43it09KHKJFnhrksj44

43 | flkejwkhnlkjahg()JIOlk fkljkejhnKHJLKJHFKWEhgfoi2o43it09KHKJFnhrksj45

43 | flkejwkhnlkjahgJLIJFOIJOIJEGF fkljkejhnKHJL你好Ehgfoi2o43it09KHKJFnhrksj45

44 | flkejwkhnlkjahg()JIOlk fkljkejhnKHJLKJHFKWEhgfoi2o43it09KHKJFnhrksj46

44 | flkejwkhnlkjahgJLIJFOIJOIJEGF fkljkejhnKHJL你好Ehgfoi2o43it09KHKJFnhrksj46

45 | flkejwkhnlkjahg()JIOlk fkljkejhnKHJLKJHFKWEhgfoi2o43it09KHKJFnhrksj47

45 | flkejwkhnlkjahgJLIJFOIJOIJEGF fkljkejhnKHJL你好Ehgfoi2o43it09KHKJFnhrksj47

46 | flkejwkhnlkjahg()JIOlk fkljkejhnKHJLKJHFKWEhgfoi2o43it09KHKJFnhrksj48

46 | flkejwkhnlkjahgJLIJFOIJOIJEGF fkljkejhnKHJL你好Ehgfoi2o43it09KHKJFnhrksj48

47 | flkejwkhnlkjahg()JIOlk fkljkejhnKHJLKJHFKWEhgfoi2o43it09KHKJFnhrksj49

47 | flkejwkhnlkjahgJLIJFOIJOIJEGF fkljkejhnKHJL你好Ehgfoi2o43it09KHKJFnhrksj49

48 | flkejwkhnlkjahg()JIOlk fkljkejhnKHJLKJHFKWEhgfoi2o43it09KHKJFnhrksj50

48 | flkejwkhnlkjahgJLIJFOIJOIJEGF fkljkejhnKHJL你好Ehgfoi2o43it09KHKJFnhrksj50

49 | flkejwkhnlkjahg()JIOlk fkljkejhnKHJLKJHFKWEhgfoi2o43it09KHKJFnhrksj51

49 | flkejwkhnlkjahgJLIJFOIJOIJEGF fkljkejhnKHJL你好Ehgfoi2o43it09KHKJFnhrksj51

50 | flkejwkhnlkjahg()JIOlk fkljkejhnKHJLKJHFKWEhgfoi2o43it09KHKJFnhrksj52

50 | flkejwkhnlkjahgJLIJFOIJOIJEGF fkljkejhnKHJL你好Ehgfoi2o43it09KHKJFnhrksj52

(100 rows)

那么pgsql_tmp里面的文件存储的到底是啥呢？

在源代码中我找到了这个相关的。

src/include/storage/fd.h

/* Filename components for OpenTemporaryFile */

#define PG_TEMP_FILES_DIR "pgsql_tmp"

#define PG_TEMP_FILE_PREFIX "pgsql_tmp"

src/backend/storage/file/fd.c

* Remove temporary and temporary relation files left over from a prior

* postmaster session

* This should be called during postmaster startup. It will forcibly

* remove any leftover files created by OpenTemporaryFile and any leftover

* temporary relation files created by mdcreate.

* NOTE: we could, but don't, call this during a post-backend-crash restart

* cycle. The argument for not doing it is that someone might want to examine

* the temp files for debugging purposes. This does however mean that

* OpenTemporaryFile had better allow for collision with an existing temp

* file name.

* In EXEC_BACKEND case there is a pgsql_tmp directory at the top level of

* DataDir as well.

#ifdef EXEC_BACKEND

RemovePgTempFilesInDir(PG_TEMP_FILES_DIR);

#endif

/* Process one pgsql_tmp directory for RemovePgTempFiles */

static void

RemovePgTempFilesInDir(const char *tmpdirname)

{

DIR *temp_dir;

struct dirent *temp_de;

char rm_path[MAXPGPATH];

temp_dir = AllocateDir(tmpdirname);

if (temp_dir == NULL)

{

/* anything except ENOENT is fishy */

if (errno != ENOENT)

elog(LOG,

"could not open temporary-files directory \"%s\": %m",

tmpdirname);

return;

}

while ((temp_de = ReadDir(temp_dir, tmpdirname)) != NULL)

{

if (strcmp(temp_de->d_name, ".") == 0 ||

strcmp(temp_de->d_name, "..") == 0)

continue;

snprintf(rm_path, sizeof(rm_path), "%s/%s",

tmpdirname, temp_de->d_name);

if (strncmp(temp_de->d_name,

PG_TEMP_FILE_PREFIX,

strlen(PG_TEMP_FILE_PREFIX)) == 0)

unlink(rm_path); /* note we ignore any error */

else

elog(LOG,

"unexpected file found in temporary-files directory: \"%s\"",

rm_path);

}

FreeDir(temp_dir);

}

相信在src/backend/storage/file/fd.c中可以找到答案.

另外我们再来验证一下手册上说的work_mem的内容不会复用。

Session 1 :

postgres@db5-> psql -h 127.0.0.1 digoal digoal

psql (9.1beta2)

Type "help" for help.

digoal=> set work_mem='10240 MB';

SET

digoal=> explain analyze verbose select distinct id,comment from user_info order by id,comment desc;

QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------

---------

Unique (cost=1501672.83..1577885.90 rows=10161743 width=86) (actual time=24547.376..44742.270 rows=10000000 loops=1)

Output: id, comment

-> Sort (cost=1501672.83..1527077.19 rows=10161743 width=86) (actual time=24547.371..31096.446 rows=10000000 loops=1)

Output: id, comment

Sort Key: user_info.id, user_info.comment

Sort Method: quicksort Memory: 1799467kB

-> Seq Scan on digoal.user_info (cost=0.00..319016.43 rows=10161743 width=86) (actual time=0.028..8872.090 rows=10000000

loops=1)

Output: id, comment

Total runtime: 51242.900 ms

(9 rows)

Session 2 :

postgres@db5-> psql -h 127.0.0.1 digoal digoal

psql (9.1beta2)

Type "help" for help.

digoal=> set work_mem='10240 MB';

SET

digoal=> explain analyze verbose select distinct id,comment from user_info order by id,comment desc;

QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------

---------

Unique (cost=1501672.83..1577885.90 rows=10161743 width=86) (actual time=24529.584..44689.774 rows=10000000 loops=1)

Output: id, comment

-> Sort (cost=1501672.83..1527077.19 rows=10161743 width=86) (actual time=24529.579..31043.949 rows=10000000 loops=1)

Output: id, comment

Sort Key: user_info.id, user_info.comment

Sort Method: quicksort Memory: 1799467kB

-> Seq Scan on digoal.user_info (cost=0.00..319016.43 rows=10161743 width=86) (actual time=0.021..8925.400 rows=10000000

loops=1)

Output: id, comment

Total runtime: 51170.098 ms

(9 rows)

# 注意到这里又用到了一种新的Sort Method : quicksort Memory: 1799467kB

下面来分析一下OS层监控到的内存使用情况,

1. 执行explain analyze之前

top - 07:01:00 up 17 days, 15:30, 3 users, load average: 0.54, 0.57, 0.25

Tasks: 160 total, 1 running, 159 sleeping, 0 stopped, 0 zombie

Cpu(s): 0.1%us, 0.1%sy, 0.0%ni, 99.8%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st

Mem: 16438912k total, 9507680k used, 6931232k free, 163756k buffers

Swap: 16777208k total, 18432k used, 16758776k free, 8979204k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

4444 postgres 15 0 2324m 3672 2060 S 0 0.0 0:00.00 postgres: digoal digoal 127.0.0.1(42278) idle

4446 postgres 15 0 2324m 3672 2060 S 0 0.0 0:00.00 postgres: digoal digoal 127.0.0.1(42279) idle

2. 执行explain analyze过程中

top - 07:01:54 up 17 days, 15:31, 3 users, load average: 1.20, 0.73, 0.32

Tasks: 160 total, 3 running, 157 sleeping, 0 stopped, 0 zombie

Cpu(s): 25.0%us, 0.1%sy, 0.0%ni, 74.7%id, 0.1%wa, 0.0%hi, 0.0%si, 0.0%st

Mem: 16438912k total, 12805700k used, 3633212k free, 163848k buffers

Swap: 16777208k total, 18432k used, 16758776k free, 8979204k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

4444 postgres 25 0 4085m 3.3g 1.7g R 100 20.8 0:41.43 postgres: digoal digoal 127.0.0.1(42278) EXPLAIN

4446 postgres 25 0 4085m 3.3g 1.7g R 99 20.8 0:42.97 postgres: digoal digoal 127.0.0.1(42279) EXPLAIN

# 可以注意到free有变化,总体使用了约3.2G(=3.3-1.7 + 3.3-1.7)，shr=1.7G(user_info表的OS-FS cache用掉的,所以我们看到cached前后没有变化)

3. 执行完explain analyze之后

top - 07:02:07 up 17 days, 15:31, 3 users, load average: 1.22, 0.76, 0.34

Tasks: 160 total, 1 running, 159 sleeping, 0 stopped, 0 zombie

Cpu(s): 0.1%us, 0.0%sy, 0.0%ni, 99.6%id, 0.2%wa, 0.0%hi, 0.0%si, 0.0%st

Mem: 16438912k total, 9518956k used, 6919956k free, 163868k buffers

Swap: 16777208k total, 18432k used, 16758776k free, 8979196k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

4444 postgres 25 0 2324m 1.7g 1.7g S 0 10.8 0:51.15 postgres: digoal digoal 127.0.0.1(42278) idle

4446 postgres 25 0 2324m 1.7g 1.7g S 0 10.8 0:51.17 postgres: digoal digoal 127.0.0.1(42279) idle

执行完后，我们看到free回到执行前的水平,并且进程的res和SHR现在是相同的，原因是user_info表仍在OS-FS的cache里面。这部分是共享的开销。使用pg_fincore提供的pgfadv_willnotneed调整USER_INFO表的behavior，就能看到它很快被释放(不推荐这么做，因为PG很需要OS的CACHE)。

由此可以解释，sort用到的work_mem不共享或者说不复用。

接下来我们把work_mem调到1MB,更能清晰的反映这个问题。

set work_mem='1 MB';

然后同样执行上面的两个SESSION，执行过程中会在表空间目录的pgsql_tmp /home/pgdata/pg_root/tbs_digoal/PG_9.1_201105231/pgsql_tmp 里面产生两个临时文件pgsql_tmp4444.0和pgsql_tmp4446.0。显然不是共享的。

总结 :

1. work_mem配多大不要紧，因为不会一次全部分配掉。manual中的as much as这个让我在之前对work_mem有所误解。

2. work_mem配大了，确实可能造成内存的过度开销，就像上面看到的一样。

3. Oracle早前的进程工作内存也是这么来设置的，后来引入了PGA的概念，可以将所有进程的WORK_MEM的总和圈定在一个范围内，而不用限定某一个进程的使用大小。这样的好处是显而易见的。不知道PostgreSQL什么时候会改进这个机制。

【参考】

Oracle :

A temporary tablespace contains transient data that persists only for the duration of the session. Temporary tablespaces can improve the concurrency of multiple sort operations, reduce their overhead, and avoid Oracle Database space management operations. A temporary tablespace can be assigned to users with the CREATE USER or ALTER USER statement and can be shared by multiple users.

Within a temporary tablespace, all sort operations for a given instance and tablespace share a single sort segment. Sort segments exist for every instance that performs sort operations within a given tablespace. The sort segment is created by the first statement that uses a temporary tablespace for sorting, after startup, and is released only at shutdown. An extent cannot be shared by multiple transactions.

You can view the allocation and deallocation of space in a temporary tablespace sort segment using the V$SORT_SEGMENT view. The V$TEMPSEG_USAGE view identifies the current sort users in those segments.

You cannot explicitly create objects in a temporary tablespace.

PostgreSQL :

http://www.postgresql.org/docs/9.0/static/runtime-config-resource.html#RUNTIME-CONFIG-RESOURCE-MEMORY

Temporary files (for operations such as sorting more data than can fit in memory) are created within PGDATA/base/pgsql_tmp, or within a pgsql_tmp subdirectory of a tablespace directory if a tablespace other than pg_default is specified for them. The name of a temporary file has the form pgsql_tmpPPP.NNN, where PPP is the PID of the owning backend andNNN distinguishes different temporary files of that backend.

http://www.postgresql.org/docs/9.0/static/runtime-config-resource.html#RUNTIME-CONFIG-RESOURCE-MEMORY

work_mem (integer)

Specifies the amount of memory to be used by internal sort operations and hash tables before writing to temporary disk files. The value defaults to one megabyte (1MB). Note that for a complex query, several sort or hash operations might be running in parallel; each operation will be allowed to use as much memory as this value specifies before it starts to write data into temporary files. Also, several running sessions could be doing such operations concurrently. Therefore, the total memory used could be many times the value of work_mem; it is necessary to keep this fact in mind when choosing the value. Sort operations are used for ORDER BY, DISTINCT, and merge joins. Hash tables are used in hash joins, hash-based aggregation, and hash-based processing of IN subqueries.

↧

rsync bwlimit

July 25, 2011, 1:34 am

≫ Next: mongoDB and numa policy interleave

≪ Previous: PostgreSQL's pgsql_tmp like oracle's temp tablespace

事实证明rsync不能控制服务端的总带宽，只能控制发起端的带宽。

cat /etc/rsyncd.conf

port = 873

hosts deny = 0.0.0.0/0

read only = false

write only = false

gid = 0

uid = 0

use chroot = no

max connections = 10

pid file = /var/run/rsync.pid

lock file = /var/run/rsync.lock

log file = /var/log/rsync.log

bwlimit = 1024

[dbbak_sh]

path = /dbbak_sh

comment = Database Backup from shanghai.

hosts allow = 172.16.3.33,172.16.3.40

服务端命令 rsync -v --daemon --config=/etc/rsyncd.conf

客户端命令 rsync -acvz --delete --delete-after /root/SLES-11-SP1-DVD-x86_64-GM-DVD2.iso1 172.16.3.176::dbbak_sh

客户端输出

building file list ... done

SLES-11-SP1-DVD-x86_64-GM-DVD2.iso1

sent 4594258875 bytes received 38 bytes 18713885.59 bytes/sec

total size is 4605421568 speedup is 1.00

限速无效

cat /etc/rsyncd.conf

port = 873

hosts deny = 0.0.0.0/0

read only = false

write only = false

gid = 0

uid = 0

use chroot = no

max connections = 10

pid file = /var/run/rsync.pid

lock file = /var/run/rsync.lock

log file = /var/log/rsync.log

[dbbak_sh]

path = /dbbak_sh

comment = Database Backup from shanghai.

hosts allow = 172.16.3.33,172.16.3.40

服务端命令 rsync -v --daemon --bwlimit=1024 --config=/etc/rsyncd.conf

客户端命令 rsync -acvz --delete --delete-after /root/SLES-11-SP1-DVD-x86_64-GM-DVD2.iso1 172.16.3.176::dbbak_sh

客户端输出

building file list ... done

SLES-11-SP1-DVD-x86_64-GM-DVD2.iso1

sent 4594258859 bytes received 38 bytes 17110833.88 bytes/sec

total size is 4605421568 speedup is 1.00

限速无效

cat /etc/rsyncd.conf

port = 873

hosts deny = 0.0.0.0/0

read only = false

write only = false

gid = 0

uid = 0

use chroot = no

max connections = 10

pid file = /var/run/rsync.pid

lock file = /var/run/rsync.lock

log file = /var/log/rsync.log

[dbbak_sh]

path = /dbbak_sh

comment = Database Backup from shanghai.

hosts allow = 172.16.3.33,172.16.3.40

服务端命令 rsync -v --daemon --config=/etc/rsyncd.conf

客户端命令 rsync -acvz --daemon --bwlimit=5000 --delete --delete-after /root/SLES-11-SP1-DVD-x86_64-GM-DVD2.iso1 172.16.3.176::dbbak_sh

客户端报错 :

rsync: -acvz: unknown option (in daemon mode)

(Type "rsync --daemon --help" for assistance with daemon mode.)

rsync error: syntax or usage error (code 1) at options.c(879) [client=2.6.8]

限速失败.

客户端命令 rsync -acvz --bwlimit=5000 --delete --delete-after /root/SLES-11-SP1-DVD-x86_64-GM-DVD2.iso1 172.16.3.176::dbbak_sh

building file list ... done

SLES-11-SP1-DVD-x86_64-GM-DVD2.iso1

sent 4594258875 bytes received 38 bytes 5023793.23 bytes/sec

total size is 4605421568 speedup is 1.00

限速成功。

【参考】

--bwlimit=KBPS limit I/O bandwidth; KBytes per second

Rsync can also be run as a daemon, in which case the following options are accepted:

--daemon run as an rsync daemon

--address=ADDRESS bind to the specified address

--bwlimit=KBPS limit I/O bandwidth; KBytes per second (这里应该是说服务端限速,不过显然没有生效)

--config=FILE specify alternate rsyncd.conf file

--no-detach do not detach from the parent

--port=PORT listen on alternate port number

--sockopts=OPTIONS specify custom TCP options

-v, --verbose increase verbosity

-4, --ipv4 prefer IPv4

-6, --ipv6 prefer IPv6

-h, --help show this help (if used after --daemon)

↧

mongoDB and numa policy interleave

July 25, 2011, 7:45 am

≫ Next: PostgreSQL 9.1 escape behavior changed warning

≪ Previous: rsync bwlimit

今天看到HelloDBA的一篇关于NUMA架构下单服务器跑多实例MYSQL的文章地址

http://www.hellodb.net/tag/numa

我发现mongoDB也有类似情形，可能需要在启动数据库的时候加一个调整NUMA内存分配策略的设置，如下：

su - mongo -c "numactl --interleave=all mongod -f /opt/mongo/conf/mongod5281.conf"

这样的话内存分配策略由默认的default修改为interleave模式，具体可以参考不同的模式的意思。

通过查看进程的numa_maps

一个物理两个node的NUMA硬件架构，已经启用了interleave策略如下:

cat /proc/$pid/numa_maps

00400000 interleave=0-1 file=/opt/mongo/bin/mongod mapped=539 active=210 N0=269 N1=270

00d46000 interleave=0-1 file=/opt/mongo/bin/mongod anon=12 dirty=12 mapped=23 active=12 N0=12 N1=11

00d63000 interleave=0-1 anon=15 dirty=15 N0=7 N1=8

11d18000 interleave=0-1 heap anon=45 dirty=45 active=44 N0=24 N1=21

4021f000 interleave=0-1

40220000 interleave=0-1 anon=2 dirty=2 N0=1 N1=1

41131000 interleave=0-1

41132000 interleave=0-1 anon=2 dirty=2 N0=1 N1=1

41b32000 interleave=0-1

41b33000 interleave=0-1 anon=2 dirty=2 N0=1 N1=1

42533000 interleave=0-1

42534000 interleave=0-1 anon=2 dirty=2 N0=1 N1=1

42f34000 interleave=0-1

42f35000 interleave=0-1 anon=2 dirty=2 N0=1 N1=1

43935000 interleave=0-1

43936000 interleave=0-1 anon=2 dirty=2 N0=1 N1=1

44336000 interleave=0-1

44337000 interleave=0-1 anon=5 dirty=5 N0=3 N1=2

44d37000 interleave=0-1

44d38000 interleave=0-1 anon=3 dirty=3 N0=1 N1=2

45738000 interleave=0-1

45739000 interleave=0-1 anon=4 dirty=4 N0=2 N1=2

46139000 interleave=0-1

4613a000 interleave=0-1 anon=3 dirty=3 N0=1 N1=2

46b3a000 interleave=0-1

46b3b000 interleave=0-1 anon=3 dirty=3 N0=2 N1=1

3adea00000 interleave=0-1 file=/lib64/ld-2.5.so mapped=20 mapmax=44 N0=20

3adec1b000 interleave=0-1 file=/lib64/ld-2.5.so anon=1 dirty=1 N1=1

3adec1c000 interleave=0-1 file=/lib64/ld-2.5.so anon=1 dirty=1 N0=1

3adee00000 interleave=0-1 file=/lib64/libc-2.5.so mapped=126 mapmax=49 N0=126

3adef4e000 interleave=0-1 file=/lib64/libc-2.5.so

3adf14e000 interleave=0-1 file=/lib64/libc-2.5.so anon=2 dirty=2 mapped=4 mapmax=26 N0=3 N1=1

3adf152000 interleave=0-1 file=/lib64/libc-2.5.so anon=1 dirty=1 N0=1

3adf153000 interleave=0-1 anon=5 dirty=5 N0=2 N1=3

3adf600000 interleave=0-1 file=/lib64/libpthread-2.5.so mapped=16 mapmax=21 N0=16

3adf616000 interleave=0-1 file=/lib64/libpthread-2.5.so

3adf815000 interleave=0-1 file=/lib64/libpthread-2.5.so anon=1 dirty=1 N1=1

3adf816000 interleave=0-1 file=/lib64/libpthread-2.5.so anon=1 dirty=1 N0=1

3adf817000 interleave=0-1 anon=1 dirty=1 N0=1

3adfa00000 interleave=0-1 file=/lib64/libm-2.5.so mapped=27 mapmax=10 active=11 N0=27

3adfa82000 interleave=0-1 file=/lib64/libm-2.5.so

3adfc81000 interleave=0-1 file=/lib64/libm-2.5.so anon=1 dirty=1 N1=1

3adfc82000 interleave=0-1 file=/lib64/libm-2.5.so anon=1 dirty=1 N0=1

3aed000000 interleave=0-1 file=/lib64/libgcc_s-4.1.2-20080825.so.1 mapped=2 mapmax=5 N0=2

3aed00d000 interleave=0-1 file=/lib64/libgcc_s-4.1.2-20080825.so.1

3aed20d000 interleave=0-1 file=/lib64/libgcc_s-4.1.2-20080825.so.1 anon=1 dirty=1 N1=1

3af0c00000 interleave=0-1 file=/usr/lib64/libstdc++.so.6.0.8 mapped=66 mapmax=3 active=65 N0=66

3af0ce6000 interleave=0-1 file=/usr/lib64/libstdc++.so.6.0.8

3af0ee5000 interleave=0-1 file=/usr/lib64/libstdc++.so.6.0.8 anon=5 dirty=5 mapped=6 mapmax=2 N0=4 N1=2

3af0eeb000 interleave=0-1 file=/usr/lib64/libstdc++.so.6.0.8 anon=3 dirty=3 N0=1 N1=2

3af0eee000 interleave=0-1 anon=3 dirty=3 N0=1 N1=2

2aaaaaac5000 interleave=0-1 file=/lib64/libnss_files-2.5.so mapped=5 mapmax=16 N0=5

2aaaaaacf000 interleave=0-1 file=/lib64/libnss_files-2.5.so

2aaaaacce000 interleave=0-1 file=/lib64/libnss_files-2.5.so anon=1 dirty=1 N1=1

2aaaaaccf000 interleave=0-1 file=/lib64/libnss_files-2.5.so anon=1 dirty=1 N0=1

2aaaac000000 interleave=0-1 anon=17 dirty=17 N0=7 N1=10

2aaaac021000 interleave=0-1

2aaab0000000 interleave=0-1 file=/data1/mongodata/local/local.ns

2aaaf0000000 interleave=0-1 file=/data1/mongodata/local/local.ns mapped=262144 active=262132 N0=131072 N1=131072

2aab30000000 interleave=0-1 file=/data1/mongodata/local/local.0

2aab34000000 interleave=0-1 file=/data1/mongodata/local/local.0 mapped=3 active=1 N0=2 N1=1

2aab38000000 interleave=0-1 file=/data1/mongodata/local/local.1

2aabb7f00000 interleave=0-1 file=/data1/mongodata/local/local.1 mapped=2 active=0 N1=2

2aac37e00000 interleave=0-1 file=/data1/mongodata/local/local.2

2aacb7d00000 interleave=0-1 file=/data1/mongodata/local/local.2 mapped=2 active=0 N0=2

2aad37c00000 interleave=0-1 file=/data1/mongodata/local/local.3

2aadb7b00000 interleave=0-1 file=/data1/mongodata/local/local.3 mapped=2 active=0 N0=2

2aae37a00000 interleave=0-1 file=/data1/mongodata/local/local.4

2aaeb7900000 interleave=0-1 file=/data1/mongodata/local/local.4 mapped=2 active=0 N0=2

2aaf37800000 interleave=0-1 file=/data1/mongodata/local/local.5

2aafb7700000 interleave=0-1 file=/data1/mongodata/local/local.5 mapped=2 active=0 N0=2

2b3f9d648000 interleave=0-1 anon=3 dirty=3 N0=2 N1=1

2b3f9d662000 interleave=0-1 anon=4 dirty=4 N0=2 N1=2

7fff14f57000 interleave=0-1 stack anon=6 dirty=6 N0=3 N1=3

7fff14f76000 interleave=0-1 mapped=1 mapmax=32 active=0 N0=1

另一台服务器，不调整numa内存分配策略启动mongodb，查看结果如下 :

00400000 default file=/opt/mongo/bin/mongod mapped=542 N0=525 N1=17

00d46000 default file=/opt/mongo/bin/mongod anon=12 dirty=12 mapped=24 N0=12 N1=12

00d63000 default anon=15 dirty=15 N1=15

1e78b000 default heap anon=38 dirty=38 N0=3 N1=35

40c02000 default

40c03000 default anon=2 dirty=2 N0=1 N1=1

41603000 default

41604000 default anon=2 dirty=2 N0=1 N1=1

42004000 default

42005000 default anon=2 dirty=2 active=0 N1=2

42a05000 default

42a06000 default anon=2 dirty=2 N0=1 N1=1

43406000 default

43407000 default anon=2 dirty=2 N0=2

43e07000 default

43e08000 default anon=2 dirty=2 N0=2

44808000 default

44809000 default anon=5 dirty=5 N0=5

45209000 default

4520a000 default anon=4 dirty=4 active=3 N0=1 N1=3

45c0a000 default

45c0b000 default anon=2 dirty=2 N0=2

32c4e00000 default file=/lib64/ld-2.5.so mapped=20 mapmax=46 N0=20

32c501b000 default file=/lib64/ld-2.5.so anon=1 dirty=1 N1=1

32c501c000 default file=/lib64/ld-2.5.so anon=1 dirty=1 N1=1

32c5200000 default file=/lib64/libc-2.5.so mapped=124 mapmax=51 N0=120 N1=4

32c534e000 default file=/lib64/libc-2.5.so

32c554e000 default file=/lib64/libc-2.5.so anon=2 dirty=2 mapped=4 mapmax=27 N0=2 N1=2

32c5552000 default file=/lib64/libc-2.5.so anon=1 dirty=1 N1=1

32c5553000 default anon=5 dirty=5 N0=1 N1=4

32c5a00000 default file=/lib64/libpthread-2.5.so mapped=16 mapmax=21 N1=16

32c5a16000 default file=/lib64/libpthread-2.5.so

32c5c15000 default file=/lib64/libpthread-2.5.so anon=1 dirty=1 N1=1

32c5c16000 default file=/lib64/libpthread-2.5.so anon=1 dirty=1 N1=1

32c5c17000 default anon=1 dirty=1 N1=1

32c5e00000 default file=/lib64/libm-2.5.so mapped=27 mapmax=10 N0=14 N1=13

32c5e82000 default file=/lib64/libm-2.5.so

32c6081000 default file=/lib64/libm-2.5.so anon=1 dirty=1 N1=1

32c6082000 default file=/lib64/libm-2.5.so anon=1 dirty=1 N1=1

32d4000000 default file=/lib64/libgcc_s-4.1.2-20080825.so.1 mapped=9 mapmax=5 N0=7 N1=2

32d400d000 default file=/lib64/libgcc_s-4.1.2-20080825.so.1

32d420d000 default file=/lib64/libgcc_s-4.1.2-20080825.so.1 anon=1 dirty=1 N1=1

32d7000000 default file=/usr/lib64/libstdc++.so.6.0.8 mapped=77 mapmax=3 N0=77

32d70e6000 default file=/usr/lib64/libstdc++.so.6.0.8

32d72e5000 default file=/usr/lib64/libstdc++.so.6.0.8 anon=5 dirty=5 mapped=6 mapmax=2 N0=1 N1=5

32d72eb000 default file=/usr/lib64/libstdc++.so.6.0.8 anon=3 dirty=3 N1=3

32d72ee000 default anon=3 dirty=3 N1=3

2aaaaaac5000 default file=/lib64/libnss_files-2.5.so mapped=5 mapmax=18 N1=5

2aaaaaacf000 default file=/lib64/libnss_files-2.5.so

2aaaaacce000 default file=/lib64/libnss_files-2.5.so anon=1 dirty=1 N0=1

2aaaaaccf000 default file=/lib64/libnss_files-2.5.so anon=1 dirty=1 N0=1

2aaaac000000 default anon=18 dirty=18 active=13 N0=7 N1=11

2aaaac021000 default

2aaab0000000 default file=/opt/mongodata/local/local.ns

2aaaf0000000 default file=/opt/mongodata/local/local.ns mapped=262144 N0=237592 N1=24552

2aab30000000 default file=/opt/mongodata/local/local.0

2aab34000000 default file=/opt/mongodata/local/local.0 mapped=2 N0=2

2aab38000000 default file=/opt/mongodata/local/local.1

2aabb7f00000 default file=/opt/mongodata/local/local.1 mapped=1 N0=1

2aac37e00000 default file=/opt/mongodata/local/local.2

2aacb7d00000 default file=/opt/mongodata/local/local.2 mapped=1 N0=1

2aad37c00000 default file=/opt/mongodata/local/local.3

2aadb7b00000 default file=/opt/mongodata/local/local.3 mapped=1 N1=1

2aae37a00000 default file=/opt/mongodata/local/local.4

2aaeb7900000 default file=/opt/mongodata/local/local.4 mapped=1 N1=1

2aaf37800000 default file=/opt/mongodata/local/local.5

2aafb7700000 default file=/opt/mongodata/local/local.5 mapped=1 N1=1

2b2a3fa63000 default anon=3 dirty=3 N1=3

2b2a3fa7d000 default anon=5 dirty=5 N1=5

7fffe3123000 default stack anon=7 dirty=7 N1=7

7fffe31fc000 default mapped=1 mapmax=34 active=0 N0=1

【参考】

numactl

DESCRIPTION

numactl runs processes with a specific NUMA scheduling or memory placement policy. The policy is set for com-

mand and inherited by all of its children. In addition it can set persistent policy for shared memory segments

or files.

Policy settings are:

--interleave=nodes, -i nodes

Set an memory interleave policy. Memory will be allocated using round robin on nodes. When memory can-

not be allocated on the current interleave target fall back to other nodes.

--membind=nodes, -m nodes

Only allocate memory from nodes. Allocation will fail when there is not enough memory available on

these nodes.

--cpunodebind=nodes, -N nodes

Only execute process on the CPUs of nodes. Note that nodes may consist of multiple CPUs.

--physcpubind=cpus, -C cpus

Only execute process on cpus. This accepts physical cpu numbers as shown in the processor fields of

/proc/cpuinfo.

--localalloc, -l

Do always local allocation on the current node.

不好说哪个策略绝对是好的，只有适合什么场景的。

前段时间还因为numa策略没有使用好，导致ORACLE数据库的某些进程仅使用本地NODE的内存，另一个NODE有FREE的内存却不使用的情况。进而SWAP空间被占用。性能下降。其实这种情况下可以考虑interleave策略。

↧

PostgreSQL 9.1 escape behavior changed warning

August 23, 2011, 4:57 pm

≫ Next: PostgreSQL 9.1 performance improve

≪ Previous: mongoDB and numa policy interleave

在9.0里面standard_conforming_strings参数的默认值是off,9.1把它改成默认on.会带来一些变化如下:

9.0的表现:

postgres@db-172-16-3-33-> psql -h 127.0.0.1

psql (9.0.4)

Type "help" for help.

postgres=# select '\\';

WARNING: nonstandard use of \\ in a string literal

LINE 1: select '\\';

HINT: Use the escape string syntax for backslashes, e.g., E'\\'.

?column?

----------

(1 row)

postgres=# select E'\\';

?column?

----------

(1 row)

postgres=# show standard_conforming_strings;

standard_conforming_strings

-----------------------------

off

(1 row)

postgres=#

9.1的表现:

postgres@db5-> psql -h 127.0.0.1

psql (9.1beta2)

Type "help" for help.

postgres=# select '\\';

?column?

----------

(1 row)

postgres=# select E'\\';

?column?

----------

(1 row)

postgres=# show standard_conforming_strings;

standard_conforming_strings

-----------------------------

(1 row)

在数据库从9.0升级到9.1的过程中，需要注意这一个参数的改变，或者升到9.1后把参数调成与9.0默认的一致。

↧

PostgreSQL 9.1 performance improve

August 23, 2011, 5:13 pm

≫ Next: PostgreSQL advisory locks

≪ Previous: PostgreSQL 9.1 escape behavior changed warning

1. Merge duplicate fsync requests

This greatly improves performance under heavy write loads.

2. Allow inheritance table scans to return meaningfully-sorted results (Greg Stark, Hans-Jurgen Schonig, Robert Haas, Tom Lane)

This allows better optimization of queries that use ORDER BY, LIMIT, or MIN/MAX with inherited tables.

这几条太养眼了。

E.1.3.1.1. Performance

Support unlogged tables using the UNLOGGED option in CREATE TABLE (Robert Haas)
Such tables provide better update performance than regular tables, but are not crash-safe: their contents are automatically cleared in case of a server crash. Their contents do not propagate to replication slaves, either.
Allow FULL OUTER JOIN to be implemented as a hash join, and allow either side of a LEFT OUTER JOIN or RIGHT OUTER JOIN to be hashed (Tom Lane)
Previously FULL OUTER JOIN could only be implemented as a merge join, and LEFT OUTER JOIN and RIGHT OUTER JOIN could hash only the nullable side of the join. These changes provide additional query optimization possibilities.
Merge duplicate fsync requests (Robert Haas, Greg Smith)
This greatly improves performance under heavy write loads.
Improve performance of commit_siblings (Greg Smith)
This allows the use of commit_siblings with less overhead.
Reduce the memory requirement for large ispell dictionaries (Pavel Stehule, Tom Lane)
Avoid leaving data files open after "blind writes" (Alvaro Herrera)
This fixes scenarios where backends might hold open files that were long since deleted, preventing the kernel from reclaiming disk space.
----
Allow inheritance table scans to return meaningfully-sorted results (Greg Stark, Hans-Jurgen Schonig, Robert Haas, Tom Lane)
This allows better optimization of queries that use ORDER BY, LIMIT, or MIN/MAX with inherited tables.
----
Allow ALTER TABLE ... SET DATA TYPE to avoid table rewrites in appropriate cases (Noah Misch, Robert Haas)
For example, converting a varchar column to text no longer requires a rewrite of the table. However, increasing the length constraint on a varchar column still requires a table rewrite.
----
- Add nearest-neighbor (order-by-operator) searching to GiST indexes (Teodor Sigaev, Tom Lane)
  This allows GiST indexes to quickly return the N closest values in a query with LIMIT. For example
```
SELECT * FROM places ORDER BY location <-> point '(101,456)' LIMIT 10;
```
  finds the ten places closest to a given target point.
- Allow GIN indexes to index null and empty values (Tom Lane)
  This allows full GIN index scans, and fixes various corner cases in which GIN scans would fail.
- Allow GIN indexes to better recognize duplicate search entries (Tom Lane)
  This reduces the cost of index scans, especially in cases where it avoids unnecessary full index scans.
- Fix GiST indexes to be fully crash-safe (Heikki Linnakangas)
  Previously there were rare cases where a REINDEX would be required (you would be informed).
  ----
  Allow numeric to use a more compact, two-byte header in common cases (Robert Haas)
  Previously all numeric values had four-byte headers; this saves on disk storage.
  ----
  E.1.3.13.2. Contrib Performance
  - Add support for LIKE and ILIKE index searches to contrib/pg_trgm (Alexander Korotkov)
  - Add levenshtein_less_equal() function to contrib/fuzzystrmatch, which is optimized for small distances (Alexander Korotkov)
  - Improve performance of index lookups on contrib/seg columns (Alexander Korotkov)
  - Improve performance of pg_upgrade for databases with many relations (Bruce Momjian)
  - Add flag to contrib/pgbench to report per-statement latencies (Florian Pflug)

↧

PostgreSQL advisory locks

August 23, 2011, 7:05 pm

≫ Next: 【转载】Win7系统彻底关闭休眠文件的方法

≪ Previous: PostgreSQL 9.1 performance improve

PostgreSQL 9.1 新增了一个事务级的advisory lock，原来只有SESSION级的。

事务级别的advisory lock不能显性的释放。

advisory lock和系统的MVCC不是一个概念，基本上完全不相干。锁的数量由 max_locks_per_transaction and max_connections 决定。

advisory lock可以在pg_locks视图里面看到。

SESSION级的advisory lock一旦获取，需要手工释放或者SESSION 结束后自动释放。SESSION级别的advisory lock还有一个和事务锁不一样的风格。

如:

1. 在一个begin; end;之间获取SESSION级别的advisory lock时，事务被回滚或者事务FAILED的话SESSION advisory锁不会被回滚。


digoal=> begin;
BEGIN
digoal=> select pg_advisory_lock(1);
 pg_advisory_lock 
------------------
 
(1 row)

digoal=> rollback;
ROLLBACK
digoal=> select * from pg_locks where objid=1;
 locktype | database | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction |  pid
  |     mode      | granted 
----------+----------+----------+------+-------+------------+---------------+---------+-------+----------+--------------------+-----
--+---------------+---------
 advisory |    16386 |          |      |       |            |               |       0 |     1 |        1 | 2/9186             | 2598
8 | ExclusiveLock | t
(1 row)

2. 在一个begin; end;之间释放SESSION级别的advisory lock时，事务被回滚或者事务FAILED的话SESSION advisory锁仍被释放。


digoal=> select pg_advisory_lock(1);
 pg_advisory_lock 
------------------
 
(1 row)

digoal=> begin;
BEGIN
digoal=> select pg_advisory_unlock(1);
 pg_advisory_unlock 
--------------------
 t
(1 row)

digoal=> rollback;
ROLLBACK
digoal=> select * from pg_locks where objid=1;
 locktype | database | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction | pid 
| mode | granted 
----------+----------+----------+------+-------+------------+---------------+---------+-------+----------+--------------------+-----
+------+---------
(0 rows)

3. 同一个session 级别的advisory lock可以在一个SESSION里面多次获取，但是释放也要多次释放。

如下，获取两次，释放两次。


digoal=> select pg_advisory_lock(1);
 pg_advisory_lock 
------------------
 
(1 row)

digoal=> select pg_advisory_lock(1);
 pg_advisory_lock 
------------------
 
(1 row)

digoal=> select pg_advisory_unlock(1);
 pg_advisory_unlock 
--------------------
 t
(1 row)

digoal=> select * from pg_locks where objid=1;
 locktype | database | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction |  pid
  |     mode      | granted 
----------+----------+----------+------+-------+------------+---------------+---------+-------+----------+--------------------+-----
--+---------------+---------
 advisory |    16386 |          |      |       |            |               |       0 |     1 |        1 | 2/9198             | 2598
8 | ExclusiveLock | t
(1 row)

digoal=> select pg_advisory_unlock(1);
 pg_advisory_unlock 
--------------------
 t
(1 row)

digoal=> select * from pg_locks where objid=1;
 locktype | database | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction | pid 
| mode | granted 
----------+----------+----------+------+-------+------------+---------------+---------+-------+----------+--------------------+-----
+------+---------
(0 rows)

4. session和transaction 级别的lock在同一个SESSION中共用锁空间，所以在TRANSACTION中获取到了锁，在同一个SESSION中也可以获取到。另一个SESSION就获取不到。TRANSACTION ADVISORY LOCK不可以显性释放，会自动在事务结束（包含提交或回滚，正常或不正常）后释放。

如下

SESSION A：


digoal=> begin;
BEGIN
digoal=> select pg_advisory_xact_lock(1);
 pg_advisory_xact_lock 
-----------------------
 
(1 row)

这里获取到事务级的ADVISORY 锁。


digoal=> select pg_advisory_lock(1);
 pg_advisory_lock 
------------------
 
(1 row)

这里获取到SESSION ADVISORY LOCK。


digoal=> end;
COMMIT

这里自动释放了TRANSACTION ADVISORY LOCK。

SESSION B:

digoal=> select pg_advisory_lock(1);

等待中。。。

SESSION A:


digoal=> select * from pg_locks where objid=1;
 locktype | database | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction |  pid
  |     mode      | granted 
----------+----------+----------+------+-------+------------+---------------+---------+-------+----------+--------------------+-----
--+---------------+---------
 advisory |    16386 |          |      |       |            |               |       0 |     1 |        1 | 2/9202             | 2598
8 | ExclusiveLock | t
 advisory |    16386 |          |      |       |            |               |       0 |     1 |        1 | 4/76               | 2645
0 | ExclusiveLock | f
(2 rows)

看到SESSION B在等待这个锁。


digoal=> select pg_advisory_unlock(1);
 pg_advisory_unlock 
--------------------
 t
(1 row)

这里释放了SESSION A 的SESSION ADVISORY LOCK。

digoal=> select * from pg_locks where objid=1;
 locktype | database | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction |  pid
  |     mode      | granted 
----------+----------+----------+------+-------+------------+---------------+---------+-------+----------+--------------------+-----
--+---------------+---------
 advisory |    16386 |          |      |       |            |               |       0 |     1 |        1 | 4/0                | 2645
0 | ExclusiveLock | t
(1 row)

可以看到B已经获得了这个锁。

由于advisory lock和MVCC不相干。所以不存在和MVCC锁的冲突。适合特殊场景的应用，降低锁冲突或者长时间持锁带来的数据库（如VACUUM释放空间,这个我很久前的BLOG有写过）压力。

advisory lock的应用场景举例(应用控制的锁)：

比如数据库里面存储了文件和ID的对应关系，应用程序需要长时间得获得一个锁，然后对文件进行修改，再释放锁。

测试数据:


digoal=> create table tbl_file_info (id int primary key,file_path text);
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index "tbl_file_info_pkey" for table "tbl_file_info"
CREATE TABLE
digoal=> insert into tbl_file_info values (1,'/home/postgres/advisory_lock_1.txt');
INSERT 0 1
digoal=> insert into tbl_file_info values (2,'/home/postgres/advisory_lock_2.txt');
INSERT 0 1
digoal=> insert into tbl_file_info values (3,'/home/postgres/advisory_lock_3.txt');
INSERT 0 1

SESSION A:


digoal=> select pg_advisory_lock(id),file_path from tbl_file_info where id=1;
 pg_advisory_lock |             file_path              
------------------+------------------------------------
                  | /home/postgres/advisory_lock_1.txt
(1 row)

应用程序对/home/postgres/advisory_lock_1.txt文件进行编辑之后，再释放这个advisory锁。

SESSION B:

当SESSIONA在编辑/home/postgres/advisory_lock_1.txt这个文件的时候，无法获得这个锁，所以可以确保不会同时编辑这个文件。

如果不使用advisory lock,改用MVCC,来看看如何 :

仍旧使用以上的测试数据，

SESSION A:


digoal=> begin;
BEGIN
digoal=> select file_path from tbl_file_info where id=1 for update;
             file_path              
------------------------------------
 /home/postgres/advisory_lock_1.txt
(1 row)

应用程序对/home/postgres/advisory_lock_1.txt文件进行编辑之后，再释放这个锁。


digoal=> end;
COMMIT

因此这个SESSION在编辑文件的时候，需要保持这个事务，一方面降低了数据库使用的并发性，另一方面带来了长事务，会有回收不了DEAD TUPLE空间的问题，参考我以前写过的BLOG。

http://blog.163.com/digoal@126/blog/static/163877040201011162912604/

DETAIL: xxxx dead row versions cannot be removed yet.

SESSION B:

当SESSIONA在编辑 /home/postgres/advisory_lock_1.txt 这个文件的时候，如果需要更改同一个文件，同样，先获得锁然后操作。

digoal=> select file_path from tbl_file_info where id=1 for update;

等待SESSION A释放锁。

【参考】

http://www.postgresql.org/docs/9.1/static/functions-admin.html#FUNCTIONS-ADVISORY-LOCKS

http://www.postgresql.org/docs/9.1/static/explicit-locking.html#ADVISORY-LOCKS

↧

【转载】Win7系统彻底关闭休眠文件的方法

November 12, 2011, 4:58 am

≫ Next: EVA6400 Preferred path/mode

≪ Previous: PostgreSQL advisory locks

我想说一下关于Windows 7休眠的问题，大家装好系统以后是不是把休眠功能都已经关掉了？如果要是没有关掉的话，则会占用C盘的空间，想必这个问题大家都知道。

　　我安装完成Windows7系统以后，第一件事情就是关闭休眠。我用的方法是打开电源管理，然后把当计算机使用用外接电源时还有只用电池的状态下，都把休眠时间改为从不，然后确定，这样就关闭了休眠。但是是否是真正关闭了休眠呢？如果以按照以上方法关闭休眠的朋友，可以打开C盘，然后打开文件夹选项，选择查看，把隐藏受保护的操作系统文件前的勾去掉，然后确定，看看在多出的一大推隐藏文件里，是否有一个叫做hiberfil.sys的文件，这个文件就是讨厌人是休眠文件，即使你用上述方法关闭了休眠功能以后，这个文件依旧存在，如果没有那就当我没说啊！呵呵，如果把这个文件直接删除，这是不科学的做法，我当时试了一下，系统提示不能删除，但是我没有用360的文件粉碎机，有兴趣的朋友可以试试，呵呵，但是我还是觉得应该正确删除，而不是强制删除。

现在我想说的就是怎样完全关闭休眠，包括这个文件夹。
打开c:\windows\system32
找到cmd这个程序
然后右击鼠标
以管理员身份运行
记住一定要以管理员身份运行
否则无效
然后在打开的命令提示窗口里输入powercfg -h off
然后回车

这样就彻底的关闭了休眠

这时你在看看C盘的空间，是不是多出来了一部分呢？如果想再开启休眠的话，还是用这个方法，打开命令提示窗口
还是要以管理员身份运行
然后输入powercfg -h on
这样就又开启了休眠，那个文件也就又出现了

↧

EVA6400 Preferred path/mode

December 21, 2011, 10:06 pm

≫ Next: 2012,给2011打个补丁.

≪ Previous: 【转载】Win7系统彻底关闭休眠文件的方法

HP EVA6400存储提供的Preferred path/mode有4个选项。如图 :

EVA6400 Preferred path/mode - 德哥@Digoal - The Heart,The World.

默认情况下是No preference.

假设LINUX multipath.conf配置如下：

blacklist {

devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*"

devnode "^hd[a-z]"

devnode "^sda$"

#devnode "^cciss!c[0-9]d[0-9]*"

}

defaults {

udev_dir /dev

polling_interval 10

selector "round-robin 0"

path_grouping_policy failover

getuid_callout "/sbin/scsi_id -g -u -s /block/%n"

prio_callout /bin/true

path_checker readsector0

rr_min_io 100

rr_weight priorities

failback immediate

no_path_retry fail

user_friendly_names yes

flush_on_last_del yes

}

multipaths {

multipath {

wwid 36001438005de97860000b00001400000

alias e06_eva_vd13

}

在操作系统中查看multipath时可以看到类似如下输出。

multipath -ll

e06_eva_vd13 (36001438005de97860000b00001400000) dm-0 HP,HSV400

[size=600G][features=1 queue_if_no_path][hwhandler=0][rw]

\_ round-robin 0 [prio=100][active]

\_ 1:0:0:1 sdb 8:16 [active][ready]

\_ 2:0:0:1 sdd 8:48 [active][ready]

\_ round-robin 0 [prio=20][enabled]

\_ 1:0:1:1 sdc 8:32 [active][ready]

\_ 2:0:1:1 sde 8:64 [active][ready]

修改为Path B-Failover/failback后，得到如下结果。

multipath -ll

e06_eva_vd13 (36001438005de97860000b00001400000) dm-0 HP,HSV400

[size=600G][features=1 queue_if_no_path][hwhandler=0][rw]

\_ round-robin 0 [prio=20][active]

\_ 1:0:0:1 sdb 8:16 [active][ready]

\_ 2:0:0:1 sdd 8:48 [active][ready]

\_ round-robin 0 [prio=100][enabled]

\_ 1:0:1:1 sdc 8:32 [active][ready]

\_ 2:0:1:1 sde 8:64 [active][ready]

Preferred path/mode可以在线修改。

↧

2012,给2011打个补丁.

January 3, 2012, 4:23 pm

≫ Next: SSD碎片早期网络文章摘录

≪ Previous: EVA6400 Preferred path/mode

2012我对自己的评价 :

2012,给2011打个补丁. - 德哥@Digoal - The Heart,The World.

本部门以及跨部门和职级的同事们给的评语, 感谢大家。

↧

SSD碎片早期网络文章摘录

January 14, 2012, 6:36 pm

≫ Next: 【摘】关于naim幽灵般的牛叫--电网质量不好怎么办？

≪ Previous: 2012,给2011打个补丁.

第一篇 :

那么就有另外一个事儿需要说道说道了，做缓存的时候经常有大量的碎片文件（当然直接理解为小文件即可）需要读写，MLC型固态硬盘本来寿命就短，用来作为擦写频繁的缓存的话恐怕就更不合适，所幸本款技嘉GA-Z68XP-UD3-iSSD主板搭载的是SLC型固态硬盘，用SLC型固态硬盘来做缓存相较MLC型固态硬盘更有优势。

本次评测中使用的技嘉GA-Z68XP-UD3-iSSD附赠了一款Intel 311 SLC型固态硬盘，配合技嘉 EZ Smart response，用户可以很容易地将SSD硬盘设置成主机械硬盘的缓存。

第二篇 :

有关SSD的一些事儿(NoATime,allocation block)

SSD 的价格一路走低，使得它也越来越被普通用户接受，0噪音，抗震动，读写迅速使得很多用户对它趋之若鹜。但是，多数用户对 SSD 的认识确流于表面，特别是对它性能的理解。大部分用户只认 SSD 上标注的顺序读写参数，这个确实也是一种悲哀。这篇文章，就是为广大在 Mac OS X 下使用，和将要使用SSD的朋友，简单说一说选购，使用 SSD 应该注意的一些问题。

1: 选 SSD 应该知道的概念

SSD 有几个重要的概念，主控芯片，NAND 存储芯片，缓存。

主控芯片，相当与电脑的 CPU，它的好坏直接影响到 SSD 的稳定和效率。现在主流的SSD芯片，主要是 Intel，JMicron，Indilinx，Toshiba，Samsung，Sandforce。Intel的主控使用在自家产品以及为金士顿等品牌的贴牌产品中，它的特点是领先业界的并行10通道读写和优化的算法，使得 Intel SSD 在稳定性，和随机读写性能在很长一段时间内远远领先其它厂商。

JMicron的芯片，一般为各大山寨品牌所使用，包括类似源科，PQI，金胜以及早期的OCZ，Photofast等DIY品牌，优点就是价格便宜，稳定性性能几乎没有。Indilinx，也是一个类似JMicron的低价位主控解决方案，但是它的稳定性和性能，都比JMicron要好很多，现在好多主流DIY品牌的SSD在使用它的芯片。Sandforce 是一个后起之秀，性能强悍，现在市场上DIY品牌的高性能SSD，几乎全是 Sandforce 解决方案，著名的镁光C300，OCZ Extreme，都是使用 Sandforce 芯片。Toshiba和Samsung的主控，都用与自家的SSD之上，性能中庸，胜在稳定。

NAND 存储芯片，这个东西不用多说，从U盘开始，我们其实就已经接触过这个东西了，主流的 NAND 芯片，有Samsung，Toshiba，Intel&Micron(Intel 镁光合资)。这个东西，一般分为三种，SLC, MLC,和新的 eMLC. SLC 就是著名 X25 E使用的NAND芯片，单层存储结构，性能超高，寿命长，极其稳定，但是容量小，价格高，一般在企业级市场使用。MLC，是最常见的SSD NAND 存储芯片，X25 M, Samsung，Toshiba，以及其它的DIY品牌主流产品都有使用。多层存储结构，性能不错，容量大，价格低廉，寿命可以接受。还有一种，就是将要在 2011 CES 大会上，Intel 发布的 X25 E 二代使用的 eMLC 存储芯片。它是 MLC 的一个变种，虽然 SLC 性能强悍，但是单位存储价格太高，eMLC 很好的解决的这个问题，通过给普通的MLC芯片加入ECC校验，等等数据安全性能，已经使MLC的安全性可以进入企业业务关键领域。

缓存，早期的山寨SSD，都是没有缓存的，这个就导致SSD非常不稳定，容易出现数据读写错误。当然，目前市面上主流的SSD，都带有缓存芯片。缓存有两个参数，频率和容量，和内存一样，这个东西频率越高，容量越大越好。但是 Intel 似乎非常自信于它的主控设计，25E, 25M 的缓存从频率到容量都不太能拿得出手，但是还有如此性能，可见它主控有多厉害了。现在主流的DIY品牌，无一例外的都是128M，256M缓存起步，这样虽然主控查一些，但是也能有非常厉害的顺序读写性能，再加上不错的 Sandforce，数据就非常华丽了。可见，如果下一代Intel，能够提高缓存容量以及频率，秒杀DIY品牌应该不是什么难事。

2: SSD 的参数如何看

SSD 有几个重要的参数，顺序读写性能，随机读写性能，抗震性能，寿命

顺序读写，就是在磁盘上读取和写入一个连续大文件的性能。这个性能一般也就是标示在 SSD 包装盒上的那个读写性能。说实话，顺序读写性能不能够反映出一个 SSD 的真实性能，只要缓存够大，频率够高，一个垃圾山寨的 SSD，都能达到读写200MBs＋的恐怖性能。所以，这个看看就得了。

随机读写性能，这个东西，才是能够真实反应一个 SSD 主控水平和性能的最重要的参数。我们知道，文件系统上的文件，并不是连续存储的，随着硬盘的使用，添加新文件，删除老文件，，，所有的文件在物理层面上，都是松散存放，这也就是磁盘碎片的由来之一。而且，统计数据也说明，电脑上真正被频繁使用读写的文件，90%都是都是4KB~2048KB(2MB)左右大小的文件。所以，这些小文件的随机读写性能，就成为一个硬盘性能的重要考量。传统的HDD由于机械结构，读写小文件的时候，磁头在磁盘盘面上进行频繁的寻址操作(寻道 seek)，所以小文件随机读写SSD 由于不存在机械部件，所以寻址操作和随机读写的性能高于传统HDD太多。所以选购SSD时，这是一个重要的参考参数。但是要注意，每一个品牌的SSD，在介绍随机读写性能时，可能会有 * 号注解。来表明这个性能是在某种特殊情况下得到，比如 25M G2 的性能介绍

Random I/O Operations per Second (IOPS)?

Random 4KB Reads: up to 35,000 IOPS
up to 6,600 IOPS (80GB drive)
up to 8,600 IOPS (120GB drive)
up to 8,600 IOPS (160GB drive)

? Measurement performed on 8GB span.

4KB 非常厉害，但是要注意那个 IOPS 后面的 1 字符，看到后面你就谁知道，原来这个是在特定连续的 8GB 区间下得到的。正常日常使用的时候，我们无法严格按照这个数据来使用。

抗震，这个不必多说，一般也没人拿着SSD摔来摔去

寿命，不管是 SLC 还是 MLC，都有一定的擦写寿命，但是这个一般不用担心，主流 SSD 主控都有均衡算法，会平衡的擦写个个 NAND cell，一般来说，MLC 3年到5年的寿命。SLC 更长。

3: SSD 与文件系统

现在还没有一个能够针对 SSD 的磁盘碎片整理程序，请千万不要使用任何SSD生产厂家提供的程序以外的磁盘碎片整理程序整理 SSD。

这里说 Mac OS X Extended (HFS+) 文件系统。这个文件系统，对于 20 MiB 下的文件有自动的磁盘碎片整理的功能，这个你是无法干预的，当然，对于 SSD 性能没有本质影响。除了这个功能外，还有一个记录文件访问时间的功能 (Access time). 这个功能与性能有关。文件系统通过记录每一个文件的Access Time，从而计算得出那些文件属于 Hot Files (热点文件)，当一个文件拥有了 Hot Files 的资格后，文件系统会自动将此文件向磁盘卷的前部移动，甚至将此文件添加到磁盘卷元数据树中以便提高访问性能。但是，这个功能对于HDD来说是有很大效果的，墓唤档痛排

Allocation Block与SSD RAID，默认状态下，Snow Leopard 是每个volume最多 2^3 2个 allocation block。每个 block 4KB(默认)，这也就是说，平常我们为什么在意4kb随机读写的原因。但是raid后，由于多加了一层硬件(或者软件层)，这个allocation block是一个可选的范围，比如32KB，64KB，128KB等等，区块的大小直接影响了4KB随机读写的效率，CPU占用率，磁盘占用率等等数据，说白了，就是一个空间，时间和效率的均衡问题，，，，对于软 RAID 来说，区块过小，CPU占用率高，导致系统性能下降，区块过大，导致磁盘利用率低下。而且，由于软疾愕拇嬖冢贾耂SD的4KB随机读写性能暴降。硬RAID也有类似问题。

这里就先说这么多，希望对各位有帮助。

第三篇 :

通过设置HPA提高intel MLC SSD硬盘的写性能

我们知道intel MLC SSD硬盘随着使用会产生很碎片，随着碎片的增加性能会大大降低。intel的工程师介绍可以使用HPA技术保留一部分空间给SSD硬盘内部使用，这样可以有效的降低碎片。
先介绍一下什么是HPA:
HPA是"host protected area"的缩写，通俗的理解就是设置读取的硬盘最大的扇区号，从而把高端的内容隐藏起来，这个最大的扇区号能够写在硬盘的某个地方，因此即使你把这个硬盘挂到其它机器上，正常情况下你也看不到隐藏的内容，fdisk，pqmaigc之类的工具也把这个硬盘当做一个稍小容量的硬盘。HPA是ATA的标准，ATA-4就包含了HPA，这个标准需要在HDD的 Firmware支持的。
在Linux下使用新版本的hdparm工具可以设置HPA，Rhel5.X下自带的hdparm工具版本太低了，不能设置HPA。可以从sourceforge网站上下载hdparm工具:http://sourceforge.net/projects/hdparm/，我下载的版本是hdparm-9.27.tar.gz，下载后放在/usr/src目录下：
#cd /usr/src
#tar zxvf hdparm-9.27.tar.gz
#cd hdparm-9.27
#make
#make install
这样就安装好了新版本的hdparm，
检查hdparm的版本是否是新版本：
#hdparm -V
hdparm v9.27
注意设置HPA会导致硬盘上原先的数据被破坏。
查看HPA的设置：
#hdparm -N /dev/sdh
/dev/sdh:
max sectors   = 146800640/312581808, HPA is disabled
可以看到HPA是关闭的。
设置HPA，160G的SSD盘按2^30bytes/GB的话，大小为149GB,我们的的硬盘大小设置为120GB，留29GB给内部使用。

注意：设置HPA时必须保证硬盘没有被使用，同时设置完HPA后需要重新启动机器才能生效。
#hdparm -N p251658240 /dev/sdh
/dev/sdh:
setting max visible sectors to 251658240 (permanent)
Use of -Nnnnnn is VERY DANGEROUS.
You have requested reducing the apparent size of the drive.
This is a BAD idea, and can easily destroy all of the drive's contents.
Please supply the --yes-i-know-what-i-am-doing flag if you really want this.
Program aborted.
由于这个操作是会导致盘上的数据全部丢失，所以hdparm会警告你，需要加 --yes-i-know-what-i-am-doing ，才会真正设置HPA。

#hdparm -N p251658240 --yes-i-know-what-i-am-doing /dev/sdh

命令中的pNNNNNNN中的P表明是持久化设置。


重新启动Linux。

检查设置是否成功:

#hdparm -N /dev/sdh
/dev/sdh:
max sectors   = 251658240/312581808, HPA is enabled

第四篇 :

通俗解释SSD磁盘碎片

前面我们已经系统的解释了均衡磨损算法、磁盘碎片等概念，不过我们有点担心说的太过潦草，让人不明究里。下面，我们就用大家都熟悉的学生宿舍居住方式，来解释SSD磁盘碎片的产生，以及为何会影响性能。

我们知道，SSD硬盘内部是按照BLOCK的方式来划分区域的，一个BLOCK拥有4个PAGE，一个PAGE的容量一般是4KB。如果把它看做是学生宿舍，那么情况大概是这样，一幢宿舍楼有若干层（BLOCK），每层有4个房间(PAGE），每个房间住4个学生（一个人相当于1KB）。

在正常情况下，每个宿舍都应该是住满了4个学生后，再开辟一间新的宿舍，这和采用均衡磨损算法的情况完全一样。不过，在所有宿舍都住上学生后，问题就出来了。

每个宿舍的学生不会是完全稳定的，比如过些日子，某些宿舍的某些学生离开学校了。那么他就空出了床位，而学校也会安排新生住进去。可是，就是在安排新生住进去的时候，管理宿舍的人就有点迷糊了，因为每个房间都曾经住过学生，哪些房间会有空床呢？

就在宿舍管理员还没彻底搞明白前，校长办公室又来指令（均衡磨损算法），宿舍必须一个接一个住满学生。可怜的管理员彻底晕了，他只能重新彻底再次编排一下宿舍，把所有学生都叫出来，然后再一个个安排进宿舍。如果这种事情发生在SSD硬盘上，结果明显，那就是系统速度慢下来了，因为内部正在进行数据整理。

当然，问题还不仅如此，当管理员好不容易重新安排了学生宿舍后，学校班主任要求管理员把他们班的学生都找出来上课。而在之前的宿舍重新调整行动中，这个班的学生分散在整个大楼所有楼层若干房间中。要想通知到他们，管理员必须从一层爬到最高层。如果这体现在SSD硬盘中，结果就是应用软件运行速度很慢。如果用硬盘术语，这就是磁盘碎片太多，影响软件效能。

从上面的描述来看，当SSD硬盘容量基本用完时，混乱最容易发生。因为管理员（SSD硬盘控制器）、校长办公室（均衡磨损算法）、班主任（应用程序）三者不停在发生冲突。如果硬盘很空，那还要安排一些，大不了就是多占用一些宿舍，起码不容易出现反复调整数据结果的情况。

那么，从上面来看，有没什么好的方法解决这个问题呢？目前来看，可以减少每个宿舍住的人数，比方做到1KB/1 PAGE。但是这样一来，宿舍管理员日常的任务就重了，管100个宿舍就不轻省，管400个宿舍，每天就更加不容易。

当然，如果没有均衡算法捣乱也会好很多，见缝插针，那个宿舍空就去哪个，但这样又不能保证SSD使用寿命。

总之，磁盘碎片的问题目前对于SSD来说，还是一个非常难解决的问题。也许增加系统缓存是一个好方法，但如果没有好的算法（管理方式）之前，恐怕也不会有很好的结果。我们还是要期待上游厂家能够有更加天才的设计，不过无论如何，SSD前景还是光明的，毕竟没有哪款产品一出现就没有任何问题，而且我们先前讨论的情况也非常极端，用户未必就一定能碰到。

第五篇 :

我的固态硬盘怎么越来越卡。

一开始用的时候还挺快的。很爽，不过现在，容量和刚开始做完GHSOT时差不多，却越来越卡。
用磁盘碎片分析工具，提示不需要整理。
30G 山寨版的SSD
有解决办法吗？

把簇改成512个字节，基本上不卡了。看来SSD硬盘是同时写入多个簇的，当容量占用过高时，可供写入的簇选择性变差。所以容易出现假死状态。

【参考】

http://pterodactyl.iteye.com/blog/408130

http://www.diybl.com/course/6_system/linux/Linuxjs/20100226/195951.html

http://hi.baidu.com/view520/blog/item/fffaca1b184cd50a8718bf7e.html

http://we.pcinlife.com/thread-1142929-1-1.html

↧