找对业务G点, 体验酸爽 - PostgreSQL内核扩展指南

August 15, 2016, 7:45 am

≫ Next: PostgreSQL内核扩展之 - ElasticSearch同步插件

≪ Previous: PostgreSQL 9.6 开放自定义WAL(REDO)接口

通用数据库就像带装修的房子一样，如果按数据库的功能划分，可以分为豪华装修、精装、简装。

PostgreSQL从SQL兼容性、功能、性能、稳定性等方面综合评价的话，绝对算得上豪华装修级别的，用户拎包入住就可以。

不过通用的毕竟是通用的，如果G点不对的话，再豪华的装修你也爽不起来，这是很多通用数据库的弊病，但是今天PostgreSQL数据库会彻底颠覆你对通用数据库的看法。

基于PostgreSQL打造最好用的私人订制数据库

花了2个通宵，写了一份PostgreSQL内核扩展指南，时间有限，内容以入门为主。

希望更多人对PostgreSQL内核扩展有个初步的了解，内核扩展并不需要对数据库内核有非常深的了解，用户只要把重点放在业务上，利用PostgreSQL开放的API实现对数据库内核能力的扩展，打造属于自己的数据库。

为什么要扩展数据库功能

在回答这个问题前，我们先回答这个问题。

数据库是不是存数据就可以了？所有的运算都交给应用程序来？

在数据大集中、硬件成本高的年代。在较为general的硬件条件下，为了避免数据库的计算成为瓶颈你可能会这样做，把数据库用得尽量简单，几乎不做任何的运算，只做简单的增删改查。

随着数据库技术的发展，水平分库被越来越多的得到应用。同时硬件也在不断的发展，CPU核数、内存带宽、块设备的带宽和IOPS的发展都很迅猛。甚至GPU辅助运算也开始逐渐成为加速的焦点。

数据库的所依托的硬件运算能力已经非常强大，这种情况下只把数据库用作简单的数据存取会带来什么问题呢？

我之前写过一篇《论云数据库编程能力的重要性》，可以读一下，也许能找到以上问题的灵感。
https://yq.aliyun.com/articles/38377

伴随硬件的飞速发展，叠加数据库的分片技术的发展，现如今使用general硬件的数据库也不再是瓶颈。

对于OLTP的query，数据库往往可以做到us级响应，而在网络层可能要花上毫秒级的时间。业务逻辑越复杂，与数据库交互的次数越多，网络RT会成倍的放大，影响用户的体验。

逻辑更复杂一些的场景，需要将数据取到应用端，在应用端处理，这会涉及到move data，也会较大程度的放大网络RT。move data的模式正在逐渐成为影响用户体验、效率，浪费成本的罪魁祸首。

如果能把数据库打造成为同事具备数据存储、管理与处理能力为一体的产品。在数据库硬件资源充足的情况下，把一些数据库能处理的逻辑交给数据库处理，将极大的降低延迟，在高并发低延迟的应用场景非常有效。

这考验的就是数据库的扩展能力。

为什么PostgreSQL特别适合做内核扩展

我提炼了3点适合做内核扩展的理由，有遗漏的话尽量来补充啊，谢谢。
.1. 接口丰富
PostgreSQL有哪些开放接口？
UDF（可以扩展聚合、窗口以及普通的函数）
https://www.postgresql.org/docs/9.5/static/xfunc-c.html

GiST, SP-GiST, GIN, BRIN 通用索引接口，允许针对任意类型自定义索引
https://www.postgresql.org/docs/9.5/static/gist.html
... ...

允许自定义扩展索引接口 (bloom例子)
https://www.postgresql.org/docs/9.6/static/bloom.html
https://www.postgresql.org/docs/9.6/static/xindex.html

操作符，允许针对类型，创建对应的操作符
https://www.postgresql.org/docs/9.5/static/sql-createoperator.html

自定义数据类型
https://www.postgresql.org/docs/9.5/static/sql-createtype.html

FDW，外部数据源接口，可以把外部数据源当成本地表使用
https://www.postgresql.org/docs/9.5/static/fdwhandler.html

函数语言 handler，可以集成任意高级语言，作为数据库服务端的函数语言（例如java, python, swift, lua, ......）
https://www.postgresql.org/docs/9.5/static/plhandler.html

动态fork 进程，动态创建共享内存段.
https://www.postgresql.org/docs/9.5/static/bgworker.html

table sampling method, 可以自定义数据采样方法，例如创建测试环境，根据用户的需求定义采样方法。
https://www.postgresql.org/docs/9.5/static/tablesample-method.html

custom scan provider，允许自定义扫描方法，扩展原有的全表扫描，索引扫描等。（例如GPU计算单元可以通过DMA直接访问块设备，绕过USER SPACE，极大的提高传输吞吐率）
https://www.postgresql.org/docs/9.5/static/custom-scan.html

自定义REDO日志encode,decode接口，例如可以用它打造黑洞数据库
https://www.postgresql.org/docs/9.6/static/generic-wal.html

用户可以利用这些接口，打造适合业务的私人订制的数据库。来适配各种特殊场景的需求。

关键是你不需要了解数据库内部的实现，只需要使用这些扩展接口就可以了。

全球使用最广泛的地理位置信息管理系统PostGIS就是通过这种接口扩展的PostgreSQL插件。
（集自定义的数据类型，自定义的操作符，以及在GIN、GiST、SP-GiST、B-tree上开发的索引与一身的插件）

.2. PostgreSQL是进程模式
进程模式也是优势？必须的。

相比线程模式，多进程相对来讲稳定性较好，一个进程挂掉，重新拉起来就好，但是一个线程crash会导致整个进程都crash。

你肯定不希望给数据库加个功能就把数据库搞挂吧，如果是线程模式，扩展数据库的功能就需要非常谨慎。
而PostgreSQL提供的接口已经有非常多年的历史，通过这些接口开发的插件也是不计其数，接口非常稳定，再加上进程模式，你可以大胆的扩展PostgreSQL的功能。后面我会给大家看看有哪些不计其数的插件。

.3. BSD许可
擦，BSD许可也是优势？必须的。

如果你要把你加过功能的PostgreSQL包装成产品售卖，你就会发现BSD的好。它允许你任意形式分发。

内核扩展指南

PostgreSQL内核概貌

如何分析数据库代码的瓶颈

如何自定义UDF

C类型和SQL类型的对应关系

用户获取SQL参数的宏

用户返回结果给SQL函数的宏

C UDF例子，SQL输入为composite类型

C UDF例子，返回record类型的例子

C UDF例子，返回表(SRF)的例子

C UDF例子，反转字符串的例子

如何编译C FUNC、创建SQL FUNC

C函数是扩展中最基本的用法，必须掌握。

聚合、窗口、数据类型、操作符、索引，FDW等，都是围绕或者直接基于C FUNC的。

后面你就会理解了，特别是看了语法后，会有更深刻的理解。

聚合函数原理

希望理解好迭代函数，迭代函数的输入参数，初始迭代值，迭代中间结果，以及终结函数，和终结类型。

自定义聚合函数

自定义窗口函数

自定义数据类型

数据类型最基本的是输入和输出函数，分别将SQL的text输入转换成C的输入，将C的输出转换成SQL的text。
文本是需要依赖字符编码的，所以PG还支持基于二进制的输入和输出函数，通常可以用来实现数据的逻辑复制，而不需要关心编码的转换问题，所见即所得。

自定义操作符

操作符其实也是函数的抽象，包括操作符的元，操作符的操作数的类型，以及操作符的等价操作符以及反转操作符的定义（被query rewrite用来重写SQL，以适用更多的执行计划选择）

例如: a<>1 等价于 not (a=1)，这样的，都是可以互换的。

与操作符相关的，还有优化器相关的OPTION以及JOIN的选择性因子。

自定义操作符例子

自定义索引语法

自定义索引也非常简单，需要实现索引方法中必须的support函数，同时将操作符添加到索引的op class即可。

这些OP就可以用这个索引。

GIN索引接口介绍

GiST索引接口介绍

SP-GiST 索引接口介绍

BRIN, BTREE, hash索引接口介绍

gin,gist,sp-gist,brin索引接口的strategy是不固定的，用户可以自行根据索引功能的形态增加。

btree和hash索引接口的strategy是固定的。

自定义GIN索引例子

取自contrib

PostgreSQL 内核扩展接口总结

如何打包与发布PostgreSQL 插件

GPU、FPGA如何与PostgreSQL深度整合

PG-Strom介绍

PG-Strom原理介绍

pg-strom利用了planner的hook，在生成执行计划时，使用自定义的执行计划生成器，打造属于自己的执行计划。

同时通过custom scan provider，用户可以使用GPU计算单元使用DMA的方式直接访问块设备，绕过了buffer cache，提升访问吞吐率。

同时自定义计算节点，包括JOIN，排序，分组，计算等，都可以交给GPU来处理。

这样就实现了GPU加速的目的。

GPU加速方向

BULK数据计算。

例如

动态路径规划。

基于BIT运算的人物、人群、企业、小区、城市画像等。

大量数据的文本分析和学习。

动态路径规划

bit逻辑运算

PostGIS点面判断

(笔误，这可能不是gpu的强项，GPU的强项是BULK计算，对延迟没要求，但是对处理能力有要求的场景。)

(点面判断属于OLTP的场景，不需要用到GPU)

除了GPU加速，其实LLVM也是BULK计算的加速方式之一，而且性能非常的棒。

Deepgreen, VitesseDB, Redshift都在使用LLVM技术，加速BULK 计算的场景。

参考资料

扩展举例

PostgreSQL非常适合内核功能扩展，空口无凭。
我给大家列举一些例子。

基因测序插件
https://colab.mpi-bremen.de/wiki/display/pbis/PostBIS
化学类型插件
http://rdkit.org/
指纹类型插件
地理位置信息管理插件
http://postgis.org/
K-V插件: hstore, json
流式数据处理插件
http://www.pipelinedb.com/
时间序列插件
https://github.com/tgres/tgres
近似度匹配: pg_trgm
ES插件
https://github.com/Mikulas/pg-es-fdw
R语言插件
http://www.joeconway.com/plr/
分布式插件
https://github.com/citusdata/citus
列存储插件
https://github.com/citusdata/cstore_fdw
内存表插件
https://github.com/knizhnik/imcs
外部数据源插件
https://wiki.postgresql.org/wiki/Fdw
hll,bloom,等插件
数据挖掘插件
http://madlib.incubator.apache.org/
中文分词插件
https://github.com/jaiminpan/pg_jieba
https://github.com/jaiminpan/pg_scws
cassandra插件
https://github.com/jaiminpan/cassandra2_fdw
阿里云的对象存储插件 oss_fdw
https://yq.aliyun.com/articles/51199
... ...

可以找到开源PostgreSQL插件的地方
https://git.postgresql.org/gitweb/
http://pgxn.org/
http://pgfoundry.org/
https://github.com/
http://postgis.org/
http://pgrouting.org/
https://github.com/pgpointcloud/pointcloud
https://github.com/postgrespro
... ...

以上都是PostgreSQL非常适合内核扩展的见证。

想像一下可以扩展的行业

图像识别
基于地理位置，O2O的任务调度
电路板检测
脚模
路径规划
透明的冷热数据分离
物联网行业
金融行业
... ...
PostgreSQL几乎任何领域都可以深入进去。

小结

.1. PostgreSQL 的进程模式，为内核扩展提供了非常靠谱的保障。
.2. 你不需要了解PG内核是如何编写的，你只需要了解业务，同时使用PG提供的API接口，扩展PG的功能。
.3. 几乎所有扩展都是基于 C FUNC 的，所以你务必要掌握好PostgreSQL C FUNC的用法。
.4. PostgreSQL有 BSD许可的优势，在其他开源许可吃过亏的大型企业，现在都非常重视开源许可了。(如果你现在不重视，难道等着养肥了被杀^-^？)
.5. PostgreSQL的扩展能力是它的核心竞争力之一，好好的利用吧。

一起来打造属于自己的数据库，发挥PostgreSQL的真正实力，开启一个新的数据库时代吧。

欢迎加入阿里云

PostgreSQL、Greenplum、MySQL、Redis、mongoDB、Hadoop、Spark、SQL Server、SAP、... ... 只要是你见过的数据库，都有可能在阿里云上相遇。
技术提高生产力，一起为社会创造价值。

↧

PostgreSQL内核扩展之 - ElasticSearch同步插件

August 15, 2016, 7:45 am

≫ Next: 如何生成和阅读EnterpriseDB (PPAS)诊断报告

≪ Previous: 找对业务G点, 体验酸爽 - PostgreSQL内核扩展指南

背景介绍

Elasticsearch 是开源搜索平台的新成员，实时数据分析的神器，发展迅猛，基于 Lucene、RESTful、分布式、面向云计算设计、实时搜索、全文搜索、稳定、高可靠、可扩展、安装＋使用方便。

PostgreSQL 是起源自伯克利大学的开源数据库，历史悠久，内核扩展性极强，用户横跨各个行业。
关于PostgreSQL的内核扩展指南请参考
https://yq.aliyun.com/articles/55981

传统数据库与搜索引擎ES如何同步

例如用户需要将数据库中某些数据同步到ES建立索引，传统的方法需要应用来负责数据的同步。
这种方法会增加一定的开发成本，时效也不是非常的实时。

PostgreSQL与ES结合有什么好处

PostgreSQL的扩展插件pg-es-fdw，使用PostgreSQL的foreign data wrap，允许直接在数据库中读写ES，方便用户实时的在ES建立索引。
这种方法不需要额外的程序支持，时效也能得到保障。

case

安装PostgreSQL 9.5

略，需要包含 --with-python

安装 ES on CentOS 7

de  ># yum install -y java-1.7.0-openjdk

# rpm --import https://packages.elastic.co/GPG-KEY-elasticsearch

# vi /etc/yum.repos.d/es.repo
[elasticsearch-2.x]
name=Elasticsearch repository for 2.x packages
baseurl=https://packages.elastic.co/elasticsearch/2.x/centos
gpgcheck=1
gpgkey=https://packages.elastic.co/GPG-KEY-elasticsearch
enabled=1

# yum install -y elasticsearch

# /bin/systemctl daemon-reload
# /bin/systemctl enable elasticsearch.service
# /bin/systemctl start elasticsearch.service

# python --version
Python 2.7.5

# curl -X GET 'http://localhost:9200'
{
  "name" : "Red Wolf",
  "cluster_name" : "elasticsearch",
  "version" : {
    "number" : "2.3.3",
    "build_hash" : "218bdf10790eef486ff2c41a3df5cfa32dadcfde",
    "build_timestamp" : "2016-05-17T15:40:04Z",
    "build_snapshot" : false,
    "lucene_version" : "5.5.0"
  },
  "tagline" : "You Know, for Search"
}
de>

python client

de  ># easy_install pip
# pip install elasticsearch
de>

PostgreSQL 插件 multicorn

de  ># wget http://api.pgxn.org/dist/multicorn/1.3.2/multicorn-1.3.2.zip
# unzip multicorn-1.3.2.zip
# cd multicorn-1.3.2
# export PATH=/home/digoal/pgsql9.5/bin:$PATH
# make && make install
# su - digoal
$ psql
postgres=# create extension multicorn ;
CREATE EXTENSION
de>

PostgreSQL 插件 pg-es-fdw (foreign server基于multicorn)

de  ># git clone https://github.com/Mikulas/pg-es-fdw /tmp/pg-es-fdw
# cd /tmp/pg-es-fdw
# export PATH=/home/digoal/pgsql9.5/bin:$PATH
# python setup.py install
# su - digoal
$ psql
de>

使用例子

基于multicorn创建es foreign server

de  >CREATE SERVER multicorn_es FOREIGN DATA WRAPPER multicorn
OPTIONS (
  wrapper 'dite.ElasticsearchFDW'
);
de>

创建测试表

de  >CREATE TABLE articles (
    id serial PRIMARY KEY,
    title text NOT NULL,
    content text NOT NULL,
    created_at timestamp
);
de>

创建外部表

de  >CREATE FOREIGN TABLE articles_es (
    id bigint,
    title text,
    content text
) SERVER multicorn_es OPTIONS (host '127.0.0.1', port '9200', node 'test', index 'articles');
de>

创建触发器

对实体表，创建触发器函数，在用户对实体表插入，删除，更新时，通过触发器函数自动将数据同步到对应ES的外部表。
同步过程调用FDW的接口，对ES进行索引的建立，更新，删除。

de  >CREATE OR REPLACE FUNCTION index_article() RETURNS trigger AS $def$
    BEGIN
        INSERT INTO articles_es (id, title, content) VALUES
            (NEW.id, NEW.title, NEW.content);
        RETURN NEW;
    END;
$def$ LANGUAGE plpgsql;

CREATE OR REPLACE FUNCTION reindex_article() RETURNS trigger AS $def$
    BEGIN
        UPDATE articles_es SET
            title = NEW.title,
            content = NEW.content
        WHERE id = NEW.id;
        RETURN NEW;
    END;
$def$ LANGUAGE plpgsql;

CREATE OR REPLACE FUNCTION delete_article() RETURNS trigger AS $def$
    BEGIN
        DELETE FROM articles_es a WHERE a.id = OLD.id;
        RETURN OLD;
    END;
$def$ LANGUAGE plpgsql;

CREATE TRIGGER es_insert_article
    AFTER INSERT ON articles
    FOR EACH ROW EXECUTE PROCEDURE index_article();

CREATE TRIGGER es_update_article
    AFTER UPDATE OF title, content ON articles
    FOR EACH ROW
    WHEN (OLD.* IS DISTINCT FROM NEW.*)
    EXECUTE PROCEDURE reindex_article();

CREATE TRIGGER es_delete_article
    BEFORE DELETE ON articles
    FOR EACH ROW EXECUTE PROCEDURE delete_article();
de>

测试

de  >curl 'localhost:9200/test/articles/_search?q=*:*&pretty'
psql -c 'SELECT * FROM articles'

写入实体表，自动同步到ES
psql -c "INSERT INTO articles (title, content, created_at) VALUES ('foo', 'spike', Now());"
psql -c 'SELECT * FROM articles'

查询ES，检查数据是否已同步
curl 'localhost:9200/test/articles/_search?q=*:*&pretty'

更新实体表，数据自动同步到ES
psql -c "UPDATE articles SET content='yeay it updates\!' WHERE title='foo'"

查询ES数据是否更新
curl 'localhost:9200/test/articles/_search?q=*:*&pretty'
de>

参考

https://www.elastic.co/guide/en/elasticsearch/reference/current/setup-repositories.html
http://www.vpsee.com/2014/05/install-and-play-with-elasticsearch/
https://github.com/Mikulas/pg-es-fdw
https://wiki.postgresql.org/wiki/Fdw
http://multicorn.org/
http://pgxn.org/dist/multicorn/
http://multicorn.readthedocs.io/en/latest/index.html

小结

PostgreSQL提供的FDW接口，允许用户在数据库中直接操纵外部的数据源，所以支持ES只是一个例子，还可以支持更多的数据源。
这是已经支持的，几乎涵盖了所有的数据源。
https://wiki.postgresql.org/wiki/Fdw
multicorn在FDW接口的上层再抽象了一层，支持使用python写FDW接口，方便快速试错，如果对性能要求不是那么高，直接用multicore就可以了。
开发人员如何编写FDW？可以参考一下如下：
http://multicorn.readthedocs.io/en/latest/index.html
https://yq.aliyun.com/articles/55981
https://www.postgresql.org/docs/9.6/static/fdwhandler.html

附录

de  >###
### Author: Mikulas Dite
### Time-stamp: <2015-06-09 21:54:14 dwa>

from multicorn import ForeignDataWrapper
from multicorn.utils import log_to_postgres as log2pg

from functools import partial

import httplib
import json
import logging

class ElasticsearchFDW (ForeignDataWrapper):

    def __init__(self, options, columns):
        super(ElasticsearchFDW, self).__init__(options, columns)

        self.host = options.get('host', 'localhost')
        self.port = int(options.get('port', '9200'))
        self.node = options.get('node', '')
        self.index = options.get('index', '')

        self.columns = columns

    def get_rel_size(self, quals, columns):
        """Helps the planner by returning costs.
        Returns a tuple of the form (nb_row, avg width)
        """

        conn = httplib.HTTPConnection(self.host, self.port)
        conn.request("GET", "/%s/%s/_count" % (self.node, self.index))
        resp = conn.getresponse()
        if not 200 == resp.status:
            return (0, 0)

        raw = resp.read()
        data = json.loads(raw)
        # log2pg('MARK RESPONSE: >>%d<<' % data['count'], logging.DEBUG)
        return (data['count'], len(columns) * 100)

    def execute(self, quals, columns):
        conn = httplib.HTTPConnection(self.host, self.port)
        conn.request("GET", "/%s/%s/_search&size=10000" % (self.node, self.index))
        resp = conn.getresponse()
        if not 200 == resp.status:
            yield {0, 0}

        raw = resp.read()
        data = json.loads(raw)
        for hit in data['hits']['hits']:
            row = {}
            for col in columns:
                if col == 'id':
                    row[col] = hit['_id']
                elif col in hit['_source']:
                    row[col] = hit['_source'][col]
            yield row

    @property
    def rowid_column(self):
        """Returns a column name which will act as a rowid column,
        for delete/update operations. This can be either an existing column
        name, or a made-up one.
        This column name should be subsequently present in every
        returned resultset.
        """
        return 'id';

    def es_index(self, id, values):
        content = json.dumps(values)

        conn = httplib.HTTPConnection(self.host, self.port)
        conn.request("PUT", "/%s/%s/%s" % (self.node, self.index, id), content)
        resp = conn.getresponse()
        if not 200 == resp.status:
            return

        raw = resp.read()
        data = json.loads(raw)

        return data

    def insert(self, new_values):
        log2pg('MARK Insert Request - new values:  %s' % new_values, logging.DEBUG)

        if not 'id' in new_values:
             log2pg('INSERT requires "id" column.  Missing in: %s' % new_values, logging.ERROR)

        id = new_values['id']
        new_values.pop('id', None)
        return self.es_index(id, new_values)

    def update(self, id, new_values):
        new_values.pop('id', None)
        return self.es_index(id, new_values)

    def delete(self, id):
        conn = httplib.HTTPConnection(self.host, self.port)
        conn.request("DELETE", "/%s/%s/%s" % (self.node, self.index, id))
        resp = conn.getresponse()
        if not 200 == resp.status:
            log2pg('Failed to delete: %s' % resp.read(), logging.ERROR)
            return

        raw = resp.read()
        return json.loads(raw)

## Local Variables: ***
## mode:python ***
## coding: utf-8 ***
## End: ***de>

↧

如何生成和阅读EnterpriseDB (PPAS)诊断报告

August 15, 2016, 7:46 am

≫ Next: PostgreSQL 内核扩展之 - 管理十亿级3D扫描数据(基于Lidar产生的point cloud数据)

≪ Previous: PostgreSQL内核扩展之 - ElasticSearch同步插件

PPAS是基于PostgreSQL的高度兼容Oracle的商业产品。

不仅语法和Oracle兼容，功能也和Oracle很类似。

例如它也支持生成类似statspack或者AWR报告。

如何创建快照

配置参数timed_statistics=true或者在客户端会话中设置timed_statistics=true.
然后创建一个快照

de  >edb=# SELECT * FROM edbsnap();
       edbsnap        
----------------------
 Statement processed.
(1 row)
de>

你可以周期性的创建快照。

如何生成诊断报告

然后选择两个快照，产生这两个快照时间点之间的数据库运行诊断报告。

de  >SELECT * FROM edbsnap();
de>

查看已创建的快照

de  >edb=# SELECT * FROM get_snaps();
          get_snaps           
------------------------------
 1  11-FEB-10 10:41:05.668852
 2  11-FEB-10 10:42:27.26154
 3  11-FEB-10 10:45:48.999992
 4  11-FEB-10 11:01:58.345163
 5  11-FEB-10 11:05:14.092683
 6  11-FEB-10 11:06:33.151002
 7  11-FEB-10 11:11:16.405664
 8  11-FEB-10 11:13:29.458405
 9  11-FEB-10 11:23:57.595916
 10 11-FEB-10 11:29:02.214014
 11 11-FEB-10 11:31:44.244038
(11 rows)
de>

返回指定快照范围内，TOP N的系统等待信息

de  >sys_rpt(beginning_id, ending_id, top_n)

edb=# SELECT * FROM sys_rpt(9, 10, 10);
                                   sys_rpt                                   
-----------------------------------------------------------------------------
 WAIT NAME                                COUNT      WAIT TIME       % WAIT
 ---------------------------------------------------------------------------
 wal write                                21250      104.723772      36.31
 db file read                             121407     72.143274       25.01
 wal flush                                84185      51.652495       17.91
 wal file sync                            712        29.482206       10.22
 infinitecache write                      84178      15.814444       5.48
 db file write                            84177      14.447718       5.01
 infinitecache read                       672        0.098691        0.03
 db file extend                           190        0.040386        0.01
 query plan                               52         0.024400        0.01
 wal insert lock acquire                  4          0.000837        0.00
(12 rows)
de>

返回指定快照范围内，TOP N的会话等待信息

de  >sess_rpt(beginning_id, ending_id, top_n)

SELECT * FROM sess_rpt(18, 19, 10);

                              sess_rpt                                       
-----------------------------------------------------------------------------
ID    USER       WAIT NAME              COUNT TIME(ms)   %WAIT SES  %WAIT ALL
 ----------------------------------------------------------------------------

 17373 enterprise db file read           30   0.175713   85.24      85.24
 17373 enterprise query plan             18   0.014930   7.24       7.24
 17373 enterprise wal flush              6    0.004067   1.97       1.97
 17373 enterprise wal write              1    0.004063   1.97       1.97
 17373 enterprise wal file sync          1    0.003664   1.78       1.78
 17373 enterprise infinitecache read     38   0.003076   1.49       1.49
 17373 enterprise infinitecache write    5    0.000548   0.27       0.27
 17373 enterprise db file extend         190  0.04.386   0.03       0.03
 17373 enterprise db file write          5    0.000082   0.04       0.04
 (11 rows)
de>

返回指定PID的诊断信息

de  >sessid_rpt(beginning_id, ending_id, backend_id)

SELECT * FROM sessid_rpt(18, 19, 17373);

                                sessid_rpt                                 
-----------------------------------------------------------------------------
 ID    USER       WAIT NAME             COUNT TIME(ms)  %WAIT SES   %WAIT ALL
 ----------------------------------------------------------------------------
 17373 enterprise db file read           30   0.175713  85.24       85.24
 17373 enterprise query plan             18   0.014930  7.24        7.24
 17373 enterprise wal flush              6    0.004067  1.97        1.97
 17373 enterprise wal write              1    0.004063  1.97        1.97
 17373 enterprise wal file sync          1    0.003664  1.78        1.78
 17373 enterprise infinitecache read     38   0.003076  1.49        1.49
 17373 enterprise infinitecache write    5    0.000548  0.27        0.27
 17373 enterprise db file extend         190  0.040386  0.03        0.03
 17373 enterprise db file write          5    0.000082  0.04        0.04
(11 rows)
de>

返回指定PID的等待信息

de  >sesshist_rpt(snapshot_id, session_id)

edb=# SELECT * FROM sesshist_rpt (9, 5531);
                              sesshist_rpt                                  
----------------------------------------------------------------------------
 ID    USER       SEQ  WAIT NAME                
   ELAPSED(ms)   File  Name                 # of Blk   Sum of Blks 
 ----------------------------------------------------------------------------
 5531 enterprise 1     db file read 
   18546        14309  session_waits_pk     1          1           
 5531 enterprise 2     infinitecache read       
   125          14309  session_waits_pk     1          1           
 5531 enterprise 3     db file read             
   376          14304  edb$session_waits    0          1           
 5531 enterprise 4     infinitecache read       
   166          14304  edb$session_waits    0          1           
 5531 enterprise 5     db file read             
   7978         1260   pg_authid            0          1           
 5531 enterprise 6     infinitecache read       
   154          1260   pg_authid            0          1           
 5531 enterprise 7     db file read             
   628          14302  system_waits_pk      1          1           
 5531 enterprise 8     infinitecache read       
   463          14302  system_waits_pk      1          1           
 5531 enterprise 9     db file read             
   3446         14297  edb$system_waits     0          1           
 5531 enterprise 10    infinitecache read       
   187          14297  edb$system_waits     0          1           
 5531 enterprise 11    db file read             
   14750        14295  snap_pk              1          1           
 5531 enterprise 12    infinitecache read       
   416          14295  snap_pk              1          1           
 5531 enterprise 13    db file read             
   7139         14290  edb$snap             0          1           
 5531 enterprise 14    infinitecache read       
   158          14290  edb$snap             0          1           
 5531 enterprise 15    db file read             
   27287        14288  snapshot_num_seq     0          1           
 5531 enterprise 16    infinitecache read       
(17 rows)
de>

清除指定范围内的快照

de  >purgesnap(beginning_id, ending_id)

SELECT * FROM purgesnap(6, 9);

             purgesnap              
------------------------------------
 Snapshots in range 6 to 9 deleted.
(1 row)

edb=# SELECT * FROM get_snaps();
          get_snaps           
------------------------------
 1  11-FEB-10 10:41:05.668852
 2  11-FEB-10 10:42:27.26154
 3  11-FEB-10 10:45:48.999992
 4  11-FEB-10 11:01:58.345163
 5  11-FEB-10 11:05:14.092683
 10 11-FEB-10 11:29:02.214014
 11 11-FEB-10 11:31:44.244038
(7 rows)
de>

清除所有快照

de  >truncsnap()

SELECT * FROM truncsnap();

      truncsnap       
----------------------
 Snapshots truncated.
(1 row)

SELECT * FROM get_snaps();
 get_snaps 
-----------
(0 rows)
de>

生成AWR报告

全面的系统报告

edbreport(beginning_id, ending_id)

数据库报告

stat_db_rpt(beginning_id, ending_id)

指定范围的表级报告

stat_tables_rpt(beginning_id, ending_id, top_n, scope)

指定范围的表级IO报告

statio_tables_rpt(beginning_id, ending_id, top_n, scope)

指定范围的索引级报告

stat_indexes_rpt(beginning_id, ending_id, top_n, scope)

指定范围的索引级IO报告

statio_indexes_rpt(beginning_id, ending_id, top_n, scope)

范围

de  >scope determines which tables the function returns statistics about. Specify SYS, USER or ALL:

SYS indicates that the function should return information about system defined tables. 
A table is considered a system table if it is stored in one of the following schemas: 
  pg_catalog, information_schema, sys, or dbo.

USER indicates that the function should return information about user-defined tables.

ALL specifies that the function should return information about all tables.
de>

rds ppas用户注意

rds ppas用户是普通用，如果要使用以上的函数，需要在前面加rds_前缀，如下方法可以查看有哪些rds函数。

找到对应的rds函数就可以执行了。

de  >postgres=# \df rds*

                                List of functions
 Schema |           Name           |     Result data type     |                                                                                                                                                                              
                                                          Argument data types                                                                                                                                                                
                                                                        |  Type  
--------+--------------------------+--------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------+--------
 sys    | rds_add_policy           | void                     | object_schema text DEFAULT NULL::text, object_name text, policy_name text, function_schema text DEFAULT NULL::text, policy_function text, statement_types text DEFAULT 'inser
t,update,delete,select'::text, update_check boolean DEFAULT false, enable boolean DEFAULT true, static_policy boolean DEFAULT false, policy_type integer DEFAULT NULL::integer, long_predicate boolean DEFAULT false, sec_relevant_cols text 
DEFAULT NULL::text, sec_relevant_cols_opt integer DEFAULT NULL::integer | normal
 sys    | rds_drop_policy          | void                     | object_schema text DEFAULT NULL::text, object_name text, policy_name text                                                                                                    

                                                                        | normal
 sys    | rds_enable_policy        | void                     | object_schema text DEFAULT NULL::text, object_name text, policy_name text, enable boolean                                                                                    

                                                                        | normal
 sys    | rds_get_snaps            | SETOF text               |                                                                                                                                                                              

                                                                        | normal
 sys    | rds_manage_extension     | boolean                  | operation text, pname text, schema text DEFAULT NULL::text, logging boolean DEFAULT false                                                                                    

                                                                        | normal
 sys    | rds_pg_cancel_backend    | boolean                  | upid integer                                                                                                                                                                 

                                                                        | normal
 sys    | rds_pg_stat_activity     | SETOF pg_stat_activity   |                                                                                                                                                                              

                                                                        | normal
 sys    | rds_pg_stat_statements   | SETOF pg_stat_statements |                                                                                                                                                                              

                                                                        | normal
 sys    | rds_pg_terminate_backend | boolean                  | upid integer                                                                                                                                                                 

                                                                        | normal
 sys    | rds_report               | SETOF text               | beginsnap bigint, endsnap bigint                                                                                                                                             

                                                                        | normal
 sys    | rds_snap                 | text                     |                                                                                                                                                                              

                                                                        | normal
 sys    | rds_truncsnap            | text                     |                                                                                                                                                                              

                                                                        | normal
(12 rows)
de>

参考

https://www.enterprisedb.com/docs/en/9.5/eeguide/toc.html
https://www.enterprisedb.com/docs/en/9.5/eeguide/Postgres_Plus_Enterprise_Edition_Guide.1.141.html

↧

PostgreSQL 内核扩展之 - 管理十亿级3D扫描数据(基于Lidar产生的point cloud数据)

August 15, 2016, 7:47 am

≫ Next: PostgreSQL雕虫小技，分组TOP性能提升44倍

≪ Previous: 如何生成和阅读EnterpriseDB (PPAS)诊断报告

背景知识

还记得成龙演的那部《十二生肖》里用3D扫描和打印技术复制的生肖吗？

3D打印是近几年兴起的一种技术，除了存储物体表面的位置信息，还有颜色，密度等信息。
而3D扫描其实在军用领域很早以前就有了。
如果使用普通的数据库来存储，得把这些属性拆开来存。
而在PostgreSQL中，你完全不需要把这些属性拆开，他们本来就是一体的，用好PG的扩展接口就好了。
PostgreSQL 扩展指南：
https://yq.aliyun.com/articles/55981

什么是Lidar

3D扫描的基础知识，来自百度百科
LiDAR——Light Detection And Ranging，即激光探测与测量。
是利用GPS（Global Position System）和IMU（Inertial Measurement Unit，惯性测量装置）机载激光扫描。
其所测得的数据为数字表面模型(Digital Surface Model, DSM)的离散点表示，数据中含有空间三维信息和激光强度信息。
应用分类(Classification)技术在这些原始数字表面模型中移除建筑物、人造物、覆盖植物等测点，即可获得数字高程模型(Digital Elevation Model, DEM)，并同时得到地面覆盖物的高度。

机载LIDAR技术在国外的发展和应用已有十几年的历史，但是我国在这方面的研究和应用还只是刚刚起步,其中利用航空激光扫描探测数据进行困难地区DEM、DOM、DLG数据产品生产是当今的研究热点之一。
该技术在地形测绘、环境检测、三维城市建模等诸多领域具有广阔的发展前景和应用需求，有可能为测绘行业带来一场新的技术革命。
针对不同的应用领域及成果要求，结合灵活的搭载方式，LiDAR技术可以广泛应用于基础测绘、道路工程、电力电网、水利、石油管线、海岸线及海岛礁、数字城市等领域，提供高精度、大比例尺（1:500至1:10000）的空间数据成果。

例子
激光扫描设备装置可记录一个单发射脉冲返回的首回波﹑中间多个回波与最后回波，通过对每个回波时刻记录，可同时获得多个高程信息，将IMU/DGPS系统和激光扫描技术进行集成，飞机向前飞行时，扫描仪横向对地面发射连续的激光束，同时接受地面反射回波，IMU/DGPS系统记录每一个激光发射点的瞬间空间位置和姿态，从而可计算得到激光反射点的空间位置。

激光雷达的激光束扫瞄机场的滑道，探测飞机将遇到的风切变，即逆风的改变。

什么是point cloud

https://en.wikipedia.org/wiki/Point_cloud

前面提到了LIDAR扫描的数据包括了位置信息，以及其他传感器在对应这个位置采集的其他信息如 RGB、时间、温度、湿度等等（你可以大开脑洞想，什么都有）。
也就是说，每一个点都包含了大量的信息。

de  >The variables captured by LIDAR sensors varies by sensor and capture process.   
Some data sets might contain only X/Y/Z values.   
Others will contain dozens of variables: X, Y, Z; intensity and return number;   
red, green, and blue values; return times; and many more.   
de>

point cloud则是很多点的集合，所以信息量是非常庞大的。

de  >A point cloud is a set of data points in some coordinate system.  

In a three-dimensional coordinate system, these points are usually defined by X, Y, and Z coordinates, and often are intended to represent the external surface of an object.  

Point clouds may be created by 3D scanners. These devices measure a large number of points on an object's surface, and often output a point cloud as a data file.   
The point cloud represents the set of points that the device has measured.  
de>

point cloud如何存入PostgreSQL

PostgreSQL 存储的point cloud格式与PDAL（Point Data Abstraction Library）库一致。
非常好的复用了PDAL的数据结构，操作方法等。
因为PDAL的通用性，所以兼容性也非常的好。

de  >PostgreSQL Pointcloud deals with all this variability by using a "schema document" to describe the contents of any particular LIDAR point.   
Each point contains a number of dimensions, and each dimension can be of any data type, with scaling and/or offsets applied to move between the actual value and the value stored in the database.   
The schema document format used by PostgreSQL Pointcloud is the same one used by the PDAL library.  
de>

PostgreSQL Point Cloud支持的对象类型

PcPoint

基本类型，样例

de  >{  
    "pcid" : 1,  
      "pt" : [0.01, 0.02, 0.03, 4]  
}  
de>

PcPatch

PcPoint的聚合类型，一个PcPatch存储的是一些临近PcPoint的聚合。

de  >The structure of database storage is such that storing billions of points as individual records in a table is not an efficient use of resources. 
Instead, we collect a group of PcPoint into a PcPatch. Each patch should hopefully contain points that are near together.    
de>

样例

de  >{  
    "pcid" : 1,  
     "pts" : [  
              [0.02, 0.03, 0.05, 6],  
              [0.02, 0.03, 0.05, 8]  
             ]  
}  
de>

PostgreSQL Point Cloud支持的Functions

PC_MakePoint(pcid integer, vals float8[]) returns pcpoint

构造pcpoint

de  >SELECT PC_MakePoint(1, ARRAY[-127, 45, 124.0, 4.0]);  
010100000064CEFFFF94110000703000000400  

INSERT INTO points (pt)  
SELECT PC_MakePoint(1, ARRAY[x,y,z,intensity])  
FROM (  
  SELECT    
  -127+a/100.0 AS x,   
    45+a/100.0 AS y,  
         1.0*a AS z,  
          a/10 AS intensity  
  FROM generate_series(1,100) AS a  
) AS values;  
pcid是唯一标示一个点的ID  
de>

PC_AsText(p pcpoint) returns text

pcpoint转换成人类可读的文本

de  >SELECT PC_AsText('010100000064CEFFFF94110000703000000400'::pcpoint);  

{"pcid":1,"pt":[-127,45,124,4]}  
de>

PC_PCId(p pcpoint) returns integer (from 1.1.0)

取pcid

de  >SELECT PC_PCId('010100000064CEFFFF94110000703000000400'::pcpoint));  

1   
de>

PC_AsBinary(p pcpoint) returns bytea

将pcpoint转换为OGC WKB编码的二进制

de  >SELECT PC_AsBinary('010100000064CEFFFF94110000703000000400'::pcpoint);  

\x01010000800000000000c05fc000000000008046400000000000005f40  
de>

PC_Get(pt pcpoint, dimname text) returns numeric

读取pcpoint包含的维度(指定属性)信息

de  >SELECT PC_Get('010100000064CEFFFF94110000703000000400'::pcpoint, 'Intensity');  

4  
de>

PC_Get(pt pcpoint) returns float8

读取pcpoint的属性，返回ARRAY

de  >SELECT PC_Get('010100000064CEFFFF94110000703000000400'::pcpoint);  

{-127,45,124,4}  
de>

PC_Patch(pts pcpoint[]) returns pcpatch

把多个pcpoint聚合成一个pcpatch

de  >INSERT INTO patches (pa)  
SELECT PC_Patch(pt) FROM points GROUP BY id/10;  
de>

PC_NumPoints(p pcpatch) returns integer

pcpatch中包含多少个pcpoint

de  >SELECT PC_NumPoints(pa) FROM patches LIMIT 1;  

9       
de>

PC_PCId(p pcpatch) returns integer (from 1.1.0)

返回pcpatch包含的pcpoint的pcid

de  >SELECT PC_PCId(pa) FROM patches LIMIT 1;  

1     
de>

PC_Envelope(p pcpatch) returns bytea

将pcpatch转换为OGC WKB编码的二进制

de  >SELECT PC_Envelope(pa) FROM patches LIMIT 1;  

\x0103000000010000000500000090c2f5285cbf5fc0e17a  
14ae4781464090c2f5285cbf5fc0ec51b81e858b46400ad7  
a3703dba5fc0ec51b81e858b46400ad7a3703dba5fc0e17a  
14ae4781464090c2f5285cbf5fc0e17a14ae47814640  
de>

PC_AsText(p pcpatch) returns text

将pcpatch转换为文本

de  >SELECT PC_AsText(pa) FROM patches LIMIT 1;  

{"pcid":1,"pts":[  
 [-126.99,45.01,1,0],[-126.98,45.02,2,0],[-126.97,45.03,3,0],  
 [-126.96,45.04,4,0],[-126.95,45.05,5,0],[-126.94,45.06,6,0],  
 [-126.93,45.07,7,0],[-126.92,45.08,8,0],[-126.91,45.09,9,0]  
]}  
de>

PC_Summary(p pcpatch) returns text (from 1.1.0)

返回json格式的文本

de  >SELECT PC_Summary(pa) FROM patches LIMIT 1;  

{"pcid":1, "npts":9, "srid":4326, "compr":"dimensional","dims":[{"pos":0,"name":"X","size":4,"type":"int32_t","compr":"sigbits","stats":{"min":-126.99,"max":-126.91,"avg":-126.95}},{"pos":1,"name":"Y","size":4,"type":"int32_t","compr":"sigbits","stats":{"min":45.01,"max":45.09,"avg":45.05}},{"pos":2,"name":"Z","size":4,"type":"int32_t","compr":"sigbits","stats":{"min":1,"max":9,"avg":5}},{"pos":3,"name":"Intensity","size":2,"type":"uint16_t","compr":"rle","stats":{"min":0,"max":0,"avg":0}}]}  
de>

PC_Uncompress(p pcpatch) returns pcpatch

解压pcpatch

de  >SELECT PC_Uncompress(pa) FROM patches   
   WHERE PC_NumPoints(pa) = 1;  

01010000000000000001000000C8CEFFFFF8110000102700000A00   
de>

PC_Union(p pcpatch[]) returns pcpatch

多个pcpatch聚合为一个pcpatch

de  >-- Compare npoints(sum(patches)) to sum(npoints(patches))  
SELECT PC_NumPoints(PC_Union(pa)) FROM patches;  
SELECT Sum(PC_NumPoints(pa)) FROM patches;  

100   
de>

PC_Intersects(p1 pcpatch, p2 pcpatch) returns boolean

判断两个pcpatch是否有交叉

de  >-- Patch should intersect itself  
SELECT PC_Intersects(  
         '01010000000000000001000000C8CEFFFFF8110000102700000A00'::pcpatch,  
         '01010000000000000001000000C8CEFFFFF8110000102700000A00'::pcpatch);  

t  
de>

PC_Explode(p pcpatch) returns SetOf[pcpoint]

将pcpatch转换为pcpoint

de  >SELECT PC_AsText(PC_Explode(pa)), id   
FROM patches WHERE id = 7;  

              pc_astext               | id   
--------------------------------------+----  
 {"pcid":1,"pt":[-126.5,45.5,50,5]}   |  7  
 {"pcid":1,"pt":[-126.49,45.51,51,5]} |  7  
 {"pcid":1,"pt":[-126.48,45.52,52,5]} |  7  
 {"pcid":1,"pt":[-126.47,45.53,53,5]} |  7  
 {"pcid":1,"pt":[-126.46,45.54,54,5]} |  7  
 {"pcid":1,"pt":[-126.45,45.55,55,5]} |  7  
 {"pcid":1,"pt":[-126.44,45.56,56,5]} |  7  
 {"pcid":1,"pt":[-126.43,45.57,57,5]} |  7  
 {"pcid":1,"pt":[-126.42,45.58,58,5]} |  7  
 {"pcid":1,"pt":[-126.41,45.59,59,5]} |  7  
de>

PC_PatchAvg(p pcpatch, dimname text) returns numeric

求pcpatch中包含的某个维度的信息的平均值

de  >SELECT PC_PatchAvg(pa, 'intensity')   
FROM patches WHERE id = 7;  

5.0000000000000000  
de>

PC_PatchMax(p pcpatch, dimname text) returns numeric

求pcpatch中包含的某个维度的信息的最大值

PC_PatchMin(p pcpatch, dimname text) returns numeric

求pcpatch中包含的某个维度的信息的最小值

PC_PatchAvg(p pcpatch) returns pcpoint (from 1.1.0)

求pcpatch中所有pcpoint的所有维度的平均值

PC_PatchMax(p pcpatch) returns pcpoint (from 1.1.0)

求pcpatch中所有pcpoint的所有维度的最大值

PC_PatchMin(p pcpatch) returns pcpoint (from 1.1.0)

求pcpatch中所有pcpoint的所有维度的最小值

PC_FilterGreaterThan(p pcpatch, dimname text, float8 value) returns pcpatch

返回pcpatch中在指定维度上大于指定值的pcpoint

de  >SELECT PC_AsText(PC_FilterGreaterThan(pa, 'y', 45.57))   
FROM patches WHERE id = 7;  

 {"pcid":1,"pts":[[-126.42,45.58,58,5],[-126.41,45.59,59,5]]}  
de>

PC_FilterLessThan(p pcpatch, dimname text, float8 value) returns pcpatch

返回pcpatch中在指定维度上小于指定值的pcpoint

PC_FilterBetween(p pcpatch, dimname text, float8 value1, float8 value2) returns pcpatch

返回pcpatch中在指定维度上在指定范围的pcpoint

PC_FilterEquals(p pcpatch, dimname text, float8 value) returns pcpatch

返回pcpatch中在指定维度上等于指定值的pcpoint

PC_Compress(p pcpatch,global_compression_scheme text,compression_config text) returns pcpatch (from 1.1.0)

压缩pcpatch

de  >Allowed global compression schemes are:  
auto -- determined by pcid  
ght -- no compression config supported  
laz -- no compression config supported  
dimensional configuration is a comma-separated list of per-dimension compressions from this list:  
  auto -- determined automatically, from values stats  
  zlib -- deflate compression  
  sigbits -- significant bits removal  
  rle -- run-length encoding  
de>

PC_PointN(p pcpatch, n int4) returns pcpoint

返回pcpatch中第n个pcpoint，正值从头开始计数，负值反向计数。

point cloud和PostGIS结合使用

de  >CREATE EXTENSION postgis;  
CREATE EXTENSION pointcloud;  
CREATE EXTENSION pointcloud_postgis;  
de>

PC_Intersects(p pcpatch, g geometry) returns boolean

PC_Intersects(g geometry, p pcpatch) returns boolean

判断pcpatch和geometry是否有相交

de  >SELECT PC_Intersects('SRID=4326;POINT(-126.451 45.552)'::geometry, pa)  
FROM patches WHERE id = 7;  

t  
de>

PC_Intersection(pcpatch, geometry) returns pcpatch

返回pcpatch中与geometry相交的点组成的一个新的pcpatch

de  >SELECT PC_AsText(PC_Explode(PC_Intersection(  
      pa,   
      'SRID=4326;POLYGON((-126.451 45.552, -126.42 47.55, -126.40 45.552, -126.451 45.552))'::geometry  
)))  
FROM patches WHERE id = 7;  

             pc_astext                 
--------------------------------------  
 {"pcid":1,"pt":[-126.44,45.56,56,5]}  
 {"pcid":1,"pt":[-126.43,45.57,57,5]}  
 {"pcid":1,"pt":[-126.42,45.58,58,5]}  
 {"pcid":1,"pt":[-126.41,45.59,59,5]}  
de>

Geometry(pcpoint) returns geometry

pcpoint::geometry returns geometry

将pcpatch中的位置属性的信息转换为geometry类型

de  >SELECT ST_AsText(PC_MakePoint(1, ARRAY[-127, 45, 124.0, 4.0])::geometry);  

POINT Z (-127 45 124)  
de>

point cloud的压缩

PostgreSQL point cloud，使用document输入时，可以指定压缩方法。
写法如下，

de  ><pc:metadata>  
  <Metadata name="compression">dimensional</Metadata>  
</pc:metadata>  
de>

支持的压缩方法如下

de  >None,   
  which stores points and patches as byte arrays using the type and formats described in the schema document.  

Dimensional,   
  which stores points the same as 'none' but stores patches as collections of dimensional data arrays, with an "appropriate" compression applied.   
Dimensional compression makes the most sense for smaller patch sizes, since small patches will tend to have more homogeneous dimensions.  

GHT or "GeoHash Tree",   
  which stores the points in a tree where each node stores the common values shared by all nodes below.   
For larger patch sizes, GHT should provide effective compression and performance for patch-wise operations.   
You must build Pointcloud with libght support to make use of the GHT compression.  

LAZ or "LASZip".   
  You must build Pointcloud with LAZPERF support to make use of the LAZ compression.  

If no compression is declared in <pc:metadata>, then a compression of "none" is assumed.  
de>

point cloud的二进制格式

The point and patch binary formats start with a common header, which provides:

endianness flag, to allow portability between architectures
pcid number, to look up the schema information in the pointcloud_formats table

Point Binary

de  >byte:     endianness (1 = NDR, 0 = XDR)  
uint32:   pcid (key to POINTCLOUD_SCHEMAS)  
uchar[]:  pointdata (interpret relative to pcid)  
de>

The patch binary formats have additional standard header information:

the compression number, which indicates how to interpret the data
the number of points in the patch

Patch Binary (Uncompressed)

de  >byte:         endianness (1 = NDR, 0 = XDR)  
uint32:       pcid (key to POINTCLOUD_SCHEMAS)  
uint32:       0 = no compression  
uint32:        npoints  
pointdata[]:  interpret relative to pcid  
de>

pcpatch的压缩格式的二进制表述请参考
https://github.com/pgpointcloud/pointcloud

如果将数据导入point cloud

有两种格式导入
From WKB

From PDAL

参考
https://github.com/pgpointcloud/pointcloud

pcpoint和pcpatch类型的SQL定义

de  >CREATE TYPE pcpoint (  
    internallength = variable,  
    input = pcpoint_in,  
    output = pcpoint_out,  
    -- send = geometry_send,  
    -- receive = geometry_recv,  
    typmod_in = pc_typmod_in,  
    typmod_out = pc_typmod_out,  
    -- delimiter = ':',  
    -- alignment = double,  
    -- analyze = geometry_analyze,  
    storage = external -- do not try to compress it please  
);  

CREATE TYPE pcpatch (  
    internallength = variable,  
    input = pcpatch_in,  
    output = pcpatch_out,  
    -- send = geometry_send,  
    -- receive = geometry_recv,  
    typmod_in = pc_typmod_in,  
    typmod_out = pc_typmod_out,  
    -- delimiter = ':',  
    -- alignment = double,  
    -- analyze = geometry_analyze,  
    storage = external  
);  

CREATE TYPE pointcloud_abs (  
    internallength = 8,  
    input = pointcloud_abs_in,  
    output = pointcloud_abs_out,  
    alignment = double  
);  
de>

pcpoint 数据类型输入输出对应的C函数

de  >PG_FUNCTION_INFO_V1(pcpoint_in);  
Datum pcpoint_in(PG_FUNCTION_ARGS)  
{  
    char *str = PG_GETARG_CSTRING(0);  
    /* Datum pc_oid = PG_GETARG_OID(1); Not needed. */  
    int32 typmod = 0;  
    uint32 pcid = 0;  
    PCPOINT *pt;  
    SERIALIZED_POINT *serpt = NULL;  

    if ( (PG_NARGS()>2) && (!PG_ARGISNULL(2)) )  
    {  
        typmod = PG_GETARG_INT32(2);  
        pcid = pcid_from_typmod(typmod);  
    }  

    /* Empty string. */  
    if ( str[0] == '\0' )  
    {  
        ereport(ERROR,(errmsg("pcpoint parse error - empty string")));  
    }  

    /* Binary or text form? Let's find out. */  
    if ( str[0] == '0' )  
    {  
        /* Hex-encoded binary */  
        pt = pc_point_from_hexwkb(str, strlen(str), fcinfo);  
        pcid_consistent(pt->schema->pcid, pcid);  
        serpt = pc_point_serialize(pt);  
        pc_point_free(pt);  
    }  
    else  
    {  
        ereport(ERROR,(errmsg("parse error - support for text format not yet implemented")));  
    }  

    if ( serpt ) PG_RETURN_POINTER(serpt);  
    else PG_RETURN_NULL();  
}  

PG_FUNCTION_INFO_V1(pcpoint_out);  
Datum pcpoint_out(PG_FUNCTION_ARGS)  
{  
    PCPOINT *pcpt = NULL;  
    PCSCHEMA *schema = NULL;  
    SERIALIZED_POINT *serpt = NULL;  
    char *hexwkb = NULL;  

    serpt = PG_GETARG_SERPOINT_P(0);  
    schema = pc_schema_from_pcid(serpt->pcid, fcinfo);  
    pcpt = pc_point_deserialize(serpt, schema);  
    hexwkb = pc_point_to_hexwkb(pcpt);  
    pc_point_free(pcpt);  
    PG_RETURN_CSTRING(hexwkb);  
}  
de>

pcpatch 数据类型输入输出对应的C函数

de  >PG_FUNCTION_INFO_V1(pcpatch_in);  
Datum pcpatch_in(PG_FUNCTION_ARGS)  
{  
    char *str = PG_GETARG_CSTRING(0);  
    /* Datum geog_oid = PG_GETARG_OID(1); Not needed. */  
    uint32 typmod = 0, pcid = 0;  
    PCPATCH *patch;  
    SERIALIZED_PATCH *serpatch = NULL;  

    if ( (PG_NARGS()>2) && (!PG_ARGISNULL(2)) )  
    {  
        typmod = PG_GETARG_INT32(2);  
        pcid = pcid_from_typmod(typmod);  
    }  

    /* Empty string. */  
    if ( str[0] == '\0' )  
    {  
        ereport(ERROR,(errmsg("pcpatch parse error - empty string")));  
    }  

    /* Binary or text form? Let's find out. */  
    if ( str[0] == '0' )  
    {  
        /* Hex-encoded binary */  
        patch = pc_patch_from_hexwkb(str, strlen(str), fcinfo);  
        pcid_consistent(patch->schema->pcid, pcid);  
        serpatch = pc_patch_serialize(patch, NULL);  
        pc_patch_free(patch);  
    }  
    else  
    {  
        ereport(ERROR,(errmsg("parse error - support for text format not yet implemented")));  
    }  

    if ( serpatch ) PG_RETURN_POINTER(serpatch);  
    else PG_RETURN_NULL();  
}  

PG_FUNCTION_INFO_V1(pcpatch_out);  
Datum pcpatch_out(PG_FUNCTION_ARGS)  
{  
    PCPATCH *patch = NULL;  
    SERIALIZED_PATCH *serpatch = NULL;  
    char *hexwkb = NULL;  
    PCSCHEMA *schema = NULL;  

    serpatch = PG_GETARG_SERPATCH_P(0);  
    schema = pc_schema_from_pcid(serpatch->pcid, fcinfo);  
    patch = pc_patch_deserialize(serpatch, schema);  
    hexwkb = pc_patch_to_hexwkb(patch);  
    pc_patch_free(patch);  
    PG_RETURN_CSTRING(hexwkb);  
}  
de>

参考

https://en.wikipedia.org/wiki/Lidar
http://baike.baidu.com/view/2922098.htm
https://github.com/pgpointcloud/pointcloud
http://pointcloud.org/
https://en.wikipedia.org/wiki/Point_cloud
http://www.pdal.io/
http://www.pdal.io/quickstart.html

↧

PostgreSQL雕虫小技，分组TOP性能提升44倍

August 15, 2016, 7:47 am

≫ Next: PostgreSQL 多路并行 xlog 设计

≪ Previous: PostgreSQL 内核扩展之 - 管理十亿级3D扫描数据(基于Lidar产生的point cloud数据)

业务背景

按分组取出TOP值，是非常常见的业务需求。
比如提取每位歌手的下载量TOP 10的曲目、提取每个城市纳税前10的人或企业。

传统方法

传统的方法是使用窗口查询，PostgreSQL是支持窗口查询的。
例子
测试表和测试数据，生成10000个分组，1000万条记录。

de  >postgres=# create table tbl(c1 int, c2 int, c3 int);
CREATE TABLE
postgres=# create index idx1 on tbl(c1,c2);
CREATE INDEX
postgres=# insert into tbl select mod(trunc(random()*10000)::int, 10000), trunc(random()*10000000) from generate_series(1,10000000);
INSERT 0 10000000
de>

使用窗口查询的执行计划

de  >postgres=# explain select * from (select row_number() over(partition by c1 order by c2) as rn,* from tbl) t where t.rn<=10;
                                       QUERY PLAN                                       
----------------------------------------------------------------------------------------
 Subquery Scan on t  (cost=0.43..770563.03 rows=3333326 width=20)
   Filter: (t.rn <= 10)
   ->  WindowAgg  (cost=0.43..645563.31 rows=9999977 width=12)
         ->  Index Scan using idx1 on tbl  (cost=0.43..470563.72 rows=9999977 width=12)
(4 rows)
de>

使用窗口查询的结果举例

de  >postgres=# select * from (select row_number() over(partition by c1 order by c2) as rn,* from tbl) t where t.rn<=10;
 rn |  c1  |   c2   | c3 
----+------+--------+----
  1 |    0 |   1657 |   
  2 |    0 |   3351 |   
  3 |    0 |   6347 |   
  4 |    0 |  12688 |   
  5 |    0 |  16991 |   
  6 |    0 |  19584 |   
  7 |    0 |  24694 |   
  8 |    0 |  36646 |   
  9 |    0 |  40882 |   
 10 |    0 |  41599 |   
  1 |    1 |  14465 |   
  2 |    1 |  29032 |   
  3 |    1 |  39969 |   
  4 |    1 |  41094 |   
  5 |    1 |  69481 |   
  6 |    1 |  70919 |   
  7 |    1 |  75575 |   
  8 |    1 |  81102 |   
  9 |    1 |  87496 |   
 10 |    1 |  90603 |   
......
de>

使用窗口查询的效率，20.1秒

de  >postgres=# explain (analyze,verbose,costs,timing,buffers) select * from (select row_number() over(partition by c1 order by c2) as rn,* from tbl) t where t.rn<=10;
                                                                     QUERY PLAN                                                                     
----------------------------------------------------------------------------------------------------------------------------------------------------
 Subquery Scan on t  (cost=0.43..770563.03 rows=3333326 width=20) (actual time=0.040..20813.469 rows=100000 loops=1)
   Output: t.rn, t.c1, t.c2, t.c3
   Filter: (t.rn <= 10)
   Rows Removed by Filter: 9900000
   Buffers: shared hit=10035535
   ->  WindowAgg  (cost=0.43..645563.31 rows=9999977 width=12) (actual time=0.035..18268.027 rows=10000000 loops=1)
         Output: row_number() OVER (?), tbl.c1, tbl.c2, tbl.c3
         Buffers: shared hit=10035535
         ->  Index Scan using idx1 on public.tbl  (cost=0.43..470563.72 rows=9999977 width=12) (actual time=0.026..11913.677 rows=10000000 loops=1)
               Output: tbl.c1, tbl.c2, tbl.c3
               Buffers: shared hit=10035535
 Planning time: 0.110 ms
 Execution time: 20833.747 ms
(13 rows)
de>

雕虫小技

如何优化？
可以参考我之前写的，使用递归查询，优化count distinct的方法。
https://yq.aliyun.com/articles/39689
本文同样需要用到递归查询，获得分组ID

de  >postgres=# with recursive t1 as (
postgres(#  (select min(c1) c1 from tbl )
postgres(#   union all
postgres(#  (select (select min(tbl.c1) c1 from tbl where tbl.c1>t.c1) c1 from t1 t where t.c1 is not null)
postgres(# )
postgres-# select * from t1;
de>

写成SRF函数，如下

de  >postgres=# create or replace function f() returns setof tbl as $$
postgres$# declare
postgres$#   v int;
postgres$# begin
postgres$#   for v in with recursive t1 as (                                                                           
postgres$#    (select min(c1) c1 from tbl )                                                                   
postgres$#     union all                                                                                      
postgres$#    (select (select min(tbl.c1) c1 from tbl where tbl.c1>t.c1) c1 from t1 t where t.c1 is not null) 
postgres$#   )                                                                                                
postgres$#   select * from t1
postgres$#   LOOP
postgres$#     return query select * from tbl where c1=v order by c2 limit 10;
postgres$#   END LOOP;
postgres$# return;
postgres$# 
postgres$# end;
postgres$# $$ language plpgsql strict;
CREATE FUNCTION
de>

优化后的查询结果例子

de  >postgres=# select * from f();
  c1  |   c2   | c3 
------+--------+----
    0 |   1657 |   
    0 |   3351 |   
    0 |   6347 |   
    0 |  12688 |   
    0 |  16991 |   
    0 |  19584 |   
    0 |  24694 |   
    0 |  36646 |   
    0 |  40882 |   
    0 |  41599 |   
    1 |  14465 |   
    1 |  29032 |   
    1 |  39969 |   
    1 |  41094 |   
    1 |  69481 |   
    1 |  70919 |   
    1 |  75575 |   
    1 |  81102 |   
    1 |  87496 |   
    1 |  90603 |   
......
de>

优化后，只需要464毫秒返回10000个分组的TOP 10。

de  >postgres=# explain (analyze,verbose,timing,costs,buffers) select * from f();
                                                     QUERY PLAN                                                      
---------------------------------------------------------------------------------------------------------------------
 Function Scan on public.f  (cost=0.25..10.25 rows=1000 width=12) (actual time=419.218..444.810 rows=100000 loops=1)
   Output: c1, c2, c3
   Function Call: f()
   Buffers: shared hit=170407, temp read=221 written=220
 Planning time: 0.037 ms
 Execution time: 464.257 ms
(6 rows)
de>

小结

传统的方法使用窗口查询，输出多个每个分组的TOP 10，需要扫描所有的记录。效率较低。
由于分组不是非常多，只有10000个，所以可以选择使用递归的方法，用上索引取TOP 10，速度非常快。
目前PostgreSQL的递归语法不支持递归的启动表写在subquery里面，也不支持启动表在递归查询中使用order by，所以不能直接使用递归得出结果，目前需要套一层函数。

↧

PostgreSQL 多路并行 xlog 设计

August 15, 2016, 7:48 am

≫ Next: PostgreSQL relcache在长连接应用中的内存霸占"坑"

≪ Previous: PostgreSQL雕虫小技，分组TOP性能提升44倍

本文截取自《多核处理器下事务型数据库性能优化技术综述》
http://www.cnki.com.cn/Article/CJFDTotal-JSJX201509012.htm

数据库的redo记录了事务的重做信息，它最重要的功能之一是用来恢复数据库，例如当数据库crash后，需要从数据库的一致检查点开始，读取redo并回放。
为了保证事务回放的先后顺序，XLOG是串行的，所以写XLOG时也需要锁的。
为了提高写性能，PostgreSQL使用了xlog buffer来缓解写压力。
在使用了xlog buffer或者XLOG使用了SSD硬盘后，串行的写可能会成为瓶颈。
本文旨在研究多路并行XLOG的实现，从而减少不必要的等待，提高写的吞吐率。

当使用了多路XLOG后，不同路数的XLOG相互之间是没有锁竞争的，大大提高了XLOG的写并发。
同时，为了保证XLOG回放时的先后顺序，每一笔XLOG都需要申请一个时间戳，并写入XLOG文件中，回放时严格按照时间戳顺序来回放。
对于检查点的创建，多路XLOG都必须完成检查点的创建，才是一个有效的检查点。
从测试结果来看，使用多路XLOG的话，测试并行度越高，性能表现相比没有XLOG并行的越好。

↧

PostgreSQL relcache在长连接应用中的内存霸占"坑"

August 15, 2016, 7:49 am

≫ Next: 阿里云数据库Greenplum版发布啦

≪ Previous: PostgreSQL 多路并行 xlog 设计

背景

阿里巴巴内部的某业务在使用阿里云RDS PG时，业务线细心的DBA发现，一些长连接占据了大量的内存没有释放。后来找到了复现的方法。使用场景有些极端。

有阿里巴巴内部业务这样的老湿机陪伴的RDS PG，是很靠谱的。

PostgreSQL 缓存

除了常见的执行计划缓存、数据缓存，PostgreSQL为了提高生成执行计划的效率，还提供了catalog, relation等缓存机制。

PostgreSQL 9.5支持的缓存代码如下

de  >ll src/backend/utils/cache/

attoptcache.c  catcache.c  evtcache.c  inval.c  lsyscache.c  plancache.c  relcache.c  relfilenodemap.c  relmapper.c  spccache.c  syscache.c  ts_cache.c  typcache.c
de>

长连接的缓存问题

这些缓存中，某些缓存是不会主动释放的，因此可能导致长连接霸占大量的内存不释放。

通常，长连接的应用，一个连接可能给多个客户端会话使用过，访问到大量catalog的可能性非常大。所以此类的内存占用比是非常高的。

有什么影响呢？
如果长连接很多，而且每个都霸占大量的内存，你的内存很快会被大量的连接耗光，出现OOM是避免不了的。
而实际上，这些内存可能大部分都是relcache的（还有一些其他的），要用到内存时，这些relcache完全可以释放出来，腾出内存空间，而没有必要被持久霸占。

例子

在数据库中存在大量的表，PostgreSQL会缓存当前会话访问过的对象的元数据，如果某个会话从启动以来，对数据库中所有的对象都有过查询的动作，那么这个会话需要将所有的对象定义都缓存起来，会占用较大的内存，占用的内存大小与一共访问了多少站该对象有关。

复现方法(截取自stackoverflow某个问题)，创建大量的对象，访问大量的对象，从而造成会话的relcache等迅速增长。
创建大量的对象
functions :
-- MTDB_destroy

de  >CREATE OR REPLACE FUNCTION public.mtdb_destroy(schemanameprefix character varying)
 RETURNS integer
 LANGUAGE plpgsql
AS $function$
declare
   curs1 cursor(prefix varchar) is select schema_name from information_schema.schemata where schema_name like prefix || '%';
   schemaName varchar(100);
   count integer;
begin
   count := 0;
   open curs1(schemaNamePrefix);
   loop
      fetch curs1 into schemaName;
      if not found then exit; end if;           
      count := count + 1;
      execute 'drop schema ' || schemaName || ' cascade;';
   end loop;  
   close curs1;
   return count;
end $function$;
de>

-- MTDB_Initialize

de  >CREATE OR REPLACE FUNCTION public.mtdb_initialize(schemanameprefix character varying, numberofschemas integer, numberoftablesperschema integer, createviewforeachtable boolean)
 RETURNS integer
 LANGUAGE plpgsql
AS $function$
declare   
   currentSchemaId integer;
   currentTableId integer;
   currentSchemaName varchar(100);
   currentTableName varchar(100);
   currentViewName varchar(100);
   count integer;
begin
   -- clear
   perform MTDB_Destroy(schemaNamePrefix);

   count := 0;
   currentSchemaId := 1;
   loop
      currentSchemaName := schemaNamePrefix || ltrim(currentSchemaId::varchar(10));
      execute 'create schema ' || currentSchemaName;

      currentTableId := 1;
      loop
         currentTableName := currentSchemaName || '.' || 'table' || ltrim(currentTableId::varchar(10));
         execute 'create table ' || currentTableName || ' (f1 integer, f2 integer, f3 varchar(100), f4 varchar(100), f5 varchar(100), f6 varchar(100), f7 boolean, f8 boolean, f9 integer, f10 integer)';
         if (createViewForEachTable = true) then
            currentViewName := currentSchemaName || '.' || 'view' || ltrim(currentTableId::varchar(10));
            execute 'create view ' || currentViewName || ' as ' ||
                     'select t1.* from ' || currentTableName || ' t1 ' ||
             ' inner join ' || currentTableName || ' t2 on (t1.f1 = t2.f1) ' ||
             ' inner join ' || currentTableName || ' t3 on (t2.f2 = t3.f2) ' ||
             ' inner join ' || currentTableName || ' t4 on (t3.f3 = t4.f3) ' ||
             ' inner join ' || currentTableName || ' t5 on (t4.f4 = t5.f4) ' ||
             ' inner join ' || currentTableName || ' t6 on (t5.f5 = t6.f5) ' ||
             ' inner join ' || currentTableName || ' t7 on (t6.f6 = t7.f6) ' ||
             ' inner join ' || currentTableName || ' t8 on (t7.f7 = t8.f7) ' ||
             ' inner join ' || currentTableName || ' t9 on (t8.f8 = t9.f8) ' ||
             ' inner join ' || currentTableName || ' t10 on (t9.f9 = t10.f9) ';                    
         end if;
         currentTableId := currentTableId + 1;
         count := count + 1;
         if (currentTableId > numberOfTablesPerSchema) then exit; end if;
      end loop;   

      currentSchemaId := currentSchemaId + 1;
      if (currentSchemaId > numberOfSchemas) then exit; end if;     
   end loop;
   return count;
END $function$;
de>

在一个会话中访问所有的对象
-- MTDB_RunTests

de  >CREATE OR REPLACE FUNCTION public.mtdb_runtests(schemanameprefix character varying, rounds integer)
 RETURNS integer
 LANGUAGE plpgsql
AS $function$
declare
   curs1 cursor(prefix varchar) is select table_schema || '.' || table_name from information_schema.tables where table_schema like prefix || '%' and table_type = 'VIEW';
   currentViewName varchar(100);
   count integer;
begin
   count := 0;
   loop
      rounds := rounds - 1;
      if (rounds < 0) then exit; end if;

      open curs1(schemaNamePrefix);
      loop
         fetch curs1 into currentViewName;
         if not found then exit; end if;
         execute 'select * from ' || currentViewName;
         count := count + 1;
      end loop;
      close curs1;
   end loop;
   return count;  
end $function$;
de>

test SQL:
prepare :
准备对象

de  >postgres=# select MTDB_Initialize('tenant', 100, 1000, true);
de>

访问对象
session 1 :

de  >postgres=# select MTDB_RunTests('tenant', 1);
 mtdb_runtests 
---------------
        100000
(1 row)
de>

访问对象
session 2 :

de  >postgres=# select MTDB_RunTests('tenant', 1);
 mtdb_runtests 
---------------
        100000
(1 row)
de>

观察内存的占用
memory view :

de  >  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+   COMMAND
 2536 digoal    20   0 20.829g 0.016t 1.786g S   0.0 25.7   3:08.20 postgres: postgres postgres [local] idle
 2453 digoal    20   0 6854896 187124 142780 S   0.0  0.3   0:00.68 postgres: postgres postgres [local] idle
de>

smem

de  >  PID User     Command                         Swap      USS      PSS     RSS 
 2536 digoal   postgres: postgres postgres        0 15022132 15535203 16894900 
 2453 digoal   postgres: postgres postgres        0 15022256 15535405 16895100 
de>

优化建议

.1. 应用层优化建议
对于长连接，建议空闲一段时间后，自动释放连接。
这样的话，即使因为某些原因一些连接访问了大量的对象，也不至于一直占用这些缓存不释放。
我们可以看到pgpool-II的设计，也考虑到了这一点，它会对空闲的server connection设置阈值，或者设置一个连接的使用生命周期，到了就释放重建。

.2. PostgreSQL内核优化建议
优化relcache的管理，为relcache等缓存提供LRU管理机制，限制总的大小，淘汰不经常访问的对象，同时建议提供SQL语法给用户，允许用户自主的释放cache。

阿里云RDS PG正在对内核进行优化，修正目前社区版本PG存在的这个问题。

参考

https://www.postgresql.org/message-id/flat/20160708012833.1419.89062%40wrigleys.postgresql.org#20160708012833.1419.89062@wrigleys.postgresql.org

de  >Every PostgreSQL session holds system data in own cache. Usually this cache
is pretty small (for significant numbers of users). But can be pretty big
if your catalog is untypically big and you touch almost all objects from
catalog in session. A implementation of this cache is simple - there is not
delete or limits. There is not garabage collector (and issue related to
GC), what is great, but the long sessions on big catalog can be problem.
The solution is simple - close session over some time or over some number
of operations. Then all memory in caches will be released.

Regards 

Pavel 
de>

随时欢迎来杭交流PostgreSQL相关技术，记得来之前请与我联系哦。

↧

阿里云数据库Greenplum版发布啦

August 15, 2016, 7:49 am

≫ Next: [转载]聊聊Greenplum的那些事

≪ Previous: PostgreSQL relcache在长连接应用中的内存霸占"坑"

经过阿里云ApsaraDB小伙伴们几个月的不懈努力，Greenplum 终于上云了。
(这里有PostgreSQL内核小组的宇宙第一小鲜肉，还有宇宙无敌老腊肉)

云数据库Greenplum版（ApsaraDB for Greenplum）是基于Greenplum开源数据库项目的MPP大规模并行处理数据仓库产品，提供全面的SQL支持(包括符合SQL2008标准的OLAP分析函数)，业界流行的BI软件都可以直接使用Greenplum进行在线业务分析。支持行存储和列存储混合模式，提高分析性能；同时提供数据压缩技术，降低存储成本。支持XML、GIS地理信息、模糊字符串等丰富的数据类型，为物联网、互联网、金融、政企等行业提供丰富的业务分析能力。

Greenplum从2008年在国内推广以来，生态已经非常的成熟，在运营商、金融、物流、公安、政府、互联网等行业都有非常庞大的用户群体。

从百TB到PB的OLAP仓库，Greenplum无疑是非常好的选择。

阿里云并不是简单的将Greenplum云化，还添加了一些非常贴地气的特性：

内核增强

支持插件 oss_ext、PostGIS、orafunc(Oracle兼容包)、DBLINK、MADlib(机器学习方面的函数库)、fuzzystrmatch插件，字符串模糊匹配；
(OSS_EXT插件，读取存放在OSS（Open Storage Service）上的文件。)
支持 create extension 语法创建插件
通过 dbsync 从 mysql,pg,ppas,gp 全量或增量同步到 pg,ppas,gp
引入第三方合作伙伴 ETL
支持ORCA优化器
只读实例(只允许select,drop,delete,copy to,truncate)
支持rds_superuser
修复BUG , gp_workfile_limit_per_segment 无法限制spill file使用量
修复BUG , Primary与Mirror数据同步缺省为非同步模式
修复BUG , copy 内存泄露
支持限制单个segment临时文件空间

异构数据导入

通过MySQL数据库可以通过mysql2pgsql进行高性能数据导入，同时业界流行的ETL工具均可支持以Greenplum为目标的ETL数据导入

OSS异构存储

可将存储于OSS中的格式化文件作为数据源，通过外部表模式进行实时操作，使用标准SQL语法实现数据查询

透明数据复制(实现HTAP)

支持数据从PostgreSQL/PPAS透明流入，持续增量无需编程处理，简化维护工作，数据入库后可再进行高性能内部数据建模及数据清洗

安全性

IP白名单配置

最多支持配置1000个允许连接RDS实例的服务器IP地址，从访问源进行直接的风险控制。

DDOS防护

在网络入口实时监测，当发现超大流量攻击时，对源IP进行清洗，清洗无效情况下可以直接拉进黑洞。

一键扩容

对于用户来说，再也不需要为数据的爆炸性增长措手不及，只需在控制台点一个按钮，轻松应对扩容需求。（公测阶段暂不提供）

方案介绍

GIS地理数据分析方案

阿里云ApsaraDB for RDS(PostgreSQL)及Greenplum都已经内置符合OpenGIS标准的空间数据库引擎，可以实现实时的定位及路劲规划，并直接支持业界广泛使用的ArcGIS。用户可以通过在应用程序中使用简单的SQL操作配合GIS函数，即可处理复杂的空间地理数据模型。得益于Greenplum的OLAP数据综合分析，用户更可以实现基于地理信息的海量数据分析工作，为物联网、移动互联网、物流配送、智慧出行(智慧城市)、LBS位置服务、O2O业务系统等提供强大的决策分析支持。

OLTP+OLAP综合解决方案

用户现有Greenplum数据仓库可以通过原生的导出及导入方式将数据直接迁移到ApsaraDB for Greenplum实现云端数据仓库的OLAP在线分析使用。用户无需再进行复杂的Greenplum运维管理，同时阿里云为用户提供完整的扩容及可用性保障，让DBA及开发人员专注于如何通过SQL提供企业的业务生产力。通过阿里云ApsaraDB for RDS用户可以实现高性能的OLTP应用，同时RDS(PPAS)还提供了Oracle语法及PL/SQL的高度兼容特性；结合Greenplum后，所有前端RDS(PPAS)及RDS(PostgreSQL)中的OLTP数据将可实现与Greenplum的流式透传，用户只需要简单配置，即可实现OLTP到OLAP数据库的数据同步。

Quick BI数据报表整合

Greenplum通过阿里云数加平台的 Quick BI报表功能，可以直接在线上实现丰富的可视化数据展现，与此同时在这里所生产的报表还可以平滑嵌入到自有系统，与用户的软件合为一体。Greenplum强劲的OLAP分析能力及高性能数据库列存，为多维分析提供性能的有效保障，从百GB到百TB性能平滑扩展，并支持复杂SQL查询。

品尝地址

公测申请
https://cn.aliyun.com/product/gpdb
欢迎提出宝贵建议，欢迎随时来阿里云促膝长谈业务需求，恭候光临。

还有一大波特性将要来袭

流式备份与恢复
支持部分节点执行计划部分节点参与运算，而非所有节点
支持节点间 connection pool
支持 replication table
OSS外部表直接写入
LLVM动态编译。。。。。。
阿里云的小伙伴们加油，努力做最贴地气的云数据仓库。

↧

[转载]聊聊Greenplum的那些事

August 15, 2016, 7:50 am

≫ Next: PostgreSQL cluster大幅减少nestloop离散IO的优化方法

≪ Previous: 阿里云数据库Greenplum版发布啦

原文

http://dbaplus.cn/news-21-341-1.html

聊聊Greenplum的那些事

李巍 2016-04-01 14:15:00 1024

开卷有益——作者的话

有时候真的感叹人生岁月匆匆，特别是当一个IT人沉浸于某个技术领域十来年后，蓦然回首，总有说不出的万千感慨。

笔者有幸从04年就开始从事大规模数据计算的相关工作，08年作为Greenplum 早期员工加入Greenplum团队（当时的工牌是“005”，哈哈），记得当时看了一眼Greenplum的架构（嗯，就是现在大家耳熟能详的那个好多个X86框框的图），就义无反顾地加入了，转眼之间，已经到了第8个年头。

在诸多项目中我亲历了Greenplum在国内的生根发芽到高速发展，再到现在拥有一百多个企业级用户的过程。也见证了Greenplum从早期的2.1版本到当前的4.37版本，许多NB功能的不断增强、系统稳定性的不断大幅提高，在Greenplum的发展壮大中，IT行业也发生着巨大的变化，业界潮流沿着开放、开源的方向走向了大数据和云计算时代。由此看出，Greenplum十来年的快速发展不是偶然发生的，这与其在技术路线上始终保持与整个IT行业的技术演进高度一致密不可分的。

多年历练中接触过大大小小几十个数据类项目，有些浅尝辄止（最短的不到一周甚至还有远程支持），有些周期以年来计（长期出差现场、生不如死），客观来说，每个项目都有其独一无二的的特点，只要有心，你总能在这个项目上学到些什么或有所领悟。我觉得把这些整理一下，用随笔的方式写下来，除了自己备忘以外，也许会对大家更深入地去了解GP带来一些启发，也或许在某个技术点上是你目前遇到的问题，如乱码怎么加载？异构数据库如何迁移？集群间如何高速复制数据？如何用C API扩展实现高效备份恢复等。希望我的这篇文章能够给大家带来帮助，同时也希望大家多拍砖。

Greenplum的起源

Greenplum最早是在10多年前（大约在2002年）出现的，基本上和Hadoop是同一时期（Hadoop 约是2004年前后，早期的Nutch可追溯到2002年）。当时的背景是：

互联网行业经过之前近10年的由慢到快的发展，累积了大量信息和数据，数据在爆发式增长，这些海量数据急需新的计算方式，需要一场计算方式的革命；

传统的主机计算模式在海量数据面前，除了造价昂贵外，在技术上也难于满足数据计算性能指标，传统主机的Scale-up模式遇到了瓶颈，SMP（对称多处理）架构难于扩展，并且在CPU计算和IO吞吐上不能满足海量数据的计算需求；

分布式存储和分布式计算理论刚刚被提出来，Google的两篇著名论文发表后引起业界的关注，一篇是关于GFS分布式文件系统，另外一篇是关于MapReduce 并行计算框架的理论，分布式计算模式在互联网行业特别是收索引擎和分词检索等方面获得了巨大成功。

由此，业界认识到对于海量数据需要一种新的计算模式来支持，这种模式就是可以支持Scale-out横向扩展的分布式并行数据计算技术。

当时，开放的X86服务器技术已经能很好的支持商用，借助高速网络（当时是千兆以太网）组建的X86集群在整体上提供的计算能力已大幅高于传统SMP主机，并且成本很低，横向的扩展性还可带来系统良好的成长性。

问题来了，在X86集群上实现自动的并行计算，无论是后来的MapReduce计算框架还是MPP（海量并行处理）计算框架，最终还是需要软件来实现，Greenplum正是在这一背景下产生的，借助于分布式计算思想，Greenplum实现了基于数据库的分布式数据存储和并行计算（GoogleMapReduce实现的是基于文件的分布式数据存储和计算，我们过后会比较这两种方法的优劣性）。

话说当年Greenplum（当时还是一个Startup公司，创始人家门口有一棵青梅 ——greenplum，因此而得名）召集了十几位业界大咖（据说来自google、yahoo、ibm和TD），说干就干，花了1年多的时间完成最初的版本设计和开发，用软件实现了在开放X86平台上的分布式并行计算，不依赖于任何专有硬件，达到的性能却远远超过传统高昂的专有系统。

大家都知道Greenplum的数据库引擎层是基于著名的开源数据库Postgresql的（下面会分析为什么采用Postgresql，而不是mysql等等），但是Postgresql是单实例数据库，怎么能在多个X86服务器上运行多个实例且实现并行计算呢？为了这，Interconnnect大神器出现了。在那1年多的时间里，大拿们很大一部分精力都在不断的设计、优化、开发Interconnect这个核心软件组件。最终实现了对同一个集群中多个Postgresql实例的高效协同和并行计算，interconnect承载了并行查询计划生产和Dispatch分发（QD）、协调节点上QE执行器的并行工作、负责数据分布、Pipeline计算、镜像复制、健康探测等等诸多任务。

在Greenplum开源以前，据说一些厂商也有开发MPP数据库的打算，其中最难的部分就是在Interconnect上遇到了障碍，可见这项技术的关键性。

Greenplum为什么选择Postgreeql做轮子

说到这，也许有同学会问，为什么Greenplum 要基于Postgresql? 这个问题大致引申出两个问题：

1、为什么不从数据库底层进行重新设计研发？

道理比较简单，所谓术业有专攻，就像制造跑车的不会亲自生产车轮一样，我们只要专注在分布式技术中最核心的并行处理技术上面，协调我们下面的轮子跑的更快更稳才是我们的最终目标。而数据库底层组件就像车轮一样，经过几十年磨砺，数据库引擎技术已经非常成熟，大可不必去重新设计开发，而且把数据库底层交给其它专业化组织来开发（对应到Postgresql就是社区），还可充分利用到社区的源源不断的创新能力和资源，让产品保持持续旺盛的生命力。

这也是我们在用户选型时，通常建议用户考察一下底层的技术支撑是不是有好的组织和社区支持的原因，如果缺乏这方面的有力支持或独自闭门造轮，那就有理由为那个车的前途感到担忧，一个简单判断的标准就是看看底下那个轮子有多少人使用，有多少人为它贡献力量。

2、为什么是Postgresql而不是其它的？

我想大家可能主要想问为什么是Postgresql而不是Mysql（对不起，还有很多开源关系型数据库，但相比这两个主流开源库，但和这两个大牛比起来，实在不在一个起跑线上）。本文无意去从详细技术点上PK这两个数据库孰优孰劣（网上很多比较），我相信它们的存在都有各自的特点，它们都有成熟的开源社区做支持，有各自的庞大的fans群众基础。个人之见，Greenplum选择Postgressql有以下考虑：

1）Postgresql号称最先进的数据库（官方主页“The world’s most advanced open source database”）,且不管这是不是自我标榜，就从OLAP分析型方面来考察，以下几点Postgresql确实胜出一筹：

PG有非常强大 SQL 支持能力和非常丰富的统计函数和统计语法支持，除对ANSI SQL完全支持外，还支持比如分析函数（SQL2003 OLAP window函数），还可以用多种语言来写存储过程，对于Madlib、R的支持也很好。这一点上MYSQL就差的很远，很多分析功能都不支持，而Greenplum作为MPP数据分析平台，这些功能都是必不可少的。

Mysql查询优化器对于子查询、复制查询如多表关联、外关联的支持等较弱，特别是在关联时对于三大join技术：hash join、merge join、nestloop join的支持方面，Mysql只支持最后一种nestloop join（据说未来会支持hash join），而多个大表关联分析时hash join是必备的利器，缺少这些关键功能非常致命，将难于在OLAP领域充当大任。我们最近对基于MYSQL的某内存分布式数据库做对比测试时，发现其优点是OLTP非常快，TPS非常高（轻松搞定几十万），但一到复杂多表关联性能就立马下降，即使其具有内存计算的功能也无能为力，就其因估计还是受到mysql在这方面限制。

2）扩展性方面，Postgresql比mysql也要出色许多，Postgres天生就是为扩展而生的，你可以在PG中用Python、C、Perl、TCL、PLSQL等等语言来扩展功能，在后续章节中，我将展现这种扩展是如何的方便，另外，开发新的功能模块、新的数据类型、新的索引类型等等非常方便，只要按照API接口开发，无需对PG重新编译。PG中contrib目录下的各个第三方模块，在GP中的postgis空间数据库、R、Madlib、pgcrypto各类加密算法、gptext全文检索都是通过这种方式实现功能扩展的。

3）在诸如ACID事物处理、数据强一致性保证、数据类型支持、独特的MVCC带来高效数据更新能力等还有很多方面，Postgresql似乎在这些OLAP功能上都比mysql更甚一筹。

4）最后，Postgresql许可是仿照BSD许可模式的，没有被大公司控制，社区比较纯洁，版本和路线控制非常好，基于Postgresql可让用户拥有更多自主性。反观Mysql的社区现状和众多分支（如MariaDB），确实够乱的。

好吧，不再过多列举了，这些特点已经足够了，据说很多互联网公司采用Mysql来做OLTP的同时，却采用Postgresql来做内部的OLAP分析数据库，甚至对新的OLTP系统也直接采用Postgresql。

相比之下，Greenplum更强悍，把Postgresql作为实例（注：该实例非Oracle实例概念，这里指的是一个分布式子库）架构在Interconnect下，在Interconnect的指挥协调下，数十个甚至数千个Sub Postgresql数据库实例同时开展并行计算，而且，这些Postgresql之间采用share-nothing无共享架构，从而更将这种并行计算能力发挥到极致。

除此之外，MPP采用两阶段提交和全局事务管理机制来保证集群上分布式事务的一致性，Greenplum像Postgresql一样满足关系型数据库的包括ACID在内的所有特征。

从上图进而可以看到，Greenplum的最小并行单元不是节点层级，而是在实例层级，安装过Greenplum的同学应该都看到每个实例都有自己的postgresql目录结构，都有各自的一套Postgresql数据库守护进程（甚至可以通过UT模式进行单个实例的访问）。正因为如此，甚至一个运行在单节点上的GreenplumDB也是一个小型的并行计算架构，一般一个节点配置6~8个实例，相当于在一个节点上有6~8个Postgresql数据库同时并行工作，优势在于可以充分利用到每个节点的所有CPU和IO 能力。

Greenplum单个节点上运行能力比其它数据库也快很多，如果运行在多节点上，其提供性能几乎是线性的增长，这样一个集群提供的性能能够很轻易的达到传统数据库的数百倍甚至数千倍，所管理数据存储规模达到100TB~数PB，而你在硬件上的投入，仅仅是数台一般的X86服务器和普通的万兆交换机。

Greenplum采用Postgresl作为底层引擎，良好的兼容了Postgresql的功能，Postgresql中的功能模块和接口基本上99%都可以在Greenplum上使用，例如odbc、jdbc、oledb、perldbi、python psycopg2等，所以Greenplum与第三方工具、BI报表集成的时候非常容易；对于postgresql的contrib中的一些常用模块Greenplum提供了编译后的模块开箱即用，如oraface、postgis、pgcrypt等，对于其它模块，用户可以自行将contrib下的代码与Greenplum的include头文件编译后，将动态so库文件部署到所有节点就可进行测试使用了。有些模块还是非常好用的，例如oraface，基本上集成了Oracle常用的函数到Greenplum中，曾经在一次PoC测试中，用户提供的22条Oracle SQL语句，不做任何改动就能运行在Greenplum上。

最后特别提示，Greenplum绝不仅仅只是简单的等同于“Postgresql+interconnect并行调度+分布式事务两阶段提交”，Greenplum还研发了非常多的高级数据分析管理功能和企业级管理模块，这些功能都是Postgresql没有提供的：

外部表并行数据加载
可更新数据压缩表
行、列混合存储
数据表多级分区
Bitmap索引
Hadoop外部表
Gptext全文检索
并行查询计划优化器和Orca优化器
Primary/Mirror镜像保护机制
资源队列管理
WEB/Brower监控

Greenplum的艺术，一切皆并行（Parallel Everything）

前面介绍了Greenplum的分布式并行计算架构，其中每个节点上所有Postgresql实例都是并行工作的，这种并行的Style贯穿了Greenplum功能设计的方方面面：外部表数据加载是并行的、查询计划执行是并行的、索引的建立和使用是并行的，统计信息收集是并行的、表关联（包括其中的重分布或广播及关联计算）是并行的，排序和分组聚合都是并行的，备份恢复也是并行的，甚而数据库启停和元数据检查等维护工具也按照并行方式来设计，得益于这种无所不在的并行，Greenplum在数据加载和数据计算中表现出强悍的性能，某行业客户深有体会:同样2TB左右的数据，在Greenplum中不到一个小时就加载完成了，而在用户传统数据仓库平台上耗时半天以上。

在该用户的生产环境中，1个数百亿表和2个10多亿条记录表的全表关联中（只有on关联条件，不带where过滤条件，其中一个10亿条的表计算中需要重分布），Greenplum仅耗时数分钟就完成了，当其它传统数据平台还在为千万级或亿级规模的表关联性能发愁时，Greenplum已经一骑绝尘，在百亿级规模以上表关联中展示出上佳的表现。

Greenplum建立在Share-nothing无共享架构上，让每一颗CPU和每一块磁盘IO都运转起来，无共享架构将这种并行处理发挥到极致。相比一些其它传统数据仓库的Sharedisk架构，后者最大瓶颈就是在IO吞吐上，在大规模数据处理时，IO无法及时feed数据给到CPU，CPU资源处于wait 空转状态，无法充分利用系统资源，导致SQL效率低下：

一台内置16块SAS盘的X86服务器，每秒的IO数据扫描性能约在2000MB/s左右，可以想象，20台这样的服务器构成的机群IO性能是40GB/s，这样超大的IO吞吐是传统的 Storage难以达到的。

(MPP Share-nothing架构实现超大IO吞吐能力)

另外，Greenplum还是建立在实例级别上的并行计算，可在一次SQL请求中利用到每个节点上的多个CPU CORE的计算能力，对X86的CPU超线程有很好的支持，提供更好的请求响应速度。在PoC中接触到其它一些国内外基于开放平台的MPP软件，大都是建立在节点级的并行，单个或少量的任务时无法充分利用资源，导致系统加载和SQL执行性能不高。

记忆较深的一次PoC公开测试中，有厂商要求在测试中关闭CPU超线程，估计和这个原因有关（因为没有办法利用到多个CPU core的计算能力，还不如关掉超线程以提高单core的能力），但即使是这样，在那个测试中，测试性能也大幅低于Greenplum（那个测试中，各厂商基于客户提供的完全相同的硬件环境，Greenplum是唯一一家完成所有测试的，特别在混合负载测试中，Greenplum的80并发耗时3个多小时就成功完成了，其它厂商大都没有完成此项测试，唯一完成的一家耗时40多小时）

前文提到，得益于Postgresql的良好扩展性（这里是extension，不是scalability），Greenplum 可以采用各种开发语言来扩展用户自定义函数（UDF）（我个人是Python和C的fans，后续章节与大家分享）。这些自定义函数部署到Greenplum后可用充分享受到实例级别的并行性能优势，我们强烈建议用户将库外的处理逻辑，部署到用MPP数据库的UDF这种In-Database的方式来处理，你将获得意想不到的性能和方便性；例如我们在某客户实现的数据转码、数据脱敏等，只需要简单的改写原有代码后部署到GP中，通过并行计算获得数十倍性能提高。

另外，GPTEXT（lucent全文检索）、Apache Madlib（开源挖掘算法）、SAS algorithm、R都是通过UDF方式实现在Greenplum集群中分布式部署，从而获得库内计算的并行能力。这里可以分享的是，SAS曾经做过测试，对1亿条记录做逻辑回归，采用一台小型机耗时约4个多小时，通过部署到Greenplum集群中，耗时不到2分钟就全部完成了。以GPEXT为例，下图展现了Solr全文检索在Greenplum中的并行化风格。

最后，也许有同学会有问题，Greenplum采用Master-slave架构，Master是否会成为瓶颈？完全不用担心，Greenplum所有的并行任务都是在Segment数据节点上完成后，Master只负责生成和优化查询计划、派发任务、协调数据节点进行并行计算。

按照我们在用户现场观察到的，Master上的资源消耗很少有超过20%情况发生，因为Segment才是计算和加载发生的场所（当然，在HA方面，Greenplum提供Standby Master机制进行保证）。

再进一步看，Master-Slave架构在业界的大数据分布式计算和云计算体系中被广泛应用，大家可以看到，现在主流分布式系统都是采用Master-Slave架构，包括：Hadoop FS、Hbase、MapReduce、Storm、Mesos……无一例外都是Master-Slave架构。相反，采用Multiple Active Master 的软件系统，需要消耗更多资源和机制来保证元数据一致性和全局事务一致性，特别是在节点规模较多时，将导致性能下降，严重时可能导致多Master之间的脑裂引发严重系统故障。

Greenplum不能做什么？

Greenplum最大的特点总结就一句话：基于低成本的开放平台基础上提供强大的并行数据计算性能和海量数据管理能力。这个能力主要指的是并行计算能力，是对大任务、复杂任务的快速高效计算，但如果你指望MPP并行数据库能够像OLTP数据库一样，在极短的时间处理大量的并发小任务，这个并非MPP数据库所长。请牢记，并行和并发是两个完全不同的概念，MPP数据库是为了解决大问题而设计的并行计算技术，而不是大量的小问题的高并发请求。

再通俗点说，Greenplum主要定位在OLAP领域，利用Greenplum MPP数据库做大数据计算或分析平台非常适合，例如:数据仓库系统、ODS系统、ACRM系统、历史数据管理系统、电信流量分析系统、移动信令分析系统、SANDBOX自助分析沙箱、数据集市等等。

而MPP数据库都不擅长做OLTP交易系统，所谓交易系统，就是高频的交易型小规模数据插入、修改、删除，每次事务处理的数据量不大，但每秒钟都会发生几十次甚至几百次以上交易型事务，这类系统的衡量指标是TPS，适用的系统是OLTP数据库或类似Gemfire的内存数据库。

Greenplum MPP 与 Hadoop

MPP和Hadoop都是为了解决大规模数据的并行计算而出现的技术，两种技术的相似点在于：

分布式存储数据在多个节点服务器上
采用分布式并行计算框架
支持横向扩展来提高整体的计算能力和存储容量
都支持X86开放集群架构

但两种技术在数据存储和计算方法上，也存在很多显而易见的差异：

MPP按照关系数据库行列表方式存储数据（有模式），Hadoop按照文件切片方式分布式存储（无模式）
两者采用的数据分布机制不同，MPP采用Hash分布，计算节点和存储紧密耦合，数据分布粒度在记录级的更小粒度（一般在1k以下）；Hadoop FS按照文件切块后随机分配，节点和数据无耦合，数据分布粒度在文件块级（缺省64MB）。
MPP采用SQL并行查询计划，Hadoop采用Mapreduce框架

基于以上不同，体现在效率、功能等特性方面也大不相同：

1.计算效率比较：

先说说Mapreduce技术，Mapreduce相比而言是一种较为蛮力计算方式（业内曾经甚至有声音质疑MapReduce是反潮流的），数据处理过程分成Map-〉Shuffle-〉Reduce的过程，相比MPP 数据库并行计算而言，Mapreduce的数据在计算前未经整理和组织（只是做了简单数据分块，数据无模式），而MPP预先会把数据有效的组织（有模式），例如：行列表关系、Hash分布、索引、分区、列存储等、统计信息收集等，这就决定了在计算过程中效率大为不同：

MAP效率对比：

Hadoop的MAP阶段需要对数据的再解析，而MPP数据库直接取行列表，效率高
Hadoop按照64MB分拆文件，而且数据不能保证在所有节点均匀分布，因此MAP过程的并行化程度低；MPP数据库按照数据记录拆分和Hash分布，粒度更细，数据分布在所有节点中非常均匀，并行化程度很高
Hadoop HDFS没有灵活的索引、分区、列存储等技术支持，而MPP通常利用这些技术大幅提高数据的检索效率；

Shuffle效率对比：（Hadoop Shuffle 对比MPP计算中的重分布）

由于Hadoop数据与节点的无关性，因此Shuffle是基本避免不了的；而MPP数据库对于相同Hash分布数据不需要重分布，节省大量网络和CPU消耗；
Mapreduce没有统计信息，不能做基于cost-base的优化；MPP数据库利用统计信息可以很好的进行并行计算优化，例如，对于不同分布的数据，可以在计算中基于Cost动态的决定最优执行路径，如采用重分布还是小表广播

Reduce效率对比：（对比于MPP数据库的SQL执行器-executor）

Mapreduce缺乏灵活的Join技术支持，MPP数据库可以基于COST来自动选择Hash join、Merger join和Nestloop join，甚至可以在Hash join通过COST选择小表做Hash，在Nestloop Join中选择index提高join性能等等；
MPP数据库对于Aggregation（聚合）提供Multiple-agg、Group-agg、sort-agg等多种技术来提供计算性能，Mapreuce需要开发人员自己实现；

另外，Mapreduce在整个MAP->Shuffle->Reduce过程中通过文件来交换数据，效率很低，MapReduce要求每个步骤间的数据都要序列化到磁盘，这意味着MapReduce作业的I/O成本很高，导致交互分析和迭代算法开销很大，MPP数据库采用Pipline方式在内存数据流中处理数据，效率比文件方式高很多；

总结以上几点，MPP数据库在计算并行度、计算算法上比Hadoop更加SMART，效率更高；在客户现场的测试对比中，Mapreduce对于单表的计算尚可，但对于复杂查询，如多表关联等，性能很差，性能甚至只有MPP数据库的几十分之一甚至几百分之一，下图是基于MapReduce的Hive和Greenplum MPP在TPCH 22个SQL测试性能比较：（相同硬件环境下）

还有，某国内知名电商在其数据分析平台做过验证，同样硬件条件下，MPP数据库比Hadoop性能快12倍以上。

2.功能上的对比

MPP数据库采用SQL作为主要交互式语言，SQL语言简单易学，具有很强数据操纵能力和过程语言的流程控制能力，SQL语言是专门为统计和数据分析开发的语言，各种功能和函数琳琅满目，SQL语言不仅适合开发人员，也适用于分析业务人员，大幅简化了数据的操作和交互过程。

而对MapReduce编程明显是困难的，在原生的Mapreduce开发框架基础上的开发，需要技术人员谙熟于JAVA开发和并行原理，不仅业务分析人员无法使用，甚至技术人员也难以学习和操控。为了解决易用性的问题，近年来SQL-0N-HADOOP技术大量涌现出来，几乎成为当前Hadoop开发使用的一个技术热点趋势。

这些技术包括：Hive、Pivotal HAWQ、SPARK SQL、Impala、Prest、Drill、Tajo等等很多，这些技术有些是在Mapreduce上做了优化，例如Spark采用内存中的Mapreduce技术，号称性能比基于文件的的Mapreduce提高10倍；有的则采用C/C++语言替代Java语言重构Hadoop和Mapreuce（如MapR公司及国内某知名电商的大数据平台）；而有些则直接绕开了Mapreduce另起炉灶，如Impala、hawq采用借鉴MPP计算思想来做查询优化和内存数据Pipeline计算，以此来提高性能。

虽然SQL-On-Hadoop比原始的Mapreduce虽然在易用上有所提高，但在SQL成熟度和关系分析上目前还与MPP数据库有较大差距：

上述系统，除了HAWQ外，对SQL的支持都非常有限，特别是分析型复杂SQL，如SQL 2003 OLAP WINDOW函数，几乎都不支持，以TPC-DS测试（用于评测决策支持系统（大数据或数据仓库）的标准SQL测试集,99个SQL）为例，包括SPARK、Impala、Hive只支持其中1/3左右；

由于HADOOP 本身Append-only特性，SQL-On-Hadoop大多不支持数据局部更新和删除功能(update/delete)；而有些，例如Spark计算时，需要预先将数据装载到DataFrames模型中；

基本上都缺少索引和存储过程等等特征

除HAWQ外，大多对于ODBC/JDBC/DBI/OLEDB/.NET接口的支持有限，与主流第三方BI报表工具兼容性不如MPP数据库

SQL-ON-HADOOP不擅长于交互式（interactive）的Ad-hoc查询，多通过预关联的方式来规避这个问题；另外，在并发处理方面能力较弱，高并发场景下，需要控制计算请求的并发度，避免资源过载导致的稳定性问题和性能下降问题；

3.架构灵活性对比：

前文提到，为保证数据的高性能计算，MPP数据库节点和数据之间是紧耦合的，相反，Hadoop的节点和数据是没有耦合关系的。这就决定了Hadoop的架构更加灵活-存储节点和计算节点的无关性，这体现在以下2个方面：

扩展性方面

Hadoop架构支持单独增加数据节点或计算节点，依托于Hadoop的SQL-ON-HADOOP系统，例如HAWQ、SPARK均可单独增加计算层的节点或数据层的HDFS存储节点，HDFS数据存储对计算层来说是透明的；
MPP数据库扩展时，一般情况下是计算节点和数据节点一起增加的，在增加节点后，需要对数据做重分布才能保证数据与节点的紧耦合（重新hash数据），进而保证系统的性能；Hadoop在增加存储层节点后，虽然也需要Rebalance数据，但相较MPP而言，不是那么紧迫。

节点退服方面

Hadoop节点宕机退服，对系统的影响较小，并且系统会自动将数据在其它节点扩充到3份；MPP数据库节点宕机时，系统的性能损耗大于Hadoop节点。

Pivotal将GPDB 的MPP技术与Hadoop分布式存储技术结合，推出了HAWQ高级数据分析软件系统，实现了Hadoop上的SQL-on-HADOOP，与其它的SQL-on-HADOOP系统不同，HAWQ支持完全的SQL语法和SQL 2003 OLAP 语法及Cost-Base的算法优化，让用户就像使用关系型数据库一样使用Hadoop。底层存储采用HDFS，HAWQ实现了计算节点和HDFS数据节点的解耦，采用MR2.0的YARN来进行资源调度，同时具有Hadoop的灵活伸缩的架构特性和MPP的高效能计算能力.

当然，有得也有所失，HAWQ的架构比Greenplum MPP数据库灵活，在获得架构优越性的同时，其性能比Greenplum MPP数据库要低一倍左右，但得益于MPP算法的红利，HAWQ性能仍大幅高于其它的基于MapReduce的SQL-on-HADOOP系统。

4.选择MPP还是Hadoop

总结一下，Hadoop MapReduce和SQL-on-HADOOP技术目前都还不够成熟，性能和功能上都有很多待提升的空间，相比之下，MPP数据在数据处理上更加SMART，要填平或缩小与MPP数据库之间的性能和功能上的GAP，Hadoop还有很长的一段路要走。

就目前来看，个人认为这两个系统都有其适用的场景，简单来说，如果你的数据需要频繁的被计算和统计、并且你希望具有更好的SQL交互式支持和更快计算性能及复杂SQL语法的支持，那么你应该选择MPP数据库，SQL-on-Hadoop技术还没有Ready。特别如数据仓库、集市、ODS、交互式分析数据平台等系统，MPP数据库有明显的优势。

而如果你的数据加载后只会被用于读取少数次的任务和用于少数次的访问，而且主要用于Batch（不需要交互式），对计算性能不是很敏感，那Hadoop也是不错的选择，因为Hadoop不需要你花费较多的精力来模式化你的数据，节省数据模型设计和数据加载设计方面的投入。这些系统包括：历史数据系统、ETL临时数据区、数据交换平台等等。

总之，Bear in mind，千万不要为了大数据而大数据（就好像不要为了创新而创新一个道理），否则，你项目最后的产出与你的最初设想可能将差之千里，行业内不乏失败案例。

最后，提一下，Greenplum MPP数据库支持用“Hadoop外部表”方式来访问、加载Hadoop FS的数据，虽然Greenplum的Hadoop外部表性能大幅低于MPP内部表，但比Hadoop 自身的HIVE要高很多（在某金融客户的测试结果，比HIVE高8倍左右），因此可以考虑在项目中同时部署MPP数据库和Hadoop，MPP用于交互式高性能分析，Hadoop用于数据Staging、MPP的数据备份或一些ETL batch的数据清洗任务，两者相辅相成，在各自最擅长的场景中发挥其特性和优势。

5.未来GP发展之路

过去十年，IT技术潮起潮落发生着时刻不停的变化，而在这变化中的不变就是走向开放和开源的道路，即将到来的伟大变革是云计算时代的到来，任何IT技术，从硬件到软件到服务，都逃不过要接受云计算的洗礼，不能赶上时代潮流的技术和公司都将被无情的淘汰。大数据也要拥抱云计算，大数据将作为一种数据服务来提供（DaaS-Data as A Service），依靠云提供共享的、弹性、按需分配的大数据计算和存储的服务。

Greenplum MPP数据库从已一开始就是开放的技术，并且在2015年年底已经开源和成立社区（在开源第一天就有上千个Download），可以说，Greenplum已经不仅仅只是Pivotal公司一家的产品，我们相信越来越多组织和个人会成为Greenplum的Contributor贡献者，随着社区的发展将推动Greenplum MPP数据库走向新的高速发展旅程。（分享一下开源的直接好处，最近我们某用户的一个特殊需求，加载数据中有回车等特殊字符，我们下载了GP外部表gpfdist源代码，不到一天就轻松搞定问题）

Greenplum也正在积极的拥抱云计算，Cloud Foundry的PaaS云平台正在技术考虑把Greenplum MPP做为DaaS服务来提供，对于Mesos或其它云计算技术的爱好者，也可以考虑采用容器镜像技术+集群资源框架管理技术来部署Greenplum,从而可以实现在公共计算资源集群上的MPP敏捷部署和资源共享与分配。

总之，相信沿着开放、开源、云计算的路线继续前行，Greenplum MPP数据库在新的时代将保持旺盛的生命力，继续高速发展。

下一期敬请关注：

《决定GP数据库良好运行的几个关键因素》

本文来源pivotal_china官方微信，经作者同意由DBA+社群编辑整理。

作者简介李巍

资深Greenplum 数据库专家，拥有超过10年从事分布式计算和大数据计算的经历，对大规模数据计算有着丰富的实战经验。
曾在包括大型电商、国有大银行、电信行业在内的数十个项目中担任架构师或技术专家，在很多关键性问题的解决上做出过多项贡献。
长期关注开源技术，对开源数据库及Python、C、Perl语言极其熟练，信念“实践出真知”的真理。

↧

PostgreSQL cluster大幅减少nestloop离散IO的优化方法

August 15, 2016, 7:51 am

≫ Next: 连接 0.0.0.0/32 发生了什么

≪ Previous: [转载]聊聊Greenplum的那些事

背景

对于较大数据量的表，如果在索引字段上面有小结果集JOIN，用nestloop JOIN是比较好的方法。

但是nestloop带来的一个问题就是离散IO，这个是无法回避的问题，特别是硬件IO能力不行的情况下，性能会比较糟糕。

有什么优化方法呢？

PostgreSQL提供了一个命令，可以修改物理存储的顺序，减少离散IO就靠它了。

例子

创建两张表

de  >postgres=# create unlogged table test01(id int primary key, info text);
CREATE TABLE
postgres=# create unlogged table test02(id int primary key, info text);
CREATE TABLE
de>

产生一些离散primary key数据

de  >postgres=# insert into test01 select trunc(random()*10000000), md5(random()::text) from generate_series(1,10000000) on conflict on constraint test01_pkey do nothing;
INSERT 0 6322422

postgres=#  insert into test02 select trunc(random()*10000000), md5(random()::text) from generate_series(1,10000000) on conflict on constraint test02_pkey do nothing;
INSERT 0 6320836
de>

分析表

de  >postgres=# analyze test01;
postgres=# analyze test02;
de>

清除缓存，并重启

de  >$ pg_ctl stop -m fast
# echo 3 > /proc/sys/vm/drop_caches
$ pg_ctl start
de>

第一次调用，耗费大量的离散IO，执行时间18.490毫秒（我这台机器是SSD，IOPS能力算好的，差的机器时间更长）

de  >postgres=# explain (analyze,verbose,timing,costs,buffers) select t1.*,t2.* from test01 t1,test02 t2 where t1.id=t2.id and t1.id between 1 and 1000;
                                                              QUERY PLAN                                                               
---------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=19.25..7532.97 rows=623 width=74) (actual time=0.465..17.221 rows=402 loops=1)
   Output: t1.id, t1.info, t2.id, t2.info
   Buffers: shared hit=1929 read=1039 dirtied=188
   ->  Bitmap Heap Scan on public.test01 t1  (cost=18.82..2306.39 rows=623 width=37) (actual time=0.416..8.019 rows=640 loops=1)
         Output: t1.id, t1.info
         Recheck Cond: ((t1.id >= 1) AND (t1.id <= 1000))
         Heap Blocks: exact=637
         Buffers: shared hit=5 read=637 dirtied=123
         ->  Bitmap Index Scan on test01_pkey  (cost=0.00..18.66 rows=623 width=0) (actual time=0.254..0.254 rows=640 loops=1)
               Index Cond: ((t1.id >= 1) AND (t1.id <= 1000))
               Buffers: shared hit=4 read=1
   ->  Index Scan using test02_pkey on public.test02 t2  (cost=0.43..8.38 rows=1 width=37) (actual time=0.013..0.013 rows=1 loops=640)
         Output: t2.id, t2.info
         Index Cond: (t2.id = t1.id)
         Buffers: shared hit=1924 read=402 dirtied=65
 Planning time: 26.668 ms
 Execution time: 18.490 ms
(17 rows)
de>

第二次，缓存命中5.4毫秒

de  >postgres=# explain (analyze,verbose,timing,costs,buffers) select t1.*,t2.* from test01 t1,test02 t2 where t1.id=t2.id and t1.id between 1 and 1000;
                                                              QUERY PLAN                                                               
---------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=19.25..7532.97 rows=623 width=74) (actual time=0.392..5.150 rows=402 loops=1)
   Output: t1.id, t1.info, t2.id, t2.info
   Buffers: shared hit=2968
   ->  Bitmap Heap Scan on public.test01 t1  (cost=18.82..2306.39 rows=623 width=37) (actual time=0.373..1.760 rows=640 loops=1)
         Output: t1.id, t1.info
         Recheck Cond: ((t1.id >= 1) AND (t1.id <= 1000))
         Heap Blocks: exact=637
         Buffers: shared hit=642
         ->  Bitmap Index Scan on test01_pkey  (cost=0.00..18.66 rows=623 width=0) (actual time=0.218..0.218 rows=640 loops=1)
               Index Cond: ((t1.id >= 1) AND (t1.id <= 1000))
               Buffers: shared hit=5
   ->  Index Scan using test02_pkey on public.test02 t2  (cost=0.43..8.38 rows=1 width=37) (actual time=0.004..0.004 rows=1 loops=640)
         Output: t2.id, t2.info
         Index Cond: (t2.id = t1.id)
         Buffers: shared hit=2326
 Planning time: 0.956 ms
 Execution time: 5.434 ms
(17 rows)
de>

根据索引字段调整表的物理顺序，降低离散IO。

de  >postgres=# cluster test01 using test01_pkey;
CLUSTER
postgres=# cluster test02 using test02_pkey;
CLUSTER
postgres=# analyze test01;
postgres=# analyze test02;
de>

清除缓存，重启数据库

de  >$ pg_ctl stop -m fast
# echo 3 > /proc/sys/vm/drop_caches
$ pg_ctl start
de>

第一次调用，降低到了5.4毫秒

de  >postgres=# explain (analyze,verbose,timing,costs,buffers) select t1.*,t2.* from test01 t1,test02 t2 where t1.id=t2.id and t1.id between 1 and 1000;
                                                                QUERY PLAN                                                                
------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.86..5618.07 rows=668 width=74) (actual time=0.069..4.072 rows=402 loops=1)
   Output: t1.id, t1.info, t2.id, t2.info
   Buffers: shared hit=2323 read=12
   ->  Index Scan using test01_pkey on public.test01 t1  (cost=0.43..30.79 rows=668 width=37) (actual time=0.040..0.557 rows=640 loops=1)
         Output: t1.id, t1.info
         Index Cond: ((t1.id >= 1) AND (t1.id <= 1000))
         Buffers: shared hit=5 read=6
   ->  Index Scan using test02_pkey on public.test02 t2  (cost=0.43..8.35 rows=1 width=37) (actual time=0.004..0.004 rows=1 loops=640)
         Output: t2.id, t2.info
         Index Cond: (t2.id = t1.id)
         Buffers: shared hit=2318 read=6  --  注意在cluster之后，shared hit并没有下降，因为LOOP了多次，但是性能确比cluster 之前提升了很多，因为需要访问的HEAP page少了，OS cache可以瞬间命中。 
 Planning time: 42.356 ms
 Execution time: 5.426 ms
(13 rows)
de>

第二次调用，3.6毫秒

de  >postgres=# explain (analyze,verbose,timing,costs,buffers) select t1.*,t2.* from test01 t1,test02 t2 where t1.id=t2.id and t1.id between 1 and 1000;
                                                                QUERY PLAN                                                                
------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.86..5618.07 rows=668 width=74) (actual time=0.055..3.414 rows=402 loops=1)
   Output: t1.id, t1.info, t2.id, t2.info
   Buffers: shared hit=2335
   ->  Index Scan using test01_pkey on public.test01 t1  (cost=0.43..30.79 rows=668 width=37) (actual time=0.037..0.374 rows=640 loops=1)
         Output: t1.id, t1.info
         Index Cond: ((t1.id >= 1) AND (t1.id <= 1000))
         Buffers: shared hit=11
   ->  Index Scan using test02_pkey on public.test02 t2  (cost=0.43..8.35 rows=1 width=37) (actual time=0.003..0.004 rows=1 loops=640)
         Output: t2.id, t2.info
         Index Cond: (t2.id = t1.id)
         Buffers: shared hit=2324
 Planning time: 1.042 ms
 Execution time: 3.620 ms
(13 rows)
de>

小结

通过cluster, 将表的物理顺序和索引对齐，所以如果查询的值是连续的，在使用嵌套循环时可以大幅减少离散IO，取得非常好查询优化的效果。

如果查询的值是跳跃的，那么这种方法就没有效果啦，不过好在PostgreSQL有bitmap index scan，在读取heap tuple前，会对ctid排序，按排序后的ctid取heap tuple，也可以起到减少离散IO的作用。

↧

连接 0.0.0.0/32 发生了什么

August 15, 2016, 7:51 am

≫ Next: PostgreSQL Oracle 兼容性之 - PL/SQL record, table类型定义

≪ Previous: PostgreSQL cluster大幅减少nestloop离散IO的优化方法

根据RFC 3330, 1700 的描述, 0.0.0.0/32 可以用作当前网络的源地址。

de  >0.0.0.0/8 - Addresses in this block refer to source hosts on "this" network.  
Address 0.0.0.0/32 may be used as a source address for this host on this network; 
other addresses within 0.0.0.0/8 may be used to refer to specified hosts on this network.
[RFC1700, page 4].
de>

0.0.0.0/32 作为目标地址使用时，与127.0.0.1含义一样。
但是0.0.0.0还有更多的含义，如下

de >IP address numbers in Internet Protocol (IP) version 4 (IPv4) range from 0.0.0.0 up to 255.255.255.255. The IP address 0.0.0.0 has several special meanings on computer networks. It cannot be used as a general-purpose device address, however.

IPv6 networks have a similar concept of an all-zeros network address.

0.0.0.0 on Clients

PCs and other client devices normally show an address of 0.0.0.0 when they are not connected to a TCP/IP network. A device may give itself this address by default whenever they are offline. It may also be automatically assigned by DHCP in case of address assignment failures. When set with this address, a device cannot communicate with any other devices on that network over IP.

0.0.0.0 can also theoretically set as a device's network (subnet) mask rather than its IP address. However, a subnet mask with this value has no practical purpose. Both the IP address and network maskare typically assigned as 0.0.0.0 on a client together.

Software Application and Server Uses of 0.0.0.0
Some devices, particularly network servers, possess more than one IP network interface. TCP/IP software applications use 0.0.0.0 as a programming technique to monitor network traffic across all of the IP addresses currently assigned to the interfaces on that multi-homed device.

While connected computers do not use this address, messages carried over IP sometimes include 0.0.0.0 inside the protocol header when the source of the message is unknown.

The Use of 0.0.0.0 vs. 127.0.0.1 on Local Networks

Students of computer networks sometimes confuse the usages of 0.0.0.0 and 127.0.0.1 on IP networks. Whereas 0.0.0.0 has several defined uses as described above, 127.0.0.1 has the one very specific purpose of allowing a device to send messages to itself.

Troubleshooting IP Address Problems with 0.0.0.0

If a computer is properly configured for TCP/IP networking yet still shows 0.0.0.0 for an address, try the following to troubleshoot this problem and obtain a valid address:

On networks with dynamic address asssignment support, release and renew the computer's IP address. Failures with DHCP assignment can be intermittent or persistent. If the failures persist, troubleshoot the DHCP server configuration: Common causes of failure include having no available addresses in the DHCP pool.
For networks that require static IP addressing, configure a valid IP address on the computer.
de>

0.0.0.0/32 可以用来表示当前网络，与0.0.0.0建立连接，实际上是与回环地址建立连接。

如下

de  ># ping 0.0.0.0
PING 0.0.0.0 (127.0.0.1) 56(84) bytes of data.
64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.018 ms
64 bytes from 127.0.0.1: icmp_seq=2 ttl=64 time=0.033 ms
de>

如果把回环地址shutdown，连接0.0.0.0会不行.

de  ># ifdown lo

$ psql -h 0.0.0.0
psql: could not connect to server: Connection timed out
        Is the server running on host "0.0.0.0" and accepting
        TCP/IP connections on port 1921?

# ping 0.0.0.0
PING 0.0.0.0 (127.0.0.1) 56(84) bytes of data.
^C
--- 0.0.0.0 ping statistics ---
3 packets transmitted, 0 received, 100% packet loss, time 1999ms
de>

所以连接0.0.0.0匹配的PostgreSQL服务端的pg_hba.conf条目是127.0.0.1/32。而不是0.0.0.0/0 。

de  >pg_hba.conf
host all all 127.0.0.1/32  trust
de>

参考
https://www.rfc-editor.org/rfc/rfc1700.txt
https://www.rfc-editor.org/rfc/rfc3330.txt
http://compnetworking.about.com/od/workingwithipaddresses/g/0_0_0_0_ip-address.htm

↧

PostgreSQL Oracle 兼容性之 - PL/SQL record, table类型定义

August 15, 2016, 7:52 am

≫ Next: PostgreSQL PL/Perl 钩子安全性分析

≪ Previous: 连接 0.0.0.0/32 发生了什么

背景

Oracle PL/SQL是非常强大的一门SQL编程语言，许多Oracle用户也使用它来处理一些要求延迟低且数据一致性或可靠性要求很高的业务逻辑。

PostgreSQL也有一门非常高级的内置SQL编程语言，plpgsql。与Oracle PL/SQL语法极其类似，但是还是有一些不一样的地方。
(PS：除了plpgsql，PostgreSQL还支持C，java，python，perl等流行的语言作为数据库的函数编程语言)

本文是针对有Oracle用户遇到的一些函数语法与PostgreSQL不兼容的地方，给出的修改建议。
涉及type xx is table of xxxx index by binary_integer语法、type xx is record语法。

Oracle PL/SQL 例子

de  >CREATE OR REPLACE FUNCTION f_xml(p_xml CLOB) RETURN INT
AS

...
type rec_tk is record
(
tkno VARCHAR2(100) ,
cg_zdj number(12,0) := 0 ,
cg_jsf number(12,0) := 0
);

type tklist is table of rec_tk index by binary_integer;


type rec_cjr is record
(
cjrid varchar2(30) ,
tk tklist
);

type cjr is table of rec_cjr index by binary_integer;
p_cjrs cjr;

FOR j IN 0..v_nllen-1 LOOP
  BEGIN

...

   p_cjrs(j).cjrid := v_nodevalue;

...

   p_cjrs(j).tk(v_tkcount).tkno := v_nodevalue;
   p_cjrs(j).tk(v_tkcount).cg_zdj := nvl(v_nodevalue,0);
   p_cjrs(j).tk(v_tkcount).cg_jsf := nvl(v_nodevalue,0);

...

   v_tkcount:=v_tkcount+1;

END LOOP;
de>

在这个例子中，用到了Oracle在PL/SQL中支持的type定义，以及type table 的定义，这个在PostgreSQL中用法不太一样。

PostgreSQL PL/SQL 兼容性例子

PostgreSQL的type定义需要在数据库中定义，而不是函数中定义。

以上PL/SQL函数在plpgsql中需要作出如下调整：
.1.

de  >type rec_tk is record
(
tkno VARCHAR2(100) ,
cg_zdj number(12,0) := 0 ,
cg_jsf number(12,0) := 0
);

type tklist is table of rec_tk index by binary_integer;
de>

修改为
函数外执行创建类型的SQL

de  >create type rec_tk as 
(
tkno VARCHAR(100) ,
cg_zdj numeric(12,0) ,
cg_jsf numeric(12,0) 
);
de>

.2.

de  >type rec_cjr is record
(
cjrid varchar2(30) ,
tk tklist
);

type cjr is table of rec_cjr index by binary_integer;
p_cjrs cjr;
de>

修改为
函数外执行创建类型的SQL

de  >create type rec_cjr as
(
cjrid varchar(30) ,
tk rec_tk[]
);
de>

函数内对table的使用修改为数组的使用，数组的下标从1开始。

de  >p_cjrs rec_cjr[];
de>

.3.

de  >   p_cjrs(j).cjrid := v_nodevalue;
...
   p_cjrs(j).tk(v_tkcount).tkno := v_nodevalue;
   p_cjrs(j).tk(v_tkcount).cg_zdj := nvl(v_nodevalue,0);
   p_cjrs(j).tk(v_tkcount).cg_jsf := nvl(v_nodevalue,0);
de>

plpgsql目前不能直接修改复合数组对应的composite.element
需要修改为

de  >declare
   v_p_cjrs rec_cjr;
   v_tk rec_tk;
...

   v_p_cjrs.cjrid := v_nodevalue;
   p_cjrs[j] := v_p_cjrs.cjrid;
...

   v_tk.tkno := v_nodevalue;
   v_tk.cg_zdj := nvl(v_nodevalue,0);
   v_tk.cg_jsf := nvl(v_nodevalue,0);
   v_p_cjrs.tk[v_tkcount] := v_tk;
   p_cjrs[j] := v_p_cjrs;

de>

或者请参考如下例子

de  >do language plpgsql $$
declare
  vtk rec_tk;
  vtk_a rec_tk[];
  vcjr rec_cjr;
  vcjr_a rec_cjr[];
begin
  vtk := row('a', 1,2);
  -- or vtk.tkno := 'a'; vtk.cg_zdj := 1; vtk.cg_jsf := 2; 
  vtk_a[1] := vtk;

  vcjr := row('test', vtk_a);
  -- or vcjr := row('test', array[row('a',1,2)]);
  -- or vcjr.cjrid := 'test'; vcjr.tk := vtk_a;
  -- or vcjr_a[1] := row('test', array[row('a',1,2)]);
  vcjr_a[1] := vcjr;
  raise notice 'vtk %, vtk_a % vcjr % vcjr_a % ', vtk, vtk_a, vcjr, vcjr_a;
end;
$$;

NOTICE:  00000: vtk (a,1,2), vtk_a {"(a,1,2)"} vcjr (test,"{""(a,1,2)""}") vcjr_a {"(test,\"{\"\"(a,1,2)\"\"}\")"} 
LOCATION:  exec_stmt_raise, pl_exec.c:3216
DO

de>

nvl函数参考PostgreSQL Oracle兼容包orafce。

.4.
array用法简介
http://blog.163.com/digoal@126/blog/static/163877040201201275922529/
https://www.postgresql.org/docs/9.5/static/arrays.html
https://www.postgresql.org/docs/9.5/static/plpgsql-control-structures.html#PLPGSQL-FOREACH-ARRAY

循环

de  >[ <<label>> ]
FOREACH target [ SLICE number ] IN ARRAY expression LOOP
    statements
END LOOP [ label ];
de>

行列转换
https://www.postgresql.org/docs/9.5/static/functions-array.html

de  >    unnest(ARRAY[1,2])
1
2
de>

小结

使用composite type替代了PL/SQL的type定义。
使用array替代了PL/SQL的table定义。
复合类型的数组，不能直接修改复合类型的element，需要先用标量修改好后赋值。

RDS PG内核改进建议

新增 CREATE TYPE [ IF NOT EXISTS ] 语法。这样的话用户就不需要将这个语法写在函数外了，可以在函数内直接执行。
PL/SQL的type是局部变量，而PostgreSQL的type是全局的，这个也需要注意，如果多个PL/SQL函数用到了同样的type name但是结构不一样，port到plpgsql时，需要创建多个type，在plpgsql中分别使用对应的type name。
plpgsql 暂时不支持composite数组直接设置element的值，需要加强plpgsql的语法功能。

↧

PostgreSQL PL/Perl 钩子安全性分析

August 15, 2016, 7:53 am

≫ Next: 如何防止远程程序与RDS PG连接中断

≪ Previous: PostgreSQL Oracle 兼容性之 - PL/SQL record, table类型定义

背景

plperl 是PostgreSQL支持的函数语言之一。

在使用plperl时，可以使用plperl提供的钩子功能，满足一些特殊场景的需求。

钩子分2种，一种是加载plperl.so库时的钩子，一种是加载perl语言解释器时的钩子。

钩子的使用有安全问题吗？

钩子用法介绍

加载plperl.so库时的钩子

加载perl语言解释器时的钩子

plperl 函数语言钩子，当在会话中第一次加载perl语言解释器时，perl 函数解释器将自动调用

de  >plperl.on_plperl_init (string)  
plperl.on_plperlu_init (string)  
de>

或设置的串。
具体调用哪个，和函数语言有关，plperl则调用on_plperl_init， plperlu则调用on_plperlu_init。

需要注意的是，这两个函数可以在参数中设置，也能在会话中设置，但是在会话中设置的话，如果perl解释器已经加载了，不会触发修改后的值。

另外需要注意on_plperl_init是在plperl安全化后执行的，所以即使在这里配置了不安全的属性，也不怕，因为会直接报错。（与调研plperl的用户权限无关，plperl是不允许执行不安全操作的，例如调研system接口）

这两个参数的解释：
https://www.postgresql.org/docs/9.5/static/plperl-under-the-hood.html
plperl.on_plperl_init (string)
plperl.on_plperlu_init (string)

de  >These parameters specify Perl code to be executed when a Perl interpreter is specialized for plperl or plperlu respectively. 

This will happen when a PL/Perl or PL/PerlU function is first executed in a database session, or when an additional interpreter has to be created because the other language is called or a PL/Perl function is called by a new SQL role. 

This follows any initialization done by plperl.on_init. 
The SPI functions are not available when this code is executed. 

The Perl code in plperl.on_plperl_init is executed after "locking down" the interpreter, and thus it can only perform trusted operations.  

If the code fails with an error it will abort the initialization and propagate out to the calling query, causing the current transaction or subtransaction to be aborted.   

Any actions already done within Perl won't be undone; however, that interpreter won't be used again. If the language is used again the initialization will be attempted again within a fresh Perl interpreter.

Only superusers can change these settings. 
Although these settings can be changed within a session, such changes will not affect Perl interpreters that have already been used to execute functions.
de>

代码如下

de  >src/pl/plperl/plperl.c

        DefineCustomStringVariable("plperl.on_plperl_init",
                                                           gettext_noop("Perl initialization code to execute once when plperl is first used."),
                                                           NULL,
                                                           &plperl_on_plperl_init,
                                                           NULL,
                                                           PGC_SUSET, 0,
                                                           NULL, NULL, NULL);

        DefineCustomStringVariable("plperl.on_plperlu_init",
                                                           gettext_noop("Perl initialization code to execute once when plperlu is first used."),
                                                           NULL,
                                                           &plperl_on_plperlu_init,
                                                           NULL,
                                                           PGC_SUSET, 0,
                                                           NULL, NULL, NULL);


src/backend/utils/misc/guc.c
                case PGC_SUSET:
                        if (context == PGC_USERSET || context == PGC_BACKEND)
                        {
                                ereport(elevel,
                                                (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
                                                 errmsg("permission denied to set parameter \"%s\"",
                                                                name)));
                                return 0;
                        }
                        break;
de>

只有超级用户能设置这两个值，普通用户在会话中设置plperl.on_plperl_init时，触发设置则报错。

de  >postgres=> set session plperl.on_plperl_init=''; 
SET
postgres=> SELECT * FROM test_munge();
WARNING:  42501: permission denied to set parameter "plperl.on_plperl_init"
LOCATION:  set_config_option, guc.c:5794
de>

测试例子

postgresql.conf 参数

de  >#plperl.on_plperlu_init = ' system("touch /home/digoal/t123") '
plperl.on_plperl_init = ' system("touch /home/digoal/t123") '
#plperl.on_init=' system("touch /home/digoal/tttt") '
de>

测试

de  >CREATE TABLE test (
    i int,
    v varchar
);

INSERT INTO test (i, v) VALUES (1, 'first line');
INSERT INTO test (i, v) VALUES (2, 'second line');
INSERT INTO test (i, v) VALUES (3, 'third line');
INSERT INTO test (i, v) VALUES (4, 'immortal');

CREATE OR REPLACE FUNCTION test_munge() RETURNS SETOF test AS $$
    my $rv = spi_exec_query('select i, v from test;');
    my $status = $rv->{status};
    my $nrows = $rv->{processed};
    foreach my $rn (0 .. $nrows - 1) {
        my $row = $rv->{rows}[$rn];
        $row->{i} += 200 if defined($row->{i});
        $row->{v} =~ tr/A-Za-z/a-zA-Z/ if (defined($row->{v}));
        return_next($row);
    }
    return undef;
$$ LANGUAGE plperl;

SELECT * FROM test_munge();
de>

使用 stat touch /home/digoal/t123 查看时间戳的变化
判断是否触发。

plperl和plperlu语言的区别

plperl是trust语言，在创建它的函数时，会监测安全性，例如过滤一些OS操作，等。普通用户和超级用户都可以创建plperl语言的函数。

plperlu则是untruste语言，允许任何操作，只有超级用户能创建plperlu的函数。

如果已经设置了plperl.on_plperl_init是一个不安全的值，则新建plperl函数会报错。

de  >postgres=# show plperl.on_plperl_init;
        plperl.on_plperl_init        
-------------------------------------
  system("touch /home/digoal/t123") 
(1 row)

postgres=# CREATE OR REPLACE FUNCTION test_munge() RETURNS SETOF test AS $$
    my $rv = spi_exec_query('select i, v from test;');
    my $status = $rv->{status};
    my $nrows = $rv->{processed};
    foreach my $rn (0 .. $nrows - 1) {
        my $row = $rv->{rows}[$rn];
        $row->{i} += 200 if defined($row->{i});
        $row->{v} =~ tr/A-Za-z/a-zA-Z/ if (defined($row->{v}));
        return_next($row);
    }
    return undef;
$$ LANGUAGE plperl;

ERROR:  38000: 'system' trapped by operation mask at line 2.
CONTEXT:  while executing plperl.on_plperl_init
compilation of PL/Perl function "test_munge"
LOCATION:  plperl_trusted_init, plperl.c:1016
de>

调用时，如果触发了不安全的plperl.on_plperl_init，也会报错。

de  >$ vi postgresql.conf
plperl.on_plperl_init = ' system("touch /home/digoal/t123") '
$ pg_ctl reload

postgres=# SELECT * FROM test_munge();
ERROR:  38000: 'system' trapped by operation mask at line 2.
CONTEXT:  while executing plperl.on_plperl_init
compilation of PL/Perl function "test_munge"
LOCATION:  plperl_trusted_init, plperl.c:1016
de>

小结

PostgreSQL将函数语言分为两类，一类是trust的另一类是untrust的。
trust的语言，不允许执行有破坏性的操作，例如系统命令，文件访问等。普通用户可以创建trust语言的函数。
untrust的语言，允许执行任何操作，只有superuser能创建untrust语言的函数。
如果只开放普通数据库用户出去，是没有安全风险的。
PostgreSQL 为 plperl或plperlu语言设置了两种钩子，分别允许在加载libperl.so时被触发(在_PG_init(void)里面实现);
或者在加载perl解释器时被触发，其中加载解释器时又分为两种，plperl和perlu的设置。
用户利用钩子，可以实现一些特殊场景的应用。

数据库普通用户无法修改钩子参数

de  >#plperl.on_plperlu_init = ' system("touch /home/digoal/t123") '
plperl.on_plperl_init = ' system("touch /home/digoal/t123") '
#plperl.on_init=' system("touch /home/digoal/tttt") '
de>

即使设置了危险的plperl.on_plperl_init参数，因为这个参数的内容是在plperl函数风险评估后执行的，所以如果有风险也不允许执行，不存在安全风险。

de  >plperl.on_plperl_init = ' system("touch /home/digoal/t123") '
postgres=# SELECT * FROM test_munge();
ERROR:  'system' trapped by operation mask at line 2.
CONTEXT:  while executing plperl.on_plperl_init
compilation of PL/Perl function "test_munge"
de>

综上，PostgreSQL对语言的管理是非常安全的，只要不随意把超级用户放出去，不随意使用untrust语言创建不安全的函数。

↧

如何防止远程程序与RDS PG连接中断

August 15, 2016, 7:53 am

≫ Next: PostgreSQL 老湿机图解平安科技遇到的垃圾回收"坑"

≪ Previous: PostgreSQL PL/Perl 钩子安全性分析

背景

偶尔有用户会遇到远程程序连接RDS PG，在不做任何操作一段时间后可能中断。

其实可能是用户和RDS PG之间，某些网络设备设置了会话空闲超时，会主动中断会话。

那么有什么方法能解决这个问题呢？

运维的同学可能有这方面的经验，例如使用securecrt或者其他终端连接服务器时，可以设置这些管理工具的no-op，周期性的发一些空字符过去，保证会话上有流量。

但是数据库连接怎么搞呢？
PostgreSQL提供了tcp keep alive的参数可供用户设置。

例子

为了避免会话中断的问题, 可以通过tcp层的keepalive机制来达到传输心跳数据的目的.

方法一，设置数据库参数

PostgreSQL支持会话级别的设置, 数据库级别的设置在$PGDATA/postgresql.conf,
建议设置如下三个参数的值

de  ># - TCP Keepalives -  
# see "man 7 tcp" for details  
tcp_keepalives_idle = 60                # TCP_KEEPIDLE, in seconds;  
                                        # 0 selects the system default  
tcp_keepalives_interval = 10            # TCP_KEEPINTVL, in seconds;  
                                        # 0 selects the system default  
tcp_keepalives_count = 10                # TCP_KEEPCNT;  
                                        # 0 selects the system default  
de>

解释详见本文末尾[参考1].

代码详见本文末尾[参考2].

参数解释
tcp_keepalives_idle : 定义这个tcp连接间隔多长后开始发送第一个 tcp keepalive 包.
tcp_keepalives_interval : 定义在以上发送第一个tcp keepalive包后如果在这个时间间隔内没有收到对端的回包, 则开始发送第二个tcp keepalive包. 在这个时间内再没有回包的话则发送第三个keepalive包....直到达到tcp_keepalives_count次则broken 连接.
tcp_keepalives_count : 定义一共发送多少个tcp keepalive包, 达到这个数字后如果对端都没有回响应包, 则关闭这个连接.
另外需要注意的是, 这几个PostgreSQL参数对PostgreSQL数据库服务端的backend process生效.
所以如果发出第一个keepalive包后, 在tcp_keepalives_interval秒内有客户端回包, 则又回到tcp_keepalives_idle计数(注意此时计数是tcp_keepalives_idle 减去 tcp_keepalives_interval 秒).

例如 :
CLIENT (172.16.3.33) :

de  >psql -h 172.16.3.150 -p 1919 -U postgres postgres  
postgres=# show tcp_keepalives_idle;  
 tcp_keepalives_idle   
---------------------  
 60  
(1 row)  
postgres=# show tcp_keepalives_interval;  
 tcp_keepalives_interval   
-------------------------  
 10  
(1 row)  
postgres=# show tcp_keepalives_count;  
 tcp_keepalives_count   
----------------------  
 10  
(1 row)  
de>

查找数据库端对应的process id.

de  >postgres=# select pg_backend_pid();  
 pg_backend_pid   
----------------  
           11016  
(1 row)  
de>

SERVER (172.16.3.150) :
在数据库端查看keepalive timer

de  >root@digoal-PowerEdge-R610:~# netstat -anpo|grep 11016  
tcp        0      0 172.16.3.150:1919       172.16.3.33:50326       ESTABLISHED 11016/postgres: pos keepalive (39.73/0/0)  
de>

CLIENT (172.16.3.33) :
在客户端查看keepalive timer

de  >postgres=# \!  
[pg92@db-172-16-3-33 ~]$ netstat -anpo|grep 1919  
(Not all processes could be identified, non-owned process info  
 will not be shown, you would have to be root to see it all.)  
tcp        0      0 172.16.3.33:50326           172.16.3.150:1919           ESTABLISHED 20408/psql          keepalive (7143.19/0/0)  
de>

继承了操作系统的keepalive设置

通过tcpdump可以观察间隔一定的时间, 会发出keepalive包.

方法二、设置操作系统级的参数:

de  >/etc/sysctl.conf  
net.ipv4.tcp_keepalive_intvl = 75  
net.ipv4.tcp_keepalive_probes = 9  
net.ipv4.tcp_keepalive_time = 7200  
de>

设置CLIENT服务器系统级的keepalive, 然后重新连接到数据库, 看看客户端的keepalive timer会不会发生变化

de  >[root@db-172-16-3-33 ~]# sysctl -w net.ipv4.tcp_keepalive_time=70  
net.ipv4.tcp_keepalive_time = 70  
[root@db-172-16-3-33 ~]# su - pg92  
pg92@db-172-16-3-33-> psql -h 172.16.3.150 -p 1919 -U postgres postgres  
psql (9.2.4)  
Type "help" for help.  
postgres=# \!  
[pg92@db-172-16-3-33 ~]$ netstat -anpo|grep 1919  
(Not all processes could be identified, non-owned process info  
 will not be shown, you would have to be root to see it all.)  
tcp        0      0 172.16.3.33:50327           172.16.3.150:1919           ESTABLISHED 20547/psql          keepalive (55.44/0/0)  
de>

系统层设置的keepalive已经生效了.

其他

.1.
通过tcpdump观察keepalive包, 也可以将这些包抓下来通过wireshark查看.

de  >pg92@db-172-16-3-33-> psql -h 172.16.3.150 -p 1919 -U postgres postgres  
postgres=# set tcp_keepalives_idle=13;  
SET  
root@digoal-PowerEdge-R610:~# tcpdump -i eth0 -n 'tcp port 1919'  
08:43:27.647408 IP 172.16.3.150.1919 > 172.16.3.33.15268: Flags [P.], seq 4937:4952, ack 58, win 115, length 15  
08:43:27.647487 IP 172.16.3.33.15268 > 172.16.3.150.1919: Flags [.], ack 4952, win 488, length 0  
08:43:40.667410 IP 172.16.3.150.1919 > 172.16.3.33.15268: Flags [.], ack 58, win 115, length 0  
08:43:40.667536 IP 172.16.3.33.15268 > 172.16.3.150.1919: Flags [.], ack 4952, win 488, length 0  
08:43:53.691417 IP 172.16.3.150.1919 > 172.16.3.33.15268: Flags [.], ack 58, win 115, length 0  
08:43:53.691544 IP 172.16.3.33.15268 > 172.16.3.150.1919: Flags [.], ack 4952, win 488, length 0  
08:44:06.715416 IP 172.16.3.150.1919 > 172.16.3.33.15268: Flags [.], ack 58, win 115, length 0  
08:44:06.715544 IP 172.16.3.33.15268 > 172.16.3.150.1919: Flags [.], ack 4952, win 488, length 0  
08:44:19.739422 IP 172.16.3.150.1919 > 172.16.3.33.15268: Flags [.], ack 58, win 115, length 0  
08:44:19.739544 IP 172.16.3.33.15268 > 172.16.3.150.1919: Flags [.], ack 4952, win 488, length 0  
08:44:32.763416 IP 172.16.3.150.1919 > 172.16.3.33.15268: Flags [.], ack 58, win 115, length 0  
08:44:32.763546 IP 172.16.3.33.15268 > 172.16.3.150.1919: Flags [.], ack 4952, win 488, length 0  
de>

每个sock会话, 每隔13秒, 数据库服务端会发送心跳包.

.2.
由于每个tcp会话都需要1个计时器, 所以如果连接数很多, 开启keepalive也是比较耗费资源的.
可以使用setsockopt关闭该会话keepalive的功能. 下一篇BLOG介绍如何禁用keepalive.

.3.
如果tcp_keepalives_idle小于tcp_keepalives_interval, 那么间隔多长时间发1个心跳包呢?
例如tcp_keepalives_idle=2, tcp_keepalives_interval=10.
答案是10, 因为检查计时需要10秒.

de  >postgres=# set tcp_keepalives_idle=2;  
SET  
postgres=# set tcp_keepalives_interval=10;  
SET  
root@digoal-PowerEdge-R610-> tcpdump -i eth0 -n 'tcp port 1919'  
09:32:27.035424 IP 172.16.3.150.1919 > 172.16.3.33.47277: Flags [.], ack 195, win 115, length 0  
09:32:27.035608 IP 172.16.3.33.47277 > 172.16.3.150.1919: Flags [.], ack 366, win 54, length 0  
09:32:37.051426 IP 172.16.3.150.1919 > 172.16.3.33.47277: Flags [.], ack 195, win 115, length 0  
09:32:37.051569 IP 172.16.3.33.47277 > 172.16.3.150.1919: Flags [.], ack 366, win 54, length 0  
09:32:47.067423 IP 172.16.3.150.1919 > 172.16.3.33.47277: Flags [.], ack 195, win 115, length 0  
09:32:47.067552 IP 172.16.3.33.47277 > 172.16.3.150.1919: Flags [.], ack 366, win 54, length 0  
09:32:57.083428 IP 172.16.3.150.1919 > 172.16.3.33.47277: Flags [.], ack 195, win 115, length 0  
09:32:57.083574 IP 172.16.3.33.47277 > 172.16.3.150.1919: Flags [.], ack 366, win 54, length 0  
de>

参考

.1. http://www.postgresql.org/docs/9.2/static/runtime-config-connection.html

de >tcp_keepalives_idle (integer)
Specifies the number of seconds before sending a keepalive packet on an otherwise idle connection. A value of 0 uses the system default. This parameter is supported only on systems that support the TCP_KEEPIDLE or TCP_KEEPALIVE symbols, and on Windows; on other systems, it must be zero. In sessions connected via a Unix-domain socket, this parameter is ignored and always reads as zero.
Note: On Windows, a value of 0 will set this parameter to 2 hours, since Windows does not provide a way to read the system default value.
tcp_keepalives_interval (integer)
Specifies the number of seconds between sending keepalives on an otherwise idle connection. A value of 0 uses the system default. This parameter is supported only on systems that support the TCP_KEEPINTVL symbol, and on Windows; on other systems, it must be zero. In sessions connected via a Unix-domain socket, this parameter is ignored and always reads as zero.
Note: On Windows, a value of 0 will set this parameter to 1 second, since Windows does not provide a way to read the system default value.
tcp_keepalives_count (integer)
Specifies the number of keepalive packets to send on an otherwise idle connection. A value of 0 uses the system default. This parameter is supported only on systems that support the TCP_KEEPCNT symbol; on other systems, it must be zero. In sessions connected via a Unix-domain socket, this parameter is ignored and always reads as zero.
Note: This parameter is not supported on Windows, and must be zero.
de>

.2. /usr/share/doc/kernel/Documentation/networking/ip-sysctl.txt

de  >tcp_keepalive_time - INTEGER  
        How often TCP sends out keepalive messages when keepalive is enabled.  
        Default: 2hours.  

tcp_keepalive_probes - INTEGER  
        How many keepalive probes TCP sends out, until it decides that the  
        connection is broken. Default value: 9.  

tcp_keepalive_intvl - INTEGER  
        How frequently the probes are send out. Multiplied by  
        tcp_keepalive_probes it is time to kill not responding connection,  
        after probes started. Default value: 75sec i.e. connection  
        will be aborted after ~11 minutes of retries.  
de>

.3. src/backend/libpq/pqcomm.c

de  >截取一个设置interval的函数.  
int  
pq_setkeepalivesinterval(int interval, Port *port)  
{  
        if (port == NULL || IS_AF_UNIX(port->laddr.addr.ss_family))  
                return STATUS_OK;  

#if defined(TCP_KEEPINTVL) || defined (SIO_KEEPALIVE_VALS)  
        if (interval == port->keepalives_interval)  
                return STATUS_OK;  

#ifndef WIN32  
        if (port->default_keepalives_interval <= 0)  
        {  
                if (pq_getkeepalivesinterval(port) < 0)  
                {  
                        if (interval == 0)  
                                return STATUS_OK;               /* default is set but unknown */  
                        else  
                                return STATUS_ERROR;  
                }  
        }  

        if (interval == 0)  
                interval = port->default_keepalives_interval;  

        if (setsockopt(port->sock, IPPROTO_TCP, TCP_KEEPINTVL,  
                                   (char *) &interval, sizeof(interval)) < 0)  
        {  
                elog(LOG, "setsockopt(TCP_KEEPINTVL) failed: %m");  
                return STATUS_ERROR;  
        }  

        port->keepalives_interval = interval;  
#else                                                   /* WIN32 */  
        return pq_setkeepaliveswin32(port, port->keepalives_idle, interval);  
#endif  
#else  
        if (interval != 0)  
        {  
                elog(LOG, "setsockopt(TCP_KEEPINTVL) not supported");  
                return STATUS_ERROR;  
        }  
#endif  

        return STATUS_OK;  
}  
de>

.4. man netstat

de  >   -o, --timers  
       Include information related to networking timers.  
de>

.5. man 7 tcp

de  >   /proc interfaces  
       System-wide TCP parameter settings can be accessed by files in the directory /proc/sys/net/ipv4/.  In addition,  most  IP  
       /proc  interfaces  also  apply  to  TCP; see ip(7).  Variables described as Boolean take an integer value, with a nonzero  
       value ("true") meaning that the corresponding option is enabled, and a zero value ("false") meaning that  the  option  is  
       disabled.  
       tcp_keepalive_intvl (integer; default: 75; since Linux 2.4)  
              The number of seconds between TCP keep-alive probes.  

       tcp_keepalive_probes (integer; default: 9; since Linux 2.2)  
              The maximum number of TCP keep-alive probes to send before giving up and killing the connection if no response  is  
              obtained from the other end.  

       tcp_keepalive_time (integer; default: 7200; since Linux 2.2)  
              The  number of seconds a connection needs to be idle before TCP begins sending out keep-alive probes.  Keep-alives  
              are only sent when the SO_KEEPALIVE socket option is enabled.  The default value is 7200 seconds  (2  hours).   An  
              idle  connection  is  terminated  after approximately an additional 11 minutes (9 probes an interval of 75 seconds  
              apart) when keep-alive is enabled.  

   Socket Options  
       To set or get a TCP socket option, call getsockopt(2) to read or setsockopt(2) to write the option with the option  level  
       argument set to IPPROTO_TCP.  In addition, most IPPROTO_IP socket options are valid on TCP sockets.  For more information  
       see ip(7).  
       TCP_KEEPCNT (since Linux 2.4)  
              The  maximum number of keepalive probes TCP should send before dropping the connection.  This option should not be  
              used in code intended to be portable.  

       TCP_KEEPIDLE (since Linux 2.4)  
              The time (in seconds) the connection needs to remain idle before TCP  starts  sending  keepalive  probes,  if  the  
              socket  option  SO_KEEPALIVE  has  been set on this socket.  This option should not be used in code intended to be  
              portable.  

       TCP_KEEPINTVL (since Linux 2.4)  
              The time (in seconds) between individual keepalive probes.  This option should not be used in code intended to  be  
              portable.  
de>

.6. netstat core :
以下内容转载自 :
http://vzkernel.blogspot.tw/2012/09/description-of-netstat-timers.html

de  >It's not easy to find out the detail description of a network socket timer from internet, I did some dig today.  

The mannual page from netstat:  

   -o, --timers  
       Include information related to networking timers.  

Then we check some command output:  

netstat -nto | head  
Active Internet connections (w/o servers)  
Proto Recv-Q Send-Q Local Address               Foreign Address             State       Timer  
tcp        0      0 127.0.0.1:5005              127.0.0.1:55309             SYN_RECV    on (5.14/1/0)  
tcp        0      0 127.0.0.1:5005              127.0.0.1:55312             SYN_RECV    on (1.34/0/0)  
tcp        0      0 127.0.0.1:5005              127.0.0.1:55310             SYN_RECV    on (2.34/0/0)  
tcp        0      0 127.0.0.1:5005              127.0.0.1:55303             SYN_RECV    on (4.14/1/0)  
tcp        0      0 192.168.1.16:57018          74.125.128.132:443          ESTABLISHED off (0.00/0/0)  
tcp        0      0 192.168.1.16:41245          203.208.46.2:443            ESTABLISHED off (0.00/0/0)  
tcp        0      0 192.168.1.16:42636          203.208.46.7:443            TIME_WAIT   timewait (44.66/0/0)  
tcp        0      0 127.0.0.1:55302  

The Timer field with the format (5.14/1/0), what does it mean?  

Let's figure it out.  


The second step, check the source code from net-tools, grab the source code from source forge:  

git clone git://net-tools.git.sourceforge.net/gitroot/net-tools/net-tools  

from userspace netstat.c:  
tcp_do_one():  
{  
....  
        if (flag_opt)  
            switch (timer_run) {  
            case 0:  
                snprintf(timers, sizeof(timers), _("off (0.00/%ld/%d)"), retr, timeout);  
                break;  

            case 1:  
                snprintf(timers, sizeof(timers), _("on (%2.2f/%ld/%d)"),  
                         (double) time_len / HZ, retr, timeout);  
                break;  

            case 2:  
                snprintf(timers, sizeof(timers), _("keepalive (%2.2f/%ld/%d)"),  
                         (double) time_len / HZ, retr, timeout);  
                break;  

            case 3:  
                snprintf(timers, sizeof(timers), _("timewait (%2.2f/%ld/%d)"),  
                         (double) time_len / HZ, retr, timeout);  
                break;  

            default:  
                snprintf(timers, sizeof(timers), _("unkn-%d (%2.2f/%ld/%d)"),  
                         timer_run, (double) time_len / HZ, retr, timeout);  
                break;  
            }  

Both the fields are grabbed from proc/net/tcp, let's check the content of it:  

$ head /proc/net/tcp  
  sl  local_address rem_address   st tx_queue rx_queue tr tm->when retrnsmt   uid  timeoutinode                                                       
   0: 00000000:036B 00000000:0000 0A 00000000:00000000 00:00000000 00000000     0        0 13012 1 ffff88007baf5400 299 0 0 2 -1                       
   1: 0100007F:138D 00000000:0000 0A 00000000:00000006 00:00000000 00000000   500        0 472674 1 ffff880021ab0380 299 0 0 2 -1                      
   2: 00000000:006F 00000000:0000 0A 00000000:00000000 00:00000000 00000000     0        0 11242 1 ffff8800796006c0 299 0 0 2 -1                       
   3: 00000000:BD50 00000000:0000 0A 00000000:00000000 00:00000000 00000000     0        0 13056 1 ffff880078da7440 299 0 0 2 -1                       
   4: 017AA8C0:0035 00000000:0000 0A 00000000:00000000 00:00000000 00000000     0        0 14066 1 ffff880078dac100 299 0 0 2 -1     

description from proc_net_tcp.txt  

timer_active:  
  0  no timer is pending  
  1  retransmit-timer is pending  
  2  another timer (e.g. delayed ack or keepalive) is pending  
  3  this is a socket in TIME_WAIT state. Not all fields will contain  
     data (or even exist)  
  4  zero window probe timer is pending  

No too much glue? Let's dive into the kernel code to have a look how the proc_net_tcp defined:  


net/ipv4/tcp_ipv4.c:  

static int tcp4_seq_show(struct seq_file *seq, void *v)  
{  
        struct tcp_iter_state *st;  
        int len;  

        if (v == SEQ_START_TOKEN) {  
                seq_printf(seq, "%-*s\n", TMPSZ - 1,  
                           "  sl  local_address rem_address   st tx_queue "  
                           "rx_queue tr tm->when retrnsmt   uid  timeout "  
                           "inode");  
                goto out;  
        }  
        st = seq->private;  

        switch (st->state) {  
        case TCP_SEQ_STATE_LISTENING:  
        case TCP_SEQ_STATE_ESTABLISHED:  
                get_tcp4_sock(v, seq, st->num, &len);  
                break;  
        case TCP_SEQ_STATE_OPENREQ:  
                get_openreq4(st->syn_wait_sk, v, seq, st->num, st->uid, &len);  
                break;  
        case TCP_SEQ_STATE_TIME_WAIT:  
                get_timewait4_sock(v, seq, st->num, &len);  
                break;  
        }  
        seq_printf(seq, "%*s\n", TMPSZ - 1 - len, "");  
out:  
        return 0;  
}  


static void get_tcp4_sock(struct sock *sk, struct seq_file *f, int i, int *len)  
{  
....  
        seq_printf(f, "%4d: %08X:%04X %08X:%04X %02X %08X:%08X %02X:%08lX "  
                        "%08X %5d %8d %lu %d %pK %lu %lu %u %u %d%n",  
                i, src, srcp, dest, destp, sk->sk_state,  
                tp->write_seq - tp->snd_una,  
                rx_queue,  
                timer_active,  
                jiffies_to_clock_t(timer_expires - jiffies),  
                icsk->icsk_retransmits,  
                sock_i_uid(sk),  
                icsk->icsk_probes_out,  
                sock_i_ino(sk),  
                atomic_read(&sk->sk_refcnt), sk,  
                jiffies_to_clock_t(icsk->icsk_rto),  
                jiffies_to_clock_t(icsk->icsk_ack.ato),  
                (icsk->icsk_ack.quick << 1) | icsk->icsk_ack.pingpong,  
                tp->snd_cwnd,  
                tcp_in_initial_slowstart(tp) ? -1 : tp->snd_ssthresh,  
                len);  
}  

which defined in include/net/inet_connection_sock.h:  

/** inet_connection_sock - INET connection oriented sock  
 *                
 * @icsk_accept_queue:     FIFO of established children   
 * @icsk_bind_hash:        Bind node  
 * @icsk_timeout:          Timeout  
 * @icsk_retransmit_timer: Resend (no ack)  
 * @icsk_rto:              Retransmit timeout  
 * @icsk_pmtu_cookie       Last pmtu seen by socket  
 * @icsk_ca_ops            Pluggable congestion control hook  
 * @icsk_af_ops            Operations which are AF_INET{4,6} specific  
 * @icsk_ca_state:         Congestion control state  
 * @icsk_retransmits:      Number of unrecovered [RTO] timeouts  
 * @icsk_pending:          Scheduled timer event  
 * @icsk_backoff:          Backoff  
 * @icsk_syn_retries:      Number of allowed SYN (or equivalent) retries  
 * @icsk_probes_out:       unanswered 0 window probes  
 * @icsk_ext_hdr_len:      Network protocol overhead (IP/IPv6 options)  
 * @icsk_ack:              Delayed ACK control data  
 * @icsk_mtup;             MTU probing control data  
 */  

For a not established socket   
static void get_openreq4(const struct sock *sk, const struct request_sock *req,  
                         struct seq_file *f, int i, int uid, int *len)  
{  
        const struct inet_request_sock *ireq = inet_rsk(req);  
        int ttd = req->expires - jiffies;  

        seq_printf(f, "%4d: %08X:%04X %08X:%04X"  
                " %02X %08X:%08X %02X:%08lX %08X %5d %8d %u %d %pK%n",  
                i,  
                ireq->loc_addr,  
                ntohs(inet_sk(sk)->inet_sport),  
                ireq->rmt_addr,  
                ntohs(ireq->rmt_port),  
                TCP_SYN_RECV,  
                0, 0, /* could print option size, but that is af dependent. */  
                1,    /* timers active (only the expire timer) */  
                jiffies_to_clock_t(ttd),  
                req->retrans,  
                uid,  
                0,  /* non standard timer */  
                0, /* open_requests have no inode */  
                atomic_read(&sk->sk_refcnt),  
                req,  
                len);  
}  
static void get_timewait4_sock(const struct inet_timewait_sock *tw,  
                               struct seq_file *f, int i, int *len)  
{  
        __be32 dest, src;  
        __u16 destp, srcp;  
        int ttd = tw->tw_ttd - jiffies;  

        if (ttd < 0)  
                ttd = 0;  

        dest  = tw->tw_daddr;  
        src   = tw->tw_rcv_saddr;  
        destp = ntohs(tw->tw_dport);  
        srcp  = ntohs(tw->tw_sport);  

        seq_printf(f, "%4d: %08X:%04X %08X:%04X"  
                " %02X %08X:%08X %02X:%08lX %08X %5d %8d %d %d %pK%n",  
                i, src, srcp, dest, destp, tw->tw_substate, 0, 0,  
                3, jiffies_to_clock_t(ttd), 0, 0, 0, 0,  
                atomic_read(&tw->tw_refcnt), tw, len);  
}  

Let's back to our questions, The description of the 'Timer' field from 'netstat -o'  
which with the format (1st/2nd/3rd)  

1. The 1st field indicates when the timer will expire  
2. The 2nd field is the retransmits which already have done  
3. The 3rd field - for a synreq socket(not yet established) and a timewait socket it's always 0, for a established socket it's 'unanswered 0 window probes'   

TCP zero window probe means that the receiver has reduced his receive buffer (a.k.a. window) to zero, basically telling the sender to stop sending - usually for performance reasons.  If the receiver does not recover and send an so called "Window Update" with a buffer size greater than zero (meaning, the sender is allowed to continue) the sender will become "impatient" at some point and "check" if the receiver is able to receive more data. That "check" is the Zero Window Probe.  

TCP Keep-Alive - Occurs when the sequence number is equal to the last byte of data in the previous packet. Used to elicit an ACK from the receiver.  
TCP Keep-Alive ACK - Self-explanatory. ACK packet sent in response to a "keep-alive" packet.  
TCP DupACK - Occurs when the same ACK number is seen AND it is lower than the last byte of data sent by the sender. If the receiver detects a gap in the sequence numbers, it will generate a duplicate ACK for each subsequent packet it receives on that connection, until the missing packet is successfully received (retransmitted). A clear indication of dropped/missing packets.  
TCP ZeroWindow - Occurs when a receiver advertises a receive window size of zero. This effectively tells the sender to stop sending because the receiver's buffer is full. Indicates a resource issue on the receiver, as the application is not retrieving data from the TCP buffer in a timely manner.  
TCP ZerowindowProbe - The sender is testing to see if the receiver's zero window condition still exists by sending the next byte of data to elicit an ACK from the receiver. If the window is still zero, the sender will double his persist timer before probing again.  de>

↧

PostgreSQL 老湿机图解平安科技遇到的垃圾回收"坑"

August 15, 2016, 7:54 am

≫ Next: PostgreSQL Oracle 兼容性之 - 如何篡改插入值(例如NULL纂改为其他值)

≪ Previous: 如何防止远程程序与RDS PG连接中断

背景

近日收到平安科技海安童鞋那里反馈的一个问题，在生产环境使用PostgreSQL的过程中，遇到的一个有点"不可思议"的问题。

一张经常被更新的表，通过主键查询这张表的记录时，发现需要扫描异常多的数据块。

本文将为你详细剖析这个问题，同时给出规避的方法，以及内核改造的方法。

文中还涉及到索引的结构解说，仔细阅读定有收获。

原因分析

.1. 和长事务有关，我在很多文章都提到过，PG在垃圾回收时，只判断垃圾版本是否是当前数据库中最老的事务之前的，如果是之后产生的，则不回收。

所以当数据库存在长事务时，同时被访问的记录被多次变更，造成一些垃圾版本没有回收。

.2. PG的索引没有版本信息，所以必须要访问heap tuple获取版本。

复现方法

测试表

de  >postgres=# create unlogged table test03 (id int primary key, info text);
de>

频繁更新100条记录

de  >$ vi test.sql
\setrandom id 1 100
insert into test03 values(:id, repeat(md5(random()::text), 1000)) on conflict on constraint test03_pkey do update set info=excluded.info;

pgbench -M prepared -n -r -P 1 -f ./test.sql -c 48 -j 48 -T 10000000
de>

开启长事务，啥也不干

de  >postgres=# begin;
BEGIN
postgres=# select txid_current();
 txid_current 
--------------
   3474642778
(1 row)
de>

经过一段时间的更新，发现需要访问很多数据块了。

de  >postgres=# explain (analyze,verbose,timing,costs,buffers) select * from test03 where id=2;
                                                         QUERY PLAN                                                          
-----------------------------------------------------------------------------------------------------------------------------
 Index Scan using test03_pkey on public.test03  (cost=0.42..8.44 rows=1 width=417) (actual time=0.661..4.440 rows=1 loops=1)
   Output: id, info
   Index Cond: (test03.id = 2)
   Buffers: shared hit=1753
 Planning time: 0.104 ms
 Execution time: 4.468 ms
(6 rows)
de>

观察访问很多的块是heap块

de  >postgres=# set enable_indexscan=off;
SET

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from test03 where id=2;
                                                      QUERY PLAN                                                       
-----------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on public.test03  (cost=4.43..8.44 rows=1 width=416) (actual time=5.818..5.819 rows=1 loops=1)
   Output: id, info
   Recheck Cond: (test03.id = 2)
   Heap Blocks: exact=1986
   Buffers: shared hit=1996
   ->  Bitmap Index Scan on test03_pkey  (cost=0.00..4.43 rows=1 width=0) (actual time=0.418..0.418 rows=1986 loops=1)
         Index Cond: (test03.id = 2)
         Buffers: shared hit=10
 Planning time: 0.200 ms
 Execution time: 5.851 ms
(10 rows)
de>

提交长事务前，使用vacuum verbose可以看到无法回收这些持续产生的垃圾page(包括index和heap的page)。

提交长事务

de  >postgres=# end;
COMMIT
de>

等待autovacuum进程回收垃圾，delete half index page。
访问的数据块数量下降了。

de  >postgres=# explain (analyze,verbose,timing,costs,buffers) select * from test03 where id=2;
                                                     QUERY PLAN                                                     
--------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on public.test03  (cost=4.43..8.45 rows=1 width=417) (actual time=0.113..0.118 rows=1 loops=1)
   Output: id, info
   Recheck Cond: (test03.id = 2)
   Heap Blocks: exact=3
   Buffers: shared hit=14
   ->  Bitmap Index Scan on test03_pkey  (cost=0.00..4.43 rows=1 width=0) (actual time=0.067..0.067 rows=3 loops=1)
         Index Cond: (test03.id = 2)
         Buffers: shared hit=11
 Planning time: 0.101 ms
 Execution time: 0.148 ms
(10 rows)
de>

深入分析

使用pageinspect观察测试过程中索引页的内容变化

创建extension

de  >postgres=# create extension pageinspect;
de>

开启长事务

de  >postgres=# begin;
BEGIN
postgres=# select txid_current();
de>

测试60秒更新

de  >pgbench -M prepared -n -r -P 1 -f ./test.sql -c 48 -j 48 -T 60
de>

观察需要扫描多少数据块

de  >postgres=# explain (analyze,verbose,timing,costs,buffers) select * from test03 where id=1;
                                                          QUERY PLAN                                                          
------------------------------------------------------------------------------------------------------------------------------
 Index Scan using test03_pkey on public.test03  (cost=0.43..8.45 rows=1 width=417) (actual time=0.052..15.738 rows=1 loops=1)
   Output: id, info
   Index Cond: (test03.id = 1)
   Buffers: shared hit=2663
 Planning time: 0.572 ms
 Execution time: 15.790 ms
(6 rows)

postgres=# set enable_indexscan=off;
SET

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from test03 where id=1;
                                                      QUERY PLAN                                                       
-----------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on public.test03  (cost=4.44..8.45 rows=1 width=417) (actual time=6.138..6.139 rows=1 loops=1)
   Output: id, info
   Recheck Cond: (test03.id = 1)
   Heap Blocks: exact=2651
   Buffers: shared hit=2663
   ->  Bitmap Index Scan on test03_pkey  (cost=0.00..4.44 rows=1 width=0) (actual time=0.585..0.585 rows=2651 loops=1)
         Index Cond: (test03.id = 1)
         Buffers: shared hit=12
 Planning time: 0.093 ms
 Execution time: 6.218 ms
(10 rows)
de>

观察索引页, root=412, 层级=2

de  >postgres=# select * from bt_metap('test03_pkey');
 magic  | version | root | level | fastroot | fastlevel 
--------+---------+------+-------+----------+-----------
 340322 |       2 |  412 |     2 |      412 |         2
(1 row)
de>

查看root页内容

de  >postgres=# select * from bt_page_stats('test03_pkey',412);
 blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags 
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
   412 | r    |          3 |          0 |            13 |      8192 |      8096 |         0 |         0 |    2 |          2
(1 row)

postgres=# select * from bt_page_items('test03_pkey',412);
 itemoffset |  ctid   | itemlen | nulls | vars |          data           
------------+---------+---------+-------+------+-------------------------
          1 | (3,1)   |       8 | f     | f    | 
          2 | (584,1) |      16 | f     | f    | 21 00 00 00 00 00 00 00
          3 | (411,1) |      16 | f     | f    | 46 00 00 00 00 00 00 00
(3 rows)
de>

查看最左branch 页内容

de  >postgres=# select * from bt_page_items('test03_pkey',3);
 itemoffset |  ctid   | itemlen | nulls | vars |          data           
------------+---------+---------+-------+------+-------------------------
          1 | (58,1)  |      16 | f     | f    | 21 00 00 00 00 00 00 00
          2 | (1,1)   |       8 | f     | f    | 
          3 | (937,1) |      16 | f     | f    | 01 00 00 00 00 00 00 00
          4 | (767,1) |      16 | f     | f    | 01 00 00 00 00 00 00 00
          5 | (666,1) |      16 | f     | f    | 01 00 00 00 00 00 00 00
          6 | (572,1) |      16 | f     | f    | 01 00 00 00 00 00 00 00
          7 | (478,1) |      16 | f     | f    | 01 00 00 00 00 00 00 00
          8 | (395,1) |      16 | f     | f    | 01 00 00 00 00 00 00 00
          9 | (307,1) |      16 | f     | f    | 01 00 00 00 00 00 00 00
         10 | (173,1) |      16 | f     | f    | 01 00 00 00 00 00 00 00
         11 | (99,1)  |      16 | f     | f    | 01 00 00 00 00 00 00 00
         12 | (951,1) |      16 | f     | f    | 02 00 00 00 00 00 00 00
         13 | (867,1) |      16 | f     | f    | 02 00 00 00 00 00 00 00
         14 | (773,1) |      16 | f     | f    | 02 00 00 00 00 00 00 00
         15 | (660,1) |      16 | f     | f    | 02 00 00 00 00 00 00 00
         16 | (564,1) |      16 | f     | f    | 02 00 00 00 00 00 00 00
         17 | (496,1) |      16 | f     | f    | 02 00 00 00 00 00 00 00
         18 | (413,1) |      16 | f     | f    | 02 00 00 00 00 00 00 00
         19 | (319,1) |      16 | f     | f    | 02 00 00 00 00 00 00 00
         20 | (204,1) |      16 | f     | f    | 02 00 00 00 00 00 00 00
         21 | (151,1) |      16 | f     | f    | 02 00 00 00 00 00 00 00
         22 | (64,1)  |      16 | f     | f    | 02 00 00 00 00 00 00 00
         23 | (865,1) |      16 | f     | f    | 03 00 00 00 00 00 00 00
         24 | (777,1) |      16 | f     | f    | 03 00 00 00 00 00 00 00
de>

查看包含最小值的最左叶子节点内容

de  >postgres=# select * from bt_page_items('test03_pkey',1);
 itemoffset |    ctid    | itemlen | nulls | vars |          data           
------------+------------+---------+-------+------+-------------------------
          1 | (57342,14) |      16 | f     | f    | 01 00 00 00 00 00 00 00
          2 | (71195,14) |      16 | f     | f    | 01 00 00 00 00 00 00 00
          3 | (71171,12) |      16 | f     | f    | 01 00 00 00 00 00 00 00
          4 | (71185,1)  |      16 | f     | f    | 01 00 00 00 00 00 00 00
          5 | (71150,17) |      16 | f     | f    | 01 00 00 00 00 00 00 00
          6 | (71143,1)  |      16 | f     | f    | 01 00 00 00 00 00 00 00
......
de>

查看包含最小值的最右叶子节点内容

de  >postgres=# select * from bt_page_items('test03_pkey',99);
 itemoffset |    ctid    | itemlen | nulls | vars |          data           
------------+------------+---------+-------+------+-------------------------
          1 | (66214,10) |      16 | f     | f    | 02 00 00 00 00 00 00 00
          2 | (12047,15) |      16 | f     | f    | 01 00 00 00 00 00 00 00
......
         40 | (11052,15) |      16 | f     | f    | 01 00 00 00 00 00 00 00
         41 | (11009,6)  |      16 | f     | f    | 01 00 00 00 00 00 00 00
         42 | (11021,6)  |      16 | f     | f    | 01 00 00 00 00 00 00 00
         43 | (71209,3)  |      16 | f     | f    | 02 00 00 00 00 00 00 00
         44 | (69951,1)  |      16 | f     | f    | 02 00 00 00 00 00 00 00
de>

查看这些叶子索引页包含data='01 00 00 00 00 00 00 00'的item有多少条，可以对应到需要扫描多少heap page

de  >select count(distinct substring(ctid::text, 1, "position"(ctid::text, ','))) from (
select * from bt_page_items('test03_pkey',1) 
union all
select * from bt_page_items('test03_pkey',937) 
union all
select * from bt_page_items('test03_pkey',767) 
union all
select * from bt_page_items('test03_pkey',666) 
union all
select * from bt_page_items('test03_pkey',572) 
union all
select * from bt_page_items('test03_pkey',478) 
union all
select * from bt_page_items('test03_pkey',395) 
union all
select * from bt_page_items('test03_pkey',307) 
union all
select * from bt_page_items('test03_pkey',173) 
union all
select * from bt_page_items('test03_pkey',99) 
union all
select * from bt_page_items('test03_pkey',951)
) t 
where data='01 00 00 00 00 00 00 00';

 count 
-------
  2652
(1 row)
de>

2652与前面执行计划中看到的2651对应。

提交长事务

de  >postgres=# end;
COMMIT
de>

等待autovacuum结束

de  >postgres=# select * from pg_stat_all_tables where relname='test03';
-[ RECORD 1 ]-------+------------------------------
relid               | 14156713
schemaname          | public
relname             | test03
seq_scan            | 39
seq_tup_read        | 5137822
idx_scan            | 3522865664
idx_tup_fetch       | 3521843178
n_tup_ins           | 1022487
n_tup_upd           | 3476465702
n_tup_del           | 22387
n_tup_hot_upd       | 3433472972
n_live_tup          | 100
n_dead_tup          | 0
n_mod_since_analyze | 0
last_vacuum         | 2016-07-15 00:03:53.909086+08
last_autovacuum     | 2016-07-15 00:32:04.177672+08
last_analyze        | 2016-07-15 00:03:53.909825+08
last_autoanalyze    | 2016-07-15 00:07:23.541629+08
vacuum_count        | 10
autovacuum_count    | 125
analyze_count       | 7
autoanalyze_count   | 99
de>

观察现在需要扫描多少块

de  >postgres=# explain (analyze,verbose,timing,costs,buffers) select * from test03 where id=1;
                                                     QUERY PLAN                                                      
---------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on public.test03  (cost=40.40..44.41 rows=1 width=417) (actual time=0.026..0.027 rows=1 loops=1)
   Output: id, info
   Recheck Cond: (test03.id = 1)
   Heap Blocks: exact=1
   Buffers: shared hit=5
   ->  Bitmap Index Scan on test03_pkey  (cost=0.00..40.40 rows=1 width=0) (actual time=0.014..0.014 rows=1 loops=1)
         Index Cond: (test03.id = 1)
         Buffers: shared hit=4
 Planning time: 0.137 ms
 Execution time: 0.052 ms
(10 rows)
de>

查看现在的索引页内容，half page已经remove掉了

de  >postgres=# select count(distinct substring(ctid::text, 1, "position"(ctid::text, ','))) from (
select * from bt_page_items('test03_pkey',1) 
union all
select * from bt_page_items('test03_pkey',937) 
union all
select * from bt_page_items('test03_pkey',767) 
union all
select * from bt_page_items('test03_pkey',666) 
union all
select * from bt_page_items('test03_pkey',572) 
union all
select * from bt_page_items('test03_pkey',478) 
union all
select * from bt_page_items('test03_pkey',395) 
union all
select * from bt_page_items('test03_pkey',307) 
union all
select * from bt_page_items('test03_pkey',173) 
union all
select * from bt_page_items('test03_pkey',99) 
union all
select * from bt_page_items('test03_pkey',951)
) t 
where data='01 00 00 00 00 00 00 00' ;
NOTICE:  page is deleted
NOTICE:  page is deleted
NOTICE:  page is deleted
NOTICE:  page is deleted
NOTICE:  page is deleted
NOTICE:  page is deleted
NOTICE:  page is deleted
NOTICE:  page is deleted
NOTICE:  page is deleted
-[ RECORD 1 ]
count | 2
de>

再观察索引页内容，已经被autovacuum收缩了

de  >postgres=# select * from bt_metap('test03_pkey');
 magic  | version | root | level | fastroot | fastlevel 
--------+---------+------+-------+----------+-----------
 340322 |       2 |  412 |     2 |      412 |         2
(1 row)

postgres=# select * from bt_page_items('test03_pkey',412);
 itemoffset |  ctid   | itemlen | nulls | vars |          data           
------------+---------+---------+-------+------+-------------------------
          1 | (3,1)   |       8 | f     | f    | 
          2 | (584,1) |      16 | f     | f    | 21 00 00 00 00 00 00 00
          3 | (411,1) |      16 | f     | f    | 46 00 00 00 00 00 00 00
(3 rows)

postgres=# select * from bt_page_items('test03_pkey',3);
 itemoffset |  ctid   | itemlen | nulls | vars |          data           
------------+---------+---------+-------+------+-------------------------
          1 | (58,1)  |      16 | f     | f    | 21 00 00 00 00 00 00 00
          2 | (1,1)   |       8 | f     | f    | 
          3 | (99,1)  |      16 | f     | f    | 01 00 00 00 00 00 00 00
          4 | (865,1) |      16 | f     | f    | 02 00 00 00 00 00 00 00
          5 | (844,1) |      16 | f     | f    | 03 00 00 00 00 00 00 00
          6 | (849,1) |      16 | f     | f    | 04 00 00 00 00 00 00 00
          7 | (18,1)  |      16 | f     | f    | 05 00 00 00 00 00 00 00
          8 | (95,1)  |      16 | f     | f    | 06 00 00 00 00 00 00 00
          9 | (63,1)  |      16 | f     | f    | 07 00 00 00 00 00 00 00
         10 | (34,1)  |      16 | f     | f    | 08 00 00 00 00 00 00 00
         11 | (851,1) |      16 | f     | f    | 09 00 00 00 00 00 00 00
         12 | (10,1)  |      16 | f     | f    | 0a 00 00 00 00 00 00 00
         13 | (71,1)  |      16 | f     | f    | 0b 00 00 00 00 00 00 00
         14 | (774,1) |      16 | f     | f    | 0c 00 00 00 00 00 00 00
         15 | (213,1) |      16 | f     | f    | 0d 00 00 00 00 00 00 00
         16 | (881,1) |      16 | f     | f    | 0e 00 00 00 00 00 00 00
         17 | (837,1) |      16 | f     | f    | 0f 00 00 00 00 00 00 00
         18 | (100,1) |      16 | f     | f    | 10 00 00 00 00 00 00 00
         19 | (872,1) |      16 | f     | f    | 11 00 00 00 00 00 00 00
         20 | (32,1)  |      16 | f     | f    | 12 00 00 00 00 00 00 00
         21 | (65,1)  |      16 | f     | f    | 13 00 00 00 00 00 00 00
         22 | (870,1) |      16 | f     | f    | 14 00 00 00 00 00 00 00
         23 | (841,1) |      16 | f     | f    | 15 00 00 00 00 00 00 00
         24 | (850,1) |      16 | f     | f    | 16 00 00 00 00 00 00 00
         25 | (30,1)  |      16 | f     | f    | 17 00 00 00 00 00 00 00
         26 | (91,1)  |      16 | f     | f    | 18 00 00 00 00 00 00 00
         27 | (829,1) |      16 | f     | f    | 19 00 00 00 00 00 00 00
         28 | (16,1)  |      16 | f     | f    | 1a 00 00 00 00 00 00 00
         29 | (784,1) |      16 | f     | f    | 1b 00 00 00 00 00 00 00
         30 | (31,1)  |      16 | f     | f    | 1c 00 00 00 00 00 00 00
         31 | (88,1)  |      16 | f     | f    | 1d 00 00 00 00 00 00 00
         32 | (48,1)  |      16 | f     | f    | 1e 00 00 00 00 00 00 00
         33 | (822,1) |      16 | f     | f    | 1f 00 00 00 00 00 00 00
         34 | (817,1) |      16 | f     | f    | 20 00 00 00 00 00 00 00
         35 | (109,1) |      16 | f     | f    | 21 00 00 00 00 00 00 00
(35 rows)

postgres=# select * from bt_page_items('test03_pkey',1);
 itemoffset |    ctid    | itemlen | nulls | vars |          data           
------------+------------+---------+-------+------+-------------------------
          1 | (57342,14) |      16 | f     | f    | 01 00 00 00 00 00 00 00
          2 | (71195,14) |      16 | f     | f    | 01 00 00 00 00 00 00 00
(2 rows)

postgres=# select * from bt_page_items('test03_pkey',99);
 itemoffset |    ctid    | itemlen | nulls | vars |          data           
------------+------------+---------+-------+------+-------------------------
          1 | (66214,10) |      16 | f     | f    | 02 00 00 00 00 00 00 00
          2 | (71209,3)  |      16 | f     | f    | 02 00 00 00 00 00 00 00
(2 rows)
de>

参考

1. b-tree原理
https://yq.aliyun.com/articles/54437

优化手段

1. 频繁更新的表，数据库的优化手段
1.1 监控长事务，绝对控制长事务

1.2 缩小autovacuum naptime (to 1s) ,
增加autovacuum work (to 10),
设置autovacuum delay=0,
增大autovacuum work memory (to 512MB or bigger),
将经常变更的表和索引放到好的iops的设备上。
不要小看这几个参数，非常的关键。

1.3 如果事务释放并且表上面已经出发了vacuum后，还是要查很多的PAGE，说明index page没有delete和收缩，可能是index page没有达到compact的要求，如果遇到这种情况，需要reindex。

2. PostgreSQL 9.6通过快照过旧彻底解决这个长事务引发的坑爹问题。
9.6 vacuum的改进如图

如何判断snapshot too old如图

https://www.postgresql.org/docs/9.6/static/runtime-config-resource.html#RUNTIME-CONFIG-RESOURCE-ASYNC-BEHAVIOR

3. 9.6的垃圾回收机制也还有改进的空间，做到更细粒度的版本控制，改进方法以前分享过，在事务列表中增加记录事务隔离级别，通过隔离级别判断需要保留的版本，而不是简单的通过最老事务来判断需要保留的垃圾版本。

祝大家玩得开心，欢迎随时来阿里云促膝长谈业务需求，恭候光临。

阿里云的小伙伴们加油，努力做最贴地气的云数据库。

↧

PostgreSQL Oracle 兼容性之 - 如何篡改插入值(例如NULL纂改为其他值)

May 28, 2016, 11:10 pm

≫ Next: 深入浅出PostgreSQL B-Tree索引结构

≪ Previous: PostgreSQL 老湿机图解平安科技遇到的垃圾回收"坑"

Oracle有个功能，可以将用户插入的NULL值替换成指定的值。
这个功能和default值的用意并不一样，default是指用户没有指定值时，使用default值代替。
例如

postgres=# alter table test alter column id set default 1;
ALTER TABLE
postgres=# create table t(id int, info text default 'abc');
CREATE TABLE
postgres=# insert into t values (1);
INSERT 0 1
postgres=# select * from t;
 id | info 
----+------
  1 | abc
(1 row)

当用户指定了NULL时，进去的就是NULL。

postgres=# insert into t values (1,NULL);
INSERT 0 1
postgres=# select * from t;
 id | info 
----+------
  1 | abc
  1 | 
(2 rows)

而NULL值的替换则是用户在给定了NULL值时可以替换为其他值。
PostgreSQL如何实现呢？
用触发器就可以实现了

postgres=# create or replace function tgf1() returns trigger as $$
declare
begin
  if NEW.info is null then
    NEW.info = (TG_ARGV[0])::text;
  end if;
  return NEW; 
end;
$$ language plpgsql;
CREATE FUNCTION

postgres=# create trigger tg1 before insert on t for each row execute procedure tgf1('new_value');
CREATE TRIGGER

postgres=# insert into t values (3,NULL);
INSERT 0 1
postgres=# select * from t where id=3;
 id |   info    
----+-----------
  3 | new_value
(1 row)

甚至可以针对用户来设置不同的值

postgres=# create or replace function tgf1() returns trigger as $$
declare
begin
  if NEW.info is null then
    select case when current_user = 'test' then 'hello' else 'world' end into NEW.info;   
  end if;     
  return NEW; 
end;
$$ language plpgsql;
CREATE FUNCTION

postgres=# insert into t values (5,NULL);
INSERT 0 1
postgres=# select * from t where id=5;
 id | info  
----+-------
  5 | world
(1 row)

postgres=# create role test superuser login;
CREATE ROLE
postgres=# \c postgres test

postgres=# insert into t values (6,NULL);
INSERT 0 1
postgres=# select * from t where id=6;
 id | info  
----+-------
  6 | hello
(1 row)

是不是很好玩呢？
因为PLPGSQL的编程能力还是非常强的，纂改的想象空间还很大，大家自由发挥。

↧

深入浅出PostgreSQL B-Tree索引结构

May 28, 2016, 11:11 pm

≫ Next: PostgreSQL ECPG 开发 DEMO

≪ Previous: PostgreSQL Oracle 兼容性之 - 如何篡改插入值(例如NULL纂改为其他值)

PostgreSQL B-Tree是一种变种(high-concurrency B-tree management algorithm)，算法详情请参考
src/backend/access/nbtree/README

PostgreSQL 的B-Tree索引页分为几种类别

meta page
root page         #  btpo_flags=2
branch page    #  btpo_flags=0
leaf page         #  btpo_flags=1

如果即是leaf又是root则  btpo_flags=3。

其中meta page和root page是必须有的，meta page需要一个页来存储，表示指向root page的page id。
随着记录数的增加，一个root page可能存不下所有的heap item，就会有leaf page，甚至branch page，甚至多层的branch page。
一共有几层branch 和 leaf，就用btree page元数据的 level 来表示。

我们可以使用pageinspect插件，内窥B-Tree的结构。

层次可以从bt_page_stats的btpo得到，代表当前index page所处的层级。
注意层级并不是唯一的，例如btpo=3的层级，可能有分几个档。
打个比喻，腾讯的技术岗位级别T3，对应T3这个级别又有几个小的档位。和这里的含义差不多，只是没有区分小档位的值，但是后面我们能看到它的存在。
btpo=0级表示最底层，处于这个层级的index pages存储的items(ctid)是指向heap page的。

类别和层级不挂钩，类别里面又可以有多个层级，但是只有层级=0的index page存储的ctid内容才是指向heap page的; 其他层级index page存储的ctid内容都是指向同层级其他index page(双向链表)，或者指下级的index page。
.1.
0层结构，只有meta和root页。
root页最多可以存储的item数，取决于索引字段数据的长度、以及索引页的大小。

例子

postgres=# create extension pageinspect;

postgres=# create table tab1(id int primary key, info text);
CREATE TABLE
postgres=# insert into tab1 select generate_series(1,100), md5(random()::text);
INSERT 0 100
postgres=# vacuum analyze tab1;
VACUUM

查看meta page，可以看到root page id = 1 。
索引的level = 0，说明没有branch和leaf page。

postgres=# select * from bt_metap('tab1_pkey');
 magic  | version | root | level | fastroot | fastlevel 
--------+---------+------+-------+----------+-----------
 340322 |       2 |    1 |     0 |        1 |         0
(1 row)

根据root page id = 1查看root page的stats
btpo=0 说明已经到了最底层
btpo_flags=3，说明它既是leaf又是root页。

postgres=# select * from bt_page_stats('tab1_pkey',1);
 blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags 
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
     1 | l    |        100 |          0 |            16 |      8192 |      6148 |         0 |         0 |    0 |          3
(1 row)

btpo_prev和btpo_next分别表示该页的相邻页（branch page是双向链表）。

btpo_flags 可以在代码中查看(src/include/access/nbtree.h)，一共有几个

/* Bits defined in btpo_flags */
#define BTP_LEAF                (1 << 0)        /* leaf page, i.e. not internal page */
#define BTP_ROOT                (1 << 1)        /* root page (has no parent) */
#define BTP_DELETED             (1 << 2)        /* page has been deleted from tree */
#define BTP_META                (1 << 3)        /* meta-page */
#define BTP_HALF_DEAD   (1 << 4)        /* empty, but still in tree */
#define BTP_SPLIT_END   (1 << 5)        /* rightmost page of split group */
#define BTP_HAS_GARBAGE (1 << 6)        /* page has LP_DEAD tuples */
#define BTP_INCOMPLETE_SPLIT (1 << 7)   /* right sibling's downlink is missing */

查看0级 page存储的ctid (即items)
0级ctid 表示存储的是 heap页的寻址。（如果是多层结构，那么branch page中的ctid，它表示的是同级btree页(链条项ctid)或者下级btree页的寻址）。
当ctid指向heap时， data是对应的列值。(多级结构的data意义不一样，后面会讲)

postgres=# select * from bt_page_items('tab1_pkey',1);
 itemoffset |  ctid   | itemlen | nulls | vars |          data           
------------+---------+---------+-------+------+-------------------------
          1 | (0,1)   |      16 | f     | f    | 01 00 00 00 00 00 00 00
          2 | (0,2)   |      16 | f     | f    | 02 00 00 00 00 00 00 00
...
         99 | (0,99)  |      16 | f     | f    | 63 00 00 00 00 00 00 00
        100 | (0,100) |      16 | f     | f    | 64 00 00 00 00 00 00 00
(100 rows)

根据ctid 查看heap记录

postgres=# select * from tab1 where ctid='(0,100)';
 id  |               info               
-----+----------------------------------
 100 | 68b63c269ee8cc2d99fe204f04d0ffcb
(1 row)

.2.
1层结构，包括meta page, root page, leaf page.

例子

postgres=# truncate tab1;
TRUNCATE TABLE
postgres=# insert into tab1 select generate_series(1,1000), md5(random()::text);
INSERT 0 1000
postgres=# vacuum analyze tab1;
VACUUM

查看meta page，可以看到root page id = 3, 索引的level = 1。
level = 1 表示包含了leaf page。

postgres=# select * from bt_metap('tab1_pkey');
 magic  | version | root | level | fastroot | fastlevel 
--------+---------+------+-------+----------+-----------
 340322 |       2 |    3 |     1 |        3 |         1
(1 row)

根据root page id 查看root page的stats
btpo = 1 说明还没有到最底层（最底层btpo=0, 这种页里面存储的ctid才代表指向heap page的地址）
btpo_flags=2 说明这个页是root page

postgres=# select * from bt_page_stats('tab1_pkey',3);
 blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags 
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
     3 | r    |          3 |          0 |            13 |      8192 |      8096 |         0 |         0 |    1 |          2
(1 row)

查看root page存储的 leaf page items (指向leaf page)
一共3个leaf pages, data存储的是这个leaf page存储的最小值。

postgres=# select * from bt_page_items('tab1_pkey',3);
 itemoffset | ctid  | itemlen | nulls | vars |          data           
------------+-------+---------+-------+------+-------------------------
          1 | (1,1) |       8 | f     | f    | 
          2 | (2,1) |      16 | f     | f    | 6f 01 00 00 00 00 00 00
          3 | (4,1) |      16 | f     | f    | dd 02 00 00 00 00 00 00
(3 rows)

第一条为空，是因为这个leaf page是最左边的PAGE，不存最小值。
对于有右leaf page的leaf page，第一条存储的heap item为该页的右链路。
第二条才是起始ITEM。
另外需要注意，虽然在item里面只存储右链，leaf page还是双向链表，在stats能看到它的prev 和next page。
根据leaf page id查看stats
最左leaf page = 1
prev btpo 指向meta page

可以看到btpo = 0了，说明这个页是底层页。  
btpo_flags=1 说明是leaf page  
postgres=# select * from bt_page_stats('tab1_pkey',1);
 blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags 
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
     1 | l    |        367 |          0 |            16 |      8192 |       808 |         0 |         2 |    0 |          1
(1 row)

next btpo 指向meta page
最右leaf page = 4
btpo_flags=1 说明是leaf page

postgres=# select * from bt_page_stats('tab1_pkey',4);
 blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags 
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
     4 | l    |        268 |          0 |            16 |      8192 |      2788 |         2 |         0 |    0 |          1
(1 row)

中间leaf page = 2
btpo_flags=1 说明是leaf page

postgres=# select * from bt_page_stats('tab1_pkey',2);
 blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags 
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
     2 | l    |        367 |          0 |            16 |      8192 |       808 |         1 |         4 |    0 |          1
(1 row)

查看leaf page存储的 heap ctid (即heap items)
含右页的例子, index page 1
第一条为右链表的第一条item, 第二条才是起始item

postgres=# select * from bt_page_items('tab1_pkey',1);
 itemoffset |  ctid   | itemlen | nulls | vars |          data           
------------+---------+---------+-------+------+-------------------------
          1 | (3,7)   |      16 | f     | f    | 6f 01 00 00 00 00 00 00
          2 | (0,1)   |      16 | f     | f    | 01 00 00 00 00 00 00 00
          3 | (0,2)   |      16 | f     | f    | 02 00 00 00 00 00 00 00
...
        367 | (3,6)   |      16 | f     | f    | 6e 01 00 00 00 00 00 00
(367 rows)

不含右页的例子, index page 4
第一条就是起始ctid (即items)

postgres=# select * from bt_page_items('tab1_pkey',4);
 itemoffset |  ctid   | itemlen | nulls | vars |          data           
------------+---------+---------+-------+------+-------------------------
          1 | (6,13)  |      16 | f     | f    | dd 02 00 00 00 00 00 00
          2 | (6,14)  |      16 | f     | f    | de 02 00 00 00 00 00 00
...
        268 | (8,40)  |      16 | f     | f    | e8 03 00 00 00 00 00 00
(268 rows)

根据ctid 查看heap记录

postgres=#              select * from tab1 where ctid='(0,1)';
 id |               info               
----+----------------------------------
  1 | 6ebc6b77aebf5dd11621a2ed846c08c4
(1 row)

.3.
记录数超过1层结构的索引可以存储的记录数时，会分裂为2层结构，除了meta page和root page，还可能包含1层branch page以及1层leaf page。
如果是边界页(branch or leaf)，那么其中一个方向没有PAGE，这个方向的链表信息都统一指向meta page。

例子

create table tbl1(id int primary key, info text);  
postgres=# select 285^2;
 ?column? 
----------
    81225
(1 row)
postgres=# insert into tab2 select trunc(random()*10000000), md5(random()::text) from generate_series(1,1000000) on conflict on constraint tab2_pkey do nothing;
INSERT 0 951379
postgres=# vacuum analyze tab2;
VACUUM

查看meta page，可以看到root page id = 412, 索引的level=2，即包括1级 branch 和 1级 leaf。

postgres=# select * from bt_metap('tab2_pkey');
 magic  | version | root | level | fastroot | fastlevel 
--------+---------+------+-------+----------+-----------
 340322 |       2 |  412 |     2 |      412 |         2
(1 row)

根据root page id 查看root page的stats
btpo = 2 当前在第二层，另外还表示下层是1
btpo_flags = 2 说明是root page

postgres=# select * from bt_page_stats('tab2_pkey', 412);
 blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags 
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
   412 | r    |         11 |          0 |            15 |      8192 |      7936 |         0 |         0 |    2 |          2
(1 row)

查看root page存储的 branch page items (指向branch page)

postgres=# select * from bt_page_items('tab2_pkey', 412);
 itemoffset |   ctid   | itemlen | nulls | vars |          data           
------------+----------+---------+-------+------+-------------------------
          1 | (3,1)    |       8 | f     | f    | 
          2 | (2577,1) |      16 | f     | f    | e1 78 0b 00 00 00 00 00
          3 | (1210,1) |      16 | f     | f    | ec 3a 18 00 00 00 00 00
          4 | (2316,1) |      16 | f     | f    | de 09 25 00 00 00 00 00
          5 | (574,1)  |      16 | f     | f    | aa e8 33 00 00 00 00 00
          6 | (2278,1) |      16 | f     | f    | 85 90 40 00 00 00 00 00
          7 | (1093,1) |      16 | f     | f    | f6 e9 4e 00 00 00 00 00
          8 | (2112,1) |      16 | f     | f    | a3 60 5c 00 00 00 00 00
          9 | (411,1)  |      16 | f     | f    | b2 ea 6b 00 00 00 00 00
         10 | (2073,1) |      16 | f     | f    | db de 79 00 00 00 00 00
         11 | (1392,1) |      16 | f     | f    | df b0 8a 00 00 00 00 00
(11 rows)

根据branch page id查看stats
btpo = 1 当前在第一层，另外还表示下层是0
btpo_flags = 0 说明是branch page

postgres=# select * from bt_page_stats('tab2_pkey', 3);
 blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags 
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
     3 | i    |        254 |          0 |            15 |      8192 |      3076 |         0 |      2577 |    1 |          0
(1 row)

查看branch page存储的 leaf page ctid (指向leaf page)
只要不是最右边的页，第一条都代表右页的起始item。
第二条才是当前页的起始ctid
注意所有branch page的起始item对应的data都是空的。
也就是说它不存储当前branch page包含的所有leaf pages的索引字段内容的最小值。

postgres=# select * from bt_page_items('tab2_pkey', 3);
 itemoffset |   ctid   | itemlen | nulls | vars |          data           
------------+----------+---------+-------+------+-------------------------
          1 | (735,1)  |      16 | f     | f    | e1 78 0b 00 00 00 00 00
          2 | (1,1)    |       8 | f     | f    | 
          3 | (2581,1) |      16 | f     | f    | a8 09 00 00 00 00 00 00
          4 | (1202,1) |      16 | f     | f    | f8 13 00 00 00 00 00 00
...
        254 | (3322,1) |      16 | f     | f    | ee 6f 0b 00 00 00 00 00
(254 rows)

根据ctid 查看leaf page
btpo = 0 当前在第0层，即最底层，这里存储的是heap ctid
btpo_flags = 1 说明是leaf page

postgres=# select * from bt_page_stats('tab2_pkey', 1);
 blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags 
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
     1 | l    |        242 |          0 |            16 |      8192 |      3308 |         0 |      2581 |    0 |          1
(1 row)

postgres=# select * from bt_page_items('tab2_pkey', 1);
 itemoffset |    ctid    | itemlen | nulls | vars |          data           
------------+------------+---------+-------+------+-------------------------
          1 | (4985,16)  |      16 | f     | f    | a8 09 00 00 00 00 00 00
          2 | (7305,79)  |      16 | f     | f    | 01 00 00 00 00 00 00 00
          3 | (2757,120) |      16 | f     | f    | 09 00 00 00 00 00 00 00
...
        242 | (1329,101) |      16 | f     | f    | a0 09 00 00 00 00 00 00
(242 rows)

查看leaf page中包含的heap page items。
如果我们根据索引页结构的原理，能推算出来(7305,79)是最小值，取它就没错了。

postgres=# select * from tab2 where ctid='(7305,79)';
 id |               info               
----+----------------------------------
  1 | 18aaeb74c359355311ac825ae2aeb22a
(1 row)

postgres=# select min(id) from tab2;
 min 
-----
   1
(1 row)

.4.
多层结构，除了meta page，还可能包含多层branch page，以及一层leaf page。

例子

postgres=# create table tab3(id int primary key, info text);
CREATE TABLE
postgres=# insert into tab3 select generate_series(1, 100000000), md5(random()::text);

查看meta page, 注意level，已经是3级了。

meta page
postgres=# select * from bt_metap('tab3_pkey');
 magic  | version |  root  | level | fastroot | fastlevel 
--------+---------+--------+-------+----------+-----------
 340322 |       2 | 116816 |     3 |   116816 |         3
(1 row)

btpo_flags=2 代表 root page
btpo = 3 代表第3层

postgres=# select * from bt_page_stats('tab3_pkey', 116816);
 blkno  | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags 
--------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
 116816 | r    |          3 |          0 |            13 |      8192 |      8096 |         0 |         0 |    3 |          2
(1 row)

postgres=# select * from bt_page_items('tab3_pkey', 116816);
 itemoffset |    ctid    | itemlen | nulls | vars |          data           
------------+------------+---------+-------+------+-------------------------
          1 | (412,1)    |       8 | f     | f    | 
          2 | (116815,1) |      16 | f     | f    | 5f 9e c5 01 00 00 00 00
          3 | (198327,1) |      16 | f     | f    | bd 3c 8b 03 00 00 00 00
(3 rows)

btpo_flags=0 代表 branch page
btpo = 2 代表第2层

postgres=# select * from bt_page_stats('tab3_pkey', 412);
 blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags 
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
   412 | i    |        286 |          0 |            15 |      8192 |      2436 |         0 |    116815 |    2 |          0
(1 row)

postgres=# select * from bt_page_items('tab3_pkey', 412);
 itemoffset |   ctid    | itemlen | nulls | vars |          data           
------------+-----------+---------+-------+------+-------------------------
          1 | (81636,1) |      16 | f     | f    | 5f 9e c5 01 00 00 00 00  -- 这是指向当前层级右页的ctid
          2 | (3,1)     |       8 | f     | f    |    -- 注意第一条初始值是这
          3 | (411,1)   |      16 | f     | f    | 77 97 01 00 00 00 00 00
          4 | (698,1)   |      16 | f     | f    | ed 2e 03 00 00 00 00 00
...
        286 | (81350,1) |      16 | f     | f    | e9 06 c4 01 00 00 00 00
(286 rows)

btpo_flags=0 代表 branch page
btpo = 1 代表第1层

postgres=# select * from bt_page_stats('tab3_pkey', 3);
 blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags 
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
     3 | i    |        286 |          0 |            15 |      8192 |      2436 |         0 |       411 |    1 |          0
(1 row)

postgres=# select * from bt_page_items('tab3_pkey', 3);
 itemoffset |  ctid   | itemlen | nulls | vars |          data           
------------+---------+---------+-------+------+-------------------------
          1 | (287,1) |      16 | f     | f    | 77 97 01 00 00 00 00 00
          2 | (1,1)   |       8 | f     | f    | 
          3 | (2,1)   |      16 | f     | f    | 6f 01 00 00 00 00 00 00
          4 | (4,1)   |      16 | f     | f    | dd 02 00 00 00 00 00 00
...
        286 | (286,1) |      16 | f     | f    | 09 96 01 00 00 00 00 00
(286 rows)

btpo_flags=1 代表 leaf page
btpo = 0 代表第0层

postgres=# select * from bt_page_stats('tab3_pkey', 1);
 blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags 
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
     1 | l    |        367 |          0 |            16 |      8192 |       808 |         0 |         2 |    0 |          1
(1 row)

postgres=# select * from bt_page_items('tab3_pkey', 1);
 itemoffset |  ctid   | itemlen | nulls | vars |          data           
------------+---------+---------+-------+------+-------------------------
          1 | (3,7)   |      16 | f     | f    | 6f 01 00 00 00 00 00 00
          2 | (0,1)   |      16 | f     | f    | 01 00 00 00 00 00 00 00
          3 | (0,2)   |      16 | f     | f    | 02 00 00 00 00 00 00 00
...
        367 | (3,6)   |      16 | f     | f    | 6e 01 00 00 00 00 00 00
(367 rows)

通过第0层的ctid就可以获取到heap了.
heap tuple例子

postgres=# select * from tab3 where ctid='(0,1)';
 id |               info               
----+----------------------------------
  1 | 370ee1989a2b7f5d8a5b43243596d91f
(1 row)

如何解释explain analyze中的扫描了多少个btree page
实战例子1

postgres=# create table tbl1(id int primary key, info text);
CREATE TABLE
postgres=# insert into tbl1 select trunc(random()*10000000), md5(random()::text) from generate_series(1,5000000) on conflict on constraint tbl1_pkey do nothing;
INSERT 0 3934875
postgres=# select ctid,* from tbl1 limit 10;
  ctid  |   id    |               info               
--------+---------+----------------------------------
 (0,1)  | 2458061 | 5c91812b54bdcae602321dceaf22e276
 (0,2)  | 8577271 | fe8e7a8be0d71a94e13b1b5a7786010b
 (0,3)  | 4612744 | 56983e47f044b5a4655300e1868d2850
 (0,4)  | 3690167 | 4a5ec8abf67bc018dcc113be829a59da
 (0,5)  | 2646638 | 7686b47dcb94e56c11d69ec04d6017f3
 (0,6)  | 6023272 | 4779d9a849c8287490be9d37a27b4637
 (0,7)  | 7163674 | 35af37f479f48caa65033a5ef56cd75e
 (0,8)  | 4049257 | 12fa110d927c88dce0773b546cc600c6
 (0,9)  | 5815903 | 69ed9770ede59917d15ac2373ca8c797
 (0,10) | 4068194 | 738595f73670da7ede40aefa8cb3d00c
(10 rows)
postgres=# vacuum analyze tbl1;
VACUUM

首先我们需要了解索引的level，才能正确的判断需要扫描多少个index page才能取出1条记录。

postgres=# select * from bt_metap('tbl1_pkey');
 magic  | version | root | level | fastroot | fastlevel 
--------+---------+------+-------+----------+-----------
 340322 |       2 |  412 |     2 |      412 |         2
(1 row)

level = 2的btree应该长这样

.1. 以下查询，命中了1条记录，并且走的是index only scan。
读了4个INDEX PAGE, 包括1 meta page, 1 root page, 1 branch page, 1 leaf page.

postgres=#  explain (analyze,verbose,timing,costs,buffers) select id from tbl1 where id = 1;
                                                         QUERY PLAN                                                         
----------------------------------------------------------------------------------------------------------------------------
 Index Only Scan using tbl1_pkey on public.tbl1  (cost=0.42..1.44 rows=1 width=4) (actual time=0.019..0.020 rows=1 loops=1)
   Output: id
   Index Cond: (tbl1.id = 1)
   Heap Fetches: 0
   Buffers: shared hit=4
 Planning time: 0.072 ms
 Execution time: 0.072 ms
(7 rows)

.2. 以下查询，命中了0条记录，并且走的是index only scan。
读了4个INDEX PAGE, 包括1 meta page, 1 root page, 1 branch page, 1 leaf page.
但是explain只算了3个，没有计算leaf page的那次，算个小BUG吧。

postgres=# explain (analyze,verbose,timing,costs,buffers) select id from tbl1 where id in (3);
                                                         QUERY PLAN                                                         
----------------------------------------------------------------------------------------------------------------------------
 Index Only Scan using tbl1_pkey on public.tbl1  (cost=0.43..1.45 rows=1 width=4) (actual time=0.010..0.010 rows=0 loops=1)
   Output: id
   Index Cond: (tbl1.id = 3)
   Heap Fetches: 0
   Buffers: shared hit=3
 Planning time: 0.073 ms
 Execution time: 0.031 ms
(7 rows)

.3. 以下查询，命中了7条记录，并且走的是index only scan。
读了22个INDEX PAGE,
1 meta page + 7 * (1 root + 1 branch + 1 leaf) = 22
也就是说，每个value都扫了root,branch,leaf。

postgres=#  explain (analyze,verbose,timing,costs,buffers) select id from tbl1 where id in (1,2,3,4,100,1000,10000);
                                                         QUERY PLAN                                                          
-----------------------------------------------------------------------------------------------------------------------------
 Index Only Scan using tbl1_pkey on public.tbl1  (cost=0.42..10.10 rows=7 width=4) (actual time=0.018..0.033 rows=7 loops=1)
   Output: id
   Index Cond: (tbl1.id = ANY ('{1,2,3,4,100,1000,10000}'::integer[]))
   Heap Fetches: 0
   Buffers: shared hit=22
 Planning time: 0.083 ms
 Execution time: 0.056 ms
(7 rows)

.4. 以下查询，命中了2条记录，并且走的是index only scan。
读了22个INDEX PAGE,
1 meta page + 7 * (1 root + 1 branch + 1 leaf) = 22
也就是说，每个value都扫了root,branch,leaf。

postgres=# explain (analyze,verbose,timing,costs,buffers) select id from tbl1 where id in (1,2,3,4,5,6,7);
                                                         QUERY PLAN                                                          
-----------------------------------------------------------------------------------------------------------------------------
 Index Only Scan using tbl1_pkey on public.tbl1  (cost=0.43..10.13 rows=7 width=4) (actual time=0.039..0.046 rows=2 loops=1)
   Output: id
   Index Cond: (tbl1.id = ANY ('{1,2,3,4,5,6,7}'::integer[]))
   Heap Fetches: 0
   Buffers: shared hit=22
 Planning time: 0.232 ms
 Execution time: 0.086 ms
(7 rows)

.5. 以下查询结果和以上查询一样，也命中了3条记录，并且走的是index only scan。
但是只读了4个INDEX PAGE,
1 meta page + 1 root + 1 branch + 1 leaf

postgres=# explain (analyze,verbose,timing,costs,buffers) select id from tbl1 where id>0 and id <=7;
                                                         QUERY PLAN                                                         
----------------------------------------------------------------------------------------------------------------------------
 Index Only Scan using tbl1_pkey on public.tbl1  (cost=0.43..1.49 rows=3 width=4) (actual time=0.008..0.009 rows=2 loops=1)
   Output: id
   Index Cond: ((tbl1.id > 0) AND (tbl1.id <= 7))
   Heap Fetches: 0
   Buffers: shared hit=4
 Planning time: 0.127 ms
 Execution time: 0.028 ms
(7 rows)

对于第四个查询，扫描了22个块，这个查询，优化器有优化的空间，比如找到1和7作为边界值，在查询到第一个值时，就可以取到leaf page的下一个page的最小值，从而得到1,2,3,4,5,6,7的值在当前page就可以完全取到，不需要去重复扫描。

↧

PostgreSQL ECPG 开发 DEMO

May 28, 2016, 11:12 pm

≫ Next: PostgreSQL 多维分析 CASE

≪ Previous: 深入浅出PostgreSQL B-Tree索引结构

ECPG 是在C中嵌套SQL的一种用法。
写好pgc文件后，需要使用ecpg程序将pgc编程成C文件来使用。

详细的用法请参考
https://www.postgresql.org/docs/9.5/static/ecpg.html

ecpg的用法, 以EXEC SQL开头表示后面是SQL写法

一些简单的用法
大小写敏感.
.1. 连接数据库

EXEC SQL CONNECT TO target [AS connection-name] [USER user-name];

target : 
dbname[@hostname][:port]
tcp:postgresql://hostname[:port][/dbname][?options]
unix:postgresql://hostname[:port][/dbname][?options]

例子
EXEC SQL CONNECT TO mydb@sql.mydomain.com;

EXEC SQL CONNECT TO unix:postgresql://sql.mydomain.com/mydb AS myconnection USER john;

EXEC SQL BEGIN DECLARE SECTION;
const char *target = "mydb@sql.mydomain.com";
const char *user = "john";
const char *passwd = "secret";
EXEC SQL END DECLARE SECTION;
 ...
EXEC SQL CONNECT TO :target USER :user USING :passwd;
/* or EXEC SQL CONNECT TO :target USER :user/:passwd; */

.2. 定义ecpg变量
PostgreSQL数据类型与ecpg使用的类型的映射关系
https://www.postgresql.org/docs/9.5/static/ecpg-variables.html#ECPG-VARIABLES-TYPE-MAPPING
有些类型没有一一对应的关系，需要使用ecpg的函数来转换。

下面是一个DEMO
首先需要编写pgc文件。

$ vi t.pgc
#include <stdio.h>
#include <stdlib.h>
#include <pgtypes_numeric.h>

EXEC SQL WHENEVER SQLERROR STOP;

int
main(void)
{
EXEC SQL BEGIN DECLARE SECTION;
    numeric *num;
    numeric *num2;
    decimal *dec;
EXEC SQL END DECLARE SECTION;

    EXEC SQL CONNECT TO tcp:postgresql://xxxcs.com:3433/postgres AS db_digoal USER digoal USING digoal;

    num = PGTYPESnumeric_new();
    dec = PGTYPESdecimal_new();

    EXEC SQL SELECT 12.345::numeric(4,2), 23.456::decimal(4,2) INTO :num, :dec;

    printf("numeric = %s\n", PGTYPESnumeric_to_asc(num, 0));
    printf("numeric = %s\n", PGTYPESnumeric_to_asc(num, 1));
    printf("numeric = %s\n", PGTYPESnumeric_to_asc(num, 2));

    /* Convert decimal to numeric to show a decimal value. */
    num2 = PGTYPESnumeric_new();
    PGTYPESnumeric_from_decimal(dec, num2);

    printf("decimal = %s\n", PGTYPESnumeric_to_asc(num2, 0));
    printf("decimal = %s\n", PGTYPESnumeric_to_asc(num2, 1));
    printf("decimal = %s\n", PGTYPESnumeric_to_asc(num2, 2));

    PGTYPESnumeric_free(num2);
    PGTYPESdecimal_free(dec);
    PGTYPESnumeric_free(num);

    EXEC SQL COMMIT;
    EXEC SQL DISCONNECT ALL;
    return 0;
}

本地环境中需要有依赖的头文件和库.
编译.pgc
-t 表示使用自动提交

ecpg -t -c -I/home/digoal/pgsql9.6/include -o t.c t.pgc

查看编译后的.c

/* Processed by ecpg (4.12.0) */
/* These include files are added by the preprocessor */
#include <ecpglib.h>
#include <ecpgerrno.h>
#include <sqlca.h>
/* End of automatic include section */

#line 1 "t.pgc"
#include <stdio.h>
#include <stdlib.h>
#include <pgtypes_numeric.h>

/* exec sql whenever sqlerror  stop ; */
#line 5 "t.pgc"


int
main(void)
{
/* exec sql begin declare section */




#line 11 "t.pgc"
 numeric * num ;

#line 12 "t.pgc"
 numeric * num2 ;

#line 13 "t.pgc"
 decimal * dec ;
/* exec sql end declare section */
#line 14 "t.pgc"


    { ECPGconnect(__LINE__, 0, "tcp:postgresql://rdsqm2ffv0wjxnxk5nbsi.pg.rds.aliyuncs.com:3433/postgres" , "digoal" , "digoal" , "db_digoal", 1); 
#line 16 "t.pgc"

if (sqlca.sqlcode < 0) exit (1);}
#line 16 "t.pgc"


    num = PGTYPESnumeric_new();
    dec = PGTYPESdecimal_new();

    { ECPGdo(__LINE__, 0, 1, NULL, 0, ECPGst_normal, "select 12.345 :: numeric ( 4 , 2 ) , 23.456 :: decimal ( 4 , 2 )", ECPGt_EOIT, 
        ECPGt_numeric,&(num),(long)1,(long)0,sizeof(numeric), 
        ECPGt_NO_INDICATOR, NULL , 0L, 0L, 0L, 
        ECPGt_decimal,&(dec),(long)1,(long)0,sizeof(decimal), 
        ECPGt_NO_INDICATOR, NULL , 0L, 0L, 0L, ECPGt_EORT);
#line 21 "t.pgc"

if (sqlca.sqlcode < 0) exit (1);}
#line 21 "t.pgc"


    printf("numeric = %s\n", PGTYPESnumeric_to_asc(num, 0));
    printf("numeric = %s\n", PGTYPESnumeric_to_asc(num, 1));
    printf("numeric = %s\n", PGTYPESnumeric_to_asc(num, 2));

    /* Convert decimal to numeric to show a decimal value. */
    num2 = PGTYPESnumeric_new();
    PGTYPESnumeric_from_decimal(dec, num2);

    printf("decimal = %s\n", PGTYPESnumeric_to_asc(num2, 0));
    printf("decimal = %s\n", PGTYPESnumeric_to_asc(num2, 1));
    printf("decimal = %s\n", PGTYPESnumeric_to_asc(num2, 2));

    PGTYPESnumeric_free(num2);
    PGTYPESdecimal_free(dec);
    PGTYPESnumeric_free(num);

    { ECPGtrans(__LINE__, NULL, "commit");
#line 39 "t.pgc"

if (sqlca.sqlcode < 0) exit (1);}
#line 39 "t.pgc"

    { ECPGdisconnect(__LINE__, "ALL");
#line 40 "t.pgc"

if (sqlca.sqlcode < 0) exit (1);}
#line 40 "t.pgc"

    return 0;
}

编译，链接 .c

gcc -I/home/digoal/pgsql9.6/include  -Wall -g  t.c  -L/home/digoal/pgsql9.6/lib -lecpg -lpq -lpgtypes -o t

也可以写成Makefile如下 :
$ vi Makefile
注意需要TAB键.

ECPG = ecpg
CC = gcc

INCLUDES = -I$(shell pg_config --includedir)
LIBPATH = -L$(shell pg_config --libdir)
CFLAGS += $(INCLUDES)
LDFLAGS += -Wall -g
LDLIBS += $(LIBPATH) -lecpg -lpq -lpgtypes

%.c: %.pgc
        $(ECPG) -t -c $(INCLUDES) -o $@ $<

%: %.o
        $(CC) $(CFLAGS) $(LDFLAGS) $(LDLIBS) -o $@ $<

TESTS = t t.c

default: $(TESTS)

clean:
        rm -f *.o *.so t t.c

使用时注意把pg_config弄进PATH

$ export PATH=/home/digoal/pgsql9.6/bin:$PATH
$ make

使用编译好的t :

./t 
numeric = 12
numeric = 12.4
numeric = 12.35
decimal = 23
decimal = 23.5
decimal = 23.46

↧

PostgreSQL 多维分析 CASE

June 14, 2016, 7:21 am

≫ Next: PostgreSQL 函数稳定性与constraint_excluded分区表逻辑推理过滤的CASE

≪ Previous: PostgreSQL ECPG 开发 DEMO

和小米的童鞋交流，听说的一个痛点。
也是很多给企业做BI分析的开发小伙伴，可能经常会遇到这样的痛苦，运营人员今天问你要这样的维度报表，明天换个维度又来"折腾"你。
对于开发的小伙伴，确实是非常痛苦的事情，那么有什么好的应对策略，而且对运营来说可能会显得比较高逼格呢？

多维分析派上用场，比如你的表有10个字段，允许运营人员以任意字段组合，产生报表。
很多商业数据库都带了这个功能，开源数据库带这个功能的不多。PostgreSQL真是业界良心啊~~~

例子
假设有4个业务字段，一个时间字段。

postgres=# create table tab5(c1 int, c2 int, c3 int, c4 int, crt_time timestamp);
CREATE TABLE

生成一批测试数据

postgres=# insert into tab5 select 
trunc(100*random()), 
trunc(1000*random()), 
trunc(10000*random()), 
trunc(100000*random()), 
clock_timestamp() + (trunc(10000*random())||' hour')::interval 
from generate_series(1,1000000);
INSERT 0 1000000

postgres=# select * from tab5 limit 10;
 c1 | c2  |  c3  |  c4   |          crt_time          
----+-----+------+-------+----------------------------
 72 |  46 | 3479 | 20075 | 2017-02-02 14:56:36.854218
 98 | 979 | 4491 | 83012 | 2017-06-13 08:56:36.854416
 54 | 758 | 5838 | 45956 | 2016-09-18 02:56:36.854427
  3 |  67 | 5148 | 74754 | 2017-01-01 01:56:36.854431
 42 | 650 | 7681 | 36495 | 2017-06-20 15:56:36.854435
  4 | 472 | 6454 | 19554 | 2016-06-18 19:56:36.854438
 82 | 922 |  902 | 17435 | 2016-07-21 14:56:36.854441
 68 | 156 | 1028 | 13275 | 2017-07-16 10:56:36.854444
  0 | 674 | 7446 | 59386 | 2016-07-26 09:56:36.854447
  0 | 629 | 2022 | 52285 | 2016-11-04 13:56:36.85445
(10 rows)

创建一个统计结果表, 其中bitmap表示统计的字段组合, 用位置符0,1表示是否统计了该维度

create table stat_tab5 (c1 int, c2 int, c3 int, c4 int, time1 text, time2 text, time3 text, time4 text, cnt int8, bitmap text);

生成业务字段任意维度组合+4组时间任选一组的组合统计
PS (如果业务字段有空的情况，建议统计时用coalesce转一下，确保不会统计到空的情况)

insert into stat_tab5
select c1,c2,c3,c4,t1,t2,t3,t4,cnt, 
'' || 
case when c1 is null then 0 else 1 end || 
case when c2 is null then 0 else 1 end || 
case when c3 is null then 0 else 1 end || 
case when c4 is null then 0 else 1 end || 
case when t1 is null then 0 else 1 end || 
case when t2 is null then 0 else 1 end || 
case when t3 is null then 0 else 1 end || 
case when t4 is null then 0 else 1 end
from 
(
select c1,c2,c3,c4,
to_char(crt_time, 'yyyy') t1, 
to_char(crt_time, 'yyyy-mm') t2, 
to_char(crt_time, 'yyyy-mm-dd') t3, 
to_char(crt_time, 'yyyy-mm-dd hh24') t4, 
count(*) cnt
from tab5 
group by 
cube(c1,c2,c3,c4), 
grouping sets(to_char(crt_time, 'yyyy'), to_char(crt_time, 'yyyy-mm'), to_char(crt_time, 'yyyy-mm-dd'), to_char(crt_time, 'yyyy-mm-dd hh24'))
)
t;

INSERT 0 49570486
Time: 172373.714 ms

在bitmap上创建索引方便取数据

create index idx_stat_tab5_bitmap on stat_tab5 (bitmap);

用户勾选几个维度，取出数据

c1,c3,c4,t3 = bitmap(10110010)

postgres=# select c1,c3,c4,time3,cnt from stat_tab5 where bitmap='10110010' limit 10;
 c1 | c3 |  c4   |   time3    | cnt 
----+----+-------+------------+-----
 41 |  0 | 30748 | 2016-06-04 |   1
 69 |  0 | 87786 | 2016-06-04 |   1
 70 |  0 | 38805 | 2016-06-04 |   1
 79 |  0 | 65892 | 2016-06-08 |   1
 51 |  0 | 13615 | 2016-06-11 |   1
 47 |  0 | 42196 | 2016-06-28 |   1
 45 |  0 | 54736 | 2016-07-01 |   1
 50 |  0 | 21605 | 2016-07-02 |   1
 46 |  0 | 40888 | 2016-07-16 |   1
 41 |  0 | 90258 | 2016-07-17 |   1
(10 rows)
Time: 0.528 ms

postgres=# select * from stat_tab5 where bitmap='00001000' limit 10;
 c1 | c2 | c3 | c4 | time1 | time2 | time3 | time4 |  cnt   |  bitmap  
----+----+----+----+-------+-------+-------+-------+--------+----------
    |    |    |    | 2016  |       |       |       | 514580 | 00001000
    |    |    |    | 2017  |       |       |       | 485420 | 00001000
(2 rows)
Time: 0.542 ms

执行计划，可以看到优雅的sort，一次sort多次rolldown, 不是简单的union all哦。

                                                    QUERY PLAN                                                    
------------------------------------------------------------------------------------------------------------------
 Insert on stat_tab5  (cost=208059.84..142986926.23 rows=1536000000 width=184)
   ->  Subquery Scan on t  (cost=208059.84..142986926.23 rows=1536000000 width=184)
         ->  GroupAggregate  (cost=208059.84..35466926.23 rows=1536000000 width=152)
               Group Key: (to_char(tab5.crt_time, 'yyyy'::text)), tab5.c4, tab5.c2, tab5.c1
               Group Key: (to_char(tab5.crt_time, 'yyyy'::text)), tab5.c4, tab5.c2
               Group Key: (to_char(tab5.crt_time, 'yyyy'::text)), tab5.c4
               Group Key: (to_char(tab5.crt_time, 'yyyy'::text))
               Sort Key: tab5.c3, tab5.c4, (to_char(tab5.crt_time, 'yyyy-mm-dd hh24'::text))
                 Group Key: tab5.c3, tab5.c4, (to_char(tab5.crt_time, 'yyyy-mm-dd hh24'::text))
               Sort Key: tab5.c3, tab5.c4, (to_char(tab5.crt_time, 'yyyy-mm-dd'::text))
                 Group Key: tab5.c3, tab5.c4, (to_char(tab5.crt_time, 'yyyy-mm-dd'::text))
               Sort Key: tab5.c3, tab5.c4, (to_char(tab5.crt_time, 'yyyy-mm'::text))
                 Group Key: tab5.c3, tab5.c4, (to_char(tab5.crt_time, 'yyyy-mm'::text))
               Sort Key: tab5.c3, tab5.c4, (to_char(tab5.crt_time, 'yyyy'::text))
                 Group Key: tab5.c3, tab5.c4, (to_char(tab5.crt_time, 'yyyy'::text))
               Sort Key: tab5.c2, tab5.c4, (to_char(tab5.crt_time, 'yyyy-mm-dd hh24'::text)), tab5.c1
                 Group Key: tab5.c2, tab5.c4, (to_char(tab5.crt_time, 'yyyy-mm-dd hh24'::text)), tab5.c1
                 Group Key: tab5.c2, tab5.c4, (to_char(tab5.crt_time, 'yyyy-mm-dd hh24'::text))
               Sort Key: tab5.c2, tab5.c4, (to_char(tab5.crt_time, 'yyyy-mm-dd'::text)), tab5.c1
                 Group Key: tab5.c2, tab5.c4, (to_char(tab5.crt_time, 'yyyy-mm-dd'::text)), tab5.c1
                 Group Key: tab5.c2, tab5.c4, (to_char(tab5.crt_time, 'yyyy-mm-dd'::text))
               Sort Key: tab5.c2, tab5.c4, (to_char(tab5.crt_time, 'yyyy-mm'::text)), tab5.c1
                 Group Key: tab5.c2, tab5.c4, (to_char(tab5.crt_time, 'yyyy-mm'::text)), tab5.c1
                 Group Key: tab5.c2, tab5.c4, (to_char(tab5.crt_time, 'yyyy-mm'::text))
               Sort Key: tab5.c1, tab5.c4, (to_char(tab5.crt_time, 'yyyy'::text)), tab5.c2, tab5.c3
                 Group Key: tab5.c1, tab5.c4, (to_char(tab5.crt_time, 'yyyy'::text)), tab5.c2, tab5.c3
                 Group Key: tab5.c1, tab5.c4, (to_char(tab5.crt_time, 'yyyy'::text))
               Sort Key: tab5.c2, (to_char(tab5.crt_time, 'yyyy'::text)), tab5.c3, tab5.c4
                 Group Key: tab5.c2, (to_char(tab5.crt_time, 'yyyy'::text)), tab5.c3, tab5.c4
                 Group Key: tab5.c2, (to_char(tab5.crt_time, 'yyyy'::text)), tab5.c3
                 Group Key: tab5.c2, (to_char(tab5.crt_time, 'yyyy'::text))
               Sort Key: tab5.c1, (to_char(tab5.crt_time, 'yyyy-mm-dd hh24'::text)), tab5.c2, tab5.c3
                 Group Key: tab5.c1, (to_char(tab5.crt_time, 'yyyy-mm-dd hh24'::text)), tab5.c2, tab5.c3
                 Group Key: tab5.c1, (to_char(tab5.crt_time, 'yyyy-mm-dd hh24'::text)), tab5.c2
                 Group Key: tab5.c1, (to_char(tab5.crt_time, 'yyyy-mm-dd hh24'::text))
               Sort Key: tab5.c1, (to_char(tab5.crt_time, 'yyyy-mm-dd'::text)), tab5.c2, tab5.c3
                 Group Key: tab5.c1, (to_char(tab5.crt_time, 'yyyy-mm-dd'::text)), tab5.c2, tab5.c3
                 Group Key: tab5.c1, (to_char(tab5.crt_time, 'yyyy-mm-dd'::text)), tab5.c2
                 Group Key: tab5.c1, (to_char(tab5.crt_time, 'yyyy-mm-dd'::text))
               Sort Key: tab5.c3, (to_char(tab5.crt_time, 'yyyy-mm-dd hh24'::text)), tab5.c1, tab5.c4
                 Group Key: tab5.c3, (to_char(tab5.crt_time, 'yyyy-mm-dd hh24'::text)), tab5.c1, tab5.c4
                 Group Key: tab5.c3, (to_char(tab5.crt_time, 'yyyy-mm-dd hh24'::text)), tab5.c1
                 Group Key: tab5.c3, (to_char(tab5.crt_time, 'yyyy-mm-dd hh24'::text))
               Sort Key: tab5.c1, (to_char(tab5.crt_time, 'yyyy-mm'::text)), tab5.c2, tab5.c3
                 Group Key: tab5.c1, (to_char(tab5.crt_time, 'yyyy-mm'::text)), tab5.c2, tab5.c3
                 Group Key: tab5.c1, (to_char(tab5.crt_time, 'yyyy-mm'::text)), tab5.c2
                 Group Key: tab5.c1, (to_char(tab5.crt_time, 'yyyy-mm'::text))
               Sort Key: tab5.c3, (to_char(tab5.crt_time, 'yyyy-mm-dd'::text)), tab5.c1, tab5.c4
                 Group Key: tab5.c3, (to_char(tab5.crt_time, 'yyyy-mm-dd'::text)), tab5.c1, tab5.c4
                 Group Key: tab5.c3, (to_char(tab5.crt_time, 'yyyy-mm-dd'::text)), tab5.c1
                 Group Key: tab5.c3, (to_char(tab5.crt_time, 'yyyy-mm-dd'::text))
               Sort Key: tab5.c3, (to_char(tab5.crt_time, 'yyyy-mm'::text)), tab5.c1, tab5.c4
                 Group Key: tab5.c3, (to_char(tab5.crt_time, 'yyyy-mm'::text)), tab5.c1, tab5.c4
                 Group Key: tab5.c3, (to_char(tab5.crt_time, 'yyyy-mm'::text)), tab5.c1
                 Group Key: tab5.c3, (to_char(tab5.crt_time, 'yyyy-mm'::text))
               Sort Key: tab5.c3, (to_char(tab5.crt_time, 'yyyy'::text)), tab5.c1, tab5.c4
                 Group Key: tab5.c3, (to_char(tab5.crt_time, 'yyyy'::text)), tab5.c1, tab5.c4
                 Group Key: tab5.c3, (to_char(tab5.crt_time, 'yyyy'::text)), tab5.c1
                 Group Key: tab5.c3, (to_char(tab5.crt_time, 'yyyy'::text))
               Sort Key: tab5.c1, (to_char(tab5.crt_time, 'yyyy'::text)), tab5.c2, tab5.c3
                 Group Key: tab5.c1, (to_char(tab5.crt_time, 'yyyy'::text)), tab5.c2, tab5.c3
                 Group Key: tab5.c1, (to_char(tab5.crt_time, 'yyyy'::text)), tab5.c2
                 Group Key: tab5.c1, (to_char(tab5.crt_time, 'yyyy'::text))
               Sort Key: tab5.c4, (to_char(tab5.crt_time, 'yyyy-mm'::text)), tab5.c1, tab5.c2, tab5.c3
                 Group Key: tab5.c4, (to_char(tab5.crt_time, 'yyyy-mm'::text)), tab5.c1, tab5.c2, tab5.c3
                 Group Key: tab5.c4, (to_char(tab5.crt_time, 'yyyy-mm'::text)), tab5.c1
                 Group Key: tab5.c4, (to_char(tab5.crt_time, 'yyyy-mm'::text))
               Sort Key: tab5.c4, (to_char(tab5.crt_time, 'yyyy-mm-dd'::text)), tab5.c1, tab5.c2, tab5.c3
                 Group Key: tab5.c4, (to_char(tab5.crt_time, 'yyyy-mm-dd'::text)), tab5.c1, tab5.c2, tab5.c3
                 Group Key: tab5.c4, (to_char(tab5.crt_time, 'yyyy-mm-dd'::text)), tab5.c1
                 Group Key: tab5.c4, (to_char(tab5.crt_time, 'yyyy-mm-dd'::text))
               Sort Key: tab5.c4, (to_char(tab5.crt_time, 'yyyy-mm-dd hh24'::text)), tab5.c1, tab5.c2, tab5.c3
                 Group Key: tab5.c4, (to_char(tab5.crt_time, 'yyyy-mm-dd hh24'::text)), tab5.c1, tab5.c2, tab5.c3
                 Group Key: tab5.c4, (to_char(tab5.crt_time, 'yyyy-mm-dd hh24'::text)), tab5.c1
                 Group Key: tab5.c4, (to_char(tab5.crt_time, 'yyyy-mm-dd hh24'::text))
               Sort Key: (to_char(tab5.crt_time, 'yyyy-mm'::text)), tab5.c2, tab5.c3, tab5.c4
                 Group Key: (to_char(tab5.crt_time, 'yyyy-mm'::text)), tab5.c2, tab5.c3, tab5.c4
                 Group Key: (to_char(tab5.crt_time, 'yyyy-mm'::text)), tab5.c2, tab5.c3
                 Group Key: (to_char(tab5.crt_time, 'yyyy-mm'::text)), tab5.c2
                 Group Key: (to_char(tab5.crt_time, 'yyyy-mm'::text))
               Sort Key: (to_char(tab5.crt_time, 'yyyy-mm-dd hh24'::text)), tab5.c2, tab5.c3, tab5.c4
                 Group Key: (to_char(tab5.crt_time, 'yyyy-mm-dd hh24'::text)), tab5.c2, tab5.c3, tab5.c4
                 Group Key: (to_char(tab5.crt_time, 'yyyy-mm-dd hh24'::text)), tab5.c2, tab5.c3
                 Group Key: (to_char(tab5.crt_time, 'yyyy-mm-dd hh24'::text)), tab5.c2
                 Group Key: (to_char(tab5.crt_time, 'yyyy-mm-dd hh24'::text))
               Sort Key: (to_char(tab5.crt_time, 'yyyy-mm-dd'::text)), tab5.c2, tab5.c3, tab5.c4
                 Group Key: (to_char(tab5.crt_time, 'yyyy-mm-dd'::text)), tab5.c2, tab5.c3, tab5.c4
                 Group Key: (to_char(tab5.crt_time, 'yyyy-mm-dd'::text)), tab5.c2, tab5.c3
                 Group Key: (to_char(tab5.crt_time, 'yyyy-mm-dd'::text)), tab5.c2
                 Group Key: (to_char(tab5.crt_time, 'yyyy-mm-dd'::text))
               ->  Sort  (cost=208059.84..210559.84 rows=1000000 width=144)
                     Sort Key: (to_char(tab5.crt_time, 'yyyy'::text)), tab5.c4, tab5.c2, tab5.c1
                     ->  Seq Scan on tab5  (cost=0.00..26370.00 rows=1000000 width=144)
(93 rows)

后续的优化手段
.1. 分区表
因为统计维度多，所以统计结果是非常庞大的。
数据分区可以帮组用户解决查询效率的问题。
例如按bitmap分区之后，在每个分区表再按时间维度分区。

.2. 流式计算
使用pipelinedb结合cube和grouping sets，把以上的统计改成流式统计，可以提升用户体验，快速得到报表。

.3. 使用MPP产品来提升数据的存储量和计算能力，例如Greenplum。

参考
https://www.postgresql.org/docs/9.6/static/queries-table-expressions.html#QUERIES-GROUPING-SETS

↧

PostgreSQL 函数稳定性与constraint_excluded分区表逻辑推理过滤的CASE

June 14, 2016, 7:21 am

≫ Next: PostgreSQL ECPG ifdef include等预处理用法

≪ Previous: PostgreSQL 多维分析 CASE

PostgreSQL 函数稳定性我在以前写过一些文章来讲解，而且在PG的优化器中，也有大量的要用函数稳定性来做出优化选择的地方。
http://www.tudou.com/programs/view/p6E3oQEsZv0/
本文要分享的这个CASE也和函数稳定性有关，当我们在使用分区表时，PostgreSQL可以根据分区表的约束，以及用户在SQL中提供的条件进行比对，通过逻辑推理过滤掉一些不需要扫描的表。
逻辑推理在前面也讲过。
https://yq.aliyun.com/articles/6821

这里先抛一个结论，约束检查时，条件中如果有函数，必须是immutable级别的，这样的条件才能进行逻辑推理，过滤掉不需要查询的表。
为什么stable不行呢？
因为执行计划是有缓存的，过滤掉的查询不需要进入执行计划的生成，所以必须保证被过滤的函数在多次调用时得到的结果是一致的，这样可以保证生成的执行计划和不过滤生成的执行计划在输入同样条件时，得到的结果也是一致的。

OK那么就来看个例子吧：

postgres=# create table p1(id int, t int);
CREATE TABLE
postgres=# create table c1(like p1) inherits(p1);
NOTICE:  merging column "id" with inherited definition
NOTICE:  merging column "t" with inherited definition
CREATE TABLE
postgres=# create table c2(like p1) inherits(p1);
NOTICE:  merging column "id" with inherited definition
NOTICE:  merging column "t" with inherited definition
CREATE TABLE
postgres=# select to_timestamp(123);
      to_timestamp      
------------------------
 1970-01-01 08:02:03+08
(1 row)

postgres=# alter table c1 add constraint ck check(to_char(to_timestamp(t::double precision), 'yyyymmdd'::text) >= '20150101'::text AND to_char(to_timestamp(t::double precision), 'yyyymmdd'::text) < '20150102'::text);
ALTER TABLE
postgres=# alter table c2 add constraint ck check(to_char(to_timestamp(t::double precision), 'yyyymmdd'::text) >= '20150102'::text AND to_char(to_timestamp(t::double precision), 'yyyymmdd'::text) < '20150103'::text);
ALTER TABLE
postgres=# explain select * from p1 where to_char((to_timestamp(t::double precision)), 'yyyymmdd'::text)='20150101'::text;
                                             QUERY PLAN                                              
-----------------------------------------------------------------------------------------------------
 Append  (cost=0.00..110.40 rows=23 width=8)
   ->  Seq Scan on p1  (cost=0.00..0.00 rows=1 width=8)
         Filter: (to_char(to_timestamp((t)::double precision), 'yyyymmdd'::text) = '20150101'::text)
   ->  Seq Scan on c1  (cost=0.00..55.20 rows=11 width=8)
         Filter: (to_char(to_timestamp((t)::double precision), 'yyyymmdd'::text) = '20150101'::text)
   ->  Seq Scan on c2  (cost=0.00..55.20 rows=11 width=8)
         Filter: (to_char(to_timestamp((t)::double precision), 'yyyymmdd'::text) = '20150101'::text)
(7 rows)

原因是这两个函数都是stable的

                                                                                         List of functions
   Schema   |     Name     |     Result data type     | Argument data types |  Type  | Security | Volatility |  Owner   | Language |    Source code     |               Description                
------------+--------------+--------------------------+---------------------+--------+----------+------------+----------+----------+--------------------+------------------------------------------
 pg_catalog | to_timestamp | timestamp with time zone | double precision    | normal | invoker  | stable  | postgres | internal | float8_timestamptz | convert UNIX epoch to timestamptz
 pg_catalog | to_char | text             | timestamp with time zone, text    | normal | invoker  | stable     | postgres | internal | timestamptz_to_char | format timestamp with time zone to text

stable的函数能保证在一个事务中，使用同样的参数多次调用得到的结果一致，但是不能保证任意时刻。
例如一个会话中，多次调用可能不一致。（那么有执行计划缓存的话，过滤掉这样的子分区就危险了）。

这两个函数为什么是stable 的呢，因为它和一些环境因素有关。
好了，那么了解这个之后，就知道为什么前面的查询没有排除这些约束了。
解决办法：
.1. 新增用户定义的函数，改SQL以及约束。

create or replace function im_to_char(timestamptz,text) returns text as $$
select to_char($1,$2);
$$ language sql immutable;

create or replace function im_to_timestamp(double precision) returns timestamptz as $$
select to_timestamp($1);
$$ language sql immutable;

postgres=# alter table c1 drop constraint ck;
ALTER TABLE
postgres=# alter table c2 drop constraint ck;
ALTER TABLE

postgres=# alter table c1 add constraint ck check(im_to_char(im_to_timestamp(t::double precision), 'yyyymmdd'::text) >= '20150101'::text AND im_to_char(im_to_timestamp(t::double precision), 'yyyymmdd'::text) < '20150102'::text);
ALTER TABLE
postgres=# alter table c2 add constraint ck check(im_to_char(im_to_timestamp(t::double precision), 'yyyymmdd'::text) >= '20150102'::text AND im_to_char(im_to_timestamp(t::double precision), 'yyyymmdd'::text) < '20150103'::text);
ALTER TABLE

postgres=# explain select * from p1 where im_to_char((im_to_timestamp(t::double precision)), 'yyyymmdd'::text)='20150101'::text;
                                                QUERY PLAN                                                 
-----------------------------------------------------------------------------------------------------------
 Append  (cost=0.00..1173.90 rows=12 width=8)
   ->  Seq Scan on p1  (cost=0.00..0.00 rows=1 width=8)
         Filter: (im_to_char(im_to_timestamp((t)::double precision), 'yyyymmdd'::text) = '20150101'::text)
   ->  Seq Scan on c1  (cost=0.00..1173.90 rows=11 width=8)
         Filter: (im_to_char(im_to_timestamp((t)::double precision), 'yyyymmdd'::text) = '20150101'::text)
(5 rows)

.2. 一个冒险的做法是直接修改这两个函数的稳定性。

alter function to_timestamp(double precision) immutable;
alter function to_char(timestamptz, text) immutable;

搞定

postgres=# explain select * from p1 where to_char((to_timestamp(t::double precision)), 'yyyymmdd'::text)='20150101'::text;
                                             QUERY PLAN                                              
-----------------------------------------------------------------------------------------------------
 Append  (cost=0.00..55.20 rows=12 width=8)
   ->  Seq Scan on p1  (cost=0.00..0.00 rows=1 width=8)
         Filter: (to_char(to_timestamp((t)::double precision), 'yyyymmdd'::text) = '20150101'::text)
   ->  Seq Scan on c1  (cost=0.00..55.20 rows=11 width=8)
         Filter: (to_char(to_timestamp((t)::double precision), 'yyyymmdd'::text) = '20150101'::text)
(5 rows)

↧