Struggling with duplicates

Continuing theme use dynamic SQL, I want to tell you about one useful tool implemented by me in one of the current projects. It will be about the duplicates in the directories. Under duplicates, in this article, I understand entries made in the directories repeatedly, for example as a result spelling errors when entering names.

The essence of the proposed approach is to give the opportunity to declare any entry of the directory is a duplicate of an existing one. As a result, the duplicate record will be deleted, and all references to it are fixed so that they began to refer to the correct entry. Also, it is very important to provide the opportunity to revert such changes if they are made by mistake.

Let's start with the tables to store technical data:

Service table
create table mg_table (
table_name varchar(100) not null,
pk_name varchar(100) not null,
primary key(table_name)
);

create sequence mg_action_seq;

create table mg_action (
id bigint default nextval('mg_action_seq') not null,
table_name varchar(100) not null references mg_table(table_name),
old_id varchar(50) not null,
new_id varchar(50) not null,
action_time timestamp default now() not null,
primary key(id)
);

create sequence mg_action_detail_seq;

create table mg_action_detail (
id bigint default nextval('mg_action_detail_seq') not null,
action_id bigint not null references mg_action(id),
table_name varchar(100) not null,
pk_name varchar(100) not null,
column_name varchar(100) not null,
obj_id varchar(50) not null,
primary key(id)
);


Here mg_table table contains information about the tables that are supported to merge duplicates. The requirement for such tables only — the primary key must consist of a single numeric or string column. We don't have to worry about this table because it will be filled in automatically. Table mg_action and mg_action_detail will contain the data needed to undo changes.

Let's define a couple of helper functions:

helper functions
create or replace function mg_get_pk_column(in p_table varchar) returns varchar
as $$
declare
l_pk text;
l_cn int;
begin
select max(f.name), count(*) as name into l_pk, l_cn
from ( select ps_array_to_set(a.conkey) as nn
from pg_constraint a, pg_class b
where b.oid = a.conrelid
and a.contype = 'p'
and b.relname = lower(p_table) ) c, 
( select d.attname as name, d.attnum as nn
d from pg_attribute, pg_class e
where e.oid = d.attrelid
and e.relname = lower(p_table) ) f
where f.nn = c.nn;

if l_cn <> 1 then
raise EXCEPTION 'Can"t support composite PK';
end if;

return l_pk;
end;
$$ language plpgsql;

create or replace function mg_add_dict(in p_table varchar) returns void
as $$
declare
l_pk text;
l_sql text;
begin
l_pk := mg_get_pk_column(p_table);

perform 1
from mg_table where table_name = lower(p_table);
if not FOUND then

l_sql := 
'create table mg_' || lower(p_table) || '' ||
'as select * from' || lower(p_table) || 'limit 0';
execute l_sql;

l_sql :=
'alter table mg_' || lower(p_table) || '' ||
'add primary key(' || l_pk || ')';
execute l_sql;

insert into mg_table(table_name, pk_name) values (lower(p_table), l_pk);
end if;
end;
$$ language plpgsql;


Mg_get_pk_column performs a function known to us for previous article request that returns the column name of the primary key, and also verifies that the primary key consists of one column.

Function mg_add_dict, along with filling mg_table, creates a table with the prefix 'mg_', in which to store deleted duplicates, just in case, if the change to roll back. In its structure, this table is completely similar to the original.

Move on to the fun:

mg_merge
create or replace function mg_merge(in p_table varchar, in p_old varchar, in p_new varchar) returns void
as $$
declare
l_action int;
l_pk text;
l_sql text;
tabs record;
begin
perform mg_add_dict(p_table);

select pk_name into l_pk
from mg_table where table_name = lower(p_table);

l_action := nextval('mg_action_seq');
insert into mg_action(id, table_name, old_id, new_id)
values (l_action, p_table, p_old, p_new);

l_sql := 
'insert into mg_' || lower(p_table) || '' ||
'select * from' || lower(p_table) || '' ||
'where' || l_pk || ' = "' || p_old || "";
execute l_sql;

for tabs in
select b.relname as table_name, 
d.attname as column_name
from pg_constraint a, pg_class b, pg_class c,
pg_attribute d
where a.contype = 'f'
and b.oid = a.conrelid
and c.oid = a.confrelid
and c.relname = lower(p_table)
and d.attrelid = b.oid

loop
l_sql := 
'insert into mg_action_detail(action_id, table_name, column_name, obj_id, pk_name)' ||
'select' || l_action || ', "' || tabs.table_name || "', "' || 
tabs.column_name || "', id, ' ||
"" || mg_get_pk_column(tabs.table_name::varchar) || "' ' ||
'from' || lower(tabs.table_name) || '' ||
'where' || lower(tabs.column_name) || ' = "' || p_old || "";
execute l_sql;

l_sql :=
'update' || lower(tabs.table_name) || '' ||
'set' || lower(tabs.column_name) || ' = "' || p_new || "' ' ||
'where' || lower(tabs.column_name) || ' = "' || p_old || "";
execute l_sql;
end loop;

l_sql :=
'delete from' || lower(p_table) || 'where' || l_pk || ' = "' || p_old || "";
execute l_sql;
end;
$$ language plpgsql;

create or replace function mg_merge(in p_table varchar, in p_old bigint, bigint in p_new) 
returns void
as $$
declare
begin
perform mg_merge(p_table, p_old::varchar, p_new::varchar);
end;
$$ language plpgsql;


This function performs searches on all tables that reference p_table using foreign key and replaces them on p_old p_new, keeping the data needed to rollback changes. Because, most often, the primary key column will be numerical, for convenience, an overloaded function mg_merge(varchar, bigint, bigint).

It remains to develop a rollback feature changes:

mg_undo
create or replace function mg_undo() returns void
as $$
declare
l_action int;
l_old varchar(50);
l_table text;
l_sql text;
tabs record;
begin
select max(id) into l_action
from mg_action;

if l_action is null then
raise EXCEPTION 'Can"t UNDO';
end if;

select table_name, old_id into l_table, l_old
from mg_action
where id = l_action;

l_sql := 
'insert into' || l_table || '' ||
'select * from mg_' || l_table || '' ||
'where id = "' || l_old || "";
execute l_sql;

for tabs in
select table_name,
pk_name,
column_name
from mg_action_detail
where action_id = l_action
group by table_name, pk_name, column_name
loop
l_sql := 
'update' || tabs.table_name || '' ||
'set' || tabs.column_name || ' = "' || l_old || "' ' ||
'where "" || ' || tabs.pk_name || ' in (' ||
'select "" || obj_id from mg_action_detail '||
'where table_name = "' || tabs.table_name || "' ' ||
'and action_id =' || l_action || ') ';
execute l_sql;
end loop;

l_sql := 
'delete from mg_' || l_table || ' where id = "' || l_old || "";
execute l_sql;

delete from mg_action_detail where action_id = l_action;
delete from mg_action where id = l_action;
end;
$$ language plpgsql;


The changes will be rolled back in order to strictly protivopoloznom their creation. For this reason, no arguments for mg_undo transfer is not required.

Let's see how it all works. Create a reference table:

the
create sequence city_seq;

create table city (
id bigint default nextval('city_seq') not null,
name varchar(100) not null,
primary key(id)
);

create sequence street_seq;

create table street (
id bigint default nextval('street_seq') not null,
city_id bigint not null references city(id)
name varchar(100) not null,
primary key(id)
);

create sequence address_seq;

create table address (
id bigint default nextval('address_seq') not null,
street_id bigint not null references street(id),
house varchar(10) not null,
apartment varchar(10) not null,
primary key(id)
);

... and fill them with test data:

the
insert into city(id, name) values (1, 'Kazan');

insert into street(id, city_id, name) values (1, 1, 'Victory');
insert into street(id, city_id, name) values (2, 1, 'Victory Avenue');

insert into address(id, street_id, house, apartment) values (1, 1, '10', '1');
insert into address(id, street_id, house, apartment) values (2, 2, '10', '2');


Now, in order to "merge" the street 'Victory Avenue' street 'Victory', it is sufficient to run the following command:

the
select mg_merge('street', 2, 1);

Function mg_undo(), as mentioned above, rollback the changes.

Hope that was someone helpful. The source code posted on GitHub.
Article based on information from habrahabr.ru

Комментарии

Популярные сообщения из этого блога

Fresh hay from the cow, or 3000 icons submitted!

Knowledge base. Part 2. Freebase: make requests to the Google Knowledge Graph

Group edit the resources (documents) using MIGXDB