Reading large amounts of data in Python/Postgresql
Stack reporting technology: Postgresql 9.3, Python 2.7 installed module "psycopg2".
the
How often in your practice had to deal with the task of processing large amount of tables (>10 million records)? I think you will agree that this task is quite resource intensive both in terms of processing time and resources in the system. Today I will try to show an alternative way of solving the problem.
the
In Postgresql, there is a great operator to work with large volumes of information, namely "COPY". The application of this operator allows us to read and write huge amounts of data in a table. In this article we will consider the read mode.
According to the documentation operator "COPY" available to us, several reading modes to a file or to STDOUT, as well as a variety of formats, including csv. Just we will try to use to maximum advantage.
the
As a "Guinea pig" we will create a table with 1 million records and will write a small script that reflects the essence of the method. The Sql file can be found in my git repository (you can find the link at the bottom of the article).
Also don't forget to install the psycopg2 extension!
the
For the sample data, we will use the wonderful feature "copy_expert", which allows us to perform "COPY" requests from the Python client.
the
Explanations to the code:
Code generator:
the
Explanations:
That's all we need!
My configuration: MacBook Air 2013 Processor: 1.3 GHz Intel Core i5, Ram: 4 GB 1600 MHz DDR3, SSD.
PS:
I want to note that this approach to accelerate reading doesn't always work, specifically, if you have a fairly simple table 3-5 fields, tangible difference you will not notice (at least up to 1 million). However, this method shows just a crazy increase in speed with complex queries, the acceleration reaches up to 10-20 times! It is also very heavily influenced by the configuration of iron on which the script is executed.
The entire code can be found in the git repository https://github.com/drizgolovicha/python_bulk_read.
I will be glad comments and suggestions to improve!
Thanks for reading to the end.
UPD:
The results of measurements of the sample (14k) records:
Article based on information from habrahabr.ru
the
Problem
How often in your practice had to deal with the task of processing large amount of tables (>10 million records)? I think you will agree that this task is quite resource intensive both in terms of processing time and resources in the system. Today I will try to show an alternative way of solving the problem.
the
Suggestion:
In Postgresql, there is a great operator to work with large volumes of information, namely "COPY". The application of this operator allows us to read and write huge amounts of data in a table. In this article we will consider the read mode.
According to the documentation operator "COPY" available to us, several reading modes to a file or to STDOUT, as well as a variety of formats, including csv. Just we will try to use to maximum advantage.
the
Training:
As a "Guinea pig" we will create a table with 1 million records and will write a small script that reflects the essence of the method. The Sql file can be found in my git repository (you can find the link at the bottom of the article).
Also don't forget to install the psycopg2 extension!
the
Implementation:
For the sample data, we will use the wonderful feature "copy_expert", which allows us to perform "COPY" requests from the Python client.
the
query = """
SELECT * from big_data big_data as inner join t1 USING(fname)
"""
output = StringIO()
self.cursor.copy_expert("COPY (%s) TO STDOUT (FORMAT 'csv', HEADER true)" % query, output)
data = output.getvalue()
output.close()
result = list()
for item in getResults(data):
# do whatever we need
item = {k: None if v == "" else v for k, v in item.items()}
result.append(item)
Explanations to the code:
-
the
- In the query doing a Union for itself, for its complexity (it is noticed that the advantage in speed is directly proportional to the complexity of the query); the
- as a buffer object, use the "StringIO", where we will record data from the cursor. the
- to Parse the string generator will "getResults"; the
- For ease of interpretation, I convert all empty strings to "None", because after using the "COPY" we get string values; the
- , I Want to mention that the format I will use "csv" with a leading header row, so will know a little later.
Code generator:
the
def getResults(stream):
"""
get the result generator
"""
f = StringIO(stream)
result = csv.DictReader(f, restkey=None)
for item in result:
yield item
f.close()
Explanations:
-
the
- As you can see from the listing, again use the now-familiar buffer, "StringIO"; the
- To convert the string "csv" in the dictionary (dictionary) use the method of "DictReader" native library csv. By default, this method takes the first row for the list of fields in the dictionary.
That's all we need!
My configuration: MacBook Air 2013 Processor: 1.3 GHz Intel Core i5, Ram: 4 GB 1600 MHz DDR3, SSD.
PS:
I want to note that this approach to accelerate reading doesn't always work, specifically, if you have a fairly simple table 3-5 fields, tangible difference you will not notice (at least up to 1 million). However, this method shows just a crazy increase in speed with complex queries, the acceleration reaches up to 10-20 times! It is also very heavily influenced by the configuration of iron on which the script is executed.
The entire code can be found in the git repository https://github.com/drizgolovicha/python_bulk_read.
I will be glad comments and suggestions to improve!
Thanks for reading to the end.
UPD:
The results of measurements of the sample (14k) records:
-
the
- Direct SELECT, Where condition on non-indexed field with 21.4 the
- COPY the previous query — 13,1 the
- Fetch the same SELECT, but from a materialized view with an index on the field — 12,6 the
- COPY of materialized view — 1.8
Комментарии
Отправить комментарий