Recently I ran into some contention issues when trying to process files using the Psycopg2 library for Python into a Redshift database. Essentially I was attempting to process each file in S3 as a multi-threaded pool and it took a ridiculous amount of time. In the pool, I was not creating a new connection if one already existed.
The Psycopg2 library behaves in a way that may not be intuitive to those unfamiliar with it. Transactions are connection specific. When you create a connection, the transaction begins when the first cursor issues an execute. All commands executed by all cursors after that are part of the same transaction until commit or rollback. This is the default behavior for the connection unless you set autocommit to true (conn.autocommit = True).
Essentially my Python based pool of resources were fighting with one another because they were all using the same connection and locking the table. I could have probably either solved the problem by setting the autocommit, or just creating a new connection for every process in the pool.
However, when I was investigating the issue I found that the Redshift COPY command inherently leverages the massively parallel processing (MPP) architecture to read and load data in parallel from files on Amazon S3. Basically my data is split into log files by the minute. The structure is roughly 60 log files per hour, then 24 hours per day. There are about 14 domains sending me log files so I get over 800 files per hour. This just happens to be the perfect format to allow Redshift to utilize parallel query execution, which is based on the number of slices in your cluster.
In the end, I dropped the pool processing and just let the Redshift architecture handle it by using the common structure I already had in S3 and just copying the full hour of log files. I saw order of magnitude performance gains where over 80 megabytes of compressed data comprising of over 800 files were loaded in as quick as 6 seconds.