Last week I wrote about how to find duplicate rows in a database table. This week I'm going to show you how to delete them.

I'm going to assume that you want to keep at least one row of the duplicates for the purposes of this.

Remember our table structure had three columns:

  • TestID
  • DataA
  • DataB

And we had some issue where DataA and DataB had duplicated data even though TestID's were unique:

How do we delete these rows? We need another query. This one works great in Postgres:

view plain print about
1delete from test_table a
2using test_table b
3where a.testID < b.testID and a.dataA = b.dataA and a.dataB = b.dataB

And I did something similar with my client project. I am actually using SQL Server on my writing machine and it does not like that syntax. It appears the T-SQL delete does not support the using clause. I switched over to using the EXISTS clause, which uses a subquery:

view plain print about
1DELETE a FROM test_table a
2WHERE EXISTS
3 ( SELECT *
4 FROM test_table b
5 WHERE a.testID >
b.testID and
6a.DataA = b.DataA and
7a.DataB = b.DataB)

Now I can run the query to find duplicates from the previous post

view plain print about
1SELECT DataA, DataB, count(*)
2FROM Test_table
3GROUP BY DataA, DataB
4HAVING count(*) >
1;

You'll see no more duplicates found:

Select the full table:

You'll see that our duplicate entries are no longer displayed.

Next week I'm going to revisit the same problem, but for tables that a composite key and no single primary key.