I've been playing around with SQL and wrote a blog post about how to find duplicate rows in a database table, and one about deleting those duplicate rows.

Last week's post required a table entry that had a primary key so you could determine what to keep. What happens when you are dealing with something like an intersection table that has no primary key of its own?

Take a table like this:

Ideally when you created this table it was set up with proper constraints so duplicate rows cannot exist in that table. But, let's pretend for a second that whoever set this table up neglected to add such a constraint--or worse removed such a constraint.

You can select to show all the duplicate rows:

view plain print about
1SELECT
2 ROW_NUMBER() OVER(ORDER BY FK1 ASC) AS Row#,
3 FK1,
4 FK2,
5 count(*)
6FROM Test_table_xref
7GROUP BY
8 FK1,
9 FK2
10HAVING count(*) >
1;

Run the query to see something like this:

If you want to delete all the duplicate rows you can execute something like this:

While this works it wipes out the duplicate rows entirely--a row is not left behind. Whatever relationship was represented with these rows; it no longer exists. If that is what you need, then you can stop reading now. But, most likely you'll want to leave one row behind from each set of duplicates.

The answer, at least in SQL Server, is to use a Common Table Expression, like this:

view plain print about
1WITH CTE(
2 [FK1],
3 [FK2],
4 duplicatecount)
5AS (SELECT [FK1],
6            [FK2],
7 ROW_NUMBER() OVER(PARTITION BY [FK1],
8            [FK2]
9 ORDER BY FK1, FK2) AS DuplicateCount
10 FROM Test_table_xref)
11SELECT *
12FROM CTE;

A CTE creates a temporary result set. The use of ROW_NUMBER() and PARTION BY gives us which duplicated entry is in the row. Select from the CTE

view plain print about
1SELECT *
2FROM CTE;

You see in this the duplicate count increases with each duplicate row. For 1, 1 you can see it has three entries with the duplicate count incrementing from 1 to 3. For 2, 2 there are two entries with the duplicate count increasing from 1 to 2. All other items have a single entry with the duplicate count of 1.

With that info behind us, you can delete from the CTE using the duplicate count, greater than 1:

You should see a message that 3 rows were deleted. And reselect from the original table:

Your duplicate entries are gone, but the existing relationships are preserved. Great Job!