I've been working on a project, and we realized we had some duplicate rows in our data. We missed a constraint when creating the table causing this duplicate data to be created.

Let's pretend we have a table with a Primary Key: testID, and two data values: DataA and DataB. Something like this:

If you ignore the primary key, you can see from my test data there are a few duplicate rows with duplicate data. The first and second row both have the same values for DataA and DataB. Also the fifth and eighth have the same issue. If we allow all the data rows in a table to be non-unique, this violates some normal constraints of the database.

Our problem was these constraints were not in place, but database flyway migration scripts were being modified, re-run, and re-creating new entries with different primary keys, but duplicate data. I discovered the issue while doing unrelated work.

First, we need a query to find what rows have duplicate data:

view plain print about
1SELECT DataA, DataB, count(*)
2FROM Test_table
3GROUP BY DataA, DataB
4HAVING count(*) >
1;

Select the values that you are afraid might be unique. This query purposely ignores the PKID which should always be unique. The count is also selected. We group the query, which is similar to ordering the table. And then we have a HAVING clause, which is kind of like a where clause but let's us compare the results of a function. We want rows where the count is greater than 1.

Run this against the table above:

Now you've found the duplicate rows!

You could use that info to delete them individually, but next week I'm going to share a delete script that you can use.