Query API and De-Duping Data

David Kortes · July 2024

Hi All,

An open ended conversation and question that I'm sure you have all encountered.

When it comes to pulling data out of the database with outputs through query api, how do you deal with duplicates? Our team is just diving into stuff with query api, and with replicating some our current reports in Power BI we realized that we pull a lot of our data through exports.

Is there an automatic way that anyone has found to de-dupe the data either through power automate or BI that will allow us to create clean data-sets through query api?

Thanks all!

Alex Wong · July 2024

@David Kortes
There are multiple ways to handle duplicates mindlessly:

Power Automate
- use the union() expression
Power BI - Power Query
- Remove Row > Remove Duplicate based on a column

However, I suggest a more mindful decision on why you are getting duplicates rows in your query. The only reason why you get duplicates row in query is b/c you are filtering, outputing, or sorting by a field that is ONE-to-MANY relationship. If you didn't take care of your query output carefully, you may be removing duplicates incorrectly.

David Kortes · July 2024

@Alex Wong
Thanks for this Alex! We will definitely take a look at how we are currently using queries in our report, especially as we move towards automation.

We're currently kicking around ideas and just wanted to know what our options were!

Lucien Piekut · July 2024

@David Kortes It looks like I'll be building out something like this soon. My initial idea is build out a bunch of base table queries, with really strong Select/Where filtering. Then use the Query merge type to pull from base tables. That'll eliminate any dupes and let me re-use code. In the end, not too far from any other data warehouse/reporting database practices.

I guess in short I would say: Only make queries that pull what you need, and do more of the selection/merges inside the API code itself, and not in Power automate.

Alex Wong · July 2024

@Lucien Piekut
if you are doing anything more than just 1 or 2 report, and provided that you have a cloud location for the data, you should consider pulling data without filter:

constituent table
- all one-to-one data point that you may use in the future of reporting
separate table for one-to-many relationship data
- email address
- phone number
- relationship
- opportunity
- attribute/custom field
gift table
- same separate table for one-to-many: soft credit, gift solicitor/fundraiser, gift splits (campaign, fund, appeal, package, split amount), gift attribute/custom fields, etc

then using reporting tool such as Power BI or Tableau, you can filter and use only data needed for certain reports.

note: clear understanding of what is one-to-many relationship data will ensure your table will not have duplicates.

Lucien Piekut · July 2024

@Alex Wong Sorry what I mean re: method is that I do not pull all reporting warehouse/BI data daily, or on demand.

For building a warehouse I keep a 1:1 set as you mention, but how I build it is only for add/appends, to reduce load on server and have fast queries.

For ad-hoc queries via the API like.. for example a table that needed grades for Semester 2, I would filter Grade entries for Semester 2, then use existing DB indexes to pull whatever related tables.

In short: Pull only what's needed when it is needed. (This is really the most common issue I see with new data/dev folk so I hope to steer anyone away from that practice!)

Alex Wong · July 2024

@Lucien Piekut
maybe you already thought of this or implemented workaround but in case anyone else reading this, here's some consideration:

just doing add/append may not be good enough by getting last modified records, for example
- constituent record id 1 has a gift record id 99
- gift record id is soft credit to cosntituent record id 10
- constituent record id 10 is found to be dup record of record id 100, and is merged into record id 100 and got deleted
- gift record id 99 will not be consider having been modified and would not come up when doing last modified since your last pull of gift list
- in your data warehouse, your gift record id 99 will still have soft credit id 10, which is no longer a valid constituent record
- your constituent table (unless you have other arrangement, such as using constituent delete webhook to remove record from data warehouse) will still have constituent record id 10, but is out of sync with RE NXT as constituent record id has been deleted
I agreed that pulling “all” data or not depends on need. However, most fundraising related reporting that one might build custom using data warehouse is going to need the full constituent and full gift table.
- even if an org thinks it will not report on gifts more than 10 years ago, however, if the org ever wants to report on “new donor”, “consecutive donor”, “long term donor”, it is going to need the full gift history to be able to make those determination.
- other data table like opportunity, relationship, etc can be pull full or partial depending on needs.

Query API and De-Duping Data

Comments

Categories