Regrouping is something that I've had to do many times during my career in data analytics, unfortunately a lot of relationships aren't defined through a single unique identifier, at least not at first. This code snippet is designed to flush out the transitive relationships defined between pairs of elements and assign all elements in the graph a single unique identifier. Get ready for some SQL looping action...
The code below is simplistic and I'm sure there are better more efficient ways to do this so please let me know in the comments. The example is written in T-SQL (Microsoft SQL Server) and makes use of analytics functions. If you're using another dialect of SQL, it is possible to do the same thing using group by statements or equivalent analytics functions but they're not detailed here.
Before I delve into the code, I'll take a minute to explain what the algorithm does and why it's so useful to have something equivalent in your toolbox.
Calculating two way comparisons is logically straightforward, A=B?, A>B?, A<=B… in each we deal with a pair of elements and from the comparison we get a binary answer, True or False. As such, coding comparisons between pairs is conceptually straightforward and a natural choice. Capturing the results is easy:
However, what if we add another relationship comparing A and E:
And this is where this piece of code becomes useful. In a procedural language, we can capture relationships between these elements in a data structure designed for it, for example a tree or graph. In SQL, we're stuck with tables, and so while it's easy to see the relationships between A and B, and A and E, it is not easy to see the relationship between A and E. Using a regrouping technique, we can store a single 'equivalence' identifier that can be used to identify all equivalent elements in the dataset.
To illustrate how the code works I've included an example set of elements, shown in the graph below and re-created in the table #pairs:
create table #pairs ( Input1 nvarchar(1), Input2 nvarchar(1) ) insert into #pairs values ('A', 'B') insert into #pairs values ('B', 'C') insert into #pairs values ('C', 'D') insert into #pairs values ('C', 'F') insert into #pairs values ('D', 'E') insert into #pairs values ('G', 'H') insert into #pairs values ('G', 'I')
Let's assign an identifier to each pair of elements using the row_number function, it's important that we use a numerical identifier for the pair, as we will rely upon a comparison later to update the pair number as part of the regrouping.
select *, row_number() over (order by Input1, Input2) as pair into #pairs_indexed from #pairs
We need to transpose the pairs so that each pair occurs over two rows, one for each element. This is most easily done using a union all.
select Input1 as Id, Pair into #pairs_transposed from #pairs_indexed union all select Input2 as Id, Pair from #pairs_indexed
Now we start the looping, this is the real code snippet. The loop will continue indefinitely until we call a break to stop looping. This should only be used when it is guaranteed that the break statement will be called in an acceptable amount of time, otherwise your database will be going forever.
Note that steps 1 and 2 are run as a single query
DECLARE @rowcount INT WHILE (1=1) BEGIN if OBJECT_ID('tempdb..#minimums') is not null drop table #minimums -- step 2: Map pair numbers to minimum pair numbers select pair, min(min_pair) as min_pair into #minimums from ( -- step 1: For each element Id, what is the lowest pair number select id, pair, min(pair) over(partition by id) as min_pair from #pairs_transposed ) x group by pair -- step 3: Are there any left where pair <> min_pair, if not, we're done select @rowcount = count(*) from #minimums where pair <> min_pair if @rowcount = 0 begin break end -- step 4: Update pairs with minimum pairs update ex set pair = min_pair from #pairs_transposed ex inner join #minimums mins on ex.pair = mins.pair END GO
You can run each statement by itself and watch as the algorithm converges on a solution. You'll see that the algorithm follows a breadth-first approach and so it will take longer to converge when the chains of equivalent elements are deep rather than wide.
Try it out and let me know what improvements you can make.
By Niall Napiercomments powered by Disqus