Introduction
Sometimes SQL programmers come up with a requirement to use multi-valued columns or variables in an SQL query especially when the data
is originated from external sources. Since there is no builtin support for an array or list or a similar datatype in t-SQL, quite often
folks use delimited VARCHAR strings to lump multiple values together.
We all know that having multiple values in a single column is against the fundamental tenet of relational model since an attribute in a
relation can have only one value drawn from an appropriate type. For more information on such elementary details ( including a fundamental
exposition of a type, its meaning and the distinction between internal and external predicates ) refer to a good relational database text
book.
SQL tables, by their very nature, do not allow multiple values in its columns. However, users can interpret a single string as a
combination of smaller strings often delimited by a character say a comma or a space. Therefore, by leveraging the existing string
functions, programmers can extract such smaller parts from the concatenated string.
It is very important to note that some of the methods are kludgy while some appear to violate the fundamentals of data independence
altogether. Such methods are only suggested here due to its notoriety and are not recommended to be used for production mode systems. If
you are using any methods that are undocumented in the product manual, use them with utmost discretion and all relevant caveats apply.
Basic recommendations for using proper datatype conversion techniques, avoiding positional significance for columns etc must be
considered for stable production code.
Considerations
Coming to the methods of parsing an array and using it for data manipulations, the lion's share of "array" usage in stored procedure comes
to insert data as multiple rows into tables. There are other instances where badly designed data represented in a table has to be cleaned
up and queried.
The following sections illustrate a variety of methods one can employ to identify and enlist subsections of a string represented in a
variable, parameter or even as a column value in a table. The examples use a comma separated list, commonly known as a CSV string where
the value is represented as: 'item-1,item-2,item-3....item-n'. In practice, you can use any character including a space to delimit and
improvise the methods accordingly.
For the examples below, a few customer identifiers are randomly chosen from the Customers table in the Northwind database.
DECLARE @p VARCHAR(50)
SET @p = 'ALFKI,LILAS,PERIC,HUNGC,SAVEA,SPLIR,LONEP,GROSR'
Northwind is a sample database in SQL Server 2000 default installations. You can download a copy from the
Microsoft Downloads
Direct Data comparison
For simple comparisons, there is no need for complicated routines. The inherent pattern matching features in Transact SQL can be used
directly in most cases. One prime example is to pass in a list of values and use it in the IN list of a WHERE clause. Here are some common
methods:
Using the CHARINDEX function:
SELECT CustomerID, ContactName, CompanyName
FROM Northwind.dbo.Customers
WHERE CHARINDEX( ',' + CustomerID + ',', ',' + @p + ',' ) > 0 ;
CustomerID ContactName CompanyName
---------- ------------------------------ ------------------------------
ALFKI Maria Anders Alfreds Futterkiste
GROSR Manuel Pereira GROSELLA-Restaurante
HUNGC Yoshi Latimer Hungry Coyote Import Store
LILAS Carlos González LILA-Supermercado
LONEP Fran Wilson Lonesome Pine Restaurant
PERIC Guillermo Fernández Pericles Comidas clásicas
SAVEA Jose Pavarotti Save-a-lot Markets
SPLIR Art Braunschweiger Split Rail Beer & Ale
(8 row(s) affected)
Using pattern matching with PATINDEX:
SELECT CustomerID, ContactName, CompanyName
FROM Northwind.dbo.Customers
WHERE PATINDEX( '%,' + CustomerID + ',%', ',' + @p + ',' ) > 0 ;
Using LIKE operator for pattern matching
SELECT CustomerID, ContactName, CompanyName
FROM Northwind.dbo.Customers
WHERE ',' + @p + ',' LIKE '%,' + CustomerID + ',%' ;
String Parsing
In many cases, you may want to use the parsed list of values as a resultset that can be used in subsequent operations. For instance, for
larger lists, it may be more effective to use a JOIN rather using an IN() clause. Another common scenario is the case of multi-row inserts
where the list is parsed and the individual elements are inserted using a single INSERT statement.
Solutions using a table of numbers
In most cases with larger strings, the faster solutions are often the ones using a table of sequentially incrementing numbers. However,
the performance assertions in general should be taken with a grain of salt since without testing it is almost impossible to conclude which
method performs better than another.
A table of monotonically increasing numbers can be created in a variety of ways. Either a base table or a view or any expression that can
create a sequence of numbers can be used in these scenarios. Some of the common methods to create a table of numbers can be found in the
article Creating a table of numbers.
The general idea here is to use the sequence value as the second argument for the SUBSTRING() function and then use the WHERE clause to
limit the number of splits using a terminating criteria. Note than this general logic can be written in a variety of ways.
SELECT SUBSTRING( ',' + @p + ',', n + 1,
CHARINDEX( ',', ',' + @p + ',', n + 1 ) - n - 1 ) AS "value"
FROM Nbrs
WHERE SUBSTRING( ',' + @p + ',', n, 1 ) = ','
AND n < LEN( ',' + @p + ',' ) ;
Similar logic from a query in a public newsgroup posting by Linda Wiezbecki.
SELECT SUBSTRING( p, n + 1,
CHARINDEX( ',', p, n + 1 ) - ( n + 1 ) ) AS "value"
FROM Nbrs
JOIN ( SELECT ',' + @p + ',' ) D ( p )
ON SUBSTRING( p, n, LEN( p ) ) LIKE ',_%' ;
Here are a couple of methods that not only parses the delimited list, but returns the position of each item in the list as well. This one
is using a self join and is popularized by Joe Celko in SQL newsgroups.
postings.
SELECT SUBSTRING( ',' + @p + ',', MAX( n1.n + 1 ),
n2.n - MAX( n1.n + 1 ) ) AS "value",
COUNT( n2.n ) AS "pos"
FROM Nbrs n1
JOIN Nbrs n2 ON n1.n < n2.n
AND n2.n <= LEN( ',' + @p + ',' ) + 1
WHERE SUBSTRING( ',' + @p + ',', n1.n, 1 ) = ','
AND SUBSTRING( ',' + @p + ',', n2.n, 1 ) = ','
GROUP BY n2.n ;
The following is from a posting by Umachandar Jayachandran that uses the REPLACE function to get the number of commas in the string and
calculate the position of each item value.
SELECT SUBSTRING( p, n + 1, CHARINDEX( ',', p, n + 1 ) - n - 1 ) AS "value",
LEN( SUBSTRING( p, 1, CHARINDEX( ',', p, n + 1 ) ) ) -
LEN( REPLACE( SUBSTRING( p, 1,
CHARINDEX( ',', p, n + 1 ) ), ',', '' )) - 1 AS "pos"
FROM Nbrs
JOIN ( SELECT ',' + @p + ',' ) D ( p )
ON n BETWEEN 1 And LEN( p ) - 1
AND SUBSTRING( p, n , 1 ) = ',' ;
The following method also gives you the positional value and is simpler
SELECT SUBSTRING( @p, n, CHARINDEX( ',', @p + ',', n ) - n ) AS "value",
n + 1 - LEN( REPLACE( LEFT( @p, n ), ',', '' ) ) AS "pos"
FROM Nbrs
WHERE SUBSTRING(',' + @p, n, 1) = ','
AND n < LEN(@p) + 1 ;
With the newly introduced ROW_NUMBER() and RANK() functions in SQL 2005, one could create the positional values much easily like:
SELECT SUBSTRING( ',' + @p + ',', n + 1,
CHARINDEX( ',', ',' + @p + ',', n + 1 ) - n - 1 ) AS "value",
ROW_NUMBER() OVER ( ORDER BY n ) AS "pos"
FROM Nbrs
WHERE SUBSTRING( ',' + @p + ',', n, 1 ) = ','
AND n < LEN( ',' + @p + ',' ) ;
You can wrap any of these methods using a table of sequentially incrementing numbers into a table valued UDF or another stored procedure
and make it more generic, reusable and handy.
With Common Table Expressions
Most of the methods described above can be re-written using common table expressions. Here is one way of using it to get the starting
position and ending position of each element item and then using SUBSTRING to extract the value:
WITH CTE ( pos_begin, pos_end ) AS (
SELECT n1.n, ( SELECT MIN( n2.n )FROM Nbrs n2
WHERE SUBSTRING( ',' + @p + ',' , n2.n , 1 ) = ','
AND n2.n > n1.n )
FROM Nbrs n1
WHERE n1.n <= LEN( ',' + @p + ',' ) - 1
AND SUBSTRING( ',' + @p + ',' , n1.n , 1 ) = ',' )
SELECT SUBSTRING( @p , pos_begin, pos_end - pos_begin - 1 ) AS "Value"
FROM CTE ;
Another concise approach with Common table expressions is to use a recursive CTE. It does not use a
table of sequentially incrementing numbers, however it can be a bit inefficient for larger strings.
Here is an example.
WITH CTE ( pos, pos_begin, pos_end ) AS (
SELECT 0, 1, CHARINDEX( ',', @p + ',' )
UNION ALL
SELECT pos + 1, pos_end + 1, CHARINDEX( ',', @p + ',', pos_end + 1 )
FROM CTE
WHERE CHARINDEX( ',', @p + ',', pos_end + 1 ) > 0 )
SELECT pos + 1, SUBSTRING( @p, pos_begin , pos_end - pos_begin ) AS "value"
FROM CTE
OPTION ( MAXRECURSION 0 ) ;
As mentioned before, any of the above mentioned parsing routines can be written as a table valued user defined function or a view and can
be used directly in the queries like:
SELECT CustomerID, ContactName, CompanyName
FROM Northwind.dbo.Customers c
JOIN dbo.udf_parsed_list () p
ON c.CustomerID = p.value ;
Methods that replace the delimiters
Apart from the above mentioned approaches there are certain other tricks which can be used in Transact SQL. Basically the following
methods uses Dynamic SQL, a bit different from traditional queries, but can be used as an approach for smaller string parsing
requirements in certain cases.
Here is an example using Dynamic SQL with IN list in the WHERE clause
DECLARE @SQLx NVARCHAR(4000)
SET @SQLx = N'
SELECT CustomerID, ContactName, CompanyName
FROM Northwind.dbo.Customers
WHERE CustomerID IN ( ''' + REPLACE( @p, N',', N''',''' ) + N''' )' ;
EXEC sp_ExecuteSQL @SQLx ;
This is another illustration of replacing the element delimiters with ' UNION SELECT '. The results can be stored in a table variable or table variable and then can be used in subsequent operations.
DECLARE @tbl TABLE ( val VARCHAR(10) NOT NULL PRIMARY KEY );
DECLARE @SQLx VARCHAR(8000)
SET @SQLx = 'SELECT ''' + REPLACE( @p, ',', ''' UNION SELECT ''') + ''''
INSERT @tbl EXEC( @SQLx ) ;
SELECT val FROM @tbl ;
A similar approach is proposed by Alejandro Mesa in a public newsgroup posting that uses separate INSERT statements as well. Here is an
example where @t is a valid table created prior to execution:
DECLARE @SQLx VARCHAR(8000)
SET @SQLx= 'INSERT ' + @t + ' VALUES (' +
REPLACE( @p, ',', ' ) INSERT ' + @t + ' VALUES (') + ')'
EXEC ( @SQLx ) ;
XML solutions
OPENXML function
An approach that is getting much attention, is the OPENXML method. This method, thought may not be ideal for larger datasets, can be
effectively used for relatively small number of items, especially if the XML string is generated from a client application. The basic
idea is to pass the values as an XML document to the stored procedure instead of a CSV. Here is an example. Note that you can use the
REPLACE function to change a CSV string to XML format, say within a stored procedure for instance.
DECLARE @doc VARCHAR(500)
DECLARE @XMLDoc INT
SET @doc = '
<ROOT>
<Customer pos="1" id="ALFKI"></Customer>
<Customer pos="2" id="LILAS"></Customer>
<Customer pos="3" id="PERIC"></Customer>
<Customer pos="4" id="HUNGC"></Customer>
<Customer pos="5" id="SAVEA"></Customer>
<Customer pos="6" id="SPLIR"></Customer>
<Customer pos="7" id="LONEP"></Customer>
<Customer pos="8" id="GROSR"></Customer>
</ROOT>'
EXEC sp_xml_preparedocument @XMLDoc OUTPUT, @doc ;
SELECT pos, id
FROM OPENXML ( @XMLDoc , '/ROOT/Customer', 1 )
WITH ( pos INT, id VARCHAR(5) ) ;
EXEC sp_xml_removedocument @XMLDoc ;
One could generate the xml document from the csv variable with only the id values easily like:
SELECT '
<ROOT>
<Customer id="' +
REPLACE( @p, ',', '"></Customer>
<Customer id="') + '"></Customer>
</ROOT>'
The nodes() method
This is a SQL Server 2005 only feature using the xml datatype. You can use the nodes() method to shred the data that can be mapped to a
new row. The value() method can be applied as shown below to extract the individual elements:
DECLARE @doc XML;
SET @doc = '
<ROOT>
<Customer pos="1" id="ALFKI"></Customer>
<Customer pos="2" id="LILAS"></Customer>
<Customer pos="3" id="PERIC"></Customer>
<Customer pos="4" id="HUNGC"></Customer>
<Customer pos="5" id="SAVEA"></Customer>
<Customer pos="6" id="SPLIR"></Customer>
<Customer pos="7" id="LONEP"></Customer>
<Customer pos="8" id="GROSR"></Customer>
</ROOT>'
SELECT D.element.value('@id', 'VARCHAR(5)'),
D.element.value('@pos', 'INT')
FROM @doc.nodes('/ROOT/Customer') AS D ( element )
Simple WHILE loop
Another popular method is to use a procedural WHILE loop. Very popular among programmers by virtue of its simplicity, this method is not
much efficient for larger datasets. Here is an example:
WHILE LEN( @param ) > 0 BEGIN
IF CHARINDEX( ',', @param ) > 0
SELECT @val = LEFT( @param, CHARINDEX( ',', @param ) - 1 ) ,
@param = RIGHT( @param, LEN( @param ) - CHARINDEX( ',', @param ) )
ELSE
SELECT @val = @param, @param = SPACE(0)
EXEC('INSERT tbl VALUES (' + @val + ')' )
END
Conclusion
Be judicious and apply common sense while using any of the string parsing routines in t-SQL. Each of the methods described above has
certain benefits in some scenarios and may not be appropriate for all situations. If you need additional detailed analysis of these and a
few other methods including performance considerations, consider the articles in the references section.
References
Acknowledgements
Erland Sommarskog, Umachandar Jayachandran, Linda Wierzbecki, Joe Celko, Alejandro Mesa
|