Home >> Transact SQL >> Faking arrays in t-SQL

Faking arrays in Transact SQL

Introduction

Sometimes SQL programmers come up with a requirement to use multi-valued columns or variables in an SQL query especially when the data is originated from external sources. Since there is no builtin support for an array or list or a similar datatype in t-SQL, quite often folks use delimited VARCHAR strings to lump multiple values together.

We all know that having multiple values in a single column is against the fundamental tenet of relational model since an attribute in a relation can have only one value drawn from an appropriate type. For more information on such elementary details ( including a fundamental exposition of a type, its meaning and the distinction between internal and external predicates ) refer to a good relational database text book.

SQL tables, by their very nature, do not allow multiple values in its columns. However, users can interpret a single string as a combination of smaller strings often delimited by a character say a comma or a space. Therefore, by leveraging the existing string functions, programmers can extract such smaller parts from the concatenated string.

It is very important to note that some of the methods are kludgy while some appear to violate the fundamentals of data independence altogether. Such methods are only suggested here due to its notoriety and are not recommended to be used for production mode systems. If you are using any methods that are undocumented in the product manual, use them with utmost discretion and all relevant caveats apply. Basic recommendations for using proper datatype conversion techniques, avoiding positional significance for columns etc must be considered for stable production code.

Considerations

Coming to the methods of parsing an array and using it for data manipulations, the lion's share of "array" usage in stored procedure comes to insert data as multiple rows into tables. There are other instances where badly designed data represented in a table has to be cleaned up and queried.

The following sections illustrate a variety of methods one can employ to identify and enlist subsections of a string represented in a variable, parameter or even as a column value in a table. The examples use a comma separated list, commonly known as a CSV string where the value is represented as: 'item-1,item-2,item-3....item-n'. In practice, you can use any character including a space to delimit and improvise the methods accordingly.

For the examples below, a few customer identifiers are randomly chosen from the Customers table in the Northwind database.
DECLARE @p VARCHAR(50)
SET @p = 'ALFKI,LILAS,PERIC,HUNGC,SAVEA,SPLIR,LONEP,GROSR'
Northwind is a sample database in SQL Server 2000 default installations. You can download a copy from the Microsoft Downloads

Direct Data comparison

For simple comparisons, there is no need for complicated routines. The inherent pattern matching features in Transact SQL can be used directly in most cases. One prime example is to pass in a list of values and use it in the IN list of a WHERE clause. Here are some common methods:

Using the CHARINDEX function:
    SELECT CustomerID, ContactName, CompanyName
      FROM Northwind.dbo.Customers
     WHERE CHARINDEX( ',' + CustomerID + ',', ',' + @p + ',' ) > 0 ;
     
    CustomerID ContactName                    CompanyName                              
    ---------- ------------------------------ ------------------------------
    ALFKI      Maria Anders                   Alfreds Futterkiste
    GROSR      Manuel Pereira                 GROSELLA-Restaurante
    HUNGC      Yoshi Latimer                  Hungry Coyote Import Store
    LILAS      Carlos González                LILA-Supermercado
    LONEP      Fran Wilson                    Lonesome Pine Restaurant
    PERIC      Guillermo Fernández            Pericles Comidas clásicas
    SAVEA      Jose Pavarotti                 Save-a-lot Markets
    SPLIR      Art Braunschweiger             Split Rail Beer & Ale

    (8 row(s) affected)
Using pattern matching with PATINDEX:
    SELECT CustomerID, ContactName, CompanyName
      FROM Northwind.dbo.Customers
     WHERE PATINDEX( '%,' + CustomerID + ',%', ',' + @p + ',' ) > 0	; 
Using LIKE operator for pattern matching
    SELECT CustomerID, ContactName, CompanyName
      FROM Northwind.dbo.Customers
     WHERE ',' + @p + ',' LIKE '%,' + CustomerID + ',%' ;
String Parsing

In many cases, you may want to use the parsed list of values as a resultset that can be used in subsequent operations. For instance, for larger lists, it may be more effective to use a JOIN rather using an IN() clause. Another common scenario is the case of multi-row inserts where the list is parsed and the individual elements are inserted using a single INSERT statement.

Solutions using a table of numbers

In most cases with larger strings, the faster solutions are often the ones using a table of sequentially incrementing numbers. However, the performance assertions in general should be taken with a grain of salt since without testing it is almost impossible to conclude which method performs better than another.

A table of monotonically increasing numbers can be created in a variety of ways. Either a base table or a view or any expression that can create a sequence of numbers can be used in these scenarios. Some of the common methods to create a table of numbers can be found in the article Creating a table of numbers.

The general idea here is to use the sequence value as the second argument for the SUBSTRING() function and then use the WHERE clause to limit the number of splits using a terminating criteria. Note than this general logic can be written in a variety of ways.
    SELECT SUBSTRING( ',' + @p + ',', n + 1, 
                CHARINDEX( ',', ',' + @p + ',', n + 1 ) - n - 1 ) AS "value"
      FROM Nbrs
     WHERE SUBSTRING( ',' + @p + ',', n, 1 ) = ','
       AND n < LEN( ',' + @p + ',' ) ;
Similar logic from a query in a public newsgroup posting by Linda Wiezbecki.
    SELECT SUBSTRING( p, n + 1, 
                CHARINDEX( ',', p, n + 1 ) - ( n + 1 ) ) AS "value"
      FROM Nbrs
      JOIN ( SELECT ',' + @p + ',' ) D ( p )
        ON SUBSTRING( p, n, LEN( p ) ) LIKE ',_%' ;
Here are a couple of methods that not only parses the delimited list, but returns the position of each item in the list as well. This one is using a self join and is popularized by Joe Celko in SQL newsgroups. postings.
    SELECT SUBSTRING( ',' + @p + ',', MAX( n1.n + 1 ),  
                        n2.n - MAX( n1.n + 1 ) ) AS "value",
           COUNT( n2.n ) AS "pos"
      FROM Nbrs n1
      JOIN Nbrs n2 ON n1.n < n2.n 
       AND n2.n <= LEN( ',' + @p + ',' ) + 1 
     WHERE SUBSTRING( ',' + @p + ',', n1.n, 1 ) = ','
       AND SUBSTRING( ',' + @p + ',', n2.n, 1 ) = ','
     GROUP BY n2.n ;
The following is from a posting by Umachandar Jayachandran that uses the REPLACE function to get the number of commas in the string and calculate the position of each item value.
    SELECT SUBSTRING( p, n + 1, CHARINDEX( ',', p, n + 1 ) - n - 1 ) AS "value",
           LEN( SUBSTRING( p, 1, CHARINDEX( ',', p, n + 1 ) ) ) - 
           LEN( REPLACE( SUBSTRING( p, 1, 
                    CHARINDEX( ',', p, n + 1 ) ), ',', '' ))  - 1 AS "pos"
      FROM Nbrs
      JOIN ( SELECT ',' + @p + ',' ) D ( p ) 
        ON n BETWEEN 1 And LEN( p ) - 1 
       AND SUBSTRING( p, n , 1 ) = ',' ;
The following method also gives you the positional value and is simpler
    SELECT SUBSTRING( @p, n, CHARINDEX( ',', @p + ',', n ) - n ) AS "value",
           n + 1 - LEN( REPLACE( LEFT( @p, n ), ',', '' ) ) AS "pos"
      FROM Nbrs  
     WHERE SUBSTRING(',' + @p, n, 1) = ','
       AND n < LEN(@p) + 1 ;
With the newly introduced ROW_NUMBER() and RANK() functions in SQL 2005, one could create the positional values much easily like:
    SELECT SUBSTRING( ',' + @p + ',', n + 1, 
                CHARINDEX( ',', ',' + @p + ',', n + 1 ) - n - 1 ) AS "value",
           ROW_NUMBER() OVER ( ORDER BY n ) AS "pos"
      FROM Nbrs
     WHERE SUBSTRING( ',' + @p + ',', n, 1 ) = ','
       AND n < LEN( ',' + @p + ',' ) ;
You can wrap any of these methods using a table of sequentially incrementing numbers into a table valued UDF or another stored procedure and make it more generic, reusable and handy.

With Common Table Expressions

Most of the methods described above can be re-written using common table expressions. Here is one way of using it to get the starting position and ending position of each element item and then using SUBSTRING to extract the value:
    WITH CTE ( pos_begin, pos_end ) AS (
        SELECT n1.n, ( SELECT MIN( n2.n )FROM Nbrs n2
                        WHERE SUBSTRING( ',' + @p + ',' , n2.n , 1 ) = ',' 
                          AND n2.n > n1.n ) 
          FROM Nbrs n1
         WHERE n1.n <= LEN( ',' + @p + ',' ) - 1 
           AND SUBSTRING( ',' + @p + ',' , n1.n , 1 ) = ',' )
    SELECT SUBSTRING( @p , pos_begin, pos_end - pos_begin - 1 ) AS "Value"
      FROM CTE ;
Another concise approach with Common table expressions is to use a recursive CTE. It does not use a table of sequentially incrementing numbers, however it can be a bit inefficient for larger strings. Here is an example.
    WITH CTE ( pos, pos_begin, pos_end ) AS (
	    SELECT 0, 1, CHARINDEX( ',', @p + ',' ) 
	     UNION ALL
	    SELECT pos + 1, pos_end + 1, CHARINDEX( ',', @p + ',', pos_end + 1 ) 
	      FROM CTE
	     WHERE CHARINDEX( ',', @p + ',', pos_end + 1 ) > 0 )
    SELECT pos + 1, SUBSTRING( @p, pos_begin , pos_end - pos_begin ) AS "value"
      FROM CTE
    OPTION ( MAXRECURSION 0 ) ;
As mentioned before, any of the above mentioned parsing routines can be written as a table valued user defined function or a view and can be used directly in the queries like:
    SELECT CustomerID, ContactName, CompanyName
      FROM Northwind.dbo.Customers c
      JOIN dbo.udf_parsed_list () p
        ON c.CustomerID = p.value ;
Methods that replace the delimiters

Apart from the above mentioned approaches there are certain other tricks which can be used in Transact SQL. Basically the following methods uses Dynamic SQL, a bit different from traditional queries, but can be used as an approach for smaller string parsing requirements in certain cases.

Here is an example using Dynamic SQL with IN list in the WHERE clause
    DECLARE @SQLx NVARCHAR(4000)
    SET @SQLx = N'
    SELECT CustomerID, ContactName, CompanyName
      FROM Northwind.dbo.Customers  
     WHERE CustomerID IN ( ''' + REPLACE( @p, N',', N''',''' ) + N''' )' ;
    EXEC sp_ExecuteSQL @SQLx ;
This is another illustration of replacing the element delimiters with ' UNION SELECT '. The results can be stored in a table variable or table variable and then can be used in subsequent operations.
    DECLARE @tbl TABLE ( val VARCHAR(10) NOT NULL PRIMARY KEY );
    DECLARE @SQLx VARCHAR(8000)
    SET @SQLx = 'SELECT ''' + REPLACE( @p, ',', ''' UNION SELECT ''') + ''''
    INSERT @tbl EXEC( @SQLx ) ;
    SELECT val FROM @tbl ;
A similar approach is proposed by Alejandro Mesa in a public newsgroup posting that uses separate INSERT statements as well. Here is an example where @t is a valid table created prior to execution:
    DECLARE @SQLx VARCHAR(8000)
    SET @SQLx= 'INSERT ' + @t + ' VALUES (' +
                   REPLACE( @p, ',', ' ) INSERT ' + @t + ' VALUES (') + ')'
    EXEC ( @SQLx ) ;
XML solutions

OPENXML function

An approach that is getting much attention, is the OPENXML method. This method, thought may not be ideal for larger datasets, can be effectively used for relatively small number of items, especially if the XML string is generated from a client application. The basic idea is to pass the values as an XML document to the stored procedure instead of a CSV. Here is an example. Note that you can use the REPLACE function to change a CSV string to XML format, say within a stored procedure for instance.
    DECLARE @doc VARCHAR(500)
    DECLARE @XMLDoc INT
    SET @doc = '
	    <ROOT>
		    <Customer pos="1" id="ALFKI"></Customer>
		    <Customer pos="2" id="LILAS"></Customer>
		    <Customer pos="3" id="PERIC"></Customer>
		    <Customer pos="4" id="HUNGC"></Customer>
		    <Customer pos="5" id="SAVEA"></Customer>
		    <Customer pos="6" id="SPLIR"></Customer>
		    <Customer pos="7" id="LONEP"></Customer>
		    <Customer pos="8" id="GROSR"></Customer>
	    </ROOT>'			
    EXEC sp_xml_preparedocument @XMLDoc OUTPUT, @doc ;
    SELECT pos, id
      FROM OPENXML ( @XMLDoc , '/ROOT/Customer', 1 )
              WITH ( pos INT, id VARCHAR(5) )  ;
    EXEC sp_xml_removedocument @XMLDoc ;
One could generate the xml document from the csv variable with only the id values easily like:
    SELECT '
    <ROOT>
	    <Customer id="' +
            REPLACE( @p, ',', '"></Customer>
	    <Customer id="') + '"></Customer>
    </ROOT>'
The nodes() method

This is a SQL Server 2005 only feature using the xml datatype. You can use the nodes() method to shred the data that can be mapped to a new row. The value() method can be applied as shown below to extract the individual elements:
    DECLARE @doc XML;
    SET @doc = '
	    <ROOT>
		    <Customer pos="1" id="ALFKI"></Customer>
		    <Customer pos="2" id="LILAS"></Customer>
		    <Customer pos="3" id="PERIC"></Customer>
		    <Customer pos="4" id="HUNGC"></Customer>
		    <Customer pos="5" id="SAVEA"></Customer>
		    <Customer pos="6" id="SPLIR"></Customer>
		    <Customer pos="7" id="LONEP"></Customer>
		    <Customer pos="8" id="GROSR"></Customer>
	    </ROOT>'			
    SELECT D.element.value('@id', 'VARCHAR(5)'),
           D.element.value('@pos', 'INT')
      FROM @doc.nodes('/ROOT/Customer') AS D ( element )
Simple WHILE loop

Another popular method is to use a procedural WHILE loop. Very popular among programmers by virtue of its simplicity, this method is not much efficient for larger datasets. Here is an example:
    WHILE LEN( @param ) > 0 BEGIN 
	    IF CHARINDEX( ',', @param ) > 0 
		    SELECT @val = LEFT( @param, CHARINDEX( ',', @param )  - 1 ) ,
		           @param = RIGHT( @param, LEN( @param ) - CHARINDEX( ',', @param ) ) 
	    ELSE 
		    SELECT @val = @param, @param = SPACE(0)
	    EXEC('INSERT tbl VALUES (' + @val + ')' ) 
    END
Conclusion

Be judicious and apply common sense while using any of the string parsing routines in t-SQL. Each of the methods described above has certain benefits in some scenarios and may not be appropriate for all situations. If you need additional detailed analysis of these and a few other methods including performance considerations, consider the articles in the references section.

References

Acknowledgements

Erland Sommarskog, Umachandar Jayachandran, Linda Wierzbecki, Joe Celko, Alejandro Mesa