Home >> Transact SQL >> Character replacements in t-SQL

Character replacements in t-SQL

Introduction

Quite often SQL programmers are left with the dirty job of working with badly formatted strings mostly generated from external sources. Typical examples are badly structured date values, social security numbers with misplaced hyphens, badly formatted phone numbers etc. When the data set if small, in many cases, one can easily fix by a one time cleanup code snippet, but for larger sets one will need more generalized routines.

Caveats

A general term "cleanup" does not do justification to the examples detailed below. All format routines are dependent upon the specific business rules defined at the conceptual model. Also, for a large set of characters that are to be changed, it makes sense to use a table with those characters. That way one can join this table to the table that has the actual data and use set based logic for data cleanups.

It is quite important to realize that the string manipulation capabilities in Transact SQL are pretty limited. So given an option where the formatting routine is application specific, it is always recommended that the formatted should be done at the application level keeping the data source intact. However if the string formatting is a general requirement often used in multiple applications or the requirement is to keep the string data at the source to be represented in a well formatted fashion, it makes sense to do the routines using SQL.

Illustrations

The following examples are using a telephone number. As it sounds, phone numbers are numeric values. However, they are represented in the various formats. For instance US phone numbers are often formatted as (123) 456-7890 and there may be a rule that requires this string to be formatted as 123456789 for a variety of reasons.
DECLARE @str VARCHAR(15)
SET @str = '(123) 456-7890'
1. Using a series of REPLACE() functions

Here each REPLACE is used to change one unwanted character to a blank space or SPACE(0). The nesting of REPLACE function in recent version of the SQL Server product can go to hundreds of levels.
SELECT REPLACE( REPLACE( REPLACE( REPLACE( 
	@str, '(', '' ), ')', '' ), ' ', '' ), '-', '' );
2. Using SUBSTRING in a series of CASE expressions

This is a very blank approach where each of the unwanted character is identified using a SUBSTRING function and then replaced with a space one by one.
SELECT CASE WHEN SUBSTRING( @str, 1, 1 ) LIKE '[0-9]' 
            THEN SUBSTRING( @str, 1, 1 ) ELSE '' END + 
       CASE WHEN SUBSTRING( @str, 2, 1 ) LIKE '[0-9]' 
            THEN SUBSTRING( @str, 2, 1 ) ELSE '' END + 
	   .....
       CASE WHEN SUBSTRING( @str, 14, 1 ) LIKE '[0-9]' 
            THEN SUBSTRING( @str, 14, 1 ) ELSE '' END ;
3. Using REPLACE in a WHILE loop

Here the REPLACE function is applied to each character in a loop. One way we can make this more generic is by wrapping the code snippet in a scalar user defined function.

WHILE PATINDEX( '%[^0-9]%', @str ) > 0 
	SET @str = REPLACE( @str, SUBSTRING( @str, 
			PATINDEX( '%[^0-9]%', @str ), 1 ), '' ) 
SELECT @str
4. Using the "blackbox" XML method

This approach is named blackbox since nobody seems to know how the order of evaluation of concatenating is done with the FOR XML clause. So far it seems to work reasonably and is suggested unofficially by vendor as an option.
; WITH CTE( s ) AS (
SELECT CASE WHEN SUBSTRING( @str, n, 1 ) LIKE '[0-9]' 
	   THEN SUBSTRING( @str, n, 1 )
	   ELSE '' END + '~'
  FROM Nbrs
 WHERE LEN( @str ) >= n
 ORDER BY n
   FOR XML PATH('') )
SELECT REPLACE( s, '~', '' )
  FROM CTE ;
Illogical methods that "seems to" work:

This is the notorious aggregate operation using SELECT statement that seems to internally loop through the rows when a column value is used along with a variable that is used on both sides of the assignment.
SELECT @Str = REPLACE( @Str , chr , '' )
  FROM ( SELECT '(' UNION 
	 SELECT ')' UNION 
	 SELECT ' ' UNION 
	 SELECT '-' ) D ( chr )
 WHERE CHARINDEX( chr , @Str ) > 0 ;
SELECT @str ;
Another method in the same illogical category is to use a t-SQL UPATE extension. This appears to be usable only when the characters to be replaced are represented in a non-derived construct say a table, #temp table or a table variable.
SET NOCOUNT ON
DECLARE @t TABLE ( chr CHAR(1) NOT NULL )
INSERT @t SELECT ')' UNION
	  SELECT '(' UNION
	  SELECT ' ' UNION 
	  SELECT '-' ;
UPDATE @t 
   SET @str = REPLACE( @str, chr, '' ) 
 PRINT @str ;
Conclusion

Even though only replacing certain characters with a space has been illustrated in the above mentioned examples, there are several other string manipulation requirements like proper casing etc. that can use similar approaches. As with any other string manipulation routines, if the requirement is application specific or rarely used, it makes sense to leverage the features of the client side language to do them.

Acknowledgements

Umachandar Jayachandran