RegexClean Transformation

by Darren Green 11 Oct 2008 05:27

Use the power of regular expressions to cleanse your data right there inside the Data Flow. This transformation includes a full user interface for simple configuration, as well as advanced features such as error output configuration.

Two regular expressions are used, a match expression and a replace expression. The transformation is designed around the named capture groups or match groups, and even supports multiple expressions. This allows for rich and complex expressions to be built, all through an easy to reuse transformation where a bespoke Script Component was previously the only alternative.

Some simple properties are available for each column selected –

Behaviour
The two behaviour modes offer similar functionality but with a difference. Replace, replaces tokens with the input, and Emit overwrites the whole string.

Cascade
Cascade allows you to define multiple expressions, each on a new line. The match expression will be processed into one operation per line, which are then processed in order at run-time. Multiple replace expressions can also be specified, again each on a new line. If there is no corresponding replace expression for a match expression line, then the last replace expression will be used instead. It is common to have multiple match expressions, but only a single replace expression.

Match Expression
The expression used to define the named capture groups. This is where you can analyse the data, and tag or name elements within it as found by the match expression.

Replace Expression
The replace determines the final output. It will reference the named groups from the match expression and assembles them into the final output.

If you want to use regular expressions to validate data then try the Regular Expression Transformation.

Quick Start Guide

  • Select a column. A new output column is created for each selected column; there is no option for in-place replacement of column values. One input column can be used to populate multiple output columns, just select the column again in the lower grid, using the Input Columns drop-down selector.
  • Amend the output column name and size as required. They default to the same as the input column selected.
  • Amend the behaviour as required, the default is Replace.
  • Amend the cascade option as required, the default is true.
  • Finally enter your match and replace regular expressions
  • Quick Sample #1

    Parse an email address and extract the user and domain portions. Format as a web address passing the user portion as a URL parameter. This uses two match groups, user and host, which correspond to the text before the @ and after it respectively.

    Behaviour is Emit, and cascade of false, we only have a single match expression.

    Match Expression ^(?<user>[^@]+)@(?<host>.+)$

    Replace Expression - http://www.${host}?user=${user}

    Results

    Sample Input Sample Output
    zheng0@adventure-works.com http://www.adventure-works.com?user=zheng0

    The component is provided as an MSI file, however to complete the installation, you will have to add the transformation to the Visual Studio toolbox manually. Right-click the toolbox, and select Choose Items.... Select the SSIS Data Flow Items tab, and then check the RegexClean Transformation from the list.

    Downloads

    The RegexClean Transformation is available for both SQL Server 2005 and SQL Server 2008. Please choose the version to match your SQL Server version, or you can install both versions and use them side by side if you have both SQL Server 2005 and SQL Server 2008 installed.

    RegexClean Transformation for SQL Server 2005

    RegexClean Transformation for SQL Server 2008

    Version History

    SQL Server 2005

    Version 1.0.0.105 - Public Release
    (28 Jan 2008)

    SQL Server 2005

    Version 1.0.0.105 - Public Release
    (28 Jan 2008)

    Screenshot

    RegexClean Transformation Editor dialog

    Comments

    1/22/2009 3:12:57 PM #

    A nice component, unfortunately it doesn't seem to be able to do what I want at the moment...

    I would like it to check a column from a Flat File Source to see if it contains a date. Since there's no ISDATE in SSIS I wanted to see if this was an OK workaround. I'm OK with regexp checking for yyyy/mm/dd, even if it's 1981/44/55 - if the system creating the flat file has a date to export it's correct. Otherwise it's either an empty string or some blanks. If there is no date in the column, It's possible to derive that info from another column - but RegexClean don't let you get data from another column...

    Tommy Petersson Sweden

    1/23/2009 11:27:29 AM #

    Tommy

    It sounds like you may just need to validate the column value in which case www.sqlis.com/.../...xpression-Transformation.aspx may be more appropriate.  if your validation fails then on the alternate output you can hand a Derived Column transform which will be used to derive the dae from the other columns you mention.

    Allan Mitchell Germany

    2/20/2009 2:32:29 AM #

    Anyone tried doing a replace with an empty expression?

    I want to strip whitespace out of the input, and don't see any mechanism to specify empty place in RegEx, and the component won't accept an empty expression for a Replace.

    Tim

    2/20/2009 7:56:32 AM #

    I would probably use a Derived Column transform to replace my existing column.  There is a string REPLACE() function in there.

    Allan Mitchell United Kingdom

    2/24/2009 1:43:12 AM #

    Cheers Allan, I had decided Replace was the way to go anyway.
    Another issue: Doing a Replace to replace following characters with a space that I then do the Dervied Replace on:
    [\+\-\(\)]

    The output column value is null though if none of those characters are in input - this doesn't seem right to me.

    Tim

    2/24/2009 1:44:34 AM #

    My temp solution is to append a ' ' at end of column in the query..

    Tim

    3/26/2009 6:26:39 PM #

    Hi, I am trying to remove all alphanumeric characters from the string using [^0-9] in Match section, but couldn't find the correct expression for the Replace section. Thanks,

    Jason United Kingdom

    5/1/2009 3:14:11 PM #

    Any way to access members within the group itself? i.e. if i have a string of xyz.abc.bab.wqrx and i have a match expression of ^(?<stringsonly>[^.]*), I want to access the different members separately i the replace syntax.

    Steven Molen United States

    6/25/2009 7:18:05 PM #

    @Jason

    I did something similar and posted it out on MSDN. (social.msdn.microsoft.com/.../adcdc0f5-9191-4937-97a2-5a63c34fc685)  I didn't find an eloquent way to do it.

    Devo United States

    8/3/2009 9:34:47 PM #

    I have unstructured text in a nvarchar(max) column that I am trying to extract into sepearate fields. I need to search for specific strings e.g. "THC METABOLITE CONFIRMED BY GC/MS 21 NG/ML" and extract the numbers after the GC/MS. I've written a regex \bTHC\sMETABOLITE\sCONFIRMED\sBY\sGC/MS\b.(?<amt>\d{4})to search for my string. How do I get the RegEx Clean to replace the string with the 21. The RegEx transformation correctly identifies every occurence, however I can not get the RegEx Clean to work. Any suggestions would be appreciated

    Derrick

    8/6/2009 10:22:21 PM #

    Built a Script Destination Component to put text into the data stream and used the RegEx clean in emit mode with the cascade option to search 34 specific patterns and replaced them all. Strings are similar to the ones in my previous post (\bTHC\sMETABOLITE\sCONFIRMED\sBY\sGC/MS\b.(?<amt>\d{4}). The transform works well.

    Good job!

    Derrick

    9/25/2009 5:44:28 PM #

    Tool has been very useful- only my [mis]understanding of regexp getting in my way.
    Although, I have yet to figure out how to remove a row once it's been created.
    I am reading an http log file and culling out several data elements from the string. One of them is causing an error (I guess my regexp is incorrect) but I can't figure out how to remove that field from the transform in order to move forward.

    Any help is appreciated.

    Eric United States

    11/14/2009 12:26:02 PM #

    datenfabrik.email

    datenfabrik.email

    SSIS-Components.net

    1/27/2010 7:05:50 PM #

    This works great for what I need to do involving dates.  However, I recieve an error when the column is null and the next output is to a derived column (I'm combining the date and time to a datetime after formatting the date and time in regex). "[RegexClean] Error: ocessing with an exception. Value cannot be null. Parameter name: input"


    Any words?

    Kyle United States

    1/28/2010 7:55:14 PM #

    Devo,

    I have written a SSIS component called RegExtractor that does what you would like to do.  You can specify a regular expression to operate on any text column and the captured matches of that regex get output to new columns.  The SSIS component is free and open source.  You can get it at:

    http://regextractor.codeplex.com

    Hope this helps (I realize I'm late in my response)

    Eric

    Eric United States

    3/20/2010 2:26:23 AM #

    Can someone expand upon what the Cascade option does? I've tried playing with it, but it doesn't produce any different results whether the box is checked or unchecked. I'm hoping someone can provide an example as to when you would or would not use it.
    Thanks.

    Spencer United States

    8/10/2010 2:19:02 AM #

    Have you adapted this for VS 2010 yet?

    andrew Canada

    8/10/2010 1:09:58 PM #

    Sorry, I got it installed in the right place.  Really nice utility, thanks.

    andrew Canada

    Add comment




      Country flag

    biuquote * Required fields. Your email address will not shown, check the preview for what you see. We use it to send you notifications on new comments if you ask us to below.
    • Comment
    • Preview
    Loading




    Welcome to SQLIS.com our free SQL Server Integration Services (SSIS) resource site.

    MVP

    RecentComments

    Comment RSS