|
| Regular Expression, to use or not to use... |
|
|
|
|
| Messages |
|
Related Types |
This message was discovered on microsoft.public.dotnet.general.
| Tom |
I have struggled with the issue of whether or not to use Regular Expressions for a long time now, and after implementing many text manipulating solutions both ways, I've found that writing specialized code instead of an RE is almost always the better solution. Here is why....
RE's are complex. Sure it is one line of code, but it is on hell of a line. Some of my RE remind me of the obfuscated code contest winners, where one line of cryptic characters generates the 12 days of Christmas. After you have worked out the complex RE it is almost impossible for anyone(including yourself) to figure out what it is really doing. Less code in not necessarily simple code. Plus it is not like writing this "one line" of code is saving you any time. I have had really complex RE's that took me hours to write that I needed external tools to help me with. Sure it was only one line of code in the end, but I could have written a page of easy to follow code in half the time. Which brings me to my next point.
RE's are not debug-able. If I have a page of well written code I (or anyone else) can easily step through it. When I send my 150 char RE into the RE engine it is a black box. I am just left to wonder why it didn't work.
Now I am sure you RE masters are out the just calling me a Dumb***, Just learn REs better and you won't have any problems. But the reality is that I can assume that anybody looking at my code is proficient in C++/C#, because it is a C++/C# Program, I can't assume that they are a RE Guru.
It's difficult to do some things in RE. RE's work great for some search but not so well for others. I usually find myself having to get a bunch of results back then doing some other processing on them get the text out, which mean special code anyways. Finally I don't know what you guy are saying about RE ever being faster ever. In my experience RE's are slow, very slow. Like on the order of 10 times slower then straight forward string parsing code. For me this is the final kicker. Originally I went through all the trouble off using RE's in all my complex text paring code because I thought it would faster. This could be a result of .NET engine, regardless I was really disappointed by this causing me to rollback a lot of my solutions.
With all this said I still use REs sometimes, but only for really simple string operations where I can come up with the expression in a couple of seconds I don't care about performance and don't feel like writing 5-10 lines of code. To me it is kind of like a quick hack.
Anyways that's my 2 bits.
Thomas
|
|
|
| |
|
| |
| |
| Niki Estner |
"Tom" <Click here to reveal e-mail address> wrote in news:Click here to reveal e-mail address... [Original message clipped]
Yesterday somebody asked a string matching question of the C# ng. I've copied two of the answers here: The C# version:
// The start Position int Start = InputName.IndexOf(":")+1; // The end position int End = InputName.IndexOf(" T"); // The output string OutputName = InputName.Substring(Start,End - Start); // Clear off leading / trailing spaces OutputName = OutputName.Trim();
The Regex version: Match m = Regex.Match(inputString, "From: (.*) To:"); if (m.Success) OutputName = m.Groups[1];
I may be a bad C# programmer, but if I read the former piece of source code I wouldn't have a clue of what it could be doing. Although every line's commented, and the code is rather simple. The RE version on the other hand seems pretty self-explaining to me.
> Plus it is not like writing this "one line" of code is saving you any time.
It does. Look at the above code snippets and think about how much time the poor plain-C#-guy will spend debugging, trying to figure out why people like "James T. Kirk" don't get their email, and why his program sometimes simply crashes...
[Original message clipped]
Interesting. Could you post such a RE, and the plain code?
[Original message clipped]
Putting it in a RegEx testing program like Expresso and removing parts of it usually does a similar job.
[Original message clipped]
On the other hand, you usually can't copy C# code to some other program and simply run it there, because it is coupled with other parts of the project.
[Original message clipped]
That applies to pretty much any tool I can think of...
[Original message clipped]
Depends on what you're doing, and how you're doing it. If you are searching for a long pattern like "Thomas Jefferson" in a long string (> 100 characters), the RE is about 10 times faster than a culture-invariant IndexOf.
[Original message clipped]
If it's in the 10% of the code that use 90% of the time, you should of course choose the fastest code you can get.
[Original message clipped]
So what?
> To me it is kind of like a quick hack.
A quick hack with full error handling, that's usually much more stable and robust that those untested "5-10 lines of code"...
Niki
|
|
|
| |
|
| |
| |
| Tom |
"Niki Estner" <Click here to reveal e-mail address> wrote in message news:<Click here to reveal e-mail address>... [Original message clipped]
Okay there are various problems with this example.
First of all this is one of the extremely trival examples that I might actually use a Regular expression for. Consider a RE like this
^(?:(?:(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26 ])|(?:(?:16|[2468][048]|[3579][26])00)))(\/|-|\.)(?:0?2\1(?: 29))$)|(?:(?:1[6-9]|[2-9]\d)?\d{2})(\/|-|\.)(?:(?:(?:0?[1357 8]|1[02])\2(?:31))|(?:(?:0?[1,3-9]|1[0-2])\2(29|30))|(?:(?:0 ?[1-9])|(?:1[0-2]))\2(?:0?[1-9]|1\d|2[0-8]))$
And tell me what it does? Better yet it doesn't work 1% of the time tell me why. This is an actual RE btw. You have to imagine the person that made this didn't type it up and have it work in one shot, in a couple of seconds. Sure you can use a 3rd party tool to decipher this and costs extra money and that nobody else probably has or is familiar with but why? So I have something that will search the string 10x slower?
Next the C# code is incorrect. It should really be this.
// The start Position int Start = InputName.IndexOf("From:")+"From:".Length; // The end position int End = InputName.IndexOf("To:",Start); // The output string OutputName = InputName.Substring(Start,End - Start); // Clear off leading / trailing spaces OutputName = OutputName.Trim();
What's hard to understand about these 3 lines of code? The methods are clearly named and better yet if you really don't understand it step through it and see. Which brings me back to my point at least if that code wasn't working I could easly find out why.
Next your RE is broken too. Consider the following input. From:John To:Mike From: JohnTo:Mike From:John To:Mike From: To:Bob From:To:Bob From: John To:Mike From:John To:Mike From: John To:Mike
Your RE will only match 3 of these strings. Which ones? Sounds like you need to do some debugging also? I was able to easily debug your C# code.
for the sake of argument lets correct that too.
The Regex version: Match m = Regex.Match(inputString, "From:(\w*)(.*)(\w*)To:"); if (m.Success) OutputName = m.Groups[1];
Even this simple RE gets a little more complex and Cryptic.
Okay but there is another thing over looked here.
For the input string From:John To:Mike The C# code will return the string
"John"
and the RE will return "From:John To:"
Looks like I need to run my C# code anyway?! Now I have done 2 string operations and the 2 lines has turned into about six. This is the extra processing that I refer to with RE's. RE's usually require 2 steps a match then some extra processing on the match. Sure you can do this with the whole prefix postfix syntax, but again this "simple" RE is getting more and more complex.
Also neither of these implementation take into consideration really poorly formated input. With the C# since you are deep in the proceess you could throw errors like "These is no 'To' after 'For' on line 3", instead of just missing Matches.
Let's Chalk this up to a bad example and move on.
[Original message clipped]
Initally I ran into this while building an RE to extract all the links out of an html page. Because of NDAs and Copyrights I don't feel comfortable posting the code. But this should be an easy one right? I know my RE code to do it about a Page, mostly because of having the post process results. Turns out to be about the same amount of code as the C# string operations actually, except C# code runs about 10x faster, because it doesn't have to do any auxilary copying or post processing. It finds exatacally what it needs at extracts it and never goes over a single byte in the string twice.
[Original message clipped]
Again now I need a 3rd party tool. I have a good debugging tool for C# why not use that?
[Original message clipped]
It's true I can't copy in a program PERL or something but who cares. As far as moving C# code around it's no problem why would a string operation be coupled to anything. The RE code is C# code in the end too. This is kind of a Mute point.
[Original message clipped]
Yea thats true, like I said I still use them occasionally, as a hack, when I don't care about performance and it actually saves me some code, and I am not really scared what happens when my "untested" silently fails to match.
[Original message clipped]
Wrong, I tried that exact example. Please show me where you get this data.
string input="ndvskjdfsnkvjfdsdsfjvkdfsnklvjndfsjkvldfsnjvndfsnjvklfdjslvdfsnklvdfskvdfs Thomas Jefferson mvkdskl;dfskl;vmdfskvlmfdskldfsmkl;vmfdsmkldfskl;vm"; { DateTime start=DateTime.Now; Regex rex=new Regex("Thomas Jefferson"); for(int i=0;i<1000000;i++) { Match m=rex.Match(input); } DateTime end=DateTime.Now; TimeSpan ts=end-start; Console.WriteLine(ts.TotalMilliseconds.ToString()); }
{ DateTime start=DateTime.Now; for(int i=0;i<1000000;i++) { int imp=input.IndexOf("Thomas Jefferson"); } DateTime end=DateTime.Now; TimeSpan ts=end-start; Console.WriteLine(ts.TotalMilliseconds.ToString()); }
The RE took over 2,000ms where the indexof branch took only 300 ms. This is about 5X slower. If you could show me some code where the RE was actually FASTER I would love to see it. Or better yet how to fix this example made I am just missing something.
But again the time doesn't ever usualy come from the Match itself but what you have to do to the Match like in the From:To: example. The extra time it take to copy an intermediate string and process that overshadows all of this. A custom taylored algorythim will just never do this.
Also this is an extremly simple re, no |'s or complex expressions. Also most of the time in complex examples I don't use indexof but rather walk the string char by char and mantain a state machine. Yes sometimes this is more complex, but this complexity usually is proportional to the complexity of the RE, and again I can debug it.
So again I'm looking at harder to write and debug and about 5x slower.
[Original message clipped]
Agreed, this is where usings RE's is just not an option.
[Original message clipped]
It's a Hack because it DOESN'T have full error handleing and is dog slow. When it fails, sure it doens't crash, (neither does IndexOf)it just doesn't return a Match. Which leaves the RE "untested" and in need of debuging also. If I am concerned about my parsing code crashing I can wrap it in a try catch block, and just let it pass though, so it works like the RE failing, but it might actully be helpfull for it to crash in a debugger so I can see what is going wrong instead of just guessing.
Again maybe this is just Microsoft's implementation of RE's. If the the RE's were really fast. This would start to make sense to me. I wonder if anyone has information on this.
Tom
> Niki
|
|
|
| |
|
|
| |
| |
| Niki Estner |
"Tom" <Click here to reveal e-mail address> wrote > [Original message clipped]
I didn't have a closer look at it, probably it does some numeric range check. (wild guessing). Doesn't look as if using regular expressions for that problem was a smart decision.
If that's the kind of problem you think of, I'd agree with you. Conventional string parsing would be the better choice here. (Books about RE's usually contain samples as these just to show what it *possible* with them.)
[Original message clipped]
I use Expresso and Regulator. Both are free and quite common.
[Original message clipped]
Counting them, apparently ;-)
[Original message clipped]
I can only guess you never spent hours debugging for an error that's not reproducible? The code above will simply crash if the input is malformed. Suppose the input was coming in on some kind of server application; It'll run fine in all your test environments, while your customer will complain that it constantly crashes twice a day...
[Original message clipped]
Now *you* are guessing: You don't know what the strings are supposed to look like. I wouldn't be surprised if the spaces were actually required.
[Original message clipped]
Use grouping parantheses. That's not an argument; It's ignorance.
[Original message clipped]
You could, which would make those 4 lines to a few hundred, if you really considered every possible error. However, if we're comparing apples with apples, the C# version simply crashes if the input is malformed.
> Let's Chalk this up to a bad example and move on.
It's a common everyday example...
Just to extend it a little further: suppose that string matching code would have to be localized... Say, to a left-to-right reading language...
[Original message clipped]
Do you use a bitmap editor? A dialog editor? A zip or other compression tool? A tooth-brush??? Or do you do all these tasks with your debugger? I don't see your point...
Different tools for different purposes.
[Original message clipped]
"Reusability"; ever heared the word before?
To give you a clearer example:
private With ParseWithStatement(){ Context withCtx = this.currentToken.Clone(); AST obj = null; AST block = null; this.blockType.Add(BlockType.Block); try{ GetNextToken(); if (JSToken.LeftParen != this.currentToken.token) ReportError(JSError.NoLeftParen); GetNextToken();
this.noSkipTokenSet.Add(NoSkipTokenSet.s_BlockConditionNoSkipTokenSet); try{ obj = ParseExpression(); ....
Do you think you could simply copy that kind of code somewhere and debug it, or reuse it? (BTW: This is real string parsing code)
[Original message clipped]
public static void Main() { string searchString = "Thomas Jefferson"; string bigString; StringBuilder builder = new StringBuilder();
const int stringLength = 100; const int repeatCount = 100000;
Random rnd = new Random(1234);
for (int i=0; i<stringLength; i++) builder.Append((char)(rnd.Next()%10+'0'));
bigString = builder.ToString();
CompareInfo info = CultureInfo.InvariantCulture.CompareInfo;
{ Console.WriteLine("Testing String.IndexOf:"); long start_time = DateTime.Now.Ticks; for (int i=0; i<repeatCount; i++) { int index = info.IndexOf(bigString, searchString); } long end_time = DateTime.Now.Ticks; Console.WriteLine(new TimeSpan(end_time - start_time)); }
{ Console.WriteLine("Testing Regex.Match:"); Regex regex = new Regex(searchString); long start_time = DateTime.Now.Ticks; for (int i=0; i<repeatCount; i++) { Match m = regex.Match(bigString); } long end_time = DateTime.Now.Ticks; Console.WriteLine(new TimeSpan(end_time - start_time)); }
Console.ReadLine(); }
You may play with the constants: RE's use the Boyer-Moore algorithm which is about O(N/M), while ordinary a String.IndexOf is about O(N). The longer the strings in question are, the faster the regex gets compared to IndexOf.
[Original message clipped]
We've had that. Look up "grouping paranthesis".
> Also this is an extremly simple re, no |'s or complex expressions.
Right. Simple ones are fast ones. What did you expect?
[Original message clipped]
You're comparing apples with oranges here.
There are many cases where using a state-machine or a recursive parser is far superior to a RE, noone doubts that. But sometimes RE's are simply the better solution.
Niki
|
|
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
BootFX
Reliable and powerful .NET application framework. |
|
|
|
|
|
|