|
|
|
|
Regular Expressions (regexs or REs) are strings that describe
patterns of characters in other strings. They are effective tools
for searching and manipulating text and are built-in features of
Perl, Python and many other scripting languages. They are a
ubiquitous aspect of programming and computer science, but have
been absent from the standard Java API before now. This article
discusses the java.util.regex package, which is a new
feature of the J2SE 1.4.
Package java.util.regex provides regular
expression-based pattern matching. It contains a
Pattern class and a Matcher class.
Matcher instances use Pattern instances to
find, replace and otherwise compare a regex to an input character
sequence.
The Pattern class is a regex container. It is
instantiated by "compiling" an expression with either
Pattern pat = Pattern.compile("x*yz*");
or
boolean isMatch = Pattern.matches("x*yz*", "xxxyzzz");
Both examples create a pattern that matches any number of 'x' characters followed by a single 'y' character and then by any number of 'z' characters. The strings "y", "xxy" and "xyz" are valid matches for the pattern.
The matches() method in the second example is a
shortcut compile and match. For one-time tests of a pattern, the
matches() method of the Pattern class
eliminates the need to instantiate a Matcher and call
its matches() method. However, as it does not allow a
compiled expression to be reused, it is less efficient if the match
is repeated several times.
The regex grammar recognized by the Pattern class is
similar in many respects to Perl. The list below shows some of its
common expressions.
Summary of Regex Grammar
| Expression | Matches |
| . | any character |
| x | character x |
| x? | zero or one of character x |
| x* | zero or more of character x |
| \t \n \r \f | tab, line feed, carriage-return, form-feed |
| \d \D | digit, non-digit |
| \s \S | whitespace, non-whitespace |
| \w \W | word, non-word |
| [a-z] | the lowercase characters a through z inclusively |
| [A-Z] | the uppercase characters A through Z inclusively |
| ^ | the beginning of a line |
| $ | the end of a line |
A Matcher object is instantiated by invoking the
matcher() method of a Pattern instance.
Matchers are used to search or manipulate a specified string.
Suppose a programmer needs to swap one substring for another. He
might choose to deconstruct the string either with a
StringTokenizer or with methods in the
String class. He would have to write logic to
disassemble the string, swap the tokens and reassemble the string.
Alternately and in C-fashion, he could "walk" the string a
character at a time and make the appropriate character
substitutions. He would have to look ahead and account for shifting
characters if the swapped string and substring were not the same
length. In contrast to this complexity, is the ease with which
Matcher objects can replace one substring with another.
// swap 'are' and 'is'
Pattern pat = Pattern.compile("are");
Matcher mat = pat.matcher("Java are fun.");
String sentence = mat.replaceAll("is"); // 'Java is fun.'
In addition to substitution, the Matcher class provides
methods that find the next matching substring, test complete and
partial string matches and return matched substrings.
The split() method, also new in the J2SE 1.4, is a
regex-like addition to the String class. It is similar
to the split routine in Perl. It uses an input regex as a delimiter
and deconstructs the contents of an input string into an array of
strings. This method is useful for parsing character delimited text
files or user input. To parse a colon-delimited text string from a
user password file once could use:
String x = "joe:x:670:500::/home/joe:/bin/false";
String arr[] = x.split(":");
The contents of arr[] are
{ "joe", "x", "670", "500", "", "home/joe", "/bin/false" }.
Often, websites relegate form validation to the browser and JavaScript. Authors of heavily trafficked sites may choose browser-side form validation to offload server-side processing. Problems can arise when users disable scripting in their browsers, and server cycles ultimately may be traded for developer cycles, as the JavaScripts are a separate codebase that must be maintained. With regular expressions, server-side form validation is easily implemented.
Consider a small web application developed in anticipation of an area code change from 318 to 543. The user enters his name and phone number. A servlet or JSP verifies the proper format of the name and phone number and checks a data source to see if the area code is changing. If so, it displays the phone number with the new area code. A war file of this application is available for download. Below is the bean that implements its regular expression logic.
package com.ociweb.jnb;
import java.util.regex.*;
public class RegexValidate {
private String name ="";
private String phone = "";
private String response =
"Your name and phone number are not on record.";
// area codes 314 will change; 318 will not
private static final String OLD318 = "318";
private static final String NEW318 = "543";
// precompiled patterns
// notice double escape of special characters
private Pattern phonePattern =
Pattern.compile("\\d\\d\\d-\\d\\d\\d-\\d\\d\\d\\d");
private Pattern namePattern =
Pattern.compile("^[A-Z]\\s[A-Z][a-z]*");
private Pattern newAreaCodePattern =
Pattern.compile("^31\\d");
private String nameList[]= {
// "name:phoneNumber:newAreaCode
"S White:314-555-3141",
"J Smith:318-555-3147",
"C Brown:314-555-6543",
"T Hogan:318-555-3180",
"E James:636-555-8970"
};
private int NUMNAMES = java.lang.reflect.Array.getLength(nameList);
public RegexValidate() { }
public void setName (String name) {
this.name = name;
if(!isValidName())
throw new IllegalArgumentException("Invalid Name.");
}
public void setPhone (String phone) {
this.phone = phone;
if(!isValidPhone())
throw new IllegalArgumentException("Invalid Phone Number.");
}
private boolean isValidPhone () {
Matcher phoneMat = phonePattern.matcher(this.phone);
return phoneMat.matches();
}
private boolean isValidName () {
Matcher nameMat = namePattern.matcher(this.name);
return nameMat.matches();
}
public String getResponse () {
boolean found = false;
String nameData[] = new String[2];
for (int ctr = 0; ctr < NUMNAMES && !found; ctr++) {
nameData = this.nameList[ctr].split(":");
if (nameData[0].equals(this.name) &&
nameData[1].equals(this.phone)) {
found = true;
this.response = "Your phone number, " +
this.phone + ", will not change.";
// check old and new area code
Matcher mat = newAreaCodePattern.matcher(this.phone);
if(mat.find())
if(mat.group().equals(OLD318)) {
this.response = "Your new phone number is " +
mat.replaceAll(NEW318);
}
}
}
return this.response;
}
}
The isValidPhone() and isValidName()
methods use precompiled patterns to limit allowed input for the
phone number and name fields. The first constrains phone numbers to
a 10-digit hyphenated format. The second ensures that names are
properly capitalized and entered as first initial space last name.
The getResponse() method uses the split()
method to deconstruct a list of names and phone numbers. The split
phone numbers are matched against a pattern that selects the area
code. While the match may be performed against each record in the
data source, all the matches are against the same pattern, so the
code is more efficient if the pattern is compiled outside of the
getResponse() method and reused.
The pattern '^31\\d' matches both '314' and '318'. Since 318 is the
only area code that changes, the example could use a pattern that
matched only 318. The partial match pattern was chosen to show the
versatility of the regex implementation and to demonstrate the
group() method. The example uses the
group() method of the Matcher class to
return the substring of the last match. It then checks the substring
and calls replaceAll() if the substring is '318'. The
replaceAll() method changes the area code to '543'.
Since the pattern only matches digits at the start of the line,
replaceAll() does not replace any '318' substrings that
occur later in the string.
The new java.util.regex package provides regular
expression functionality that has been absent from the standard Java
API. The methods of the new Matcher and
Pattern classes and the grammar they support let
developers describe and manipulate sequences of characters
succinctly. They can replace string machinations with simpler regex
constructs that are more powerful and easier to use and maintain.
For those who don't have access to the J2SE 1.4 or are tied to earlier versions of the JDK, there are several third party regular expression packages. Among them are ORO and Regexp. Both are part of the Jakarta project. ORO boasts more features and seems to have more active development. The Free software Foundation distributes the gnu.regex package, and Pat is a regex package compatible with JDK 1.0.
Object Computing, Inc (OCI) has been providing educational services to clients, industries and universities since 1993. We offer one of the most comprehensive distributed Object Oriented training curricula in the country. These curricula focus on the fundamentals of OO technology; with close to 40 workshops in OOAD, Java, XML, C++/CORBA and Unix/Linux.
For further information regarding OCI's Educational Services programs, please visit our Educational Services section on the web or contact us at training', 'ociweb.com. ')
|
|
|