• Solutions
    • FERC XBRL Reporting
    • FDTA Financial Reporting
    • SEC Compliance
    • Windows Clipboard Management
    • Legato Scripting
  • Products
    • GoFiler Suite
    • XBRLworks
    • SEC Exhibit Explorer
    • SEC Extractor
    • Clipboard Scout
    • Legato
  • Education
    • Training
    • SEC and EDGAR Compliance
    • Legato Developers
  • Blog
  • Support
  • Skip to blog entries
  • Skip to archive page
  • Skip to right sidebar

Friday, January 20. 2017

Legato Developers Corner #18: Regular Expressions and Legato

In this blog we will discuss regular expressions and how you can employ them in your Legato scripts to add robust field validation as well as patterned string replacements. Regular expressions are a powerful tool that can be used in many languages so the concepts here can apply to things outside the scope of Legato. There are a few different “flavors” of regular expressions, but Legato uses the ECMAScript standard (very similar to Perl). If you already have a good understanding of regular expressions, you may wish to skip to the “Using Regular Expression in Legato” section.


A Quick Introduction to Regular Expressions


For those who have not used regular expressions before, the concept is straightforward. You create a pattern using the regular expression syntax and then compare data to that pattern. A common parallel to every day computing is using wildcards to find files on your computer. When you try to open a Word document in Office, Word limits the directory listing to files that end in .doc or .docx. Regular expressions can do simple matching like that and so much more.


Basic Syntax and Examples


There are many resources for regular expressions on the Internet so we are going to just cover some of the basics. For more in depth look into what options there are, this Wikipedia article is a good place to start.


The most basic pattern is just text. Any characters that do not have any special meaning are matched as-is. For example, the pattern “cat” matches only the word “cat”. If we wanted to make it match many pets we could use the or operator ‘|’ like this “cat|dog|bird|turtle”. This pattern would match any of the words listed. It is important to note that the character ‘.’ means any character so if you want to use a period (or any other special character), it needs to be escaped with a backslash.


To add more power we can use quantifiers. Quantifiers say how often the preceding item is allowed to occur. For example, if we wanted the above example to allow plural words we could make it “(cat|dog|bird|turtle)s?”. The parenthesis group the names together and the question mark means match the preceding item ‘s’ once or not at all. The quantifiers are: ?, *, +, {#}, {#, } and {#, #} where # is a number. ‘*’ is zero or more, ‘+’ is one or more, “{2}” means exactly 2 times, “{2,}” means 2 or more, and “{2,5}” means 2 to 5 times.


The next basic pattern concept is sets. Sets mean any character in the collection. Sets are denoted by the “[ ]” characters. For example, the pattern “[abc]” means any character that is ‘a’, ‘b’, or ‘c’. Sets can also have ranges like “[A-Z]” (any character between ‘A’ and ‘Z’) and be inverted “[^abc]” (any character that is not ‘a’, ‘b’, or ‘c’.)


Another special option are character classes. Character classes are shorthand characters that mean specific things. For example, “\d” means any digit. This is actually equivalent to the set “[0-9]” but is shorter. Likewise, “\w” means any word character in this default set, which translates into “[A-Za-z0-9_]”. This means:


ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789_


Character classes are very specific to the particular regular expressions engine but generally things like digits, words, and spaces are available.


So bringing all this together we can make some patterns:


  [0-9]{10} Matches a CIK
  (-|\+)?[0-9]+(\.[0-9]+)? Matches any decimal number (with optional -/+)
  ([^A-Z ])([A-Z]) Matches any capital letter not preceded by another capital letter or space
  (a|e|i|o|u|A|E|I|O|U)[^ ]* Matches any word that begins with a vowel
  \S+@(\S+\.)+[A-Za-z]{2,} Matches an email address

The last regular expression topic to discuss is using backreferences. A Backreference is a way to use the text matched earlier in the pattern again. Say we want to create a pattern to detect if the same word is written twice. We can do that easily with a backreference. The syntax for backreferences is a backslash followed by the group number within the pattern. The numbering gets a little complicated when there are nested groups but the concept is still the same. If we wanted to create a pattern for that duplicate word example above, it would be “([^ ]+) \1” This pattern is a group of any nonspace character one or more times followed by a space. Then the backreference means the same characters that matched group 1 again.


Using Regular Expressions in Legato


Legato has two regular expression functions:


boolean = IsRegexMatch ( string text, string pattern );


and


string = ReplaceInStringRegex ( string text, string pattern, string replace );


The IsRegexMatch function tests a string against a pattern and returns true if it matches and false if it does not. This makes the function ideal for validating user input. The entire input string must match the pattern. If a partial match is needed the pattern should be changed to reflect that.


The ReplaceInStringRegex function finds text that matches the pattern and replaces it with a string specified by the replace parameter. This function replaces all occurrences of text that match the pattern parameterstring. The pattern does not need to completely match the input. The replace parameter can also contain backreferences to items matches in the pattern parameter. The syntax is slightly different than before. In the replace string, we use “$1” to represent a backreference to the first group instead of “\1”.


Here is an example of backreferences in a replace operation:


string pattern = "(^'|( )')([^']+)'";


AddMessage(ReplaceInStringRegex("I've poorly 'quoted things'", pattern, "$2\"$3\""));


Creates this output:


I've poorly "quoted things"


The above example replaced the text surrounded by single quotes with double quotes while ignoring the apostrophe in “I’ve”. The $2 indicates the second backreference (in this case it is the space in parenthesis) and the $3 indicates the third backreference, which is the text that is quoted.


One important thing to note when using regular expressions in Legato is the expressions themselves may require extensive use of the backslash character. This character also needs to be escaped in Legato when using string literals. A simple pattern like “\w+\.txt” will need to be escaped:


AddMessage("%d", IsRegexMatch("cat.txt", "\\w+\\.txt"));


A Sample Function Using Regular Expressions


This sample function uses regular expressions to rename files within a directory. The function has many production uses, like renaming slides from PowerPoint, as well as personal applications, such as renaming camera pictures or music files.


// Renames all Files and Directories in path using pattern
int RenameInDirectory(string path, string pattern, string replace,
                      boolean bProcessSub, boolean bRename) {
    
    int matched; // Matched items
    handle hFF;
    boolean hasfiles;
    string src, dst, name, nname;
    dword dw;
    int rc;
    
    
    hFF = GetFirstFile(AddPaths(path, "*"));
    if (IsError(hFF)) {
      return 0;
      }
    
    matched = 0;
    hasfiles = true;
    while (hasfiles) {
      name = GetName(hFF);
      dw = GetFileAttributeBits(hFF);
      if ((dw & FILE_ATTRIBUTE_DIRECTORY) != 0) {
        if (bProcessSub == TRUE && name != "." && name != "..") {
          name = AddPaths(path, name);
          ConsolePrint("Entering Directory %s\r\n", name);
          matched += RenameInDirectory(name, pattern, replace, bProcessSub, bRename);
          }
        }
      else {
        if (IsRegexMatch(name, pattern)) {
          nname = ReplaceInStringRegex(name, pattern, replace);
          ConsolePrint("Renaming File %s to %s\r\n", name, nname);
          src = AddPaths(path, name);
          dst = AddPaths(path, nname);
          if (bRename == true) {
            rc = RenameFile(src, dst);
            if (IsError(rc)){
              ConsolePrint("   FAILED! 0x%08X\r\n", rc);
              }
            else {
              matched++;
              }
            }
          else {
            matched++;
            }
          }
        else {
          ConsolePrint("Skipping File %s\r\n", name);
          }
        }

      rc = GetNextFile(hFF);
      if (IsError(rc)) {
        hasfiles = false;
        }
      }
    CloseHandle(hFF);
    return matched;
    }

The RenameInDirectory function takes several parameters: the path in which to search, the pattern to match, the text with which the function will perform the replace operation, and then two booleans. One boolean indicates whether to process subdirectories, and the second indicates whether to rename files. This function returns the number of files that have been / would be renamed. It first defines the many necessary variables.


    hFF = GetFirstFile(AddPaths(path, "*"));
    if (IsError(hFF)) {
      return 0;
      }
    
    matched = 0;
    hasfiles = true;
    while (hasfiles) {
      name = GetName(hFF);
      dw = GetFileAttributeBits(hFF);

The process starts by using the GetFirstFile function to get a list of files and directories in the supplied path. Then it gets the name of the file using the GetName function and the attributes of the file using the GetFileAttributeBits function. This occurs in a while loop to go through every file.


      if ((dw & FILE_ATTRIBUTE_DIRECTORY) != 0) {
        if (bProcessSub == TRUE && name != "." && name != "..") {
          name = AddPaths(path, name);
          ConsolePrint("Entering Directory %s\r\n", name);
          matched += RenameInDirectory(name, pattern, replace, bProcessSub, bRename);
          }
        }

The if statement checks to see if the current item is a directory. If it is a directory and the user wants to process directories and it is not the “.” or “..” directory, it is also processed. The ConsolePrint function is used to tell the user what is going on during the rename process. The RenameInDirectory function is called again with the subdirectory as the path and the amount of matched items is increased by the amount renamed in the inner call. This makes our RenameInDirectory function recursive with a very clear ending point (a directory with no subdirectories).


      else {
        if (IsRegexMatch(name, pattern)) {
          nname = ReplaceInStringRegex(name, pattern, replace);
          ConsolePrint("Renaming File %s to %s\r\n", name, nname);
          src = AddPaths(path, name);
          dst = AddPaths(path, nname);

If the item was not a directory, the regular expression magic happens. The function checks to see if the name of the file matches the pattern using the IsRegexMatch SDK function. This is important since it forces the pattern to match the entire filename. The script could do partial matching instead; we’ll discuss this more later. The ReplaceInStringRegex function is then called on the filename to create the new filename. The resulting names are printed to the console and then qualified paths are created.


          if (bRename == true) {
            rc = RenameFile(src, dst);
            if (IsError(rc)){
              ConsolePrint("   FAILED! 0x%08X\r\n", rc);
              }
            else {
              matched++;
              }
            }
          else {
            matched++;
            }

If the user wants to rename the files (not just test them), the function uses the RenameFile function to rename the files. If the function fails, it adds an error to the console. It then increases the count of files that were renamed.


          }
        else {
          ConsolePrint("Skipping File %s\r\n", name);
          }
        }

This prints a message to the user if the file was skipped since it did not match the pattern.


      rc = GetNextFile(hFF);
      if (IsError(rc)) {
        hasfiles = false;
        }
      }
    CloseHandle(hFF);
    return matched;

Finally, our script uses the GetNextFile function to get the next file in the directory. If there are no more files, the loop exits and the function returns the count of files that were renamed or would be renamed.


This relatively simple function uses the power of regular expressions to rename thousands of files in a predictable fashion. If we wanted to rename a bunch of jpgs from PowerPoint into GoFiler format (with leading 0s), we could just run:


    RenameInDirectory("C:\\PPT", "slide([0-9])\\.jpg", "image_00$1.jpg", false, true);
    RenameInDirectory("C:\\PPT", "slide([0-9]{2})\\.jpg", "image_0$1.jpg", false, true);
    RenameInDirectory("C:\\PPT", "slide([0-9]{3})\\.jpg", "image_$1.jpg", false, true);

Combining the Function with a User Interface


This next script combines the function from above with a user interface. This script CAN be dangerous to use as it will rename many files almost instantly. Use the “Test” button before executing the rename operation to see what the script will do. To use the script, save it as an .ls file and run it in GoFiler.


// Renames all Files and Directories in path using pattern
int RenameInDirectory(string path, string pattern, string replace,
                      boolean bProcessSub, boolean bRename) {
    
    int matched; // Matched items
    handle hFF;
    boolean hasfiles;
    string src, dst, name, nname;
    dword dw;
    int rc;
    
    
    hFF = GetFirstFile(AddPaths(path, "*"));
    if (IsError(hFF)) {
      return 0;
      }
    
    matched = 0;
    hasfiles = true;
    while (hasfiles) {
      name = GetName(hFF);
      dw = GetFileAttributeBits(hFF);
      if ((dw & FILE_ATTRIBUTE_DIRECTORY) != 0) {
        if (bProcessSub == TRUE && name != "." && name != "..") {
          name = AddPaths(path, name);
          ConsolePrint("Entering Directory %s\r\n", name);
          matched += RenameInDirectory(name, pattern, replace, bProcessSub, bRename);
          }
        }
      else {
        if (IsRegexMatch(name, pattern)) {
          nname = ReplaceInStringRegex(name, pattern, replace);
          ConsolePrint("Renaming File %s to %s\r\n", name, nname);
          src = AddPaths(path, name);
          dst = AddPaths(path, nname);
          if (bRename == true) {
            rc = RenameFile(src, dst);
            if (IsError(rc)){
              ConsolePrint("   FAILED! 0x%08X\r\n", rc);
              }
            else {
              matched++;
              }
            }
          else {
            matched++;
            }
          }
        else {
          ConsolePrint("Skipping File %s\r\n", name);
          }
        }

      rc = GetNextFile(hFF);
      if (IsError(rc)) {
        hasfiles = false;
        }
      }
    CloseHandle(hFF);
    return matched;
    }


//
// Supporting Dialog and Main Entry
// --------------------------------

#define DLG_LOCATION    201
#define DLG_PATTERN     202
#define DLG_REPLACE     203
#define DLG_BROWSE      101
#define DLG_TEST        102
#define DLG_RENAME      103
#define DLG_SUBDIR      301
#define DLG_EX1         501
#define DLG_EX2         502

int main() {
    DialogBox("RegexRename", "rr_");
    return 0;
    }
    

int rr_run_function(boolean bRename) {

    string location;
    string pattern;
    string replace;
    boolean bSub;
    int rc;

    // Get Location    
    location = EditGetText(DLG_LOCATION, "Location", EGT_FLAG_REQUIRED);
    if (location == "") {
      return ERROR_EOD;
      }
    if (IsPathQualified(location) != TRUE) {
      MessageBox('x', "Location must be a qualified path.");
      return ERROR_EOD;
      }
    
    // Get Pattern
    pattern = EditGetText(DLG_PATTERN, "Pattern", EGT_FLAG_REQUIRED);
    if (pattern == "") {
      return ERROR_EOD;
      }

    // Get Replace
    replace = EditGetText(DLG_REPLACE, "Replace", EGT_FLAG_REQUIRED);
    if (replace == "") {
      return ERROR_EOD;
      }
      
    // Get Subdirectories
    if (CheckboxGetState(DLG_SUBDIR) == BST_CHECKED) {
      bSub = TRUE;
      }
    else {
      bSub = FALSE;
      }

    ConsolePrint("\r\nBegining Rename...\r\n");
    rc = RenameInDirectory(location, pattern, replace, bSub, bRename);
    if (bRename == TRUE) {
      MessageBox('i', "%d file(s) renamed.", rc);
      }
    else {
      MessageBox('i', "%d file(s) would be renamed.", rc);
      }
    return ERROR_NONE;
    }

int rr_load() {

    EditSetText(DLG_EX1, "(e.g. slide([0-9]+)\\.jpg )");
    EditSetText(DLG_EX2, "(e.g. image_$1.jpg )");
    return ERROR_NONE;
    }

int rr_action(int id, int action) {

    string path;

    switch (id) {
      case DLG_BROWSE:
        path = EditGetText(DLG_LOCATION);
        path = BrowseFolder("Select Location", path);
        if (GetLastError() == ERROR_CANCEL) {
          break;
          }
        EditSetText(DLG_LOCATION, path);
        break;
      case DLG_TEST:
        rr_run_function(false);
        break;
      case DLG_RENAME:
        rr_run_function(true);
        break;
      }
    return ERROR_NONE;
    }

#beginresource

RegexRename DIALOGEX 0, 0, 300, 104, 0
EXSTYLE WS_EX_DLGMODALFRAME
STYLE DS_MODALFRAME | DS_3DLOOK | WS_POPUP | WS_VISIBLE | WS_CAPTION | WS_SYSMENU
CAPTION "Rename Files with Regular Expressions"
FONT 8, "MS Shell Dlg"
{
 CONTROL "Options", -1, "static", SS_LEFT | WS_CHILD | WS_VISIBLE, 6, 4, 57, 8, 0
 CONTROL "", -1, "static", SS_ETCHEDFRAME | WS_CHILD | WS_VISIBLE, 36, 9, 260, 1, 0
 CONTROL "&Location:", -1, "static", SS_LEFT | WS_CHILD | WS_VISIBLE, 12, 19, 38, 8, 0
 CONTROL "", DLG_LOCATION, "edit", ES_LEFT | ES_AUTOHSCROLL | WS_CHILD | WS_VISIBLE | WS_BORDER | WS_TABSTOP, 50, 18, 186, 12, 0
 CONTROL "&Browse", DLG_BROWSE, "button", BS_PUSHBUTTON | BS_CENTER | WS_CHILD | WS_VISIBLE | WS_TABSTOP, 243, 18, 45, 12, 0
 CONTROL "Process subdirectories (recurse)", DLG_SUBDIR, "button", BS_AUTOCHECKBOX | WS_CHILD | WS_VISIBLE | WS_TABSTOP, 50, 32, 186, 12, 0
 CONTROL "&Pattern:", -1, "static", SS_LEFT | WS_CHILD | WS_VISIBLE, 12, 49, 38, 8, 0
 CONTROL "", DLG_PATTERN, "edit", ES_LEFT | ES_AUTOHSCROLL | WS_CHILD | WS_VISIBLE | WS_BORDER | WS_TABSTOP, 50, 47, 170, 12, 0
 CONTROL "", DLG_EX1, "static", SS_LEFT | WS_CHILD | WS_VISIBLE, 222, 49, 76, 8, 0
 CONTROL "&Replace:", -1, "static", SS_LEFT | WS_CHILD | WS_VISIBLE, 12, 65, 38, 8, 0
 CONTROL "", DLG_REPLACE, "edit", ES_LEFT | ES_AUTOHSCROLL | WS_CHILD | WS_VISIBLE | WS_BORDER | WS_TABSTOP, 50, 63, 170, 12, 0
 CONTROL "", DLG_EX2, "static", SS_LEFT | WS_CHILD | WS_VISIBLE, 222, 65, 76, 8, 0
 CONTROL "", -1, "static", SS_ETCHEDFRAME | WS_CHILD | WS_VISIBLE, 6, 80, 290, 1, 0
 CONTROL "&Test", DLG_TEST, "button", BS_DEFPUSHBUTTON | BS_CENTER | WS_CHILD | WS_VISIBLE | WS_TABSTOP, 136, 86, 50, 14, 0
 CONTROL "Re&name", DLG_RENAME, "BUTTON", BS_PUSHBUTTON | BS_CENTER | WS_CHILD | WS_VISIBLE | WS_TABSTOP, 191, 86, 50, 14
 CONTROL "Close", IDOK, "BUTTON", BS_PUSHBUTTON | BS_CENTER | WS_CHILD | WS_VISIBLE | WS_TABSTOP, 246, 86, 50, 14
}

#endresource

The Next Step


If you want to test what you’ve learned here and your Legato skills, try taking the above script and editing it to allow for partial matches in filenames.






David Theis has been developing software for Windows operating systems for over fifteen years. He has a Bachelor of Sciences in Computer Science from the Rochester Institute of Technology and co-founded Novaworks in 2006. He is the Vice President of Development and is one of the primary developers of GoFiler, a financial reporting software package designed to create and file EDGAR XML, HTML, and XBRL documents to the U.S. Securities and Exchange Commission.

Posted by
David Theis
in Development at 13:31
Trackbacks
Trackback specific URI for this entry

No Trackbacks

Comments
Display comments as (Linear | Threaded)
No comments
The author does not allow comments to this entry

Quicksearch

Categories

  • XML Accounting
  • XML AICPA News
  • XML FASB News
  • XML GASB News
  • XML IASB News
  • XML Development
  • XML Events
  • XML FERC
  • XML eForms News
  • XML FERC Filing Help
  • XML Filing Technology
  • XML Information Technology
  • XML Investor Education
  • XML MSRB
  • XML EMMA News
  • XML FDTA
  • XML MSRB Filing Help
  • XML Novaworks News
  • XML GoFiler Online Updates
  • XML GoFiler Updates
  • XML XBRLworks Updates
  • XML SEC
  • XML Corporation Finance
  • XML DERA
  • XML EDGAR News
  • XML Investment Management
  • XML SEC Filing Help
  • XML XBRL
  • XML Data Quality Committee
  • XML GRIP Taxonomy
  • XML IFRS Taxonomy
  • XML US GAAP Taxonomy

Calendar

Back May '25 Forward
Mo Tu We Th Fr Sa Su
Sunday, May 18. 2025
      1 2 3 4
5 6 7 8 9 10 11
12 13 14 15 16 17 18
19 20 21 22 23 24 25
26 27 28 29 30 31  

Feeds

  • XML
Sign Up Now
Get SEC news articles and blog posts delivered monthly to your inbox!
Based on the s9y Bulletproof template framework

Compliance

  • FERC
  • EDGAR
  • EMMA

Software

  • GoFiler Suite
  • SEC Exhibit Explorer
  • SEC Extractor
  • XBRLworks
  • Legato Scripting

Company

  • About Novaworks
  • News
  • Site Map
  • Support

Follow Us:

  • LinkedIn
  • YouTube
  • RSS
  • Newsletter
  • © 2024 Novaworks, LLC
  • Privacy
  • Terms of Use
  • Trademarks and Patents
  • Contact Us