Legato Developers Corner #15: String Testing and Translation

Friday, December 30. 2016

Legato Developers Corner #15: String Testing and Translation

This week we will be covering scanning, testing, and some functions for manipulating textual string data.

String Testing and Translation

This week we will be covering scanning, testing, and some functions for manipulating textual string data.

Introduction

In certain cases, it is necessary to manually scan and test string content. One could manually step through a string character by character and perform specific tests. This would be slow and cumbersome. Fortunately, the Legato SDK contains hundreds of functions to process and manipulate strings. In this article, we will be discussing parsing, basic boolean test functions, word analysis functions, and some conversion functions.

Boolean and Bitwise Results

Functions that test or get data frequent either return a boolean value or a bitwise result. While they are both fundamentally numbers, the boolean data type is used to indicate that if the value is 0, it is FALSE. A non-zero (usually 1) is TRUE. Bitwise return values are typically defined as a dword, a 32-bit unsigned integer. Conventionally the top bit (0x80000000) is not used as part of the result since it normally indicates the value is a formatted error code.

With boolean return values, it is easy to perform operations such as:

if (HasText(mystring)) { ... }

Bitwise return values can contain a lot of information and are normally tested with bit and ordinal masks. This will be covered in detail later. Since working with values like 0x0001000 can be clumsy, the SDK contains definitions for all common bitwise return functions.

A sample use of bitwise return values might look like this:

result = AnalyzeText(mystring);

if ((result & TEXT_TYPE_MASK) == TEXT_TYPE_HEADING) { ... }

A Short Review of the Word Parse Object

Generally speaking, we have to have data to test. The easiest way to look through a string of text is by using the Word Parse Object. The object supports a series of functions that are specifically tailored to “parse” or process textual data. The general-purpose parse runs in three modes: text, tags, and program. The text mode provides word parsing for reading general text. The tags mode is tailored to work on XML, HTML, or SGML tags and character entities. Finally, program mode is made to parse typical program or script text.

We will be focusing on the default parsing mode or text mode (WP_GENERAL), which stops on word spaces, returns (line endings), and punctuation within the textual information.

Basic Operation

The general steps are as follows: create (get handle), load/set data, and iterate until the data has been exhausted. New data can be repeatedly loaded to the same object to process multiple buffers or lines. After completion, the Word Parse Object handle should be closed.

As each item is parsed, the leading spaces and statistics are stored. For example, the caller can check to see if there are leading spaces and even get the white space character as a string.

Once the source data is set, the source variable can be changed or released. The Word Parse Object makes an internal copy of the data.

Setting Up a Parse Operation

The first action is to create a Word Parse Object and retrieve a handle. That handle is then used in subsequent operations to move through the text and examine each parsed item. For example:

        handle          hWP;
        string          s1, s2;
        int             spaces, count, pos;
        
        s1 = "My favorite pastime is  waiting for my browser to load a page.\rEnd.";

        hWP = WordParseCreate();
        if (hWP == NULL_HANDLE) {
          MessageBox('x', "Error on handle");
          exit;
          }
        WordParseSetData(hWP, s1);
        s2 = WordParseGetWord(hWP);
        while (s2 != "") {
          count++;
          pos = WordParseGetPosition(hWP);
          spaces = WordParseGetSpaceSize(hWP);
          AddMessage("   %3d %3d %3d :%s:", count, pos, spaces, s2);
          s2 = WordParseGetWord(hWP);
          }

        CloseHandle(hWP);

The result in the log:

     1   2   0 :My:
     2  11   1 :favorite:
     3  19   1 :pastime:
     4  22   1 :is:
     5  31   2 :waiting:
     6  35   1 :for:
     7  38   1 :my:
     8  46   1 :browser:
     9  49   1 :to:
    10  54   1 :load:
    11  56   1 :a:
    12  62   1 :page.:
    13  67   1 :End.:

In this case, the parse object is created with the default mode (text). A string is added to the parse object and then each successive word is retrieved along with certain attributes. When added to the log, we surround the returned string value with “::” to illustrate that the string does not contain white space. Note as shown in the log, the first entry has no leading spaces. There is an additional space before “waiting” and a return before “End”.

Functions are provided to retrieve and change the parsing position. In addition, a parse object can be used repeatedly, provided the parse mode remains the same.

Skipping Through a String

Another option while parsing is to skip through a string looking for text or spaces. A series of functions are provided that take a string and a zero-based index as a parameter and then return a new index position.

	Function	Description
	SkipBackWordSpaces	Skips back from a specified index to the first non-word space character.
	SkipToLineEnding	Skips forward to the next line ending character (0x0D/0x0A).
	SkipToNonText	Skips forward to a character that is not alpha-numeric.
	SkipToWordSpace	Skips forward until a word space is found.
	SkipWordSpaces	Skips forward until not on a word space.

Once positions have been established, they can be used with the word parser or functions such as the GetStringSegment function.

Testing a Word

To avoid having to test each character in a word to determine its type, a large set of SDK functions are provided to test a string, or in some cases a character, as meeting a particular criterion. The following is a partial list of the ‘Is’ or ‘Has’ functions that return TRUE (1) or FALSE (0) depending on the result.

	Function	Description
	HasNumeric	Tests a string for any numeric characters (digits).
	HasText	Tests a string for any text (alpha characters).
	IsAllLower	Tests a string for all lower case on any text that is present.
	IsAllUpper	Tests a string for all upper case on any text that is present.
	IsASCII	Tests a string for non-ASCII characters allowing for return and tab characters.
	IsASCIICharacters	Tests a string for non-ASCII (no control characters).
	IsAccounting	Tests a string for accounting characters (i.e., -123 or (34,555.44)).
	IsAlpha8859	Tests a character or string for ASCII Letters and ISO-8859 Latin letter characters.
	IsAlphaNumeric8859	Tests a character or string for ASCII letters, numbers, and ISO-8859 Latin letter characters.
	IsAlpha	Tests a character or string for ASCII characters.
	IsCurrency	Tests a character or string as currency group (allows ‘.’ ‘,’ and ‘(’ ‘)’ characters).
	IsCurrencyFormatted	Tests a string as properly formatted currency (US, Euro, Pounds, Yen, cents).
	IsCurrencyPrefix	Tests a string for currency leader (i.e., USD$ or CAN$) allowing for other ISO-4217 codes. Following number can be loosely formatted as accounting.
	IsCurrencyProper	Tests a string for properly structured currency, i.e., €128,333, allowing for multiple commas and periods for US and European formats. It does not check the number of digits.
	IsDrawing	Tests character or string for a limited set of characters commonly used for drawing such as ‘-’ or ‘=’.
	IsFalse	Checks string for common terms such as ‘no’, ‘false’,‘0’ or empty checkboxes as being the same as logical FALSE.
	IsFootnoteReference	Checks string for typical footnote characters, numbers or letters in the hole: ‘(1)’ or ‘(b)’.
	IsHTML	Tests a string as being HTML by checking for certain HTML tags.
	IsHex	Tests a character or string as being valid hex characters. It cannot have the ‘0x’ prefix.
	IsInString	Checks for all or part of string or character inside of a target string (like InString but with a boolean result).
	IsLeaderBackFill	Looks backward in a string for leader fill characters. The string can contain text prior to the leader.
	IsLeaderFill	Looks in string for leader fill characters.
	IsLower	Checks a string (word) for being lower case. The word cannot contain any non-alpha characters.
	IsNil	Checks a character or string as being a financial ‘nil’ value.
	IsNonBreakingSpace	Checks a character or a string as being non-breaking space character(s) (0xA0 or 160).
	IsNonBreakingSpaceEntity	Checks for the start of a string being a non-breaking space (PCDATA) as   or   characters.
	IsNonBreakingSpacePCDATA	Checks for string containing only non-breaking spaces and optional word spaces.
	IsNumeric	Tests a string for strictly numeric (digits).
	IsPCDATARequired	Checks a string for a requirement to encode as PCDATA.
	IsPercentage	Checks a string as a percentage value such as 0% or 00.00% etc.
	IsReal	Checks a string as being a real number.
	IsRealStrict	Checks a string as a real number but with strict requirements.
	IsRegexMatch	Performs a regular expression pattern match on string data.
	IsRoman	Tests a character as a roman numeral or a string as a roman number.
	IsSectionNumber	Tests a string for being a section number (i.e.. 1, 2.2, 2.1.1., etc.).
	IsStringPadded	Tests to see if a string has space padding either before or after.
	IsTabbedString	Checks a string for tab characters.
	IsText	Tests a string for being a textual word with or without conventional punctuation.
	IsTrue	Tests a string for common terms such as ‘yes’, ‘true’, ‘1’ or various checked checkbox styles being the same as logical TRUE.
	IsUpper	Checks a string (word) for being upper case. The word cannot contain any non-alpha characters.
	IsSGMLCharacterEntity	Tests a string for basic SGML character entity structure. It does not check the actual character specification.
	IsSGMLEmptyElement	Tests a string for basic SGML character entity structure. It does not check the actual character specification.
	IsSGMLTag	Tests a string for basic SGML tag structure. Does not check the content other than for <a> or </a> structure.
	IsValidSGMLAttribute	Tests a string for a valid syntax attribute name (with name space).
	IsValidSGMLElement	Tests a string for a valid syntax element name (with name space).
	IsWildListMatch	Matches a list of items against a target string with wildcards. The match string list can be a semicolon separated list of test cases.
	IsWildMatch	Checks two strings for wild card match. Matches string 2 to string 1.
	IsWildString	Tests a string for containing one or more wild card characters.

Some of the above functions will also accept and test an individual character. Functions that test only characters are also available:

	Function	Description
	IsANSISpace	Tests a character as white space including backspace (back tab, `0x08`) and tab (`0x09`).
	IsAlphaNumeric	Tests a character for character ASCII letters or numbers.
	IsDigit	Tests a character as a number (0-9).
	IsExpressionCharacter	Tests a character that could be used in an expression (i.e.., `{` `<` `+`, etc.).
	IsExpressionGroup	Tests a character used in an expression group (i.e., `" '` `(` or `[` ).
	IsExtendedAlpha	Tests a character to see if it is part of ISO-8859 latin alpha set commonly used in English.
	IsFinancial	Checks character as a number, currency or ‘`,`’ or ‘`.`’.
	IsSentenceDelimiter	Tests character as one that delimits a sentence (`. ! : ?`).
	IsValidSGMLCharacter	Tests a character that can be part of an SGML element or attribute.
	IsValidSGMLStartCharacter	Tests a character as a start character in an SGML element or attribute.
	IsValidVariableCharacter	Tests a character that can be used in a programming variable (no lead character exclusion).
	IsVowel	Tests a character as western vowel (A E I O U, upper and lower case).
	IsWordDelimiter	Tests a character as word style delimiter (‘`' . , : ; ! ? ( ) [ ] { }`’).
	IsWordSpace	Tests a character as a Word Space (space, return, tab, new line, `0x00`).

Another powerful tool is the GetWordType function. The GetWordType function analyzes the content of a provided word and returns the type and attributes. The prototype:

dword = GetWordType ( string data );

The data parameter is a string containing a word without leading or trailing spaces. The returned value is a 32-bit dword (or int, unsigned) containing bitwise information. The results can be any of the following:

Definition		Bitwise	Description
Item Types
	WT_TYPE_ITEM_MASK	0x000F0000	Item Type Mask
	WT_TYPE_UNKNOWN	0x00000000	Unknown Value
	WT_TYPE_WORD	0x00010000	Word (dog, cat, monkey)
	WT_TYPE_NUMBER	0x00020000	Number
	WT_TYPE_NUMBER_SERIAL	0x00030000	Serial Number (12, 63)
	WT_TYPE_LEADER	0x00040000	Leader Line
	WT_TYPE_RULER	0x00050000	Ruler (possible or dash, nil)
	WT_TYPE_CURRENCY_LEADER	0x00060000	Opening Currency “$ 1,121”
	WT_TYPE_NIL	0x00070000	Nil or Compound Nil “--(a)” or “—” or “$-”
	WT_TYPE_DATE	0x00080000	Date “12/12/12”, “12.12.12”, “23:22” or ISO
Word Variations
	WT_WORD_MASK	0x00700000	Word Type Mask
Types
	WT_WORD_UNKNOWN	0x00000000	Unknown or General Word Type
	WT_WORD_LOWER	0x00100000	Lower Case Word
	WT_WORD_UPPER	0x00200000	Upper Case Word
	WT_WORD_INITIAL	0x00300000	Initial Capital
Word Flags
	WT_WORD_TRAIL_MASK	0x000000FF	Punctuation (low in char)
	WT_WORD_TRAIL_PUNCTUATION	0x00800000	Trails Punctuation (in low char)
	WT_WORD_QUOTED	0x01000000	Word Quoted (can be partial)
	WT_WORD_IN_HOLE	0x02000000	Word has Parenthesis or Brackets
	WT_WORD_LEADER_TRAIL	0x04000000	Word has a Trailing Leader Line
Lexicon
	WT_WORD_LEXICON_MASK	0x70000000	Lexicon Mask
	WT_WORD_DATE_MONTH	0x10000000	Word is in Month Lexicon
	WT_WORD_DATE_DAY	0x20000000	Word is in Day Lexicon
	WT_WORD_HONORIFIC	0x30000000	Word is in Honorific Lexicon
Number Variations
	WT_NUMBER_ALIGN_MASK	0x000000FF	Alignment Position at Size
Types
	WT_NUMBER_MASK	0x00700000	Number Type Mask
	WT_NUMBER_UNKNOWN	0x00000000	Unknown Type
	WT_NUMBER_YEAR	0x00100000	Number is Year (1900-2099)
	WT_NUMBER_DAY	0x00200000	Number is Day (1-31)
	WT_NUMBER_FORMATTED	0x00300000	Number is Formatted
	WT_NUMBER_LIST	0x00400000	Part of a List (1-99 with trail)
Number Flags
	WT_NUMBER_NEGATIVE	0x01000000	Negative Number (000) or -000
	WT_NUMBER_IN_HOLE	0x02000000	Negative Number (000)
	WT_NUMBER_FOOTNOTE	0x04000000	Has Footnote
	WT_NUMBER_CURRENCY	0x08000000	Has Currency
	WT_NUMBER_PERCENT	0x10000000	Has Percent
	WT_NUMBER_IN_HOLE_ERROR	0x20000000	Error in Parenthetical
	WT_NUMBER_BAD_FORMAT	0x40000000	Bad Format (characters, not structure)
Leader Variation
	WT_LEADER_SIZE_MASK	0x00000FFF	Word Type Mask (character in bottom)
Ruler Variations
	WT_RULER_MASK	0x00700000	Drawing Character in the Lower 8-bits
	WT_RULER_CHARACTER	0x000000FF	Mask for Ruler Character
Ruler Types
	WT_RULER_MIXED	0x00000000	Of Indeterminate Type
	WT_RULER_SUBTOTAL	0x00100000	Subtotal Type
	WT_RULER_TOTAL	0x00200000	Total Type
Ruler Flags
	WT_RULER_DASH	0x01000000	Possible Connecting Dash
Date Variations
	WT_DATE_MASK	0x0F000000	Date Code Mask
	WT_DATE_AS_GENERAL	0x00000000	Date as Any Type (short mm/yy not supported)
	WT_DATE_ISO_8601	0x01000000	Date as ISO (in part, w w/o time)
	WT_DATE_TIME_ONLY	0x02000000	A Time with Optional AM/PM
Unknown Word Data
	WT_UNKNOWN_ALPHA	0x0000000F	Alpha Count
	WT_UNKNOWN_NUMERIC	0x000000F0	Numeric Count
	WT_UNKNOWN_CURRENCY	0x00000300	Currency Count (4)
	WT_UNKNOWN_PUNCTUATION	0x00000C00	Sentence Punctuation Count (4)
	WT_UNKNOWN_COMMA_PERIOD	0x00003000	Comma Period Count
	WT_UNKNOWN_GROUP	0x0000C000	Parenthesis/Brace Group
	WT_UNKNOWN_QUOTE	0x00300000	Quote Character Count
	WT_UNKNOWN_FOOTNOTE	0x00C00000	Footnote Type Characters
	WT_UNKNOWN_RULE	0x03000000	Rule Character Count
	WT_UNKNOWN_ELLIPSE	0x0C000000	Ellipse Count
	WT_UNKNOWN_OTHER	0x30000000	Other Count

Depending on your programming background, bitwise operation may be a bit foreign. They are widely used under the hood in many environments and can be very efficient at conveying a lot of information in a small form factor. Generally, the binary information is segmented into flags and ordinals. Flags are simple. If a bit is set, then the condition is true. Ordinals, on the other hand, require a mask to filter the group associated bits. Those bits in turn represent one of a set of conditions. For example, the resulting dword can be filtered by ‘ANDing’ the result with the WT_TYPE_ITEM_MASK value:

        code = GetWordType(word);
        switch (code & WT_TYPE_ITEM_MASK) {
          case WT_TYPE_UNKNOWN:
            break;
          case WT_TYPE_WORD:
            break;
          case WT_TYPE_NUMBER:
            break;
          case WT_TYPE_NUMBER_SERIAL:
            break;
          case WT_TYPE_LEADER:
            break;
          case WT_TYPE_RULER:
            break;
          case WT_TYPE_CURRENCY_LEADER:
            break;
          case WT_TYPE_NIL:
            break;
          case WT_TYPE_DATE:
            break;
          }

Each case section can then count or act upon the details of the item. For example, if the type is date, then the WT_DATE_ items can be tested to narrow the type of date.

The GetWordType function is useful for aggregating information from a text stream to perform high level analysis. For example, a line of text can be parsed, information accumulated, and the first and last word data examined to determine the probability of line being a heading, part of a paragraph, or perhaps a row of a table.

Analysis is performed on a gross level basis. That is, types of characters are counted and then run through logic to perform a basic analysis. For example, if one or two dashes are present without text, the content will be considered a “nil” value as would be seen in a table. On other hand, three dashes would be considered as a possible rule or visual aid.

In addition, there are the related functions GetListType and GetNumericType, which are similar in operation to GetWordType but return data specific to values as list and numbers, respectively.

The words to test should be passed to the function without spaces. If the Word Parse Object is employed with WP_GENERAL mode, the data returned is compatible with analysis.

Converting Common String Forms

The Legato SDK also contains a number of functions for performing common string conversions and operations:

	Function	Description
	ChangeCase	Changes the case of a string, including HTML.
	CharacterToLowerCase	Converts a character to lower case (ANSI only).
	CharacterToUpperCase	Converts a character to upper case (ANSI only).
	ConformAddressString	Conforms the case and style of an address line.
	ConvertAddNewlines	Copies string and adds newline (0x0A) characters to return (0x0D) characters.
	ConvertDeleteNewlines	Copies string while deleting newline (0x0A) characters.
	ConvertFromEscapeCharacters	Copies from escaped characters (with backslash such as \r or \n).
	ConvertFromUnderbars	Converts underbars in a string to spaces.
	ConvertFromUnderlines	Removes the static control underline characters.
	ConvertNoCodes	Converts a string and changes all control codes (including newlines, returns, tabs) to period (‘.’) characters.
	ConvertNoPunctuation	Converts a string by removing any punctuation.
	ConvertNoSpaces	Converts a string by removing all space characters (0x20).
	ConvertSoftBreaksToSpaces	Converts soft break characters (0x09, 0x0D, 0x0A) to spaces.
	ConvertToEscapeCharacters	Copies to escaped characters (with backslash such as \r \n)
	ConvertToUnderbars	Copies with spaces changed to underbars.
	ConvertToUnderlines	Copies to static control underline characters using escaped ‘&’.
	ConvertWordSpaces	Converts all word spaces to single spaces.
	MakeLowerCase	Makes a string lower case (ANSI only).
	MakeUpperCase	Makes a string upper case (ANSI only).
	PadString	Pads a string to a specified size with an optional fill string.
	ReplaceInString	Replaces matching strings inside another string with or without case sensitivity.
	ReplaceInStringRegex	Replaces matching strings inside another string using regular expression rules.
	ReverseString	Reverses the character position content of a string.
	TrailStringAfter	Trails off a string with an ellipse (‘...’ characters) if exceeds specified size.
	TrailStringAfterAlways	Trails off a string with an ellipse (‘...’ characters) at specified size or always.
	TrailStringBefore	Truncates a string and adds ellipse (‘...’ characters) at the start of the string if the length exceeds the specified size.
	TrimNonBreakingSpaces	Trims non-breaking spaces (as raw characters).
	TrimPadding	Trims the padding on both left and right sides of string.
	TrimString	Trims the trailing spaces from the right side (end) of a string.

Changing case is a common operation, which can be performed using the MakeLowerCase and MakeUpperCase functions. The ChangeCase function is substantially more sophisticated allows for the processing of sentences of data in a number of modes, such as title capitalization.

Expanding Our Example

Let us add a few things to the above example:

        handle          hWP;
        string          s1, s2, s3;
        dword           type;
        int             spaces, count, pos;
        
        s1  = "On July 27, 2016, the Company: (i) purchased a dog; (ii) found a vet; ";
        s1 += "(iii) purchased a dog bed; and, (iv) spent $110 on a doggy ID chip. ";

        hWP = WordParseCreate();
        if (hWP == NULL_HANDLE) {
          MessageBox('x', "Error on handle");
          exit;
          }
        WordParseSetData(hWP, s1);
        s2 = WordParseGetWord(hWP);
        while (s2 != "") {
          count++;
          pos = WordParseGetPosition(hWP);
          spaces = WordParseGetSpaceSize(hWP);
          type = GetWordType(s2);
          s2 = ":" + s2 + ":";
          s2 = PadString(s2, 12);
          switch (type & WT_TYPE_ITEM_MASK) { 
            case WT_TYPE_UNKNOWN: s3 = "Unknown"; break; 
            case WT_TYPE_WORD: s3 = "Word"; break; 
            case WT_TYPE_NUMBER: s3 = "Number"; break; 
            case WT_TYPE_NUMBER_SERIAL: s3 = "Number (Serial)"; break; 
            case WT_TYPE_LEADER: s3 = "Leader"; break; 
            case WT_TYPE_RULER: s3 = "Ruler"; break; 
            case WT_TYPE_CURRENCY_LEADER: s3 = "Currency"; break; 
            case WT_TYPE_NIL: s3 = "Nil"; break; 
            case WT_TYPE_DATE: s3 = "Date"; break; 
            default: s3 = "";
            }
          AddMessage("   %3d %3d %3d 0x%08X %s %s", count, pos, spaces, type, s2, s3);
          s2 = WordParseGetWord(hWP);
          }

        CloseHandle(hWP);

The result in the log would appear as:

     1   2   0 0x00310000 :On:         Word
     2   7   1 0x10310000 :July:       Word
     3  11   1 0x00230002 :27,:        Number (serial)
     4  17   1 0x00030005 :2016,:      Number (serial)
     5  21   1 0x00110000 :the:        Word
     6  30   1 0x00B1003A :Company::   Word
     7  34   1 0x02110000 :(i):        Word
     8  44   1 0x00110000 :purchased:  Word
     9  46   1 0x00110000 :a:          Word
    10  51   1 0x0011003B :dog;:       Word
    11  56   1 0x02110000 :(ii):       Word
    12  62   1 0x00110000 :found:      Word
    13  64   1 0x00110000 :a:          Word
    14  69   1 0x0011003B :vet;:       Word
    15  75   1 0x02110000 :(iii):      Word
    16  85   1 0x00110000 :purchased:  Word
    17  87   1 0x00110000 :a:          Word
    18  91   1 0x00110000 :dog:        Word
    19  96   1 0x0011003B :bed;:       Word
    20 101   1 0x0011002C :and,:       Word
    21 106   1 0x02110000 :(iv):       Word
    22 112   1 0x00110000 :spent:      Word
    23 117   1 0x08020004 :$110:       Number
    24 120   1 0x00110000 :on:         Word
    25 122   1 0x00110000 :a:          Word
    26 128   1 0x00110000 :doggy:      Word
    27 131   1 0x00210000 :ID:         Word
    28 137   1 0x0091002E :chip.:      Word

We are using the PadString function to make a fixed size field in the log for the word, and we are still maintaining the ‘::’ convention that contains the word. The return value from the GetWordType function is both translated to a friendly string and printed in hexadecimal form in the log. Note that the words “27,” and “2016,” are considered serial numbers, as in a list that could appear within narrative as opposed to a table cell.

Conclusion

Since Legato is a part of GoFiler and GoFiler specializes in converting and editing text, many of the foundational string functions are exposed as script functions. If you cannot find a function to match your particular operation, contact technical support as it may already exist.

Scott Theis is the President of Novaworks and has been involved in the EDGAR industry for over thirty years. He has worked with the EDGAR system at multiple levels: as a financial printer, a member of the EDGAR design team, and as a software developer. He has extensive expertise with EDGAR, HTML, XBRL, and other programming languages.