• Solutions
    • FERC XBRL Reporting
    • FDTA Financial Reporting
    • SEC Compliance
    • Windows Clipboard Management
    • Legato Scripting
  • Products
    • GoFiler Suite
    • XBRLworks
    • SEC Exhibit Explorer
    • SEC Extractor
    • Clipboard Scout
    • Legato
  • Education
    • Training
    • SEC and EDGAR Compliance
    • Legato Developers
  • Blog
  • Support
  • Skip to blog entries
  • Skip to archive page
  • Skip to right sidebar

Friday, December 16. 2016

Legato Developers Corner #13: Exploding Strings and Files

No, they don’t actually go BOOM! Legato features both implode and explode functions that allow strings to be created from or turned into arrays of strings. There are actually a number of functions for different purposes, such as exploding formatted text or a parameter array. These functions bundle a lot of horsepower into just a few operations and when combined with other functions, they make it easy to load or write complex data.


The ExplodeString function allows a single string to be cut into smaller pieces and placed into an array. Conversely, the ImplodeArray function allows an array to be assembled into a single string. Both functions use what is commonly referred to as ‘glue’ (a delimiter) to perform the disassembly and reassembly.


This blog post will cover using these functions, and will provide a real world example of their utility. It should be noted that as with many programming tasks, there are multiple ways to approach a problem, each with their respective pros and cons. This article is biased to using strings and arrays.


Strings To and From Files


Let’s begin by loading data into a string variable. Many times data is either collected from a file or an internet source. Two functions directly copy the contents of a file into or out of a string:


string = FileToString ( string name );


and


int = StringToFile ( string data, string name );


In both functions, the name parameter is a fully qualified filename. In the case of the FileToString function, the name could also be an HTTP URL. This is not so for the StringToFile function because HTTP does not support a direct file write operation.


The FileToString function will return an empty string on error but it could also return an empty string if the source file is empty or contains binary data starting with a zero. The IsError and GetLastError functions can be used to determine if an error occurred and the nature of the error. The source file is opened using the lowest privileges and sharing possible and is immediately closed after the data has been copied. The resulting string can be large. In the example provided, we are opening an EDGAR archive index which is in excess of 30mb. For extremely large files, other methods, such as the OpenMappedTextFile function, would be more appropriate.


On the other hand, the StringToFile function writes the string data verbatim to a specified file. If the file already exists, it is overwritten. The function returns a formatted error code, so it is not necessary to call the GetLastError function. The IsError function is fine to check the function’s return value for an error. 


Translating Strings to Arrays


Once we have loaded data into the string using the FileToString function, let’s suppose that we want to access that data in sections, and those sections of text are delimited by line endings. Exploding a string takes the content of the string and finds segments by a delimiter, which are in turn returned as a list or single-dimension array.


string[] = ExplodeString ( string data, [string delimiter], [int limit] );


The data parameter would typically be from a source, such as the returned string from the FileToString function as described above. If the data is to be split by lines, not specifying the delimiter parameter causes any type of line ending to be used as a delimiter.


Other delimiters can also be used, for example:


string   s1;
string   items[];
int      ix, size;

s1 = "The quick brown fox jumped over the lazy dog's back";

items = ExplodeString(s1, " ");

size = ArrayGetAxisDepth(items);
for (ix = 0; ix < size; ix++) {
  AddMessage("%d - '%s'", ix, items[ix]);
  }

creates the following in the default log:


0 - 'The'
1 - 'quick'
2 - 'brown'
3 - 'fox'
4 - 'jumped'
5 - 'over'
6 - 'the'
7 - 'lazy'
8 - 'dog's'
9 - 'back'

As a side note, the ExplodeString function is not suitable to parse CSV data since there are multiple methods of delimiting and escaping the data. Finally, the ExplodeString function has an optional parameter called limit. The limit parameter allows two different methods of capturing only part of the data. In one mode, the limit will simply add the requested number of items. In the other mode, the function will add the items and place the unprocessed data after the limit as the last item in the array.


On the flipside, the ImplodeArray function will glue an array together into a string. If we add the code below to end of the above script:


s1 = ImplodeArray(items, "+");
AddMessage(s1);

and run the script again, the following will be added to the log:


The+quick+brown+fox+jumped+over+the+lazy+dog's+back

The function prototype is as follows:


string = ImplodeArray ( string[] array, [string glue] );


The glue is optional and, if omitted, will be set to “\r\n” as conventional line endings. In the above example, we used a ‘+’ symbol. It is worth noting that the function does not encode source data to avoid conflicts with the glue string. So take care in the choice of glue and the content of the array.


Translating a Formatted String


Another useful function that takes a string and makes an array from the contents is the FormattedTextToArray function. This function grabs multiple segments and dumps the result into a returned list. It assumes that the source data is a formatted text string with fixed field positions.


Field positions are supplied as a list of integers. The function prototype is as follows:


string[] = FormattedTextToArray ( string data, int[] positions );


The data parameter should not contain control characters, such as tabs and line endings. If the source contains tab characters, they should be expanded using the ExpandTabs function. Fields are identified by the positions array, which should contain integers identifying the ending position of each segment. The first position starts at index 0. As an example similar to the one above, we can read formatted data:


string  s1;
string  items[];
int     pos[];
int     ix, size;

s1 = "Peanut Butter              167cal    15g       6g        7g";

pos[0] = 26; pos[1] = 33; pos[2] = 42; pos[3] = 52; pos[4] = 0;

items = FormattedTextToArray(s1, pos);

size = ArrayGetAxisDepth(items);
for (ix = 0; ix < size; ix++) {
  AddMessage("%d - '%s'", ix, items[ix]);
  }

and send it to the log:


0 - 'Peanut Butter'
1 - ' 167cal'
2 - '    15g'
3 - '     6g'
4 - '     7g'

Note that the resulting data is trimmed for trailing spaces but not leading spaces. (After Legato version 4.16a, both leading and trailing spaces are trimmed.) The last position can be set to some value beyond the end of the longest line or zero to indicate to go to the maximum left position. As we will see later, this function can be useful for processing large lists of data. It could also be used to parse an EDGAR ASCII table that has been marked with EDGAR table tags.


Putting Exploded and Formatted Strings Together


Below is a script that will download a quarterly EDGAR archive index, parse through the lines, create a CSV file, and then open the file in Data View.


                                                                // -- Variables
    string              s1;
    string              lines[];
    string              items[];
    string              table[][];
    int                 pos[20];
    int                 ix, size;
    int                 cx, cols;
    int                 rx;
    int                 rc;
        
                                                                // -- Initialize
    ProgressOpen("Get SEC Quarterly Index", 0);
    pos[0] = 62; table[0][0] = "Company Name";
    pos[1] = 74; table[0][1] = "Form Type";
    pos[2] = 86; table[0][2] = "CIK";
    pos[3] = 98; table[0][3] = "Date Filed";
    pos[4] = 0;  table[0][4] = "File Name";
    rx = 1;
        

                                                                // -- Get the Archive Data
    ProgressSetStatus("Downloading IDX File");
    s1 = FileToString("https://www.sec.gov/Archives/edgar/full-index/2016/QTR3/company.idx");
    if (IsError()) {
      rc = GetLastError();
      MessageBox('x', "Error loading idx file (%08X) ", rc);
      exit;
      }
    lines = ExplodeString(s1);
    size = ArrayGetAxisDepth(lines);
                                                                // -- Line loop (start at data)    
    ProgressSetStatus("Translating");
    ix = 10;
    while (ix < size) {
      ProgressUpdate(ix, size);
      items = FormattedTextToArray(lines[ix], pos);
      table[rx][0] = items[0];
      table[rx][1] = items[1];
      table[rx][2] = items[2];
      table[rx][3] = items[3];
      table[rx][4] = items[4];
      ix++; rx++;
      }
                                                                // -- Write Out File
    ProgressSetStatus("Write File");
    s1 = GetTempFile();                                         // (change file as desired)
    CSVWriteTable(table, s1);                                   // (will be deleted on app reopen)

    ProgressClose();
    DataViewOpenFile(s1);

This script can essentially be broken into four parts: (i) download data from the SEC into a variable; (ii) explode it into an array; (iii) translate each line into table entries; and, (iv) write the data to CSV and open it in Data View. Overall run time will be from 30 to 60 seconds (or more) depending on your internet connection speed and the speed of your computer. The index file can be downloaded locally and the script can point to the download while experimenting to avoid repeatedly downloading the 30mb index file.


Positions and table headings are set in the beginning section of the script (marked in the comments as “Initialize”). They are coded side by side to illustrate the position and column relationships. You will also note that a progress window has been added to show the user how things are moving along. (I hate programs that present the spinning wheel of death with no information as to what is happening.)


The FileToString function points directly to the SEC URL for the index to download. The code checks to see if the download succeeded and then continues to convert the string to an array with the ExplodeString function.


Now that we have an array of data in the lines variable, we can start processing the body of the index data. In this case, we assume that the data starts at row 10. A more sophisticated approach could be taken to locate the index body, which would be preferable should the SEC change the format of the data. This is also true of the column positions.


The FormattedTextToArray function is called on each line of the data from the lines variable. Note that the intermediate variable items is used to translate into the table. (Legato does not presently support partial array transfers such as having the table row equal the function.)


Finally, upon completion, the entire table variable is written to a temporary file using the CSVWriteTable function and the file is then opened in Data View using the DataViewOpenFile function. Note that the temporary file will be deleted by the application the next time it starts up as part of its house keeping. Obviously the resulting file could be written to any location or even a location selected by the user location via the BrowseSaveFile function.


Conclusion


In next week’s blog, we will continue our discussion with parameters and arrays.

Posted by
Scott Theis
in Development at 16:47
Trackbacks
Trackback specific URI for this entry

No Trackbacks

Comments
Display comments as (Linear | Threaded)
No comments
The author does not allow comments to this entry

Quicksearch

Categories

  • XML Accounting
  • XML AICPA News
  • XML FASB News
  • XML GASB News
  • XML IASB News
  • XML Development
  • XML Events
  • XML FERC
  • XML eForms News
  • XML FERC Filing Help
  • XML Filing Technology
  • XML Information Technology
  • XML Investor Education
  • XML MSRB
  • XML EMMA News
  • XML FDTA
  • XML MSRB Filing Help
  • XML Novaworks News
  • XML GoFiler Online Updates
  • XML GoFiler Updates
  • XML XBRLworks Updates
  • XML SEC
  • XML Corporation Finance
  • XML DERA
  • XML EDGAR News
  • XML Investment Management
  • XML SEC Filing Help
  • XML XBRL
  • XML Data Quality Committee
  • XML GRIP Taxonomy
  • XML IFRS Taxonomy
  • XML US GAAP Taxonomy

Calendar

Back May '25 Forward
Mo Tu We Th Fr Sa Su
Sunday, May 18. 2025
      1 2 3 4
5 6 7 8 9 10 11
12 13 14 15 16 17 18
19 20 21 22 23 24 25
26 27 28 29 30 31  

Feeds

  • XML
Sign Up Now
Get SEC news articles and blog posts delivered monthly to your inbox!
Based on the s9y Bulletproof template framework

Compliance

  • FERC
  • EDGAR
  • EMMA

Software

  • GoFiler Suite
  • SEC Exhibit Explorer
  • SEC Extractor
  • XBRLworks
  • Legato Scripting

Company

  • About Novaworks
  • News
  • Site Map
  • Support

Follow Us:

  • LinkedIn
  • YouTube
  • RSS
  • Newsletter
  • © 2024 Novaworks, LLC
  • Privacy
  • Terms of Use
  • Trademarks and Patents
  • Contact Us