Often data files have variable record types, identified by a control character in the first one or two positions of the record. This Tip of the Day walks you through the basic steps of importing and parsing the data.
|Assume your data file looks like this:
The first character of each row identifies what type of row it is. Some examples:
1 = header record
5 = batch header record
9 = file control record
Different row types have different layouts.
How do you extract all the data in a reliable, repeatable way? Read on!
|Step 1: Import the file as a fixed width file with a single column as wide as the entire width of the file.|
|In the illustration above, the following parameters have been set
|Step 2: Create a project. Use the dataset you imported in Step 1 as the anchor dataset
|Step 3: Split the one column, by Length with the first column being 1 character wide and the second column being the remaining characters (93 in the case of the example).
|Step 4: For each record type, set a filter on that record type and split the unparsed_data column based on that record layout specification.|
This illustration shows parsing a header type record (1) of an ACH file.
|Once you have parsed each record type, you would likely have additional work to finalize the data preparation, but that is a subject for another day. Hint: “Fill Down”!|
n.b.: There are many standards which use this type of format. In finance, there are countless applications of this type of format, in genomic research, the VCF (Variant Call Format) is used for storing gene sequence variations, in statistics XPORT / XPT make up the SAS Transport File Format Family. There are countless applications and industries which use this type of layout. Paxata is the Swiss Army knife that will open them all!