Up: Component summary Component

CSVSort

Sorts and merges CSV files.

The component may either sort a single CSV or merge an array of CSVs into a sorted CSV. Sorting can be performed by a number of key fields in descending/ascending order. Fields may be either strings (all characters), words (only alphabetic symbols and digits), natural or real numbers.

The component invokes sort GNU sort utility and supports external sorting (Large files non-fitting into RAM may be sorted). NOTE: In case of extremely large input CSVs (bigger than RAM) it is recommended to disable preliminary sorted check (set skipSortCheck=true) if it is not needed.

Version 1.2
Bundle tools
Categories
Authors Vladimir Rogojin (vladimir.rogojin@helsinki.fi)
Issue tracker View/Report issues
Requires GNU sort
Source files component.xml CSVSort.java
Usage Example with default values

Inputs

Name Type Mandatory Description
in CSV Optional An input CSV. If array is defined, this CSV is merged with CSVs from the array
inArray Array<CSV> Optional An array of input CSVs. If defined, all CSVs from the array to be merged. If csv is defined, it will be merged with the array.

Outputs

Name Type Description
out CSV The sorted CSV
status TextFile Value sorted if all input CSVs are sorted, value unsorted if at least one of the input files is not sorted, value unknown if no sort check was performed.

Parameters

Name Type Default Description
ignoreCase string "" Comma separated list of case-insensitive key columns.
keyColumns string "" Ordered comma separated list of key columns by which to sort CSV files. Sorting is performed first by the first column, then by the second column, etc. By default, the first column from CSVs is considered to be the key column. NOTE: all CSVs should have same columns following in the same order!
mode string "" Comma separated list of sorting modes for each column. Format column_name=mode. Modes: asc (ascending mode), des (descending mode). By default all columns are sorted in ascending order.
noSort boolean false Should we actually sort the CSV/CSVs or we need just to check whether they are sorted? If TRUE, no actual sorting to be performed, the empty CSV is produced, all CSVs are checked to be sorted.
preprocess boolean true If input CSVs contain quotes delimiting columns and/or spaces inside columns' fields, preprocess=true should be used to enable transformation of input CSVs into suitable for the sort utility format.
regex string ".*" Java regular expression for keys of CSVs from the input array to merge/sort.
skipSortCheck boolean false Should we skip sort check? If TRUE, no initial check whether input CSVs are sorted will be performed. It is recommended to set it TRUE to enhance performance in case of large input CSVs. If no sorted check is performed, the component will return choice unknown.
sortCmd string "sort" The command to invoke sort utility.
stable boolean true Stable option disables last-resort comparison.
types string "" Comma separated list of column types. Format: column_name = column_type. There are defined the following column types: string (any character sequence), word (alphabetic letters and/or digits sequence), natural (integer non-negative number), real (real number). By default all columns are of type natural.
unique boolean false Return unique rows only.

Test cases

Test case Parameters IN
in
IN
inArray
OUT
out
OUT
status
case1 (missing) in (missing) out (missing)
case10_unique_rows properties in (missing) out (missing)

unique=true

case2 (missing) in (missing) out (missing)
case3 properties in (missing) out (missing)

keyColumns=Col2,Col3,Col4,
types=Col3=string,Col4=real,
skipSortCheck=true

case4 properties in (missing) out (missing)

keyColumns=Col2,Col3,Col4,
types=Col3=string,Col4=real,
mode=Col2=des

case5 properties in inArray out (missing)

keyColumns=Col1,Col3,Col2,
types=Col2=natural,Col1=string,Col3=string,
mode=Col3=des,
ignoreCase=Col1

case6 properties (missing) inArray out (missing)

keyColumns=Col1,Col3,Col2,
types=Col2=natural,Col1=string,Col3=string,
mode=Col3=des,
ignoreCase=Col1,
regex=file.

case7 properties (missing) inArray out (missing)

keyColumns=Col1,Col3,Col2,
types=Col2=natural,Col1=string,Col3=string,
mode=Col3=des,
ignoreCase=Col1,
regex=file.,
noSort=true

case8 properties in (missing) out (missing)

keyColumns=MEDIANSURIVALPVALUE,
mode=MEDIANSURIVALPVALUE=asc,
types=MEDIANSURIVALPVALUE=real

case9 properties in (missing) out (missing)

keyColumns=Col_2,Col3,
types=Col_2=string,Col3=natural


Generated 2018-12-12 07:42:06 by Anduril 2.0.0