Up: Component summary Component

CSVFilter

Filters columns and/or rows from CSV files using flexible criteria.

Columns are filtered based on column names (see "includeColumns"); it is also possible to rename columns (see "rename"). Columns can be reordered using "includeColumns".

Rows can be filtered using several criteria. A row is included in output if all conditions are satisfied. Rows having too many missing values are filtered using the "nonMissing" parameter. The parameters "regexp", "lowBound" and "highBound" filter rows based on cell contents.

If the "auxiliary" input is given, the result contains those rows from the "csv" input file whose values match those from the auxiliary file. Column names for matching are specified using "idColumn" and "matchColumn".

Default settings do not filter out anything. The "negate" parameter reverses all inclusion criteria. The component reads the "csv" input one line at a time and is memory-efficient for files with a large number of lines.

Version 1.1.1
Bundle tools
Categories Filter
Authors Kristian Ovaska (kristian.ovaska@helsinki.fi), Marko Laakso (Marko.Laakso@Helsinki.FI)
Issue tracker View/Report issues
Source files component.xml CSVFilter.java
Usage Example with default values

Inputs

Name Type Mandatory Description
in CSV Mandatory CSV file to be filtered.
aux CSV Optional If given, contains one column (see "matchColumn") whose values are matched to a column in the "csv" input (see "idColumn").
includeColumnsFile IDList Optional Column names that should be included in the result file. This input is used together with includeColumns parameter. The actual list represents their union. The order of column names is significant: the output columns are in the order given. Order of this input overrides the parameter in case of duplicates.

Outputs

Name Type Description
out CSV Filtered CSV file.

Parameters

Name Type Default Description
colOrder boolean false Enables including columns from the includeColumns parameter before the columns in the includeColumns input. Secify whether to include columns specified by the includeColumns parameter last (default) or first. See also includeColumns input and includeColumns parameter.
highBound string "" A row is included if the value in given numeric columns are at most as large as given numeric bounds. The parameter has a format COLNAME1=HIGH1,COLNAME2=HIGH2.
idColumn string "" Column name in "csv" that contains ID values. If the parameter is empty, the first column is used.
includeColumns string "*" Comma-separated list of column names that should be included in the result file. The special value * includes all columns. The order of column names is significant: the output columns are in the order given. See also includeColumns input and colOrder parameter. The parameter supports regular expressions for column name matching.
lowBound string "" A row is included if the value in given numeric columns are at least as large as given numeric bounds. The parameter has a format COLNAME1=LOW1,COLNAME2=LOW2. For example, "col=5" includes rows where the column col has a value of at least 5.
matchColumn string "" Column name in "auxiliary" containing values that must match the ID column in "csv" (specified using "idColumn"). If empty, the first column of "auxiliary" is used.
negate boolean false If true, all inclusion criteria are negated so that rows or columns are excluded instead of included if they match the criteria. For example, includeColumns=* then excludes all columns (which would be an error since at least one column must be included).
nonMissing float 0 Include those rows that have non-missing (non-NA) values in at least this many columns. Only those columns that are part of the output are counted. If nonMissing is less than one, it is interpreted as a percentage value. Note that '1' is interpreted as an absolute value, not as 100 percent.
regexp string "" Row filtering based on regular expression. A row is included in the result if values in the given columns match given regular expressions. The parameter has a format COLNAME1=EXPRESSION,COLNAME2=EXPRESSION2 where COLNAMEs are column names in "csv" and EXPRESSIONs are regular expressions using Java syntax. For example, "col=a|b" includes rows where the column col has a value of "a" or "b". Commas have to be escaped with "\\" if they are used in column names or regular expressions.
rename string "" Comma-separated list of column renaming rules (OLDNAME=NEWNAME). Parameters that refer to column names refer to old names; renaming is done last.

Test cases

Test case Parameters IN
in
IN
aux
IN
includeColumnsFile
OUT
out
case01_names properties in (missing) (missing) out

includeColumns=Gene,C3,C2,
regexp=Gene=R4|R3|R1

case02_aux properties in aux (missing) out

matchColumn=A2

case03_colstar properties in (missing) (missing) out

includeColumns=*,
regexp=Gene=R5

case04_missing properties in (missing) (missing) out

includeColumns=C3,C2,
nonMissing=2

case05_bounds properties in (missing) (missing) out

lowBound=C1=2,C2=3,
highBound=C1=5.1,C3=6

case06_nofilter (missing) in (missing) (missing) out
case07_negate properties in (missing) (missing) out

includeColumns=C1,C2,
regexp=Gene=R2|R3|R5,
negate=true

case08_rename properties in (missing) (missing) out

includeColumns=Gene,C1,C4,
regexp=Gene=R1|R2|R3,
rename=Gene=GeneNew,C4=C4New

case09_negateCols properties in (missing) (missing) out

includeColumns=col1,
idColumn=ID,
negate=true

case10_includeCols properties in (missing) includeColumnsFile out

includeColumns = B,A

case11 properties in (missing) includeColumnsFile out

includeColumns = B,A,
colOrder = true

case12_aux_empty (missing) in aux (missing) out
case13_misPercent properties in (missing) (missing) out

includeColumns=C3,C2,
nonMissing=0.5

case14_includeCols_regex properties in (missing) (missing) out

includeColumns = B,A,.*match.*


Generated 2018-12-17 07:42:31 by Anduril 2.0.0