Up: Component summary Component

IDConvert

Converts IDs from a column in a CSV file to other type of IDs and/or combines duplicated rows into one. The conversion is done using a conversion table.

This component can be used to make an ID column unique by giving the same CSV file as both "csv" and "conversionTable" input, using the same column for sourceColumn and conversionColumn, and giving unique=true.

Version 1.0
Bundle tools
Categories Convert
Specialties generic
Authors Kristian Ovaska (kristian.ovaska@helsinki.fi)
Issue tracker View/Report issues
Requires commons-math3-3.2.jar (jar) ; commons-primitives-1.0.jar (jar)
Source files component.xml IDConvert.java
Usage Example with default values

Type parameters (generics)

Inputs

Name Type Mandatory Description
in T (generic) Mandatory Source CSV file. Contains the column that is to be converted. The column is named by the parameter 'sourceColumn'.
conversionTable CSV Mandatory Conversion table. Must contain the column named by the parameter 'conversionColumn'.

Outputs

Name Type Description
out T (generic) Converted CSV file. This is a modified copy of the source CSV file that contains the converted column.

Parameters

Name Type Default Description
collapseNumeric string "median" Method for collapsing numeric columns with duplicate rows (when unique=true). Legal values are "median" (take median of non-NA values), "mean", "sum", "max", "min", "first" (take the first row), "majority" (take the value that is present on the largest number of rows), "consensus" (require that all rows have the same value) and "indicator".
Here, "majority" and "consensus" are mostly useful for discrete (integer) values). For "consensus", it is an error if duplicate rows contain distinct values. "indicator" assumes that data are divided into negative (-1), neutral (0) and positive (1) items, such as discretized expression values. The result is (a) 1 if there is a 1 and no -1 value; (b) -1 if there is a -1 and not 1 value; (c) 0 if there is a 0 and no -1 or 1; or (d) NA otherwise.
collapseString string "consensus" Method for collapsing non-numeric columns with duplicate rows (when unique=true). Legal values are "first" (take the first row), "majority" (take the value that is present on the largest number of rows), "longest" (take the longest value), "shortest" (take the shortest value) and "consensus" (require that all rows have the same value). For "consensus", it is an error if duplicate rows contain distinct values.
conversionColumn string (no default) Column in 'conversionTable' that is used to fetch new values for the source column. If the parameter is empty, the first column is used.
dropMissing boolean false Determines whether rows with missing conversion values are dropped. See also 'originalWhenMissing'.
keyColumn string "" Column in 'conversionTable' that is used to match ID values from the source CSV. If the parameter is empty, the first column is used.
originalWhenMissing boolean true Determines how values are converted when the item is not found in the conversion table. If true, the original (unconverted) ID is inserted to the output. If false, NA is inserted. If 'dropMissing' is set true, NA rows are dropped from the output.
sourceColumn string "" Column in 'csv' that is going to be converted. If empty, the first column is used.
split boolean false If true, split values in the source column by commas before ID conversion. For example, if the source column contains the string "a,b,c", it is first split into "a", "b" and "c"; then, individual IDs are converted; finally, the converted IDs are joined by commas into a string of format "A,B,C".
splitConverted boolean false If true, split converted values by commas after ID conversion. A row with comma separated values in the target column will be split to otherwise identical rows but so that the rows have a single ID value in the target column. This is done before possible collapsing of duplicate rows.
targetColumn string "" Target column name in 'converted'. If empty, the column name is the same as the original column (sourceColumn).
unique boolean false If true, rows with identical identifiers (sourceColumn) are collapsed into one row so that sourceColumn is unique. Collapsing may result in error if duplicate rows contain conflicting values. See parameters collapseNumeric and collapseString for methods to collapse numeric and non-numeric columns, respectively.
varMeasure string "" Method for calculating the variation of numeric duplicate values. If the variation is above the threshold given in varThreshold , duplicate values are collapsed to NA. Possible values are 'MAD' and 'variance'. 'MAD' stands for the median absolute deviation and 'variance' for the population variance. If no values is given, no threshold for combining numeric values is used.
varThreshold float 2 Maximum allowed variability in the duplicate values. If varMeasure is given duplicate values with variance above the threshold are collapsed to NA

Test cases

Test case Parameters IN
in
IN
conversionTable
OUT
out
case1 properties in conversionTable out

sourceColumn=Column2,
conversionColumn=Convert,
targetColumn=Column2Converted

case10_unique_max properties in conversionTable out

unique=true,
conversionColumn=idtype2,
collapseNumeric=max

case11_unique_min properties in conversionTable out

unique=true,
conversionColumn=idtype2,
collapseNumeric=min

case12_longest properties in conversionTable out

unique=true,
conversionColumn=idtype2,
collapseString=longest

case13_dropMissing properties in conversionTable out

sourceColumn=Column2,
conversionColumn=Convert,
targetColumn=Column2Converted,
originalWhenMissing=false,
dropMissing=true

case14_sum properties in conversionTable out

unique=true,
collapseNumeric=sum,
conversionColumn=ID

case15_consensus_numeric properties in conversionTable (expecting failure)

unique=true,
collapseNumeric=consensus,
conversionColumn=gene_id,
keyColumn=gene_id,
sourceColumn=gene_id

case2_split properties in conversionTable out

sourceColumn=Column2,
conversionColumn=Convert,
split=true

case3_unique_median properties in conversionTable out

unique=true,
conversionColumn=idtype2

case4_unique2 properties in conversionTable out

unique=true,
sourceColumn=idtype1,
conversionColumn=idtype2,
collapseNumeric=majority,
collapseString=first

case5_missing properties in conversionTable out

sourceColumn=Column2,
conversionColumn=Convert,
targetColumn=Column2Converted,
originalWhenMissing=false

case6_split_converted properties in conversionTable out

sourceColumn=Column2,
conversionColumn=Convert,
targetColumn=Column2Converted,
splitConverted=true

case7_split_converted_unique properties in conversionTable out

sourceColumn=Column2,
conversionColumn=Convert,
targetColumn=Column2Converted,
splitConverted=true,
unique = true

case8_mad_unique properties in conversionTable out

unique=true,
sourceColumn=idtype1,
conversionColumn=idtype2,
varMeasure=MAD

case9_unique_mean properties in conversionTable out

unique=true,
conversionColumn=idtype2,
collapseNumeric=mean


Generated 2018-12-12 07:42:06 by Anduril 2.0.0