Up: Component summary Component

KeywordMatcher

Identifies gene/protein IDs using fuzzy keyword matching between gene aliases and descriptions. Keyword matching is based on Czekanowski-Dice distance between pairwise keyword sets, taking into account the information content of the keywords.

To illustrate, consider mapping gene names into database identifiers. The query consists of gene names, which may include non-official names, and textual descriptions. The reference annotation contains database IDs, official gene names, gene aliases and textual descriptions for the whole genome (or a subset of interest), against which query items are compared. Query columns are mapped against each annotation column in turn and a distance metric is computed. If the distance is below threshold, the match is reported. Also reported are keywords that uniquely identify query items in the annotation file.

Method: CSV items in annotation and query files are tokenized using regular expressions, producing keywords. Keywords are converted to lower case and their order does not matter. From the annotation file, normalized information content (NIC; range 0..1) of each keyword is computed as log(k/N)/log(1/N), where k is the frequency of the keyword and N is the total number of annotation items having non-empty content in current annotation column. The distance (range 0..1) between query row i and annotation row j is the minimum of pair-wise column distances on the row. In each pairwise comparison concerning two keyword sets, we compute symmetric difference D, union U and intersection I. Keywords are weighted based on NIC, with more informative keywords having greater weight. Distance is defined as SUM(D)/(SUM(U)+SUM(I)), where SUM is the sum of NICs in the set. If a query keyword is not present in the annotation file, its NIC is 1. CPU performance: O(RA*RQ*CA*CQ), where RA, RQ, CA and CQ are the number of rows (R) and selected columns (C) in annotation (A) and query files (Q).

Version 1.0
Bundle microarray
Categories Convert
Authors Kristian Ovaska (kristian.ovaska@helsinki.fi)
Issue tracker View/Report issues
Source files component.xml KeywordMatcher.java
Usage Example with default values

Inputs

Name Type Mandatory Description
annotation CSV Mandatory Annotation for a set of genes or other entities. One column is a key column that is reported in the output and others are annotation columns.
query CSV Optional Attributes of query genes or other entities. These are matched against the annotation file. The query may be omitted, in which case statistics on the annotation file is produced as output.

Outputs

Name Type Description
mapping CSV Match results. For each query row, there is one output row. The columns are QueryID, TargetID (comma-separated list of target IDs), TargetDistance (distances corresponding to each target match ID) and Keywords (list of keywords that uniquely identify the item).

Parameters

Name Type Default Description
annotationColumns string "*" Comma-separated list of column names in the annotation file that are used for keyword matching. The special value * includes all columns, including the key column.
annotationKeyColumn string "" Column name in the annotation file that contains target identifiers. If empty, the first column is used.
matchDistance float 0.2 Distance threshold below which two rows are considered to to represent the same genes or other entities. Must be between 0 and 1.
maxMatches int 1 Maximum number of matches that are reported for each query row. Setting this to 1 improves performance.
pruneKeywordIC float 0.1 Keywords whose information content is below this threshold are removed to improve performance. Effectively, their NIC=0. Setting this too high lowers accuracy.
queryColumns string "*" Comma-separated list of column names in the query file that are used for keyword matching. The special value * includes all columns, including the key column.
queryKeyColumn string "" Column name in the query file that contains query identifiers. If empty, the first column is used.
removePattern string "[()\[\]\{\};]" Java regular expression that is applied to CSV cells before splitting into tokens. Matching portions are replaced with a space character.
tokenizePattern string "[, ]+" Java regular expression that is used to split CSV cells into tokens.
trimPattern string "[:._-]+" Java regular expression that is applied to tokens (keywords) to remove leading and trailing portions of the token. For example, if trimPattern="[_.]+", the token "__abc_def__." is transformed into "abc_def".

Test cases

Test case Parameters IN
annotation
IN
query
OUT
mapping
case1 properties annotation query mapping

matchDistance = 0.5,
maxMatches = 2

case2_noquery properties annotation (missing) mapping

maxMatches = 1


Generated 2018-12-18 07:42:21 by Anduril 2.0.0