Check the preview of 2nd version of this platform being developed by the open MLCommons taskforce on automation and reproducibility as a free, open-source and technology-agnostic on-prem platform.

Soft edit distance for differentiable comparison of symbolic sequences

lib:f9adbcaa3b7d4906 (v1.0.0)

Authors: Evgenii Ofitserov,Vasily Tsvetkov,Vadim Nazarov
ArXiv: 1904.12562
Document:  PDF  DOI 
Abstract URL: http://arxiv.org/abs/1904.12562v1


Edit distance, also known as Levenshtein distance, is an essential way to compare two strings that proved to be particularly useful in the analysis of genetic sequences and natural language processing. However, edit distance is a discrete function that is known to be hard to optimize. This fact hampers the use of this metric in Machine Learning. Even as simple algorithm as K-means fails to cluster a set of sequences using edit distance if they are of variable length and abundance. In this paper we propose a novel metric - soft edit distance (SED), which is a smooth approximation of edit distance. It is differentiable and therefore it is possible to optimize it with gradient methods. Similar to original edit distance, SED as well as its derivatives can be calculated with recurrent formulas at polynomial time. We prove usefulness of the proposed metric on synthetic datasets and clustering of biological sequences.

Relevant initiatives  

Related knowledge about this paper Reproduced results (crowd-benchmarking and competitions) Artifact and reproducibility checklists Common formats for research projects and shared artifacts Reproducibility initiatives

Comments  

Please log in to add your comments!
If you notice any inapropriate content that should not be here, please report us as soon as possible and we will try to remove it within 48 hours!