Document Type

Technical Report

Publication Date






Technical Report Number



The DNA restriction mapping problem can be abstracted to the Shortest Common Matching String problem by viewing the restriction fragment lengths as symbols and the clones as bags. The Shortest Common Matching String problem can be decomposed into the Bag Sequencing problem and the Symbol Sequencing problem. All three of these problems have bene shown to be NP-hard. Rhee has proposed a family of greedy algorithms to compute polynomial time approximations to the Bag Sequencing problem, which produced surprisingly good performance results on abstracted data. In the test data generated by Rhee for pragmatic analysis, the symbols in bags represented instances of fragments whose length equivalency had been (magically) determined, a priori; conceptually, fragments of identical length were identified as being equivalent. Also, the distribution of the abstract symbols was assumed to be a uniform distribution, instead of a geometric distribution, as exhibited by fragment length data. This report presents modifications of Rhee's work, adapted to (non-abstracted) fragment length data, and the performance of the modified algorithms when applied to simulated data and laboratory data. The modifications reflect the actual length properties of the restriction fragments, identifying two instance of fragments as being similar if their measured lengths differ by no more than some predefined error measurement bound (often taken to be about 3%). This relation is not an equivalence relation, since it is not transitive. The modified algorithms were applied to fragments length data containing normal random (measurement) error containing the classical geometric distribution. The performance that these modified algorithms achieved when applied against such fragment length data was again surprisingly good.


Permanent URL: