A Combinatorial Approach to Search and Clustering
Michael Houle National Institute of Informatics
26 April 2007
M E Houle @ NII
1
Overview Shared-Neighbor Clustering The Relevant-Set Correlation Model
I. II. a. b. c.
Association Measures Significance of Association Partial Significance and Reshaping
a. b. c.
Query Result Clustering Outlier Detection Feature Set Evaluation and Selection
III. The GreedyRSC Heuristic IV. Experimental Results V. Extensions and Applications
26 April 2007
M E Houle @ NII
2
What is Clustering? Clustering is: Organization of data into well-differentiated groups of highly-similar items. A form of unsupervised learning. Fundamental operation in data mining & knowledge discovery. Important tool in the design of efficient algorithms and heuristics. Closely related to search & retrieval.
26 April 2007
M E Houle @ NII
3
Clustering Paradox Clustering models/methods traditionally
make assumptions on the nature of the data: Data representation. Similarity measures. Data distribution. Cluster numbers, sizes and/or densities. Definition of noise.
… but, cluster analysis seeks to
discover the nature of the data!
26 April 2007
M E Houle @ NII
4
Shared Neighbor Clustering Similarity measures not fully trusted? Curse of dimensionality – concentration effect. Variations in density. Lack of objective meaning. Shared neighbor information: “If two items have many neighbors in common, they are probably closely related.” Similarity measure used primarily for ranking. Adaptive to variations in density. 26 April 2007
M E Houle @ NII
5
Shared-Neighbor Clustering Methods (1) Jarvis-Patrick (1973) Hierarchical clustering heuristic. Single-linkage merge criterion. Fixed-cardinality neighborhoods. Merge threshold t. Merge if there exists a pair a, b such that: a and b are k-NNs of one A another; Intersection of k-NN lists contains at least tk items. 26 April 2007
M E Houle @ NII
B
6
Shared-Neighbor Clustering Methods (2) ROCK
(Guha, Rastogi, Shim 2000) Hierarchical clustering
heuristic. Fixed-radius neighborhoods. Pairwise linkage defined as size of intersection of neighborhoods. Merge if total (size-weighted) inter-cluster linkage size is maximized.
26 April 2007
M E Houle @ NII
B A
7
Shared-Neighbor Clustering Methods (3) SNN (Ertöz, Steinbach, Kumar 2003) Based on DBSCAN (1996): Density over fixed-radius neighborhood. Core points – density exceeding a supplied threshold. Merging – if one core point is contained in the neighborhood of another. SNN: DBSCAN with A fixed-cardinality neighborhoods. Similarity: intersection size of fixed-cardinality neighborhoods. 26 April 2007
M E Houle @ NII
B
8
Drawbacks of Shared Neighbor Clustering Fixed k-NNs: Bias towards clusters of size order k. Examples: Jarvis-Patrick, SNN. How to choose k ? Fixed radius neighborhoods: Bias towards clusters of larger density. Examples: ROCK. How to choose radius? Clustering depends on parameters that make
implicit assumptions regarding the data.
26 April 2007
M E Houle @ NII
9
Desiderata for Clustering Fully automated clustering: Similarity measure, but used strictly for ranking. Otherwise, no knowledge of data distribution. Parameters must have domain-independent interpretation. Automatic determination of number of clusters, cluster sizes. Other desiderata: Scalable heuristics. Adaptive to variations in density. Handles cluster overlap. 26 April 2007
M E Houle @ NII
10
Overview Shared-Neighbor Clustering The Relevant-Set Correlation Model
I. II. a. b. c.
Association Measures Significance of Association Partial Significance and Reshaping
a. b. c.
Query Result Clustering Outlier Detection Feature Set Evaluation and Selection
III. The GreedyRSC Heuristic IV. Experimental Results V. Extensions and Applications
26 April 2007
M E Houle @ NII
11
Query-Based Clustering How can we cluster when the nature of
the data is hidden? No pairwise (dis)similarity measure? Only assumption: relevancy rankings for queries-by-example.
Q(q, k): ranked relevant set for query
item q. |Q(q, k)| = k.
Clusters will be patterned on query
relevant sets Q(q, k) for some q in S. 26 April 2007
M E Houle @ NII
12
Confidence Two sets A and B related according
to their degree of overlap.
Natural measure – confidence
(inspired by Association Rule Mining) (related to Jarvis-Patrick merge criterion): A A ∩B 0 ≤ conf (A ,B ) = ≤1
A
B
Interpretation of conf : precision & recall (as in IR). Query result Q for concept set C. Precision is conf (Q, C). Recall is conf (C, Q). 26 April 2007
M E Houle @ NII
13
Mutual Confidence Symmetric measure – mutual
confidence: 0 ≤ MC (A ,B )
= conf (A ,B )⋅conf (B ,A ) =
A B A ⋅B
A B
≤1
Interpretation of MC : cosine-angle between set
vectors. If item j is a member, j-th coordinate equals 1. Otherwise, j-th coordinate equals 0. cos-1 (MC (A, B)) is a distance metric. 26 April 2007
M E Houle @ NII
14
Set Correlation Pearson correlation formula:
∑i
n
r=
(∑
n
=1
xi yi − nxy
x − nx i=1 2 i
2
) (∑
n
2 2 y − n y i=1 i
)
A
Apply this to coordinate pairs of set vectors for A, B ⊂ S… Gives set correlation between A and B :
R (A ,B ) =
A ∩B − ( S − A ) ( S − B ) A ⋅ B
S
A ⋅ B S
Tends to cosine similarity measure when A and B are small. 26 April 2007
M E Houle @ NII
15
B
Intra-set Association How do we measure the goodness of a cluster
candidate C ? No pairwise similarity measure is available!
First-order criterion – if v belongs to C, then: The items relevant to v should belong to C. R (C, Q(v, |C|)) should be high. Second-order criterion – if v, w belong to C, then: The items relevant to v should coincide with those
relevant to w. (Will discuss only first-order criterion here.)
26 April 2007
M E Houle @ NII
16
Self-confidence Measure – self-confidence: The average mutual confidence
between a set and the (samesized) relevant sets of its members. Denoted SC (A), where 1 SC (A ) = A
MC ∑ v A
(A ,Q (v, A ))=
∈
1
A
2
A ∩ Q (v, A ) ∑ v A ∈
Related to SNN density
criterion.
26 April 2007
M E Houle @ NII
17
Self-correlation Measure – self-correlation: The average correlation
between a set and the (same-sized) relevant sets of its members. Denoted SR (A), where 1 SR (A ) = A
26 April 2007
S ⋅ SC (A )− A R (A ,Q (v, A ))= ∑ S−A v∈ A
M E Houle @ NII
18
Significance & Size Which aggregation of points
is more significant? SC (A) = 0.8525, SR (A) = 0.815625. SC (B) = 1.0, SR (B) = 1.0. SC (C) = 0.45, SR (C) ≈ 0.3888889.
Set size must be considered. Note: proper interpretation of Pearson correlation
requires a test of significance. 26 April 2007
M E Houle @ NII
19
Overview Shared-Neighbor Clustering The Relevant-Set Correlation Model
I. II. a. b. c.
Association Measures Significance of Association Partial Significance and Reshaping
a. b. c.
Query Result Clustering Outlier Detection Feature Set Evaluation and Selection
III. The GreedyRSC Heuristic IV. Experimental Results V. Extensions and Applications
26 April 2007
M E Houle @ NII
20
Randomness Hypothesis What if every query relevant set (QRS) were
independently selected uniformly at random? Number of intersections between QRS and fixed set is distributed hypergeometrically. Expectation and variance can be determined.
If X = |A ∩ B|, where A is fixed and B
is selected randomly from S, then: A ⋅ B ( S − A )( S − B ) A⋅B Var (X ) = E (X ) = 2 S S ( S − 1)
26 April 2007
M E Houle @ NII
A B 21
Significance & Standard Scores Standard score under randomness hypothesis: Measure of the deviation from randomness. ZSC (A): number of standard devs of SC (A)
from its expectation. ZSR (A): number of standard devs of SR (A) from its expectation. The greater the standard scores, the more significant the aggregation. SC (A )− E ( SC (A )) Z SC (A ) = Var ( SC (A )) 26 April 2007
SR (A )− E ( SR (A )) Z SR (A ) = Var ( SR (A ))
M E Houle @ NII
22
Intra-Set Significance E( A ∩ B ) E ( R ( A ,B ) ) = − (S − A)(S − B) A ⋅B
S
= ( S − A ) ( S − B )
S
A ⋅ B S A ⋅ B =0 S
A⋅B ⋅ − S A⋅B 1
2
S Var ( R ( A ,B ) ) = Var ( A ∩ B ) (S − A)(S − B) A ⋅B
(S − A)(S − B) A ⋅B = 1 S = ⋅ 2 (S − A)(S − B) A ⋅B S −1 S ( S − 1) 2
SR (A )− E ( SR (A )) Z SR (A ) = = Var ( SR (A ))
SR (A )− 1
A 26 April 2007
2
1 A
∑ E ( R ( A ,Q ( v, A ) ) ) v∈ A
∑ Var ( R ( A ,Q ( v, A ) ) )
=
A ( S − 1) SR (A )
v∈ A
M E Houle @ NII
23
Intra-Set Significance E ( SC (A )) =
1
A
(
)
1
E A ∩ Q ( v, A ) = 2 ∑ v∈ A
Var ( SC (A )) =
1
A
4
A
2
∑ v A ∈
2
A A = S S
Var ( A ∩ Q ( v, A ) ) = ∑ v A ∈
(S − A) A ⋅ S ( S − 1) 2
2
Can show: Z SR (A ) = Z SC (A ) =
A ( S − 1) SR (A )= Z (A ) ∆
Z (A): intra-set significance of set A. 26 April 2007
M E Houle @ NII
24
Example Set significances of A, B, C: SC (A) = 0.8525,
SR (A) Z (A) SC (B) SR (B) Z (B) SC (C) SR (C) Z (C)
= ≈ = = ≈ = ≈ ≈
0.815625, 36.29. 1.0, 1.0, 22.25. 0.45, 0.3888889, 12.24.
Z (A) > Z (B) > Z (C). 26 April 2007
M E Houle @ NII
25
Inter-Set Significance Z (A, B): inter-set significance of (the
relationship between) A and B.
E ( R (A ,B )) = 0
1 Var ( R (A ,B )) = S −1
A⋅B E ( MC (A ,B )) = S Var ( MC (A ,B )) =
A ⋅ B ⋅ ( S − A )( S − B ) S
2
( S − 1)
R (A ,B )− E ( R (A ,B )) MC (A ,B )− E ( MC (A ,B )) Z (A ,B ) = = = S − 1 R (A ,B ) Var ( R (A ,B )) Var ( MC (A ,B )) For fixed S, inter-set significance is
equivalent to the set correlation R (A, B).
26 April 2007
M E Houle @ NII
26
Overview Shared-Neighbor Clustering The Relevant-Set Correlation Model
I. II. a. b. c.
Association Measures Significance of Association Partial Significance and Reshaping
a. b. c.
Query Result Clustering Outlier Detection Feature Set Evaluation and Selection
III. The GreedyRSC Heuristic IV. Experimental Results V. Extensions and Applications
26 April 2007
M E Houle @ NII
27
Contributions to Significance Some members contribute more
than others towards the set significance Z (A).
Contribution of member v to
SR (A): ∆
t(v |A )=
1 R ( A ,Q ( v, A ) ) A
Can consider potential
contributions Z (v|A) even when v ∉ A. 26 April 2007
M E Houle @ NII
A ∩ Q ( v, A ) 28
Partial Significance Standard score of t (v|A)
w.r.t. the randomness hypothesis:
S − 1 R ( A ,Q ( v, A ) )
Z (v |A ) =
The significance of A can be
expressed in terms of these partial significances: Z (A ) =
1
A
26 April 2007
Z ( v |A ) ∑ v A ∈
A ∩ Q ( v, A ) M E Houle @ NII
29
Set Reshaping (1) Idea: modify A so as to
boost its significance.
New set A′ has average
mutual correlation to A: 1 SR (A ′ |A ) = A′
R ( A ,Q ( v, A ) ) ∑ v A ∈ ′
Significance of SR(A′|A)
w.r.t. randomness hypoth.: Z (A ′ |A ) =
1
Z ( v |A ) ∑ A′ v A ∈ ′
26 April 2007
A ∩ Q ( v, A ) M E Houle @ NII
30
Set Reshaping (2) For fixed size |A′|, we
can maximize Z (A′|A) by taking largest values of Z (v|A).
For this example: Z (A|A) = Z (A) = 36.29 Z (A′|A) = 37.18 Maximum achieved at A′. A serves as a pattern for
the discovery of A′.
A ∩ Q ( v, A ) 26 April 2007
M E Houle @ NII
31
Partitioning What if we need to assign item v to a single group? Want a group C for which both: the set significance is high; and the significance of the relationship with v is
high.
In practice, can choose group C (reshaped from
pattern C*) satisfying:
* * maximize Z ( v | C ) ⋅ Z ( C | C ) *
C
26 April 2007
M E Houle @ NII
32
Overview Shared-Neighbor Clustering The Relevant-Set Correlation Model
I. II. a. b. c.
Association Measures Significance of Association Partial Significance and Reshaping
a. b. c.
Query Result Clustering Outlier Detection Feature Set Evaluation and Selection
III. The GreedyRSC Heuristic IV. Experimental Results V. Extensions and Applications
26 April 2007
M E Houle @ NII
33
Cluster Map Generation Nodes are sets having sufficiently-high
significance scores.
Edges appear between set nodes having
sufficiently-high inter-set significances.
Retained cluster candidates should not be too
highly correlated with other retained candidates.
Final clusters form an “independent set”
within an initial candidate cluster map.
26 April 2007
M E Houle @ NII
34
Nodes meeting min threshold on
set significance. Edges meeting min threshold on inter-set significance (correlation).
26 April 2007
Candidate Map
M E Houle @ NII
35
Nodes ranked by significance
(red is highest). Thick edges join nodes whose set correlations are too high.
26 April 2007
M E Houle @ NII
36
Need independent node set
within thick-edged subgraph. Heuristic: greedy by significance.
26 April 2007
M E Houle @ NII
37
Need independent node set
within thick-edged subgraph. Heuristic: greedy by significance.
26 April 2007
M E Houle @ NII
38
Final cluster map is typically
disconnected. Clusters nodes form a rough cover of the data set.
26 April 2007
M E Houle @ NII
Cluster Map
39
GreedyRSC
DB Queries
Pattern Pruning
Candidate Patterns
Member Assignment 26 April 2007
Cluster Pruning M E Houle @ NII
Sample Relevant Sets
Cluster Map Generation
40
Scalability Issues Problems: Curse of dimensionality: computing queries Q(q,k)
can take time linear in |S| even when k is small. Computing Z(A) takes time quadratic in |A|.
Workarounds: Use approximate neighborhoods using SASH search
structure [H. '03, H. & Sakuma '05]. Pattern generation over samples of varying sizes, with fixed range for |A|.
26 April 2007
M E Houle @ NII
41
Sampling Strategy Create bands of samples of sizes |S|, |S|/2, |S|/4, …. Within each sample, compute candidate patch for each
element via maximization of set significance. Fixed range of patch sizes a < k < b. Within each sample, select patterns greedily and eliminate duplicates. Reshape patterns to form cluster candidates. Eliminate duplicate candidates to form final cluster set. Generate cluster map edges.
26 April 2007
M E Houle @ NII
42
Overall Time Complexity Without data partitioning: Precompute relevant sets: O(n log n) queries, n = |S|. Compute pattern for O(n log n) neighborhoods: O(b 2) time each. Compute all patterns: O(b 2n log n) time. Eliminate duplicate patterns: O((b 2+σ2)n log n), where σ2 is the average variance of inverted member list sizes over each sample. Form candidate clusters: bounded by O(b n log2n). Eliminate duplicate clusters & create map: O((b 2+τ2)n log n ), where τ2 is the average variance of inverted member list sizes over each sample. Total excluding relevant set computation: O((b 2+σ2+τ2)n log n + b n log2n). Data partitioning (details omitted!) introduces factor of c, the number of data chunks. M E Houle @ NII 26 April 2007 43
Overview Shared-Neighbor Clustering The Relevant-Set Correlation Model
I. II. a. b. c.
Association Measures Significance of Association Partial Significance and Reshaping
a. b. c.
Query Result Clustering Outlier Detection Feature Set Evaluation and Selection
III. The GreedyRSC Heuristic IV. Experimental Results V. Extensions and Applications
26 April 2007
M E Houle @ NII
44
Clustering Parameters For comparison of significance values, can normalize
significance scores for convenience. Common factor dependent on |S| can be dropped. Normalizing inter-set significance → set correlation. Normalizing square of set significance: use
Z 2 (A ) 0 ≤ = S −1
A ⋅ SR 2 (A ) ≤
A
For all experiments: Min normalized squared set significance = 4. Max normalized inter-set significance = 0.5. Min norm. inter-set sig. = 0.1 (for cluster map). 26 April 2007
M E Houle @ NII
45
Images Amsterdam Library of Object Images (ALOI): Dense feature vectors, colour & texture histogram (prepared
by INRIA-Rocquencourt). Number of vectors: 110,250. 641 features per vector. 5322 clusters computed in < 4 hours on desktop (older, slower implementation). Maximum cluster size: 7201. Median cluster size: 12. Minimum cluster size: 4. SASH accuracy of ~96%.
26 April 2007
M E Houle @ NII
46
Journal Abstracts Medline medical journal abstracts, 1996 to mid-2003: Vectors with TF-IDF term weighting, NO dimensional
reduction (prepared by IBM TRL). Number of vectors: 1,055,073. ~75 non-zero attributes per vector. Representational dimension: 1,101,003. 9789 clusters computed in < 24 hours on 3.0GHz desktop. Maximum cluster size: 15,255. Median cluster size: 45. Minimum cluster size: 4. SASH accuracy of 51%, 115 x faster than sequential search.
26 April 2007
M E Houle @ NII
47
Protein Sequences Bacterial ORFs (protein sequences): Vectors of gapped BLAST scores with respect to fixed sample
of 1/10th size (as per Liao & Noble, 2003) (prepared by DNA Data Base of Japan). Number of vectors: 378,659. ~125 non-zero attributes per vector. Representational dimension: 40,000. Vector preparation: < 1 day on 16 node PC cluster. 8907 clusters computed in 7 hours on 3.0GHz desktop. Maximum cluster size: 69,859. Median cluster size: 20. Minimum cluster size: 4. SASH accuracy of 75%, 34 x faster than sequential search.
26 April 2007
M E Houle @ NII
48
Demos
26 April 2007
M E Houle @ NII
49
Overview Shared-Neighbor Clustering The Relevant-Set Correlation Model
I. II. a. b. c.
Association Measures Significance of Association Partial Significance and Reshaping
a. b. c.
Query Result Clustering Outlier Detection Feature Set Evaluation and Selection
III. The GreedyRSC Heuristic IV. Experimental Results V. Extensions and Applications
26 April 2007
M E Houle @ NII
50
Query Result Clustering “Pure” shared-neighbor clustering can be used
to cluster results of queries. Produce long ranked query result lists. Ranking function can be hidden. Database must support queries-by-example. Otherwise can be performed without the cooperation of the database manager.
Example at NII: WEBCAT Plus library database. GETA search engine. WEBCAT Plus QRC tool currently under
development (with N. Grira).
26 April 2007
M E Houle @ NII
51
Query Result Clustering Adapting RSC for QRC: Database size may not be known. Self-correlation and significance formula
undefined. Approximation: assume infinite database size. As |S| tends to infinity, the normalized squared significance tends to: Z 2 (A ) = S −1
A ⋅ SR 2 (A ) →
A ⋅ SC 2 (A )
This formula does not depend on |S|. Computation as per GreedyRSC heuristic.
26 April 2007
M E Houle @ NII
52
Overview Shared-Neighbor Clustering The Relevant-Set Correlation Model
I. II. a. b. c.
Association Measures Significance of Association Partial Significance and Reshaping
a. b. c.
Query Result Clustering Outlier Detection Feature Set Evaluation and Selection
III. The GreedyRSC Heuristic IV. Experimental Results V. Extensions and Applications
26 April 2007
M E Houle @ NII
53
Outlier Detection RSC model: Patterns of low significance can indicate the
presence of outliers. Can be detected as per the initial stages of GreedyRSC. Many potential definitions are possible.
Some possible formulations (to be minimized):
Z (Q (v,k))
SR (Q (v,k))
max Z (Q (v,k))
max SR (Q (v,k))
1≤ i≤ k
1≤ i≤ k
Work in progress with M. Gebski, NICTA. 26 April 2007
M E Houle @ NII
54
Overview Shared-Neighbor Clustering The Relevant-Set Correlation Model
I. II. a. b. c.
Association Measures Significance of Association Partial Significance and Reshaping
a. b. c.
Query Result Clustering Outlier Detection Feature Set Evaluation and Selection
III. The GreedyRSC Heuristic IV. Experimental Results V. Extensions and Applications
26 April 2007
M E Houle @ NII
55
Feature Set Evaluation Feature selection methods: To our knowledge, current feature selection techniques developed only for supervised learning. Training set needed to guide the process. Work with N. Grira: unsupervised feature set evaluation and selection. Assumptions: Similarity measure and candidate features unknown. Assess the effectiveness of the features & similarity measure for search and clustering. 26 April 2007
M E Houle @ NII
56
Good Feature Sets For any given 'true'
cluster C:
For any item v in C, any
query based at v should ideally rank the items of C ahead of any other items. Two-set classification. Best result – when there exists a partition of the data set into clusters such that the two-set classification property holds. 26 April 2007
M E Houle @ NII
57
Distinctiveness Distinctiveness criterion: Variant of self-correlation. External items that are well-correlated are
penalized. Equals 1 if A is perfectly associated (SR(A)=1) and external points are uncorrelated with A.
1 DR (A ) = A
1 R (A ,Q (v, A ))− ∑ S−A v∈ A
1 = SR (A )− S−A
26 April 2007
R (A ,Q (v, A )) ∑ v A ∉
R (A ,Q (v, A )) ∑ v A ∉
M E Houle @ NII
58
Significance Significance can be derived as under RSC: Randomness hypothesis. A ( S − A )( S − 1) Z DR (A ) = DR (A ) S Unlike self-correlation significance,
distinctiveness significance tends to 0 as |A| tends to |S|. Distinctiveness is expensive to compute in practice. Can use SR(A) to approximate DR(A). 26 April 2007
M E Houle @ NII
59
Feature Set Evaluation Criterion for feature set selection: For each item q, estimate the most significant relevant set based at q. Average the self-correlations of the most significant relevant set identified. Serves as the basis for search, e.g. local improvement methods such as Tabu Search. maximize
26 April 2007
1 S
SR (Q (q,kq )), ∑ q S
kq = argmax DR (Q (q,k ))
∈
M E Houle @ NII
1≤ k ≤ S
60
Feature Set Evaluation maximize
1 S
SR (Q (q,kq )), ∑ q S
kq = argmax DR (Q (q,k ))
∈
1≤ k ≤ S
Properties: If all relevant sets are randomly generated, criterion
is 0. If all relevant sets are identical, criterion is 0. If two-set classification holds for all q, criterion is 1 (the maximum possible). Conjectured: for any disjoint partition of the data into clusters of size at least some constant, there exists a set of rankings for which the maximum of 1 is achieved.
26 April 2007
M E Houle @ NII
61
26 April 2007
M E Houle @ NII
62
Background: Protein Sequence Analysis Applications of clustering: Classification of sequences of unknown
functionality with respect to clusters of sequences of known functionality. Discovery of new motifs from clusters of sequences of previously unknown function.
Problems: Single linkage (agglomerative) clustering
techniques produce clusters with poor internal association. Traditional clustering techniques do not scale well to large set sizes. Protein sequence data is “inherently” highdimensional. 26 April 2007 M E Houle @ NII 63
Protein Sequence Similarity Pairwise (gapped) sequence alignment
scoring: BLAST heuristic [Altschul et al. '97]. Bonus for matching and near-matching symbols. Penalty for non-matching symbols. Penalty for gaps (increases with gap length). Dynamic programming (expensive!). Faster heuristics exist.
MSVMYKKILYPTDFSETAEIALKHVKAFKTLKAEEVILLHVIDEREIKKRDIFSLLLGVA 60 M M++K+L+PTDFSE A A++ + ++ EVILLHVIDE +++ L+ G + MIFMFRKVLFPTDFSEGAYRAVEVFEKRNKMEVGEVILLHVIDEGTLEE-----LMDGYS 55
26 April 2007
M E Houle @ NII
64
Alignment-based Similarity Problem: direct use of BLAST scores fails! Not transitive since alignments are incomplete. SASH index for approximate neighbourhood computation
achieves poor accuracy vs time trade-off. Sequential search would work, but is too expensive.
Example: pairs (A,B) and (B,C) have high BLAST
scores, but pair (A,B) has score of zero. A B C
26 April 2007
M E Houle @ NII
65
Reference Set Similarity sample sequences
BLAST scores with respect to fixed sample of 1/10th size (as per Liao & Noble, 2003). Conversion of BLAST scores to E-values. Vector sparsification via thresholding to zero. Vector angle distance metric for neighbourhood computation.
full set sequences
Solution (with Å. J. Västermark): Vectors of gapped-alignment
BLAST E-values 26 April 2007
M E Houle @ NII
66
Analogy to Text sample sequences
Each sequence analogous to a
full set sequences
document. Reference sequences analogous to terms. Sparse vectorization. Significant BLAST scores analogous to terms appearing in document. Strong BLAST scores analogous to BLAST dominant terms of document (as E-values per TF-IDF weighting). 26 April 2007
M E Houle @ NII
67
26 April 2007
M E Houle @ NII
68