-
Notifications
You must be signed in to change notification settings - Fork 24
/
08-upset.Rmd
1074 lines (883 loc) · 33 KB
/
08-upset.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# UpSet plot {#upset-plot}
```{r, echo = FALSE, message = FALSE}
library(circlize)
library(GenomicRanges)
```
[UpSet plot](https://caleydo.org/tools/upset/) provides an efficient way to
visualize intersections of multiple sets compared to the traditional
approaches, i.e. the Venn Diagram. It is implemented in the [UpSetR
package](https://github.com/hms-dbmi/UpSetR) in R. Here we re-implemented
UpSet plots with the **ComplexHeatmap** package with some improvements.
## Input data {#input-data}
To represent multiple sets, the variable can be represented as:
1. A list of sets where each set is a vector, e.g.:
```{r, eval = FALSE}
list(set1 = c("a", "b", "c"),
set2 = c("b", "c", "d", "e"),
...)
```
2. A binary matrix/data frame where rows are elements and columns are sets,
e.g.:
```
set1 set2 set3
h 1 1 1
t 1 0 1
j 1 0 0
u 1 0 1
w 1 0 0
...
```
In the matrix, e.g., for row `t`, it means, `t` is in set **set1**, not in set **set2**, and
in set **set3**. Note the matrix is also valid if it is a logical matrix.
If the variable is a data frame, the binary columns (only contain 0 and 1) and
the logical columns are only used.
Both formats can be used for making UpSet plots, users can still use
`list_to_matrix()` to convert from list to the binary matrix.
```{r}
lt = list(set1 = c("a", "b", "c"),
set2 = c("b", "c", "d", "e"))
list_to_matrix(lt)
```
You can also set the universal set in `list_to_matrix()`:
```{r}
list_to_matrix(lt, universal = letters[1:10])
```
If the universal set does not completely cover the input sets,
those elements are not in the universal set are removed:
```{r}
list_to_matrix(lt, universal = letters[1:4])
```
3. The set can be genomic intervals, then it can only be represented as a list
of `GRanges`/`IRanges` objects.
```{r, eval = FALSE}
list(set1 = GRanges(...),
set2 = GRanges(...),
...)
```
## Mode {#upset-mode}
E.g. for three sets (**A**, **B**, **C**), all combinations of selecting
elements in the set or not in the set are encoded as following:
```
A B C
1 1 1
1 1 0
1 0 1
0 1 1
1 0 0
0 1 0
0 0 1
```
A value of 1 means to select that set and 0 means not to select that set.
E.g., `1 1 0` means to select set A, B while not set C. Note there is no `0 0
0`, because the background set is not of interest here. In following part of
this section, we refer **A**, **B** and **C** as **sets** and each combination
as **combination set**. The whole binary matrix is called **combination
matrix**.
The UpSet plot visualizes the **size** of each combination set. With the
binary code of each combination set, next we need to define how to calculate
the size of that combination set. There are three modes:
1. `distinct` mode: 1 means in that set and 0 means not in that set, then `1 1
0` means a set of elements both in set **A** and **B**, while not in **C**
(`setdiff(intersect(A, B), C)`). Under this mode, the seven combination
sets are the seven partitions in the Venn diagram and they are mutually
exclusive.
2. `intersect` mode: 1 means in that set and 0 is not taken into account,
then, `1 1 0` means a set of elements in set **A** and **B**, and they can
also in **C** or not in **C** (`intersect(A, B)`). Under this mode, the
seven combination sets can overlap.
3. `union mode`: 1 means in that set and 0 is not taken into account. When
there are multiple 1, the relationship is __OR__. Then, `1 1 0` means a set
of elements in set **A** or **B**, and they can also in **C** or not in
**C** (`union(A, B)`). Under this mode, the seven combination sets can
overlap.
The three modes are illustrated in following figure:
```{r, fig.width = 6*0.8, fig.height = 8*0.8, echo = FALSE}
source("upset_mode.R")
```
## Make the combination matrix {#make-the-combination-matrix}
The function `make_comb_mat()` generates the combination matrix as well as
calculates the size of the sets and the combination sets. The input can be one
single variable or name-value pairs:
```{r}
set.seed(123)
lt = list(a = sample(letters, 5),
b = sample(letters, 10),
c = sample(letters, 15))
m1 = make_comb_mat(lt)
m1
m2 = make_comb_mat(a = lt$a, b = lt$b, c = lt$c)
m3 = make_comb_mat(list_to_matrix(lt))
```
`m1`, `m2` and `m3` are identical.
The mode is controlled by the `mode` argument:
```{r}
m1 = make_comb_mat(lt) # the default mode is `distinct`
m2 = make_comb_mat(lt, mode = "intersect")
m3 = make_comb_mat(lt, mode = "union")
```
The UpSet plots under different modes will be demonstrated in later sections.
When there are too many sets, the sets can be pre-filtered by the set sizes.
The `min_set_size` and `top_n_sets` are for this purpose. `min_set_size`
controls the minimal size for the sets and `top_n_sets` controls the number of
top sets with the largest sizes.
```{r}
m1 = make_comb_mat(lt, min_set_size = 6)
m2 = make_comb_mat(lt, top_n_sets = 2)
```
The subsetting of the sets affects the calculation of the sizes of the
combination sets, that is why it needs to be controlled at the combination
matrix generation step. The subsetting of combination sets can be directly
performed by subsetting the matrix:
```{r}
m = make_comb_mat(lt)
m[1:4]
```
`make_comb_mat()` also allows to specify the universal set so that the complement
set which contains elements not belonging to any set is also considered.
```{r}
m = make_comb_mat(lt, universal_set = letters)
m
```
The universal set can be smaller than the union of all sets, then for each
set, only the intersection to universal set is considered.
```{r}
m = make_comb_mat(lt, universal_set = letters[1:10])
m
```
If you already know the size of the complement size, you can directly set
`complement_size` argument.
```{r}
m = make_comb_mat(lt, complement_size = 5)
m
```
When the input is matrix and it contains elements that do not belong to any of
the set, these elements are treated as complement set.
```{r}
x = list_to_matrix(lt, universal_set = letters)
m = make_comb_mat(x)
m
```
Next we demonstrate a second example, where the sets are genomic regions.
**When the sets are genomic regions, the size is calculated as the sum of the
width of regions in each set (or in other words, the total number of base
pairs).**
```{r}
library(circlize)
library(GenomicRanges)
lt2 = lapply(1:4, function(i) generateRandomBed())
lt2 = lapply(lt2, function(df) GRanges(seqnames = df[, 1],
ranges = IRanges(df[, 2], df[, 3])))
names(lt2) = letters[1:4]
m2 = make_comb_mat(lt2)
m2
```
We don't recommend to use **the number of regions** for the intersection of two
sets of genomic regions. There are two reasons: 1. the value is not symmetric, i.e.
the number of intersected regions measured in set1 is not always identical to the number
of intersected regions measured in set2, thus, it is difficult to assign a value
for the intersection between set1 and set2; 2. if one long region in set1 overlaps to
another long region in set2, but only in a few base pairs, does it make sense to say
these two regions are common in the two sets?
The universal set also works for sets as genomic regions.
## Utility functions {#upset-utility-functions}
`make_comb_mat()` returns a matrix, also in `comb_mat` class. There are some
utility functions that can be applied to this `comb_mat` object:
- `set_name()`: The set names.
- `comb_name()`: The combination set names. The names of the combination sets
are formatted as a string of binary bits. E.g. for three sets of **A**,
**B**, **C**, the combination set with name "101" corresponds to selecting set
**A**, not selecting set **B** and selecting set **C**.
- `set_size()`: The set sizes.
- `comb_size()`: The combination set sizes.
- `comb_degree()`: The degree for a combination set is the number of sets that
are selected.
- `t()`: Transpose the combination matrix. By default `make_comb_mat()`
generates a matrix where sets are on rows and combination sets are on
columns, and so are they on the UpSet plots. By transposing the combination
matrix, the position of sets and combination sets can be swtiched on the
UpSet plot.
- `extract_comb()`: Extract the elements in a specified combination set. The
usage will be explained later.
- Functions for subsetting the matrix.
Quick examples are:
```{r}
m = make_comb_mat(lt)
set_name(m)
comb_name(m)
set_size(m)
comb_size(m)
comb_degree(m)
t(m)
```
For using `extract_comb()`, the valid combination set name should be from
`comb_name()`. Note the elements in the combination sets depends on the "mode"
set in `make_comb_mat()`.
```{r}
extract_comb(m, "101")
```
And the example for sets that are the genomic regions:
```{r}
# `lt2` was generated in the previous section
m2 = make_comb_mat(lt2)
set_size(m2)
comb_size(m2)
```
And now `extract_comb()` returns genomic regions that are in the corresponding combination set.
```{r}
extract_comb(m2, "1010")
```
With `comb_size()` and `comb_degree()`, we can filter the combination matrix as:
```{r}
m = make_comb_mat(lt)
# combination set size >= 4
m[comb_size(m) >= 4]
# combination set degree == 2
m[comb_degree(m) == 2]
```
For the complement set, the name for this special combination set is only
composed of zeros.
```{r}
m2 = make_comb_mat(lt, universal_set = letters)
comb_name(m2) # see the first element
comb_degree(m2)
```
If `universal_set` was set in `make_comb_mat()`, `extract_comb()` can be
applied to the complement set.
```{r}
m2 = make_comb_mat(lt, universal_set = letters)
extract_comb(m2, "000")
m2 = make_comb_mat(lt, universal_set = letters[1:10])
extract_comb(m2, "000")
```
When `universal_set` was set, `extract_comb()` also works for genomic region sets.
In previous examples, we demonstrated using "one-dimensional index" such as:
```{r, eval = FALSE}
m[comb_degree(m) == 2]
```
Since the combination matrix is natually a matrix, the indices can also be applied to
the both dimensions. In the default settings, sets are on the rows and combination
sets are on the columns, thus, indices on the first dimension of the matrix
correspond to sets and indices on the second dimension correspond to combination sets:
```{r, eval = FALSE}
# by set names
m[c("a", "b", "c"), ]
# by nummeric indicies
m[3:1, ]
```
**New empty sets** can be added to the combination matrix by:
```{r, eval = FALSE}
# `d` is the new empty set
m[c("a", "b", "c", "d"), ]
```
Note when the indices specified do not cover all non-empty sets in the original
combination matrix, the combination matrix will be re-calculated because it
affects the values in the combination sets:
```{r, eval = FALSE}
# if `c` is a non-empty set
m[c("a", "b"),]
```
Similar for subsetting on the second dimension which correspond to the combination
sets:
```{r, eval = FALSE}
# reorder
m[, 5:1]
# take a subset
m[, 1:3]
# by charater indices
m[, c("110", "101", "011")]
```
New empty combination sets can also be added by setting the character indices:
```{r, eval = FALSE}
m[m, c(comb_name(m), "100")]
```
Indices can be set on both dimension simultaneously only when the set indices
covers all non-empty sets:
```{r, eval = FALSE}
m[3:1, 5:1]
# this will throw an error because `c` is a non-empty set
m[c("a", "b"), 5:1]
```
If the combination matrix was transposed, the margin of the matrix where
set indices and combination set indices should be switched.
```{r, eval = FALSE}
tm = t(m)
tm[reverse(comb_name(tm)), reverse(set_name(tm))]
```
If only the indices for the combination sets are set as one-dimensional,
it automatically works for both matrices that are transposed or not:
```{r, eval = FALSE}
m[1:5]
tm[1:5]
```
## Make the plot {#upset-making-the-plot}
Making the UpSet plot is very straightforward that users just send the
combination matrix to `UpSet()` function:
```{r, fig.width = 5, fig.height = 3}
m = make_comb_mat(lt)
UpSet(m)
```
By default the sets are ordered by the size and the combination sets are
ordered by the degree (number of sets that are selected).
The order is controlled by `set_order` and `comb_order`:
```{r, fig.width = 5, fig.height = 3}
UpSet(m, set_order = c("a", "b", "c"), comb_order = order(comb_size(m)))
```
Color of dots, size of dots and line width of the segments are controlled by
`pt_size`, `comb_col` and `lwd`. `comb_col` should be a vector corresponding
to the combination sets. In following code, since `comb_degree(m)` returns a
vector of integers, we just use it as index for the color vector.
```{r, fig.width = 5, fig.height = 3}
UpSet(m, pt_size = unit(5, "mm"), lwd = 3,
comb_col = c("red", "blue", "black")[comb_degree(m)])
```
Colors for the background (the rectangles and the dots representing the set is
not selected) are controlled by `bg_col`, `bg_pt_col`. The length of `bg_col`
can have length of one or two.
```{r, fig.width = 5, fig.height = 3}
UpSet(m, comb_col = "#0000FF", bg_col = "#F0F0FF", bg_pt_col = "#CCCCFF")
UpSet(m, comb_col = "#0000FF", bg_col = c("#F0F0FF", "#FFF0F0"), bg_pt_col = "#CCCCFF")
```
Transposing the combination matrix swtiches the sets to columns and
combination sets to rows.
```{r, fig.width = 3, fig.height = 5}
UpSet(t(m))
```
As we have introduced, if do subsetting on the combination sets, the subset of
the matrix can be visualized as well:
```{r, eval = FALSE}
UpSet(m[comb_size(m) >= 4])
UpSet(m[comb_degree(m) == 2])
```
```{r, echo = FALSE, fig.height = 3, fig.width = 7}
grid.newpage()
pushViewport(viewport(x = 0, width = 0.5, just = "left"))
draw(UpSet(m[comb_size(m) >= 4], column_title = "comb_size(m) >= 4"), newpage = FALSE)
popViewport()
pushViewport(viewport(x = 0.5, width = 0.5, just = "left"))
draw(UpSet(m[comb_degree(m) == 2], column_title = "comb_degree(m) == 2"), newpage = FALSE)
popViewport()
```
Following compares the different mode in `make_comb_mat()`:
```{r, eval = FALSE}
m1 = make_comb_mat(lt) # the default mode is `distinct`
m2 = make_comb_mat(lt, mode = "intersect")
m3 = make_comb_mat(lt, mode = "union")
UpSet(m1)
UpSet(m2)
UpSet(m3)
```
```{r, echo = FALSE, fig.height = 8, fig.width = 5}
m1 = make_comb_mat(lt) # the default mode is `distinct`
m2 = make_comb_mat(lt, mode = "intersect")
m3 = make_comb_mat(lt, mode = "union")
grid.newpage()
pushViewport(viewport(y = 1, height = 1/3, just = "top"))
draw(UpSet(m1, column_title = "mode = 'distinct'"), newpage = FALSE)
popViewport()
pushViewport(viewport(y = 2/3, height = 1/3, just = "top"))
draw(UpSet(m2, column_title = "mode = 'intersect'"), newpage = FALSE)
popViewport()
pushViewport(viewport(y = 1/3, height = 1/3, just = "top"))
draw(UpSet(m3, column_title = "mode = 'union'"), newpage = FALSE)
popViewport()
```
For the plot containing complement set, there is one additional column showing
this complement set does not overlap to any of the sets (all dots are in grey).
```{r, fig.width = 5, fig.height = 3}
m2 = make_comb_mat(lt, universal_set = letters)
UpSet(m2)
```
Remember if you already know the size for the complement set, you can directly assign it
by `complement_size` argument in `make_comb_mat()`.
```{r, fig.width = 5, fig.height = 3}
m2 = make_comb_mat(lt, complement_size = 10)
UpSet(m2)
```
For the case where the universal set is smaller than the union of all sets:
```{r, fig.width = 5, fig.height = 3}
m2 = make_comb_mat(lt, universal_set = letters[1:10])
UpSet(m2)
```
There are some cases that you may have complement set but you don't want to show it,
especially when the input for `make_comb_mat()` is a matrix which already contains
complement set, you can filter by the combination degrees.
```{r, fig.width = 5, fig.height = 3}
x = list_to_matrix(lt, universal_set = letters)
m2 = make_comb_mat(x)
m2 = m2[comb_degree(m2) > 0]
UpSet(m2)
```
## UpSet plots as heatmaps {#upset-plots-as-heatmaps}
In the UpSet plot, the major component is the combination matrix, and on the
two sides are the barplots representing the size of sets and the combination
sets, thus, it is quite straightforward to implement it as a "heatmap" where
the heatmap is self-defined with dots and segments, and the two barplots are
two barplot annotations constructed by `anno_barplot()`.
The default top annotation is:
```{r, eval = FALSE}
HeatmapAnnotation("Intersection\nsize" = anno_barplot(comb_size(m),
border = FALSE, gp = gpar(fill = "black"), height = unit(3, "cm")),
annotation_name_side = "left", annotation_name_rot = 0)
```
This top annotation is wrapped in `upset_top_annotation()` which only contais
the upset top barplot annotation. Most of the arguments in
`upset_top_annotation()` directly go to the `anno_barplot()`, e.g. to set
the colors of bars:
```{r, fig.width = 5, fig.height = 3}
UpSet(m, top_annotation = upset_top_annotation(m,
gp = gpar(col = comb_degree(m))))
```
To control the data range and axis:
```{r, fig.width = 5, fig.height = 3}
UpSet(m, top_annotation = upset_top_annotation(m,
ylim = c(0, 15),
bar_width = 1,
axis_param = list(side = "right", at = c(0, 5, 10, 15),
labels = c("zero", "five", "ten", "fifteen"))))
```
To control the annotation name:
```{r, fig.width = 5, fig.height = 3}
UpSet(m, top_annotation = upset_top_annotation(m,
annotation_name_rot = 90,
annotation_name_side = "right",
axis_param = list(side = "right")))
```
The settings are very similar for the right annotation:
```{r, fig.width = 5, fig.height = 3}
UpSet(m, right_annotation = upset_right_annotation(m,
ylim = c(0, 30),
gp = gpar(fill = "green"),
annotation_name_side = "top",
axis_param = list(side = "top")))
```
`upset_top_annotation()` and `upset_right_annotation()` can automatically
recognize whether sets are on rows or columns.
`upset_top_annotation()` and `upset_right_annotation()` only contain one
barplot annotation. If users want to add more annotations, they need to
manually construct a `HeatmapAnnotation` object with multiple annotations.
To add more annotations on top:
```{r, fig.width = 5, fig.height = 3}
UpSet(m, top_annotation = HeatmapAnnotation(
degree = as.character(comb_degree(m)),
"Intersection\nsize" = anno_barplot(comb_size(m),
border = FALSE,
gp = gpar(fill = "black"),
height = unit(2, "cm")
),
annotation_name_side = "left",
annotation_name_rot = 0))
```
To add more annotation on the right:
```{r, fig.width = 5, fig.height = 3}
UpSet(m, right_annotation = rowAnnotation(
"Set size" = anno_barplot(set_size(m),
border = FALSE,
gp = gpar(fill = "black"),
width = unit(2, "cm")
),
group = c("group1", "group1", "group2")))
```
To move the right annotation to the left of the combination matrix, use `upset_left_annotation()`:
```{r, fig.width = 5, fig.height = 3}
UpSet(m, left_annotation = upset_left_annotation(m))
```
To add numbers on top of the bars:
```{r, fig.width = 5, fig.height = 3}
UpSet(m, top_annotation = upset_top_annotation(m, add_numbers = TRUE),
right_annotation = upset_right_annotation(m, add_numbers = TRUE))
```
The object returned by `UpSet()` is actually a `Heatmap` class object, thus,
you can add to other heatmaps and annotations by `+` or `%v%`.
```{r, fig.width = 6, fig.height = 3}
ht = UpSet(m)
class(ht)
ht + Heatmap(1:3, name = "foo", width = unit(5, "mm")) +
rowAnnotation(bar = anno_points(1:3))
```
```{r, fig.width = 5, fig.height = 3}
ht %v% Heatmap(rbind(1:7), name = "foo", row_names_side = "left",
height = unit(5, "mm")) %v%
HeatmapAnnotation(bar = anno_points(1:7),
annotation_name_side = "left")
```
Add multiple UpSet plots:
```{r, fig.width = 5, fig.height = 7}
m1 = make_comb_mat(lt, mode = "distinct")
m2 = make_comb_mat(lt, mode = "intersect")
m3 = make_comb_mat(lt, mode = "union")
UpSet(m1, row_title = "distinct mode") %v%
UpSet(m2, row_title = "intersect mode") %v%
UpSet(m3, row_title = "union mode")
```
Or first transpose all the combination matrices and add them horizontally:
```{r, fig.height = 5, fig.width = 7}
m1 = make_comb_mat(lt, mode = "distinct")
m2 = make_comb_mat(lt, mode = "intersect")
m3 = make_comb_mat(lt, mode = "union")
UpSet(t(m1), column_title = "distinct mode") +
UpSet(t(m2), column_title = "intersect mode") +
UpSet(t(m3), column_title = "union mode")
```
The three combination matrices are actually the same and plotting them three
times is redundant. With the functionality in **ComplexHeatmap** package, we
can directly add three barplot annotations.
```{r, fig.width = 5, fig.height = 5}
top_ha = HeatmapAnnotation(
"distict" = anno_barplot(comb_size(m1),
gp = gpar(fill = "black"), height = unit(2, "cm")),
"intersect" = anno_barplot(comb_size(m2),
gp = gpar(fill = "black"), height = unit(2, "cm")),
"union" = anno_barplot(comb_size(m3),
gp = gpar(fill = "black"), height = unit(2, "cm")),
gap = unit(2, "mm"), annotation_name_side = "left", annotation_name_rot = 0)
# the same for using m2 or m3
UpSet(m1, top_annotation = top_ha)
```
Similar when the combination matrix is transposed:
```{r, fig.width = 5, fig.height = 5}
right_ha = rowAnnotation(
"distict" = anno_barplot(comb_size(m1),
gp = gpar(fill = "black"), width = unit(2, "cm")),
"intersect" = anno_barplot(comb_size(m2),
gp = gpar(fill = "black"), width = unit(2, "cm")),
"union" = anno_barplot(comb_size(m3),
gp = gpar(fill = "black"), width = unit(2, "cm")),
gap = unit(2, "mm"), annotation_name_side = "bottom")
# the same for using m2 or m3
UpSet(t(m1), right_annotation = right_ha)
```
## Example with the movies dataset
[UpsetR package](https://github.com/hms-dbmi/UpSetR) also provides a `movies`
dataset, which contains 17 genres for 3883 movies. First load the dataset.
```{r}
movies = read.csv(system.file("extdata", "movies.csv", package = "UpSetR"),
header = TRUE, sep = ";")
head(movies) # `make_comb_mat()` automatically ignores the first two columns
```
To make a same UpSet plot as in [this vignette](https://cran.r-project.org/web/packages/UpSetR/vignettes/basic.usage.html#example-2-choosing-the-top-largest-sets-and-plot-formatting):
```{r, fig.width = 10, fig.height = 4}
m = make_comb_mat(movies, top_n_sets = 6)
m
m = m[comb_degree(m) > 0]
UpSet(m)
```
Following code makes it look more similar as the orignal plot. The code is a
little bit long, but most of the code mainly customize the annotations and
row/column orders.
```{r, fig.width = 10, fig.height = 4}
ss = set_size(m)
cs = comb_size(m)
ht = UpSet(m,
set_order = order(ss),
comb_order = order(comb_degree(m), -cs),
top_annotation = HeatmapAnnotation(
"Genre Intersections" = anno_barplot(cs,
ylim = c(0, max(cs)*1.1),
border = FALSE,
gp = gpar(fill = "black"),
height = unit(4, "cm")
),
annotation_name_side = "left",
annotation_name_rot = 90),
left_annotation = rowAnnotation(
"Movies Per Genre" = anno_barplot(-ss,
baseline = 0,
axis_param = list(
at = c(0, -500, -1000, -1500),
labels = c(0, 500, 1000, 1500),
labels_rot = 0),
border = FALSE,
gp = gpar(fill = "black"),
width = unit(4, "cm")
),
set_name = anno_text(set_name(m),
location = 0.5,
just = "center",
width = max_text_width(set_name(m)) + unit(4, "mm"))
),
right_annotation = NULL,
show_row_names = FALSE)
ht = draw(ht)
od = column_order(ht)
decorate_annotation("Genre Intersections", {
grid.text(cs[od], x = seq_along(cs), y = unit(cs[od], "native") + unit(2, "pt"),
default.units = "native", just = c("left", "bottom"),
gp = gpar(fontsize = 6, col = "#404040"), rot = 45)
})
```
In `movies` dataset, there is also one column `AvgRating` which gives the
rating of each movie, we next split all the movies into five groups based on
the ratings.
```{r}
genre = c("Action", "Romance", "Horror", "Children", "SciFi", "Documentary")
rating = cut(movies$AvgRating, c(0, 1, 2, 3, 4, 5))
m_list = tapply(seq_len(nrow(movies)), rating, function(ind) {
m = make_comb_mat(movies[ind, genre, drop = FALSE])
m[comb_degree(m) > 0]
})
```
The combination matrices in `m_list` might have different combination sets:
```{r}
sapply(m_list, comb_size)
```
To compare between multiple groups with UpSet plots, we need to normalize all
the matrices to make them have same sets and same combination sets.
`normalize_comb_mat()` basically adds zero to the new combination sets which
were not there before.
```{r}
m_list = normalize_comb_mat(m_list)
sapply(m_list, comb_size)
```
We calculate the range for the two barplots:
```{r}
max_set_size = max(sapply(m_list, set_size))
max_comb_size = max(sapply(m_list, comb_size))
```
And finally we add the five UpSet plots vertically:
```{r, fig.width = 7, fig.height = 12}
ht_list = NULL
for(i in seq_along(m_list)) {
ht_list = ht_list %v%
UpSet(m_list[[i]], row_title = paste0("rating in", names(m_list)[i]),
set_order = NULL, comb_order = NULL,
top_annotation = upset_top_annotation(m_list[[i]], ylim = c(0, max_comb_size)),
right_annotation = upset_right_annotation(m_list[[i]], ylim = c(0, max_set_size)))
}
ht_list
```
After comparing the five UpSet plots, we can see most of the movies are rated
between 2 and 4. Horror movies tend to have lower ratings and romance movies
tend to have higher ratings.
Instead of directly comparing the size of the combination sets, we can also
compare the relative fraction to the full sets. In following code, we remove
the group of `c(0, 1]` because the number of movies are too few there.
```{r, fig.width = 7, fig.height = 9}
m_list = m_list[-1]
max_set_size = max(sapply(m_list, set_size))
rel_comb_size = sapply(m_list, function(m) {
s = comb_size(m)
# because the combination matrix is generated under "distinct" mode
# the sum of `s` is the size of the full set
s/sum(s)
})
ht_list = NULL
for(i in seq_along(m_list)) {
ht_list = ht_list %v%
UpSet(m_list[[i]], row_title = paste0("rating in", names(m_list)[i]),
set_order = NULL, comb_order = NULL,
top_annotation = HeatmapAnnotation(
"Relative\nfraction" = anno_barplot(
rel_comb_size[, i],
ylim = c(0, 0.5),
gp = gpar(fill = "black"),
border = FALSE,
height = unit(2, "cm"),
),
annotation_name_side = "left",
annotation_name_rot = 0),
right_annotation = upset_right_annotation(m_list[[i]],
ylim = c(0, max_set_size))
)
}
ht_list
```
Now the trend is more clear that horror movies are rated low and documentaries
are rated high.
Next we split the movies by years:
```{r, fig.width = 14, fig.height = 10}
year = floor(movies$ReleaseDate/10)*10
m_list = tapply(seq_len(nrow(movies)), year, function(ind) {
m = make_comb_mat(movies[ind, genre, drop = FALSE])
m[comb_degree(m) > 0]
})
m_list = normalize_comb_mat(m_list)
max_set_size = max(sapply(m_list, set_size))
max_comb_size = max(sapply(m_list, comb_size))
ht_list1 = NULL
for(i in 1:5) {
ht_list1 = ht_list1 %v%
UpSet(m_list[[i]], row_title = paste0(names(m_list)[i], "s"),
set_order = NULL, comb_order = NULL,
top_annotation = upset_top_annotation(m_list[[i]], ylim = c(0, max_comb_size),
height = unit(2, "cm")),
right_annotation = upset_right_annotation(m_list[[i]], ylim = c(0, max_set_size)))
}
ht_list2 = NULL
for(i in 6:10) {
ht_list2 = ht_list2 %v%
UpSet(m_list[[i]], row_title = paste0(names(m_list)[i], "s"),
set_order = NULL, comb_order = NULL,
top_annotation = upset_top_annotation(m_list[[i]], ylim = c(0, max_comb_size),
height = unit(2, "cm")),
right_annotation = upset_right_annotation(m_list[[i]], ylim = c(0, max_set_size)))
}
grid.newpage()
pushViewport(viewport(x = 0, width = 0.5, just = "left"))
draw(ht_list1, newpage = FALSE)
popViewport()
pushViewport(viewport(x = 0.5, width = 0.5, just = "left"))
draw(ht_list2, newpage = FALSE)
popViewport()
```
Now we can see most of the movies were produces in 1990s and the two major
genres are actions and romance.
Similarly, if we change the top annotation to the relative fraction to the
full sets (code not shown):
```{r, fig.width = 14, fig.height = 10, echo = FALSE}
max_set_size = max(sapply(m_list, set_size))
rel_comb_size = sapply(m_list, function(m) {
s = comb_size(m)
s/sum(s)
})
ht_list1 = NULL
for(i in 1:5) {
ht_list1 = ht_list1 %v%
UpSet(m_list[[i]], row_title = paste0(names(m_list)[i], "s"),
set_order = NULL, comb_order = NULL,
top_annotation = HeatmapAnnotation(
"Relative\nfraction" = anno_barplot(
rel_comb_size[, i],
ylim = c(0, 0.5),
gp = gpar(fill = "black"),
border = FALSE,
height = unit(2, "cm"),
),
annotation_name_side = "left",
annotation_name_rot = 0),
right_annotation = upset_right_annotation(m_list[[i]],
ylim = c(0, max_set_size))
)
}
ht_list2 = NULL
for(i in 6:10) {
ht_list2 = ht_list2 %v%
UpSet(m_list[[i]], row_title = paste0(names(m_list)[i], "s"),
set_order = NULL, comb_order = NULL,
top_annotation = HeatmapAnnotation(
"Relative\nfraction" = anno_barplot(
rel_comb_size[, i],
ylim = c(0, 0.5),
gp = gpar(fill = "black"),
border = FALSE,
height = unit(2, "cm"),
),
annotation_name_side = "left",
annotation_name_rot = 0),
right_annotation = upset_right_annotation(m_list[[i]],
ylim = c(0, max_set_size))
)
}
grid.newpage()
pushViewport(viewport(x = 0, width = 0.5, just = "left"))
draw(ht_list1, newpage = FALSE)
popViewport()
pushViewport(viewport(x = 0.5, width = 0.5, just = "left"))
draw(ht_list2, newpage = FALSE)
popViewport()
```
Finally we can add the statistics of years, ratings and number of watches for
each combination set as boxplot annotations to the right of the UpSet plot.
```{r, fig.width = 6, fig.height = 8}
m = make_comb_mat(movies[, genre])
m = m[comb_degree(m) > 0]
comb_elements = lapply(comb_name(m), function(nm) extract_comb(m, nm))
years = lapply(comb_elements, function(ind) movies$ReleaseDate[ind])
rating = lapply(comb_elements, function(ind) movies$AvgRating[ind])
watches = lapply(comb_elements, function(ind) movies$Watches[ind])
UpSet(t(m)) + rowAnnotation(years = anno_boxplot(years),
rating = anno_boxplot(rating),
watches = anno_boxplot(watches),
gap = unit(2, "mm"))
```
We can see the movies with genre "Scifi + Children" were produced quite old
but the ratings are not bad. The movies with genre "Action + Children" have
the lowest ratings.
## Example with the genomic regions
The H3K4me3 ChIP-seq peaks from six [Roadmap](http://www.roadmapepigenomics.org/) samples are
visualized by UpSet plot. The six samples are:
- [ESC, E016](https://egg2.wustl.edu/roadmap/data/byFileType/peaks/consolidated/narrowPeak/E016-H3K4me3.narrowPeak.gz)
- [ES-derived, E004](https://egg2.wustl.edu/roadmap/data/byFileType/peaks/consolidated/narrowPeak/E004-H3K4me3.narrowPeak.gz)
- [ES-derived, E006](https://egg2.wustl.edu/roadmap/data/byFileType/peaks/consolidated/narrowPeak/E006-H3K4me3.narrowPeak.gz)
- [Brain, E071](https://egg2.wustl.edu/roadmap/data/byFileType/peaks/consolidated/narrowPeak/E071-H3K4me3.narrowPeak.gz)
- [Muscle, E100](https://egg2.wustl.edu/roadmap/data/byFileType/peaks/consolidated/narrowPeak/E100-H3K4me3.narrowPeak.gz)
- [Heart, E104](https://egg2.wustl.edu/roadmap/data/byFileType/peaks/consolidated/narrowPeak/E104-H3K4me3.narrowPeak.gz)
First read the files and convert to `GRanges` objects.
```{r}
file_list = c(
"ESC" = "data/E016-H3K4me3.narrowPeak.gz",
"ES-deriv1" = "data/E004-H3K4me3.narrowPeak.gz",
"ES-deriv2" = "data/E006-H3K4me3.narrowPeak.gz",
"Brain" = "data/E071-H3K4me3.narrowPeak.gz",
"Muscle" = "data/E100-H3K4me3.narrowPeak.gz",
"Heart" = "data/E104-H3K4me3.narrowPeak.gz"
)
library(GenomicRanges)
peak_list = lapply(file_list, function(f) {