A toolkit for preprocessing single cell sequencing data.
- Run through ruff to check and format files
- Error handling to capture missing or incorrect parameters, and unexpected file content
- Peaks in between barcodes need further investigation
- Plot generated by harvest currently will not handle > 1 barcode peak per whitelist (doesn't affect CSV output)
- Benchmark different assays (SPLiTseq, Parse, 10X) and methods (split-pipe, scarecrow, UMI tools)
-
- barcode recovery
-
- alignment (STAR and kallisto)
- Test alignment with kallisto and STAR
-
- may need to alter sequence header formatting depending on what is retained in BAM file
R1=100K_1.fastq
R2=100K_2.fastq
BARCODES=(BC1:R1_v3:/Users/s14dw4/Documents/scarecrow_test/barcodes/bc_data_n123_R1_v3_5.barcodes
BC2:v1:/Users/s14dw4/Documents/scarecrow_test/barcodes/bc_data_v1.barcodes
BC3:R3_v3:/Users/s14dw4/Documents/scarecrow_test/barcodes/bc_data_R3_v3.barcodes)
for BARCODE in ${BARCODES[@]}
do
scarecrow seed --fastqs ${R1} ${R2} --strands pos neg \
-o ./results/barcodes_${BARCODE%:*:*}.csv --barcodes ${BARCODE}
done
FILES=(./results/barcodes_BC*csv)
scarecrow harvest ${FILES[@]} --barcode_count 3 --min_distance 11 \
--conserved ./results/barcodes_BC1_conserved.tsv --out barcode_positions.csv
time scarecrow reap --fastqs ${R1} ${R2} -p ./barcode_positions.csv --barcode_reverse_order \
-j 2 -m 2 -q 30 --barcodes ${BARCODES[@]} --extract 1:1-64 --umi 2:1-10 --out ./cDNA.fq --threads 4
scarecrow tally -f ./cDNA.fq -m 2