These subfolders contain the genome assemblies and associated data: - $assembly.2bit --> repeat-softmasked genome in UCSC 2bit format (see below for howto convert to fasta) - spliceAi/ --> subfolder that contains for bigWig files listing the SpliceAI probabilities for donor and acceptor and both strand for each genomic position - $assembly.consensi.fa.classified --> repeat library obtained from RepeatModeler ========= Why do you provide 2bit and HOW can I convert 2bit to fasta? ======== UCSC's 2bit format (https://genome.ucsc.edu/goldenpath/help/twoBit.html) is most likely the best way to store genome sequences, as it is compressed and provides random access via the kent src tool twoBitToFa. To save disk space, we only provide 2bit, also because TOGA2 requires 2bit as input. twoBitToFa allows to - convert the 2bit into a fasta file - extract (very fast) a set of genomic regions, including entire scaffolds, handling the reverse complement strand - preserve or strip the repeat softmasking. twoBitToFa is available as part of the kent source code (https://github.com/ucscGenomeBrowser/kent), as a 64bit binary (https://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/twoBitToFa) and on galaxy (https://usegalaxy.eu/?tool_id=toolshed.g2.bx.psu.edu%2Frepos%2Fiuc%2Fucsc_twobittofa%2Fucsc-twobittofa%2F332) ================================================================================ ========= References: ======== SpliceAI: Jaganathan, Kishore, Sofia Kyriazopoulou Panagiotopoulou, Jeremy F. McRae, Siavash Fazel Darbandi, David Knowles, Yang I. Li, Jack A. Kosmicki, et al. 2019. "Predicting Splicing from Primary Sequence with Deep Learning." Cell 176 (3): 535-48.e24. RepeatModeler: Flynn, Jullien M., Robert Hubley, Clément Goubert, Jeb Rosen, Andrew G. Clark, Cédric Feschotte, and Arian F. Smit. 2020. "RepeatModeler2 for Automated Genomic Discovery of Transposable Element Families." Proceedings of the National Academy of Sciences of the United States of America 117 (17): 9451-57.