Loading a single sequence from a file#
In this case, the filename suffix is used to infer the data format.
Warning
If a file has more than one sequence, only the first one is loaded.
Note
It’s also possible to load a sequence from a url.
Directly use the fasta format parser to load a sequence#
The cogent3 parsers return standard Python data types. The iter_fasta_records() is a generator, so it yields one record at a time. Because I know there’s a single sequence in this file, I wrap the call with list and select the first record.
You can provide a converter that will transform the sequence data to the type you want. In this example, we use a cogent3 builtin to return a numpy array of unsigned 8-bit integers. We first get all the IUPAC characters for DNA and construct the converter. The converter maps an integer the provided characters in their order of occurrence in dna_alpha.
Note
The characters provided to the delete argument are white space and essential to ensure line feeds are removed.
We then use the parser as before but provide our custom converter.
Directly use the genbank format parser to load a sequence and annotations#
The cogent3 parsers return standard Python data types. The iter_genbank_records() is a generator, so it yields one record at a time. Because I know there’s a single sequence in this file, I wrap the call with list and select the first record.
As the output indicates, variable anns is a dictionary. The features in the GenBank feature table are available as a list under the "features" key. (See getting GenBank features as primitives.)
Directly use the fastq format parser to load reads with quality scores#
The iter_fastq_records() generator yields (label, sequence, quality) tuples for each record in a fastq file. By default both the sequence and quality strings are decoded to str.
To get a transformation of read quality into Phred scores, pass a converter built with make_qual_converter to get quality as a numpy.uint8 array. The scoring scheme can be selected by a PhredEncoding member or by name ("phred+33" or "phred+64", case insensitive).
Note
As with the fasta and genbank iterative parsers, a converter argument exists for providing a custom transformer to be applied to the sequence data too.
Loading sequence collections from a file or url#
Loading aligned sequences#
Any file in which the sequences have exactly the same length can be loaded as an alignment.
Note
The load functions record the origin of the data in a .source attribute.
Loading unaligned sequences#
Files containing sequences that may differ in length can be loaded using load_unaligned_seqs(), which returns a sequence collection.
Loading unaligned sequences from multiple files#
You can create a single sequence collection containing sequences from all files in a directory that match a wildcard (or glob) pattern (e.g. "path/to/dir/*.<filename suffix>"). We load data from files ending with .fa using load_unaligned_seqs(). This approach can be taken for all supported sequence file formats.
Note
The function limits loading to just one sequence per file.
Loading from a url#
The cogent3 load functions support loading from a url. We load the above fasta file directly from GitHub.
Specifying the file format#
The loading functions use the filename suffix to infer the file format. This can be overridden using the format argument.
Specifying the sequence molecular type#
Making an alignment from standard python objects#
From a dict of strings#
From a dict of numpy arrays#
From a series of strings#
The sequence names will be automatically created.
Changing sequence labels on loading#
Load a list of aligned nucleotide sequences, while specifying the DNA molecule type and stripping the comments from the label. In this example, we rename sequences by passing a function that removes everything after the first whitespace to the label_to_name parameter.
Making a sequence collection from standard python objects#
This is done using make_unaligned_seqs(), which returns a SequenceCollection instance. The function arguments match those of make_aligned_seqs(). We demonstrate only for the case where the input data is a dict.
Loading sequences using format parsers#
load_aligned_seqs() and load_unaligned_seqs() are just convenience interfaces to format parsers. It can sometimes be more effective to use the parsers directly, say when you don’t want to load everything into memory.
Loading FASTA sequences from an open file or list of lines#
To load FASTA formatted sequences directly, you can use iter_fasta_records. This parser returns data as python strings.
Note
This returns the sequences as strings.
Handling overloaded FASTA sequence labels#
The FASTA label field is frequently overloaded, with different information fields present in the field and separated by some delimiter. This can be flexibly addressed using the LabelParser. By creating a custom label parser, we can decide which part we use as the sequence name. We show how to convert a field into something specific.
RichLabel objects have an Info object as an attribute, allowing specific reference to all the specified label fields.
Using a third-party plugin for sequence storage#
Sequence collections and alignments have a .storage attribute which holds the underlying sequence data and provides basic functions for obtaining it. Users can install a third-party plugin which is customized for different types of sequence data. The following examples require you install the cogent3-h5seqs plugin. This project provides alternative storage for both unaligned sequences and for alignments.
$ pip install cogent3-h5seqs
Selecting an alternate storage backend#
Specify the storage using the storage_backend argument.
That’s it!
For the cogent3-h5seqs package you specify a different storage backend for unaligned sequences.
Set the default storage#
You can set the default storage process-wide, so you don’t need to use the storage_backend argument.
When you apply operations, the new backend storage setting is applied.
Note
To revert to the cogent3 defaults, use the reset argument.