Metadata-Version: 2.4
Name: atol-genome-launcher
Version: 0.14.0
Summary: Draft code for the AToL genome launcher
Author-email: Amy Tims <amy.tims@unimelb.edu.au>, Emily Marshall <emily@biocommons.org.au>, Keeva Connolly <keeva.connolly@qcif.edu.au>, Tom Harrop <tharrop@unimelb.edu.au>
License-Expression: GPL-3.0-or-later
Project-URL: Homepage, https://github.com/tomharrop/atol-genome-launcher
Classifier: Development Status :: 3 - Alpha
Classifier: Natural Language :: English
Classifier: Operating System :: POSIX :: Linux
Classifier: Private :: Do Not Upload
Classifier: Programming Language :: Python :: 3.11
Requires-Python: <3.15,>=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: argparse
Requires-Dist: pandas<3,>=2.3.3
Requires-Dist: pydantic>=2.12.0
Requires-Dist: snakedeploy>=0.11.0
Requires-Dist: snakemake<10,>=9.11.6
Dynamic: license-file

## atol-genome-launcher

Utility code for AToL's Genome Engine. This package provides modules for
launching assemblies and annotations based on metadata ingested by the
[atol-bpa-datamapper](github.com/TomHarrop/atol-bpa-datamapper).


### Standardised metadata parsing

The `yaml_manifest` module provides standardised parsing of AToL's assembly
manifest. The schema for the manifest is at
[src/yaml_manifest/schema.json](./src/yaml_manifest/schema.json)

> [!IMPORTANT]
>
> Despite the name, the preferred input is JSON. See the [example JSON
> file](./test-data/dummy_pb.json). A legacy parser for YAML is available as
> `Manifest.from_yaml()`. 


#### Load the manifest

```python3
from yaml_manifest import Manifest

with open("manifest.json", "rb") as f:
  manifest = Manifest.model_validate_json(f.read())
```

If you have already processed the manifest in Python, you can load it straight
from a dict.

```python3
from yaml_manifest import Manifest

manifest = Manifest.from_dict(config)
```

#### Specimen metadata

Available as `Manifest` properties, e.g.

```python3
manifest.dataset_id
manifest.scientific_name
manifest.taxon_id
manifest.busco_lineage
manifest.hic_motif
```

#### Read file information

Available as `ReadFile` objects, which can be queried for processing.

```python3

hic_reads = manifest.hic_reads

hic_reads.is_paired_end   # check file types
hic_reads.names           # get names, URLs etc
hic_reads.all_urls
```


#### Standardised directory structure

Standardised directory layout for each stage of read file processing is
[configured in json](src/yaml_manifest/directory_layout.json).

We've configured *raw* and *qc* for now.

`ReadFile` objects can be queried to get the appropriate `Paths` for each
stage.

```python3
my_file = manifest.reads.get("353997_AusARG_BRF_HMGMJDRXY")

print(my_file.paths("raw"))
print(my_file.paths("qc"))
print(my_file.stats_path("qc"))
```

Generic directories are available from the `Manifest` object.

```python3

manifest.get_dir("downloads")

# Specific directories are available by data_type
manifest.get_dir("downloads", data_type="Hi-C") 
```

#### Automatic `jinja2` template rendering

`jinja2` templates can be rendered with `render_template_file` and
`render_template` (for a Python string) methods.

Keys in the manifest will automatically be matched to keys in the template.

Keys in the template that aren't directly available as `Manifest` properties
can be passed as extra args, *e.g.* `platform` and `custom_param` below. 

```python3
rendered = manifest.render_template_file(
    "templates/pipeline_config.yaml.j2",
    platform="pacbio",
    custom_param="value",
)
```

### deploy-pipline

Deploy the AToL Genome Launcher's pipelines. This prepares a `run-dir` to run
jobs for an assembly `manifest`. 

The suggested usage is to have a single working directory for each assembly
manifest, so you can run `deploy-pipeline manifest.yaml`. The deployed
workflow, runscripts and manifest could then be committed to a private
repository.


#### Usage

```bash
usage: deploy-pipeline [-h] [-n] [--workflow_url WORKFLOW_URL] [--workflow_tag WORKFLOW_TAG]
                       [--force] [--run-dir RUN_DIR]
                       manifest_file

positional arguments:
  manifest_file         Path to the manifest

options:
  -h, --help            show this help message and exit

Outputs:
  --run-dir RUN_DIR     Run directory for the assembly (default: /home/tharrop/Projects/atol-
                        genome-launcher)

Settings:
  -n                    Dry run (default: False)
  --workflow_url WORKFLOW_URL
                        genome-launcher-workflow URL (default: SplitResult(scheme='https',
                        netloc='github.com', path='/AToL-Bioinformatics/genome-launcher-
                        workflow', query='', fragment=''))
  --workflow_tag WORKFLOW_TAG
                        genome-launcher-workflow tag (default: 0.0.3)
  --force               Passed to snakedeploy (default: False)
```

### request-assembly-repo

Generate an assembly repo on GitHub for a `manifest` file.

#### Usage

```bash
usage: request-assembly-repo [-h] [-n] [--assignees ASSIGNEES] [--label_flag LABEL_FLAG] [--token_env_var TOKEN_ENV_VAR] manifest

positional arguments:
  manifest

options:
  -h, --help            show this help message and exit

Settings:
  -n                    Dry run
  --assignees ASSIGNEES
                        GitHub user names to assign to the issue.
  --label_flag LABEL_FLAG
                        Label for this assembly.
  --token_env_var TOKEN_ENV_VAR
                        The name of the environment variable containing the GitHub personal access token with permission to run the Action.
```


### assembly-data-downloader

Read an assembly `manifest_file` and download the raw read files from BPA.

#### Usage

```bash
usage: assembly-data-downloader [-h] [-n] [--parallel_downloads PARALLEL_DOWNLOADS] manifest_file

positional arguments:
  manifest_file         Path to the manifest

options:
  -h, --help            show this help message and exit
  -n                    Dry run
  --parallel_downloads PARALLEL_DOWNLOADS
                        Number of parallel downloads
```

### bpa-file-downloader

Downloads a file from `bioplatforms_url` to `file_name`. Requires the
environment variable `BPA_APIKEY` to be set.

#### Usage

```bash
atol-genome-launcher version 0.1.3.dev0+g09f43177b.d20251021
usage: bpa-file-downloader [-h] [--file_checksum FILE_CHECKSUM] bioplatforms_url file_name

positional arguments:
  bioplatforms_url
  file_name

options:
  -h, --help            show this help message and exit
  --file_checksum FILE_CHECKSUM
```

### pipeline-result-uploader

Reads the YAML `manifest` and walks the output directory to find result files
for a `stage`, e.g. "genomeassembly".

Uploads the files to the given `bucket`, under the same path as the result
file. If the files are specified for compression in the
[config](src/yaml_manifest/directory_layout.json), they will be compressed
before upload.

**Requires the same [environment
variables](https://github.com/TomHarrop/atol-genome-launcher?tab=readme-ov-file#required-environment-variables)
as result-file-uploader**.

#### Usage

```
usage: pipeline-result-uploader [-h] --stage STAGE --bucket BUCKET [--parallel_downloads PARALLEL_DOWNLOADS] [-n] manifest receipts_file

Collect pipeline result files and upload them to S3-compatible object storage using rclone.

positional arguments:
  manifest              Path to the YAML manifest file.
  receipts_file         jsonl file to store the upload receipts

options:
  -h, --help            show this help message and exit
  --stage STAGE         Pipeline stage to collect results from (e.g. 'genomeassembly', 'ascc').
  --bucket BUCKET       Name of the S3 bucket.
  --parallel_downloads PARALLEL_DOWNLOADS
                        Number of parallel downloads
  -n                    Dry run
```

For testing, the rclone remote name can be set using `--rclone_remote_name`,
and the directory to search for files to upload can be set using
`--result_dir`.

### result-file-uploader

Uploads a result file to object storage. Prints the remote path and sha256sum
to stdout.

> [!WARNING]
>
> Uses `rclone copyto`, so **destination files will be overwritten**.


#### Required environment variables

 | Variable                                 | Description                | Example |
 | ---------------------------------------- | -------------------------- | ------- |
 | `RCLONE_CONFIG_UPLOAD_TYPE`              | Rclone backend type        | "s3"    |
 | `RCLONE_CONFIG_UPLOAD_PROVIDER`          | S3-compatible provider     | "Ceph"  |
 | `RCLONE_CONFIG_UPLOAD_ACCESS_KEY_ID`     | S3 access key              |         |
 | `RCLONE_CONFIG_UPLOAD_SECRET_ACCESS_KEY` | S3 secret key              |         |
 | `RCLONE_CONFIG_UPLOAD_ENDPOINT`          | S3-compatible endpoint URL |         |

#### Usage

```bash
usage: result-file-uploader [-h] --bucket BUCKET local_file remote_path

Upload a single file to S3-compatible object storage using rclone.

positional arguments:
  local_file       Path to the local file to upload.
  remote_path      Destination key/path within the bucket.

options:
  -h, --help       show this help message and exit
  --bucket BUCKET  Name of the S3 bucket.
```


### rnaseq_manifest_generator

Queries the mapped metadata for an organism (`organism_grouping_key`) and
outputs a CSV-format manifest of RNASeq files.

#### Usage

```bash
usage: rnaseq-manifest-generator [-h] --resources RESOURCES --packages PACKAGES organism_grouping_key manifest

Generate a manifest of RNAseq data for an organism.

positional arguments:
  organism_grouping_key
                        Data Mapper organism_grouping_key
  manifest              Path to output the manifest

options:
  -h, --help            show this help message and exit
  --resources RESOURCES
                        Mapped Resources CSV. FIXME. Should be JSON.
  --packages PACKAGES   Mapped Packages CSV. FIXME. Should be JSON.
```

### rnaseq_reads_downloader

Takes a CSV-format manifest of RNASeq files, runs the `bpa-file-downloader` for
each file, and combines the downloaded files by sample.

#### Usage

```bash
usage: rnaseq-reads-downloader [-h] [--parallel_downloads PARALLEL_DOWNLOADS] manifest outdir

positional arguments:
  manifest              Path to the manifest
  outdir                Output directory

options:
  -h, --help            show this help message and exit
  --parallel_downloads PARALLEL_DOWNLOADS
                        Number of parallel downloads
```

### assembly_config_generator

Generates config files for sanger-tol/genomeassembly pipeline 

```bash
usage: assembly-config-generator [-h] --long_reads LONG_READS [--hic_reads HIC_READS] [--template TEMPLATE] config pipeline_config

positional arguments:
  config
  pipeline_config

options:
  -h, --help            show this help message and exit
  --long_reads LONG_READS
  --hic_reads HIC_READS
  --template TEMPLATE
```
