One Health Microbiome + Exposomics Skill Roadmap


This repository tracks my technical skill-building journey across 24 months (2 hours/day), focused on computational biology, exposomics, and microbiome science in the One Health framework.
1. Biological Domain Knowledge
Total Time: ~7 weeks
Focus: Microbiome ecology, host–microbe–toxin interaction, One Health, exposure biology, CRISPR, TA systems
Know More
### 1.1. Microbiome Ecology
| Sub-Skill | Learn To... | Notes |
| ----------------------------------- | --------------------------------------------------------------------- | ------------------------------------------------- |
| Microbial ecology principles | Understand diversity, richness, evenness; define core/rare taxa | Alpha/beta/gamma diversity; richness vs abundance |
| Functional guilds & metabolic roles | Interpret what microbes do (e.g., SCFA production, vitamin synthesis) | Use KEGG/MetaCyc modules to connect pathways |
| Colonization resistance & dysbiosis | Identify when microbial communities shift toward pathogenic states | Important for exposome impact assessment |
| Cross-biome connections | Compare oral–gut–skin–lung microbiomes | One Health relevance in zoonotic exchange |
### 1.2. Host–Microbe Interactions
| Sub-Skill | Learn To... | Notes |
| ------------------------------ | --------------------------------------------------------------------- | -------------------------------------------- |
| Mucosal immunity | Understand immune education by commensals | Especially IL-10, Tregs, IgA |
| Microbe-induced signaling | Analyze microbial metabolites (e.g., SCFA, bile acids) affecting host | Integration with metabolomics/exposomics |
| Barrier function | Understand epithelial integrity, tight junctions, and permeability | Used in toxicology and inflammation contexts |
| Host gene expression responses | Link microbiome shifts to host transcriptomic changes | Required for multi-omics correlation work |
### 1.3. One health systems thinking
| Sub-Skill | Learn To... | Notes |
| ---------------------------------------- | ----------------------------------------------------------- | ------------------------------------------------------- |
| Cross-species microbiome transmission | Track microbial exchange between human–animal–environment | Zoonoses, AMR transfer, ecological modeling |
| Reservoirs and environmental persistence | Understand how soil, water, or air act as microbial vectors | Especially relevant for exposure sources |
| Shared exposure consequences | Compare how toxins affect different hosts | Use same metabolite pathways across hosts |
| Surveillance and global monitoring | Explore WHO/FAO/OIE approaches to microbial surveillance | Helpful for integrative use cases or policy translation |
### 1.4. Exposomics and Environmental Toxicology
| Sub-Skill | Learn To... | Notes |
| -------------------------------- | ---------------------------------------------------------------------------------------- | -------------------------------------------------------------------------- |
| Exposure routes and kinetics | Inhalation, dermal, oral; ADME principles | Know how chemicals reach microbes and host tissues |
| Xenobiotic metabolism | Identify microbial enzymes (e.g., azoreductase, nitroreductase) that transform chemicals | Impacts metabolomics interpretation |
| Host–microbe–chemical crosstalk | Recognize interaction effects (e.g., dysbiosis after exposure) | Synergistic or antagonistic outcomes matter |
| Microbial indicators of exposure | Use taxa as biomarkers of toxin presence | Differential abundance signatures or functional genes (e.g., efflux pumps) |
### 1.5. CRISPR, Toxin–Antitoxin (TA), and Other Mobile Elements
| Sub-Skill | Learn To... | Notes |
| ----------------------------------- | --------------------------------------------------------------- | ------------------------------------------------ |
| CRISPR-Cas systems | Distinguish Class I vs II; identify arrays and cas operons | Required for Spacerome, viral defense analysis |
| Toxin–antitoxin systems | Understand types (I–VI), addiction modules, and stress response | Metatranscriptomics of TA genes |
| Mobile genetic elements | Detect plasmids, phages, ICEs | Important in AMR spread, functional plasticity |
| Horizontal gene transfer mechanisms | Conjugation, transformation, transduction | For exposure-driven evolution/adaptation studies |
### 1.6. Taxonomic, Phylogenetic & Evolutionary Foundations
| Sub-Skill | Learn To... | Notes |
| ----------------------------- | ------------------------------------------------------------ | ---------------------------------------------------- |
| Prokaryotic taxonomy | Understand NCBI/SILVA/GTDB hierarchies | Required for mapping and data standardization |
| Strain-level diversity | Differentiate between species, strains, and genotypes | Applies in MAG analysis, shotgun metagenomics |
| Phylogenetic inference | Interpret phylogenies based on marker genes or whole genomes | Used for tree-based analysis, functional predictions |
| Evolution of microbial traits | Understand selective pressures, niche adaptation | For interpreting functional enrichment results |
2. Biostatistics & Quantitative Modeling
Total Time: ~8 weeks
Focus: GLMs, ZINB, PERMANOVA, compositional stats, differential abundance, mixed models
Know More
### 2.1. Foundations of Statistical Thinking
| Sub-Skill | Learn To... | Notes |
| --------------------------- | ---------------------------------------------------------------------------------- | ------------------------------------------------------ |
| Probability theory | Interpret distributions, likelihoods, prior beliefs | Essential for understanding Bayesian models |
| Key distributions | Normal, Poisson, Binomial, Negative Binomial, Zero-Inflated, Dirichlet Multinomial | Needed for microbiome data and overdispersion modeling |
| Estimation & testing | Perform point estimation, confidence intervals, hypothesis testing | Learn frequentist and Bayesian logic |
| Multiple testing correction | Apply FDR (Benjamini–Hochberg), Bonferroni, q-value | Required in multi-omic differential abundance tests |
### 2.2. Generalized Linear and Mixed Models
| Sub-Skill | Learn To... | Notes |
| --------------------------------------- | ---------------------------------------------------------------------- | -------------------------------------------------------- |
| Linear models (LM) | Run `lm()` models with continuous outcomes | Basis for PCA, PERMANOVA, etc. |
| GLMs | Use `glm()` for count/zero-inflated data (log, logit, identity link) | Use family = `poisson`, `quasipoisson`, `binomial`, `nb` |
| Generalized linear mixed models (GLMMs) | Use `lme4::glmer`, `glmmTMB`, `nlme` to handle nested/repeated designs | For longitudinal exposome or repeated microbiome samples |
| Model diagnostics | Residuals, AIC, pseudo-R², overdispersion checks | Ensure you're not misinterpreting noise as signal |
### 2.3. Zero-Inflation, Compositionality & Normalization
| Sub-Skill | Learn To... | Notes |
| ------------------------- | ---------------------------------------------------------- | ----------------------------------------------------------- |
| Compositional data issues | Understand sum-constrained proportions, false correlations | Affects microbiome, metabolomics, exposome data |
| Transformations | Apply CLR, ILR, ALR, log+psuedocount, TSS | Use `microbiome::transform`, `compositions::clr` |
| Normalization | VST, RLE, CSS, GMPR, rarefaction | Learn when to use each based on your method |
| Zero-inflated modeling | Choose ZINB, hurdle, or ZIBR based on dropout structure | Use `glmmTMB`, `zinbwave`, `ZIBR`, `ANCOM-BC` appropriately |
### 2.4. Differential Abundance & Expression Modeling
| Sub-Skill | Learn To... | Notes |
| --------------------------- | ------------------------------------------------------------- | --------------------------------------------------------------- |
| Classical methods | Apply t-test, ANOVA, Kruskal–Wallis, Wilcoxon | Only for simple comparisons — avoid overuse |
| RNA-seq-inspired models | Use `DESeq2`, `edgeR`, `limma-voom`, `voomWithQualityWeights` | Base for protein, transcript, or taxonomic abundance shifts |
| Compositional-aware methods | Use `ALDEx2`, `ANCOM-BC`, `MaAsLin2`, `metagenomeSeq` | Choose based on effect size structure, covariates, and sparsity |
| Model comparison & post-hoc | Compare fit via AIC/BIC; apply post-hoc Tukey or emmeans | Necessary when doing multi-group comparisons |
### 2.5. Multivariate & Ordination Methods
| Sub-Skill | Learn To... | Notes |
| ------------------------- | ------------------------------------------------------------- | ------------------------------------------------------- |
| PCA / PCoA / NMDS | Reduce dimensionality and visualize beta-diversity | Use `vegan::metaMDS`, `ape::pcoa`, `prcomp` |
| Distance metrics | Understand Bray-Curtis, Aitchison, Jaccard, Euclidean | Choice affects ordination structure and interpretation |
| PERMANOVA & Adonis | Run `vegan::adonis` for group separation on distance matrices | Used in all microbiome diversity comparisons |
| Procrustes & Mantel tests | Compare ordination structures across omics | Important for integrative exposomics/microbiome studies |
### 2.6. Longitudinal & Hierarchical Modeling
| Sub-Skill | Learn To... | Notes |
| ------------------------ | ---------------------------------------------------------------- | ---------------------------------------------------- |
| Repeated measures models | Mixed models (`lmer`, `glmmTMB`, `nlme`) for repeated timepoints | Use for exposome × time, diet interventions, etc. |
| Time series smoothing | Use `splines`, `smoothers`, `mgcv::gam` for trend detection | Needed for exposome drift and adaptation modeling |
| MaAsLin2, ZIBR, ZINBMM | Use longitudinal microbiome/exposome-specific tools | Part of MiNDSET development and interpretation logic |
### 2.7. Bayesian & Simulation-Based Inference
| Sub-Skill | Learn To... | Notes |
| ----------------------------- | ---------------------------------------------------------------- | ----------------------------------------------------------------------- |
| Bayesian modeling | Use `brms`, `rstanarm`, `JAGS`, `Stan` | Full uncertainty modeling for complex biological systems |
| Posterior estimation & priors | Understand credible intervals, shrinkage, regularization | Required for ZIDM, eBay, and probabilistic exposome inference |
| Simulated data testing | Use `simstudy`, `synthpop`, or custom scripts for power analysis | Especially when working with underpowered animal/environmental datasets |
3. Programming & Scripting
Total Time: ~6 weeks
Focus: R (tidyverse, Bioconductor), Python (pandas, Biopython), Bash (file parsing, job scripts)
Know More
### 3.1. R Programming for Statistical and Microbiome Analysis
| Sub-Skill | Learn To... | Notes |
| -------------------------- | ----------------------------------------------------------------------------- | ---------------------------------------------------- |
| Tidyverse core | Use `dplyr`, `tidyr`, `ggplot2`, `tibble`, `forcats` for clean pipelines | Tidy data in, tidy data out |
| Bioconductor packages | Use `phyloseq`, `DESeq2`, `edgeR`, `limma`, `vegan`, `microbiome` | Essential for omics analysis |
| Plotting | Make publication-ready plots with `ggplot2`, `patchwork`, `cowplot`, `ggpubr` | Modular, themable, scalable plots |
| Data reshaping | Use `pivot_longer`, `pivot_wider`, `separate`, `unite` | Clean metadata or taxonomic tables |
| Writing functions | Wrap routines into reusable, modular functions | Critical for scaling code and pipelines |
| Error handling & debugging | Use `tryCatch`, `message()`, `stop()` | For large-scale batch processing and robust wrappers |
| Literate programming | Write `RMarkdown`, `Quarto`, `.Rmd` reports | For reproducible reports and notebooks |
### 3.2. Python for Parsing, Preprocessing & Machine Learning
| Sub-Skill | Learn To... | Notes |
| ------------------------ | ------------------------------------------------------------------------------ | ----------------------------------------------------------- |
| Data manipulation | Use `pandas`, `numpy`, `glob`, `os` to wrangle and process files | Ideal for working with metadata, logs, sequence files |
| Plotting | Use `matplotlib`, `seaborn`, `plotly` for dynamic plots | `sns.clustermap` for heatmaps, `plt.subplots` for panels |
| Machine learning | Use `scikit-learn` for RF, SVM, XGBoost, pipelines | For supervised learning & interpretable ML |
| Bioinformatics utilities | Use `BioPython`, `ete3`, `scikit-bio`, `PyPHLAWD` | FASTA/FASTQ parsing, phylogeny, alignment |
| API access & automation | Use `requests`, `json`, `xml`, `BeautifulSoup` for scraping or data extraction | Programmatic data retrieval (e.g., SRA, MGnify, BioSamples) |
| Writing clean scripts | Write `.py` modules, argparse CLI wrappers, logging | For building scalable tools and batch jobs |
### 3.3. Bash and Command-Line Proficiency
| Sub-Skill | Learn To... | Notes |
| ------------------------- | ------------------------------------------------------------------ | ------------------------------------------------------------ |
| Navigation & file ops | Use `cd`, `ls`, `mv`, `cp`, `rm`, `mkdir`, `find`, `xargs`, `tree` | For fast project traversal and data prep |
| File parsing | Use `cut`, `awk`, `sed`, `sort`, `uniq`, `grep`, `head`, `tail` | For log parsing, sample lists, metadata cleanup |
| Job scheduling (HPC) | Use `sbatch`, `qsub`, `squeue`, `sacct`, `#!/bin/bash` headers | For running batch pipelines on clusters |
| Permissions & environment | Use `chmod`, `chown`, `PATH`, `export`, `.bashrc` | To avoid execution errors and dependency hell |
| Writing shell scripts | Automate routine jobs with `.sh` scripts and parameterized loops | For reproducible project automation |
| Command-line utilities | Install and use tools like `fastqc`, `multiqc`, `seqkit`, `jq` | Widely used in pre/post processing for multi-omics workflows |
### 3.4. Inter-language Integration & Reusability
| Sub-Skill | Learn To... | Notes |
| --------------------------------- | ------------------------------------------------------------------------- | ------------------------------------------------------- |
| Call R from Python (`rpy2`) | Seamlessly combine ggplot2 with scikit-learn models | Advanced, but powerful |
| Call Python from R (`reticulate`) | Use machine learning tools within R pipelines | Use inside Quarto/Rmd |
| Modular function writing | Split analysis into reusable scripts or R/Py functions | Better for pipelines & reproducibility |
| Logging and messaging | Use `logging` (Python), `message()` (R), `echo` (Bash) for robust outputs | Helps in debugging large runs |
| Config-driven scripts | Load parameters via `.yaml`, `.json`, or `.toml` | Needed for Snakemake, Nextflow, or pipeline integration |
4. Pipeline Development & Workflow Management
Total Time: ~5 weeks
Focus: Snakemake, Nexflow, Portability and Sharing
Know More
### 4.1. Workflow Architecture Fundamentals
| Sub-Skill | Learn To... | Notes |
| ----------------------- | ---------------------------------------------------------- | --------------------------------------------------------- |
| DAG thinking | Design workflows as Directed Acyclic Graphs (DAGs) | Know how input/output files drive task dependencies |
| Rule chaining | Define jobs that depend on outputs from prior rules | Crucial for reproducible logic |
| Inputs, outputs, params | Write pipeline rules with dynamic wildcards and parameters | Know the difference between rule-level and global configs |
| Rule modularity | Separate reusable rules into modules/snippets | Encourages reusability across pipelines |
| Configuration files | Use YAML/JSON to parameterize your pipeline | Keeps code clean and flexible across datasets |
### 4.2. Snakemake
| Sub-Skill | Learn To... | Notes |
| ------------------------- | ------------------------------------------------------------------ | ----------------------------------------------------- |
| Rule definition | Write rules with `input`, `output`, `params`, `shell`, `resources` | Core of Snakemake pipelines |
| Wildcards and checkpoints | Handle variable filenames, sample-specific outputs | Use `{sample}` or `{group}` patterns |
| Config files | Load sample sheets and params from `.yaml` | Essential for dataset-specific runs |
| Logging & reports | Write `log:` blocks and generate workflow reports | `snakemake --report` for HTML summaries |
| Conda/env integration | Use `conda:` block to manage per-rule environments | Promotes reproducibility and avoids version conflicts |
| Cluster execution | Use `--cluster sbatch` and profile configs | Integrates seamlessly with SLURM and SGE |
### 4.3. Nextflow
| Sub-Skill | Learn To... | Notes |
| -------------------------------- | -------------------------------------------------------- | --------------------------------------------------- |
| Processes and channels | Write processes and connect them via channels | Nextflow is channel-driven (dataflow paradigm) |
| Input/output declaration | Use `from`, `into`, `tuple`, `file()` for channel IO | More explicit than Snakemake’s rule chaining |
| Parameterization | Use `params.config` and `.nf` config files | Supports environment switching, AWS profiles, etc. |
| Docker & Singularity integration | Declare container for each process | Enables exact reproducibility and cloud portability |
| DSL2 modular structure | Use modules, workflows, main.nf for large-scale projects | nf-core compliant structure |
### 4.4. Workflow Design Best Practices
| Sub-Skill | Learn To... | Notes |
| ----------------------------- | ------------------------------------------------------------------------- | ---------------------------------------------- |
| Sample sheet integration | Load `.csv` or `.tsv` sample metadata for looping rules | Important for automating per-sample operations |
| Reusability and encapsulation | Split logic into subworkflows or module files | Avoids massive monolithic scripts |
| Dry-run and benchmarking | Test rules without execution (`--dry-run`, `touch`) | Safe pre-run validation |
| Resource optimization | Set `threads`, `memory`, and `runtime` per rule | Required for HPC scaling or SLURM efficiency |
| Error handling and debugging | Use `--rerun-incomplete`, `--printshellcmds`, and logging | For traceability and crash recovery |
| Output structure | Keep `results/`, `logs/`, `config/`, `scripts/`, and `workflow/` separate | Makes repos easier to navigate and share |
### 4.5. Workflow Portability & Sharing
| Sub-Skill | Learn To... | Notes |
| ----------------------------------- | ------------------------------------------------------------------------------ | --------------------------------------------------------------- |
| Containerized rules | Run pipelines with `--use-conda`, `--use-singularity`, `docker.enabled = true` | For portable pipelines |
| Profiles for different systems | Create cluster profiles for local, HPC, AWS | Config switching via `--profile hpc` etc. |
| Publishing workflows | Structure repos with `README.md`, `envs/`, `config/`, `workflow/` | Enables GitHub distribution or nf-core compatibility |
| Workflow documentation | Write usage examples, environment setup instructions | Critical for collaboration, reproducibility, and reviewer trust |
| Integration with `make` or wrappers | Optional: use `Makefile` to wrap Snakemake or Nextflow commands | For simplified command-line entry points |
5. Containerization & Environment Management
Total Time: ~5 weeks
Focus: Conda, R, Python, Docker, Singularity
Know More
### 5.1. Conda & Environment Management (Cross-language)
| Sub-Skill | Learn To... | Notes |
| ------------------------------ | ------------------------------------------------------------------------- | ---------------------------------------- |
| Creating environments | Use `conda create -n env_name pkg1 pkg2`, `conda activate` | Standard across Python/R pipelines |
| Exporting & sharing | `conda env export > env.yml`; recreate with `conda env create -f env.yml` | Always version-lock your environments |
| Environment isolation | Use different envs for different projects | Prevent dependency conflicts |
| Bioconda & channels | Learn how to install tools from `bioconda`, `conda-forge` | e.g., `conda install -c bioconda fastqc` |
| Snakemake/Nextflow integration | Use `conda:` block per rule to autoinstall envs | Enables reproducibility per-step |
### 5.2. R Environment Management with renv
| Sub-Skill | Learn To... | Notes |
| ----------------------- | -------------------------------------------------------- | -------------------------------------------- |
| Project isolation | Use `renv::init()` to create a local project environment | Tracks all installed packages in `renv.lock` |
| Dependency tracking | Auto-record versions of every R package | Essential for reproducibility in notebooks |
| Environment restoration | Use `renv::restore()` on a new system to rebuild the env | Like `conda env create`, but for R |
| GitHub integration | Commit `renv.lock`, ignore `renv/library` | Makes repos portable but lightweight |
### 5.3. Docker (Containers for Everything)
| Sub-Skill | Learn To... | Notes |
| --------------------------- | ------------------------------------------------------------------------- | ------------------------------------- |
| Dockerfile creation | Write `FROM`, `RUN`, `COPY`, `CMD` layers | Builds a reproducible container image |
| Building and running images | `docker build -t mytool .`, `docker run -v $PWD:/data mytool` | Mount volumes, pass arguments |
| Versioned images | Tag builds with version numbers (`:v1.0`, `:latest`) | Use in pipeline configs |
| Base images | Choose wisely: `rocker/rstudio`, `continuumio/miniconda`, `biocontainers` | Ensures reproducibility across builds |
| Entrypoints & CMDs | Create CLI wrappers using Python, R, or Bash | For command-line tool containers |
### 5.4. Singularity (for HPC & Academic Clusters)
| Sub-Skill | Learn To... | Notes |
| -------------------------------- | ----------------------------------------------------------------- | ---------------------------------------- |
| Running images | Use `singularity run mycontainer.sif` | Needed when Docker is unavailable on HPC |
| Building containers | Convert Docker image: `singularity build out.sif docker://ubuntu` | Leverages DockerHub ecosystem |
| Binding volumes | Use `--bind /path:/container/path` for data I/O | Required for filesystem access on HPC |
| Snakemake & Nextflow integration | Use `--use-singularity` or process-level container declaration | Fully portable pipelines |
| Managing versions | Store `.sif` files with version info | Stable and audit-ready for publications |
### 5.5. Container Registries & Distribution
| Sub-Skill | Learn To... | Notes |
| --------------------------- | -------------------------------------------------------------------------- | ----------------------------------------- |
| DockerHub | Push/pull your Docker containers with `docker push` | Enables sharing tools publicly |
| Biocontainers | Use community-curated tools via `bioconda` or `docker://biocontainers/...` | Saves effort building from scratch |
| GitHub Container Registry | Optional: publish private/protected containers with GitHub Actions | For organizations or controlled workflows |
| Registry tagging & security | Manage tokens, tags, and container metadata | Especially when deploying at scale |
### 5.6. Integrated Environment Strategies
| Strategy | Use Case | Stack |
| ----------------------------------------------------- | ---------------------------------------------- | ------------------------------------- |
| Conda-only | Lightweight, flexible, simple pipelines | `env.yml` |
| Docker + Conda | Portability + dependency control | Dockerfile + `env.yml` |
| Singularity + Conda | HPC-compatible reproducibility | `.sif` + `env.yml` |
| Full stack (Snakemake/Nextflow + Singularity + Conda) | Production-grade, fully reproducible workflows | Multi-config systems with YAML params |
6. Multi-Omics Data Processing
Total Time: ~12 weeks
Focus: 16s, shotgun, WGS, metatranscriptomics, metaproteomics/metabolomics
Know More
### 6.1. Amplicon Sequencing (16S rRNA / ITS)
| Sub-Skill | Learn To... | Tools |
| ------------------------------ | ---------------------------------------------- | ------------------------------------------------ |
| Import and demultiplex | Handle multiplexed read files, barcodes | QIIME2, cutadapt, fastp |
| Denoising | Use ASV-level methods for error correction | DADA2, Deblur |
| Chimera filtering | Identify and remove artifacts | DADA2 built-in, VSEARCH |
| Taxonomic classification | Use reference databases to assign taxonomy | SILVA, Greengenes, GTDB with Naive Bayes, IDTAXA |
| Phylogenetic tree construction | Align representative sequences and build trees | MAFFT + FastTree |
| Diversity analyses | Alpha/beta diversity, ordination, PERMANOVA | QIIME2, phyloseq, vegan |
### 6.2. Shotgun Metagenomics
| Sub-Skill | Learn To... | Tools |
| ------------------------------ | ----------------------------------------- | ---------------------------------- |
| Preprocessing | QC, trimming, filtering low-quality reads | fastp, TrimGalore, FastQC, MultiQC |
| Host/decontamination filtering | Remove host or contaminant reads | Bowtie2, BMTagger |
| Assembly (optional) | Reconstruct contigs from reads | MEGAHIT, SPAdes |
| Taxonomic profiling | Generate species-level profiles | MetaPhlAn3, Kraken2, Bracken |
| Functional profiling | Annotate gene families and pathways | HUMAnN3 (UniRef90, MetaCyc) |
| Binning (optional) | Cluster contigs into MAGs | MetaBAT2, CONCOCT, MaxBin2 |
### 6.3. Metatranscriptomics
| Sub-Skill | Learn To... | Tools |
| ----------------------- | ------------------------------------------------------ | ---------------------------- |
| RNA-seq preprocessing | Trimming, QC, rRNA depletion | TrimGalore, SortMeRNA, fastp |
| Mapping/quantification | Map reads to reference or quantify via pseudoalignment | Salmon, Kallisto, BWA |
| rRNA vs mRNA separation | Separate regulatory vs coding reads | SortMeRNA, custom filters |
| Normalization | TPM, RPKM, DESeq2 VST | DESeq2, edgeR, limma-voom |
| Differential expression | Compare across conditions | DESeq2, limma-voom |
| Gene/pathway annotation | Map to GO, KEGG, eggNOG, MetaCyc | eggNOG-mapper, InterProScan |
### 6.4. Metaproteomics
| Sub-Skill | Learn To... | Tools |
| --------------------- | -------------------------------------------------------- | ------------------------------- |
| Raw data processing | MS file format conversion (.raw to .mzML) | MSConvert |
| Database search | Match spectra to UniProt sequences | FragPipe, MSFragger |
| Quantification | Protein/peptide intensities | IonQuant, MaxQuant |
| Taxonomic assignment | Peptide-based LCA profiling | UniPept, MetaProteomeAnalyzer |
| Functional annotation | Map to GO, KEGG, EC, MetaCyc | eggNOG, InterProScan |
| Output integration | Generate peptide → protein → function abundance matrices | Custom R/Python parsing scripts |
### 6.5. Metabolomics / Exposomics
| Sub-Skill | Learn To... | Tools |
| ---------------------- | ----------------------------------------- | ----------------------------------------------------- |
| Raw data preprocessing | Peak detection, alignment, deconvolution | XCMS, MZmine |
| Annotation & ID | Match m/z features to chemical IDs | GNPS, HMDB, MS-DIAL |
| Normalization | Log, Pareto, scaling by TIC | `MetaboAnalystR`, `MSPrep` |
| Batch correction | Combat, LOESS, internal standards | `sva::ComBat`, `normStats` |
| Pathway enrichment | Map m/z features to pathways | Mummichog, MetaCyc, KEGG |
| Exposure integration | Link to host response or microbiome shift | Statistical or ML-based integration with omics layers |
### 6.6. Cross-Omics Output Formatting
| Sub-Skill | Learn To... | Format |
| -------------------------------- | --------------------------------------------------------------- | ------------------------------------------------- |
| Unified abundance tables | Construct sample × feature matrix for taxa/proteins/metabolites | TSV, CSV, biom |
| Metadata linkage | Join omics tables with sample-level metadata | `left_join`, `merge`, `sample_data()` in phyloseq |
| Feature mapping | Map gene/protein/peptide IDs to function, taxonomy | UniProt ID mapping, eggNOG, InterProScan |
| Output structuring for pipelines | Write reusable scripts for creating final tables | R, Python, Snakemake rules |
7. Functional Annotation & Pathway Inference
Total Time: ~4 weeks
Focus: Gene and protein functional annotation, pathway enrichment, metabolic reconstruction, ontology handling
Know More
### 7.1. Gene & Protein Functional Annotation
| Sub-Skill | Learn To... | Tools & Resources |
| ---------------------------------- | --------------------------------------------------------------------------------- | ----------------------------------------------------------- |
| Map genes to orthologous groups | Assign functional context across species | **eggNOG-mapper**, **OrthoFinder**, **KEGG Orthology (KO)** |
| Assign GO terms (BP, MF, CC) | Retrieve Gene Ontology Biological Process, Molecular Function, Cellular Component | **InterProScan**, **Blast2GO**, **UniProtKB**, **biomaRt** |
| EC number annotation | Assign Enzyme Commission codes to genes/proteins | **eggNOG**, **KAAS**, **UniProt** |
| Filter high-confidence annotations | Remove generic or uninformative hits (e.g., “hypothetical protein”) | Custom R/Python parsing logic |
| Functional redundancy analysis | Quantify whether multiple features map to same function | Often important in microbial communities |
### 7.2. Pathway Annotation & Enrichment
| Sub-Skill | Learn To... | Tools & Resources |
| ---------------------------- | ------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------- |
| Map functions to pathways | Connect gene/protein/metabolite IDs to pathways | **KEGG**, **MetaCyc**, **Reactome**, **GOslim** |
| Functional hierarchy mapping | Annotate multi-level (L1-L4) functions (e.g., metabolism → amino acid metabolism → lysine biosynthesis) | **HUMAnN3**, **eggNOG**, **UniRef90**, **KEGG modules** |
| Perform pathway enrichment | Run Fisher’s exact test or GSEA-like approaches to find enriched functions | **clusterProfiler**, **gProfiler2**, **MetaboAnalyst**, **Pathview**, **DAVID** |
| Pathway coverage estimation | Determine % of pathway covered by your observed genes | **HUMAnN3**, **MinPath**, **MetaPath** |
### 7.3. Metabolic Reconstruction
| Sub-Skill | Learn To... | Tools & Resources |
| ---------------------------------------------- | --------------------------------------------------- | -------------------------------------------------------- |
| Reconstruct metabolic maps | Build draft metabolic models from annotations | **Pathway Tools**, **CarveMe**, **ModelSEED**, **AGORA** |
| Assign function to taxa (functional potential) | Assess whether taxon X encodes pathway Y | **Tax4Fun2**, **PICRUSt2**, **HUMAnN3** |
| Link environment to function | Ask: “What does this community do in this context?” | Critical for exposomics and One Health research |
### 7.4. Ontology Handling & Term Curation
| Sub-Skill | Learn To... | Tools & Resources |
| ------------------------------ | ----------------------------------------------------------------------------------------------- | ---------------------------------------------------------- |
| Standardize terms | Use controlled vocabularies: GO, KEGG BRITE, MetaCyc | **OntologyLookupService**, **OBO Foundry**, **Ontobee** |
| Deduplicate IDs | Merge multiple IDs mapping to the same function | Use UniProt cross-mapping tables |
| Visualize ontology graphs | Show GO term hierarchies or relationships | **topGO**, **REVIGO**, **QuickGO**, **ggraph**, **igraph** |
| Custom annotation dictionaries | Create dictionaries of curated functional categories (e.g., “stress response”, “toxin-related”) | R/Python dictionaries, JSON term banks |
### 7.5. Output Structuring for Downstream Use
| Sub-Skill | Learn To... | Output Format |
| ----------------------------------- | ----------------------------------------------------------- | ---------------------------------------------------- |
| Build functional abundance matrices | Sample × GO Term / EC / KO matrix | Used for DA, enrichment, ordination |
| Join taxonomy + function | Create combined tables for taxon-function interpretation | e.g., “Bacteroides → LPS biosynthesis” |
| Store annotations as metadata | Keep raw feature ID + annotation + abundance in tidy format | For use in MiNDSET, MoMo-MAP, Shiny dashboards, etc. |
| Create interpretable summaries | Summarize functions by category, module, or subsystem | Often shown in stacked barplots, pathway heatmaps |
8: Multi-Omics Integration & Modeling
Total Time: ~5 weeks
Focus: combine, correlate, and co-model different layers of omics (e.g., metagenomics + transcriptomics + metaproteomics + metabolomics + exposomics)
Know More
### 8.1. Pre-Integration Harmonization
| Sub-Skill | Learn To... | Tools & Notes |
| ------------------------------- | --------------------------------------------------------- | --------------------------------------------------------- |
| Normalize each omics table | Apply within-omics normalization (CLR, VST, log2, Pareto) | `DESeq2::vst`, `metagenomeSeq::cumNorm`, `metaboAnalystR` |
| Filter uninformative features | Remove low-variance or sparse features to avoid noise | `nearZeroVar()`, `rowVars()`, prevalence filtering |
| Align sample IDs across tables | Ensure perfect match across omics layers | Use `intersect(colnames(...))`, consistent metadata |
| Match feature IDs to annotation | Harmonize feature-level info (e.g., gene symbols, KO IDs) | Ensures cross-mapping is valid |
### 8.2. Unsupervised Multi-Omics Integration
| Sub-Skill | Learn To... | Tools & Methods |
| --------------------------------- | ------------------------------------------------------- | ----------------------------------------------------------------- |
| Joint dimension reduction | Use multiblock PCA or sGCCA to extract shared structure | **mixOmics::rGCCA**, **PMA::CCA**, **DIABLO (unsupervised mode)** |
| Cluster samples across omics | Perform consensus clustering or integrative clustering | `iClusterPlus`, `MOFA`, `NMF` |
| Co-abundance network construction | Build networks of features that co-vary across omics | `WGCNA`, `SpiecEasi`, `cooccur`, `CoNet` |
| Multitable correlation heatmaps | Plot cross-omics correlation matrices | `mixOmics::plotVar`, `corrplot`, `pheatmap` |
### 8.3. Supervised Multi-Omics Integration
| Sub-Skill | Learn To... | Tools & Methods |
| ----------------------------------- | ----------------------------------------------------------- | ----------------------------------------------------------- |
| Predict conditions from multi-omics | Use classifiers trained on multiple omics blocks | **DIABLO**, **Random Forest**, **SVM**, `caret::train()` |
| Identify discriminative features | Detect key multi-omics markers that distinguish groups | `mixOmics::selectVar()`, VIP scores, SHAP values |
| Integrate covariates & confounders | Adjust models for group, timepoint, diet, etc. | Include as covariates or random effects in model formula |
| Visualization | Plot sample projection, circos plots, or component loadings | `circlize`, `ggplot2`, `mixOmics::plotIndiv()`, `plotVar()` |
### 8.4. Cross-Modal Correlation and Causal Modeling
| Sub-Skill | Learn To... | Tools & Methods |
| --------------------------- | ------------------------------------------------------ | ------------------------------------------------------- |
| Feature–feature correlation | Identify associations between genes–metabolites–taxa | `psych::corr.test()`, `sparcc`, `cclasso`, `HAllA` |
| Multi-omics networks | Construct multi-layered bipartite or tripartite graphs | `igraph`, `networkD3`, `ggraph`, `mixOmics::network()` |
| Time-resolved modeling | Integrate temporal data across omics layers | `metalonda`, `MaSigPro`, `MEtime`, `LongitudinalDIABLO` |
| Causal inference (optional) | Use DAGs or Bayesian networks to model causality | `bnlearn`, `DoWhy`, `SEM`, `gCastle` |
### 8.5. Biological Interpretation of Integrated Models
| Sub-Skill | Learn To... | Tools & Outputs |
| --------------------------- | ------------------------------------------------------------ | ----------------------------------------------------------------------- |
| Map components to biology | Interpret component loadings as functional or taxonomic axes | Annotate with KEGG/GO info |
| Functional module discovery | Identify multi-omic features converging on same pathway | e.g., “Gene X, Protein Y, Metabolite Z all part of Lysine biosynthesis” |
| Metadata association | Relate integrated signatures to exposure, disease, behavior | Use LMs, GLMMs, or custom modeling |
| Report generation | Generate reproducible reports from integration results | `quarto`, `Rmd`, interactive dashboards (`shiny`, `dash`) |
### 8.6. Integration Frameworks - Must Master
| Framework | Type | Notes |
| ------------------------ | --------------------------- | ------------------------------------------------------------------ |
| **mixOmics** | sGCCA, DIABLO, MINT | Flexible, scalable, supports supervised + unsupervised integration |
| **iClusterPlus** | Integrative clustering | Bayesian hierarchical model, good for discovery |
| **MOFA/MOFA2** | Factor analysis | Very powerful latent space model |
| **WGCNA** | Co-expression networks | Can be extended for multi-omics if processed correctly |
| **HAllA** | Feature-feature association | Hypothesis-free, correlation discovery |
| **metalonda / MaSigPro** | Longitudinal integration | Time-aware DE modeling |
| **mintR / MINTmixOmics** | Cohort integration | Multi-batch integration (e.g., different animal/human groups) |
9. Experimental Context Awareness
Total Time: ~2 weeks
Focus: ask better questions, detect biases, adjust for confounders, and interpret results accurately.
Know More
### 9.1. Microbiome and Multi-Omics Study Design
| Sub-Skill | Learn To... | Notes |
| -------------------------------------------------- | --------------------------------------------------------------------------------- | --------------------------------------------------- |
| Understand cross-sectional vs longitudinal designs | Know when repeated measures require paired models | Choose GLMM vs simple ANOVA correctly |
| Biological vs technical replicates | Distinguish what replicates truly capture | Avoid overestimating power or missing batch effects |
| Confounder identification | Identify age, diet, medication, sex, cage effect (animals), housing (environment) | Plan to adjust or stratify |
| Matching strategies | Understand matched-case control, paired samples, or stratified sampling | Influences the modeling framework |
| Sample size calculation | Estimate power given variability and expected effect size | Use `pwr`, `simr`, or simulation for guidance |
### 9.2. Sequencing Technology & Protocol Artifacts
| Sub-Skill | Learn To... | Notes |
| -------------------------------------------- | ---------------------------------------------------------------------------- | --------------------------------------------------------- |
| Understand amplicon vs shotgun tradeoffs | Resolution, cost, amplification bias | 16S gives genus/species, shotgun can give strain/function |
| Biases from extraction kits | Know how DNA/RNA/protein extraction method can shape composition | Choose batch correction accordingly |
| Batch effects from library prep or sequencer | Understand when batch correction is necessary | Use `Combat`, `removeBatchEffect()`, or random effects |
| rRNA depletion vs poly-A selection | Impacts metatranscriptomics — do you have total RNA or eukaryotic mRNA only? | Guides filtering strategy |
| Multi-batch/multi-platform datasets | Identify issues with mixed Illumina/Nanopore platforms | Use `MINT` or batch-aware integration methods |
### 9.3. Sample Type, Source, and Environmental Matrix
| Sub-Skill | Learn To... | Examples |
| --------------------------------- | ---------------------------------------------------------------------- | ----------------------------------------------------------- |
| Biological matrix | Saliva, plaque, stool, blood, air filters, water, soil | Each has unique challenges (e.g., inhibitors, low biomass) |
| Human vs animal vs environmental | Understand matrix-specific variation in microbiome + exposome profiles | Needed for One Health-aware modeling |
| Invasive vs non-invasive sampling | Know limits of what can be measured | Interpreting host transcriptomics from buccal swabs ≠ blood |
| Storage & transport artifacts | Freeze-thaw, preservative usage, time-to-processing | Introduces noise or biases if unaccounted for |
### 9.4. Exposure Context and Toxicology Considerations
| Sub-Skill | Learn To... | Notes |
| --------------------------------- | ---------------------------------------------------------------------------------------------------- | ------------------------------------------------------------- |
| Exposure route & dose | Oral, dermal, inhaled; acute vs chronic | Needed to model biologically relevant gradients |
| Chemical classes | Know what VOCs, heavy metals, endocrine disruptors, and antibiotics do | Guides functional annotation and hypothesis generation |
| Host–microbe–chemical interaction | Understand tripartite effects: e.g., antibiotics reduce diversity; diet alters xenobiotic metabolism | Can confound or explain findings |
| Bioaccumulation & persistence | Know which chemicals stick around and impact long-term | e.g., PFAS in One Health datasets |
| Internal vs external exposome | External = environment; Internal = host response | Can be profiled via transcriptomics, metabolomics, proteomics |
### 9.5. Metadata Quality and Annotation Depth
| Sub-Skill | Learn To... | Examples |
| ---------------------------------- | ------------------------------------------------------------------------------------------ | ------------------------------------------- |
| Identify critical missing metadata | Antibiotic use? Age? Sample collection time? | Lack of metadata = limited interpretability |
| Standardize metadata vocabularies | Use MIxS, EnvO, UBERON, CHEBI | Improves interoperability and integration |
| Assess granularity of metadata | Is exposure “yes/no” or “ug/m3”? Is diet “vegetarian” or detailed macronutrient breakdown? | Influences modeling choice |
| Hierarchical modeling readiness | Know when samples are nested (e.g., repeated stool samples per subject per cage) | Required for proper random effects design |
Total Time: ~5 weeks
Focus: analysis package, a command-line utility, or a reproducible function library
Know More
### 10.1. R Package Development (CRAN & Bioconductor-Ready)
| Sub-Skill | Learn To... | Tools & Notes |
| ----------------------------- | ------------------------------------------------------------ | ------------------------------------------------------ |
| Initialize package structure | Use `usethis::create_package()` or RStudio’s devtools wizard | Creates `R/`, `man/`, `DESCRIPTION`, `NAMESPACE`, etc. |
| Write functions with roxygen2 | Document functions with `#'` tags above each function | Use `devtools::document()` to auto-generate help files |
| Define metadata files | Fill `DESCRIPTION`, `NAMESPACE`, `README.md`, `LICENSE` | Include `Imports`, `Suggests`, `Authors@R`, versioning |
| Add vignettes | Write long-form examples using `usethis::use_vignette()` | CRAN and BioC require it |
| Unit testing | Write `testthat` scripts to verify behavior | `usethis::use_testthat()` sets up structure |
| Package build & install | Use `devtools::install()`, `check()`, `build()` | Run `rcmdcheck()` locally or with GitHub Actions |
| CRAN/BioC submission | Follow format checks and `BiocCheck()` | BioC requires weekly commits during review |
### 10.2. Python CLI Tool & Package Development
| Sub-Skill | Learn To... | Tools & Notes |
| ----------------------------- | -------------------------------------------------------- | --------------------------------------------------------------------------------------- |
| Write command-line interfaces | Use `argparse`, `click`, or `typer` for CLI logic | Modularize main functions |
| Setup packaging | Define `setup.py`, `pyproject.toml`, `__init__.py` | Install via `pip install .` |
| Create modules & scripts | Split tools into logical files (`utils.py`, `main.py`) | Reuse functions across tools |
| Publish to PyPI | Register your package with version, license, description | Use `twine upload dist/*` |
| Unit tests & coverage | Use `pytest`, `unittest`, `coverage`, `tox` | Test logic + input/output integrity |
| Conda packaging | Write `meta.yaml` for Bioconda | Submit pull request to [bioconda-recipes](https://github.com/bioconda/bioconda-recipes) |
### 10.3. Tool Versioning, Distribution, and Releases
| Sub-Skill | Learn To... | Tools & Notes |
| ------------------------ | ------------------------------------------------------------ | ------------------------------------------ |
| Semantic versioning | Use `MAJOR.MINOR.PATCH` format (e.g., 1.3.2) | `DESCRIPTION`, `setup.py`, Git tags |
| GitHub releases | Create release tags with changelog and binaries | Use `gh release create` or GitHub UI |
| GitHub Actions for CI/CD | Automate `R CMD check`, `pytest`, `coverage`, release builds | YAML workflows in `.github/workflows` |
| Binder / Docker deploy | Wrap tool in Docker and serve with Binder or DockerHub | Great for interactive tutorials |
| Documentation sites | Use `pkgdown` (R) or `Sphinx/MkDocs` (Python) | Host via GitHub Pages |
| Citation files | Add `CITATION`, Zenodo DOI, and `codemeta.json` | So your software can be cited like a paper |
### 10.4. Best Practices in Packaging
| Principle | Why It Matters |
| ------------------------------------------- | -------------------------------------------------------------- |
| Write small, testable functions | Easier to debug, document, and reuse |
| Avoid hardcoded paths and assumptions | Use `here::here()` (R) or `os.path.abspath(__file__)` (Python) |
| Keep outputs tidy and predictable | Required for automation in pipelines |
| Separate core logic from CLI or UI | So it can be tested and re-used in other contexts |
| Document everything | Functions, outputs, edge cases, usage examples |
| Write CHANGELOGs and ROADMAPs | Useful for collaborators, users, and reviewers |
| Make everything version-controlled and open | GitHub/GitLab is non-negotiable for serious tools |
11. Programmatic Data Access & Automation
Total Time: ~3 weeks
Focus: control over accessing public repositories, downloading datasets, parsing metadata, and integrating into pipelines through APIs, web scraping, and FTP scripting.
Know More
### 11.1. RESTful APIs for Biological Databases
| Sub-Skill | Learn To... | APIs & Tools |
| ----------------------- | ---------------------------------------------------------------------- | --------------------------------------------------- |
| Understand REST APIs | Use `GET`, `POST`, headers, parameters | Learn to read API docs (JSON inputs/outputs) |
| Query NCBI E-utilities | Use `esearch`, `efetch`, `esummary` to retrieve BioSample/SRA metadata | `rentrez`, `biopython`, or direct via `requests` |
| Fetch MGnify metadata | Access study/sample/taxon/functional annotations | `https://www.ebi.ac.uk/metagenomics/api/` |
| Access UniProt records | Retrieve protein sequences, GO terms, taxonomy | `https://rest.uniprot.org/`, `biomaRt`, `Biopython` |
| Query GNPS/MetaboLights | Get metabolomics datasets, chemical IDs | GNPS REST API, ReDU APIs |
| EPA/Exposome datasets | Use `epa.gov` endpoints, CompTox API | Useful for exposure chemistry and risk data |
### 11.2. Python + R for Programmatic Access
| Sub-Skill | Learn To... | Tools |
| ------------------------ | ------------------------------------------------------------------------- | --------------------------------------------------------- |
| Use `requests` in Python | Programmatically send GET/POST, parse JSON | Perfect for all REST APIs |
| Use `httr` in R | R wrapper to make web queries | Pairs with `jsonlite::fromJSON()` |
| Parse JSON, XML, CSV | Clean and extract fields from API outputs | `jsonlite`, `xml2`, `pandas.read_json()`, `ElementTree` |
| Retry logic and timeouts | Handle API failures with graceful retries | Use `tryCatch` (R), `try/except` (Python), `backoff` libs |
| Build downloaders | Write scripts that download raw FASTQ, metadata, annotation files via API | Modular CLI downloaders for projects |
### 11.3. FTP / Aspera / HTTP Direct File Retrieval
| Sub-Skill | Learn To... | Tools |
| ---------------------- | ----------------------------------------------------- | ---------------------------------- |
| Connect to FTP servers | Navigate FTPs from ENA, NCBI, EBI, etc. | `wget`, `curl`, `lftp`, `ncftp` |
| Download with `wget` | Automate large file downloads with wildcards, filters | Use `wget -r -l1 -A ".fna.gz"` |
| Use Aspera/aspera-cli | Speed up large file downloads (faster than FTP) | Used with SRA, EGA, ENA |
| Directory traversal | Parse file trees and download only what you need | Automate data syncs with scripting |
### 11.4. Web Scraping (When APIs Don’t Exist)
| Sub-Skill | Learn To... | Tools |
| ----------------------------------- | ----------------------------------------------------------------- | ------------------------------------- |
| Extract data from HTML tables/pages | Use XPath, CSS selectors, or regex | `rvest` (R), `BeautifulSoup` (Python) |
| Simulate form submissions | Handle POST forms with payloads | `requests.post()` |
| Headless browsing | Interact with JavaScript-rendered pages | `selenium`, `RSelenium`, `playwright` |
| Ethics and etiquette | Respect `robots.txt`, avoid overloading servers, use `user-agent` | Follow FAIR data principles |
### 11.5. Batch Query and Data Wrangling Strategies
| Sub-Skill | Learn To... | Tools |
| ---------------------- | -------------------------------------------------------------- | -------------------------------------- |
| Batch queries | Chunk 1000s of queries into multiple calls | Use `split()`, rate limiting, sleeps |
| BioSample ID workflows | Go from assembly → BioSample → metadata → download | Use `efetch` chains or custom logic |
| Clean messy metadata | Deduplicate fields, standardize units, merge tables | Use `pandas`, `dplyr`, `janitor` |
| Merge with omics data | Join programmatically retrieved metadata with abundance tables | Use `left_join`, `merge`, `pd.merge()` |
### 11.6. Automating and Integrating into Pipelines
| Sub-Skill | Learn To... | Tools |
| ------------------------- | --------------------------------------------------- | ----------------------------------------- |
| Write CLI downloaders | Wrap API/FTP scripts into command-line tools | `argparse`, `optparse`, `click`, `docopt` |
| Cache results | Store API responses locally to avoid repeated calls | Use `.cache/`, `memoise`, or save JSONs |
| Logging | Log successes/failures/downloads | `logging` (Py), `futile.logger` (R) |
| Use in Snakemake/Nextflow | Integrate metadata fetching into rules or processes | Fully automated input generation |
12. Data Visualization & Interactive Communication
Total Time: ~4 weeks
Focus: clear, interpretable, and publication-grade visualizations.
Know More
### 12.1. Publication-Quality Static Visualizations
| Sub-Skill | Learn To... | Tools & Notes |
| --------------------------- | ---------------------------------------------------------------------- | ------------------------------------------------------ |
| Grammar of graphics | Use `ggplot2` in R or `seaborn/matplotlib` in Python for layered plots | Build from `aes()` and `geom_*()` up |
| Customize themes and scales | Use `theme()`, `scale_*_manual()`, `facet_wrap()` | Create consistent, journal-quality figures |
| Save with high resolution | Use `ggsave(filename, dpi = 600)` or `plt.savefig(dpi=600)` | Always generate vector + raster versions |
| Combine panels | Use `patchwork`, `cowplot`, or `matplotlib.gridspec` | For composite figures and multi-omics overlays |
| Add annotations | Show stats, significance, highlights | `geom_text()`, `annotate()`, or `stat_compare_means()` |
Common plots - must master:
- Heatmaps (with clustering or ordering)
- Volcano plots, MA plots
- PCA, PCoA, NMDS
- Boxplots + jitter/dot
- Time series + spline/loess smoothing
- Correlation matrices
- Network plots (functional or taxon co-occurrence)
- Phylogenetic trees (with metadata overlay)
### 12.2. Interactive Visualizations
| Sub-Skill | Learn To... | Tools & Notes |
| --------------------------------- | ------------------------------------------------------ | --------------------------------------------------- |
| Build tooltips and zoomable plots | Use `plotly`, `ggplotly()`, `plotly.express`, `Altair` | Interactive scatter, heatmap, volcano, or PCA plots |
| Highlight dynamic subsets | Allow filters, sliders, selectors | Used in Shiny or Dash |
| Animate time or group shifts | Use `gganimate`, `plotly`, or `dash_core_components` | For exposomics and longitudinal omics |
| Render large datasets smoothly | Use `datatables`, `reactable`, `dash_table.DataTable` | Supports rapid querying and exploration |
### 12.3. Phylogenetic & Hierarchical Visualization
| Sub-Skill | Learn To... | Tools |
| --------------------------- | --------------------------------------------------------------- | ---------------------------------------------------- |
| Draw trees with annotations | Use `ggtree`, `iTOL`, `ape`, `ete3`, `phylotree.js` | Annotate with function, abundance, metadata |
| Visualize GO/KEGG hierarchy | Show ontology structure with term labels and enrichment results | `REVIGO`, `topGO`, `ggraph`, `graphviz`, `circlize` |
| Clustered heatmaps | Combine distance metrics with abundance tables | `pheatmap`, `ComplexHeatmap`, `seaborn.clustermap()` |
### 12.4. Network Visualization
| Sub-Skill | Learn To... | Tools |
| ------------------------------ | ----------------------------------------------------------------- | ------------------------------------------------------- |
| Generate force-directed graphs | Use `igraph`, `ggraph`, `visNetwork`, `networkx` | For spacerome, co-abundance, or function-function links |
| Bipartite and tripartite plots | Model taxa–function–exposure or host–microbe–chemical connections | Visualize layered relationships |
| Overlay metadata on networks | Size/color nodes by metadata (abundance, category, etc.) | Use edge thickness or node fill |
| Export for Cytoscape | Save as `.graphml` or `.gml` for full network annotation | Cytoscape excels at complex layouts |
### 12.5. Dashboarding & Web Applications
| Sub-Skill | Learn To... | Tools |
| ---------------------------- | -------------------------------------------------- | ---------------------------------------------- |
| Build dashboards in R | Use **Shiny**, `shinydashboard`, `DT`, `plotly` | Deployable on shinyapps.io or internal server |
| Build dashboards in Python | Use **Dash**, `dash_core_components`, `dash_table` | Flask-style flexibility |
| Modular app design | Structure app as multiple reactive components | Required for reusability |
| Secure deployment (optional) | Add login, user-level access, persistent data | `auth0`, `firebase`, or server-based solutions |
### 12.6. Figure Legends, Captions, and Manuscript Context
| Sub-Skill | Learn To... | Tools |
| ------------------------------------- | ------------------------------------------------------------------------------------- | ----------------------------------------------------- |
| Write detailed figure captions | Include: what’s shown, what’s important, how to read it, how significance is computed | Often the most ignored yet essential part of a figure |
| Connect visuals to biological meaning | Move beyond "blue is high, red is low" — interpret patterns in context | Tell a story, don’t just show a plot |
| Version control for figures | Save different versions during revision | Use suffixes or GitHub to track |
Total Time: ~2 weeks
Focus: organized, discoverable, interoperable, and reusable (FAIR). Master structured metadata annotation, version control, ontology usage, and public deposition across projects.
Know More
### 13.1. Project Directory Structure & Naming Standards
| Sub-Skill | Learn To... | Notes |
| --------------------------------- | ---------------------------------------------------------------------- | ------------------------------------------------- |
| Modular project structure | Use `raw/`, `processed/`, `results/`, `scripts/`, `logs/`, `metadata/` | Adopt from `cookiecutter-data-science`, `nf-core` |
| File naming consistency | Encode sample, date, version, and processing stage | e.g., `G1_Caries_TP1_2024_trimmed.fastq.gz` |
| Use relative paths | Avoid absolute paths (`/home/user/`) to maximize portability | Use `here::here()` or `os.path.abspath()` |
| Organize outputs by analysis step | Keep each pipeline module’s outputs in a defined subdirectory | Easier to debug, rerun, or review |
### 13.2. Metadata Collection, Cleaning, and Curation
| Sub-Skill | Learn To... | Tools |
| ------------------------------- | --------------------------------------------------------------------- | ------------------------------------------------------------ |
| Design metadata templates | Include sample ID, timepoint, host species, condition, exposure, etc. | Make these required from collaborators |
| Reshape metadata formats | Convert between wide and long format, ensure tidy structure | Use `pivot_longer()`, `reshape2`, `melt()` |
| Clean missing/invalid data | Detect blanks, typos, inconsistent units | `janitor`, `dplyr::mutate(across())`, `assertthat`, `pandas` |
| Harmonize categorical variables | Standardize case, spelling, and codes | e.g., `Yes` vs `yes` vs `TRUE` vs `1` |
### 13.3. Ontology-Aware Annotation & Controlled Vocabularies
| Sub-Skill | Learn To... | Examples |
| ------------------------------------ | -------------------------------------------------------------- | --------------------------------------------------------------------- |
| Apply domain-specific vocabularies | Use **MIxS**, **EnvO**, **UBERON**, **CHEBI**, **PO**, **EFO** | For One Health, exposome, and microbiome interoperability |
| Map metadata terms to ontology IDs | Link `saliva` → `UBERON:0001836` | Enables semantic search and dataset alignment |
| Use term validators | Use `ontofox`, `obofoundry`, `ols4R`, `ontologyLookupService` | Check term validity and retrieve labels |
| Integrate ontology terms in metadata | Store alongside readable label and source ontology | e.g., `sample_type = "oral cavity"`, `ontology_id = "UBERON:0001836"` |
### 13.4. FAIR Principles Implementation
| Principle | Learn To... | Tools & Notes |
| ------------- | -------------------------------------------------------- | ------------------------------------------------------ |
| Findable | Assign unique, stable identifiers to datasets and files | Use DOIs via Zenodo, figshare, or OSF |
| Accessible | Store data in trusted public or internal repositories | SRA, ENA, Zenodo, MG-RAST, MetaboLights |
| Interoperable | Use standard formats (TSV, JSON, biom) and ontologies | Avoid Excel files with merged cells, color codes, etc. |
| Reusable | Provide full metadata, code, documentation, and licenses | CC-BY, MIT, or GPL for open use |
### 13.5. Public Repository Submission & Archiving
| Sub-Skill | Learn To... | Platform |
| ------------------------------ | ---------------------------------------------------------------- | -------------------------------------------------- |
| Submit to NCBI SRA or ENA | Use BioProject → BioSample → Run structure | Use `sra-tools`, `ena-upload-cli`, or Webin portal |
| Archive functional annotations | Upload to Zenodo, include README and schema | For pathway tables, gene sets, GO mappings |
| Deposit multi-omics studies | Use MGnify, GNPS, MetaboLights | Link all layers to BioSample accessions |
| Reference your data in papers | Include accession numbers and DOIs in figure legends and methods | This is a reviewer expectation |
### 13.6. Reproducibility Tracking & Versioning
| Sub-Skill | Learn To... | Tools |
| ------------------------ | ----------------------------------------------------------------- | ------------------------------------------------------- |
| Track data provenance | Maintain a data log: where it came from, how it was processed | Use `CHANGELOG.md`, `data_manifest.csv`, Snakemake DAGs |
| Version datasets | Snapshot raw + processed data with time/version label | e.g., `v1.0`, `v1.1-fixed-headers`, `v2.0-annotation` |
| Integrate with Git | Track metadata, scripts, configs — avoid tracking raw FASTQ files | Use `.gitignore` for large/binary files |
| Create data dictionaries | Document every column in your metadata and output tables | Describe variable names, units, and allowed values |
14. Cloud & HPC Integration
Total Time: ~3 weeks
Focus: high-performance clusters (HPCs) and cloud platforms like AWS, GCP, Terra, or DNAnexus
Know More
### High Performance Computing (HPC) Basics
| Sub-Skill | Learn To... | Tools & Concepts |
| ------------------------------------ | ----------------------------------------------------------------------- | ------------------------------------------ |
| Navigate login nodes & scratch space | Understand `/home`, `/scratch`, `/work`, `/tmp` | Minimize I/O and memory issues |
| Use job schedulers | Submit, monitor, cancel jobs via `sbatch`, `squeue`, `scancel`, `sacct` | SLURM (Argon), PBS, or LSF |
| Write SLURM job scripts | Set `#SBATCH` parameters: time, memory, cpus, output logs | Automate your Snakemake/Nextflow runs |
| Allocate resources wisely | Choose appropriate `--cpus-per-task`, `--mem`, and `--time` | Prevents wasting or crashing jobs |
| Debug failed jobs | Read `.err` logs, SLURM exit codes, check quota | Use `--mail-type=FAIL` to get email alerts |
### 14.2. Run Workflows on HPC
| Sub-Skill | Learn To... | Tools |
| ----------------------------- | --------------------------------------------------------------------------- | -------------------------------------------------------- |
| Run Snakemake with SLURM | Use `--cluster 'sbatch ...' --jobs 50` or use a cluster profile | Define `cluster.yaml` for rule-specific resource control |
| Run Nextflow on HPC | Use `-profile slurm` or custom config with `process.executor = 'slurm'` | Modular config allows full portability |
| Use `singularity` on clusters | Replace Docker with `.sif` for containerized tools | `--use-singularity` in Snakemake/Nextflow |
| Cache environments | Prevent repeated downloads by using `~/.snakemake/conda/` or shared modules | Share across jobs for efficiency |
### 14.3. Cloud Platform Essentials
| Sub-Skill | Learn To... | Platforms |
| -------------------------------- | ---------------------------------------------------------------- | ----------------------------------------- |
| Cloud vs HPC tradeoffs | Understand cost, scalability, persistence, access | Cloud = on-demand; HPC = fixed queues |
| Use AWS S3 buckets | Upload/download files using `aws s3 cp` or `boto3` | Store reference databases, sample outputs |
| Launch workflows in Terra | Run WDL-based workflows from Dockstore or FireCloud | For genomics projects at scale |
| Use DNAnexus or Seven Bridges | GUI-based, drag-and-drop interfaces for clinical-grade pipelines | Useful for regulated environments |
| Use Nextflow Tower | Monitor Nextflow runs on AWS/GCP with real-time dashboards | Visualize DAGs, logs, and metrics |
| Use Google Colab for prototyping | Great for rapid notebooks, but not for big data | Useful for teaching or pilot testing |
### 14.4. Environment Portability Across Platforms
| Sub-Skill | Learn To... | Tools |
| ------------------------------------------ | --------------------------------------------------------- | ---------------------------------------------------------- |
| Build platform-agnostic workflows | Always use config files, containers, and modular scripts | Makes HPC/cloud transitions seamless |
| Use Conda and Singularity for environments | Avoid system-wide installs | Prevents “works here but not there” issues |
| Transfer data securely | Use `rsync`, `scp`, `sftp`, or cloud-specific CLIs | Preserve permissions and timestamps |
| Sync large data efficiently | Use `rclone`, `wget -c`, `aria2c` for resumable downloads | Crucial for multi-TB exposomics or metaproteomics datasets |
### 14.5. Cost and Resource Management
| Sub-Skill | Learn To... | Notes |
| ------------------------ | -------------------------------------------------- | ------------------------------------------------ |
| Monitor usage | Use `sacct`, `htop`, `du -sh`, billing dashboards | Prevent quota violations or surprise cloud bills |
| Spot instance strategies | Use preemptible/spot instances for cheaper runs | Great for fault-tolerant pipelines |
| Budget-aware planning | Estimate compute time × cost × data transfer ahead | Plan ahead for shared or grant-funded platforms |
15. Machine Learning & Predictive Modeling
Total Time: ~12 weeks
Focus: supervised and unsupervised ML methods to classify, cluster, regress, and interpret
Know More
### 15.1. Foundations of ML for Biologists
| Sub-Skill | Learn To... | Notes |
| ------------------------------- | ------------------------------------------------------------------- | -------------------------------------------------- |
| Understand learning paradigms | Supervised vs unsupervised, regression vs classification | Know what to use for what goal |
| Cross-validation & resampling | k-fold CV, stratified CV, leave-one-out, bootstrap | Avoid overfitting on sparse high-dimensional data |
| Train-test splitting | Use `train_test_split()` (Py) or `caret::createDataPartition()` (R) | Always track random seed and ensure stratification |
| Confusion matrix and ROC curves | Evaluate classification performance | `pROC`, `scikit-learn`, `yardstick` |
### 15.2. Supervised Learning Models
| Model Type | Learn To... | R/Python Tools |
| ---------------------------------------- | ---------------------------------------------- | -------------------------------------------------------- |
| Logistic regression | Build baseline interpretable classifiers | `glm()`, `statsmodels`, `sklearn.linear_model` |
| Random forest (RF) | Handle non-linear, high-dimensional data | `ranger`, `randomForest`, `sklearn.ensemble` |
| Support vector machines (SVM) | Classify in complex feature spaces | `e1071`, `kernlab`, `sklearn.svm` |
| Gradient boosting (XGBoost/LightGBM) | Boosted decision trees with regularization | `xgboost`, `lightgbm`, `catboost` |
| Penalized regression (LASSO, ElasticNet) | Perform feature selection & shrinkage | `glmnet`, `sklearn.linear_model` |
| Naive Bayes | Probabilistic classifier for sparse count data | Good for 16S datasets with compositional transformations |
| Deep learning (optional) | Use MLPs, VAEs, CNNs, Transformers | `keras`, `torch`, `PyTorch`, `scvi-tools` (advanced) |
### 15.3. Unsupervised Learning & Clustering
| Task | Learn To... | Tools |
| ------------------------ | ---------------------------------------------------------- | ----------------------------------------------- |
| Dimensionality reduction | PCA, t-SNE, UMAP, MDS | `prcomp`, `umap-learn`, `Rtsne`, `scikit-learn` |
| Feature clustering | k-means, hierarchical, DBSCAN | `stats::hclust`, `fpc`, `scikit-learn` |
| Sample clustering | Discover subtypes of exposure response or microbial states | Useful for phenotype discovery |
| Distance metrics | Use Aitchison, Bray-Curtis, cosine, Jaccard | Impacts clustering and ordination |
### 15.4. Model Interpretation & Feature Importance
| Sub-Skill | Learn To... | Tools |
| ----------------------------- | ---------------------------------------------- | ------------------------------------------------- |
| Variable importance | Use permutation-based or impurity-based scores | `varImpPlot()`, `vip`, `sklearn.inspection` |
| SHAP values | Visualize model predictions per feature | `shap` (Python), `iml` (R) |
| LIME | Locally interpret model predictions | `lime` (R/Py) |
| Visualize decision boundaries | For low-dimensional classifiers | Especially useful for SVM and logistic regression |
### 15.5. Multi-Omics & High-Dimensional Specific Methods
| Sub-Skill | Learn To... | Tools |
| --------------------- | --------------------------------------------------------- | ------------------------------------------------ |
| Sparse models | Use sPLS-DA, LASSO, DIABLO for feature selection | `mixOmics`, `glmnet`, `mlr3` |
| Multi-view models | Use `DIABLO`, `MOFA`, `iClusterPlus`, `mint.block.splsda` | Integrates datasets with distinct feature spaces |
| Ensemble models | Combine RF + SVM + GLM for robustness | `caretEnsemble`, `superlearner`, `mlxtend` |
| Autoencoders and VAEs | Perform latent space discovery and denoising | `keras`, `PyTorch`, `scvi-tools`, `DeepBioSim` |
16. Simulation & Synthetic Data Generation
Total Time: ~6 weeks
Focus: mimic real-world complexity — including sparsity, zero-inflation, overdispersion, compositionality, time series, and interactions. Use these datasets to benchmark tools, test hypotheses, validate models, and develop own pipelines.
Know More
### 16.1. Understand Why Simulation Matters
| Simulation Use Case | Why It's Critical |
| --------------------- | ---------------------------------------------------------------- |
| Benchmarking methods | Compare DA tools (DESeq2 vs ALDEx2 vs ANCOM-BC) on known truth |
| Power estimation | Determine sample size needed to detect known effect |
| Stress-testing models | See how ML behaves under noise, dropout, overfitting |
| Developing new tools | Create ground-truth datasets to validate your package or method |
| Training ML pipelines | Create controlled training/test splits when real data is limited |
### 16.2. Simulate Microbiome Data (Counts, Compositional)
| Sub-Skill | Learn To... | Tools |
| ------------------------------------- | ------------------------------------------------------------------ | --------------------------------------------------------- |
| Simulate zero-inflated count matrices | Use ZINB, DM, hurdle, NB, or log-normal | `NBZIMM`, `zinbwave`, `metagenomeSeq::fitZig()`, `simPop` |
| Control sparsity and compositionality | Simulate rare taxa, fixed total reads, or constant sum constraints | Custom R functions, `scMicrobiomeSim`, `phyloseqSim` |
| Add group effects | Simulate DA taxa with known fold change | Custom `rnbinom()` loops, `mixturemodels` |
| Add batch effects or confounders | Include nested or crossed effects (e.g., cage, site, diet) | `lme4`, `simstudy` |
| Simulate longitudinal data | Use splines, autoregressive error, or custom time series | `splines`, `metalonda`, `MaSigPro`, `splineTimeR` |
### 16.3. Simulate Metatranscriptomic, Proteomic, or Functional Data
| Sub-Skill | Learn To... | Tools |
| --------------------------------- | ------------------------------------------------ | -------------------------------------------------------- |
| Generate gene expression matrices | Use realistic fold change, dropout, and noise | `polyester`, `compcodeR`, `scDesign2` |
| Add GO term enrichment effects | Simulate functional enrichment across conditions | Custom ID mapping + signal injection |
| Simulate MS intensities | Use gamma/normal distributions for peak areas | `MSstats`, `mssims`, custom `rnorm()` |
| Functional pathway simulation | Simulate taxon–function abundance matrices | `PICRUSt2` + pathway weighting, synthetic mapping tables |
### 16.4. Simulate Exposure / Chemical Data (Exposomics)
| Sub-Skill | Learn To... | Tools |
| ----------------------------------------- | --------------------------------------------------------------- | ---------------------------------------------------- |
| Simulate multi-chemical exposure matrices | Create concentration data with realistic co-occurrence patterns | `mvtnorm`, `BayesNetSim`, custom correlation matrix |
| Add exposure–omics interaction | Simulate effect of chemical on microbe/gene abundance | Define `beta` effect sizes manually |
| Simulate external vs internal exposome | Model difference between exposure and host response | 2-layer simulation strategy |
| Add noise, limit of detection effects | Simulate censored or thresholded data (e.g., LODs) | `truncnorm`, `survival::Surv()` with censoring flags |
### 16.5. Deep Generative Models (Advanced)
| Model | Use For | Tools |
| -------------------------------- | -------------------------------------------------- | ----------------------------------------------- |
| Variational autoencoders (VAE) | Generate realistic latent features for multi-omics | `keras`, `scvi-tools`, `DeepBioSim` (your tool) |
| GANs or diffusion models | Simulate realistic sample distributions | `tensorflow`, `torch`, `diffusers`, `scGen` |
| Dirichlet-multinomial simulators | Match real microbiome feature distributions | `dirmult`, `HMP`, `DMsim` |
| Time-aware models | Learn longitudinal or autoregressive dynamics | `RecurrentVAEs`, `GaussianProcesses`, `LongVAE` |
### 16.6. Evaluate Simulated Data Fidelity
| Sub-Skill | Learn To... | Tools |
| ---------------------------------- | --------------------------------------------------------------- | ---------------------------------------------------- |
| Compare distributions to real data | Use KS test, Aitchison distances, correlation | `ks.test`, `compositions`, `vegan::vegdist()` |
| Check diversity indices | Simulated alpha/beta diversity should resemble empirical values | `phyloseq`, `microbiome::alpha()` |
| Visualize structure | Use PCA, t-SNE, heatmaps to confirm signal and separability | `prcomp`, `Rtsne`, `pheatmap` |
| Annotate ground truth | Store true class labels, fold changes, and noise levels | So benchmarking has measurable accuracy (TP/FP/etc.) |
17. Ontology Mapping & Semantic Harmonization
Total Time: ~5 weeks
Focus: standardized, machine-readable, and biologically meaningful annotations, cross-study comparison, integration, and semantic search.
Know More
### 17.1. Understand Ontology Basics
| Sub-Skill | Learn To... | Concepts |
| --------------------------------------------- | ---------------------------------------------------------------- | ------------------------------------------------------------------- |
| What is an ontology? | Understand terms, relationships (is\_a, part\_of), and hierarchy | Ontologies = structured knowledge graphs |
| Difference between vocabularies vs ontologies | Vocab = flat terms. Ontology = terms + relationships + context | Ontologies have parents, children, synonyms |
| Components | Understand IDs, labels, synonyms, definitions, namespaces | e.g., `UBERON:0001836`, "saliva", "oral fluid", `is_a` "body fluid" |
### 17.2. Key Ontologies in Microbiome + Exposomics + One Health
| Domain | Ontologies |
| ----------------------- | --------------------------------------------------------------------------------- |
| Sample type / tissue | **UBERON** (anatomy), **FMA**, **BTO** |
| Environmental context | **EnvO** (Environmental Ontology), **ENVO-lite** |
| Experimental metadata | **EFO** (Experimental Factor Ontology), **OBI** |
| Host species & taxonomy | **NCBITaxon**, **VTO** (vertebrate), **ITIS** |
| Chemicals & exposures | **CHEBI** (chemicals), **ExO** (Exposome Ontology), **CompTox**, **MeSH D-terms** |
| Microbial function | **GO**, **KEGG BRITE**, **MetaCyc**, **Reactome** |
| Disease associations | **MONDO**, **DOID**, **HPO** |
| Microbiome metadata | **MIxS**, **BioSamples**, **MGnify Terms** |
### 17.3. Map and Validate Ontology Terms in Your Metadata
| Sub-Skill | Learn To... | Tools |
| ------------------------------- | -------------------------------------------------------------------- | ------------------------------------------------------------- |
| Look up ontology terms | Search using label, synonym, ID, or definition | **OLS (Ontology Lookup Service)**, **Ontobee**, **BioPortal** |
| Assign term IDs | Link `"oral cavity"` → `UBERON:0000167`, `"feces"` → `ENVO:02000057` | Store both label + ID in metadata |
| Validate terms programmatically | Use OLS4R (R), `ols-client` (Python), or REST APIs | Automate annotation pipelines |
| Match terms via synonyms | Handle common non-standard entries (e.g., "stool", "poop" → "feces") | Use fuzzy match, `stringdist`, `agrep` |
| Handle multiple ontologies | Use preferred hierarchy or cross-map (`EFO → MeSH`) | BioPortal xrefs or OBO cross-references |
### 17.4. Ontology Integration in Pipelines and Outputs
| Sub-Skill | Learn To... | Use Cases |
| ------------------------------------ | ------------------------------------------------------------- | ---------------------------------------------------- |
| Store ontology metadata with outputs | Save alongside feature tables (e.g., GO term name, ID, level) | So your DA results carry functional context |
| Include in phylogenies/networks | Overlay node info with term-based categories | e.g., color samples by `ENVO` biome |
| Annotate sample sheets | Add `ontology_term_id`, `ontology_label`, `source_ontology` | MIxS-compliant metadata structure |
| Enable semantic querying | Query: "find samples with air biome AND lung tissue" | Powered by ontology annotations in database backends |
### 17.5. Semantic Harmonization Across Studies
| Sub-Skill | Learn To... | Tools & Concepts |
| ------------------------------------ | ----------------------------------------------------------- | ---------------------------------------- |
| Align metadata from multiple sources | Map “oral swab” (study A) and “saliva” (study B) → UBERON | Enables clean cross-cohort meta-analysis |
| Collapse terms by parent category | e.g., group all "aquatic biome" terms under `ENVO:00002030` | Use ontology hierarchy traversal |
| Map multiple ontologies together | CHEBI → KEGG → MetaCyc → GO | Build layered functional insight |
| Harmonize chemical metadata | Match synonyms, IDs, InChIs, SMILES, and classes | Use CompTox Dashboard or PubChem API |
18. Collaboration, Deployment & Translation
Total Time: ~5 weeks
Focus: collaborate with interdisciplinary teams, deploy tools and pipelines, and translate results into real-world One Health decisions.
Know More
### 18.1. Collaborative Practices in Multi-Team Projects
| Sub-Skill | Learn To... | Tools & Practices |
| -------------------------------------------------------------- | ------------------------------------------------------------ | -------------------------------------------------------- |
| Work with biologists, clinicians, and environmental scientists | Speak each group’s language; clarify data assumptions | Use integrative metadata templates and shared glossaries |
| Use Git collaboratively | Branching, forking, pull requests, merge conflicts | GitHub issues, `CONTRIBUTING.md`, project boards |
| Maintain collaborative pipelines | Keep YAML configs and environment files modular and editable | Use `config.yaml`, `params.json`, or `.env` files |
| Create shared cloud workspaces | Terra, DNAnexus, Google Drive or AWS S3 with IAM | Permissions, reproducible notebooks, synced outputs |
| Co-author manuscripts efficiently | Use Overleaf (LaTeX) or Quarto with GitHub sync | Avoid version conflicts, enable tracked edits |
### 18.2. Deploy Tools, Pipelines, or Dashboards
| Sub-Skill | Learn To... | Tools |
| -------------------------------------- | ------------------------------------------------------------------- | ----------------------------------------------- |
| Package your pipeline for others | Use `snakemake --containerize`, `conda-lock`, or Docker | Push to GitHub + Zenodo/DOI |
| Make R/Python packages installable | `devtools`, `usethis`, `reticulate`, `setup.py`, `pip install -e .` | Make docs with `pkgdown`, `quarto`, `sphinx` |
| Build web dashboards for non-coders | Use `Shiny`, `Dash`, `Streamlit`, `Observable` | Deploy on shinyapps.io, Hugging Face, or Heroku |
| Host static documentation or tutorials | Use GitHub Pages, Netlify, or ReadTheDocs | Good for protocols, wikis, or project portals |
### 18.3. Communicate Results to Diverse Stakeholders
| Sub-Skill | Learn To... | Channels |
| ------------------------------------------------- | ---------------------------------------------------------------- | -------------------------------------------------- |
| Build science communication artifacts | Visual abstracts, explainer diagrams, blog posts | Canva, BioRender, Quarto blogs |
| Tailor data summaries for non-academics | Summarize findings for farmers, veterinarians, or policy leaders | Use interpretable plots, glossaries, key messages |
| Present findings to stakeholders | Short slide decks, executive summaries, clear visuals | Avoid jargon, emphasize impact |
| Translate results into actionable recommendations | Link findings to surveillance, prevention, or regulation | Highlight risk levels, trends, early warning signs |
### 18.4. Enable Translation in One Health Contexts
| Domain | Translational Strategy | Notes |
| ------------------------ | ----------------------------------------------------------------------------------- | -------------------------------------------- |
| Human health | Anticipate how exposome–microbiome interactions inform early diagnostics, therapy | Personalized recommendations |
| Animal health | Use microbiome insights to improve feed, reduce antibiotic usage, prevent outbreaks | Livestock & zoonotic pathogen surveillance |
| Environmental monitoring | Track microbial biomarkers of pollution, water safety, AMR hotspots | Early detection and bioremediation triggers |
| Policy & public health | Use FAIR pipelines to support reproducible decision-making | Interact with NGOs, agencies (EPA, WHO, FAO) |
### 18.5. Build a Deployable, FAIR-Compliant Portfolio
| Sub-Skill | Learn To... | Tools |
| ------------------------------------ | ---------------------------------------------------------- | -------------------------------------------- |
| Assign persistent identifiers (DOIs) | Use Zenodo, Figshare, OSF | Link to GitHub releases |
| Create full project documentation | README, LICENSE, environment.yml, inputs/outputs | Mimic professional repositories |
| Publish metadata and data openly | Submit to MGnify, SRA, BioSamples, GNPS, PANGEA | Respect privacy/ethics for human data |
| Track provenance & reproducibility | Ensure workflow + code + data are re-runnable from scratch | Use containerized + version-controlled setup |
19. Continuous Integration & Deployment (CI/CD)
Total Time: ~2 weeks
Focus: Build, test, and deploy your pipelines, packages, or dashboards automatically
Know More
### 19.1. CI/CD Core Concepts
| Sub-Skill | Learn To... | Tools |
| --------------------------------- | ------------------------------------- | ------------------------------------------------ |
| Set up automated workflows | Trigger on push, pull requests, tags | GitHub Actions, CircleCI |
| Run package tests automatically | Validate builds for R/Python packages | `rcmdcheck`, `testthat`, `pytest`, `lintr` |
| Auto-build and deploy docs | Publish Quarto sites, pkgdown docs | GitHub Pages, Netlify, `quarto publish` |
| Build & test containers | Validate Docker/Singularity builds | `docker build`, `snakemake --use-singularity` |
| Integration testing for pipelines | Use test configs and data | `nf-test` (Nextflow), mini datasets in Snakemake |
| Send automated reports | Notify team or yourself on pass/fail | Slack, email, or GitHub issue comments |
20. Causal Inference & Intervention Modeling
Total Time: ~2 weeks
Focus: Causal graphs, counterfactual reasoning, and g-methods
Know More
### 20.1. Foundations of Causal Thinking
| Sub-Skill | Learn To... | Tools |
| ---------------------------------------------- | ---------------------------------------------- | --------------------------------------------------- |
| Construct and interpret DAGs | Identify confounding, mediation, collider bias | `ggdag`, `dagitty`, `bnlearn` |
| Identify causal paths vs. spurious paths | Understand d-separation, backdoor criterion | DAG-based reasoning |
| Design observational studies to emulate trials | Causal diagrams, synthetic control | Structural assumptions, inverse probability weights |
### 20.2. Applied Causal Modeling
| Sub-Skill | Learn To... | Tools |
| ----------------------------------- | --------------------------------------------------------------- | ----------------------------------------------- |
| Estimate causal effects | Average Treatment Effect (ATE), mediation | `mediation`, `MatchIt`, `DoWhy`, `causalimpact` |
| Handle time-varying exposures | MSMs, g-methods | `ipw`, `gfoRmula` |
| Causal discovery | Learn DAGs from omics data | PC algorithm, NOTEARS, `pycausal`, `bnlearn` |
| Integrate into exposomics workflows | Model how environmental exposures affect disease via microbiome | Mediation analysis with multi-omics inputs |