Skip to content

cyniphile/singlecell-kaggle

Repository files navigation

https://www.kaggle.com/competitions/open-problems-multimodal/

How To Run

pyenv install 3.10.6
brew install poetry
# Make a local `.venv` directory
poetry config virtualenvs.create false --local
poetry install

If poetry is giving you issues there is also a requirements.txt file available for standard installation of packages into a virtualenv.

  1. Download the dataset and extract to data/original.

  2. Download sparse dataset and extract to data/sparse

  3. Run basic tests ./test.sh

  4. Check out run data: prefect orion start

To study

- Competition announcement twitter thread

  • Paper spun off from last year: https://twitter.com/satijalab/status/1498319810459062287
  • Cell Types: To help guide your analysis, we performed a preliminary cell type annotation based on the RNA gene expression using information from the following paper: https://www.nature.com/articles/ncb3493. Note, cell type annotation is an imprecise art, and the concept of assigning discrete labels to continuous data has inherent limitations. You do not need to use these labels in your predictions; they are primarily provided to guide exploratory analysis. In the data, there are the following cell types:
    • MasP = Mast Cell Progenitor
    • MkP = Megakaryocyte Progenitor
    • NeuP = Neutrophil Progenitor
    • MoP = Monocyte Progenitor
    • EryP = Erythrocyte Progenitor
    • HSC = Hematoploetic Stem Cell
    • BP = B-Cell Progenitor
  • Probably good to level up your detail in the understanding of the dogma of molecular bio, including epigenetics, Post-transcriptional modification, Transcription factors, Gene expression, and the correlation of rna and protein.
  • Pseudotime Algorithms
    • "In single-cell data science, dynamic processes have been modeled by so-called pseudotime algorithms that capture the progression of the biological process. Yet, generalizing these algorithms to account for both pseudotime and real time is still an open problem."
  • Potentially useful tools

Notes

Multivariate regression

For a simple linear model, it seems to be equivalent to running $n$ regressions when you want to predict and length $n$ vector, but would like to worth through the math?

A few sklearn methods are naturally multi-output (LinearRegression (and related) KNeighborsRegressor DecisionTreeRegressor RandomForestRegressor), and for the rest there is a wrapper MultiOutputRegressor(model) that runs $n$ single-output regressions using any single-output model.

CD34+ hematopoietic stem and progenitor cells (HSPCs)

Engineering Notes

  • On visualization:
    • "It's time to stop making t-SNE & UMAP plots. In a new preprint w/ Tara Chari we show that while they display some correlation with the underlying high-dimension data, they don't preserve local or global structure & are misleading. They're also arbitrary." https://twitter.com/lpachter/status/1431325969411821572?s=21&t=HtzVmulBKba77ShXSQcKIQ
    • "My rule of my thumb, if the data has structure it should be immediately obvious. PCA then UMAP is a reasonable place to start. Never a good idea to fiddle parameters until you find what you're looking for."
    • "Well, I’ve used a lot of umaps in my day, but scanpy has really convenient plotting tools for heatmaps, dotplots, violin plots, and much more. https://scanpy-tutorials.readthedocs.io/en/latest/plotting/core.html"
  • Still don't understand ATAC. It seems to be measuring something different from just genes

Random ideas

About

Code for multimodal singlecell kaggle competition

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published