Skip to content

zhanglabtools/scHi-CSim

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

scHi-CSim: a flexible simulator that generates high-fidelity single-cell Hi-C data for benchmarking

Overview of scHi-CSim

scHi-CSim is a single-cell Hi-C simulator for generating high-fidelity data. As for the sparseness and heterogeneity of single-cell data, scHi-CSim merges neighboring cells to overcome the sparseness, samples interactions in distance-stratified chromosomes to maintain the heterogeneity of the single cells and estimates the empirical distribution of restriction fragments to generate simulated data. We verify that scHi-CSim generates high-fidelity data by comparing the performance of single-cell clustering and detection of chromosomal high-order structures with raw data. Furthermore, scHi-CSim is flexible to change the sequencing depth and the number of simulated replicates. scHi-CSim requires real single-cell Hi-C sequencing data (fragment-interaction format) as input along with user-defined simulation parameters.

scHi-CSim workflow:

scHi-CSim pipeline diagram

scHi-CSim Usage

1. Preparation

git clone https://github.com/zhanglabtools/scHi-CSim

After git clone the repository completely, scHi-CSim is installed successfully. The repository includes an example in the data folder, consisting of 20 mouse embryonic stem cells(The cells are available at https://github.com/tanaylab/schic2). The default setting of scHi-CSim is ready to run the example once the environmental requirements meet. The requirements of scHi-CSim are as follows,

  • python(>= 3.7.4)
  • pandas(>= 0.25.1)
  • numpy(>= 1.16.5)
  • scipy(>= 1.3.1)
  • tqdm(>= 4.36.1)
  • seaborn(>= 0.9.0)
  • joblib(>=0.13.2)

The environment can also be quickly installed through python-requirements.txt file by running

    pip install -r python-requirements.txt

or by running

  1. Install conda.
  2. Install enviroment
      conda env create -f requirements.yml
  1. Activate enviroment
      conda activate scHi-CSim

After installing the necessary environment, the instance consisting of 20 cells placed in data folder is easy to run by the following guidelines.

2. Setting the parameters in the "parameters.txt" file

2.1 Parameters set for general folders

1. python        : Path to the python interpreter(E.g., C:\Program Files\anaconda3\python.exe is a popular alternative path).
1. work_dir      : Path to the scHi-CSim repository.
2. src           : Path to the scHi-CSim src folder(E.g., work_dir\src).
3. cell_base_info: Path to the folder contaning basic infomation of all cells, such as cell name, chromosome length and so on(E.g., work_dir\data\cell_base_info).
4. raw_data      : Path to the folder contaning raw data of all cells(E.g., work_dir\data\raw_data).
5. sim_data      : Path to the folder contaning simulated data of all cells(E.g., work_dir\data\sim_data).
6. features      : Path to the folder contaning feature sets of raw data(E.g., work_dir\data\features).

2.2 Parameters set for simulation: the numbers of fragmnet interactions and replicates

flow_chart_designating_fragment_interaction_number Flow chart showing the designation of fragment-interaction number.

7. fragment_interaction_number_designating_mode: Two-value parameter(all_cell or each_cell) to determine how to designate the number of fragment interactions per cell.
   When assigned as all_cell, the fragment-interaction numbers of all cells will be specified together
   by parameter all_cell_seqDepthTime.
   each_cell_seqDepthTime is equal to all_cell_seqDepthTime.
   The basic fragment-interaction number of each simulated cell will be generated
   by stratified sampling according to the raw data distribution of library size.
   The final fragment-interaction number of each simulated cell is equal to
   the basic fragment-interaction number multiply all_cell_seqDepthTime.
   When assigned as each_cell, see the parameter each_cell_fragment_interaction_designating_mode description.
   The default value is all_cell.
8. all_cell_seqDepthTime: Multiples of sequencing depth(E.g., 0.1,0.5,1 or 2 times sequencing depth.).
   When fragment_interaction_number_designating_mode is assigned as all_cell, all_cell_seqDepthTime will work. The default value is 1.
9. each_cell_fragment_interaction_number_designating_mode: Two-value parameter(sequence_depth_time or fragment_interaction_number) to determine how to designate fragment-interaction number of each cell.
   When assigned as sequence_depth_time, a tab-separated file, named "each_cell_sequencing_depth_time.txt", should be provided in "cell_base_info" folder.
   each_cell_seqDepthTime will be read from "each_cell_sequencing_depth_time.txt".  
   The basic fragment-interaction number of each simulated cell will be generated
   by stratified sampling according to library size's raw data distribution.
   The final fragment-interaction number of each simulated cell is equal to the basic fragment-interaction number multiply each_cell_seqDepthTime.
   When assigned as fragment_interaction_number, a tab-separated file, named "each_cell_fragment_interaction_number.txt",
   should be provided in "cell_base_info" folder.
   each_cell_fragment_interaction_number will be read from "each_cell_fragment_interaction_number.txt".
   When fragment_interaction_number_designating_mode is assigned as each_cell, each_cell_fragment_interaction_number_designating_mode will work.
   The default value is sequence_depth_time.

flow_chart_designating_replicates_number Flow chart showing the designation of replicates' number.

10. replicates_number_designating_mode: Two-value parameter(all_cell or each_cell) to determine how to designate the number of replicates per cell.
    When assigned as all_cell, the replicates numbers of all cells will be specified together by parameter all_cell_replicates_number.
    each_cell_replicates_number is equal to all_cell_replicates_number.
    When assigned as each_cell, a tab-separated file, named "each_cell_replicates_number.txt", should be provided in "cell_base_info" folder.
    each_cell_replicates_number will be read from "each_cell_replicates_number.txt".
    The default value is all_cell.
11. all_cell_replicates_number: Number of replicates(E.g., 1, 2, 3 or 4 replicates per cell.).  
    When replicates_number_designating_mode is assigned as all_cell, all_cell_replicates_number will work.
    The default value is 1.

2.3 Parameters set for simulation: Number of merged cells and others

12. combineNumber: The number of merged cells. The default value is 20.
13. step: The step size used when dividing chromosomes into different distances. The default value is 0.04.
14. Bin_interval_number: Number of intervals when stratified sampling. The default value is 200.
15. parallel: Two-value parameter(True or False) to determine whether to simulate in parallel. The default value is True.
16. kernel_number: The number of GPU kernel used in simulating. When parallel is assigned as True, kernel_number will work. The default value is 24.
17. filter_distance: The threshold of chromosomal distance used for filtering noisy
    signals. The part greater than the threshold will be denoised. filter_distance will work
    when the replicate number of simulated data set above is greater than 1. The default value is 1000000.
18. filter_value_percentile: The percentile of values used for filtering noisy signals.
    The fragment interactions whose chromosomal distance is more than filter_distance and
    count number is less than filter_value_percentile will be filtered by controlling the
    simulated sequencing depth. filter_value_percentile will work when the replicate number
    of simulated data set above is greater than 1. The default value is 20.

Explanation of 17 and 18: scHi-CSim is independently sampling fragment interactions in each chromosomal distance, and the count of Hi-C data at a longer distance is relatively small. If the replicate number of simulated data set above is greater than 1, then it is necessary to reduce the probability of interactions that are far from the diagonal and have lower values. In this way, the number of interactions in the simulation data can be reduced to avoid the formation of very obvious noise points at a far distance.

3. Pre-processing

3.1 Input-file format

A tab('\t') separated file, named chr_pos, that contains, on each line

<chr1> <pos1> <chr2> <pos2> <count> <cell_name>

  • chr = chromosome (must be a chromosome in the genome)
  • pos = position, the specific position of corresponding restriction fragment
  • count =restriction fragment-interaction number
  • cell_name= cell name corresponding to current file

3.2 File conversion(optional)

Convert the adj (fends-fends interaction) file to the chr_pos file by running the following script:

python convert_adj_to_chr_pos.py -p parameters.txt -f GATC.fends

GATC.fends is the projection file conveting fragment end(fend) to chromosome(chr) and coordinates(coord), placed in cell_base_info directory (Due to the file size limitation of github, the file has been compressed into rar format, you need to decompress it before use). scHiC2 provides scripts and guidelines to generate adj file (https://github.com/tanaylab/schic2). The website also supplies Hi-C contact maps with processed adj files.

3.3 Extracting features

Construct features sets, PCC and CDD, by running the following script:

python extract_features.py -p parameters.txt -b 10

The -b flag indicates the lower bound number of contacts used in extracting PCC. The bin will be filtered if the bin's total number of all cells is less than the lower bound. If "ValueError" occurs, please decrease the bound value. The recommended value for the bound value under different cell numbers is as follows. Besides, if the cells' number is more than 2000, it is strongly recommended that the user should turn off the parallel by setting parallel "False" to avoid memory overflow.

cell number bound value
20-100 1-10
100-500 10-30
500-1000 30-50
>1000 50

3.4 Calculating the cell-cell distances

Calculate the cell-cell distances by using PCC and CDD, as following

python calculate_cell_cell_distance.py -p parameters.txt -c 2

The script will use PCA(principal component analysis) to reduce the dimension of features' sets, PCC and CDD. The -c flag indicates the number of principal components used in calculating cell-cell distances. Then the cell-cell distances file, cell_cell_distance.txt, is generated and placed in features directory. The user can also provide user-defined cell-to-cell distances, put it in the features folder and name it cell_cell_distance.txt. For circular cell trajectories, it is recommended to use CIRCLET (https://github.com/zhanglabtools/CIRCLET) to construct the distance relationship between cells.

4. Simulating

Simulate cells according to cell_name_list.txt, as following

python simulating.py -p parameters.txt

The simulated cells, named chr_pos, are placed in sim_data folder.

5. Post-processing

5.1 Merging the simulated cells

python merge_cell.py -p parameters.txt -m data\merge_data\merge_cell_name_list.txt -i data\sim_data -o data\merge_data

5.2 Converting the chr_pos to bin_pairs file

python convert_chr_pos_to_bin.py -p parameters.txt -i combine_data\chr_pos -o combine_data\bin_pairs  -r resolution

Run time and Complexity

  • The consumption of simulating the instance consisting of 20 cells is expected to less than 10 minutes with 1 cores CPU in a normal PC or server. The usage of time and memory for reference with a distinct number of cores are exhibited as follows. The complexity of Step 1 is O(n1), where n1 represents the total number of fragment interactions in raw data. The complexity of Step 2 and Step 3 are O(n2), where n2 represents the total number of fragment interactions in simulated data. Besides, Step 2 and Step 3 are performed independently for each cell. Therefore they are highly scalable for parallel computation. In summary, the complexity of scHi-CSim is O(n), where n=max(n1,n2) and the usage of a multi-kernel CPU will significantly accelerate the simulation process. When the numbers of cores are 1, 4, 8, 12, 24, the peak memory usages of these experiments are 736Mb, 3,674Mb, 10,678Mb, 13,669Mb, and 21,849Mb, respectively.

usage_of_time_and_memory

6. Creating simulation with your own data

  1. Follow section 1 "Preparation" to download scHi-CSim repository and install essential modules.
  2. Download or prepare your own single-cell Hi-C sequencing data (chr_pos format). Put the single-cell Hi-C data in raw_data folder determined in "parameter.txt" file. Each cell should be placed in a separate folder under raw_data. Then, as in the example, put the basic information of the data under the cell_base_info folder, which contains cell_name_list.txt, chr_length, fragment_num. If you choose to independently specify the simulation information of each cell in "parameter.txt" file, you also need to provide an additional corresponding files, such as each_cell_fragment_interaction_number.txt, each_cell_replicates_number.txt and each_cell_replicates_number.txt. All the above files are tab-separated.
  3. Generate the cell_cell_distance.txt file according to our pre-process guidelines and put it under the features folder.
  4. Designate folder's name of sim_data in "parameter.txt" file for this run. After running simulating.py script, the simualted cell will be placed in sim_data folder. Note that the name of sim_data should be different in order to avode the overlaps of results.
  5. Create the merger_cell_name_list.txt upder merge_data folder. Then run merge_cell.py script to generate merged Hi-C files for downstream analysis. We also supply a script, convert_chr_pos_to_bin.py, to convert the chr_pos file to bin_pars file.

About

scHi-CSim, a simulator for generating single-cell Hi-C data.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%