How to Read My Alpha Genomix Results
Genomix
What is Genomix?
Genomix is a parallel genome associates organisation congenital from the ground up with scalability in mind. It can get together big and high-coverage genomes from fastq files in a short fourth dimension and produces assemblies like to Velvet or Ray in quality. Genomix uses the De Bruijin Graph to stand for the assembly and cleans, prunes, and walks the graph completely in parallel.
Under the hood, Genomix employs Pregelix, a graph-based, bulk-synchronous parallel message-passing framework. Nosotros currently handle graph compression, cleaning and scaffolding. Pregelix is an open-source implementation of Google's Pregel system, the bulk-synchronous parallel vertex-oriented programming model for large-calibration graph analytics, assuasive Genomix to scale to very big clusters and very large graphs. Pregelix uses main retention equally far as it'southward available, and seemlessly and efficiently spills to deejay. This allows usa to to produce results quickly but also allows u.s. to scale the process to arbitrarily-sized graphs. Genomix can run on a single auto or scale to large, cheap clusters and in our benchmarks, tin can run 100x faster than Hadoop-based solutions.
Usage
Currently Genomix code is injecting into the Pregelix codebase directly. All genomix code is under the /genomix folder.
/genomix-data here is the basic data structures, like the Node, Kmer, etc. /genomix-driver here is the driver to connect the whole pipeline. /genomix-hadoop here is the first trying of compare the result with using hadoop, (now obsolete) /genomix-hyracks here is the graph building footstep /genomix-pregelix here is the graph cleaning and scaffolding step.
To build Genomix:
git clone https://github.com/uci-cbcl/genomix.git cd genomix mvn package -am -pl genomix/genomix-commuter -DskipTests # Wait a few minutes... # At this signal, the complete genomix package has been packaged nether: cd genomix/genomix-driver/target/genomix-driver-0.ii.10-SNAPSHOT/
The command line usage:
Usage: bin/genomix [options] -bridgeRemove_maxLength N : Nodes with length <= bridgeRemoveLengt h that bridge separate paths are removed from the graph -bubbleMerge_maxDissimilarity N : Maximum dissimilarity (i - % identity) allowed between 2 kmers while still because them a "bubble", (leading to their collapse into a single node) -bubbleMerge_maxLength N : The maximum length an internal node may exist and yet be considered a chimera -bubbleMergewithsearch_maxLength N : Maximum length tin be searched -bubbleMergewithsearch_searchDirection : Maximum length tin be searched VAL : -clusterWaitTime N : the amount of time (in ms) to wait between starting/stopping CC/NC -debugKmers VAL : Log all interactions with the given comma-separated list of kmers at the FINE log level (check conf/logging.pro perties to specify an output location) -extraConfFiles VAL : Read all the job confs from the given comma-separated list of multiple conf files -graphCleanMaxIterations N : The maximum number of iterations any graph cleaning job is allowed to run for -hdfsInput VAL : HDFS directory containing input for the first pipeline step -hdfsOutput VAL : HDFS directory where the terminal stride's output will be saved -hdfsWorkPath VAL : HDFS directory where pipeline temp output volition exist saved -kmerLength N : The kmer length for this graph. -localInput VAL : Local directory containing input for the first pipeline step -localOutput VAL : Local directory where the concluding pace's output volition be saved -logReadIds : Log all readIds with the selected edges at the FINE log level (check conf/logging.properties to specify an output location) -maxReadIDsPerEdge N : The maximum number of readids that are recored every bit spanning a unmarried edge -num-lines-per-map North : The kmer length for this graph. -outerDistMeans VAL : Average outer distances (from A to B: A==> <==B) for paired-end libraries -outerDistStdDevs VAL : Standard deviations of outer distances (from A to B: A==> <==B) for paired-stop libraries -pairedEndFastqs VAL : Ii or more local fastq files every bit inputs to graphbuild. Treated as paired-end reads. See too, -outerDist Mean and -outerDistStdDev -pathMergeRandom_probBeingRandomHead Northward : The probability of being selected as a random head in the random path-merge algorithm -pipelineOrder VAL : Specify the order of the graph cleaning process -plotSubgraph_numHops N : The minimum vertex length that can be the caput of scaffolding -plotSubgraph_startSeed VAL : The minimum vertex length that tin exist the head of scaffolding -plotSubgraph_verbosity N : Specify the level of details in output graph: 1. UNDIRECTED_GRAPH_WITH OUT_LABELS, two. DIRECTED_GRAPH_WITH_SIM PLELABEL_AND_EDGETYPE, 3. DIRECTED_GRA PH_WITH_KMERS_AND_EDGETYPE, iv. DIRECTED_GRAPH_WITH_ALLDETAILSDefault is 1. -profile : Whether or non to do runtime profiflin one thousand -randomSeed N : The seed used in the random path-merge or split-repeat algorithm -readLengths VAL : read lengths for each library, with paired-end libraries first -removeLowCoverage_maxCoverage N : Nodes with coverage lower than this threshold volition exist removed from the graph -runAllStats : Whether or non to run a STATS job after each normal job -runLocal : Run a local instance using the Hadoop MiniCluster. -saveIntermediateResults : whether or non to salve intermediate steps to HDFS (default: true) -scaffold_seedLengthPercentile Northward : Choose scaffolding seeds as the nodes with longest kmer length. If this is 0 < percentile < 1, this value volition exist interpreted equally a fraction of the graph (and so .01 will mean one% of the graph will exist a seed). For fraction >= 1, it will be interpreted as the (judge) *number* of seeds to include. Mutually sectional with -scaffold_seedScorePercentile. -scaffold_seedScorePercentile N : Choose scaffolding seeds as the highest 'seed score', currently (length * numReads). If this is 0 < percentile < ane, this value volition be interpreted equally a fraction of the graph (so .01 will mean 1% of the graph will be a seed). For fraction >= 1, information technology will be interpreted as the (approximate) *number* of seeds to include. Mutually exclusive with -scaffold_seedLengthPercentile. -scaffolding_serialRunMinLength N : Rather than processing all the nodes in parallel, run dissever scaffolding jobs serially, running with a seed of all nodes longer than this threshold -setCutoffCoverageByFittingMixture : Whether or not to automatically set cutoff coverage based on fitting mixture -singleEndFastqs VAL : One or more local fastq files equally inputs to graphbuild. Treated as single-ends reads. -stats_expectedGenomeSize Northward : The expected length for this whole genome data -stats_minContigLength Due north : the minimum contig length included in statistics calculations -threadsPerMachine Northward : The number of threads to use per slave machine. Default is i. -tipRemove_maxLength N : Tips (expressionless ends in the graph) whose length is less than this threshold are removed from the graph -useExistingCluster : Don't kickoff or stop a cluster (use ane that's already running) Example: bin/genomix -kmerLength 55 -pipelineOrder BUILD_HYRACKS,MERGE,TIP_REMOVE,MERGE,Chimera,MERGE -localInput /path/to/readfiledir/
Acknowledgement
YourKit is supporting Genomix open source project with its full-featured Java Profiler. YourKit, LLC is the creator of innovative and intelligent tools for profiling Java and .Internet applications. Take a look at YourKit'south leading software products: YourKit Coffee Profiler and YourKit .NET Profiler.
Source: https://libraries.io/github/uci-cbcl/genomix
0 Response to "How to Read My Alpha Genomix Results"
Post a Comment