How to Read My Alpha Genomix Results

Genomix

What is Genomix?

Genomix is a parallel genome associates organisation congenital from the ground up with scalability in mind. It can get together big and high-coverage genomes from fastq files in a short fourth dimension and produces assemblies like to Velvet or Ray in quality. Genomix uses the De Bruijin Graph to stand for the assembly and cleans, prunes, and walks the graph completely in parallel.

Under the hood, Genomix employs Pregelix, a graph-based, bulk-synchronous parallel message-passing framework. Nosotros currently handle graph compression, cleaning and scaffolding. Pregelix is an open-source implementation of Google's Pregel system, the bulk-synchronous parallel vertex-oriented programming model for large-calibration graph analytics, assuasive Genomix to scale to very big clusters and very large graphs. Pregelix uses main retention equally far as it'southward available, and seemlessly and efficiently spills to deejay. This allows usa to to produce results quickly but also allows u.s. to scale the process to arbitrarily-sized graphs. Genomix can run on a single auto or scale to large, cheap clusters and in our benchmarks, tin can run 100x faster than Hadoop-based solutions.

Usage

Currently Genomix code is injecting into the Pregelix codebase directly. All genomix code is under the /genomix folder.

            /genomix-data       here is the basic data structures, like the Node, Kmer, etc. /genomix-driver     here is the driver to connect the whole pipeline.  /genomix-hadoop     here is the first trying of compare the result with using hadoop, (now obsolete) /genomix-hyracks    here is the graph building footstep /genomix-pregelix   here is the graph cleaning and scaffolding step.

To build Genomix:

            git clone https://github.com/uci-cbcl/genomix.git cd genomix mvn package -am -pl genomix/genomix-commuter -DskipTests # Wait a few minutes...  # At this signal, the complete genomix package has been packaged nether: cd genomix/genomix-driver/target/genomix-driver-0.ii.10-SNAPSHOT/

The command line usage:

            Usage: bin/genomix [options]   -bridgeRemove_maxLength N              : Nodes with length <= bridgeRemoveLengt                                           h that bridge separate paths are                                           removed from the graph  -bubbleMerge_maxDissimilarity N        : Maximum dissimilarity (i - % identity)                                           allowed between 2 kmers while still                                           because them a "bubble", (leading                                           to their collapse into a single node)  -bubbleMerge_maxLength N               : The maximum length an internal node                                           may exist and yet be considered a                                           chimera  -bubbleMergewithsearch_maxLength N     : Maximum length tin be searched  -bubbleMergewithsearch_searchDirection : Maximum length tin be searched  VAL                                    :    -clusterWaitTime N                     : the amount of time (in ms) to wait                                           between starting/stopping CC/NC  -debugKmers VAL                        : Log all interactions with the given                                           comma-separated list of kmers at the                                           FINE log level (check conf/logging.pro                                           perties to specify an output location)  -extraConfFiles VAL                    : Read all the job confs from the given                                           comma-separated list of multiple conf                                           files  -graphCleanMaxIterations N             : The maximum number of iterations any                                           graph cleaning job is allowed to run                                           for  -hdfsInput VAL                         : HDFS directory containing input for                                           the first pipeline step  -hdfsOutput VAL                        : HDFS directory where the terminal stride's                                           output will be saved  -hdfsWorkPath VAL                      : HDFS directory where pipeline temp                                           output volition exist saved  -kmerLength N                          : The kmer length for this graph.  -localInput VAL                        : Local directory containing input for                                           the first pipeline step  -localOutput VAL                       : Local directory where the concluding                                           pace's output volition be saved  -logReadIds                            : Log all readIds with the selected                                           edges at the FINE log level (check                                           conf/logging.properties to specify an                                           output location)  -maxReadIDsPerEdge N                   : The maximum number of readids that                                           are recored every bit spanning a unmarried edge  -num-lines-per-map North                   : The kmer length for this graph.  -outerDistMeans VAL                    : Average outer distances (from A to B:                                           A==>    <==B)  for paired-end                                           libraries  -outerDistStdDevs VAL                  : Standard deviations of outer distances                                           (from A to B:  A==>    <==B)  for                                           paired-stop libraries  -pairedEndFastqs VAL                   : Ii or more local fastq files every bit                                           inputs to graphbuild. Treated as                                           paired-end reads. See too, -outerDist                                           Mean and -outerDistStdDev  -pathMergeRandom_probBeingRandomHead Northward : The probability of being selected as                                           a random head in the random path-merge                                           algorithm  -pipelineOrder VAL                     : Specify the order of the graph                                           cleaning process  -plotSubgraph_numHops N                : The minimum vertex length that can be                                           the caput of scaffolding  -plotSubgraph_startSeed VAL            : The minimum vertex length that tin exist                                           the head of scaffolding  -plotSubgraph_verbosity N              : Specify the level of details in                                           output graph: 1. UNDIRECTED_GRAPH_WITH                                           OUT_LABELS, two. DIRECTED_GRAPH_WITH_SIM                                           PLELABEL_AND_EDGETYPE, 3. DIRECTED_GRA                                           PH_WITH_KMERS_AND_EDGETYPE, iv.                                           DIRECTED_GRAPH_WITH_ALLDETAILSDefault                                           is 1.  -profile                               : Whether or non to do runtime profiflin                                           one thousand  -randomSeed N                          : The seed used in the random path-merge                                           or split-repeat algorithm  -readLengths VAL                       : read lengths for each library, with                                           paired-end libraries first  -removeLowCoverage_maxCoverage N       : Nodes with coverage lower than this                                           threshold volition exist removed from the                                           graph  -runAllStats                           : Whether or non to run a STATS job                                           after each normal job  -runLocal                              : Run a local instance using the Hadoop                                           MiniCluster.  -saveIntermediateResults               : whether or non to salve intermediate                                           steps to HDFS (default: true)  -scaffold_seedLengthPercentile Northward       : Choose scaffolding seeds as the nodes                                           with longest kmer length.  If this is                                           0 < percentile < 1, this value volition                                           exist interpreted equally a fraction of the                                           graph (and so .01 will mean one% of the                                           graph will exist a seed).  For fraction                                           >= 1, it will be interpreted as the                                           (judge) *number* of seeds to                                           include. Mutually sectional with                                           -scaffold_seedScorePercentile.  -scaffold_seedScorePercentile N        : Choose scaffolding seeds as the                                           highest 'seed score', currently                                           (length * numReads).  If this is 0 <                                           percentile < ane, this value volition be                                           interpreted equally a fraction of the                                           graph (so .01 will mean 1% of the                                           graph will be a seed).  For fraction                                           >= 1, information technology will be interpreted as the                                           (approximate) *number* of seeds to                                           include. Mutually exclusive with                                           -scaffold_seedLengthPercentile.  -scaffolding_serialRunMinLength N      : Rather than processing all the nodes                                           in parallel, run dissever scaffolding                                           jobs serially, running with a seed of                                           all nodes longer than this threshold  -setCutoffCoverageByFittingMixture     : Whether or not to automatically set                                           cutoff coverage based on fitting                                           mixture  -singleEndFastqs VAL                   : One or more local fastq files equally                                           inputs to graphbuild. Treated as                                           single-ends reads.  -stats_expectedGenomeSize Northward            : The expected length for this whole                                           genome data  -stats_minContigLength Due north               : the minimum contig length included in                                           statistics calculations  -threadsPerMachine Northward                   : The number of threads to use per                                           slave machine. Default is i.  -tipRemove_maxLength N                 : Tips (expressionless ends in the graph) whose                                           length is less than this threshold                                           are removed from the graph  -useExistingCluster                    : Don't kickoff or stop a cluster (use                                           ane that's already running)  Example:     bin/genomix -kmerLength 55 -pipelineOrder BUILD_HYRACKS,MERGE,TIP_REMOVE,MERGE,Chimera,MERGE -localInput /path/to/readfiledir/

Build Status

Acknowledgement

YourKit is supporting Genomix open source project with its full-featured Java Profiler. YourKit, LLC is the creator of innovative and intelligent tools for profiling Java and .Internet applications. Take a look at YourKit'south leading software products: YourKit Coffee Profiler and YourKit .NET Profiler.

YourKit Logo

How to Read My Alpha Genomix Results

Genomix

What is Genomix?

Usage

Acknowledgement

0 Response to "How to Read My Alpha Genomix Results"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel