Back to the table of contents

Previous      Next

waffles_transform

A command-line tool for transforming datasets. It contains import/export functionality, unsupervised algorithms, and other useful transforms that you may wish to perform on a dataset. Here's the usage information:

Full Usage Information
[Square brackets] are used to indicate required arguments.
<Angled brackets> are used to indicate optional arguments.

waffles_transform [command]
   Transform data, shuffle rows, swap columns, matrix operations, etc.
   add [dataset1] [dataset2]
      Adds two matrices together element-wise. Results are printed to stdout.
      [dataset1]
         The filename of the first matrix.
      [dataset2]
         The filename of the second matrix.
   addindexcolumn [dataset] <options>
      Add a column that Specify the index of each row. This column will be
      inserted as column 0. (For example, suppose you would like to plot the
      values in each column of your data against the row index. Most plotting
      tools expect one of the columns to supply the position on the horizontal
      axis. This feature will create such a column for you.)
      [dataset]
         The filename of a dataset.
      <options>
         -start [value]
            Specify the initial index. (the default is 0.0).
         -increment [value]
            Specify the increment amount. (the default is 1.0).
         -name [value]
            Specify the name of the new attribute.
   addcategorycolumn [dataset] [name] [value]
      Add a column with a constant categorical value. This column will be
      inserted as column 0.
      [dataset]
         The filename of a dataset.
      [name]
         The name of the new column or attribute.
      [value]
         The name of the constant value to insert in every row.
   addnoise [dataset] [dev] <options>
      Add Gaussian noise with the specified deviation to all the elements in
      the dataset. (Assumes that the values are all continuous.)
      [dataset]
         The filename of a dataset.
      [dev]
         The deviation of the Gaussian noise
      <options>
         -seed [value]
            Specify a seed for the random number generator.
         -excludelast [n]
            Do not add noise to the last [n] columns.
   aggregatecols [n]
      Make a matrix by aggregating each column [n] from the .arff files in the
      current directory. The resulting matrix is printed to stdout.
   aggregaterows [n]
      Make a matrix by aggregating each row [n] from the .arff files in the
      current directory. The resulting matrix is printed to stdout.
   align [a] [b]
      Translates and rotates dataset [b] to minimize mean squared difference
      with dataset [a]. (Uses the Kabsch algorithm.)
      [a]
         The filename of a dataset.
      [b]
         The filename of a dataset.
   autocorrelation [dataset]
      Compute the autocorrelation of the specified time-series data.
   cholesky [dataset]
      Compute the cholesky decomposition of the specified matrix.
   correlation [dataset] [attr1] [attr2] <options>
      Compute the linear correlation coefficient of the two specified
      attributes.
      [dataset]
         The filename of a dataset.
      [attr1]
         A zero-indexed attribute number.
      [attr2]
         A zero-indexed attribute number.
      <options>
         -aboutorigin
            Compute the correlation about the origin. (The default is to
            compute it about the mean.)
   covariance [dataset]
      Compute the covariance matrix of the specified matrix.
   colstats [dataset]
      Generates a 4-row table. Row 0 contains the min value of each column in
      [dataset]. Row 1 contains the max value of each column in [dataset]. Row
      2 contains the mean value of each column in [dataset]. Row 3 contains the
      median value of each column in [dataset].
      [dataset]
         The filename of a dataset.
   cumulativecolumns [dataset] [column-list]
      Accumulates the values in the specified columns. For example, a column
      that contains the values 2,1,3,2 would be changed to 2,3,6,8. This might
      be useful for converting a histogram of some distribution into a
      histogram of the cumulative disribution.
      [dataset]
         The filename of a dataset.
      [column-list]
         A comma-separated list of zero-indexed columns to transform. A hypen
         may be used to specify a range of columns. Example: 0,2-5,7
   determinant [dataset]
      Compute the determinant of the specified matrix.
   discretize [dataset] <options>
      Discretizes the continuous attributes in the specified dataset.
      [dataset]
         The filename of a dataset.
      <options>
         -buckets [count]
            Specify the number of buckets to use. If not specified, the default
            is to use the square root of the number of rows in the dataset.
         -colrange [first] [last]
            Specify a range of columns. Only continuous columns in the
            specified range will be modified. (Columns are zero-indexed.)
   dropcolumns [dataset] [column-list]
      Remove one or more columns from a dataset and prints the results to
      stdout. (The input file is not modified.)
      [dataset]
         The filename of a dataset.
      [column-list]
         A comma-separated list of zero-indexed columns to drop. A hypen may be
         used to specify a range of columns.  A '*' preceding a value means to
         index from the right instead of the left. For example, "0,2-5" refers
         to columns 0, 2, 3, 4, and 5. "*0" refers to the last column. "0-*1"
         refers to all but the last column.
   drophomogcols [dataset]
      Remove all columns that are homogeneous (have zero variance).
   dropiftooclose [dataset] [col] [gap]
      Drop each row if the value in the specified column is less than [gap]
      greater than that in the previous row.
      [dataset]
         The filename of a dataset.
      [col]
         The column to evaluate.
      [gap]
         The minimum gap between sequential values.
   droprows [dataset] [after-size]
      Removes all rows except for the first [after-size] rows.
      [dataset]
         The filename of a dataset.
      [after-size]
         The number of rows to keep
   dropmissingvalues [dataset]
      Remove all rows that contain missing values.
   droprandomvalues [dataset] [portion] <options>
      Drop random values from the specified dataset. The resulting dataset with
      missing values is printed to stdout.
      [dataset]
         The filename of a dataset.
      [portion]
         The portion of the data to drop. For example, if [portion] is 0.1,
         then 10% of the values will be replaced with unknown values
      <options>
         -seed [value]
            Specify a seed for the random number generator.
   dropunusedvalues [dataset]
      Drops any nominal meta-data values that are not used.
   export [dataset] <options>
      Print the data as a list of comma separated values without any meta-data.
      [dataset]
         The filename of a dataset.
      <options>
         -tab
            Separate with tabs instead of commas.
         -space
            Separate with spaces instead of commas.
         -r
            Use "NA" instead of "?" for missing values. (This is the format
            used by R.)
         -columnnames
            Print column names on the first row. (The default is to not print
            column names.)
         -precision [val]
            Specify how many digits of precision to use before truncating and
            resorting to scientific notation.
   fillmissingvalues [dataset] <options>
      Replace all missing values in the dataset. (Note that the
      fillmissingvalues command in the waffles_recommend tool performs a
      similar task, but it can intelligently predict the missing values instead
      of just using the baseline value.)
      [dataset]
         The filename of a dataset
      <options>
         -seed [value]
            Specify a seed for the random number generator.
         -random
            Replace each missing value with a randomly chosen non-missing value
            from the same attribute. (The default is to use the baseline value.
            That is, the mean for continuous attributes, and the most-common
            value for nominal attributes.)
   filterelements [dataset] [attr] [min] [max] <options>
      Remove each element in the specified attribute that does not fall in a
      certain range.
      [dataset]
         The filename of a dataset
      [attr]
         A zero-indexed column number
      [min]
         The minimum acceptable value
      [max]
         The maximum acceptable value
      <options>
         -invert
            Drop elements that fall within the range instead of elements that
            do not fall within the range.
   filterrows [dataset] [attr] [min] [max] <options>
      Remove each row where the value of the specified attribute does not fall
      in a certain range. Rows with unknown values in the specified attribute
      will also be deleted.
      [dataset]
         The filename of a dataset
      [attr]
         A zero-indexed column number
      [min]
         The minimum acceptable value
      [max]
         The maximum acceptable value
      <options>
         -invert
            Drop the row if the value falls within the range instead of
            dropping it if the value does not fall within the range.
         -preserveOrder
            Preserve the order of the input matrix. By default, the delete
            operation does not guarantee the order will be preserved.
   function [dataset] [equations]
      Compute new data as a function of some existing data. Each row in the
      output is computed from the corresponding row of the input dataset. Each
      equation, f1, f2, f3, ... will produce one column in the output data.
      [dataset]
         The filename of a dataset
      [equations]
         A set of equations to compute the output data. The equations must be
         named f1, f2, f3, etc. The parameters to these equations may have any
         name, but will correspond with the columns of the input data in order.
   geodistance [dataset] [lat1] [lon1] [lat2] [lon2] <options>
      For each row in [dataset], compute the distance (in kilometers) between
      two points (specified in latitude and longitude) by following a great
      circle on the surface of a perfectly spherical Earth, using the haversine
      formula.
      [dataset]
         The filename of a dataset
      [lat1]
         The latitude of point 1 in degrees.
      [lon1]
         The longitude of point 2 in degrees.
      [lat2]
         The latitude of point 1 in degrees.
      [lon2]
         The longitude of point 2 in degrees.
      <options>
         -radius [r]
            Specify the radius of the Earth (or the sphere upon which the
            points occur). The results will have the same units as the radius
            specified. The default is 6371.0, which is approximately the radius
            of the Earth in kilometers.
   import [dataset] <options>
      Convert a text file of comma separated (or otherwise separated) values to
      a .arff file. The meta-data is automatically determined. The .arff file
      is printed to stdout. This makes it easy to operate on structured data
      from a spreadsheet, database, or pretty-much any other source.
      [dataset]
         The filename of a dataset.
      <options>
         -tab
            Data elements are separated with a tab character instead of a
            comma.
         -space
            Data elements are separated with a single space instead of a comma.
         -whitespace
            Data elements are separated with an arbitrary amount of whitespace.
         -semicolon
            Data elements are separated with semicolons.
         -separator [char]
            Data elements are separated with the specified character.
         -columnnames
            Use the first row of data for column names.
         -maxvals [n]
            Specify the maximum number of unique values in a categorical
            attribute before parsing of that attribute will be aborted.
         -time [attr] [format]
            Specify that a particular attribute is a date or time stamp in a
            particular format. Example format: "YYYY-MM-DD hh:mm:ss".
         -nominal [attr]
            Indiciate that the specified attribute should be treated as
            nominal.
         -real [attr]
            Indiciate that the specified attribute should be treated as real.
   enumeratevalues [dataset] [col]
      Enumerates all of the unique values in the specified column, and replaces
      each value with its enumeration. (For example, if you have a column that
      contains the social-security-number of each user, this will change them
      to numbers from 0 to n-1, where n is the number of unique users.)
      [dataset]
         The filename of a dataset
      [col]
         The column index (starting with 0) to enumerate
   keeponlycolumns [dataset] [column-list]
      Removes all unlisted columns from a dataset and prints the results to
      stdout. (The input file is not modified.)
      [dataset]
         The filename of a dataset.
      [column-list]
         A comma-separated list of zero-indexed columns to which will not be
         dropped. A hypen may be used to specify a range of columns.  A '*'
         preceding a value means to index from the right instead of the left.
         For example, "0,2-5" refers to columns 0, 2, 3, 4, and 5. "*0" refers
         to the last column. "0-*1" refers to all but the last column.
   measuremeansquarederror [dataset1] [dataset2] <options>
      Print the mean squared error between two datasets. (Both datasets must be
      the same size.)
      [dataset1]
         The filename of a dataset
      [dataset2]
         The filename of a dataset
      <options>
         -fit
            Use a hill-climber to find an affine transformation to make
            dataset2 fit as closely as possible to dataset1. Report results
            after each iteration.
         -sum
            Sum the mean-squared error over each attribute and only report this
            sum. (The default is to report the mean-squared error in each
            attribute.)
   mergehoriz [dataset1] [dataset2]
      Merge two (or more) datasets horizontally. All datasets must already have
      the same number of rows. The resulting dataset will have all the columns
      of both datasets.
      [dataset1]
         The filename of a dataset
      [dataset2]
         The filename of a dataset
   mergevert [dataset1] [dataset2] <options>
      Merge two datasets vertically. Both datasets must already have the same
      number of columns. The resulting dataset will have all the rows of both
      datasets.
      [dataset1]
         The filename of a dataset
      [dataset2]
         The filename of a dataset
      <options>
         -f
            Force merge, even if attribute names do not match.
   multiply [a] [b] <options>
      Matrix multiply [a] x [b]. Both arguments are the filenames of .arff
      files. Results are printed to stdout.
      [dataset1]
         The filename of a dataset
      [dataset2]
         The filename of a dataset
      <options>
         -transposea
            Transpose [a] before multiplying.
         -transposeb
            Transpose [b] before multiplying.
   multiplyscalar [dataset] [scalar]
      Multiply all elements in [dataset] by the specified scalar. Results are
      printed to stdout.
      [dataset]
         The filename of a dataset.
      [scalar]
         A scalar to multiply each element by.
   normalize [dataset] <options>
      Normalize all continuous attributes to fall within the specified range.
      (Nominal columns are left unchanged.)
      [dataset]
         The filename of a dataset
      <options>
         -range [min] [max]
            Specify the output min and max values. (The default is 0 1.)
   normalizemagnitude [dataset]
      Normalize the magnitude of each row-vector to 1.
      [dataset]
         The filename of a dataset
   nominaltocat [dataset] <options>
      Convert all nominal attributes in the data to vectors of real values by
      representing them as a categorical distribution. Columns with only two
      nominal values are converted to 0 or 1. If there are three or more
      possible values, a column is created for each value. The column
      corresponding to the value is set to 1, and the others are set to 0.
      (This is similar to Weka's NominalToBinaryFilter.)
      [dataset]
         The filename of a dataset
      <options>
         -maxvalues [cap]
            Specify the maximum number of nominal values for which to create
            new columns. If not specified, the default is 12.
   obfuscate [data]
      Strips comments from an ARFF file, and replaces meta-data with generic
      meaningless values, thus making it difficult to determine what the data
      means. (You may also want to normalize the data to make the range of
      continuous attributes meaningless.) Note that the values of the actual
      data are not altered, so it may still be possible to derive meaning from
      them.
   overlay [base] [over]
      Combines two same-sized matrices by placing [over] on top of [base], such
      that elements from [base] are used only if the same element is missing in
      [over].
      [base]
         The matrix of values to use when they are missing in the other one.
      [over]
         The matrix of values to use as long as they are not missing.
   powercolumns [dataset] [column-list] [exponent]
      Raises the values in the specified columns to some power (or exponent).
      [dataset]
         The filename of a dataset.
      [column-list]
         A comma-separated list of zero-indexed columns to transform. A hypen
         may be used to specify a range of columns. Example: 0,2-5,7
      [exponent]
         An exponent value, such as 0.5, 2, etc.
   prettify [json-file]
      Pretty-prints a JSON file.
   pseudoinverse [dataset]
      Compute the Moore-Penrose pseudo-inverse of the specified matrix of real
      values.
   reducedrowechelonform [dataset]
      Convert a matrix to reduced row echelon form. Results are printed to
      stdout.
   rotate [dataset] [col_x] [col_y] [angle_degrees]
      Rotate angle degrees around the origin in in the col_x,col_y plane. Only
      affects the values in col_x and col_y.
      [dataset]
         The filename of a dataset.
      [col_x]
         The zero-based index of an attribute to serve as the x coordinate in
         the plane of rotation.  Rotation from x to y will be 90 degrees. col_x
         must be a real-valued attribute.
      [col_y]
         The zero-based index of an attribute to serve as the y coordinate in
         the plane of rotation.  Rotation from y to x will be 270 degrees.
         col_y must be a real-valued attribute.
      [angle_degrees]
         The angle in degrees to rotate around the origin in the col_x,col_y
         plane.
   reordercolumns [dataset] [column-list]
      Reorder the columns as specified in the column list.
      [dataset]
         The filename of a dataset.
      [column-list]
         A comma-separated list of zero-indexed columns. A hypen may be used to
         specify a range of columns.  A '*' preceding a value means to index
         from the right instead of the left. For example, "0,2-5" refers to
         columns 0, 2, 3, 4, and 5. "*0" refers to the last column. "0-*1"
         refers to all but the last column.
   samplerows [dataset] [portion]
      Randomly samples from the rows in the specified dataset and prints them
      to stdout. This tool reads each row one-at-a-time, so it is well-suited
      for reducing the size of datasets that are too big to fit into memory.
      (Note that unlike most other tools, this one does not convert CSV to ARFF
      format internally. If the input is CSV, the output will be CSV too.)
      [dataset]
         The filename of a dataset. ARFF, CSV, and a few other formats are
         supported.
      [portion]
         A value between 0 and 1 that specifies the likelihood that each row
         will be printed to stdout.
      <options>
         -seed [value]
            Specify a seed for the random number generator.
   samplerowsregularly [dataset] [freq]
      Samples from the rows in the specified dataset at regular intervals and
      prints them to stdout. This tool reads each row one-at-a-time, so it is
      well-suited for reducing the size of datasets that are too big to fit
      into memory. (Note that unlike most other tools, this one does not
      convert CSV to ARFF format internally. If the input is CSV, the output
      will be CSV too.)
      [dataset]
         The filename of a dataset. ARFF, CSV, and a few other formats are
         supported.
      [freq]
         The number of rows read for each row printed.
   scalecolumns [dataset] [column-list] [scalar]
      Multiply the values in the specified columns by a scalar.
      [dataset]
         The filename of a dataset.
      [column-list]
         A comma-separated list of zero-indexed columns to transform. A hypen
         may be used to specify a range of columns. Example: 0,2-5,7
      [scalar]
         A scalar value.
   shiftcolumns [dataset] [column-list] [offset]
      Add [offset] to all of the values in the specified columns.
      [dataset]
         The filename of a dataset.
      [column-list]
         A comma-separated list of zero-indexed columns to transform. A hypen
         may be used to specify a range of columns. Example: 0,2-5,7
      [offset]
         A positive or negative value to add to the values in the specified
         columns.
   shuffle [dataset] <options>
      Shuffle the row order.
      [dataset]
         The filename of a dataset
      <options>
         -seed [value]
            Specify a seed for the random number generator.
   significance [dataset] [attr1] [attr2] <options>
      Compute statistical significance values for the two specified attributes.
      [dataset]
         The filename of a .arff file.
      [attr1]
         A zero-indexed column number.
      [attr2]
         A zero-indexed column number.
      <options>
         -tol [value]
            Sets the tolerance value for the Wilcoxon Signed Ranks test. The
            default value is 0.001.
   sortcolumn [dataset] [col] <options>
      Sort the rows in [dataset] such that the values in the specified column
      are in ascending order and print the results to to stdout. (The input
      file is not modified.)
      [dataset]
         The filename of a dataset.
      [col]
         The zero-indexed column number in which to sort
      <options>
         -descending
            Sort in descending order instead of ascending order.
   split [dataset] [rows] [filename1] [filename2] <options>
      Split a dataset into two datasets. (Nothing is printed to stdout.)
      [dataset]
         The filename of a datset.
      [rows]
         The number of rows to go into the first file. The rest go in the
         second file.
      <options>
         -seed [value]
            Specify a seed for the random number generator.
         -shuffle
            Shuffle the input data before splitting it.
      [filename1]
         The filename for one half of the data.
      [filename2]
         The filename for the other half of the data.
   splitclass [data] [attr] <options>
      Splits a dataset by a class attribute, such that a separate file is
      created for each unique class label. The generated filenames will be
      "[data]_[value]", where [value] is the unique class label value.
      [data]
         The filename of a dataset.
      [attr]
         The zero-indexed column number of the class attribute.
      <options>
         -dropclass
            Drop the class attribute after splitting the data. (The default is
            to include the class attribute in each of the output datasets,
            which is rather redundant since every row in the file will have the
            same class label.)
   splitfold [dataset] [i] [n] <options>
      Divides a dataset into [n] parts of approximately equal size, then puts
      part [i] into one file, and the other [n]-1 parts in another file. (This
      tool may be useful, for example, to implement n-fold cross validation.)
      [dataset]
         The filename of a datset.
      [i]
         The (zero-based) index of the fold, or the part to put into the
         training set. [i] must be less than [n].
      [n]
         The number of folds.
      <options>
         -out [train_filename] [test_filename]
            Specify the filenames for the training and test portions of the
            data. The default values are train.arff and test.arff.
   squareddistance [a] [b]
      Computesthe sum and mean squared distance between dataset [a] and [b].
      ([a] and [b] are each the names of files in .arff format. They must have
      the same dimensions.)
      [a]
         The filename of a dataset.
      [b]
         The filename of a dataset.
   swapcolumns [dataset] [col1] [col2]
      Swap two columns in the specified dataset and prints the results to
      stdout. (Columns are zero-indexed.)
      [dataset]
         The filename of a dataset
      [col1]
         A zero-indexed column number.
      [col2]
         A zero-indexed column number.
   transition [action-sequence] [state-sequence] <options>
      Given a sequence of actions and a sequence of states (each in separate
      datasets), this generates a single dataset to map from action-state pairs
      to the next state. This would be useful for generating the data to train
      a transition function.
      <options>
         -delta
            Predict the delta of the state transition instead of the new state.
   threshold [dataset] [column] [threshold]
      Outputs a copy of dataset such that any value v in the given column
      becomes 0 if v <= threshold and 1 otherwise.  Only works on continuous
      attributes.
      [dataset]
         The filename of a dataset.
      [column]
         The zero-indexed column number to threshold.
      [threshold]
         The threshold value.
   transpose [dataset]
      Transpose the data such that columns become rows and rows become columns.
   uglify [json-file]
      Prints a JSON file with whitespace removed.
   unique [dataset] [col] <options>
      Discard rows with redundant values in [col].
      [dataset]
         The dataset on which to operate.
      [col]
         The column in which to preserve only one of each unique value.
      <options>
         -last
            Preserve the last row with a unique value in [col]. (The default is
            to preserve the first row with a unique value in [col].)
   zeromean [dataset]
      Subtracts the mean from all values of all continuous attributes, so that
      their means in the result are zero. Leaves nominal attributes untouched.
   usage
      Print usage information.

Previous      Next

Back to the table of contents