Importing a data set requires two files: a data dictionary named filename.dct and an ASCII data set named filename.dat, where filename is replaced with a name of the user’s choice. Appendix A presents a sample data dictionary. Each line of the data dictionary describes a single variable. The first element on the line must be a variable name that is up to eight characters long. Next, the column (or columns) in which the item is found appears. Table 1 describes the additional key words that may (or sometimes must) appear in the data dictionary.
If you are importing data from another file format, you will need to write a dictionary file only if you want to use the MML procedures for test data. In this case, importing is a two step process: first the datais imported, then the metadata is updated. The second step is necessary to define tests, item parameters of test items, etc. Dictionaries for updating data imported from other formats are the same format as the dictionaries for importing from ASCII, but they do not specify a file or the location of the data within the file. Typically, they define tests and items on tests. The remainder of this document describes reading ASCII data and the format of dictionary files.
Getting ASCII data into AM is much like any other statistics program. It requires a dictionary file (<name>.dct) and a data file containing a flat, rectangular matrix of data. The dictionary file contains all of the metadata (information about the data).
The dictionary file is organized in sequences. A sequence is a string of keywords and information about the data. Some types of sequences contain other sequences. A description of each type of sequence is described below.
Test sequences.
TEST="<name>" ID=<positive integer>
[SCALE=<real>]
[LOCATION=<real> ]
[CUTSCORES=<real[,real…]> ]
<subtest sequence>
[<subtest sequence> …]
The test sequences define a test. The scale, location, and cutscore keywords are not currently used, but will define the reporting metric and numeric scores associated with achievement levels. All subsequent references to the test are done through the id number (to save typing), and it can be any arbitrary positive integer .
Tests must be defined before any items are associated with the test (see Variable sequences, below)
Subtest sequences
SUBTEST="<name>" ID=<positive integer> [SCALE=<real> LOCATION=<real>
CUTSCORES=<real[,real…]> WEIGHT=<real>]
Subtest sequences must be embedded in test sequences. For readability, they typically form the second and subsequent lines of the test sequence. Subtests can define their own scales, locations, and cutscores. They also can define a weight to be used in the formation of a composite. Currently, these default values are not used, and users must input them directly.
Format sequences
FORMAT=<name> <real>="label" [ <real>="label" …]
Format sequences define the labels that will be applied to the values of variables with which they are associated.
Format sequences must appear before variable sequences reference them.
File sequence
FILE = "<pathname>" [LRECL=<positive integer>]
Pathname specifies the fully qualified file name for the file containing the data matrix. LRECL is only required if each record is not terminated by a linefeed/carriage return. Note that the system currently expects all data in a record to appear on a single line.
Variables sequence
VARIABLES <variable sequence> [<variable sequence> …]
The keyword variables signals the start of a series of variable sequences.
Variable sequence
<name> <positive integer> LENGTH=<positive integer> TYPE=<B | R | D>
[ FORMAT=<format name> ]
[F=%<f|d>[.<positive integer>]]
[DESIGN=<S | C | W>]
[irm sequence]
The variable sequence is formatted somewhat differently than the other sequences. Users are not required to type VARIABLE for each one, instead, they simply begin the sequence with the name of the variable. The second entry (<positive integer>) identifies the first column of the variable’s data in the data file. Length specifies the number of columns used to store the variable in the data file. Alternatively, users can input <startcolumn>-<endcolumn>, and omit the length statement. Note, however, that one column variables must still indicate a start and an end column (e.g., 2-2).
Type specifies the storage type of the data. The options are B (byte-contains only positive integers up to 254), R (real-any real number to stored to single precision), or D (real number stored to double precision). We will expand the system to handle short and long integers in the future. Double should be rarely used, since input data virtually never requires double precision. (All calculations are done in double precision regardless of the storage type of the variables.)
Format names a value format, defined earlier using a FORMAT sequence.
F identifieds a display format for numeric values, and is not currently used.
Design identifies variables that play a special role in the sample design—S (strata id), C (cluster or PSU id), or W (weight). If set, these become the default values for the design information used in calculation of Taylor series standard errors.
IRM sequence
IRM=<3PL | PCL | 3PN | GRN> ONTEST = <test id number> ONSUBTEST=<subtest id number>
<parameters-depend on model, see Table 1. All parameters are real numbers>
[OMITTED=positive integer]
Parameter sequences
Model |
Parameters |
3PL |
IPA=<real> IPB=<real> [IPC=<real>] |
3PN |
IPA=<real> IPB=<real> [IPC=<real>] |
PCL |
IPA=<real> IPB=<real> IPD1=<real> [IPD2=<real> … IPD10=<real>] |
GRN |
IPA=<real> IPB=<real> IPD1=<real> [IPD2=<real> … IPD10=<real>] |
These sequences define test items, identifying the test (ONTEST) and subtest (ONSUBTEST) of which they are part (using the id numbers defined in a preceding test sequence)
NOTE: Any variables defined as items MUST be of TYPE=B.