Skip to main content
Logo image

Section 3.12 Working with data

It is possible to read in an external file and access its content from within a PreFigure source file. This can be used to create, for example, some common statistical diagrams, such as scatter plots and histograms.

Subsection 3.12.1 Reading an external file

The <read> element is used for reading in an external file. The name of the file needs to be supplied with the @filename attribute, and the contents can later be referenced with the @name attribute. Here are some additional attributes
type
This describes the type of file to be read. At present, the only possible file type is CSV though additional file types are expected to be added. The default is therefore @type="csv".
delimiter
For reading a CSV, the value of this attribute defines the file’s delimiter. By default, the delimiter is a single comma.
quotechar
The value of this attribute tells the CVS reader what character to view as quotation marks, whose enclosed content is read as a single entry. By default, the delimiter is a single quote.
string-columns
When reading a CSV file, the value of this parameter can be a list of headers in the file whose columns should be interpreted as strings rather than numbers.
Note that this feature is not yet implemented in PreTeXt, but it will be soon.
In the simplest case, PreFigure looks for the file in the file structure relative to the current PreFigure source. This can be modified using an <external-root> element in a PreFigure publication file. If this element has the attribute @name="/home/data", and the <read> element has @name="input.csv", then PreFigure will expect the file to be located at /home/data/input.csv.
Inside a PreTeXt document, the <external-root> element should be included in the <prefigure-preamble> element inside <docinfo>.
PreFigure expects CSV files to have some structure. In particular, the first line of the file will be interpreted as headers for the columns. If first line is 0,1,2, then the headers will be referenced later as “0”, “1”, and “2”. It is then assumed that each subsequent row has at least as many entries as headers. If a header appears in the list defined by the @string-columns attribute, then the entries in that column are interpreted as strings. Otherwise, they are interpreted as numerical data.
A well-known dataset contains measurements of 150 irises along with the species they belong to. The first two lines of the file are
  sepal length,sepal width,petal length,petal width,species
  5.1,3.5,1.4,0.2,Iris-setosa
The element <read name="data" filename="iris.csv"> will read this file with the results later referenced as data. For instance, data['sepal length'] contains the list of sepal lengths of 150 irises. data['species'] contains a list of the species each flower belongs to. You may notice that we did not need to specify that the last column consists of strings since this is automatically detected by PreFigure.
There are two useful operations.
filter
The filter function can be used to trim a column of the dataset by some criterion. For instance, filter(df['sepal length', 'species', 'Iris-setosa']) returns a list of the sepal lengths of the flowers whose species attribute is “Iris-setosa”.
zip_lists
If we have two lists with the same number of entries, this function returns a list of points whose entries come from the two lists. For instance, zip_lists(df['sepal length'], df['sepal width']) will be a list of 150 points whose first entry is the sepal length and whose second is the sepal width of each flower.
Next we will see some things we can do with data like this.

Subsection 3.12.2 Scatter plots

We can create scatter plots in PreFigure. Of course, we don’t need to use data that is loaded from an external source, but that might be the most common use case. This is demonstrated in Figure 3.12.1.
Figure 3.12.1. A scatter plot built from the source in Listing 3.12.2.
<diagram xmlns="https://prefigure.org"
         dimensions="(300,300)"
         margins="(30,30,5,5)">
  <read name="df" filename="iris.data"
        string-columns="['species']"/>
  <definition>x=filter(df, 'sepal length', 'species', 'Iris-versicolor')</definition>
  <definition>y=filter(df, 'petal width', 'species', 'Iris-versicolor')</definition>
  <definition>points=zip_lists(x, y)</definition>
  
  <coordinates bbox="(4,0,9,3)">
    <grid-axes>
      <xlabel alignment="nw">Sepal length</xlabel>
      <ylabel alignment="se">Petal width</ylabel>
    </grid-axes>

    <scatter points="points" fill="blue"/>

  </coordinates>
</diagram>
Listing 3.12.2. The PreFigure source for Figure 3.12.1.
Notice that we read the data using a <read> element and then form a set of points using the filter and zip_lists functions. These points are given to the <scatter> element through the @name attribute. Each point will be plotted using a <point> element, which is internally created using attributes supplied by the <scatter> element. In this example, the <scatter> element simply defined the fill of those points to be “blue”.
As a convenience, the <scatter> element offers a more direct way to build scatter plots directly from the data source. This is demonstrated in Figure 3.12.3.
Figure 3.12.3. A scatter plot built from the source in Listing 3.12.4.
<diagram xmlns="https://prefigure.org"
         dimensions="(400,300)"
         margins="(30,30,5,5)">
  <read name="df" filename="iris.data"
        string-columns="['species']"/>
  <coordinates bbox="(4,0,9,3)">
    <grid-axes>
      <xlabel alignment="nw">Sepal length</xlabel>
      <ylabel alignment="se">Petal width</ylabel>
    </grid-axes>

    <scatter at="setosa" data="df"
             x="sepal length" y="petal width"
             filter="('species','Iris-setosa')"
             fill="blue"
             annotate="yes" text="Data from 50 Iris setosas"
             point-text="An iris setosa"/>

    <scatter at="versicolor" data="df"
             x="sepal length" y="petal width"
             filter="('species','Iris-versicolor')"
             fill="red" style="box" size="4"
             annotate="yes" text="Data from 50 Iris versicolor"
             point-text="An iris versicolor"/>

    <scatter at="virginica" data="df"
             x="sepal length" y="petal width"
             filter="('species','Iris-virginica')"
             fill="green" style="diamond" size="4"
             annotate="yes" text="Data from 50 Iris virginica"
             point-text="An iris virginica"/>

    <legend anchor="(bbox[2],0.3)"
            alignment="nw" opacity="0.9">
      <item ref="setosa-point">Iris-setosa</item>
      <item ref="versicolor-point">Iris-versicolor</item>
      <item ref="virginica-point">Iris-virginica</item>
    </legend>

  </coordinates>

  <annotations>
    <annotation ref="figure"
                text="Scatter plot of the iris data set">
      <annotation ref="setosa"/>
      <annotation ref="versicolor"/>
      <annotation ref="virginica"/>
    </annotation>
  </annotations>
</diagram>
Listing 3.12.4. The PreFigure source for Figure 3.12.3.
Notice that the <scatter> element has an attribute @data that points to the dataset read in earlier. We then state the headers of the columns of the data set that we intend to plot using the @x and @y attributes. An optional @filter attribute can be supplied to filter the data, which is then zipped together internally.
This example also demonstrates a few more features, some of which we will explore in more detail later. For now, notice that the <legend> needs the handle of a point. The handle of the second <scatter> element is “versicolor” so the handle of the points created by PreFigure are “versicolor-point”; that is “-point” is appended to the handle of the <scatter> element.
Each scatter plot is annotated by setting @annotate="yes", specifying the text for the group using @text, and declaring the text for each point using @point-text. More details about annotating a group of objects like this is included in Section 6.1.

Subsection 3.12.3 Histograms

The <histogram> element is used to create histograms as demonstrated in Figure 3.12.5.
Figure 3.12.5. A histogram built from the source in Listing 3.12.6.
<diagram xmlns="https://prefigure.org"
         dimensions="(400,300)"
         margins="(30,30,5,5)">
  <read name="df" filename="iris.data"
        string-columns="['species']"/>
  <coordinates bbox="(3,0,9,25)">
    <grid-axes>
      <ylabel alignment="se">Sepal length</ylabel>
    </grid-axes>

    <histogram at="histogram" data="df['sepal length']" outline="tactile"
               min="4" max="9" bins="20"
               fill="lightskyblue" stroke="black"
               annotate="yes"
               text="The histogram with 20 bins"
               bin-text="A bin from ${__histogram_x[bin_num]}"/>

  </coordinates>

  <annotations>
    <annotation ref="figure"
                text="A histogram of irises with 30 bins grouped by sepal length">
      <annotation ref="histogram"/>
    </annotation>
  </annotations>
</diagram>
Listing 3.12.6. The PreFigure source for Figure 3.12.5.
Notice that the histogram is defined through a @data attribute, which is one column of the dataset, along with the attributes @min, @max, and @bins, which is the number of bins.
Once again, some more advanced annotation features are demonstrated. There are two lists created internally, __histogram_x and __histogram_y, that describe the upper-left corner of each bar. There is a parameter bin_num that iterates through each bin and that can be used for annotating. Another quantity delta_x is also defined as the width of the bins.
The handle of a bar created by the <histogram> is “-bin” appended to the handle of the <histogram>.

Subsection 3.12.4 Plotting a trajectory

There may be times when one generates data in some way and then wants to plot it. Perhaps you have solved a differential equation using a specialized or novel technique and saved the results in a CSV file. Figure 3.12.7 illustrates a solution to the three-body problem found using an adaptive method.
Figure 3.12.7. Plotting the solution to a differential equation solved externally.
<diagram dimensions="(300,300)" margins="5">
  <read filename="gragg_data.csv" name="data" delimiter=' '/>
  <definition>x = data['x']</definition>
  <definition>y = data['y']</definition>
  <definition>points=zip_lists(x,y)</definition>
  <coordinates bbox="(-15,-8,15,8)">
    <grid-axes xlabel="x" ylabel="y"/>
    <polygon points="points" stroke="blue"/>
  </coordinates>
</diagram>
Listing 3.12.8. The PreFigure source for Figure 3.12.7.