New In

Java integration

Now you can embed and execute Java code directly in Stata. Previously, you could create and use Java plugins in Stata, but that required you to compile your code and bundle it in a JAR file. You can still do this, but you can now also write your Java code directly in do-files, ado-files, or even interactively. The Java code you write compiles on the fly—an external compiler is not necessary! You may even be able to write parallelized code to take advantage of multiple cores.

The included Stata Function Interface (sfi) Java package provides a bidirectional connection between Stata and Java.

Stata bundles the Java Development Kit (JDK) with its installation, so there is no additional setup involved. This version of Stata includes Java 11, which is the current long-term support (LTS) version.

Highlights

  • Java integration with Stata
    • Use Java interactively (like JShell) from within Stata
    • Embed Java code in do-files
    • Embed Java code in ado-files
    • Compile Java code on the fly without external programs
  • Stata Function Interface (sfi) Java package
    • Bidirectional connection between Stata and Java
    • Access Stata datasets, frames, macros, scalars, matrices, value labels, characteristics, global Mata matrices, date and time values, and more from Java

Let’s see it work

Example 1: Invoke Java interactively

If you are familiar with JShell, the following will look similar to you.

. java:
java (type end to exit and /help for help)
java> System.out.println("Hello, Java!"); Hello, Java! java>

In this example, we asked Java to print “Hello, Java!” We can call Java interactively from Stata’s Command window.

Example 2: Embed Java code in a do-file

We can embed Java in a do-file, just as we do with Mata and Python code. Just place the Java code between java: and end.

--------------------- do-file (begin)----------------------------
java:
ArrayList<String> coffeeList = new ArrayList<>(Arrays.asList(
        "Latte", "Mocha", "Espresso", "Cappuccino"));

  Collections.sort(coffeeList);
  for ( String coffee : coffeeList ) {
        SFIToolkit.displayln(coffee) ;
  }
  end // leave Java
--------------------- do-file (end) -----------------------------

--------------------- output (begin) ----------------------------
. java:
java (type end to exit and /help for help)
java> ArrayList<String> coffeeList = new ArrayList<>(Arrays.asList( ...> "Latte", "Mocha", "Espresso", "Cappuccino")); coffeeList ==> [Latte, Mocha, Espresso, Cappuccino] java> java> Collections.sort(coffeeList); java> for ( String coffee : coffeeList ) { ...> SFIToolkit.displayln(coffee) ; ...> } Cappuccino Espresso Latte Mocha java> end // leave Java
--------------------- output (end) -------------------------------

We have an ArrayList containing four different coffee drinks (“Latte”, “Mocha”, “Espresso”, “Cappuccino”).

We sorted the ArrayList using Collections.sort() and then printed the sorted list, one on each line.

Cappuccino
Espresso
Latte
Mocha

Example 3: Embed Java code in an ado-file

In contrast with Python, Java libraries tend to have a lower-level implementation, meaning you might have to write a bit more code to do what you want. This, however, often gives greater flexibility and performance. One of Java’s strengths is in its extensive APIs, which are packaged with the Java virtual machine. See the Java Development Kit for details. There are also many useful third-party libraries available.

Let’s assume that Stata’s egen command did not already have a rowmean function. We could use Java integration to write a Stata command with this functionality.

--------- rowmean.ado (begin) ------------------------------------------------
program rowmean
    version 17
    syntax varlist [if] [in], GENerate(string)
    confirm new variable `generate'
    preserve
    quietly generate `generate' = .
    java `varlist' `if' `in' : Demo.rowmean("`generate'")
    restore, not
end

java:
import java.util.stream.LongStream;
public class Demo {
    public static void rowmean(String newvar) {
        long obs1 = Data.getObsParsedIn1(); // get observations for in range
        long obs2 = Data.getObsParsedIn2(); // get observations for in range
        int varCount = Data.getParsedVarCount();
        int idxStore = Data.getVarIndex(newvar);

        // loop over observations
        LongStream.rangeClosed(obs1, obs2).forEach(obs -> {
            double sum = 0;
            long count = 0;
            if (!Data.isParsedIfTrue(obs)) {
                return; // skip iteration of lambda expression
            }

            // loop over each variable
            for (int i = 1; i <= varCount; i++) {
                final int var = Data.mapParsedVarIndex(i);
                if (Data.isVarTypeString(var)) continue; // skip if string

                double value = Data.getNum(var, obs);
                if (Missing.isMissing(value)) continue;
                sum += value; count++;
            }
            Data.storeNumFast(idxStore, obs, sum/count);
        });
    }
}
end // end Java block
---------- rowmean.ado (end) -------------------------------------------------

We put our Java code right in our ado-file. It compiles on the fly as the ado-file is loaded and executed.

Let’s set up some sample data.

---------- setup-data.do (begin) ----------------------------
clear
set seed 12345
set obs 5000000           // create five million observations
forvalues i = 1/10 {      // generate ten variables
    gen v`i' = runiform()
}
---------- setup-data.do (end) -------------------------------

Let’s run the command.

. rowmean v*, gen(rmean)

That ran in 3.2 seconds. That’s not bad. Can we make it faster? You bet we can! We can take advantage of our multiple processors. The machine I am testing on is an I7-5820K and has 6 CPU cores. Another optimization we can make is to not ask Stata repeatedly for each variables index. Instead, we can get that information once and cache it.

Example 4: Make it faster

Here is our improved command:

--------- rowmean.ado (begin) ------------------------------------------------
program rowmean
    version 17
    syntax varlist [if] [in], GENerate(string)
    confirm new variable `generate'
    preserve
    quietly generate `generate' = .
    java `varlist' `if' `in' : Demo.rowmean("`generate'")
    restore, not
end

java:
import java.util.stream.LongStream;
public class Demo {
    public static void rowmean(String newvar) {
        long obs1 = Data.getObsParsedIn1(); // get observations for in range
        long obs2 = Data.getObsParsedIn2(); // get observations for in range
        int varCount = Data.getParsedVarCount();
        int idxStore = Data.getVarIndex(newvar);

        // cache the variable indexes
        ArrayList<Integer> varmap = new ArrayList<Integer>();
        for (int i = 1; i <= varCount; i++) {
            int idx = Data.mapParsedVarIndex(i);
            if (!Data.isVarTypeString(idx)) {
                 varmap.add(idx);
            }
        }

        // loop over observations
        LongStream.rangeClosed(obs1, obs2).parallel().forEach(obs -> {
            double sum = 0;
            long count = 0;
            if (!Data.isParsedIfTrue(obs)) {
                return; // skip iteration of lambda expression
            }

            // loop over each variable
            for (int var : varmap) {
                double value = Data.getNum(var, obs);
                if (Missing.isMissing(value)) continue;
                sum += value; count++;
            }
            Data.storeNumFast(idxStore, obs, sum/count);
        });
    }
}
end // end Java block
---------- rowmean.ado (end) -------------------------------------------------

Did that make it faster? It sure did! Our original timing was 3.2 seconds. This time, using the same dataset, the command finished in .79 seconds. Most of the improvement was from making our code run in parallel. If you don’t look closely, you might not notice the change we made. Because we are using “LongStream” to loop over the observations, we can ask Java to run that in parallel by invoking the parallel() method before foreach() in the lambda expression. Keep in mind that writing parallel code has many pitfalls and may not be easy for more complicated problems. And for other problems, parallelization may not be beneficial.