— 2 min read
Stata is fine for the small stuff, but Python is way better for anything
intensive. However, you'll often have data in Stata's
.dta format that you
need to read. This post will detail the nice features available in Python's
We'll use the 1978 Automobile Data that comes with Stata. First export this data into a file in your working directory:
1sysuse auto2save "auto.dta", replace
Now open up Python. First import Pandas, the module in Python used to work with rectangular data frames.
1import pandas as pd
The most straightforward way to import a Stata file is a single line:
1auto = pd.read_stata('auto.dta')
This is really simple, and is fine for small files, but with larger files, you often need to finesse your data import. Imagine you have a 100GB Stata file. For most computers, that's too big to import into memory.
First we need to create an iterator, which reads the metadata attached to the
.dta file, but importantly doesn't read the data itself yet.
1itr = pd.read_stata('auto.dta', iterator = True)
Now it's possible to read in just a chunk of the data at a time.
1auto = itr.get_chunk(5)
You can also easily loop over the data like so:
1itr = pd.read_stata('auto.dta', iterator = True, chunksize = 10)2for df in itr:3 # Program operating on 10 rows of the dataset at a time
Now without importing the file, we can get the data label, number of observations, number of variables, and the timestamp at which the data were last saved.
If we want to see the names and labels of the variables, we can use
itr.variable_labels() returns a dictionary where the keys of the dictionary are the variable names and the values of the dictionary are the variable labels. So we can access the labels with something like:
1labels = itr.variable_labels()2# Gets the label of `mpg`3labels['mpg']4# Gets all keys5labels.keys()6# Gets all values7labels.values()
If you're working with a large dataset that might run up against memory constraints, you might want to keep in mind exactly how much memory the imported data will take up.
You can get a list of the number of bytes each column takes up with the
auto.dta, the first column takes up 18 bytes for each row, while the rest of the columns take up between 1 and 4 bytes.
Lets get a better idea of what data types these columns are.
The former shows you the data types that will be used in Python upon import and the latter shows the display formats the data had used in Stata (see
itr.dtyplist, we can see that the first column is a string of length 18, while the rest are types
numpy.float32. These data types come from Numpy, a scientific library that Pandas is based upon, and correspond to Stata's
float, respectively (see Stata's
help data types).
The size of the data in memory is almost exactly the number of rows times the sum of the number of bytes needed for each row. I.e. if the number of rows isN
and the number of bytes each column uses isBcol
then the total memory use of the dataset isN∗∑colBcol
This can be helpful with understanding how many rows of a file to import at once. Let's say you want to not use more than 1GB of memory at once. If you want to import all columns of
auto.dta, each row takes up
sum(itr.col_sizes) = 43 bytes. So the number of rows you can import at a time is
Obviously with the
auto.dta dataset we don't need to add restrictions on rows or columns, but in datasets with columns -- especially those with many string columns -- you might not be able to read in your whole dataset at once.