01 Aug Using Machine Learning To Automate Data Coding At The Burea…
Government agencies are awash in documents. Many of these documents are paper-based, but even for the electronic documents a human is still often needed to process and understand those documents to make use of them for vital services. Federal agencies are increasingly looking to AI to help improve those document and human-bound processes by applying advanced machine learning, neural network, and natural language processing (NLP) technologies. While for many these technologies might be fairly new in their organization, in some government agencies, they have been using that technology for many years, augmenting and enhancing various workflows and tasks.
In the case of the Bureau of Labor Statistics (BLS), the agency is mandated to conduct a Survey of Occupational Injuries and Illnesses to determine workplace injuries and help guide policy. To perform this survey, BLS has dozens of trained staff in offices throughout the country who classify injuries and illnesses using workplace-generated survey data. However, the human-based processes performed at BLS were performed manually, causing inconsistencies in labeling, coding errors, and speed and cost bottlenecks.
To streamline this process, BLS implemented machine learning to help. About a decade ago Alex Measure, Economist at Bureau of Labor Statistics decided to explore how machine learning (ML) could help the agency improve and shares with us how he incorporated AI into BLS as well as some of the unique challenges that the federal government has around data usage that could be obstacles for agencies looking to use AI as well as what he’s most excited to see in the coming years. In this article, he shares his insights on applying ML to the sorts of document and human-bound processes that exist throughout the government.
What are some of the unique challenges of the BLS with regards to data and data collection?
Alex Measure: The Bureau of Labor Statistics produces information about a wide variety of topics covering everything from employment and prices to time-use and workplace injuries. One thing that all of these activities have in common however, is language. When we go out there to collect this information, whether by interviews, surveys, or some other means, most of the information we are collecting is communicated in the form of language. One of the ways we convert this language into statistics is through a process we call coding, in which we assign standardized classifications to indicate key characteristics of interest. For example, the Survey of Occupational Injuries and Illnesses collects hundreds of thousands of written descriptions of work related injury and illness each year. In order to answer questions like “What is the most common cause of injuries for janitors?”, we go through each of these descriptions and assign codes to indicate things like the occupation of the worker, and the event that caused their injury. The resulting information can then be aggregated to answer our questions. One problem, at least until recently, is that this is a lot of work, and work that mostly has to be done by hand. For the Survey of Occupational Injuries and Illnesses, we estimate it requires about 25,000 hours of labor each year. If you want it done quickly that means you need to have a lot of people working on it simultaneously, and that means you need to train a lot of people and make sure they are all interpreting things consistently. It’s not easy, in fact we find that even when we ask experienced experts to code the exact same injury narratives, any two experts will only agree on the same codes for the same case about 70% of the time. That’s a big challenge not just in BLS, but in many organizations working on similar tasks around the world.
How is Bureau Labor Statistics using machine learning to solve these problems?
Alex Measure: : Seven years ago, BLS did all of the coding for the Survey of Occupational Injuries and Illnesses by hand. This past year we did more than 85% of it automatically using supervised machine learning, specifically with deep neural networks. BLS is increasingly applying these same techniques to a wide variety of related tasks covering everything from the classification of occupations and products to medical benefits and job requirements.
How has the BLS’s view and use of AI evolved over the years?
Alex Measure: When I started at BLS nearly 12 years ago, the main approach people were using was what’s sometimes called the knowledge engineering or rule-based approach. The basic idea is, if you want a computer to do something, you need to explicitly tell it every rule and piece of information that is necessary to perform the task. If you’re classifying occupations, for example, that could mean creating a list of all the job titles that might show up, and the corresponding occupation codes that should be assigned when they do.
This approach works well when working with simple and standardized things, but unfortunately that’s rarely the case with human language, even in a domain as narrow as job titles. In the Survey of Occupational Injuries and Illnesses, for example, we found that each year we received about 2,000 different job titles all corresponding to the occupation “janitor”. To make matters worse, many of those job titles had never occurred in our data previously. To make matters still worse, many of those job titles were associated with different occupations depending on other factors like the naming practices of the individual company or the industry of the employer. The result is that you need a huge number of often complicated rules, all to assign just one of the more than 840 occupation classifications we assign. Building and maintaining this sort of system can be incredibly time consuming and difficult.
Supervised machine learning provides an alternative, instead of telling the computer everything it needs to know and do, we instead tell the computer how to learn from data and then feed it lots of data showing how some task should be performed. If you have lots of this data, which we do, since we’ve been doing this by hand for many years, you can often build a very effective system with very little additional work. In our case, we built our first machine learning systems using…