
classifyLLM brings the power of modern large language models (LLMs) directly into tidyverse data pipelines.It offers a simple, transparent, and auditable way to classify text into predefined categories—without the need to train or maintain your own machine-learning models.
By combining R’s native data-wrangling syntax with LLM-based reasoning, classifyLLM allows analysts to apply consistent classification logic across large datasets using a single line of code inside mutate(). The package handles prompt construction, model communication, and output parsing automatically, returning results as tidy columns that integrate seamlessly with existing workflows.
This makes classifyLLM particularly useful for text-rich humanitarian, social science, or policy datasets—where labels or categories are often context-specific and traditional supervised models are difficult to build due to limited training data or changing definitions.
🧭 Why this package
Analysts and researchers often need to classify open-ended text fields: - survey responses or interview transcripts
- “other” categories in datasets
- qualitative notes from reports
- lists of job titles, symptoms, or objects
Traditional NLP workflows require model training, feature engineering, or external tools.classifyLLM lets you use an LLM (e.g. GPT-4) to perform classification directly, within your data pipeline.
🚀 Features
| Feature | Description |
|---|---|
| 🧹 Tidyverse integration | Works naturally inside mutate(), across(), or map() pipelines. |
| ⚙️ Deterministic and auditable | Set temperature = 0 for reproducible results. |
| 📦 Batching and rate control |
batch_size and delay prevent rate-limit errors. |
| 🔐 Secure key management | Use set_openai_key() or environment variable OPENAI_API_KEY. |
| 🧪 Testing support | API calls skipped if key not set; mock mode planned for CI. |
| 💬 Fallback logic | Normalizes and corrects near matches, prevents blank outputs. |
📦 Installation
You can install the development version of classifyLLM from GitHub using {remotes}:
# install remotes if needed
install.packages("remotes")
# install classifyLLM from GitHub
remotes::install_github("dante042/classifyLLM")
# load the package
library(classifyLLM)Before using the package, make sure your OpenAI API key is available as an environment variable:
Sys.setenv(OPENAI_API_KEY = "your_api_key_here")Or store it permanently in your .Renviron file:
usethis::edit_r_environ()
# then add: OPENAI_API_KEY=your_api_key_here🧩 Example
library(classifyLLM)
library(dplyr)
Sys.setenv(OPENAI_API_KEY = "sk-...") # or classifyLLM::set_openai_key()
tibble::tibble(animal = c("siamese kitty", "golden retriever", "parakeet")) |>
mutate(species = classify_llm(
animal,
categories = c("cat", "dog", "bird"),
model = "gpt-4o-mini",
temperature = 0
))🧠 New: Classify using a data frame of categories
While classify_llm() lets you define categories directly in the function call, the new classify_df() function lets you provide a tidy data frame of categories and optional descriptions, perfect when your taxonomy is stored in a CSV or shared file.
library(classifyLLM)
library(dplyr)
# Example texts
texts <- tibble::tibble(
id = 1:3,
content = c(
"Food distribution in border camp delayed by insecurity.",
"Price inflation accelerates in host communities.",
"Asylum application processing times decrease."
)
)
# Category definitions
categories <- tibble::tribble(
~category, ~description,
"Protection", "Risks, incidents, access to territory/asylum, GBV/CP",
"Basic Needs", "Shelter, food, WASH, core relief items",
"Livelihoods", "Jobs, income, markets, prices",
"Procedures", "RSD, documentation, processing, status"
)
# Classify with a tidy category table
texts |>
classify_df(content, categories = categories, model = "gpt-4o-mini")
#> # A tibble: 3 × 4
#> id content .pred_category .pred_score
#> <int> <chr> <chr> <dbl>
#> 1 1 Food distribution in border camp delayed by ... Basic Needs 0.87
#> 2 2 Price inflation accelerates in host communit... Livelihoods 0.91
#> 3 3 Asylum application processing times decrease. Procedures 0.93