Google Summer of Code 2026 – mlr3hf Project
Contributor: Sachin Kumar
Download a dataset from Hugging Face using pure R, parse it into a
structured data.frame, automatically detect the target column,
infer task type (classification/regression), and convert it into a valid
mlr3::Task object.
Dataset: perplexity-ai/draco
Format: JSONL
Rows: 100
Columns: id, problem, answer, domain
The domain column was automatically detected as the classification target
due to its categorical nature and limited unique values.
htsk <- function(repo_id, filename, target = NULL) {
url <- paste0(
"https://huggingface.co/datasets/",
repo_id,
"/resolve/main/",
filename
)
download.file(url, filename, mode = "wb")
if (grepl("\\.jsonl$", filename)) {
data <- jsonlite::stream_in(file(filename))
} else if (grepl("\\.csv$", filename)) {
data <- read.csv(filename)
} else {
stop("Unsupported file type")
}
if (is.null(target)) {
preferred <- c("label", "target", "class", "domain")
candidate <- intersect(preferred, names(data))
if (length(candidate) > 0) {
target <- candidate[1]
} else {
cat_cols <- names(data)[sapply(data, function(x)
is.character(x) || is.factor(x))]
uniq_counts <- sapply(data[cat_cols], function(x)
length(unique(x)))
target <- cat_cols[which.min(uniq_counts)]
}
}
if (is.numeric(data[[target]])) {
task <- mlr3::TaskRegr$new(id = repo_id, backend = data, target = target)
} else {
data[[target]] <- as.factor(data[[target]])
task <- mlr3::TaskClassif$new(id = repo_id, backend = data, target = target)
}
return(task)
}
<TaskClassif> (100x4)
• Target: domain
• Properties: multiclass
• Features: id, problem, answer
Automatic target detection successful
Multiclass classification inferred
Pure R implementation (no Python / reticulate)
Generalizable design for future package development