Go to easy Task

Hard Task: Hugging Face → mlr3 Integration

Google Summer of Code 2026 – mlr3hf Project

Contributor: Sachin Kumar  

Objective

Download a dataset from Hugging Face using pure R, parse it into a structured data.frame, automatically detect the target column, infer task type (classification/regression), and convert it into a valid mlr3::Task object.

Dataset Used

Dataset: perplexity-ai/draco

Format: JSONL

Rows: 100

Columns: id, problem, answer, domain

The domain column was automatically detected as the classification target due to its categorical nature and limited unique values.

Implementation Strategy

  1. Download dataset via direct HTTPS link
  2. Detect file type (.jsonl / .csv)
  3. Load into R
  4. Automatically detect target column
  5. Infer task type (classification vs regression)
  6. Create and return mlr3 Task

htsk() Prototype Function

htsk <- function(repo_id, filename, target = NULL) {

  url <- paste0(
    "https://huggingface.co/datasets/",
    repo_id,
    "/resolve/main/",
    filename
  )

  download.file(url, filename, mode = "wb")

  if (grepl("\\.jsonl$", filename)) {
    data <- jsonlite::stream_in(file(filename))
  } else if (grepl("\\.csv$", filename)) {
    data <- read.csv(filename)
  } else {
    stop("Unsupported file type")
  }

  if (is.null(target)) {
    preferred <- c("label", "target", "class", "domain")
    candidate <- intersect(preferred, names(data))

    if (length(candidate) > 0) {
      target <- candidate[1]
    } else {
      cat_cols <- names(data)[sapply(data, function(x)
        is.character(x) || is.factor(x))]

      uniq_counts <- sapply(data[cat_cols], function(x)
        length(unique(x)))

      target <- cat_cols[which.min(uniq_counts)]
    }
  }

  if (is.numeric(data[[target]])) {
    task <- mlr3::TaskRegr$new(id = repo_id, backend = data, target = target)
  } else {
    data[[target]] <- as.factor(data[[target]])
    task <- mlr3::TaskClassif$new(id = repo_id, backend = data, target = target)
  }

  return(task)
}

Result

<TaskClassif> (100x4)
• Target: domain
• Properties: multiclass
• Features: id, problem, answer

Automatic target detection successful
Multiclass classification inferred
Pure R implementation (no Python / reticulate)
Generalizable design for future package development

Future Improvements