SecureCode: security-aware code models (3B–20B), trained for review + remediation
I’ve been frustrated by how often code assistants recommend patterns that pass tests but fail security review (e.g., string-built SQL, brittle auth logic, unsafe parsing, insecure defaults, etc.). So I built **SecureCode**: a collection of **8 code models (3B → 20B)** trained to behave more like a security reviewer.
What you should expect from SecureCode:
- identify likely vuln patterns and explain *why* they’re risky - outline plausible abuse paths (defensive framing) - propose a secure rewrite (drop-in where possible) - include defense-in-depth guidance + regression tests/checks
>You are a senior application security engineer. Review the code below.>Output: (1) findings with severity, (2) likely exploit scenarios (high level), (3) secure rewrite,> (4) defense-in-depth recommendations, (5) regression tests/checks.>Code: `...`
**I’m looking for real-world feedback**
- Your “this slipped through review once” snippets (sanitized is fine) - False positives / false negatives you observe - Contributions of new CVE-grounded examples
If you drop a snippet, please include language/framework + what the *correct* remediation looks like in your environment. If you have any contributions or suggestions for the dataset, I'd be happy to hear them. I have some new features and enhancements planned for v3 that are already underway, but for now, I'm focused on testing as many use cases as possible. Appreciate you all!
We’ve released two conversational speech datasets from oto on Hugging Face 🤗 Both are based on real, casual, full-duplex conversations, but with slightly different focuses.
Dataset 1: Processed / curated subset otoearth/otoSpeech-full-duplex-processed-141h * Full-duplex, spontaneous multi-speaker conversations * Participants filtered for high audio quality * PII removal and audio enhancement applied * Designed for training and benchmarking S2S or dialogue models
Dataset 2: Larger raw(er) release otoearth/otoSpeech-full-duplex-280h * Same collection pipeline, with broader coverage * More diversity in speakers, accents, and conversation styles * Useful for analysis, filtering, or custom preprocessing experiments
We intentionally split the release to support different research workflows: clean and ready-to-use vs. more exploratory and research-oriented use.
The datasets are currently private, but we’re happy to approve access requests — feel free to request access if you’re interested.
If you’re working on speech-to-speech (S2S) models or are curious about full-duplex conversational data, we’d love to discuss and exchange ideas together.