Spaces:
Running
Need data for a new model
Would be nice if you helped me gather required data (10B tokens) for training a model on most of the internet knowledge etc.
You can maek it muktimodal if you want it to
Ok, so how large of a model is it (paramaters)
Yk that would take ~19 days of training (19tflops, 500m p, 10b tokens) but sure, i can infact provide that, just give me what kinda data stuff u need :D
I need text data and the model is 200m (I know, overfitting to some extent, but hey, this is a sparse MoE) parameters. I got way more than 19TFLOPS.
And thanks for being a good person to help me find that!!!
And it is for a general purpose AI covering coding and stuff like whatever is defined as a general purpose AI.
How do you want me to give it to you?
In this discussion or by making a dataset.
I do have 5BLN tokens of data rn, is that fine?
Yes
any updates so far?
mb for vanishing, i was working on a new web crawler (i am NOT baby sitting it for ~12 hrs) If you need quick data, download the large english wikidump, and use wikiextractor (human written)
alright thanks!
Did it work? :D
Well, i couldn't find it lol.
ok