Негізгі бет Codex: Evaluating Large Language Models Trained on Code

Күн бұрын

Codex: Evaluating Large Language Models Trained on Code

Рет қаралды 3,177

1 1

Пікірлер: 9

@antferdom
2 жыл бұрын
In the ages of continuous accelerationism and speed, this slow and detailed explanations are simply pure gold. Thanks in advance for up coming videos.
@Jeny-dj2eh
11 ай бұрын
Standing ovation🤩
@schmiddisen2450
Жыл бұрын
This Video is amazing
@SamuelAlbanie1
Жыл бұрын
Thank you Schmiddisen.
@jiawei8319
Жыл бұрын
Thank you for delivering an outstanding lecture! By the way, I noticed that the link for the slides isn't working. Would it be possible for you to update it? I'd really appreciate it. Thank you so much!
@SamuelAlbanie1
Жыл бұрын
Thank you for flagging this Jiawei. I've updated the link to the slides in the video description.
@creativeuser9086
Жыл бұрын
Is there a more recent update on the performance of GPT-4 for code generation? Also, what about other models like starcoder ? Finally, I’m wondering if RAG (retrieval augmented generation) where a good chunk of GitHub code is indexed and embedded would enhance the humanEval performance ?
@SamuelAlbanie1
Жыл бұрын
By combining prompting strategies with GPT-4, it's possible to get substantially higher performance than Codex. For instance, by using "reflexion", Shinn et al. report a score of 88% pass@1 on HumanEval (see nanothoughts.substack.com/p/reflecting-on-reflexion for details). To the best of my knowledge, there is still something of a gap between models like StarCoder and closed LLMs (see huggingface.co/blog/starcoder for a summary of results). As to whether RAG will help code generation, it's certainly possible that it boosts performance - I'm not aware of a recent study that explores this with the most powerful LLMs (feel free to point such a study out if you encounter one). It's plausible that the latest frontier models use retrieval, but simply do not describe it in their technical reports (which exclude model details). One increasing challenge with using explicit retrieval is that as more time passes, more examples from HumanEval are likely to appear on GitHub, so more careful test set de-duplication is required.
@creativeuser9086
Жыл бұрын
@@SamuelAlbanie1 in fact I struggle to believe that there’s a good (and fair I should say) evaluation method for both code and other language-related tasks for the same reason you mentioned. It kind of feels like all models are cheating. Is there a standard way of guarding against training data contamination? Also, thanks a lot for your elaborate response, I’m a big fan of your academic work. It’s amazing that you use KZitem to explain it.