We introduce a Deep Stochastic IOC1 RNN Encoder- decoder framework, DESIRE, with a conditional Variational Auto-Encoder and multiple RNNs for the task of future predictions of multiple interacting agents in dynamic scenes. Accurately predicting the location of objects in the future is an extremely challenging task. An effective prediction model must be able to 1) account for the multi-modal nature of the future prediction (i.e., given the same context, future may vary), 2) fore-see the potential future outcomes and make a strategic prediction based on that, and 3) reason not only from the past motion history, but also from the scene context as well as the interactions among the agents. DESIRE can address all aforementioned challenges in a single end-to-end trainable neural network model, while being computationally efficient. The model first obtains a diverse set of hypothetical future prediction samples employing a conditional variational auto-encoder, which are ranked and refined via the following RNN scoring-regression module. We evaluate our model on two publicly available datasets: KITTI and Stanford Drone Dataset. Our experiments show that the proposed model significantly improves the prediction accuracy compared to other baseline methods.