We propose a method for joint image and per-pixel annotation synthesis with GAN. We demonstrate that GAN has good high-level representation of target data that can be easily projected to semantic segmentation masks. This method can be used to create a training dataset for teaching separate semantic segmentation network. Our experiments show that such segmentation network successfully generalizes on real data. Additionally, the method outperforms supervised training when the number of training samples is small, and works on variety of different scenes and classes. The source code of the proposed method will be publicly available.