Video analysis plays an important role in the field of computer vision and finds its application in many areas. Fine-grained event classification is one of the most challenging problems in video analysis due to subtle difference between classes and limited training examples, such as echocardiogram function prediction and social insect behavior classification. The difference between patterns of interest in these tasks is hard to perceive so we must rely on domain experts with professional skills to annotate the unlabeled videos. As a result, the data set of annotated videos is usually in small quantity or severely unbalanced. The performance of various traditional shallow learning methods is bounded by handcrafted feature extraction and data scarcity. Recently, the methods based on deep learning, such as convolutional neural network (CNN), have made substantial advancements in various vision tasks. They learn feature representation in a pure data-driven manner. In this dissertation, we propose a set of methods to address three fine-grained video classification problems for rare events. We first present an approach to classify fine-grained echocardiogram videos with subtle difference and limited training data using 3D CNN. Then, we investigate an autoencoder with 3D CNN structure and additional one-class support vector machine (OCSVM) layer to detect impaired heart videos using unbalanced echocardiogram dataset. Finally, we propose a pipeline to localize fine-grained pairwise ant behaviors, by generating behavior proposals from convolutional feature maps computed by 3D CNN.